<a href="https://colab.research.google.com/github/SDS-AAU/M2-2019/blob/master/notebooks/network_analysis_and_the_gram.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploring Hashtag Networks on the Gram

In this notebook, we will collect data from Instagram to construct (snowball) a network of hashtags as well as a bipartite network of Instagram users and hashtags.

This work builds on [this post](https://medium.com/@marcosacj/creating-and-visualizing-a-complex-network-of-instagram-hashtags-based-on-posts-about-politics-2daf24f31088) but I've rewritten the code to a shorter more "notebook-friendly" format employing more dependencies (pandas, numpy & co). You can find the original code here.

The networks in this example can be considered synthetic, since nodes and particularly edges represent virtual constructs rather than explicit connections.


The notebook explores:

- Graph construction (normal and bipartite)
- Calculation of centrality indicators 
- Community detection
- Projection of bipartite network

Furthermore you will learn:

- to make simple (public) API requests (API: Application Programming Interface) 
- parse json response
- perform simple string manipulation/text-mining to extract features of interest (Transition into NLP)

### So what?

Such an analysis can be useful in marketing to identify sub-dicussions in a niche or related to a brand. We will detect popular hashtags within sub-niches that "correlate" with a topic of interest.
Furthermore, we will identify accounts with high engagement (post-counts) within specific hashtag communities.

Unfortunately Instagram, very recently (few days back), diesabled a simple public API that allowed to map usernames form user-ids. Therefore, we will use ```instaloader```, a module for interacting with Instagram. 

We will only use public data that does not require log-in. If you want to explore other graph structures on Instagram (e.g. follow-networks), have a look at Instabot.

### Importing Modules

In [0]:
# Installing instaloader
!pip3 install instaloader

In [0]:
# The requests library handles "requests" to APIs 
# similar to a browser that requests a webpage given a URL

import requests as rq

# A bit of a transition into NLP. The tweet tokenizer from the NLTK library will help us extract
# the hashtags from post-text

from nltk.tokenize import TweetTokenizer
tknzr = TweetTokenizer()

# The networkx module for all network related manipulation + bipartite  + the community module for 
# community identification in large graphs (not part of networkx)
import networkx as nx
from networkx.algorithms import bipartite 
import community

# Python's amazing iteration & combination library
import itertools

# The usual suspects
import pandas as pd
import numpy as np

# Instaloader for mapping user-ids to usernames
import instaloader
L = instaloader.Instaloader()

### Global Constants

In [0]:
# These things may change without a warning...

# Instagram base url preffix
tagurl_prefix = 'https://www.instagram.com/explore/tags/'

# suffix to append to tag request url to retrieve data in JSON format
tagurl_suffix = '/?__a=1'

# suffix to end cursor when requesting posts by tag
tagurl_endcursor = '&max_id='

# a generic media post preffix (concat with media shortcode to view)
posturl_prefix = 'https://www.instagram.com/p/'

In [0]:
# target initial tags (we will run this with only one tag but the code can take multiple - just extend the list)

tags = ['fitnessworlddk']

In [0]:
# urls to initial tags using the above url-components
queries = [ tagurl_prefix + tag + tagurl_suffix for tag in tags ]

### Getting the Data

The response structure of this Insta endpoint is not really straightforward. You can read more about it in the original post.
The data is most likely composed on request by some large-scale graph database at returned. Instagram obviously assumes that the receiving site is a browser exploring public posts.

We also don't get all posts for some hashtag right away but a "page" ~25 posts.

To receive further posts, we need to pass a new requests specifying "our position" by providing an end_cursor.

This **end cursor** can be found in

```response['graphql']['hashtag']['edge_hashtag_to_media']['page_info']['end_cursor']````



#### Some thoughts on JSON

This brings us to JSON. Think of JSON objects as of combinations of dictionaries and lists that can contain most Python objects (e.g. lists, dictionaries, tuples, strings, ints etc.) that can be represented as text. Once parsed you can operate JSON objects just as any other dictionary or list in Python.
More about JSON - here:

#### Where are the individual posts?

They hide in ```response['graphql']['hashtag']['edge_hashtag_to_media']['edges']``` here you will find a list of dictionaries. If you think this is convoluted, wait until you see the structure of each post.

#### How do we collect them?

We create an empy list and iterate trough hashtags and iterations (deepth) and finally extend the empty list with the elements in the recent request.
Try to run ```edges[0]```once you collected everything to see the structure of one of the posts.

In [0]:
edges = []
for q in queries:    
    for i in range(10): # how many iterations/deepth ?
      r = rq.get(q).json()
      end_cursor = r['graphql']['hashtag']['edge_hashtag_to_media']['page_info']['end_cursor']
      edges.extend(r['graphql']['hashtag']['edge_hashtag_to_media']['edges'])
      print(i)
      q = q + tagurl_endcursor + end_cursor

In [0]:
edges[0]

### Bringing the collected data into useful format...

In the next step we will take the rich raw posts data and extract only the information that we need for our analysis. We will just cut out owner-id (account that posted), a shortcode that we can use to identify the post and get more data on it in future if needed, and the text including the hashtags.

To make things more compact we not only extract the raw data but we also preprocess a bit.

The hashtags are incorporated within the post-text. Therefore, we pass the text of each post through a tokenizer, that identifies individual words and elements (such as emoji). We use the tweet-tokenizer from the NLTK library, which is made for working with social media data.

```
  tokens = tknzr.tokenize(text)
  tags = [x.strip('#') for x in tokens if x.startswith('#')]
```

The first line turns the text of the post in a list of tokens (words & co.). The second line picks out only the elements that start with a "#" and strips the "#" when adding them to a list.

Then we construct a dictionary with these values and append it to a list.

This gives us a list of dicitonaries - something that we can pass to Pandas to get a dataframe we can work with.

In [0]:
post_dicts = [] #empty list

for post in edges: #iterate all raw posts

  if post['node']['edge_media_to_caption']['edges'] == []: # hop to the next if no text in the post
    continue
    
  post_dict = {} # empty dictionary
  id_owner = post['node']['owner']['id'] # pick out user-id
  shortcode = post['node']['shortcode'] # pick out short post identifier
  text = post['node']['edge_media_to_caption']['edges'][0]['node']['text'] # pick out post text
  
  # Pick hashtags from text
  tokens = tknzr.tokenize(text)
  tags = [x.strip('#') for x in tokens if x.startswith('#')]

  # fill in dictionary with values
  post_dict['id_owner'] = id_owner
  post_dict['shortcode'] = shortcode
  post_dict['tags'] = tags
  post_dict['text'] = text

  post_dicts.append(post_dict) #append the dictionary to a list of post-dictionaries

In [0]:
# Create DF
posts_df = pd.DataFrame(post_dicts)

# Remove hashtags that are not a hashtag (emptyspace & mistakes)
posts_df['tags'] = posts_df['tags'].map(lambda t: [x for x in t if x.isalnum()])

# Kick out posts with 0 hashtags
posts_df = posts_df[posts_df['tags'].map(len) != 0]

#### Simple stats

In [0]:
# People with most posts

posts_df['id_owner'].value_counts()

In [0]:
# Look up who these people are (this line gets us also other information about the user)
profile = instaloader.Profile.from_id(L.context, 6047745288)

In [0]:
profile.username

### Creating a graph

Networkx is rather "base-Pythonic" in it's syntax. But not too crazy after all. The documentation is nice and clear...and you probably have seen the most in the DC courses.

In [0]:
# Create empty undirected Graph
G = nx.Graph()

We will construct the graph from hashtag combinations of each post. We will use ```itertools.combinations``` for that. Given a list of n objects this will create all possible unique combinations of size k (which we set to 2). Note, that we can build up the Graph sequentially. An edgelist contains all data we need.

In [0]:
# Create the graph
for i in posts_df['tags']:
  G.add_edges_from(list(itertools.combinations(i,2)))

#### Filtering the Graph

It can be a good idea to filter the Graph before analysing. For instance, we can remove all hashtags with low degree-centrality. This can be interpreted as - kicking out made up hashtags or extremely underused ones. We will calculate a percentile threshold and exclude everything under it.

In [0]:
# Calculating degree centrality for the Graph
degree_centrality = nx.degree_centrality(G)

In [0]:
# Getting a "reasonable" lower bound.
perc_filter = np.percentile([v for u,v in degree_centrality.items()], 20)

In [0]:
# Make a subgraph based on nodes with a degree_centrality over the threshold
nodes_selected = [x for x,y in degree_centrality.items() if y >= perc_filter]

G = G.subgraph(nodes_selected)

#### Analysing the Graph

Now we are going to calculate some network indicators and once done, we will export a DataFrame analyse them further.

In [0]:
# Recalculate degre-centrality and assign it as a node-attribute
degree_centrality = nx.degree_centrality(G)
nx.set_node_attributes(G, degree_centrality, 'degree')

In [0]:
# Same for Eigenvector Centrality
eigenvector = nx.eigenvector_centrality(G)
nx.set_node_attributes(G, eigenvector, 'eigenvector_centrality')

In [0]:
# Same for community detection
communities = community.best_partition(G, resolution = 1)
nx.set_node_attributes(G, communities, 'community')

In [0]:
graph_df = pd.DataFrame(dict(G.nodes(data=True))).T

In [0]:
graph_df['community'].value_counts(normalize=True)

In [0]:
# Find the 5 most popular hashtags for each identified community
tag_per_com = graph_df.groupby('community')['degree'].nlargest(5)

In [0]:
tag_per_com[0]

In [0]:
# Let's write the graph out to play around with it in Gephi
nx.write_gexf(G, 'G_hashtags.gexf')

### Let's try out something else: a bipartite graph between users and hashtags

Can we identify communities of users given their usage of hashtags?

In [0]:
# Create a new graph
B = nx.Graph()

In [0]:
# we will take the same data
posts_df.head()

In [0]:
# We need to specify the nodes for level 0 - this will be our users
B.add_nodes_from(list(set(posts_df.id_owner)), bipartite=0)

In [0]:
# Then we need to add hashtags nodes as level 1 nodes
B.add_nodes_from(list(set(itertools.chain(*posts_df.tags))), bipartite=1)

In [0]:
# This quick loop will generate edges between users and hashtags
# Every time someone mentions a #hashtag, a link is created

bi_edges = []
for i in posts_df[['id_owner','tags']].iterrows(): # we do this row-by-row since each row is a post
  id_owner = i[1]['id_owner']
  for j in i[1]['tags']:
    bi_edges.append((id_owner, j)) # edges are appended to a list as a tuple (id_owner, hashtag)


In [0]:
# Let's add the edges to our graph
B.add_edges_from(bi_edges)

In the next step we will project the graph onto the account-level. For this we need to get the nodesets of the 0 level. We also calculate the level 1 level (just because)

In [0]:
# Extract a set of nodes with level 0
top_nodes = {n for n, d in B.nodes(data=True) if d['bipartite']==0}

# the remaining nodes are then level 1
bottom_nodes = set(B) - top_nodes

In [0]:
# Let's project this graph using a weighted projection
G_proj = bipartite.weighted_projected_graph(B, top_nodes)

In [0]:
# Again, we can identify communities
bi_communities = community.best_partition(G_proj, resolution = 1)
nx.set_node_attributes(G_proj, bi_communities, 'community')

In [0]:
# Calculate eigenvector centrality and set it as an attribute
bi_eigenvector = nx.eigenvector_centrality(G_proj)
nx.set_node_attributes(G_proj, bi_eigenvector, 'eigenvector_centrality')

In [0]:
# Create a new attribute "activity" - or propensity to spam
nx.set_node_attributes(G_proj, dict(posts_df.id_owner.value_counts()), 'activity' )

In [0]:
# Do spammers connect more in terms of spamming about the same stuff?

print(nx.numeric_assortativity_coefficient(G_proj,'activity'))

In [0]:
graph_proj_df = pd.DataFrame(dict(G_proj.nodes(data=True))).T

In [0]:
graph_proj_df.head()

In [0]:
# Find the 5 most central for each identified community
user_per_com = graph_proj_df.groupby('community')['eigenvector_centrality'].nlargest(5)

In [0]:
user_per_com

In [0]:
profile = instaloader.Profile.from_id(L.context, 1929431148)

In [0]:
print(profile.biography)
print(profile.username)

In [0]:
nx.write_gexf(G_proj, 'G_proj.gexf')