# Practical assignment

In this assignment you will analyse user comments from the website [reddit.com](http://www.reddit.com). Reddit users can post content (e.g., a website, a question, news), which can be up- or downvoted. Posts with many upvotes tend to appear in the top of the category or at the frontpage of Reddit. The website is quite popular and has over half a billion monthly visitors. At times, appearing on the frontpage of Reddit generates so much traffic to the posted website, that it actually crashes.

The community is organised in various subreddits, such as news, movies, music, et cetera. You will analyse user comments from the [politics subreddit](https://www.reddit.com/r/politics/). These user comments are either replies to the starting post, or replies to other users’ comments. The latter will be the basis for the communication network that you will construct here.

First let us get started with the data


## Data

If you have not done so already, download all data from https://storage.googleapis.com/css-files/reddit_discussion_network_2016_10.csv. This file is 377MB, it may take some time to download. If you have trouble working with this dataset on your computer, please try the alternative: https://storage.googleapis.com/css-files/reddit_discussion_network_2015_02.csv, which is only 46MB.

### Importing libraries

In [1]:
import random

# For network
import igraph as ig
import louvain

# For NLP
import nltk
import gensim

# For data handling
import pandas as pd

# For calculation
import numpy as np
import scipy

# For plotting
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

### Reading in data

First read in Reddit data.

In [2]:
data_path = '../../../data/reddit/'
file_name = 'reddit_discussion_network_2016_10.csv'
comments_df = pd.read_csv(data_path + file_name)

Which columns does this dataset have?

In [None]:
comments_df.columns

The first post:

In [None]:
comments_df.head(1)

To help speed up the analysis, we already computed topic and sentiment values for each post. You can read the data as follows:

In [3]:
topic_sentiment_df = pd.read_csv(data_path + 'topic_sentiment_reddit.csv');

For each post, the topic distribution is saved in t_0 to t_14 (15 topics)

In [None]:
topic_sentiment_df.head(5)

We then link this back to the original comments as follows:

In [4]:
comments_enriched_df = comments_df.merge(topic_sentiment_df)

We can now delete the `topic_sentiment_df` variable to save some memory.

In [5]:
del topic_sentiment_df

Let us calculate the interaction between all users.

In [6]:
grp_df = comments_enriched_df.groupby(['author_from', 'author_reply_to'])
interaction_df = grp_df.mean()
interaction_df['count'] = grp_df.size()

This only keeps the numerical columns (and throws away the text). We can now use this to build the network.

In [7]:
G = ig.Graph.TupleList(
        edges=interaction_df.reset_index().values,
        edge_attrs=interaction_df.columns,
        directed=True)

There are now four smaller subassignments which we will work on. You can choose any single one to work on. Hints for doing some of the analysis are provided after the description of the subassignments. Most of the techniques involved should already be explained during the lectures, but these hints provide some more explicit help.



# Topics and centrality

Users that are central tend to interact with lots of different (central) users. We could either expect that users become more central if they secure a position of authority in a single topic. In that case, everybody interacts with the user because he is authoritative in this subject. Alternatively, somebody can be more central because he is active in many different topics. Finally, somebody may simply be more central because he is active himself, and every comment is likely to get a reply.

Techniques necessary
- Topic detection
- Centrality

## Topic modelling

We now calculate the average values for each user as follows:

In [8]:
topic_sentiment_user_df = comments_enriched_df.groupby('author_from').mean()

In [None]:
topic_sentiment_user_df.head(5)

You can the values for a particular user:

In [None]:
topic_sentiment_user_df.ix['---CAISSON---']

You can easily grab only the topics

In [None]:
topic_sentiment_user_df.ix[:,'t_0':'t_14'].head(5)

You can also combine both the selection of users and of topics

In [None]:
topic_sentiment_user_df.ix['---CAISSON---','t_0':'t_14']

### Entropy

One way to calculate whether a user is posting mostly about one topic, or is the user is active in multiple topics is using **entropy** (https://en.wikipedia.org/wiki/Entropy_(information_theory))

This is an example where we have two topics. Because the probability of both topics is equal (0.5), the entropy is high.

In [None]:
scipy.stats.entropy([0.5, 0.5])

Because in the following example all the probability is concentrated on one topic, the entropy is low (0).

In [None]:
scipy.stats.entropy([1, 0])

The overall entropy for all users can be calculated easily:

In [None]:
user_topic_entropy = topic_sentiment_user_df.ix[:,'t_0':'t_14'].apply(scipy.stats.entropy, axis=1)

## Centrality

There are various possible centralities. Betweenness in in too slow to calculate for this network, so we will only focus on eigenvector centrality, pagerank and (in- or out-)degree. You can try any one of them, just keep in mind when interpreting further results. You can get the centralities by running any one of the following:

In [9]:
G.es['weight'] = G.es['count']
G.vs['eigenvector_centrality'] = G.eigenvector_centrality(weights='weight')
G.vs['pagerank'] = G.pagerank(weights='weight')
G.vs['indegree'] = G.degree(mode=ig.IN)
G.vs['indegree_weighted'] = G.strength(mode=ig.IN, weights='weight')
G.vs['outdegree_weighted'] = G.strength(mode=ig.OUT, weights='weight')

  from ipykernel import kernelapp as app


We can easily put all attributes from the graph in a pandas dataframe.

In [None]:
user_df = pd.DataFrame({attr: G.vs[attr] for attr in G.vertex_attributes()}).set_index('name')

Now let us also calculate the topic entropy for each user.

In [None]:
user_df['topic_entropy'] = user_topic_entropy

Note that there is not always information for all users, because not all users have written any comments themselves in this period. For example, BigDaddy2014 was replied 187 times, but he did not write a single comment himself during this period.

In [None]:
user_df.ix['BigDaddy2014',:]

For this particular assignment, it might be useful to filter users. If you include *all* users, then users who have only posted a few posts might have a topic distribution skewed towards a few topics, just because they haven't been active much. We can plot the results for users that have posted at least 50 comments.

In [None]:
user_df[user_df['outdegree_weighted'] > 50].plot('eigenvector_centrality', 'topic_entropy', kind='scatter')

In [None]:
user_df[user_df['outdegree_weighted'] > 50].corr()

** Todo: **
- Decide which users you will analyze
- Compute the centrality for each user
- Compute the topic distribution for each user. 
- Analyze whether there is a relation between the two measures.

# Sentiment and centrality 

In order to become central in the commenter network, sufficient people have to respond to your comment. Enticing others to respond is thus essential. This is more likely when comments are controversial: i.e. many people would disagree with the comment. What is controversial depends on in which environment a statement is made. At any rate, we could expect a controversial statement to be met with criticism. We should then expect that central people are more likely to be criticised, and that they attract relatively many negative comments.

Techniques necessary
- Sentiment analysis
- Centrality

## Sentiment analysis

In [None]:
from empath import Empath
lexicon = Empath()

Take a look at post number 340

In [None]:
print comments_df.ix[340, 'comment']

Analyze the comment using Empath

In [None]:
def tokenize(text):
    return list(gensim.utils.simple_preprocess(text))

In [None]:
lexicon.analyze(tokenize(comments_df.iloc[[340]]['comment'].values[0]), normalize=True)

Again, we have precomputed the sentiment values (but if you have time: extend it and consider other features as well,
                                                like emotion)

Very similar to what we did before. Compute the mean for each author (but now we are looking at responses, so we look at 'author_reply_to')

In [None]:
G.vs['sentiment_strength_pos'] = G.strength(mode=ig.IN, weights='pos')
G.vs['sentiment_strength_neg'] = G.strength(mode=ig.IN, weights='neg')
G.vs['sentiment_strength'] = np.array(G.vs['sentiment_strength_pos']) - np.array(G.vs['sentiment_strength_neg'])

In [None]:
user_df = pd.DataFrame({attr: G.vs[attr] for attr in G.vertex_attributes()}).set_index('name')

In [None]:
user_df.corr()

In [None]:
user_df.plot('indegree', 'sentiment_strength', kind='scatter')

# Communities of interest

Earlier today you learned that interaction is often homophilous: people with the same interest are more likely to be connected to each other. We will look into this question here on the basis of topics. Two question are central in this assignment: (1) are users that share topics more likely to be connected; and (2) does this create communities of interest.

Techniques necessary
- Topic modelling
- Assortativity
- Community detection

## Topic modelling

We already dispose of the average topic distribution for each user. It is easier to work with a single topic for each user.

In [None]:
user_topic = topic_sentiment_user_df.ix[:,'t_0':'t_14'].idxmax(axis=1)

Let us put this information from the dataframe in the graph. Because we only have topic information if somebody wrote a post, we will look at the subgraph of people having written at least some number of posts.

In [10]:
H = G.subgraph(G.vs.select(outdegree_weighted_ge=50))

In [None]:
H.vs['user_topic'] = user_topic.ix[H.vs['name']].str[2:].astype(int)

To measure the distance between two topic distribution, we will be using the Jensen-Shannon Divergence:

In [11]:
from scipy.stats import entropy
from numpy.linalg import norm

def JSD(P, Q):
    _P = P / norm(P, ord=1)
    _Q = Q / norm(Q, ord=1)
    _M = 0.5 * (_P + _Q)
    return 0.5 * (entropy(_P, _M) + entropy(_Q, _M))

In [13]:
from random import randint

topic_jsd = []
edge_weights = []

# Traverse all edges
for es in H.es:
    author_name1 = H.vs[es.target]['name']
    author_name2 = H.vs[es.source]['name']
    
    # Compute topic distributions, Jensen-Shannon Divergence and save values
    topic_dist1 = list(topic_sentiment_user_df.ix[author_name1,'t_0':'t_14'])
    topic_dist2 = list(topic_sentiment_user_df.ix[author_name2,'t_0':'t_14'])
    
    topic_jsd.append(JSD(topic_dist1, topic_dist2))
    edge_weights.append(es["weight"])

    # Sample another vertex
    vertex3 = randint(0, len(H.vs)-1)
    
    if H.are_connected(es.target, vertex3): #Alternative: resample vertex?
        continue
        
    topic_dist3 = list(topic_sentiment_user_df.ix[H.vs[vertex3]['name'],'t_0':'t_14'])
    topic_jsd.append(JSD(topic_dist1, topic_dist3))
    edge_weights.append(0)  # since they are not connected 


In [14]:
print scipy.stats.spearmanr(topic_jsd,edge_weights)

SpearmanrResult(correlation=-0.1506997713462051, pvalue=0.0)


## Assortativity

The assortativity is easy to calculate:

In [None]:
H.assortativity_nominal(types='user_topic')

## Community detection

The most difficult part of community detection is deciding what method is appropriate and sometimes what resolution is appropriate. Modularity is the most often used, and can be obtained as follows:

<div class="alert alert-warning">
Detecting communities may take some time.
</div>

In [None]:
mod_partition = louvain.find_partition(H, 'Modularity', weight='count')

Now compare it to the partition based on the topics.

In [None]:
topic_partition = ig.VertexClustering.FromAttribute(H, 'user_topic')
mod_partition.compare_to(topic_partition, 'nmi')

Alternatively, you can try out CPM, using various resolution values. Good resolution values are usually quite small, but this may depend on the weight. Around a resolution parameter of 1e-5 seems to be most interesting.

In [None]:
CPM_partition = louvain.find_partition(H, 'CPM', weight='count', resolution_parameter=0.01)

In [None]:
CPM_partition.sizes()[:10]

In [None]:
CPM_partition.compare_to(topic_partition, 'nmi')

# Sentiment and language across communities

Following social balance theory, it is possible that the commenter network is polarized (not implausible given the divisive US politics). Simply looking at communication while disregarding the valence of the link (i.e. whether it was negative or positive) may distort our view of the integration of the network. We will use sentiment analysis of the comments to determine whether the links are in fact negative or positive. In this assignment two question are central: (1)  is sentiment different within sentiment different from sentiment between groups?; and (2) does the valence of links change the community structure?

Techniques necessary
- Sentiment analysis
- Community detection

## Sentiment analysis

First construct a single measure for whether a link is positive or negative.

In [None]:
H.es['sentiment'] = np.array(H.es['pos']) - np.array(H.es['neg'])

Now look at whether it is more positive or negative between the communities we previously detected.

In [None]:
edge_sentiment_group_df = pd.DataFrame({'sentiment': H.es['sentiment'],
                                        'crossing': CPM_partition.crossing()})
edge_sentiment_group_df.groupby('crossing').agg(['mean', 'std'])

## Community detection

Modularity is ordinarily not suited for community detection if the weights are negative. This can be corrected though, as is illustrated below.

In [None]:
H_positive = H.subgraph_edges(H.es.select(sentiment_gt=0), delete_vertices=False)
H_negative = H.subgraph_edges(H.es.select(sentiment_lt=0), delete_vertices=False)

membership, quality = louvain.find_partition_multiplex([
louvain.Layer(graph=H_positive, method='Modularity', weight='sentiment', layer_weight=1.0),
louvain.Layer(graph=H_negative, method='Modularity', weight='sentiment', layer_weight=-1.0)])
balance_partition = ig.VertexClustering(H, membership)

Let us compare this partition to the previous parition we got using modularity.

In [None]:
mod_partition.compare_to(balance_partition, 'nmi')