# Practical assignment

In this assignment you will analyse user comments from the website [reddit.com](http://www.reddit.com). Reddit users can post content (e.g., a website, a question, news), which can be up- or downvoted. Posts with many upvotes tend to appear in the top of the category or at the frontpage of Reddit. The website is quite popular and has over half a billion monthly visitors. At times, appearing on the frontpage of Reddit generates so much traffic to the posted website, that it actually crashes.

The community is organised in various subreddits, such as news, movies, music, et cetera. You will analyse user comments from the [politics subreddit](https://www.reddit.com/r/politics/). These user comments are either replies to the starting post, or replies to other users’ comments. The latter will be the basis for the communication network that you will construct here.

First let us get started with the data.


## Data

If you have not done so already, download all data from https://storage.googleapis.com/css-files/reddit_discussion_network_2016_10.csv. This file is 377MB, it may take some time to download. If you have trouble working with this dataset on your computer, please try the alternative: https://storage.googleapis.com/css-files/reddit_discussion_network_2015_02.csv, which is only 46MB.

### Importing libraries

In [None]:
import random

# For network
import igraph as ig
import louvain

# For NLP
import nltk
import gensim

# For data handling
import pandas as pd

# For calculation
import numpy as np
import scipy

# For plotting
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

### Reading in data

First read in Reddit data.

In [None]:
data_path = '../../../data/reddit/'
file_name = 'reddit_discussion_network_2016_10.csv'
comments_df = pd.read_csv(data_path + file_name)

Which columns does this dataset have?

In [None]:
comments_df.columns

The first post:

In [None]:
comments_df.head(1)

To help speed up the analysis, we already computed topic and sentiment values for each post. You can read the data as follows:

In [None]:
topic_sentiment_df = pd.read_csv(data_path + 'topic_sentiment_reddit.csv');

For each post, the topic distribution is saved in t_0 to t_14 (15 topics)

In [None]:
topic_sentiment_df.head(5)

We then link this back to the original comments as follows:

In [None]:
comments_enriched_df = comments_df.merge(topic_sentiment_df)

We can now delete the `topic_sentiment_df` variable to save some memory.

In [None]:
del topic_sentiment_df

Let us calculate the interaction between all users.

In [None]:
grp_df = comments_enriched_df.groupby(['author_from', 'author_reply_to'])
interaction_df = grp_df.mean()
interaction_df['count'] = grp_df.size()

This only keeps the numerical columns (and throws away the text). We can now use this to build the network.

In [None]:
G = ig.Graph.TupleList(
        edges=interaction_df.reset_index().values,
        edge_attrs=interaction_df.columns,
        directed=True)

There are now four smaller subassignments which we will work on. Please work them through in order. The first assignment contains a bit more help, and you increasingly have to do more yourself. Most of the techniques involved should already be explained during the lectures, but these hints provide some more explicit help.

# Topics and centrality

Users that are central tend to interact with lots of different (central) users. We could either expect that users become more central if they secure a position of authority in a single topic. In that case, everybody interacts with the user because he is authoritative in this subject. Alternatively, somebody can be more central because he is active in many different topics. Finally, somebody may simply be more central because he is active himself, and every comment is likely to get a reply.

Techniques necessary
- Topic detection
- Centrality

## Topic modelling

We now calculate the average values for each user as follows:

In [None]:
topic_sentiment_user_df = comments_enriched_df.groupby('author_from').mean()

In [None]:
topic_sentiment_user_df.head(5)

You can show the values for a particular user:

In [None]:
topic_sentiment_user_df.ix['---CAISSON---']

You can easily grab only the topics

In [None]:
topic_sentiment_user_df.ix[:,'t_0':'t_14'].head(5)

You can also combine both the selection of users and of topics

In [None]:
topic_sentiment_user_df.ix['---CAISSON---','t_0':'t_14']

### Entropy

One way to calculate whether a user is posting mostly about one topic, or whether the user is active in multiple topics is using **entropy** (https://en.wikipedia.org/wiki/Entropy_(information_theory))

This is an example where we have two topics. Because the probability of both topics is equal (0.5), the entropy is high.

In [None]:
scipy.stats.entropy([0.5, 0.5])

Because in the following example all the probability is concentrated on one topic, the entropy is low (0).

In [None]:
scipy.stats.entropy([1, 0])

The overall entropy for all users can be calculated easily:

In [None]:
user_topic_entropy = topic_sentiment_user_df.ix[:,'t_0':'t_14'].apply(scipy.stats.entropy, axis=1)

## Centrality

There are various possible centralities. Betweenness is too slow to calculate for this network, so you should only focus on eigenvector centrality, pagerank and (in- or out-)degree. You can try any one of them, just keep in mind when interpreting further results.

In [None]:
# TODO: Calculate the centrality values and store them in vertex attributes

We can easily put all attributes from the graph in a pandas dataframe.

In [None]:
user_df = pd.DataFrame({attr: G.vs[attr] for attr in G.vertex_attributes()}).set_index('name')

Now let us also store the topic entropy for each user.

In [None]:
user_df['topic_entropy'] = user_topic_entropy

Note that there is not always information for all users, because not all users have written any comments themselves in this period. For example, BigDaddy2014 was replied 187 times, but he did not write a single comment himself during this period.

In [None]:
user_df.ix['BigDaddy2014',:]

For this particular assignment, it might be useful to filter users. If you include *all* users, then users who have only posted a few posts might have a topic distribution skewed towards a few topics, just because they haven't been active much.

In [None]:
# TODO: Analyze whether there is relation between the centrality of users and their topic distributions.

# Sentiment and centrality 

In order to become central in the commenter network, sufficient people have to respond to your comment. Enticing others to respond is thus essential. We could hypothesise that this is more likely when comments are controversial: i.e. many people would disagree with the comment. What is controversial depends on in which environment a statement is made. At any rate, we could expect a controversial statement to be met with criticism. We should then expect that central people are more likely to be criticised, and that they attract relatively many negative comments.

Techniques necessary
- Sentiment analysis
- Centrality

## Sentiment analysis

Again, we have precomputed the sentiment values (but if you have time: extend it and consider other features as well, like emotion). The average sentiment is available as an edge attribute `pos` and `neg`.

In [None]:
# TODO: Calculate the sentiment strength for users
# TODO: Analyze whether there is relation between the centrality of users and sentiment in the comments.

# Communities of interest

Earlier today you learned that interaction is often homophilous: people with the same interest are more likely to be connected to each other. We will look into this question here on the basis of topics. Two question are central in this assignment: (1) are users that share topics more likely to be connected; and (2) does this create communities of interest.

Techniques necessary
- Topic modelling
- Assortativity
- Community detection

## Topic modelling

We already dispose of the average topic distribution for each user. It is easier to work with a single topic for each user.

In [None]:
user_topic = topic_sentiment_user_df.ix[:,'t_0':'t_14'].idxmax(axis=1)

Let us put this information from the dataframe in the graph. Because we only have topic information if somebody wrote a post, we will look at the subgraph of people having written at least some number of posts.

In [None]:
H = G.subgraph(G.vs.select(outdegree_weighted_ge=50))

In [None]:
H.vs['user_topic'] = user_topic.ix[H.vs['name']].str[2:].astype(int)

## Assortativity

In [None]:
# TODO:  Calculate the assortativity.

## Community detection

The most difficult part of community detection is deciding what method is appropriate and what resolution is appropriate. Modularity is the most often used.

<div class="alert alert-warning">
Detecting communities may take some time for graphs of this size.
</div>

In [None]:
# TODO: Apply community detection. Try to answer the two questions:  
# (1) Are users that share topics more likely to be connected?
# (2) does this create communities of interest?

# Sentiment and language across communities

Following social balance theory, it is possible that the commenter network is polarized (not implausible given the divisive US politics). Simply looking at communication while disregarding the valence of the link (i.e. whether it was negative or positive) may distort our view of the integration of the network. We will use sentiment analysis of the comments to determine whether the links are in fact negative or positive. In this assignment two question are central: (1)  is sentiment different within groups than between groups?; and (2) does the valence of links change the community structure?

Techniques necessary
- Sentiment analysis
- Community detection

## Sentiment analysis

One hint: the `crossing` function of a partition indicates per edge whether it is between or within communities.

## Community detection

Modularity is ordinarily not suited for community detection if the weights are negative. This can be corrected though, as is illustrated below.

In [None]:
H_positive = H.subgraph_edges(H.es.select(sentiment_gt=0), delete_vertices=False)
H_negative = H.subgraph_edges(H.es.select(sentiment_lt=0), delete_vertices=False)

membership, quality = louvain.find_partition_multiplex([
    louvain.Layer(graph=H_positive, method='Modularity', weight='sentiment', layer_weight=1.0),
    louvain.Layer(graph=H_negative, method='Modularity', weight='sentiment', layer_weight=-1.0)])
balance_partition = ig.VertexClustering(H, membership)

In [None]:
# TODO: Analyse how this partition differs from what you have seen previously.