<div style="text-align: center;" >
<h1 style="margin-top: 0.2em; margin-bottom: 0.1em;">Assignment 4</h1>
<h4 style="margin-top: 0.7em; margin-bottom: 0.3em; font-style:italic">Commit your solutions to GitHub until July 12, 23:59</h4>
</div>
<br>

## Part 1
## Social Network Analysis of Swiss Politicians on Twitter Data
In the first part of this assignment you will do the following tasks:
1. Build social network of retweets
2. Calculate assortativity
3. Permutation tests
4. Community detection

### Install requirements. 

The following cell contains all the necessary dependencies needed for this task. If you run the cell everything will be installed.  

* [`pandas`](https://pandas.pydata.org/docs/index.html) is a Python package for creating and working with tabular data. [Here](https://pandas.pydata.org/docs/reference/index.html) is the documentation of `pandas`.
* [`numpy`](https://numpy.org/) is a Python package for mathematical functions. [Here](https://numpy.org/doc/stable/reference/index.html) is the documentation of `numpy`.
* [`matplotlib`](https://matplotlib.org/) is a Python package for creating plots. [Here](https://matplotlib.org/stable/api/index.html) is the documentation of `matplotlib`.
* [`networkx`](https://networkx.org/) is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks. [Here](https://networkx.org/documentation/stable/reference/index.html) is the documentation of `networkx`.

In [1]:
! pip install pandas
! pip install numpy
! pip install matplotlib
! pip install networkx

### Import requirements
The cell below imports all necessary dependancies. Make sure they are installed (see cell above).

In [1]:
import pandas as pd
import networkx as nx
import numpy as np
import matplotlib.pyplot as plt

### Exercise 1: Load social networks of retweets *(1 point)*

The attached `swiss_pol_retweet_network.gexf` file contains an undirected retweet network of Swiss politicians for the time between 2021-07-12 and 2022-07-12. Each node in the network represents a politician, and stores their Twitter user id, username, and party affiliation. An edge exists between a pair of politicians that exchanged at least one retweet with each other (regardless of the direction).

* Import the graph.
* How many nodes and edges are there in the network?
* Visualize the graph. Use [`draw_networkx`](https://networkx.org/documentation/stable/reference/generated/networkx.drawing.nx_pylab.draw_networkx.html) for this. You can try different layouts, but it does not have to be super advanced, yet.

### Exercise 2: Calculate graph assortativity *(2 points)*

Use the function [`attribute_assortativity_coefficient`](https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.assortativity.attribute_assortativity_coefficient.html) of `networkx` to calculate the assortativity with respect to party labels. How high is the value?

To see if the assortativity value fits your expectations, use the function [`draw_networkx`](https://networkx.org/documentation/stable/reference/generated/networkx.drawing.nx_pylab.draw_networkx.html) to plot the network coloring each node according to the political party label of the politician. Does the pattern of colors fit the value of assortativity?

Your answer can go here:

### Exercise 3: Permutation tests *(2 points)*

Next, we are going to use a permutation test to test whether the above result could have happened at random. 

First, let's run one permutation:
* Extract the party labels of nodes and permute them 
* Set the party for each node as node attribute by using [`set_node_attribute`](https://networkx.org/documentation/stable/reference/generated/networkx.classes.function.set_node_attributes.html) (be carefull the parties are now permuted)
* Perform the same assortativity calculation as above, but with the permuted party labels


Is the value much closer to zero?
Repeat the calculation with 1000 permutations and plot the histogram of the resulting values. Add a line with the value of the assortativity without permutation. Is it far or close to the permuted values?

Your answer can go here:

To be sure, let's calculate a p-value for the null hypothesis that the assortativity is zero and the alternative hypothesis that it is positive (what we expected):

After looking at the above results, do you think it is likely that the assortativity we found in the data was produced by chance?

Your answer can go here:

### Exercise 4: Community detection *(3 points)*

Let's test if Twitter communities match political affiliations. Remove nodes with degree zero in the network and run the [Louvain community detection algorithm](https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.community.louvain.louvain_communities.html). Visualize the result coloring nodes by community labels.

Run the [`modularity`](https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.community.quality.modularity.html) function with the above community labels. Is it high enough to think that the network has a community structure?

Your answer can go here:

Repeat but using the (true) party labels instead of the communities detected with Louvain. Is it higher or lower? How far is this modularity from the maximal one found with Louvain?

* For this iterate over the parties and filter a subset of users that is in the given party and in the graph. Add the ids of these partymembers (do not include any duplicates) and repeat this for all parties.
* Afterwards you can calculate the modularity.

Your answer can go here:

Finally, to understand which parties are represented in each community, build a data frame for each node with two columns: one with the party label and another one with the community label. Use the [`groupby()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) function to print a contingency table. Which party or parties compose each community?

Your answer can go here:

### Exercise 5: Prediction and discussion of other methods *(3 points)*
* How well can you predict the party of a politician from its neighbors in the network? Here you can use the rule of predicting the party as the majority party among its neighbors and evaluate the accuracy of this approach. See the `Graph.neighbors` function.
* What would be the results if we use the network of replies? Do you expect assortativity and modularity to be higher or lower?
* If you retrieved data of follower links, you can repeat the above analysis for undirected following relationships. Do you expect a higher or lower assortativity?

Your answer can go here:

## Part 2
## Reddit

### Exercise 6: Data Collection *(3 points)*

#### Sign up for the Reddit API
* In this part of the assignment we will collect data using the Reddit API, and compare the tree structure of political and non-political subreddits.
* First, you need to sign up for the Reddit API. For this, follow the steps outlined in [this guide](https://towardsdatascience.com/how-to-use-the-reddit-api-in-python-5e05ddfd1e5c). You will need to create an app on the following [link](https://old.reddit.com/prefs/apps/).
* Next, install the [PRAW package](https://praw.readthedocs.io/en/stable/getting_started/quick_start.html), which provides a nice wrapper for the Reddit API.

#### Collect the data
* Navigate to the following [link](https://www.reddit.com/best/communities/1/) and select 4 political, and 4 non-political subreddits. Ideally, you would want subreddits with around 100-200 thousand members. You should select subreddits with enough engagement, but ones which do not typically have a thousand replies to each submission, since the API has a relatively low rate limit.
* Extract the top 20 `hottest` submissions from each of your selected subreddits, ignoring `pinned` submissions.
* For each of the submissions, extract all the comments and replies, and store them, so that you don't need to rerun this step later (and make sure to upload these data to GitHub as well!). Make sure to save the `id`, of the post, the id of its `parent` (the post that it replies to) and the name of the user.

In [None]:
!pip install praw

In [None]:
import praw

### Exercise 7: Analysis *(3 points)*
* Create a network/tree for each of the submissions, for this, you may use the [networkx](https://networkx.org/documentation/stable/tutorial.html) package, or create your own classes to store the data.
* For each of the trees, calculate the `maximum depth` and `maximum width`. 
    * By maximum depth, we mean the number of edges between the root node, and the furthest leaf node (i.e. the reply which is deepest in the comment tree). 
    * The maximum width of the tree is the maximum number of comments, replies on one "level". On the first "level" is the submission itself, on the next one the comments replying directly to the submission, on the third are the comments replying to the comments on the first level, and so on.
* Also calculate the `number of nodes` for each of the trees.

### Exercise 8: Comparison and interpretation *(3 points)*
* Compare the mean number of nodes, mean maximum width, and mean maximum depth of political and non-political subreddits. What differences can you notice?

* Can you conduct a statistical test to see if the differences are significant? (conduct the test if you found one which is feasible)

* Compare the distribution of the maximum width and maximum height of political vs non political subreddits by plotting their relative frequencies.

* Create a scatterplot with the log of max width of the tree on the x-axis, and the max depth of the tree on the y-axis. Color the dots based on their group (political vs. non-political). Add a large dot for both groups to show the mean of the group. 

* Interpret your results. (Either here, or separately after each plot, you have computed)

You answer can go here:


* What are the limitations of this analysis?

Your answer can go here: