<a href="https://colab.research.google.com/github/SimonTommerup/02805-sgai/blob/main/notebooks/thomas_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Imports used

In [1]:
import pandas as pd
import numpy as np
import re
import nltk
import networkx as nx
import matplotlib.pyplot as plt
import json

from collections import Counter
from tqdm import tqdm

4

# 1 Motivation
- What is your dataset?
- Why did you choose this/these particular dataset(s)?
- What was your goal for the end user's experience?

# 2 Basic stats. Let's understand the dataset better
- Write about your choices in data cleaning and preprocessing
- Write a short section that discusses the dataset stats (here you can recycle the work you did for Project Assignment A) **But leave network stats (#nodes, #edges degree...) to next section!**

# 3 Tools, theory and analysis. Describe the process of theory to insight
In this section we analyze the above presented data using network science tools and data analysis strategies.


## 3.1 Introducing the analysis




As mentioned and discussed in previous sections, we will be investigating typical users on the subreddit pages of the two 2020 presidential candidates of USA [Trump](https://www.reddit.com/r/donaldtrump/) and [Biden](https://www.reddit.com/https://www.reddit.com/r/joebiden/).

Initially we will **Introduction..., then 3.2 then 3.3. then 3.4. (overview!)**.




### 3.1.2 The bipartite network **Consider: building actual bipartite network?**
We will represent our data with a bipartite network of users and subreddits. For our data we have 
- $U$, the set of users
- $S^u$, the set of subreddits which $u \in U$ has commented on, excluding 
- $S$, the set of all subreddits commented on by all users.

Note that we do not include [Trump](https://www.reddit.com/r/donaldtrump/) and [Biden](https://www.reddit.com/https://www.reddit.com/r/joebiden/) subreddits in $S^u$ and $S$

We can assemble this data into a undirected bipartite network $G_{bi}$ with two distjoint sets of nodes $U$ and $S$. A user $u \in U$ is connected to subreddit $s \in S$ if $s \in S^u$. This means that we have a network $G_{bi}$ where a user has a connection to all subreddits which it have commented on. And these subreddits are further connected to every other user, which has also commented on this subreddit. 



### 3.1.3 Building a weighted network of users
In order to apply a wide range of network science tools and data science strategies from 02805, a projection of $G_{bi}$ is considered, constituting a network of users. Specifically, a [simple projection](https://en.wikipedia.org/wiki/Bipartite_network_projection) is considered where users $u_i$ and $u_j$ are connected with weight equal to the size of $(S^{u_i}\cap S^{u_j})$ - i.e. weight is equal to number of subreddits which both users has commented on. 

We will be using our constructed pandas dataframe to generate the weighted network $G$ implicit from $G_{bi}$, as it is more simple.

In [None]:

def get_ids_of_users_with_common_subreddits(user_id, users, used_subreddits):
    users_with_common_subreddits = []
    for other_user_id in range(len(users)):
        if other_user_id != user_id:  # Skip connection to self
            for subr in used_subreddits[user_id]:
                if subr in used_subreddits[other_user_id]:
                    users_with_common_subreddits.append(other_user_id)
                    break
    return users_with_common_subreddits


def get_common_subreddits(user_id, other_user_id, used_subreddits):
    common_subreddits = [subreddit for subreddit in used_subreddits[user_id] if subreddit in used_subreddits[other_user_id]]
    return common_subreddits

  
def create_graph(users, used_subreddits, from_subreddits, n_required_subreddits=1):
    G = nx.Graph()
    # Loop through all users
    for user_id in tqdm(range(len(users))):
        # Add a node for EVERY user in data set 
        G.add_node(users[user_id], from_subreddit=from_subreddits[user_id])

        # Get all other users with atleast one other subreddit in common
        other_users_id = get_ids_of_users_with_common_subreddits(user_id, users, used_subreddits)
        for other_user_id in other_users_id:
            # Save all UNIQUE common subreddits as edge property if constraints are satisfied
            common_subreddits = get_common_subreddits(user_id, other_user_id, used_subreddits)
            common_subreddits = list(set(common_subreddits))
            # Remove 'from_subreddit' as common subreddit.
            # TODO: Should we remove biden?
            common_subreddits = list(filter(lambda e: e not in from_subreddits[user_id], common_subreddits))

            if len(common_subreddits) >= n_required_subreddits:
                G.add_edge(users[user_id], users[other_user_id], common_subreddits=(common_subreddits), 
                           weight=len(common_subreddits))
    return G


# Create graph
G = create_graph(users, used_subreddits, from_subreddits, n_required_subreddits=1)

TODO: 
- Present basic NETWORK stats we got from Project A (#Edges, #nodes, avg/min/max degree)
- Plot the network with (possibly with current classification = from_subreddit)

## 3.2 Extracting networks of interest and classifying users

### 3.2.1 Extracting the "backbone" of the user network
- Motive: "As we saw from introduction - weighted is very dense... Might be able to extract to more informative!" 
- Tools: "Works by applying disperse filters, defined by "...
- Results: "Resulting network is..." (#Edges, #nodes, avg/min/max degree) + PLOT
- Discussion: " Will be used as comparison to weighted, to see which one gives more information"

### 3.2.2 Classifying users with community Detection and sentiment analysis
- Motive: "Classifying users by from_subreddit is not necessarily optimal.
- Tools: Three compared partitions: from_subreddit, Louvain and sentiment in comment. Modularity. Plots
- Results: Modularity=... Plots=... #Links_Across_partition=..., MORE to decide the better partition!!?
- Discussion: From X and Y we find __ as the best partitioning for representing each candidates' supporters

### TBD: 3.2.3 Detecting communities with the bipartite network? ONLY MAYBE!
- Motive: "Bipartite networks might contain additional information, which is discarded in the projection
- Tools: "Explain how community detection works"...
- Results: "We saw a lot more!!" or "revealed nothing..."
- Discussion: "Probably because..."

## 3.3 Comparing candidate sub-networks (of best partitioning)



### 3.3.1 Simple Network Statistics for candidate sub-networks
- Motive: "To compare the two networks in terms of simple statistics"
- Tools: #Nodes, #Edges, Degrees, densities, median, mode, 
- Results: "compute them..."
- Discussion: "This could mean that... "


### 3.3.2 Degree Distributions and the Network types
- Motive: "To understand the characteristics of our networks... Does our network follow power-law? Which could mean..."
- Tools: explain theory...
- Results
- Discussion

### 3.3.3 Advanced statistics (maybe this should be 3 seperate bullets)
- Motive: "Which supporters are more diverse in interests? which are etc...
- Tools: Clustering, Shortest paths and centralities in sub-networks?
- Results
- Discussion

### 3.3.4 Community detection wihin partitions
- Motive: "Investigate if any communities within Biden/trump lair"
- Tools: Louvain
- Results
- Dicussion

## 3.4 Comparing text/comments of candidates' supporters/subreddit forum users 

### 3.4.1 Natural Language Processing
- Motive: Is one community more eloquent? Does either community have more catch-phrases? Typical words?
- Tools: Lexical diversity, collocations, TFTR + wordclouds
- Results
- Discussion


### 3.4.2 Sentiment analysis
- Motive: Is one candidate´s supporters more positive than the other's?
- Tools: Sentiment analysis of comments
- Results
- Discussion

# 4 Discussion. Think critically about your creation
- What went well?,
- What is still missing? What could be improved?, Why?
  - Improvements: Maybe also try to incooporate how *often* the users have commented on the same subreddit in link weights.

# 5 Contributions. Who did what?
- You should write (just briefly) which group member was the main responsible for which elements of the assignment. (I want you guys to understand every part of the assignment, but usually there is someone who took lead role on certain portions of the work. That’s what you should explain).

- It is not OK simply to write "All group members contributed equally".

