Instructions
Centrality measures can be used to predict (positive or negative) outcomes for a node.

Your task in this week’s assignment is to identify an interesting set of network data that is available on the web (either through web scraping or web APIs) that could be used for analyzing and comparing centrality measures across nodes.  As an additional constraint, there should be at least one categorical variable available for each node (such as “Male” or “Female”; “Republican”, “Democrat,” or “Undecided”, etc.)

In addition to identifying your data source, you should create a high-level plan that describes how you would load the data for analysis, and describe a hypothetical outcome that could be predicted from comparing degree centrality across categorical groups. 

Dataset:
https://snap.stanford.edu/data/twitch-social-networks.html

>@misc{rozemberczki2019multiscale,    
       title = {Multi-scale Attributed Node Embedding},   
       author = {Benedek Rozemberczki and Carl Allen and Rik Sarkar},   
       year = {2019},   
       eprint = {1909.13021},  
       archivePrefix = {arXiv},  
       primaryClass = {cs.LG}   
       }

In [None]:
import networkx as nx
#print("NetworkX version:", nx.__version__)
import matplotlib.pyplot as plt
import requests
import io
from itertools import combinations

The dataset is about the Twitch Social Network, and these datasets are used for node classification and transfer learning, which are Twitch user-user networks of gamers who stream in a certain language. Nodes are the users themselves, and the links are mutual friendships between them. Vertex features are extracted based on the games played and liked, location, and streaming habits. Datasets share the same set of node features, which makes transfer learning across networks possible. These social networks were collected in May 2018. The dataset contains nodes and edges for Twitch streamers in multiple languages (German, English, Spanish, French, Portuguese, and Russian), but I will filter my analysis to just the English network. The said network has 7126 Nodes and 35,324 edges, with a sparse density of 0.002.

This dataset works as follows; 
Each node is a Twitch user
Each edge (from -> to) is a connection/following
The dataset contains categorical variables such as 'mature' (whether a user has explicit language), partner (whether or not a user is an official streaming partner with Twitch).
It also contains numerical variables such as the user ID, days (the age of the user's account), and views (the number of views their channel has).

High-Level Plan:

1) Load data in Python;  Edges/Node features/target labels from the CSV/JSON dataset files.
2) Create a simple plot, attach node attributes (features/target labels).
3) Compute centrality measures (Degree/Betweenness/Closeness/Eigenvector Centrality).

Using degree centrality, I could determine if more popular viewers (those with more views/followers) are more or less likely to use explicit language. Or if explicit language repels viewers. I also want to see if there is a network of viewers with high degree centrality that interact with / exclusively follow each other. Alternatively, we can compare the average degree centrality between streamers who use explicit language, vs those who don't. 