# Social Network Analysis Project

## Table Of Content

##### 1. [Introduction](#intro-header)

1.1. [The Data](#data)<br>
1.2. [Research Question](#research-question)<br>
1.3. [Our Approach](#approach)

##### 2. [Building the Network](#build-header)

2.1. [Data Cleansing](#cleansing)<br>
2.2. [Create the Network](#create)<br>
2.3. [Correlation between features](#feat-matrix)

##### 3. [Distributions](#dis-header)

3.1. [Degree Distributions](#degree-dis)<br>
3.2. [Betweenness Distributions](#bet-dis)

##### 4. [Degree Centrality](#deg-header)

4.1. [Degree in Category](#cat-degree)<br>
4.2. [Degree outside Category](#others-degree)


<hr>

# 1. Introduction <a class="anchor" id="intro-header"></a>
## 1.1. The Data <a class="anchor" id="data"></a>

We chose the Cora Citation Network, it is a directed network where nodes represent scientific papers, an edge between two nodes indicates that the left node cites the right node, the edges are unweighted. In addition the papers are classified to categories and sub-categories.

## 1.2. Research Question <a class="anchor" id="research-question"></a>

1. What is the most cited sub-category that all other categories depend on?
2. If we cluster our nodes (look for communities within the network) what catergies will be in each community?

We expect for each community to include articles from the same category or sub-category depending on number of communities we look for.  


## 1.3. Approach for our research <a class="anchor" id="approach"></a>

##### Degree Centrality
We'll calculate two In Degrees for each node:
1. The in degree of the node among it's category.
2. The in degree of the node according to all categories besides the node's category. 

Both of the degrees above will be normalized so we can compare them. We will use the second measure in order to provide an answer to our first research question.


##### Community Detection
We want to check if the clusters are of nodes from the same category or sub-catergoy, by using two methods: 
1. Removing edges with high betweenness and checking the resulted communities - in other words The Girvan-Newman Method.
2. Using either the Modularity method or Louvain community detection method.  
<b>#TODO change to selected one above</b>


### Imports

In [1]:
# Import packages
import numpy as np
import os
import pandas as pd
import networkx as nx
import time
from random import sample
import sys
import matplotlib.pyplot as plt
import plotly
import plotly.graph_objects as go
import plotly.express as px

<hr>

# 2. Building the Network <a class="anchor" id="build-header"></a>
## 2.1. Data Cleansing <a class="anchor" id="cleansing"></a>

Get data from csv's:

In [2]:
# edges.csv has 91500 rows, nodes.csv has 23166 rows
edges = pd.read_csv('./data/edges.csv', delimiter=' ')
nodes = pd.read_csv('./data/nodes.csv', delimiter=' ')
nodes = nodes[['network_id', 'node_id']]

#get id's, get each node id's category and concat them together in names df
names = pd.read_csv('./data/ent.subelj_cora_cora.id.csv', delimiter=' ')
cat = pd.read_csv('./data/ent.subelj_cora_cora.class.csv', delimiter=' ')
names['category'] = cat[['category']]

result = pd.merge(nodes, names, on='node_id')
result = result.drop(['node_id'], axis=1)

# split category and sub-category
cat_df = result.copy()
cat_df['category'] = cat_df['category'].astype(str)

for index, row in cat_df.iterrows():
    split = cat_df.loc[index]['category'].split('/')
    cat = split[1]
    cat = cat.replace('_', ' ')
    cat_df.loc[index, 'category'] = cat
    subcat = split[2]
    subcat = subcat.replace('_', ' ')
    cat_df.loc[index, 'sub_cat'] = subcat

Final data contains two dataframes one for the edges and the other includes each node and its category and sub-category, an example:

In [3]:
edges.head(2)

Unnamed: 0,cites,cited
0,20128,6078
1,22236,10436


In [4]:
cat_df.head(2)

Unnamed: 0,network_id,category,sub_cat
0,1,Databases,Performance
1,2,Human Computer Interaction,Cooperative


In [5]:
'''
from sklearn import preprocessing 
  
# label_encoder object knows how to understand word labels. 
label_encoder = preprocessing.LabelEncoder() 
# Encode labels in column 'species'. 
cat_df['category']= label_encoder.fit_transform(cat_df['category']) 
label_encoder.fit_transform(cat_df['category']) 
'''

"\nfrom sklearn import preprocessing \n  \n# label_encoder object knows how to understand word labels. \nlabel_encoder = preprocessing.LabelEncoder() \n# Encode labels in column 'species'. \ncat_df['category']= label_encoder.fit_transform(cat_df['category']) \nlabel_encoder.fit_transform(cat_df['category']) \n"

## 2.2. Create the Network <a class="anchor" id="create"></a>


In [6]:
edges_tuple = [tuple(x) for x in edges.to_numpy()]
DG = nx.DiGraph()
DG.add_edges_from(edges_tuple)

Some of the network's properties:

In [7]:
print(nx.info(DG))
print("Is the Graph directed? " + str(DG.is_directed()))
print("Graph Density is: " + str(nx.density(DG)))
print("Average Clustering: " + str(nx.average_clustering(DG, nodes=None, weight=None, count_zeros=True)))

Name: 
Type: DiGraph
Number of nodes: 23166
Number of edges: 91500
Average in degree:   3.9498
Average out degree:   3.9498
Is the Graph directed? True
Graph Density is: 0.00017050524281260305
Average Clustering: 0.14601503382808564


<hr>

# 3. Distributions <a class="anchor" id="dis-header"></a>
## 3.1. Degree Distribution <a class="anchor" id="degree-dis"></a>

Since our network is directed we'll calculate both in and out degree for each node, we expect to get power-law distribution in both.

In [76]:
'''
Given the "list" of degrees we get from xnetwork this method turns it into a 
df with two columns one for degree and other count
'''
def dictionary_to_df(df, measure):
    all_df = []
    for node,deg in list(df):
        all_df.append(deg)
    unique = list(set(all_df))

    count = []
    for i in unique:
        x = all_df.count(i)
        count.append(x)

    dfout = pd.DataFrame(list(zip(unique, count)), 
                   columns =[measure, 'Freq']) 
    return dfout 

In [9]:
out_deg = DG.out_degree()
in_deg = DG.in_degree()
out_df = dictionary_to_df(out_deg, 'Degree')
in_df = dictionary_to_df(in_deg, 'Degree')

In [31]:
# Plot the in and out degree distribution
fig =  plotly.subplots.make_subplots(rows=1, cols=2, horizontal_spacing=0.1,
                                     subplot_titles=("In-Degree Distribution","Out-Degree Distribution"),
                                                     specs=[[{"type": "xy"},{"type": "xy"}]])
fig.add_trace(
    go.Scatter(x=in_df['Degree'], y=in_df['Freq'],marker_symbol='hexagon2', mode="markers+text", 
               marker=dict(size=12,color='rgba(135, 206, 250, 0.7)', line=dict(width=1, color='DarkSlateGrey'))), row=1, col=1)
fig.add_trace(
go.Scatter(x=out_df['Degree'], y=out_df['Freq'],marker_symbol='hexagon2', mode="markers+text", 
               marker=dict(size=12,color='rgba(135, 206, 250, 0.7)', line=dict(width=1, color='DarkSlateGrey'))), row=1, col=2)
fig.update_xaxes(title_text="Degree")
fig.update_yaxes(title_text="Frequency")
fig.update_layout(height=500, width=1000,showlegend=False)
fig.show()

#### Results 

As expected both the in and out degree distributions are the power-law distribution which means small occurences are extremely common, notice that in the number of in-degree of 0 is much higher than in out-degree which makes sence since most articles cite at least another article but there is a big number of articles that no one cited so far. 


## 3.2. Betweenness Distribution <a class="anchor" id="bet-dis"></a>


In [None]:
'''
try not to rerun
'''
bet = nx.betweenness_centrality(DG)
betw_df = dictionary_to_df(bet.items(), 'Betweenness')
betw_df['Freq'] = np.log(betw_df['Freq'])/np.log(10)

In [None]:
fig = px.scatter(betw_df, x="Betweenness", y="Freq")
fig.update_yaxes(title_text="Log Frequency")
fig.update_layout(height=500, width=700,showlegend=False, title='Betweenness Distribution')
fig.show()

## Add Conclutions about the betweenness distribtuion

<hr>

# 4. Degree Centrality <a class="anchor" id="deg-header"></a>

We're aiming to find the category or sub-catergory that is cited the most, bs as we said before we can't compare between the in and out degree since we're not taking into account the number of articles in each category which affects the degree. 
To bypass this problem we decided to calculate two degrees (each will also have in and out degree) for each article:
1. The in Category Degree - degree of the article based on edges between it and other articles in the same category, it will be normalized using the sum of articles in the category.
2. The Degree outside Category - degree of the article based on edges between it and other articles from outside it's category, it will be normalized using the sum of all articles that are not in the node's category. 

## 4.1. Degree in Category <a class="anchor" id="cat-degree"></a>

We have 10 categories, now we'll devide the network into 10 networks and calculate the degree in each one and normalize it by deviding it by the number of articles in each category.


In [None]:
unique_cat = cat_df.category.unique()
degrees_df = pd.DataFrame(columns=['network_id','out_degree', 'in_degree'])
for category in unique_cat:
    #get list of all nodes in category
    cat_nodes = cat_df.copy()
    cat_nodes = cat_nodes.loc[cat_nodes['category'] == category]
    idlist = cat_nodes['network_id']
    #only keep edges that are withhin the category
    cat_edges = edges.copy()
    cat_edges = cat_edges[cat_edges['cites'].isin(idlist)]
    cat_edges = cat_edges[cat_edges['cited'].isin(idlist)]
    #Create the graph
    cat_edges_tuple = [tuple(x) for x in cat_edges.to_numpy()]
    cat_DG = nx.DiGraph()
    cat_DG.add_edges_from(cat_edges_tuple)
    #Calculate in and out degree
    cat_outdeg = cat_DG.out_degree()
    cat_indeg = cat_DG.in_degree()
    final_out = pd.DataFrame(cat_outdeg, columns=['network_id', 'out_degree'])
    final_in = pd.DataFrame(cat_indeg, columns=['network_id', 'in_degree'])
    #Normalize the degree based on number of articles
    final_out['out_degree'] = final_out.out_degree.div(final_out.shape[0])
    final_in['in_degree'] = final_in.in_degree.div(final_in.shape[0])
    final = pd.merge(final_out, final_in, on='network_id')
    degrees_df = pd.concat([degrees_df, final],ignore_index=True)

degrees_df = degrees_df.sort_values(by=['in_degree'], ascending=False)
degrees_df = pd.merge(degrees_df, cat_df, on='network_id')
degrees_df

## 4.2. Degree outside Category <a class="anchor" id="others-degree"></a>



In [84]:
fig = px.scatter(betw_df, x="Betweenness", y="Freq")
fig.update_yaxes(title_text="Log Frequency")
fig.update_layout(height=500, width=700,showlegend=False, title='Betweenness Distribution')
fig.show()

## Add Conclutions about the betweenness distribtuion

<hr>

# 4. Degree Centrality <a class="anchor" id="deg-header"></a>

We're aiming to find the category or sub-catergory that is cited the most, bs as we said before we can't compare between the in and out degree since we're not taking into account the number of articles in each category which affects the degree. 
To bypass this problem we decided to calculate two degrees (each will also have in and out degree) for each article:
1. The in Category Degree - degree of the article based on edges between it and other articles in the same category, it will be normalized using the sum of articles in the category.
2. The Degree outside Category - degree of the article based on edges between it and other articles from outside it's category, it will be normalized using the sum of all articles that are not in the node's category. 

## 4.1. Degree in Category <a class="anchor" id="cat-degree"></a>

We have 10 categories, now we'll devide the network into 10 networks and calculate the degree in each one and normalize it by deviding it by the number of articles in each category.


In [146]:
unique_cat = cat_df.category.unique()
degrees_df = pd.DataFrame(columns=['network_id','out_degree', 'in_degree'])
for category in unique_cat:
    #get list of all nodes in category
    cat_nodes = cat_df.copy()
    cat_nodes = cat_nodes.loc[cat_nodes['category'] == category]
    idlist = cat_nodes['network_id']
    #only keep edges that are withhin the category
    cat_edges = edges.copy()
    cat_edges = cat_edges[cat_edges['cites'].isin(idlist)]
    cat_edges = cat_edges[cat_edges['cited'].isin(idlist)]
    #Create the graph
    cat_edges_tuple = [tuple(x) for x in cat_edges.to_numpy()]
    cat_DG = nx.DiGraph()
    cat_DG.add_edges_from(cat_edges_tuple)
    #Calculate in and out degree
    cat_outdeg = cat_DG.out_degree()
    cat_indeg = cat_DG.in_degree()
    final_out = pd.DataFrame(cat_outdeg, columns=['network_id', 'out_degree'])
    final_in = pd.DataFrame(cat_indeg, columns=['network_id', 'in_degree'])
    #Normalize the degree based on number of articles
    final_out['out_degree'] = final_out.out_degree.div(final_out.shape[0])
    final_in['in_degree'] = final_in.in_degree.div(final_in.shape[0])
    final = pd.merge(final_out, final_in, on='network_id')
    degrees_df = pd.concat([degrees_df, final],ignore_index=True)

degrees_df = degrees_df.sort_values(by=['in_degree'], ascending=False)
degrees_df = pd.merge(degrees_df, cat_df, on='network_id')
degrees_df

Unnamed: 0,network_id,out_degree,in_degree,category,sub_cat
0,11096,0.000000,0.144676,Encryption and Compression,Encryption
1,14889,0.000000,0.092074,Networking,Protocols
2,6317,0.001157,0.084491,Encryption and Compression,Encryption
3,14764,0.002188,0.076586,Information Retrieval,Filtering
4,8818,0.003203,0.076061,Networking,Routing
...,...,...,...,...,...
21356,18080,0.001033,0.000000,Data Structures Algorithms and Theory,Logic
21357,17010,0.000516,0.000000,Data Structures Algorithms and Theory,Computational Complexity
21358,3858,0.000516,0.000000,Data Structures Algorithms and Theory,Randomized
21359,10172,0.000516,0.000000,Data Structures Algorithms and Theory,Parallel


## 4.2. Degree outside Category <a class="anchor" id="others-degree"></a>

