<a href="https://colab.research.google.com/github/ArthurCBx/Applied_Social_Network_Analysis/blob/main/module%204/Network_Evolution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preferential Attachment Model

## Degree Distributions
- The degree of a node in an undirected graph is the number of neighbors it has;
- The degree distribution of a graph is the probability distribution of the degrees over the entire network

- To plot in networkx:
```python
degrees = dict(G.degree())
degree_values = sorted(set(degrees.values()))
histogram = [list(degrees.values()).count(i)/float(nx.number_of_nodes(G)) for i in degree_values()]

import matplotlib.pyplot as plt
plt.bar(degree_values,histogram)
plt.xlabel('Degree')
plt.ylabel('Fraction of Nodes')
plt.show()
```

## In-degree Distributions
- The in-degree of a node in a directed graph is the number of in-links it has

## Prefential Attachment Model
- Start with two nodes connected by an edge;
- At each time step, add a new node with an edge connecting it to an existing node;
- Choose the node to connect to at rando with probability proportional to each node's degree;
- The probability of connecting to a node u of degree $k_u$ is $k_u/\sum_ik_i$
- As the number of nodes increases, the degree distribution of the network under the preferential attachment model approaches the power law $P(k) = Ck^{-3}$
- The preferential attachment model produces networks with degree distributions similar to real networks

### Networkx
- `barabasi__albert_graph_(n, m)` returns a network with n nodes. Each new node attaches to m eisting nodes accordgin to the Preferential Attachment model.

# Small World Networks

## Milgram Small World Experiment
### Set up (1960)
- 296 randomly chosen "starters" asked to forward a letter to a "target" person;
- Target was a stockbroker in Boston;
- Instruction for starter:
  - Send letter to target if you know him on a first name basis
### Results
- 64/296 letters reached the target;
- Median chain length was 6

## Clustering Coefficient
- Local clustering coefficient of a node:
  - Fraction of pairs of the node's friends that are friend with each other.
- Facebook 2011: High average CC

## Path Length and Clustering
- Social networks tend to have high clustering coefficient and small average path length.
- Can we think of a network generative model that has these 2 properties ?
  - How about the Preferential Attachment model ?
    - Very small avg_clustering but small shortest_path_length.
  - Small World Model

## Small World Model
- **Motivation**: Real networks exhibit high clustering coefficient and small average shortest paths. This is a model that achieves both of these properties.
- Small-world model:
  - Start with a ring of n nodes, where each node is connect;
  - Fix a parameter $p \in [0,1]$;
  - Consider each edge (u, v). With propability p, select a node w at random and rewire the edge (u, v) so it becomes (u, w)

### NetworkX
- `watts_strogatz_graph(n, k, p)` returns a small world network with n nodes, starting with a ring lattice with each node connected to its k nearest neighbors and rewiring probability p.

- Other option that only leads to strongly connected networks is `connected_watts_strogatz_graph(n, k, p, t)`, which tries up to t times to generate a connected network.
- Another option is `newman_watts_strogatz_graph(n, k, p)` runs a model similar to the small world, but rather than rewiring edges, new edges are added with probability p.

# Link Prediction
- What new edges are likely to form in this network ?
- Given a pair of nodes, how to assess wheter they are likely to connect ?

## Measure 1: Common Neighbors
- The number of common neighbors of nodes X and Y is:
$comm\_neigh(X, Y) = |N(X) \cap N(Y)|$, where N(X) is the set of neighbors of node X.

### NetworkX
- `nx.common_neighbors(G, u, v)`
- `nx.non_edges(G)`

## Measure 2: Jaccard Coefficient
- Number of common neighbors normalized by the total number of neighbors
- $jacc\_coeff(X,Y) = \frac{|N(X)\cap N(Y)|}{|N(X) \cup N(Y)|}$

### NetworkX
- `list(nx.jaccard_coefficient(G))`

## Measure 3: Resource Allocation
- Fraction of a "resource" that a node can send to another through their common neighbors.
- The Resource Allocation index of nodes X and Y is:
$res\_alloc(X, Y) = \sum_{u \in N(X) \cap N(Y)}\frac{1}{|N(u)|}$

### NetworkX
`list(nx.resource_allocation_index(G))`

## Measure 4: Adamic-Adar Index
- Similar to resource allocation index, but with log in the denominator.
- The Adamic-Adar index of nodes X and Y is: $res\_alloc(X, Y) = \sum_{u \in N(X) \cap N(Y)}\frac{1}{log(|N(u)|)}$

## Measure 5: Pref. Attachment
- In the preferential attachment model, nodes with high degree get more neighbors.
- The preferential attachment score of nodes X and Y is: $pref\_attach(X, Y) = |N(X)||N(Y)|$

### NetworkX
`list(nx.preferential_attachment(G))`


## Community Structure
- Some measures consider the community structure of the network for link prediction;
- Assume the nodes in this network belong to different communities (sets of nodes);
- Pairs of nodes who belong to the same community and have many common neighbors in their community are likely to form an edge

## Measure 6: Community Common Neighbors
- The Common Neighbor Soundarajan-Hopcroft score of nodes X and Y is:
- $cn\_soundarajan\_hopcroft(X,Y) = |N(X) \cap N(Y)| + \sum_{u \in N(X) \cap N(Y)}f(u)$, where f(u) is 1 if u in same community as X and Y and 0 otherwise.

### NetworkX
- Assign nodes to communities with attribute node "community";
- `list(nx.cn_soundarajan_hopcroft(G))`

## Measure 7: Community Resource Allocation
- Similar to resource allocation, but only considering nodes in the same community
- $ra\_soundarajan\_hopcroft(X, Y) = \sum_{u \in N(X) \cap N(Y)}\frac{f(u)}{|N(u)|}$, where N(u) is the degree of the node been sumed

### NetworkX
- `list(nx.ra_index_soundaraja_hopcroft(G)`

# Assignment 4

In [None]:
!git clone https://github.com/ArthurCBx/Applied_Social_Network_Analysis.git

Cloning into 'Applied_Social_Network_Analysis'...
remote: Enumerating objects: 55, done.[K
remote: Counting objects: 100% (55/55), done.[K
remote: Compressing objects: 100% (46/46), done.[K
remote: Total 55 (delta 6), reused 41 (delta 2), pack-reused 0 (from 0)[K
Receiving objects: 100% (55/55), 5.60 MiB | 16.40 MiB/s, done.
Resolving deltas: 100% (6/6), done.


---

## Part 1 - Random Graph Identification

For the first part of this assignment you will analyze randomly generated graphs and determine which algorithm created them.

In [None]:
import networkx as nx
import pandas as pd
import numpy as np
import pickle

In [None]:
import networkx as nx
import pandas as pd
import numpy as np
import pickle

with open("Applied_Social_Network_Analysis/module 4/assets/A4_P1_G1", 'rb') as f:
  G1 = pickle.load(f)
with open("Applied_Social_Network_Analysis/module 4/assets/A4_P1_G2", 'rb') as f:
  G2 = pickle.load(f)
with open("Applied_Social_Network_Analysis/module 4/assets/A4_P1_G3", 'rb') as f:
  G3 = pickle.load(f)
with open("Applied_Social_Network_Analysis/module 4/assets/A4_P1_G4", 'rb') as f:
  G4 = pickle.load(f)
with open("Applied_Social_Network_Analysis/module 4/assets/A4_P1_G5", 'rb') as f:
  G5 = pickle.load(f)

P1_Graphs = [G1, G2, G3, G4, G5]

<br>
`P1_Graphs` is a list containing 5 networkx graphs. Each of these graphs were generated by one of three possible algorithms:
* Preferential Attachment (`'PA'`)
* Small World with low probability of rewiring (`'SW_L'`)
* Small World with high probability of rewiring (`'SW_H'`)

Anaylze each of the 5 graphs using any methodology and determine which of the three algorithms generated each graph.

*The `graph_identification` function should return a list of length 5 where each element in the list is either `'PA'`, `'SW_L'`, or `'SW_H'`.*

In [None]:
for i, network in enumerate(P1_Graphs):
  print(f"Network {i} has avg_path_lenght: {nx.average_shortest_path_length(network)}")
  print(f"Network {i} has avg_clustering_coef: {nx.average_clustering(network)}")
  print()

Network 0 has avg_path_lenght: 6.530506506506507
Network 0 has avg_clustering_coef: 0.0

Network 1 has avg_path_lenght: 43.80284684684685
Network 1 has avg_clustering_coef: 0.49310000000000004

Network 2 has avg_path_lenght: 39.007695695695695
Network 2 has avg_clustering_coef: 0.48973333333333335

Network 3 has avg_path_lenght: 8.158990990990992
Network 3 has avg_clustering_coef: 0.0

Network 4 has avg_path_lenght: 8.532046046046046
Network 4 has avg_clustering_coef: 0.36504285714285717



In [None]:
def graph_identification():
# Decision Rule
## Low path_length and Medium/high avg_clustering_coef == SM_H
## Medium/High path_length and Medium avg_clustering_coef == SM_L
## Medium/High path_length and Near Zero avg_clustering_coef == PA

  return list(['PA','SW_L','SW_L','PA','SW_H'])

---

## Part 2 - Company Emails

For the second part of this assignment you will be working with a company's email network where each node corresponds to a person at the company, and each edge indicates that at least one email has been sent between two people.

The network also contains the node attributes `Department` and `ManagmentSalary`.

`Department` indicates the department in the company which the person belongs to, and `ManagmentSalary` indicates whether that person is receiving a managment position salary.

In [None]:
G = pickle.load(open('Applied_Social_Network_Analysis/module 4/assets/email_prediction_NEW.txt', 'rb'))

print(f"Graph with {len(nx.nodes(G))} nodes and {len(nx.edges(G))} edges")

Graph with 1005 nodes and 16706 edges


### Part 2A - Salary Prediction

Using network `G`, identify the people in the network with missing values for the node attribute `ManagementSalary` and predict whether or not these individuals are receiving a managment position salary.

To accomplish this, you will need to create a matrix of node features of your choice using networkx, train a sklearn classifier on nodes that have `ManagementSalary` data, and predict a probability of the node receiving a managment salary for nodes where `ManagementSalary` is missing.



Your predictions will need to be given as the probability that the corresponding employee is receiving a managment position salary.

The evaluation metric for this assignment is the Area Under the ROC Curve (AUC).

Your grade will be based on the AUC score computed for your classifier. A model which with an AUC of 0.75 or higher will recieve full points.

Using your trained classifier, return a Pandas series of length 252 with the data being the probability of receiving managment salary, and the index being the node id.

    Example:
    
        1       1.0
        2       0.0
        5       0.8
        8       1.0
            ...
        996     0.7
        1000    0.5
        1001    0.0
        Length: 252, dtype: float64

In [None]:
list(G.nodes(data=True))[:5] # print the first 5 nodes

[(0, {'Department': 1, 'ManagementSalary': 0.0}),
 (1, {'Department': 1, 'ManagementSalary': nan}),
 (581, {'Department': 3, 'ManagementSalary': 0.0}),
 (6, {'Department': 25, 'ManagementSalary': 1.0}),
 (65, {'Department': 4, 'ManagementSalary': nan})]

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

df = pd.DataFrame(index=G.nodes(), columns=G.nodes()).fillna(0)
df['Department'] = pd.Series(nx.get_node_attributes(G, name='Department'))
df['ManagementSalary'] = pd.Series(nx.get_node_attributes(G, name='ManagementSalary'))

adj_df = nx.to_pandas_adjacency(G, nodelist=list(G.nodes()))
df.loc[:, list(G.nodes())] = adj_df

df = df.rename(str, axis='columns')
#Separate the missing data
real_test_data = df[np.isnan(df['ManagementSalary']) == True]
df = df.dropna()
df.head(5)

  df = pd.DataFrame(index=G.nodes(), columns=G.nodes()).fillna(0)
  df['Department'] = pd.Series(nx.get_node_attributes(G, name='Department'))
  df['ManagementSalary'] = pd.Series(nx.get_node_attributes(G, name='ManagementSalary'))


Unnamed: 0,0,1,581,6,65,64,73,74,459,268,...,944,772,862,798,808,965,973,975,Department,ManagementSalary
0,1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,1,0.0
581,1,0,1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,3,0.0
6,1,0,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,25,1.0
64,1,0,0,1,1,1,0,1,0,0,...,0,0,0,0,0,0,0,0,25,1.0
73,1,0,0,0,0,0,1,1,0,1,...,0,0,0,0,0,0,0,0,1,0.0


In [None]:
# standardizing the department data
scaler = StandardScaler()
df['Department'] = scaler.fit_transform(df['Department'].to_frame())
X = df.drop('ManagementSalary',axis=1)
y = df['ManagementSalary']
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [None]:
from sklearn.metrics import roc_auc_score
random_forest_clf = RandomForestClassifier()
random_forest_clf.fit(X_train, y_train)
predict_proba = random_forest_clf.predict_proba(X_test)[:, 1]

roc_auc_score(y_test, predict_proba)

np.float64(0.8401939655172413)

In [None]:
real_test_data['Department'] = scaler.transform(real_test_data['Department'].to_frame())
X_to_predict = real_test_data.drop('ManagementSalary',axis=1)
pd.Series(random_forest_clf.predict_proba(X_to_predict)[:, 1], index=X_to_predict.index)

Unnamed: 0,0
1,0.17
65,0.22
18,0.21
215,0.32
283,0.61
...,...
691,0.00
788,0.02
944,0.01
798,0.00


In [None]:
def salary_predictions():
  from sklearn.preprocessing import StandardScaler
  from sklearn.ensemble import RandomForestClassifier
  from sklearn.metrics import roc_auc_score


  df = pd.DataFrame(index=G.nodes(), columns=G.nodes()).fillna(0)
  df['Department'] = pd.Series(nx.get_node_attributes(G, name='Department'))
  df['ManagementSalary'] = pd.Series(nx.get_node_attributes(G, name='ManagementSalary'))

  # Adding node edges as features
  adj_df = nx.to_pandas_adjacency(G, nodelist=list(G.nodes()))
  df.loc[:, list(G.nodes())] = adj_df

  df = df.rename(str, axis='columns')

  #Separate the missing data
  real_test_data = df[np.isnan(df['ManagementSalary']) == True]
  df = df.dropna()

  # standardizing the department data
  scaler = StandardScaler()
  df['Department'] = scaler.fit_transform(df['Department'].to_frame())
  X = df.drop('ManagementSalary',axis=1)
  y = df['ManagementSalary']

  # Training the classifier with as much data as possible
  random_forest_clf = RandomForestClassifier()
  random_forest_clf.fit(X, y)

  # Standardizing the department data with the same scaler used before
  real_test_data['Department'] = scaler.transform(real_test_data['Department'].to_frame())
  X_to_predict = real_test_data.drop('ManagementSalary',axis=1)

  # Predicting the answer
  predict_proba = random_forest_clf.predict_proba(X_to_predict)[:, 1]
  answer = pd.Series(predict_proba, index=X_to_predict.index)
  return answer

### Part 2B - New Connections Prediction

For the last part of this assignment, you will predict future connections between employees of the network. The future connections information has been loaded into the variable `future_connections`. The index is a tuple indicating a pair of nodes that currently do not have a connection, and the `Future Connection` column indicates if an edge between those two nodes will exist in the future, where a value of 1.0 indicates a future connection.

In [None]:
future_connections = pd.read_csv('Applied_Social_Network_Analysis/module 4/assets/Future_Connections.csv', index_col=0, converters={0: eval})
future_connections.head(5)

Unnamed: 0,Future Connection
"(6, 840)",0.0
"(4, 197)",0.0
"(620, 979)",0.0
"(519, 872)",0.0
"(382, 423)",0.0


Using network `G` and `future_connections`, identify the edges in `future_connections` with missing values and predict whether or not these edges will have a future connection.

To accomplish this, you will need to:      
1. Create a matrix of features of your choice for the edges found in `future_connections` using Networkx     
2. Train a sklearn classifier on those edges in `future_connections` that have `Future Connection` data     
3. Predict a probability of the edge being a future connection for those edges in `future_connections` where `Future Connection` is missing.



Your predictions will need to be given as the probability of the corresponding edge being a future connection.

The evaluation metric for this assignment is the Area Under the ROC Curve (AUC).

Your grade will be based on the AUC score computed for your classifier. A model which with an AUC of 0.75 or higher will recieve full points.

Using your trained classifier, return a series of length 122112 with the data being the probability of the edge being a future connection, and the index being the edge as represented by a tuple of nodes.

    Example:
    
        (107, 348)    0.35
        (542, 751)    0.40
        (20, 426)     0.55
        (50, 989)     0.35
                  ...
        (939, 940)    0.15
        (555, 905)    0.35
        (75, 101)     0.65
        Length: 122112, dtype: float64

In [None]:
series = pd.Series(nx.get_node_attributes(G, 'Department'), index = G.nodes(), name = 'Department')
series

Unnamed: 0,Department
0,1
1,1
581,3
6,25
65,4
...,...
798,1
808,20
965,4
973,14


In [None]:
edge_and_score_cn = list(nx.cn_soundarajan_hopcroft(G,community='Department'))
soundarajan_hopcroft_score = {(node1, node2): score for node1, node2, score in edge_and_score_cn}

In [None]:
edge_score_ra_index = nx.ra_index_soundarajan_hopcroft(G, community='Department')
edge_score_ra_index = {(node1, node2): score for node1, node2, score in edge_score_ra_index}

In [None]:
edge_jaccard_score = list(nx.jaccard_coefficient(G))
edge_jaccard_score = {(node1, node2): score for node1, node2, score in edge_jaccard_score}

In [None]:
edge_resource_allocation_score = list(nx.resource_allocation_index(G))
edge_resource_allocation_score = {(node1, node2): score for node1, node2, score in edge_resource_allocation_score}


In [None]:
df = pd.DataFrame(list(edge_score_ra_index.values()), index=list(edge_score_ra_index.keys()),columns=['ra_index_score'])
df['soundarajan_hopcroft_score'] = pd.DataFrame(list(edge_score_ra_index.values()), index=list(edge_score_ra_index.keys()),columns=['soundarajan_hopcroft_score'])
df['jaccard_score'] = pd.DataFrame(list(edge_jaccard_score.values()), index=list(edge_jaccard_score.keys()),columns=['edge_jaccard_score'])
df['edge_resource_allocation_score'] = pd.DataFrame(list(edge_resource_allocation_score.values()), index=list(edge_resource_allocation_score.keys()),columns=['edge_resource_allocation_score'])


In [None]:
# Use join to efficiently add the score columns from df to future_connections
future_connections = future_connections.join(df)

In [None]:
to_predict = future_connections[np.isnan(future_connections['Future Connection']) == True].drop('Future Connection',axis=1)
new_df = future_connections.dropna()
X = new_df.drop('Future Connection',axis=1)
y = new_df['Future Connection']

X_to_predict.head(5)

Unnamed: 0,ra_index_score,soundarajan_hopcroft_score,jaccard_score,edge_resource_allocation_score
"(107, 348)",0.0,0.0,0.009009,0.025562
"(542, 751)",0.0,0.0,0.0,0.0
"(20, 426)",0.0,0.0,0.081967,0.082016
"(50, 989)",0.0,0.0,0.0,0.0
"(942, 986)",0.0,0.0,0.0,0.0


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y)
X_train_transformed = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns, index=X_train.index)

In [None]:
random_forest_clf = RandomForestClassifier()
random_forest_clf.fit(X_train_transformed, y_train)

X_test_transformed = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns, index=X_test.index)

predict_proba = random_forest_clf.predict_proba(X_test_transformed)[:, 1]
roc_auc_score(y_test, predict_proba)

np.float64(0.874310523635102)

In [None]:
def new_connections_predictions():
  from sklearn.preprocessing import StandardScaler, LabelEncoder
  from sklearn.ensemble import RandomForestClassifier

  scaler = StandardScaler()
  future_connections = pd.read_csv('assets/Future_Connections.csv', index_col=0, converters={0: eval})

  series = pd.Series(nx.get_node_attributes(G, 'Department'), index = G.nodes(), name = 'Department')

  # Community Community Common Neighbors
  edge_and_score_cn = list(nx.cn_soundarajan_hopcroft(G,community='Department'))
  soundarajan_hopcroft_score = {(node1, node2): score for node1, node2, score in edge_and_score_cn}

  # Communit Resource Allocation score
  edge_score_ra_index = nx.ra_index_soundarajan_hopcroft(G, community='Department')
  edge_score_ra_index = {(node1, node2): score for node1, node2, score in edge_score_ra_index}

  # Jaccard score
  edge_jaccard_score = list(nx.jaccard_coefficient(G))
  edge_jaccard_score = {(node1, node2): score for node1, node2, score in edge_jaccard_score}

  # Normal Resource Allocation score
  edge_resource_allocation_score = list(nx.resource_allocation_index(G))
  edge_resource_allocation_score = {(node1, node2): score for node1, node2, score in edge_resource_allocation_score}

  # Unite all the scores in a DataFrame
  df = pd.DataFrame(list(edge_score_ra_index.values()), index=list(edge_score_ra_index.keys()),columns=['ra_index_score'])
  df['soundarajan_hopcroft_score'] = pd.DataFrame(list(edge_score_ra_index.values()), index=list(edge_score_ra_index.keys()),columns=['soundarajan_hopcroft_score'])
  df['jaccard_score'] = pd.DataFrame(list(edge_jaccard_score.values()), index=list(edge_jaccard_score.keys()),columns=['edge_jaccard_score'])
  df['edge_resource_allocation_score'] = pd.DataFrame(list(edge_resource_allocation_score.values()), index=list(edge_resource_allocation_score.keys()),columns=['edge_resource_allocation_score'])

  # Use join to add the score columns from df to future_connections
  future_connections = future_connections.join(df)

  # Separate to predict rows
  to_predict = future_connections[np.isnan(future_connections['Future Connection']) == True]
  X_to_predict = to_predict.drop('Future Connection',axis=1)
  y_to_predict = to_predict['Future Connection']

  # Separate X, y to train the model
  new_df = future_connections.dropna()
  X = new_df.drop('Future Connection',axis=1)
  y = new_df['Future Connection']

  # Standardize features
  X_transformed = pd.DataFrame(scaler.fit_transform(X), columns=X.columns, index=X.index)
  X_to_predict_transformed = pd.DataFrame(scaler.transform(X_to_predict), columns=X_to_predict.columns, index=X_to_predict.index)

  # Train the data
  random_forest_clf = RandomForestClassifier()
  random_forest_clf.fit(X_transformed, y)

  # Calculate predict probabilities
  predict_proba = random_forest_clf.predict_proba(X_to_predict_transformed)[:, 1]
  return pd.Series(predict_proba, index=X_to_predict_transformed.index)