# Graph Analysis Example

In [1]:
!pip install networkx==1.11



We need to install version 1.11 because on any version at 2.0 or above the call any call to `G.degree` or even `G.degree()` raised the below error:  
`--> 346         self._succ = G._succ if hasattr(G, "_succ") else G._adj`  
`AttributeError: 'Graph' object has no attribute '_adj'`

In [2]:
import networkx as nx
import pandas as pd
import numpy as np
import pickle

---

## Node and Edge Predictions

This was an interesting assignment that was based on a company's email network where each node corresponds to a person at the company, and each edge indicates that at least one email has been sent between two people.

The network also contains the node attributes `Department` and `ManagementSalary`. `ManagementSalary` indicates whether that person is receiving a management position salary.

The interesting thing about predication on networks is in this class was that you did not get a dataset of features as usual, but made your own features from network metrics, whose choice all depends on what you want to predict. I tried a number of classifiers, such as random forest, gradient boosting but all gave similiar scorings, so all things being close I decided to go with a simple algorithm, logistic regression.  

### Predict a node attribute

The goal was to predict the node attribute `ManagementSalary`, using a sklearn classifier. The dependent variable could be 1 (true) or 0 (false), and the evaluation was using Area Under the ROC Curve (AUC). So we have a straight forward binary classification problem. 

I create a dataframe of node features using networkx. The key assumption with this assignment is that since management is more central to a network, that is to say that they communicate with many others by the nature of their jobs, that management will be more central to the network. 

NetworkX has a lot of methods that provide useful metrics for this purpose. I used the number of degrees (connections), degree centrality, closeness centrality, and betweenness centrality. 

In [3]:
G = nx.read_gpickle('email_prediction.txt')
def node_predictions():
    
    # Custom code here
    forsubmission=False
    from sklearn.linear_model import LogisticRegression
    from sklearn.preprocessing import StandardScaler
    from sklearn.model_selection import train_test_split, GridSearchCV
    from sklearn.metrics import roc_auc_score

    # Here we create the data frame where all the features will be metrics of node centrality
    df=pd.DataFrame([G.node[i] for i in range(len(G.node))])
    df.index=list(G.node.keys())
    dc=nx.degree_centrality(G)
    cc=nx.closeness_centrality(G)
    bc=nx.betweenness_centrality(G, normalized = True)
    df['d'] = [v for v in G.degree().values()]
    df['dc'] = [dc[i] for i in df.index]
    df['cc'] = [cc[i] for i in df.index]
    df['bc'] = [bc[i] for i in df.index]
    df=df.drop(['Department'], axis=1)
    
    df_val=df[df['ManagementSalary'].isnull()]
    df2=df[~df['ManagementSalary'].isnull()]
    X_val=df_val.drop(['ManagementSalary'], axis=1)

    X=df2[['d', 'dc', 'cc', 'bc']]    
    y=df2['ManagementSalary']
    X_train, X_test, y_train, y_test = (train_test_split(X,y,random_state = 0))

    # Now we transform the data and get it all the same scale
    scaler = StandardScaler().fit(X_train)
    std_X_train = scaler.transform(X_train)
    std_X_test = scaler.transform(X_test)
    std_X_val = scaler.transform(X_val)

    # Building a model we do a grid search to find the best hyperparameters
    lr = LogisticRegression()
    list_grid = {'C':[0.01, 0.1, 1, 10, 100]} 
    grid_lr_acc = GridSearchCV(lr, param_grid = list_grid, scoring = 'roc_auc')
    grid_lr_acc.fit(std_X_train, y_train)
    if forsubmission==False:
        print('Test set AUC: ', roc_auc_score(y_test, grid_lr_acc.predict_proba(std_X_test)[:,1]))

    return_series=pd.Series(grid_lr_acc.predict_proba(std_X_val)[:,1], index=X_val.index)
    
    return "In the assignment I returned the variable return_series for grading"

In [4]:
node_predictions()

Test set AUC:  0.8656184486373166


'In the assignment I returned the variable return_series for grading'

The automatic grading reported these for the validation predictions:  
`For the salary predictions your AUC 0.9211290992112909 was awarded` 

### Edge predictions

The goal of the edge prediction was to predict future connections. The key assumption here is that nodes that have a lot of connections in common, will tend to form connections between themselves later.  
  
Once again, networkx provides a lot of useful common connections metrics that we can put into a dataframe and use as features. This model uses the basic metrics: Number of Common Neighbors, Jaccard Coefficient, Resource Allocation Index, Adamic-Adar Index, Preferential Attachment Score, Common Neighbor Score, Resource Allocation Score. 

In [5]:
future_connections = pd.read_csv('Future_Connections.csv', index_col=0, converters={0: eval})
def edge_predictions():
    
    # Custom code here
    forsubmission=False
    from sklearn.linear_model import LogisticRegression
    from sklearn.preprocessing import StandardScaler
    from sklearn.model_selection import train_test_split, GridSearchCV
    from sklearn.metrics import roc_auc_score
    
    df=future_connections.copy()
    df.head()
    # Here we use the metrics to create a data frame of features
    cn=[len([pair for pair in nx.common_neighbors(G,e[0],e[1])]) for e in df.index ]
    jc=[p for u,v,p in nx.jaccard_coefficient(G, ebunch=df.index)]
    ra=[p for u,v,p in nx.resource_allocation_index(G, ebunch=df.index)]
    aa=[p for u,v,p in nx.adamic_adar_index(G, ebunch=df.index)]
    pa=[p for u,v,p in nx.preferential_attachment(G, ebunch=df.index)]
    ccn=[p for u,v,p in nx.cn_soundarajan_hopcroft(G,ebunch=df.index,community='Department')]
    cra=[p for u,v,p in nx.ra_index_soundarajan_hopcroft(G,ebunch=df.index,community='Department')]

    df['cn']=cn
    df['jc']=jc
    df['ra']=ra
    df['aa']=aa
    df['pa']=pa
    df['ccn']=ccn
    df['cra']=cra

    df_val=df[df['Future Connection'].isnull()]
    df2=df[~df['Future Connection'].isnull()]
    X_val=df_val.drop(['Future Connection'], axis=1)

    X=df2[['cn', 'jc', 'ra', 'aa', 'pa', 'ccn', 'cra']]    
    y=df2['Future Connection']
    X_train, X_test, y_train, y_test = (train_test_split(X,y,random_state = 0))

    # Trans for the data so all the features are on the same scale
    scaler = StandardScaler().fit(X_train)
    std_X_train = scaler.transform(X_train)
    std_X_test = scaler.transform(X_test)
    std_X_val = scaler.transform(X_val)

    # we'll use a grid search so our model has the best hyperparameters
    lr = LogisticRegression()
    list_grid = {'C':[0.01, 0.1, 1, 10, 100]} 
    grid_lr_acc = GridSearchCV(lr, param_grid = list_grid, scoring = 'roc_auc')
    grid_lr_acc.fit(std_X_train, y_train)
    if forsubmission==False:
        print('Test set AUC: ', roc_auc_score(y_test, grid_lr_acc.predict_proba(std_X_test)[:,1]))

    return_series=pd.Series(grid_lr_acc.predict_proba(std_X_val)[:,1], index=X_val.index)
    return "In the assignment we returned the variable return_series for automatic grading"

In [6]:
edge_predictions()

Test set AUC:  0.9126945905047377


'In the assignment we returned the variable return_series for automatic grading'

The automatic grading reported these for the validation predictions:  
`For the new connections predictions your AUC 0.9136392864156448`

### Conclusion
So when everything is said and done I did better on the validation data then I did on the train/test split. It really was the feature creation that made a difference as most algorithms came up with very close scores. 