---

_You are currently looking at **version 1.2** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-social-network-analysis/resources/yPcBs) course resource._

---

# Assignment 4

In [1]:
import networkx as nx
import pandas as pd
import numpy as np
import pickle

---

## Part 1 - Random Graph Identification

For the first part of this assignment you will analyze randomly generated graphs and determine which algorithm created them.

In [2]:
P1_Graphs = pickle.load(open('A4_graphs','rb'))
P1_Graphs

[<networkx.classes.graph.Graph at 0x7ffacc0f0438>,
 <networkx.classes.graph.Graph at 0x7ffa952024a8>,
 <networkx.classes.graph.Graph at 0x7ffa952024e0>,
 <networkx.classes.graph.Graph at 0x7ffa95202518>,
 <networkx.classes.graph.Graph at 0x7ffa95202550>]

<br>
`P1_Graphs` is a list containing 5 networkx graphs. Each of these graphs were generated by one of three possible algorithms:
* Preferential Attachment (`'PA'`)
* Small World with low probability of rewiring (`'SW_L'`)
* Small World with high probability of rewiring (`'SW_H'`)

Anaylze each of the 5 graphs and determine which of the three algorithms generated the graph.

*The `graph_identification` function should return a list of length 5 where each element in the list is either `'PA'`, `'SW_L'`, or `'SW_H'`.*

In [3]:
def graph_identification():
    
    import matplotlib.pyplot as plt
    '''
    for i in P1_Graphs:
        degrees = i.degree()
        degree_values = sorted(set(degrees.values()))
        histogram = [list(degrees.values()).count(j)/float(nx.number_of_nodes(i)) for j in degree_values]
        plt.bar(degree_values, histogram)
        plt.xlabel('Degree')
        plt.ylabel('Fraction of Nodes')
        plt.show()
    '''
    return ['PA', 'SW_L', 'SW_L', 'PA', 'SW_H']

---

## Part 2 - Company Emails

For the second part of this assignment you will be workking with a company's email network where each node corresponds to a person at the company, and each edge indicates that at least one email has been sent between two people.

The network also contains the node attributes `Department` and `ManagementSalary`.

`Department` indicates the department in the company which the person belongs to, and `ManagementSalary` indicates whether that person is receiving a management position salary.

In [15]:
G = nx.read_gpickle('email_prediction.txt')

print(nx.info(G))

Name: 
Type: Graph
Number of nodes: 1005
Number of edges: 16706
Average degree:  33.2458


### Part 2A - Salary Prediction

Using network `G`, identify the people in the network with missing values for the node attribute `ManagementSalary` and predict whether or not these individuals are receiving a management position salary.

To accomplish this, you will need to create a matrix of node features using networkx, train a sklearn classifier on nodes that have `ManagementSalary` data, and predict a probability of the node receiving a management salary for nodes where `ManagementSalary` is missing.



Your predictions will need to be given as the probability that the corresponding employee is receiving a management position salary.

The evaluation metric for this assignment is the Area Under the ROC Curve (AUC).

Your grade will be based on the AUC score computed for your classifier. A model which with an AUC of 0.88 or higher will receive full points, and with an AUC of 0.82 or higher will pass (get 80% of the full points).

Using your trained classifier, return a series of length 252 with the data being the probability of receiving management salary, and the index being the node id.

    Example:
    
        1       1.0
        2       0.0
        5       0.8
        8       1.0
            ...
        996     0.7
        1000    0.5
        1001    0.0
        Length: 252, dtype: float64

In [20]:
def salary_predictions():
    
    manage_dict = nx.get_node_attributes(G, 'ManagementSalary')
    depart_dict = nx.get_node_attributes(G, 'Department')
    degree_dict = G.degree()
    degcent_dict = nx.degree_centrality(G)
    closecent_dict = nx.closeness_centrality(G)
    betcen_dict = nx.betweenness_centrality(G, normalized=True, endpoints=False)
    
    aver_neigh_deg = nx.average_neighbor_degree(G)
    cluster = nx.clustering(G)
    pagerank = nx.pagerank(G)
    eigen_cent = nx.eigenvector_centrality(G)
    #curr_flow = nx.current_flow_betweenness_centrality(G)

    df1 = pd.DataFrame(list(manage_dict.values()), columns=['ManagementSalary'], index=list(manage_dict.keys()))
    df2 = pd.DataFrame(list(depart_dict.values()), columns=['Department'], index=list(depart_dict.keys())).fillna(0)
    df3 = pd.DataFrame(list(degree_dict.values()), columns=['Degree'], index=list(degree_dict.keys())).fillna(0)
    df4 = pd.DataFrame(list(degcent_dict.values()), columns=['DegreeCentrality'], index=list(degcent_dict.keys())).fillna(0)
    df5 = pd.DataFrame(list(closecent_dict.values()), columns=['ClosebessCentrality'], index=list(closecent_dict.keys())).fillna(0)
    df6 = pd.DataFrame(list(betcen_dict.values()), columns=['BetweennessCentrality'], index=list(betcen_dict.keys())).fillna(0)
    
    df7 = pd.DataFrame(list(aver_neigh_deg.values()), columns=['AverageNearestNeighbor'], index=list(aver_neigh_deg.keys())).fillna(0)
    df8 = pd.DataFrame(list(cluster.values()), columns=['Clustering'], index=list(cluster.keys())).fillna(0)
    df9 = pd.DataFrame(list(pagerank.values()), columns=['PageRank'], index=list(pagerank.keys())).fillna(0)
    df10 = pd.DataFrame(list(eigen_cent.values()), columns=['EigenvectorCentrality'], index=list(eigen_cent.keys())).fillna(0)
    #df10 = pd.DataFrame(list(curr_flow.values()), columns=['CurrentFlowBetweennessCentrality'], index=list(curr_flow.keys())).fillna(0)
    
    df = df1.merge(df2, right_index=True, left_index=True).merge(df3, right_index=True, left_index=True).merge(df4, right_index=True, left_index=True).merge(df5, right_index=True, left_index=True).merge(df6, right_index=True, left_index=True).merge(df7, right_index=True, left_index=True).merge(df8, right_index=True, left_index=True).merge(df9, right_index=True, left_index=True).merge(df10, right_index=True, left_index=True)
    
    from sklearn.ensemble import GradientBoostingClassifier
    from sklearn.model_selection import train_test_split
    from sklearn import preprocessing
    
    #df = pd.concat([df, pd.get_dummies(df['Department'], prefix="Department")], axis=1)
    #df.drop(['Department'], axis=1, inplace=True)
    
    train = df[(df['ManagementSalary'] == 0.0) | (df['ManagementSalary'] == 1.0)]
    test = df[(df['ManagementSalary'] != 0.0) & (df['ManagementSalary'] != 1.0)]
    
    X_train = train.drop(['ManagementSalary'], axis=1)
    y_train = train['ManagementSalary']
    X_train_norm = preprocessing.normalize(X_train)
    
    X_test = test.drop(['ManagementSalary'], axis=1)
    X_test_norm = preprocessing.normalize(X_test)
    
    clf = GradientBoostingClassifier(random_state=1234).fit(X_train_norm, y_train)
    scores = list(clf.predict_proba(X_test_norm)[:,1])
    labels = list(X_test.index)

    
    df_score = pd.DataFrame(scores, index=labels)
    
    return df_score

salary_predictions()

{0: 0.018708925446578928,
 1: 0.029163385482839088,
 2: 0.05309639118027564,
 3: 0.04692952551307305,
 4: 0.060522099541649374,
 5: 0.07946339705539311,
 6: 0.05061187724622949,
 7: 0.02288648776924591,
 8: 0.0156789806240652,
 9: 0.006950516462514145,
 10: 0.034893781871262256,
 11: 0.03155240516519104,
 12: 0.02972748506709673,
 13: 0.08569308559783256,
 14: 0.036773360432653085,
 15: 0.03100633041316911,
 16: 0.05662923469347452,
 17: 0.0758424911452088,
 18: 0.03829034389393094,
 19: 0.04193967084315033,
 20: 0.050472514081514576,
 21: 0.09830933449912932,
 22: 0.007635686441548312,
 23: 0.05850266617560135,
 24: 0.017834163310614676,
 25: 0.01934440203123387,
 26: 0.013949802391356026,
 27: 0.03239052749370683,
 28: 0.08305522664642136,
 29: 0.0348107916918594,
 30: 0.0471905913015998,
 31: 0.03398784746938911,
 32: 0.017487003693880464,
 33: 0.01569891103931655,
 34: 0.014054383244498023,
 35: 0.04128923518679919,
 36: 0.03503183463669878,
 37: 0.014162042057298018,
 38: 0.021492

### Part 2B - New Connections Prediction

For the last part of this assignment, you will predict future connections between employees of the network. The future connections information has been loaded into the variable `future_connections`. The index is a tuple indicating a pair of nodes that currently do not have a connection, and the `Future Connection` column indicates if an edge between those two nodes will exist in the future, where a value of 1.0 indicates a future connection.

In [21]:
future_connections = pd.read_csv('Future_Connections.csv', index_col=0, converters={0: eval})
future_connections.head(10)

Unnamed: 0,Future Connection
"(6, 840)",0.0
"(4, 197)",0.0
"(620, 979)",0.0
"(519, 872)",0.0
"(382, 423)",0.0
"(97, 226)",1.0
"(349, 905)",0.0
"(429, 860)",0.0
"(309, 989)",0.0
"(468, 880)",0.0


Using network `G` and `future_connections`, identify the edges in `future_connections` with missing values and predict whether or not these edges will have a future connection.

To accomplish this, you will need to create a matrix of features for the edges found in `future_connections` using networkx, train a sklearn classifier on those edges in `future_connections` that have `Future Connection` data, and predict a probability of the edge being a future connection for those edges in `future_connections` where `Future Connection` is missing.



Your predictions will need to be given as the probability of the corresponding edge being a future connection.

The evaluation metric for this assignment is the Area Under the ROC Curve (AUC).

Your grade will be based on the AUC score computed for your classifier. A model which with an AUC of 0.88 or higher will receive full points, and with an AUC of 0.82 or higher will pass (get 80% of the full points).

Using your trained classifier, return a series of length 122112 with the data being the probability of the edge being a future connection, and the index being the edge as represented by a tuple of nodes.

    Example:
    
        (107, 348)    0.35
        (542, 751)    0.40
        (20, 426)     0.55
        (50, 989)     0.35
                  ...
        (939, 940)    0.15
        (555, 905)    0.35
        (75, 101)     0.65
        Length: 122112, dtype: float64

In [31]:
def new_connections_predictions():
    
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import MinMaxScaler

    depart_dict = nx.get_node_attributes(G, 'Department')
    
    df_depart = pd.DataFrame(list(depart_dict.values()), columns=['Department'], index=list(depart_dict.keys())).fillna(0)
    
    pref_att = nx.preferential_attachment(G, list(future_connections.index))
    jacc = nx.jaccard_coefficient(G, future_connections.index)
    res_all = nx.resource_allocation_index(G, future_connections.index)
    
    future_connections['Start'] = [u for (u, v, p) in pref_att]
    future_connections['Finish'] = [v for (u, v, p) in pref_att]
    future_connections['PreferentialAttachment'] = [p for (u, v, p) in pref_att]
    future_connections['JaccardCoefficent'] = [p for (u, v, p) in jacc]
    future_connections['ResourceAllocation'] = [p for (u, v, p) in res_all]
    
    train = future_connections[(future_connections['Future Connection'] == 0.0) | (future_connections['Future Connection'] == 1.0)]
    test = future_connections[(future_connections['Future Connection'] != 0.0) & (future_connections['Future Connection'] != 1.0)]
    
    scaler = MinMaxScaler()
    
    X_train = train.drop(['Future Connection'], axis=1)
    y_train = train['Future Connection']
    X_train_norm = scaler.fit_transform(X_train)
    
    X_test = test.drop(['Future Connection'], axis=1)
    X_test_norm = scaler.fit_transform(X_test)
    
    clf = RandomForestClassifier(random_state=1234).fit(X_train_norm, y_train)
    scores = list(clf.predict_proba(X_test_norm)[:,1])
    labels = list(X_test.index)

    
    df_score = pd.DataFrame(scores, index=labels)
    
    return df_score

new_connections_predictions()

Unnamed: 0,0
"(107, 348)",0.0
"(542, 751)",0.0
"(20, 426)",1.0
"(50, 989)",0.0
"(942, 986)",0.0
"(324, 857)",0.0
"(13, 710)",0.0
"(19, 271)",0.0
"(319, 878)",0.0
"(659, 707)",0.0
