In [1]:
import networkx as nx
import pandas as pd
import numpy as np
import pickle

Analyze randomly generated graphs and determine which algorithm created them.

In [7]:
P1_Graphs = pickle.load(open('A4_graphs','rb'))
P1_Graphs

[<networkx.classes.graph.Graph at 0x7fb7f5ec4588>,
 <networkx.classes.graph.Graph at 0x7fb7f5ec4668>,
 <networkx.classes.graph.Graph at 0x7fb7f5ec4940>,
 <networkx.classes.graph.Graph at 0x7fb81c1a75f8>,
 <networkx.classes.graph.Graph at 0x7fb834dc3240>]

<br>
`P1_Graphs` is a list containing 5 networkx graphs. Each of these graphs were generated by one of three possible algorithms:
* Preferential Attachment (`'PA'`)
* Small World with low probability of rewiring (`'SW_L'`)
* Small World with high probability of rewiring (`'SW_H'`)

Anaylze each of the 5 graphs and determine which of the three algorithms generated the graph.

In [8]:
def graph_identification():
    return ['PA', 'SW_L', 'SW_L', 'PA', 'SW_H']

In [12]:
for graph in P1_Graphs:
    print(graph)
    #using this data to find the type of graph and returning the type

barabasi_albert_graph(1000,2)
watts_strogatz_graph(1000,10,0.05)
watts_strogatz_graph(750,5,0.075)
barabasi_albert_graph(750,4)
watts_strogatz_graph(750,4,1)


Workking with a company's email network where each node corresponds to a person at the company, and each edge indicates that at least one email has been sent between two people.

The network also contains the node attributes `Department` and `ManagementSalary`.

`Department` indicates the department in the company which the person belongs to, and `ManagementSalary` indicates whether that person is receiving a management position salary.

In [13]:
G = nx.read_gpickle('email_prediction.txt')
print(nx.info(G))

Name: 
Type: Graph
Number of nodes: 1005
Number of edges: 16706
Average degree:  33.2458


Salary Prediction
Using network `G`, identify the people in the network with missing values for the node attribute `ManagementSalary` and predict whether or not these individuals are receiving a management position salary.

Your predictions will need to be given as the probability that the corresponding employee is receiving a management position salary.

The evaluation metric for this project is the Area Under the ROC Curve (AUC).



Using your trained classifier, return a series of length 252 with the data being the probability of receiving management salary, and the index being the node id.

    Example:
    
        1       1.0
        2       0.0
        5       0.8
        8       1.0
            ...
        996     0.7
        1000    0.5
        1001    0.0
        Length: 252, dtype: float64

In [15]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
def salary_predictions():
    df = pd.DataFrame(index = G.nodes())
    man_sal = pd.Series(nx.get_node_attributes(G, 'ManagementSalary'))
    depart = pd.Series(nx.get_node_attributes(G, 'Department'))
    df['man_sal'] = man_sal
    df['degree'] = pd.Series(G.degree())
    df['clustering'] = pd.Series(nx.clustering(G))
    df['degree_centrality'] = pd.Series(nx.degree_centrality(G))
    df['closeness_centrality'] = pd.Series(nx.closeness_centrality(G, normalized = True))
    df['betweenness_centrality'] = pd.Series(nx.betweenness_centrality(G, normalized = True))
    df['pagerank'] = pd.Series(nx.pagerank(G))
    train = df.dropna()
    final_test = df[df['man_sal'].isnull() == True]
    final_test.drop(['man_sal'], axis = 1, inplace = True)
    y = train['man_sal']
    X = train.drop(['man_sal'], axis = 1)
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 3)
    clf = RandomForestClassifier(n_estimators = 100, max_depth = 5, max_features = None, random_state = 0)
    ran = clf.fit(X, y)
    pred = ran.predict_proba(final_test)
    pred1 = [i[1] for i in pred]
    final_test['pred'] = pred1
    return final_test['pred']
salary_predictions() #predicitons with random forest gives an roc auc score of 0.92 whereas svm and logistic regression give only about 0.79

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


1       0.009372
2       1.000000
5       1.000000
8       0.159018
14      0.041490
18      0.002549
27      0.009216
30      0.837940
31      0.197438
34      0.019671
37      0.005211
40      0.037086
45      0.013155
54      0.285251
55      0.550340
60      0.169954
62      1.000000
65      1.000000
77      0.042830
79      0.002549
97      0.002365
101     0.000636
103     0.641604
108     0.017677
113     0.046595
122     0.000184
141     0.353395
142     1.000000
144     0.002365
145     0.766470
          ...   
913     0.000556
914     0.002365
915     0.000184
918     0.046113
923     0.011765
926     0.071352
931     0.000556
934     0.000184
939     0.000184
944     0.000184
945     0.011765
947     0.025130
950     0.002549
951     0.013346
953     0.000184
959     0.000184
962     0.000184
963     0.200758
968     0.023130
969     0.049378
974     0.029070
984     0.000184
987     0.060932
989     0.023130
991     0.023130
992     0.000184
994     0.000184
996     0.0001

###New Connections Prediction

For the last part of this assignment, you will predict future connections between employees of the network. The future connections information has been loaded into the variable `future_connections`. The index is a tuple indicating a pair of nodes that currently do not have a connection, and the `Future Connection` column indicates if an edge between those two nodes will exist in the future, where a value of 1.0 indicates a future connection.

In [18]:
future_connections = pd.read_csv('Future_Connections.csv', index_col=0, converters={0: eval})
future_connections.head(10)

Unnamed: 0,Future Connection
"(6, 840)",0.0
"(4, 197)",0.0
"(620, 979)",0.0
"(519, 872)",0.0
"(382, 423)",0.0
"(97, 226)",1.0
"(349, 905)",0.0
"(429, 860)",0.0
"(309, 989)",0.0
"(468, 880)",0.0


Using network `G` and `future_connections`, identify the edges in `future_connections` with missing values and predict whether or not these edges will have a future connection.

Your predictions will need to be given as the probability of the corresponding edge being a future connection.

The evaluation metric is the Area Under the ROC Curve (AUC).

Using your trained classifier, return a series of length 122112 with the data being the probability of the edge being a future connection, and the index being the edge as represented by a tuple of nodes.

    Example:
    
        (107, 348)    0.35
        (542, 751)    0.40
        (20, 426)     0.55
        (50, 989)     0.35
                  ...
        (939, 940)    0.15
        (555, 905)    0.35
        (75, 101)     0.65
        Length: 122112, dtype: float64

In [21]:
def new_connections_predictions():
    future_connections['preferential_attachment'] = [i[2] for i in nx.preferential_attachment(G, future_connections.index)]
    future_connections['Common Neighbors'] = future_connections.index.map(lambda x: len(list(nx.common_neighbors(G, x[0], x[1]))))
    future_connections['resource_allocation'] = [i[2] for i in nx.resource_allocation_index(G, future_connections.index)]
    future_connections['jaccard'] = [i[2] for i in nx.jaccard_coefficient(G, future_connections.index)]
    final_test = future_connections[future_connections['Future Connection'].isnull() == True]
    train = future_connections.dropna()
    final_test.drop(['Future Connection'], axis = 1, inplace= True)
    X = train.drop(['Future Connection'], axis = 1)
    y = train['Future Connection']
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import roc_auc_score
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 3)
    clf = RandomForestClassifier(n_estimators = 100, max_depth = 5, max_features = None, random_state = 0)
    ran = clf.fit(X, y)
    pred = ran.predict_proba(final_test)
    pred1 = [i[1] for i in pred]
    final_test['pred'] = pred1
    
    return final_test['pred']
new_connections_predictions()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


(107, 348)    0.035778
(542, 751)    0.013507
(20, 426)     0.588843
(50, 989)     0.013507
(942, 986)    0.013507
(324, 857)    0.013507
(13, 710)     0.116105
(19, 271)     0.138755
(319, 878)    0.013507
(659, 707)    0.013507
(49, 843)     0.013507
(208, 893)    0.013507
(377, 469)    0.013507
(405, 999)    0.018256
(129, 740)    0.013656
(292, 618)    0.062146
(239, 689)    0.013507
(359, 373)    0.013507
(53, 523)     0.034027
(276, 984)    0.013507
(202, 997)    0.013507
(604, 619)    0.111510
(270, 911)    0.013507
(261, 481)    0.068349
(200, 450)    0.857042
(213, 634)    0.013507
(644, 735)    0.200513
(346, 553)    0.013507
(521, 738)    0.013507
(422, 953)    0.017926
                ...   
(672, 848)    0.013507
(28, 127)     0.973219
(202, 661)    0.013507
(54, 195)     0.997839
(295, 864)    0.013507
(814, 936)    0.013507
(839, 874)    0.013507
(139, 843)    0.013507
(461, 544)    0.013507
(68, 487)     0.013507
(622, 932)    0.013507
(504, 936)    0.015163
(479, 528) 