### **Posing this problem as classification problem:** 

#### **Generating some edges which are not present in graph for supervised learning**  
- **Generated Bad links from graph which are not in graph and whose shortest path is greater than 2.**

- In our dataset, all the pairs have edges, let's label them 1.
- To create a classification problem we must have to create those pairs which doesn't have edges and label them 0.
- We can create un-edges pairs by randomly picking pairs from the total no. of possible pairs between the nodes.
- Let's call them bad-edges.

In [None]:
#Importing Libraries
import warnings
warnings.filterwarnings("ignore")

import csv
import pandas as pd#pandas to create small dataframes 
import datetime #Convert to unix time
import time #Convert to unix time

import numpy as np
import matplotlib
import matplotlib.pylab as plt
import seaborn as sns
from matplotlib import rcParams#Size of plots  
from sklearn.cluster import MiniBatchKMeans, KMeans#Clustering
import math
import pickle
import os
!pip3 install xgboost
import xgboost as xgb

import warnings
!pip3 install networkx
import networkx as nx
import pdb
import pickle



In [None]:
%%time
###generating bad edges from given graph
import random
if not os.path.isfile('/content/drive/MyDrive/Facebook/data/after_eda/missing_edges_final.p'):
    #getting all set of edges
    r = csv.reader(open('/content/drive/MyDrive/Facebook/data/after_eda/train_woheader.csv','r'))
    edges = dict()
    for edge in r:
        edges[(edge[0], edge[1])] = 1
        
        
    missing_edges = set([])
    while (len(missing_edges)<9437519): # We are creating 0 labeled pairs of same no. of unpaired nodes 
        a=random.randint(1, 1862220) # Randomely sampelling two nodes a and b 
        b=random.randint(1, 1862220) # first we check they don't have any edge between them
        tmp = edges.get((a,b),-1)
        if tmp == -1 and a!=b:
            try:
                if nx.shortest_path_length(g,source=a,target=b) > 2: # Then we check the path length should be > 2

                    missing_edges.add((a,b)) # Then we add them to missing_edges
                else:
                    continue  
            except:  
                    missing_edges.add((a,b))              
        else:
            continue
    pickle.dump(missing_edges,open('/content/drive/MyDrive/Facebook/data/after_eda/missing_edges_final.p','wb'))
else:
    missing_edges = pickle.load(open('/content/drive/MyDrive/Facebook/data/after_eda/missing_edges_final.p','rb'))

CPU times: user 2.04 s, sys: 861 ms, total: 2.91 s
Wall time: 2.92 s


In [None]:
len(missing_edges)

9437519

### **Training and Test data split:  

**Real World Problem here whould be:**
- This graph changes over time.
- This data is actually temporaly changing in real world. So, we need to build our model by taking temporal nature into consideration.
- So, we would be doing time based splitting in the real world case.

**BUT**

Our data doesn't have any temporal nature.

**In our problem**
- We are doing random splitting, we don't have any other option.

In [None]:
from sklearn.model_selection import train_test_split
if (not os.path.isfile('data/after_eda/train_pos_after_eda.csv')) and (not os.path.isfile('data/after_eda/test_pos_after_eda.csv')):
    #reading total data df
    df_pos = pd.read_csv('/content/drive/MyDrive/Facebook/data/train.csv')
    df_neg = pd.DataFrame(list(missing_edges), columns=['source_node', 'destination_node'])
    
    print("Number of nodes in the graph with edges", df_pos.shape[0])
    print("Number of nodes in the graph without edges", df_neg.shape[0])
    
    #Trian test split 
    #Spiltted data into 80-20 
    #positive links and negative links seperatly because we need positive training data only for creating graph 
    #and for feature generation
    X_train_pos, X_test_pos, y_train_pos, y_test_pos  = train_test_split(df_pos,np.ones(len(df_pos)),test_size=0.2, random_state=9)
    X_train_neg, X_test_neg, y_train_neg, y_test_neg  = train_test_split(df_neg,np.zeros(len(df_neg)),test_size=0.2, random_state=9)
    
    print('='*60)
    print("Number of nodes in the train data graph with edges", X_train_pos.shape[0],"=",y_train_pos.shape[0])
    print("Number of nodes in the train data graph without edges", X_train_neg.shape[0],"=", y_train_neg.shape[0])
    print('='*60)
    print("Number of nodes in the test data graph with edges", X_test_pos.shape[0],"=",y_test_pos.shape[0])
    print("Number of nodes in the test data graph without edges", X_test_neg.shape[0],"=",y_test_neg.shape[0])

    #removing header and saving
    X_train_pos.to_csv('/content/drive/MyDrive/Facebook/data/after_eda/train_pos_after_eda.csv',header=False, index=False)
    X_test_pos.to_csv('/content/drive/MyDrive/Facebook/data/after_eda/test_pos_after_eda.csv',header=False, index=False)
    X_train_neg.to_csv('/content/drive/MyDrive/Facebook/data/after_eda/train_neg_after_eda.csv',header=False, index=False)
    X_test_neg.to_csv('/content/drive/MyDrive/Facebook/data/after_eda/test_neg_after_eda.csv',header=False, index=False)
else:
    #Graph from Traing data only 
    del missing_edges

Number of nodes in the graph with edges 9437519
Number of nodes in the graph without edges 9437519
Number of nodes in the train data graph with edges 7550015 = 7550015
Number of nodes in the train data graph without edges 7550015 = 7550015
Number of nodes in the test data graph with edges 1887504 = 1887504
Number of nodes in the test data graph without edges 1887504 = 1887504


- We have a balanced dataset with same no. of class 0 and class 1 nodes.
- We are splitting into two parts in ratio 80:20.
- Here we are performing random split but in real world data we would perform time based splitting.

In [None]:
if (os.path.isfile('/content/drive/MyDrive/Facebook/data/after_eda/train_pos_after_eda.csv')) and (os.path.isfile('/content/drive/MyDrive/Facebook/data/after_eda/test_pos_after_eda.csv')):        
    train_graph=nx.read_edgelist('/content/drive/MyDrive/Facebook/data/after_eda/train_pos_after_eda.csv',delimiter=',',create_using=nx.DiGraph(),nodetype=int)
    test_graph=nx.read_edgelist('/content/drive/MyDrive/Facebook/data/after_eda/test_pos_after_eda.csv',delimiter=',',create_using=nx.DiGraph(),nodetype=int)
    print(nx.info(train_graph))
    print(nx.info(test_graph))

    # finding the unique nodes in the both train and test graphs
    train_nodes_pos = set(train_graph.nodes())
    test_nodes_pos = set(test_graph.nodes())

    trY_teY = len(train_nodes_pos.intersection(test_nodes_pos))
    trY_teN = len(train_nodes_pos - test_nodes_pos)
    teY_trN = len(test_nodes_pos - train_nodes_pos)

    print('no of people common in train and test -- ',trY_teY)
    print('no of people present in train but not present in test -- ',trY_teN)

    print('no of people present in test but not present in train -- ',teY_trN)
    print(' % of people not there in Train but exist in Test in total Test data are {} %'.format(teY_trN/len(test_nodes_pos)*100))

DiGraph with 1780722 nodes and 7550015 edges
DiGraph with 1144623 nodes and 1887504 edges
no of people common in train and test --  1063125
no of people present in train but not present in test --  717597
no of people present in test but not present in train --  81498
 % of people not there in Train but exist in Test in total Test data are 7.1200735962845405 %


- Roughly 7% people are there in my train data but not in my test data. For those nodes I have no other information that is called as a cold start problem.
- This shows we have some partial cold start problem.