*Author: Daniel Puente Viejo*  

<img src="https://cdn-icons-png.flaticon.com/512/5043/5043998.png" width="100" height="100" float ="right">    

This notebook explains the steps to generate the graph by applying the data already analysed and cleaned. The code on how to include new users is also provided.
In any case, we have provided a series of scripts where you can run the latter with just one function: `graphs_management.py` 
- <a href='#1'><ins>1. Loading of Libraries and Data<ins></a>
- <a href='#2'><ins>2. Split data<ins></a>
- <a href='#2'><ins>3. Graph creation<ins></a>
- <a href='#4'><ins>4. Graph save<ins> </a>
- <a href='#5'><ins>5. New customer inclusion<ins> </a>

### <a id='1'>1. Loading of Libraries and Data</a>
----

* Common libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
import itertools

* Sklearn

In [2]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, recall_score, precision_score, confusion_matrix, roc_auc_score

Pytorch

In [3]:
import torch

from torch_geometric.data import HeteroData
import torch_geometric.transforms as T
from torch_geometric.nn import Sequential, Linear, SAGEConv, to_hetero
from torch.nn import ReLU
import torch.nn.functional as F

* Paths and warnings

In [4]:
import warnings
warnings.filterwarnings("ignore")

path = "../data/eda_generated_data/"
output_path = "../data/graph_data/"

* Load data

In [5]:
def load_pickle(path, file_name):
    with open(path + file_name, 'rb') as f: return pickle.load(f)

df_train = load_pickle(path, "df_train.pkl")
df_val = load_pickle(path, "df_val.pkl")
df_test = load_pickle(path, "df_test.pkl")
scaler = load_pickle(path, "scaler.pkl")

### <a id='2'>2. Split data</a>
----

As the graph will become very large and complex, only 10.000 transactions will be used for the graph creation.

In [6]:
df_train_graph, _ = train_test_split(df_train, train_size = 8000, random_state = 40, stratify = df_train['TARGET'])
df_val_graph, _ = train_test_split(df_val, train_size = 2000, random_state = 40, stratify = df_val['TARGET'])

df_graph = pd.concat([df_train_graph, df_val_graph], axis = 0).reset_index(drop = True)

### <a id='3'>3. Graph creation</a>
----

Taking into account that making relations of each feature individually would supose a huge amount of relations, the features are merged.   
With the following function can be made

In [7]:
def join(df, combination_list):

    new_df = pd.DataFrame()
    for i in combination_list:

        df_with_selected_columns = df[list(i)]
        name = '_'.join(i)
        df_with_selected_columns[name] = df_with_selected_columns.apply(lambda x: ' '.join(x), axis=1)
        new_df = pd.concat([new_df, df_with_selected_columns[name]], axis=1)

    return new_df

This are the manual features that have been agrupated. Moreover, it has also been done a rename dictionary to make the names more readable.

In [8]:
manual_selection = [['ORGANIZATION_TYPE'], ['ORGANIZATION_TYPE', 'WEEKDAY_APPR_PROCESS_START'], ['NAME_INCOME_TYPE', 'FLAG_OWN_CAR', 'ORGANIZATION_TYPE'], 
                    ['FLAG_OWN_REALTY', 'ORGANIZATION_TYPE'], ['NAME_TYPE_SUITE', 'ORGANIZATION_TYPE'], ['NAME_CONTRACT_TYPE', 'ORGANIZATION_TYPE'], 
                    ['NAME_HOUSING_TYPE', 'NAME_EDUCATION_TYPE', 'ORGANIZATION_TYPE'], ['NAME_HOUSING_TYPE', 'NAME_FAMILY_STATUS', 'ORGANIZATION_TYPE']]
                    
rename_dict = {'ORGANIZATION_TYPE': 'organization', 
               'ORGANIZATION_TYPE_WEEKDAY_APPR_PROCESS_START': 'organization_weekday',
               'NAME_INCOME_TYPE_FLAG_OWN_CAR_ORGANIZATION_TYPE': 'income_car_organization',
               'FLAG_OWN_REALTY_ORGANIZATION_TYPE': 'realty_organization',
               'NAME_TYPE_SUITE_ORGANIZATION_TYPE': 'suite_organization',
               'NAME_CONTRACT_TYPE_ORGANIZATION_TYPE': 'contract_organization',
               'NAME_HOUSING_TYPE_NAME_EDUCATION_TYPE_ORGANIZATION_TYPE': 'housing_education_organization',
               'NAME_HOUSING_TYPE_NAME_FAMILY_STATUS_ORGANIZATION_TYPE': 'housing_family_organization'}

The dataframe is created and shown

In [9]:
new_df = join(df_graph, manual_selection)
new_df.reset_index(drop=True, inplace=True)
new_df.rename(columns=rename_dict, inplace=True)

new_df.head(5)

Unnamed: 0,organization,organization_weekday,income_car_organization,realty_organization,suite_organization,contract_organization,housing_education_organization,housing_family_organization
0,Business Entity Type 1,Business Entity Type 1 FRIDAY,Working Y Business Entity Type 1,N Business Entity Type 1,Unaccompanied Business Entity Type 1,Cash loans Business Entity Type 1,House / apartment Higher education Business En...,House / apartment Married Business Entity Type 1
1,Business Entity Type 3,Business Entity Type 3 SATURDAY,Commercial associate Y Business Entity Type 3,Y Business Entity Type 3,Unaccompanied Business Entity Type 3,Cash loans Business Entity Type 3,House / apartment Higher education Business En...,House / apartment Married Business Entity Type 3
2,Medicine,Medicine WEDNESDAY,Commercial associate N Medicine,Y Medicine,Family Medicine,Cash loans Medicine,House / apartment Secondary / secondary specia...,House / apartment Married Medicine
3,Business Entity Type 3,Business Entity Type 3 FRIDAY,Working N Business Entity Type 3,Y Business Entity Type 3,Unaccompanied Business Entity Type 3,Cash loans Business Entity Type 3,House / apartment Secondary / secondary specia...,House / apartment Married Business Entity Type 3
4,Self-employed,Self-employed WEDNESDAY,Working N Self-employed,N Self-employed,Unaccompanied Self-employed,Cash loans Self-employed,House / apartment Secondary / secondary specia...,House / apartment Married Self-employed


With the following functions every relation is created

In [10]:
def get_relations(new_df, feature, row):
    filtrado = new_df[feature]
    list_of_index_func = list(filtrado[filtrado == row[feature]].index)
    
    return list_of_index_func

The relations are:
* **Self-loop:** The node is connected to itself.
* **Bidirectional:** The node is connected to another node and vice versa.

In [11]:
def edge_creation(_user_list_user, new_df, test = None):
    
    if test == None: new_df_copy = new_df.copy()
    else: new_df_copy = new_df.iloc[test:].copy()
        
    for k, row in enumerate(new_df_copy.iterrows()):
        index, value = row

        for x, cols in enumerate(new_df.columns): 
            list_of_index = get_relations(new_df, cols, value)

            lenght_index = len(list_of_index)
            _user_list_user[cols][0] += list_of_index
            _user_list_user[cols][1] += list(np.full(lenght_index, index))

    edges = [torch.tensor([np.array(v[0]), np.array(v[1])], dtype = torch.long) for k, v in _user_list_user.items()]

    return edges

_user_list_user = {i:[[],[]] for i in new_df.columns}
edges = edge_creation(_user_list_user, new_df)

The numeric features are selected and scaled.

In [12]:
df_graph_numeric = df_graph.select_dtypes(include=['float64', 'int64'])
df_graph_numeric_exclude = df_graph_numeric.drop(['SK_ID_CURR','TARGET'], axis=1)
df_graph_numeric_exclude_scaled = scaler.transform(df_graph_numeric_exclude)

With all the information the graph is created.

In [13]:
datas = HeteroData()

datas['users'].x = torch.from_numpy(df_graph_numeric_exclude_scaled).float()
datas['users'].y = torch.from_numpy(df_graph_numeric.TARGET.values).long()

for k, v in zip(new_df.columns, edges): datas['users', k, 'users'].edge_index = v

datas = T.ToUndirected()(datas)
datas = T.AddSelfLoops()(datas)
datas

HeteroData(
  [1musers[0m={
    x=[10000, 48],
    y=[10000]
  },
  [1m(users, organization, users)[0m={ edge_index=[2, 10581492] },
  [1m(users, organization_weekday, users)[0m={ edge_index=[2, 1659140] },
  [1m(users, income_car_organization, users)[0m={ edge_index=[2, 4373766] },
  [1m(users, realty_organization, users)[0m={ edge_index=[2, 6232718] },
  [1m(users, suite_organization, users)[0m={ edge_index=[2, 7100622] },
  [1m(users, contract_organization, users)[0m={ edge_index=[2, 8803078] },
  [1m(users, housing_education_organization, users)[0m={ edge_index=[2, 5235818] },
  [1m(users, housing_family_organization, users)[0m={ edge_index=[2, 3802318] }
)

Finally train and validation masks are created.

In [14]:
train_mask, val_mask = np.array([True] * 8000 + [False] * 2000), np.array([False] * 8000 + [True] * 2000)

datas['users'].train_mask = torch.from_numpy(train_mask).bool()
datas['users'].valid_mask = torch.from_numpy(train_mask).bool()

### <a id='4'>4. Graph save</a>
----

In [16]:
torch.save(datas, output_path + 'training_graph.pt')

In [19]:
def save_pickle_file(file_name, file):
    with open(output_path + file_name, 'wb') as f: pickle.dump(file, f)

save_pickle_file('graph_df.pkl', new_df)

### <a id='5'>5. New customer inclusion</a>
----

The train preprocessing is applied to the test data

In [18]:
def cleaning_test(df_test):
    df_test_cleaned = df_test[df_train.columns]
    return df_test_cleaned

df_test_cleaned = cleaning_test(df_test)
df_test_cleaned = df_test_cleaned.iloc[:1000]

Numerical columns are selected and scaled.

In [19]:
df_test_cleaned_numeric = df_test_cleaned.select_dtypes(include=['float64', 'int64'])
df_test_cleaned_numeric_exclude = df_test_cleaned_numeric.drop(['SK_ID_CURR','TARGET'], axis=1)
df_test_cleaned_numeric_exclude_scaled = scaler.transform(df_test_cleaned_numeric_exclude)

Scaled values are added to the graph

In [17]:
new_x_values = torch.from_numpy(df_test_cleaned_numeric_exclude_scaled).float()
datas['users'].x = torch.cat((datas['users'].x, new_x_values))

last_val = datas['users', 'organization', 'users']['edge_index'][1,-1].item()

The merged dataframe is created

In [18]:
df_test_cleaned.index = range(last_val+1, last_val+1 + len(df_test_cleaned))
df_test_cleaned_colums_agregated = join(df_test_cleaned, manual_selection)
df_test_cleaned_colums_agregated.rename(columns=rename_dict, inplace=True)

The edges are obtained

In [63]:
new_df2 = pd.concat([new_df, df_test_cleaned_colums_agregated], axis=0)
_user_list_user, length_test = {i:[[],[]] for i in new_df2.columns}, len(df_test_cleaned_colums_agregated)

new_edges = edge_creation(_user_list_user, new_df2, -length_test)

A temporal merged graph is done so as to create biderectional relations and then contatenate with the big one.

In [68]:
datas_merge = HeteroData()

for k, v in zip(new_df2.columns, new_edges): datas_merge['users', k, 'users'].edge_index = v
datas_merge = T.ToUndirected()(datas_merge)
datas_merge = T.AddSelfLoops()(datas_merge)

Edges are merged

In [70]:
for k, v in enumerate(new_df2.columns):
    previous_relation, new_relations = datas['users', v, 'users'].edge_index, datas_merge['users', v, 'users'].edge_index
    new_relation = torch.cat((previous_relation, new_relations), dim=1)
    datas['users', v, 'users'].edge_index = new_relation

Test mask is created

In [75]:
test_new_mask = np.array([False] * (last_val+1) + [True] * len(df_test_cleaned_colums_agregated))
datas['users'].test_mask = torch.from_numpy(test_new_mask).bool()