*Author: Daniel Puente Viejo*  

<img src="https://cdn-icons-png.flaticon.com/512/5043/5043998.png" width="100" height="100" float ="right">    

This notebook explains the steps to generate the graph by applying the data already analysed and cleaned. The code on how to include new users is also provided.
In any case, we have provided a series of scripts where you can run the latter with just one function: `graphs_management.py` 
- <a href='#1'><ins>1. Loading of Libraries and Data<ins></a>
- <a href='#2'><ins>2. Split data<ins></a>
- <a href='#2'><ins>3. Graph creation<ins></a>
- <a href='#4'><ins>4. Graph save<ins> </a>
- <a href='#5'><ins>5. New customer inclusion<ins> </a>

### <a id='1'>1. Loading of Libraries and Data</a>
----

* Common libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
import itertools

* Sklearn

In [2]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, recall_score, precision_score, confusion_matrix, roc_auc_score

Pytorch

In [3]:
import torch

from torch_geometric.data import HeteroData
import torch_geometric.transforms as T
from torch_geometric.nn import Sequential, Linear, SAGEConv, to_hetero
from torch.nn import ReLU
import torch.nn.functional as F

  from .autonotebook import tqdm as notebook_tqdm


* Paths and warnings

In [4]:
import warnings
warnings.filterwarnings("ignore")

path = "../data/eda_generated_data/"
output_path = "../data/graph_data/"

def load_pickle(path, file_name):
    with open(path + file_name, 'rb') as f: return pickle.load(f)
    
def save_pickle_file(file_name, file):
    with open(output_path + file_name, 'wb') as f: pickle.dump(file, f)

* Load data

In [5]:
df_train = load_pickle(path, "df_train.pkl")
df_val = load_pickle(path, "df_val.pkl")
df_test = load_pickle(path, "df_test.pkl")
scaler = load_pickle(path, "scaler.pkl")

📚 Previous application filtering and tranformation 📚

In [6]:
previous_df = pd.read_csv("../data/previous_application.csv")
df = pd.read_csv("../data/cleaning_generated_data/application_data_fraud.csv")

df_new_previous = previous_df[previous_df.SK_ID_CURR.isin(df.SK_ID_CURR)]


null_values = df_new_previous.isna().sum()/len(df_new_previous)
null_values = null_values[null_values > 0.3].index


df_new_previous = df_new_previous.drop(null_values, axis=1)
df_new_previous.dropna(inplace=True)
df_new_previous.sort_values(by=['SK_ID_CURR'], inplace=True)

numeric_previous = df_new_previous.select_dtypes(include=['float64', 'int64'])
numeric_previous_no = numeric_previous.drop(['SK_ID_PREV','AMT_CREDIT','AMT_GOODS_PRICE'], axis=1)

numeric_previous_no_pivoted = numeric_previous_no.pivot_table(index='SK_ID_CURR', aggfunc=['median', 'last', 'max', 'min'])
numeric_previous_no_pivoted.columns = ['_'.join(col).strip() for col in numeric_previous_no_pivoted.columns.values]
numeric_previous_no_pivoted.reset_index(inplace = True)

df_product_combination = (pd.get_dummies(df_new_previous[['SK_ID_CURR', 'PRODUCT_COMBINATION', 'NAME_CONTRACT_STATUS']], columns = ['PRODUCT_COMBINATION', 'NAME_CONTRACT_STATUS'])
                                        .groupby('SK_ID_CURR').sum()).reset_index()

df_previous_graph_all = pd.merge(numeric_previous_no_pivoted, df_product_combination, on='SK_ID_CURR')
save_pickle_file("df_previous_graph.pkl", df_previous_graph_all)

### <a id='2'>2. Split data</a>
----

As the graph will become very large and complex, only 10.000 transactions will be used for the graph creation.

In [7]:
df_train_graph, _ = train_test_split(df_train, train_size = 16000, random_state = 40, stratify = df_train['TARGET'])
df_val_graph, _ = train_test_split(df_val, train_size = 5000, random_state = 40, stratify = df_val['TARGET'])

df_graph = pd.concat([df_train_graph, df_val_graph], axis = 0).reset_index(drop = True)

In [29]:
save_pickle_file("df_train_graph.pkl", df_train_graph)
save_pickle_file("df_val_graph.pkl", df_val_graph)

📚 Scaling previous 📚

In [8]:
df_previous_graph_train = df_previous_graph_all[df_previous_graph_all.SK_ID_CURR.isin(df_train.SK_ID_CURR)]
scaler_previous = StandardScaler()
scaler_previous.fit(df_previous_graph_train.drop(['SK_ID_CURR'], axis=1))
save_pickle_file("scaler_previous.pkl", scaler_previous)

📚 Previous with train data 📚

In [9]:
df_previous_graph_train = df_previous_graph_train[df_previous_graph_train.SK_ID_CURR.isin(df_train_graph.SK_ID_CURR)]
df_previous_graph_val = df_previous_graph_all[df_previous_graph_all.SK_ID_CURR.isin(df_val_graph.SK_ID_CURR)]

df_previous_graph = pd.concat([df_previous_graph_train, df_previous_graph_val], axis = 0).reset_index(drop = True)

📚 Generation a dictionary of the ID and the position 📚

In [10]:
id_index_dict = df_graph.SK_ID_CURR.to_dict()
id_index_dict = {v: k for k, v in id_index_dict.items()}

id_intersection = set(df_previous_graph.SK_ID_CURR).intersection(set(id_index_dict.keys()))
dict_you_want = {key: id_index_dict[key] for key in id_intersection}

In [23]:
save_pickle_file("dict_you_want.pkl", dict_you_want)
save_pickle_file("id_index_dict.pkl", id_index_dict)

### <a id='3'>3. Graph creation</a>
----

Taking into account that making relations of each feature individually would supose a huge amount of relations, the features are merged.   
With the following function can be made

In [11]:
def join(df, combination_list):

    new_df = pd.DataFrame()
    for i in combination_list:

        df_with_selected_columns = df[list(i)]
        name = '_'.join(i)
        df_with_selected_columns[name] = df_with_selected_columns.apply(lambda x: ' '.join(x), axis=1)
        new_df = pd.concat([new_df, df_with_selected_columns[name]], axis=1)

    return new_df

This are the manual features that have been agrupated. Moreover, it has also been done a rename dictionary to make the names more readable.

In [12]:
manual_selection = [['ORGANIZATION_TYPE'], ['ORGANIZATION_TYPE', 'WEEKDAY_APPR_PROCESS_START'], ['NAME_INCOME_TYPE', 'FLAG_OWN_CAR', 'ORGANIZATION_TYPE'], 
                    ['FLAG_OWN_REALTY', 'ORGANIZATION_TYPE'], ['NAME_TYPE_SUITE', 'ORGANIZATION_TYPE'], ['NAME_CONTRACT_TYPE', 'ORGANIZATION_TYPE'], 
                    ['NAME_HOUSING_TYPE', 'NAME_EDUCATION_TYPE', 'ORGANIZATION_TYPE'], ['NAME_HOUSING_TYPE', 'NAME_FAMILY_STATUS', 'ORGANIZATION_TYPE']]
                    
rename_dict = {'ORGANIZATION_TYPE': 'organization', 
               'ORGANIZATION_TYPE_WEEKDAY_APPR_PROCESS_START': 'organization_weekday',
               'NAME_INCOME_TYPE_FLAG_OWN_CAR_ORGANIZATION_TYPE': 'income_car_organization',
               'FLAG_OWN_REALTY_ORGANIZATION_TYPE': 'realty_organization',
               'NAME_TYPE_SUITE_ORGANIZATION_TYPE': 'suite_organization',
               'NAME_CONTRACT_TYPE_ORGANIZATION_TYPE': 'contract_organization',
               'NAME_HOUSING_TYPE_NAME_EDUCATION_TYPE_ORGANIZATION_TYPE': 'housing_education_organization',
               'NAME_HOUSING_TYPE_NAME_FAMILY_STATUS_ORGANIZATION_TYPE': 'housing_family_organization'}

The dataframe is created and shown

In [13]:
new_df = join(df_graph, manual_selection)
new_df.reset_index(drop=True, inplace=True)
new_df.rename(columns=rename_dict, inplace=True)

new_df.head(5)

Unnamed: 0,organization,organization_weekday,income_car_organization,realty_organization,suite_organization,contract_organization,housing_education_organization,housing_family_organization
0,XNA,XNA SATURDAY,Pensioner N XNA,Y XNA,"Spouse, partner XNA",Cash loans XNA,House / apartment Secondary / secondary specia...,House / apartment Married XNA
1,Business Entity Type 3,Business Entity Type 3 THURSDAY,Working N Business Entity Type 3,Y Business Entity Type 3,Family Business Entity Type 3,Cash loans Business Entity Type 3,House / apartment Secondary / secondary specia...,House / apartment Married Business Entity Type 3
2,Business Entity Type 3,Business Entity Type 3 THURSDAY,Working Y Business Entity Type 3,Y Business Entity Type 3,Family Business Entity Type 3,Cash loans Business Entity Type 3,House / apartment Secondary / secondary specia...,House / apartment Single / not married Busines...
3,Other,Other SATURDAY,Working N Other,Y Other,Unaccompanied Other,Revolving loans Other,House / apartment Secondary / secondary specia...,House / apartment Married Other
4,Business Entity Type 3,Business Entity Type 3 WEDNESDAY,Commercial associate N Business Entity Type 3,Y Business Entity Type 3,Unaccompanied Business Entity Type 3,Cash loans Business Entity Type 3,House / apartment Secondary / secondary specia...,House / apartment Married Business Entity Type 3


With the following functions every relation is created

In [14]:
def get_relations(new_df, feature, row):
    filtrado = new_df[feature]
    list_of_index_func = list(filtrado[filtrado == row[feature]].index)
    
    return list_of_index_func

The relations are:
* **Self-loop:** The node is connected to itself.
* **Bidirectional:** The node is connected to another node and vice versa.

In [15]:
def edge_creation(_user_list_user, new_df, test = None):
    
    if test == None: new_df_copy = new_df.copy()
    else: new_df_copy = new_df.iloc[test:].copy()
        
    for k, row in enumerate(new_df_copy.iterrows()):
        index, value = row

        for x, cols in enumerate(new_df.columns): 
            list_of_index = get_relations(new_df, cols, value)

            lenght_index = len(list_of_index)
            _user_list_user[cols][0] += list_of_index
            _user_list_user[cols][1] += list(np.full(lenght_index, index))

    edges = [torch.tensor([np.array(v[0]), np.array(v[1])], dtype = torch.long) for k, v in _user_list_user.items()]

    return edges

_user_list_user = {i:[[],[]] for i in new_df.columns}
edges = edge_creation(_user_list_user, new_df)

In [16]:
_user_has_previous = [[],[]]
for k, v in dict_you_want.items():
    df_graph_prev_filtered = df_previous_graph[df_previous_graph.SK_ID_CURR == k]
    _user_has_previous[0].append(v)
    _user_has_previous[1].append(df_graph_prev_filtered.index[0])

edges_prev = torch.tensor([np.array(_user_has_previous[0]), np.array(_user_has_previous[1])])

📚 X values of previous application 📚

In [17]:
x_prev = scaler_previous.transform(df_previous_graph.drop(['SK_ID_CURR'], axis=1))

The numeric features are selected and scaled.

In [18]:
df_graph_numeric = df_graph.select_dtypes(include=['float64', 'int64'])
df_graph_numeric_exclude = df_graph_numeric.drop(['SK_ID_CURR','TARGET'], axis=1)
df_graph_numeric_exclude_scaled = scaler.transform(df_graph_numeric_exclude)

📚 With all the information the graph is created. 📚

In [19]:
datas = HeteroData()

datas['users'].x = torch.from_numpy(df_graph_numeric_exclude_scaled).float()
datas['users'].y = torch.from_numpy(df_graph_numeric.TARGET.values).long()

for k, v in zip(new_df.columns, edges): datas['users', k, 'users'].edge_index = v

datas['previous'].x = torch.from_numpy(x_prev).float()
datas['users', 'has_previous', 'previous'].edge_index = edges_prev

datas = T.ToUndirected()(datas)
datas = T.AddSelfLoops()(datas)
datas

HeteroData(
  [1musers[0m={
    x=[21000, 48],
    y=[21000]
  },
  [1mprevious[0m={ x=[20036, 48] },
  [1m(users, organization, users)[0m={ edge_index=[2, 46304716] },
  [1m(users, organization_weekday, users)[0m={ edge_index=[2, 7233784] },
  [1m(users, income_car_organization, users)[0m={ edge_index=[2, 19286166] },
  [1m(users, realty_organization, users)[0m={ edge_index=[2, 27157012] },
  [1m(users, suite_organization, users)[0m={ edge_index=[2, 31595802] },
  [1m(users, contract_organization, users)[0m={ edge_index=[2, 38976610] },
  [1m(users, housing_education_organization, users)[0m={ edge_index=[2, 22939114] },
  [1m(users, housing_family_organization, users)[0m={ edge_index=[2, 16688072] },
  [1m(users, has_previous, previous)[0m={ edge_index=[2, 20036] },
  [1m(previous, rev_has_previous, users)[0m={ edge_index=[2, 20036] }
)

Finally train and validation masks are created.

In [26]:
train_mask, val_mask = np.array([True] * 16000 + [False] * 5000), np.array([False] * 16000 + [True] * 5000)

datas['users'].train_mask = torch.from_numpy(train_mask).bool()
datas['users'].valid_mask = torch.from_numpy(train_mask).bool()

### <a id='4'>4. Graph save</a>
----

In [27]:
torch.save(datas, output_path + 'training_graph_prev.pt')

In [22]:
def save_pickle_file(file_name, file):
    with open(output_path + file_name, 'wb') as f: pickle.dump(file, f)

save_pickle_file('graph_df.pkl', new_df)

### <a id='5'>5. New customer inclusion</a>
----

In [21]:
datas = torch.load('../data/graph_data/training_graph_prev.pt')

The train preprocessing is applied to the test data

In [22]:
def cleaning_test(df_test):
    df_test_cleaned = df_test[df_train.columns]
    return df_test_cleaned

df_test_cleaned = cleaning_test(df_test)
df_test_cleaned = df_test_cleaned.iloc[:10000]

Numerical columns are selected and scaled.

In [23]:
df_test_cleaned_numeric = df_test_cleaned.select_dtypes(include=['float64', 'int64'])
df_test_cleaned_numeric_exclude = df_test_cleaned_numeric.drop(['SK_ID_CURR','TARGET'], axis=1)
df_test_cleaned_numeric_exclude_scaled = scaler.transform(df_test_cleaned_numeric_exclude)

Scaled values are added to the graph

In [24]:
new_x_values = torch.from_numpy(df_test_cleaned_numeric_exclude_scaled).float()
datas['users'].x = torch.cat((datas['users'].x, new_x_values))

last_val = datas['users', 'organization', 'users']['edge_index'][1,-1].item()

The merged dataframe is created

In [25]:
df_test_cleaned.index = range(last_val+1, last_val+1 + len(df_test_cleaned))
df_test_cleaned_colums_agregated = join(df_test_cleaned, manual_selection)
df_test_cleaned_colums_agregated.rename(columns=rename_dict, inplace=True)

The edges are obtained

In [26]:
new_df2 = pd.concat([new_df, df_test_cleaned_colums_agregated], axis=0)
_user_list_user, length_test = {i:[[],[]] for i in new_df2.columns}, len(df_test_cleaned_colums_agregated)

new_edges = edge_creation(_user_list_user, new_df2, -length_test)

📚 Edges for previous application in test 📚

In [27]:
df_previous_graph_test = df_previous_graph_all[df_previous_graph_all.SK_ID_CURR.isin(df_test_cleaned.SK_ID_CURR)]
df_previous_graph_test.index = range(len(dict_you_want), len(dict_you_want) + len(df_previous_graph_test))

max_dict_value = max(id_index_dict.values())
id_index_dict_test = {k:v for k, v in zip(df_previous_graph_test.SK_ID_CURR, range(max_dict_value + 1, max_dict_value + len(df_previous_graph_test) + 1))}
id_index_dict_new = {**id_index_dict, **id_index_dict_test}

_user_has_previous_test = [[],[]]
for k, v in id_index_dict_test.items():
    df_graph_prev_filtered = df_previous_graph_test[df_previous_graph_test.SK_ID_CURR == k]
    _user_has_previous_test[0].append(v)
    _user_has_previous_test[1].append(df_graph_prev_filtered.index[0])

edges_prev_test = torch.tensor([np.array(_user_has_previous_test[0]), np.array(_user_has_previous_test[1])])

📚 Previous application x values in test 📚

In [28]:
new_x_prev_values = scaler_previous.transform(df_previous_graph_test.drop(['SK_ID_CURR'], axis=1))
datas['previous'].x = torch.cat((datas['previous'].x, torch.from_numpy(new_x_prev_values).float()))

📚 A temporal merged graph is done so as to create biderectional relations and then contatenate with the big one. 📚

In [29]:
datas_merge = HeteroData()

for k, v in zip(new_df2.columns, new_edges): datas_merge['users', k, 'users'].edge_index = v

datas_merge['users', 'has_previous', 'previous'].edge_index = edges_prev_test

datas_merge = T.ToUndirected()(datas_merge)
datas_merge = T.AddSelfLoops()(datas_merge)

📚 Edges are merged 📚

In [30]:
for k, v in enumerate(new_df2.columns):
    previous_relation, new_relations = datas['users', v, 'users'].edge_index, datas_merge['users', v, 'users'].edge_index
    new_relation = torch.cat((previous_relation, new_relations), dim=1)
    datas['users', v, 'users'].edge_index = new_relation

previous_relation, new_relations = datas['users', 'has_previous', 'previous'].edge_index, datas_merge['users', 'has_previous', 'previous'].edge_index
new_relation = torch.cat((previous_relation, new_relations), dim=1)
datas['users', 'has_previous', 'previous'].edge_index = new_relation

previous_relation_rev, new_relations_rev = datas['previous', 'rev_has_previous', 'users'].edge_index, datas_merge['previous', 'rev_has_previous', 'users'].edge_index
new_relation_rev = torch.cat((previous_relation_rev, new_relations_rev), dim=1)
datas['previous', 'rev_has_previous', 'users'].edge_index = new_relation_rev

Test mask is created

In [31]:
test_new_mask = np.array([False] * (last_val+1) + [True] * len(df_test_cleaned_colums_agregated))
datas['users'].test_mask = torch.from_numpy(test_new_mask).bool()

----

### **TRYIALS** 

In [5]:
model = torch.load('../data/models/sage_heterogen_model_prev.pt')

In [35]:
model.eval()
out = model(datas.x_dict, datas.edge_index_dict)

mask = datas['users'].test_mask
pred = out['users'][mask].max(1)[1]

In [37]:
roc_auc_score(df_test.iloc[:10000].TARGET, pred), f1_score(df_test.iloc[:10000].TARGET, pred), precision_score(df_test.iloc[:10000].TARGET, pred), recall_score(df_test.iloc[:10000].TARGET, pred)

(0.6585472129231096,
 0.24780058651026396,
 0.1549511002444988,
 0.6182926829268293)

In [141]:
roc_auc_score(df_test.iloc[:10000].TARGET, pred), f1_score(df_test.iloc[:10000].TARGET, pred), precision_score(df_test.iloc[:10000].TARGET, pred), recall_score(df_test.iloc[:10000].TARGET, pred)

(0.6416281417716139,
 0.24770642201834858,
 0.1619190404797601,
 0.526829268292683)

In [1]:
import matplotlib.pyplot as plt
import torch
import torch.nn.functional as F
from captum.attr import IntegratedGradients
import pickle

import torch_geometric.transforms as T
import numpy as np
                                
def save_pickle_file(file_name, file):
    with open(output_path + file_name, 'wb') as f: pickle.dump(file, f)

def load_pickle(path, file_name):
    with open(path + file_name, 'rb') as f: return pickle.load(f)

path = "../data/graph_data/"
eda_path = "../data/eda_generated_data/"
output_path = "../data/models/"

data = torch.load(path + 'training_graph_prev.pt')
model = torch.load('../data/models/sage_heterogen_model_prev.pt')

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
from torch_geometric.explain.explainer import Explainer
from torch_geometric.explain.algorithm import GNNExplainer, base
from torch_geometric.explain import Explanation

In [3]:
from torch_geometric.utils import subgraph

In [67]:
node_mask_type = ["attributes", "object", "common_attributes"]
explanation_list = []
for i in node_mask_type:
    explainer = Explainer(
        model=model,
        algorithm=GNNExplainer(epochs=5),
        explainer_config=dict(
            explanation_type = "model",
            node_mask_type = i,
        ),
        model_config=dict(
            mode='classification',
            task_level='node',
            return_type='probs',
        ),
        
    )

    node_idx = 10
    explanation = explainer(data.x_dict, data.edge_index_dict, index=node_idx)
    explanation_list.append(explanation)

Epochs:  1
Epochs:  2
Epochs:  3
Epochs:  4
Epochs:  5
Epochs:  1
Epochs:  2
Epochs:  3
Epochs:  4
Epochs:  5
Epochs:  1
Epochs:  2
Epochs:  3
Epochs:  4
Epochs:  5


In [111]:
print(f"The node with the highest value in a feature is: {np.argmax((np.array(explanation_list[0].node_feat_mask)).max(axis=1))} with a value of {np.amax(np.array(explanation_list[0].node_feat_mask))}")
print(f"Most important node: {int(np.argmax(explanation_list[1].node_mask))}, with an importance of {float(max(explanation_list[1].node_mask))}")
print(f"Most important feature: {int(np.argmax(explanation_list[2].node_feat_mask[1]))} with an importance of {float(max(explanation_list[2].node_feat_mask[1]))}")

The node with the highest value in a feature is: 18842 with a value of 0.6036412715911865
Most important node: 3994, with an importance of 0.5811887383460999
Most important feature: 22 with an importance of 0.5546156167984009
