# Analysis 

In this notebook we will take the data from the _Ego networks_ notebook and make an analysis with the three different methods: a multinomial logistic model, a random forest method an an artificial neural network. 
<br><br>
First, we will load the data, we will check for outliers and then we will prepare and format the predictors in order to apply each one of these methods. 
<br><br>
The first step is loading the libraries, in this case we will use the standard numpy, pandas, matplotlib and seaborn for manipulating and plotting the data. In order to apply the different techniques of analysis, we will use sklearn, statsmodels and tensorflow. 

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
# Sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_validate, cross_val_predict
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.dummy import DummyClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
# Statsmodels
import statsmodels.formula.api as smf
from statsmodels.api import MNLogit



# Just to print prettier. Uncomment to see all (not important) warnings
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

## Load data

The next step is loading the .csv file from the previous notebook. Then we will select the columns we will use for the analysis, as the notebook contains a lot of information of the egos not related to the structural properties of their networks. Then we will map the categorical columns to a numerical encoding in the columns of : _Subject origin_, _Subject residence_, and _Regime_. 

In [4]:
### Read data
df_2 = pd.read_csv('Redes_2.csv')

### Drop Unnecessary Variables
df_2.drop('Unnamed: 0',axis=1, inplace=True)

###Take the necessary ones
df = df_2[df_2.columns[0:17]]
df['EDUC'] = df_2['EDUC'].copy()
df['FMIG2'] = df_2['FMIG2'].copy()
df['SEX'] = df_2['SEX'].copy()
df['RELG'] = df_2['RELG'].copy()

### The numerical encoding
#not_apply = ['Subject_origin','Subject_residence','Regime']
not_apply = ['Subject_origin','Subject_residence']
diccs = [0]*len(not_apply)
i = 0
for col in not_apply: 
        uniques = list(df[col].unique()) 
        diccs[i] = {uniques[j]:uniques.index(uniques[j]) for j in range(len(uniques)) }
        df[col] = df[col].map(diccs[i])
        i+=1
df.columns = df.columns.str.replace(' ', '_')
### Reset the datatype of the columns
df['Subject_origin'].astype('int64')
df['Subject_residence'].astype('int64')
#df['Regime'].astype('int64')
df.dropna(inplace=True)

KeyError: 'Subject_residence'

## Prepare and explore data

We make an overview of the main statistics of the data and the properties we have generated in the past notebook.

In [None]:
df

In [None]:
df.describe(include='all')

Some values of `mu` are way out of range (min = -294). This is clearly from divergences in the model. We mark observations greater than 10 (in absolute value) as `nan` and then drop `nan`.

In [None]:
# Clean estimates for mu
df['Mu'] = df['Mu'].apply(lambda x: np.nan if x < -100 else x)
df['Mu'] = df['Mu'].apply(lambda x: np.nan if x > 100 else x)
df.dropna(inplace = True)

## Group some nationalities in `others` group

We keep only classes with more than 50 observations. The rest of the classes will be considered as one called "others" 

In [None]:
# There are few data on several Origins
count_origins = pd.get_dummies(df['Subject_origin']).sum()
t = 50 # threshold
df['Subject_origin'] = df['Subject_origin'].apply(lambda x: 10 if (count_origins[x] < t) else x)
#pd.get_dummies(df['Subject_origin']).sum()

In [None]:
### This is just to translate the encoding to the first five integers
dicc_traslation = {10:0,2:1,5:2,6:3,8:4,9:5}
dicc_final = {0:"Other",1:"Dominican",2:"PuertoRican",3:"Argentinean",4:"Moroccan",5:"Senegambian"}
df['Subject_origin'] = df['Subject_origin'].map(dicc_traslation)

### Define `predictors` for all the inference and prediction methods 

In [None]:
#predictors = ['Closeness','Clustering','Average_degree','Assortativity','Betweenness',
#             'Closeness_origin','Closeness_residence','Number_origin','Number_residence','Mu']
predictors = ['Closeness','Clustering','Average_degree','Assortativity','Betweenness',
             'Closeness_origin','Closeness_residence','Number_origin','Number_residence']
target = "Subject_origin"

### Define `train` and `test` split for the dataset

In [None]:
X = df[predictors]       # independent variables
y = df[target]

test_size = 0.20 #maybe more is needed (20% is standard though)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = test_size, random_state = 0)


# Standard Scaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform (X_test)

#Define dataframe as merge of X and y
df_str = df[target].to_frame().merge(pd.DataFrame(sc.fit_transform(X),columns=predictors,index=df.index), left_index=True, right_index=True)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
log_test = LogisticRegression().fit(X_train,y_train)
predictions = log_test.predict(X_test)
print(classification_report(predictions,y_test))

In [None]:
coeffs = pd.DataFrame(data=log_test.coef_.T,index = df_str.drop("Subject_origin",axis=1).columns, columns=dicc_final.values())

In [None]:
import seaborn as sns
sns.heatmap(coeffs)

# INFERENCE

At this point, we begin to include tools of inference, beginning by the multinomial logistic regression (MLN).
The library used for this analysis is mainly _statsmodels_ and the main function can be checked in this link:
https://stats.idre.ucla.edu/stata/dae/multinomiallogistic-regression/

In this part of the notebook we will prepare the variables, execute the regression and save the results. 

### Fit Multinomial Logistic Model

https://www.statsmodels.org/stable/generated/statsmodels.discrete.discrete_model.MNLogit.html

In [None]:
### Uses the list 'predictors' as independent variables
formula_predictors = ' + '.join(predictors)
target_str = target +" ~ {}"
model = MNLogit.from_formula(target_str.format(formula_predictors), df_str)
results = model.fit(maxiter=200)

#### Results

In [None]:
print(results.summary())

In [None]:
print('pseudo r-squared = {}'.format(np.round(results.prsquared,2)))

In [None]:
results.llr_pvalue

# PREDICTION

We train and fit a powerful non-linear (and non-parametric) machine learnin classifier to the data; a Random Forest. There are many other alternatives, but tree based metods are very powerfull and there are new techniques to help identify relevant predictors.

In this section, we want to test wether this model can outperform significantly other null (dummy) classifiers. If that is the case (which it is), it confirms the hypothesis that the predictors have relevant information about the nationalities of the subjects.

### Train and test with MNL regression

In [None]:

formula_predictors = ' + '.join(predictors)
model = MNLogit.from_formula(target_str.format(formula_predictors), df_str.loc[y_train.index])
results_prediction = model.fit(maxiter=200)
ypred = results_prediction.predict(df_str.loc[y_test.index])
y_pred =list(map(np.argmax,np.array(ypred)))
##Meter función accuracy 

In [None]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

### Train and tune the model using k-cross fold validation

In [None]:
scoring = 'accuracy' #'f1_macro' # This chooses the metric to optimise during training (there are others!)
njobs=-1                         # This the number of cores used in your cpu (-1 means "all of them")
cv=5                             # the k in k-cross-fold validation
# RANDOM FOREST
print('\nFitting Random Forest\n')

rfc=RandomForestClassifier(random_state=0)
# Parameter combinations to explore
param_grid = { 
    'n_estimators': [75, 100,300,1000],
    'max_features': ['auto', None],
    'min_samples_split' :[2,6, 10, 14],
    'max_depth' : [10, 15, 30, 50,None],
    'max_samples' : [0.5 ,0.7, None],}


CV_rfc = GridSearchCV(estimator=rfc, 
                  param_grid=param_grid, 
                  scoring = scoring,
                  verbose=0,
                  n_jobs=njobs,
                  cv= cv)
CV_rfc.fit(X_train, y_train)

print('\nRandom Forest:')
print('Best Score: ', CV_rfc.best_score_)
print('Best Params: ', CV_rfc.best_params_)



### Evaluating the algorithm performance in the test set (unseen data)

In [None]:
y_pred = CV_rfc.predict(X_test)
print('Confusion Matrix:\n ', confusion_matrix(y_test,y_pred),'\n')
print(classification_report(y_test,y_pred),'\n')
print('Accuracy: {0:.2f}'.format(accuracy_score(y_test, y_pred),2))
dicc_final = {0:"Other",1:"Dominican",2:"PuertoRican",3:"Argentinean",4:"Moroccan",5:"Senegambian"}

### Compare this performance with  null models

In [None]:
df["Subject_origin"].value_counts()

In [None]:
#  relative prevalence of each class
rel_prev = (y.value_counts() / len(y))
print(rel_prev)

In [None]:
# Uniform Dummy Classifier (classifies randomly with p = 1/6)

# If the classifier randomly guesses: 
print('Acurracy of uniform dummy classifier: ',(((1/6) * y.value_counts()) / len(y)).sum()) # = 1/6

In [None]:
# Stratified Dummy Classifier (classifies randomly with p ~ prevalence of each class)
print('Acurracy of stratified dummy classifier: ',(rel_prev * y.value_counts()).sum() / len(y))

In [None]:
# Most frequent Dummy Classifier (classifies always in the most frequent class)
print('Acurracy of Most freq dummy classifier: ',rel_prev.max() )

In [None]:
# SKLEARN versions of the dummy classifiers (to double check and for convinience methods)

dummy = "stratified"# most_frequent, stratified, uniform
dummy_clf = DummyClassifier(strategy=dummy,random_state=0) 

 

# Actual accuracy of the dummy in the same train-test split as the RF model
dummy_clf.fit(X_train, y_train)
dummy_score = dummy_clf.score(X_test, y_test)
print('Mean accuracy of null ' + dummy +' model: {0:.2f}'.format(dummy_score),'\n')
print('Mean accuracy (in test) of RF model: {0:.2f}'.format(CV_rfc.score(X_test, y_test)),'\n')




In [None]:
# Confusion matrix and report of the selected dummy classifier

y_pred_dummy = dummy_clf.predict(X_test)
print('Confusion Matrix:\n\n ',confusion_matrix(y_test,y_pred_dummy),'\n')
print(classification_report(y_test,y_pred_dummy),'\n')
print('Accuracy: {0:.2f}'.format(accuracy_score(y_test, y_pred_dummy),2))


In [None]:
# Just for reference, the results of the RF Model

y_pred = CV_rfc.predict(X_test)
print('Confusion Matrix:\n\n ', confusion_matrix(y_test,y_pred),'\n')
print(classification_report(y_test,y_pred),'\n')
print('Accuracy: {0:.2f}'.format(accuracy_score(y_test, y_pred),2))

In [None]:
dummy_report = pd.DataFrame(classification_report(y_test,dummy_clf.predict(X_test), output_dict= True))

rfc_report = pd.DataFrame(classification_report(y_test,CV_rfc.predict(X_test), output_dict= True))

#### Increase in prediction power (percentage with respect to null model)

i.e. 100% means twice as good

In [None]:
final_table = ((rfc_report - dummy_report)*100 / dummy_report).drop('support').round(decimals=2)
final_table

This significant increases further support the claim that the predictors (based on ego-network properties) have useful information to predict the countries of origin of the individuals)

## Shap Values

<ul>
  Shap values are a tool to interpret our random forest model, in this case. They tell us some intuition about which part of the prediction belongs to each feature. 
</ul>
<ul>
A positive (negative) SHAP value indicates that the value (in this case, probability of belonging to a certain country) is reinforced (diminished) by the feature.  
</ul>
<ul>
We will use 2 kind of plots at this moment. The first one one is a summary plot, a violin plot of the distribution of SHAP values. The colour indicates the value of the feature indicated at the left. This plot let us see the which features contribute the most (this is, they have high SHAP values). Features are ordered according to their contribution to the global prediction.
</ul>
<ul>
The second kind of plot you will see several times after the summary plot is the dependence plot. They show the distribution of the SHAP values of a variable. The colormap plots another variable, the one the algorithm thinks it has more interaction with the current variable. It lets us distinguish between different regimes of the coloured variable. 
</ul>

In [None]:
import shap
shap.__version__

In [None]:
# explain the model's predictions using SHAP
##Shap values
import  shap

shap.initjs()
model = CV_rfc.best_estimator_
explainer = shap.TreeExplainer(model,X_train,check_additivity=False)
shap_values = explainer.shap_values(X_train,check_additivity=False)


## Example of summary plot

We extract the summary plots that summarizes the correlations for each nationality.

<u>SHAP values for the dominican</u>

In [None]:
shap.summary_plot(shap_values[1],X_train,feature_names = predictors)

<u>SHAP values for the Puerto Rican</u>

In [None]:
shap.summary_plot(shap_values[2],X_train,feature_names = predictors)

<u>SHAP values for the argentinean</u>

In [None]:
shap.summary_plot(shap_values[3],X_train,feature_names = predictors)

<u>SHAP values for the moroccan</u>

In [None]:
shap.summary_plot(shap_values[5],X_train,feature_names = predictors)

<u>SHAP values for the control group</u>

In [None]:
shap.summary_plot(shap_values[0],X_train,feature_names = predictors)

# LIME 

<ul>
LIME (Local Interpretable Model-agnostic Explanations), is an algorithm that takes the decision function from the classifier (decision = f(features)). This function may be complex, but the algorithm makes a linear regression around a single prediction, weighting the importance of the coefficients with the distance to this local prediction.   
</ul>
<ul>
This kind of algorithm helps us to explain single predictions.
</ul>

In [None]:
##Using LIME to interpret 
import lime
import lime.lime_tabular

In [None]:
explainer = lime.lime_tabular.LimeTabularExplainer(X_train, feature_names=predictors, discretize_continuous=True)

In [None]:
i = np.random.randint(0, X_test.shape[0])
exp = explainer.explain_instance(X_test[i], CV_rfc.predict_proba, num_features=3, top_labels=1)

In [None]:
exp.show_in_notebook(show_table=True, show_all=True)

##  Artificial neural network

As a complementary method, we train a simple ANN to provide a new method and give more strength to the previous results. In order to do that, we will preprocess the data, distinguishing the categorical and numerical predictors. Then we will split the dataset into the train and test parts and, finally, we will define the model and fit to obtain a final result for the accuracy. 

In [None]:
### Import the package tensorflow
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 
import tensorflow as tf
import logging
logging.getLogger("tensorflow").setLevel(logging.ERROR)

tf.random.set_seed(0)

In [None]:
###Define  a simple a ANN and fit our data
stat_accul = []
model_accul = tf.keras.Sequential([
    tf.keras.layers.Dense(70,activation="relu"),
    tf.keras.layers.Dense(70,activation="relu"),
    tf.keras.layers.Dense(6,activation="softmax")
])



###Compile the model 
model_accul.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(),
               optimizer=tf.keras.optimizers.Adam(learning_rate=10e-4),
               metrics=["accuracy"])

### We fit the model 100 times and take notes of the accuracy on the test set

history_accul = model_accul.fit(X_train,
                         np.array(y_train),
                         epochs=100,
                         verbose = 0)
stat_accul.append(model_accul.evaluate(X_test,np.array(y_test))[1])

## Display the final results

In [None]:
print(f"The final results for a training iteration is {np.average(stat_accul):.2f}")

## Radar plots for regressions 

In [None]:
dicc_final

In [None]:
from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler()
df_fitted = sc.fit_transform(results.params)

df_polar = pd.DataFrame(sc.fit_transform(results.params.transpose())).transpose()
df_polar.columns = list(dicc_final.values())[1:]
#df_polar.index = predictors.insert(0,"Intercept")
df_polar = df_polar.drop(0,axis=0).reset_index().drop("index",axis = 1)
df_polar.index = predictors

In [None]:
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import plotly.io as pio
pio.renderers.default = "notebook+pdf" 

categories = predictors



for col in df_polar.columns : 
    fig = go.Figure()
    fig.add_trace(go.Scatterpolar(
        r = df_polar[col].values,
        theta = categories,
        fill = "toself",
        name = col
    ))



    fig.update_layout(
      polar=dict(
        radialaxis=dict(
          visible=True,
          range=[0, 1]
        )),
      showlegend=True,
      font={"size":18}
    )
    
    fig.write_image("Radar_"+str(col)+".jpg")
    fig.show()