<a id="0"></a> <br>
 # Table of Contents  
0. [Miscalleneous](#1)     
1. [Handling Data](#2) 
   

## General
### Download files from GitHub
Download a folder (repository): https://download-directory.github.io/ \
Download a single file: Click on file → Right click on « Raw » button → Save as
### Get information about function inputs/outputs:
Use "?". Also gives some instructions. Example: plt.scatter?

### Format for prints
- Use %d for integers, %f for floats (%.xf to specify a precision of x), %s for strings 
print("The mean is %f apples and %f oranges." % (apples, oranges)) 
- Use .format(), no need to specify data type \
print('The mean is {} apples and {} oranges'.format(apples, oranges))
--- 

# 0. Miscalleneous <a id="1"> </a> 
## 0.1 - import libraries 
Install new libraries directly from Notebook: ! pip install library_name

In [1]:
# Basic libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import json
import bz2

# Statistics
from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats import diagnostic
'''
# Machine learning (regression, classification, dimensioanlity reduction, split, performance...)
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, LogisticRegressionCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error, auc, roc_curve
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import silhouette_score
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, average_precision_score, balanced'''
# Networks
import networkx as nx
from networkx.algorithms.community.centrality import girvan_newman
from community import community_louvain
'''
0.2 - Loading data in dataframes
Separator type
Default is comma ',' , tab '\t' , semicolon ';' , vertical bar '|' , colon ':'
Decimal data with commas
Example 12,6 instead of 12.6: add in the parenthesis decimal=','
Import only first 10 rows
Add in the parenthesis nrows=10
0.3 - Preview dataframe
0.4 - Get list of row/column names
1 - Handling data
1.1 - Merge dataframes
1.2 - Drop columns / rows
axis=0 will act on all the ROWS, in each column
axis=1 will act on all the COLUMNS, in each row
from operator import itemgetter
import collections
# Turn off notebook warnings
import warnings
warnings.filterwarnings('ignore')'''



## 0.2 - Loading Dataframes

In [None]:
df = pd.read_csv('Data/file.csv', sep=',')
# pd.read_csv?
df = pd.read_json('Data/file.json')
# pd.read_json?
df = pd.read_excel('Data/file.xlsx')
# pd.read_excel?

### Separator type
Default is comma ',' , tab '\t' , semicolon ';' , vertical bar '|' , colon ':'
### Decimal data with commas
Example 12,6 instead of 12.6: add in the parenthesis decimal=','
### Import only first 10 rows
Add in the parenthesis nrows=10

## 0.3 - preview Dataframe

In [None]:
df.head(5)
df.tail(5)
df.sample(5)
df.dtypes

## 0.4 - list of row and column names

In [None]:
# Get list of row names (index):
list(df.index)
# Get list of column names:
list(df.columns)

## 0.5 - Writing Data to Files

In [None]:
df.to_csv("mb.csv")
df.to_pickle("baseball_pickle") # An efficient way of storing data to disk is in **binary format**. Pandas supports this using Python’s built-in pickle serialization.

--- 

# 1. Handling Data <a id="2"> </a> 
## 1.1 Merging DataFrames

In [None]:
# Merging two dataframes df1, df2 on common column
df = df1.merge(df2, on='common_column')

# By default, inner join (intersection of the two dfs).
# If a row from df1 is missing in df2, it will be discarded => loss of info
# If you dont want this, do outer join (will add NaN in columns from df2 that did not appear):
df = pd.merge(df1, df2, how='outer')

# ----- Concatenate -----
# Axis=0 is default, will obtain df with the the rows concatenated
df = pd.concat([df1, df2], axis=0)

# Axis=1 will concatenate column-wise, but respecting the row indices of the two dfs.
df = pd.concat([df1, df2], axis=1)

## 1.2 selecting some columns of dataframe 


In [None]:
df[['col1','col2',...,'col(n)']].head(10)

## 1.3 dropping column and rows 
- axis=0 will act on all the ROWS, in each column
- axis=1 will act on all the COLUMNS, in each r

In [None]:
df.drop("drop_this_column", axis=1)
df.drop("drop_this_row", axis=0)
# Drop rows with at least one missing value (NaN)
df.dropna()
# Drop rows with all elements that are missing values
df.dropna(how='all')
# Drop duplicates based on id, keeping first
df.drop_duplicates(inplace=True, subset=['id'], keep='first')
# Drop duplicates based on two columns (two rows with albums by the same band with the same name)
df.drop_duplicates(inplace=True, subset=['artist','album'], keep='first')

## 1.3 - Reorganizing data: sort / group by

In [None]:
# Sort rows in increasing/decreasing order according to a column
df.sort_values("according_to_this_column", ascending=False)
# Count of unique values in column
df.column1.value_counts()
# Group rows by same column value
df = df.groupyby("this_column")
# Group rows by same column value and take mean of other columns in doing so
df = df.groupby(['this_column']).mean()
# Group rows by same column and take total count from other columns while doing so
df = df.groupby(['this_column']).count()
# Typically need to perform a reset index after a groupby
df = df.reset_index()

## 1.4 - Isolate/select specific data

In [None]:
# Simple way to choose only the rows where the team column value is "Boston Celtics":
df[df.Team=='Boston Celtics']
# For several conditions:
df[(df.Team=='Boston Celtics') & (df.Year==2006)] # AND condition
df[(df.Team=='Boston Celtics') | (df.Year==2006)] # OR condition
# For several possibilities, in same column
df[(df.Team=='Boston Celtics') | (df.Team=='Tornadoes')]
df[df.Team.isin(['Boston Celtics', 'Tornadoes'])]
# ----- INDEXING ------
# Select row position by index name:
df.loc['row_name']
# Select row position by absolute index:
df.iloc[i]
# Select data based on dtypes
float_df = df.select_dtypes(include=['float64'])
# Select all columns that contain a specific word in name
# Example: Select only the {x}_onehot columns
onehot_cols = [col for col in df.columns if 'onehot' in col]

## 1.5 - replacing 

In [None]:
# In a certain column, replace Old1, Old2 and Old3 by New1, New2 and New3 respectively
df.Column_Name.replace({'Old1':'New1', 'Old2':'New2', 'Old3':'New3'})

## 1.6 - Get unique values in column 

In [None]:
# Get the number of unique values in a column:
df.Column1.nunique()
# Get all the unique values in a column:
df.Column1.unique()
# Get a count of number of elements for all unique values in column
df.topic.value_counts()
# If you have to do this for all columns:
column = list(df.columns)
for i in range(len(column)):
df[column[i]].nunique()
df[column[i]].unique()

## 1.7 - Apply function 

In [None]:
# One line method
data_features['adopted'] = data_features['outcome_type'].apply(lambda r: 1 if r=='Adoption' else 0)

# More detailed method, equivalent
def funct(r):
if r=='Adoption': return 1
else: return 0
data_features['adopted'] = data_features['outcome_type'].apply(funct)

## 1.8 - date time conversion 

In [None]:
# Convert column to datetime format
df.date = pd.to_datetime(df.date)
# Remove hours
df['date'] = df['date'].dt.date
# Get current datetime

## 1.9 - get the values

In [None]:
df['col_name'].values 

# 2 - Vizualizing Data <a id="3"> </a> 

## 2.1 -  simple plots 

In [None]:
# TIP: Use the one from pandas if you need to be fast, labels axes automaticaly, figsize easier

 # Using PANDAS directly
df.plot(x='column1', y='column2', kind='scatter', title='title', figsize=(6,4));

# other kinds: 'line', 'bar', 'barh', 'hist', 'box', 'pie', 'scatter'
plt.xticks(rotation=90)

# Using MATPLOTLIB
plt.scatter(df.column1, df.column2)
plt.xlabel('Column1')
plt.ylabel('Column2')
plt.title('Title');
# other kinds: 'plot', 'bar', 'barh', 'hist', 'boxplot', 'pie', 'scatter', 'loglog' => ex "plt.hist()"

# For histograms, add "bins=100" option
df.column.hist(bins=100)
plt.hist(df.column, bins=100)

# If you need to compare two or more distributions depending on the value of a column, df.condition:
sns.histplot(data, x='column1', hue='condition')

## 2.2 - Grid of subplots 

In [None]:
# Creating a 3x4 (3 height, 4 width) grid of plots
fig, ax = plt.subplots(3,4, figsize=(14,12),sharey = True, sharex = True)
column = list(df.columns)
for i in range(3):
for j in range(4):
# This is the trick: i*width + j
ax[i,j].hist(df[column[i*4+j]], bins=100)
ax[i,j].set_title(column[i*4+j])
plt.show()
# Give the whole grid of subplots a title if you want to using this
fig.suptitle('Total title', fontsize=16)

## 2.3 -  Logplots 

In [None]:
# Histogram with y-value in log
plt.hist(df.column, bins=100, log=True)
# Log scale x and y
plt.loglog(df.column1, df.column2)

## 2.4 - Heatmaps

In [None]:
# ---- Heatmaps with two values ----

# To get the number of occurences for each (column1,column2) pair
df_heatmap = pd.crosstab(df.column1, df.column2)
sns.heatmap(df_heatmap, annot=True) # too edit scale: #, vmin = 0, vmax = 20);

# ---- Heatmaps with three values ----
# The color will represent the third variable ("values")
df_heatmap = pd.crosstab(df.column1, df.column2, values=df.column3, margins=False, aggfunc='sum')
sns.heatmap(df_heatmap, annot=True) 

## 2.5 - Plots with errorbars 

In [None]:
# SNS barplots have errorbars directly computed
sns.barplot(x="column1", y="column2", data=df)
# You might want to adjust the limits of the y-axis to see if the errorbars overlap or not.
# Otherwise use a CI, obtained by bootstraping for example (see tutorial 2)

## 2.6 - Boxplotting

In [None]:
df.boxplot(by='categdata', column='age', figsize = [5, 5], grid=True)
plt.show()

## 2.7 - grouped bar plot 

In [None]:
## adding a new column 
df = df.assign(col3= np.where((df['col1'] == 0) & (df['col2'] == 0), 1, 0))
ethnicity = df.groupby(['categcol'])[['col1','col2','col3']].sum()

## put in relative terms 
ethnicity = ethnicity.div(ethnicity.sum(axis=1), axis=0)

pl = ethnicity.plot(kind='bar', figsize=[7,5], rot=0)
pl.set_title('ethnicity')
pl.set_ylabel('participants')
pl.set_xlabel('treatment')
plt.show()

# 3 - Handling Data <a id="4"> </a> 

## 3.1 - Basic statistical data

In [None]:
# Easy way to get all at once: count, mean, std, min, max, 25%, 50%, 75%
df.describe()
# If you want something specific
df.column_name.mean()
df.column_name.median()
df.column_name.std()
df.column_name.sum()
df.column_name.min()
df.column_name.max()

## 3.2 - distribution tests

In [None]:
# Does the data come from a normal distrbution?
diagnostic.kstest_normal(df.column1.values, dist='norm')
# Does the data come form an exponential distribution?
diagnostic.kstest_normal(df.column1.values, dist='exp')
# Returns: (ksstat, pvalue).
# If p-value < 0.05 => Reject Null hypothesis that it comes from a normal distribution => not normal dist.
# If p-value > 0.05 => Do not reject null hypothesis.
# Run this if you want more information on outputs/inputs:
diagnostic.kstest_normal?

## 3.3 - Correlation tests

In [None]:
## NORMAL DISTIBUTION ASSUMPTION 
stats.pearsonr(df.column1, df.column2)
# Returns: (r, p-value)
# If p-value < 0.05 => Reject null hypothesis that the variables are uncorrelated => they are correlated
# If p-value > 0.05 => Do not reject null hypothesis that the two variables are uncorrelated
# r can be -1 and 1. The larger abs(r), the larger the correlation. (r<0 => as x increases, y decreases).
stats.pearsonr?

## NOT NECESSARY NORMALLY DISTRIBUTED 

stats.spearmanr(df.column1, df.column2)
# Returns: (correlation, p-value)
# Same interpretation as Pearson correlation
stats.spearmanr?

## 3.4 - Hypothesis testing (T-test)
 Two-sided test for the null hypothesis that 2 independent samples have identical average (expected) values. Assumes that the populations
have identical variances by default

In [None]:
stats.ttest_ind(df.column1, df.column2)
# Returns: (statistic, p-value)
# The t-test quantifies difference between the means of the two samples.
# Null hypothesis: the samples are drawn from populations with the same means
# p-value < 0.05 => observation is unlikely to have occurred by chance => reject H => not same mean.
# p-value > 0.05 => observation is not so unlikely to have occurred by chance => do not reject H.
stats.ttest_ind?

# 4 - Linear regression <a id="5"> </a> 
Equations are specified using patsy formula syntax. Important operators are:
1. ~ : Separates the left-hand side and right-hand side of a formula.
2. + : Creates a union of terms that are included in the model.
3. : : Interaction term.
4. * : a * b is short-hand for a + b + a:b , useful when you want to include all interactions between a set of variables.
- Intercepts are added by default.
- Categorical variables can be included directly by adding a term C(a).

## 4.1 - Basic Linear regression 

In [None]:
# Declare the model
mod = smf.ols(formula='time ~ C(diabetes) + C(high_blood_pressure)', data=df)
# Fit the model (finds the optimal coefficients, adding a random seed ensures consistency)
np.random.seed(2)
res = mod.fit()
# Print the summary output

For each predictor (e.g. C(diabities)) you get: coefficient, standard error of the coefficients, p-value, 95% confidence intervals. Significant
predictor if p < 0.05.\
Interpretation: days at hospital = coeff(Intercept) + coeff(diabities) * diabetes(bool, 1 if yes) + coeff(high
blood pressure) * high blood pressure(bool, 1 if yes) .
- People who don't have diabetes nor high blood pressure stay at the hospital on average for coeff(intercept) days
- People who have diabetes, but don't have blood pressure stay for coeff(intercept) + coeff(diabities) days
- etc.
### With interaction terms:
Adding a*b adds terms a , b and a:b at once (you get a coefficient for each).

## 4.2 - Logisitic regression 

In [None]:
# Standardize CONTINUOUS features (get z-scores)
df['cont_feature1'] = (df['cont_feature1'] - df['cont_feature1'].mean()) / df['cont_feature1'].std()
df['cont_feature2'] = (df['cont_feature2'] - df['cont_feature2'].mean()) / df['cont_feature2'].std()
# Perform logistic regression
mod = smf.logit(formula='DEATH_EVENT ~ age + ... + C(diabities) + C(high_blood_pressure)', data=df)
res = mod.fit()
# Print the summary output
print(res.summary())
# To acces specific elements:
variables = res.params.index
coefficients = res.params.values
p_values = res.pvalues
standard_errors = res.bse.values
CIs = res.conf_int() # confidence intervals!
# Sort by coefficients:
l1, l2, l3, l4 = zip(*sorted(zip(coefficients[1:], variables[1:], standard_errors[1:],

# 5 - Observartionnal studies <a id="6"> </a> 
Equations are specified using patsy formula syntax. Important operators are:
1. Standardize data --> continuous features 
2. Logisitc regression 
3. Propensity score matching

## 5.1 - Extract propensity score 


In [None]:
# 1 - Standardize CONTINUOUS predictors (get z-scores)
df['cont_feature1'] = (df['cont_feature1'] - df['cont_feature1'].mean()) / df['cont_feature1'].std()
df['cont_feature2'] = (df['cont_feature2'] - df['cont_feature2'].mean()) / df['cont_feature2'].std()

# 2 - Perform logistic regression
mod = smf.logit(formula='treat ~ age + educ + C(black) + C(hispan)', data=df)
res = mod.fit()

# 3 - Extract estimated propensity scores (add to new column)
df['Propensity_score'] = res.predict()

## 5.1 - Matching
### Match subjects into pairs (1 treated, 1 control), with minimal propensity score difference

We will create a network (using networkx library) with an edge between all instances. The library has a function that maximizes the sum of
weights between pairs. All we need to do is to make sure we have large weights between two instances that have a small difference in
propensity scores.

We can use similarity: $$ similarity(x,y) = 1 - | propensity\_score(x) - propensity\_score(y) |$$

In [None]:
# Function to compute similarity score
def get_similarity(propensity_score1, propensity_score2):
    return 1-np.abs(propensity_score1-propensity_score2)

# Separate the treatment and control groups
treatment_df = df[df['treat'] == 1]
control_df = df[df['treat'] == 0]
# Create an empty undirected graph
G = nx.Graph()
# Loop through all the pairs (1 control, 1 treatment) of instances
for control_id, control_row in control_df.iterrows():
for treatment_id, treatment_row in treatment_df.iterrows():
# Calculate the similarity between the two instances (=nodes)
similarity = get_similarity(control_row['Propensity_score'], treatment_row['Propensity_score'])
# Add an edge between the two instances (=nodes) weighted by the similarity between them
G.add_weighted_edges_from([(control_id, treatment_id, similarity)])
# Get the maximum weight (=max similarity!) matching on the generated graph
matching = nx.max_weight_matching(G)
# Get the new dataset with the matched pairs
matched = [i[0] for i in list(matching)] + [i[1] for i in list(matching)]
final_df = df.iloc[matched]
# Now you can see how the results (ex plot) in the dataframe are more coherent!
# They should get rid of the effect of the covariates

# 6 - Supervised Learning <a id="7"> </a> 
### __Supervized__: Input samples (X,y) . Learn f such that y=f(X) to evaluate the y of new samples X
- Continuous y : regression
- Discrete y : classification

## 6.1 - Supervized learning --> Regression --> __Linear__ Regression

### Step 1: Prepare X and y 

In [None]:
# Choosing feature (input) columns, get X
feature_cols = ['input_column1', 'input_column2']
X = df[feature_cols]
# Choosing output column, y
y = df.output_column

### Step 2 : Perform a linear regression 

In [None]:
# Import library
from sklearn.linear_model import LinearRegression
# Perform linear regression
lin_reg = LinearRegression() # INITIALIZE REGRESSION
lin_reg.fit(X,y) # FIT
# Get the obtained beta coefficients
for f in range(len(feature_cols)):
    print("{0} * {1} + ".format(lin_reg.coef_[f], feature_cols[f]))
print(lin_reg.intercept_)

For example, predicting the sales (y) with inputs (X): advertising on TV, Radio, Newspaper.
This prints the following (intercept is $\beta_0$, the other $\beta_i$ are coefficients):
$$y = \beta_0 + \beta_1 \times TV + \beta_2 \times
radio + \beta_3 \times newspaper$$

### Step 3 : Measure performance (MSE) using cross validation 

In [None]:
# Each entry of y_pred is a prediction obtained by cross-validation
y_pred = cross_val_predict(lin_reg, X, y, cv=5)
# Get MSE (mean square error) between prediction and actual
mean_squared_error(y, y_pred)

### Step 4 : Regularization 
In our dataset, we might have many records, but let us imagine that we had much fewer records ?

**Problem**: The model remembers the training records (overfitting).

**Solution**: Regularization

Regularization refers to methods that help to reduce overfitting. Let's try Ridge Regression, which puts a penalty on large weights $\beta_i$ and forces them to be smaller in magnitude. This reduces the complexity of the model.

In [None]:
ridge = Ridge(alpha=5)

# cross_val_predict returns an array of the same size as `y` where each entry
# is a prediction obtained by cross validation:
predicted_r = cross_val_predict(ridge, X, y, cv=5)

## 6.2 - Supervized learning --> Regression --> __Logistic__ Regression
__Important__: For logistic regression, features must be either discrete or standardized
You can therefore onehot encode or label column values.


### Step 1 : Prepare X and y 

In [None]:
# Choosing feature (input) columns
feature_cols = ['input_column1', 'input_column2']
# Convert cathegorical values into dummy variables:
X = pd.get_dummies(df[feature_cols])
# You can also do this by simply labelling ('replace' is an easy way to do it)
# Choosing output column (y)
y = df.output_column

### Step 1 : Perform a logistic regression 

In [None]:
# Import library
from sklearn.linear_model import LogisticRegression
# Perform logistic regression
logistic = LogisticRegression(solver='lbfgs') # INITIALIZE REGRESSION
logistic.fit(X, y)                            # FIT

### Step 3: Measure performance (Precision and Recall) using cross validation

In [None]:
# Evaluate Precision and Recall
precision = cross_val_score(logistic, X, y, cv=10, scoring="precision")
print("Precision: %0.2f (+/- %0.2f)" % (precision.mean(), precision.std() * 2))
recall = cross_val_score(logistic, X, y, cv=10, scoring="recall")
print("Recall: %0.2f (+/- %0.2f)" % (recall.mean(), recall.std() * 2))
# Plot ROC curve: See tutorial 6

## 6.3 - Supervized learning --> Classification --> __K-NN__

In [None]:
# Import library
from sklearn.neighbors import KNeighborsClassifier
# Example k-NN for K neighbours
classifier = KNeighborsClassifier(15) # INITIALIZE CLASSIFIER
classifier.fit(X_train, y_train) # FIT
y_pred = classifier.predict(X_test) # PREDICT

## 6.4 - Supervized learning → Classification → Random forests
Important: In X and y, use dummies for non numerical variables (onehot encode or just label)

In [None]:
# Import library
from sklearn.ensemble import RandomForestClassifier
# Example Random Forest with K trees
classifier = RandomForestClassifier(max_depth=3, random_state=0, n_estimators=K) # INITIALIZE CLASSIFIER
classifier.fit(X_train, y_train) # FIT
y_pred = classifier.predict(X_test) # PREDICT

# 7 - Applied Machine Learning <a id="8"> </a> 
## 7.1 - Splitting the data into training set and testing set

In [None]:
# Import library
from sklearn.model_selection import train_test_split
# For a 70-30 split (70% train, 30% test), with random state 123
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)

## can be done this way 
def split_set(data_to_split, ratio=0.8):
    mask = np.random.rand(len(data_to_split)) < ratio
    return [data_to_split[mask].reset_index(drop=True), data_to_split[~mask].reset_index(drop=True)]

[train, test] = split_set(data)

## 7.1 - create dummies 

In [None]:
categorical_columns = ['col1','col2',..]
train_categorical = pd.get_dummies(train, columns=categorical_columns)
train_categorical.columns

# Make sure we use only the features available in the training set
test_categorical = pd.get_dummies(test, columns=categorical_columns)[train_categorical.columns]


train_label=train_categorical.label
train_features = train_categorical.drop('label', axis=1)
print('Length of the train dataset : {}'.format(len(train)))

test_label=test_categorical.label
test_features = test_categorical.drop('label', axis=1)
print('Length of the test dataset : {}'.format(len(test)))

In [None]:
## Standardization 
means = train_features.mean()
stddevs = train_features.std()

train_features_std = pd.DataFrame()
for c in train_features.columns:
    train_features_std[c] = (train_features[c]-means[c])/stddevs[c]

# Use the mean and stddev of the training set
test_features_std = pd.DataFrame()
for c in test_features.columns:
    test_features_std[c] = (test_features[c]-means[c])/stddevs[c]

train_features_std.head()

## 7.2 - Computing the confusion matrix and recall, precision, F-score

Confusion matrix: $\begin{pmatrix}
TP & FP \\
FN & TN \\
\end{pmatrix}$

In [None]:
# NOTE: X_train and X_test should be standardized before this step
logistic = LogisticRegression(solver='lbfgs') # INITIALIZE REGRESSION
logistic.fit(X_train,y_train) # FIT
y_pred = clf.predict(X_test) # PREDIT

# For the confusion matrix
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred))

# For recall, precision, F-score
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred, digits=3))

>another way of doing 

In [None]:
def compute_confusion_matrix(true_label, prediction_proba, decision_threshold=0.5): 
    
    predict_label = (prediction_proba[:,1]>decision_threshold).astype(int)   
                                                                                                                       
    TP = np.sum(np.logical_and(predict_label==1, true_label==1))
    TN = np.sum(np.logical_and(predict_label==0, true_label==0))
    FP = np.sum(np.logical_and(predict_label==1, true_label==0))
    FN = np.sum(np.logical_and(predict_label==0, true_label==1))
    
    confusion_matrix = np.asarray([[TP, FP],
                                    [FN, TN]])
    return confusion_matrix


def plot_confusion_matrix(confusion_matrix):
    [[TP, FP],[FN, TN]] = confusion_matrix
    label = np.asarray([['TP {}'.format(TP), 'FP {}'.format(FP)],
                        ['FN {}'.format(FN), 'TN {}'.format(TN)]])
    
    df_cm = pd.DataFrame(confusion_matrix, index=['Yes', 'No'], columns=['Positive', 'Negative']) 
    
    return sn.heatmap(df_cm, cmap='YlOrRd', annot=label, annot_kws={"size": 16}, cbar=False, fmt='')


def compute_all_score(confusion_matrix, t=0.5):
    [[TP, FP],[FN, TN]] = confusion_matrix.astype(float)
    
    accuracy =  (TP+TN)/np.sum(confusion_matrix)
    
    precision_positive = TP/(TP+FP) if (TP+FP) !=0 else np.nan
    precision_negative = TN/(TN+FN) if (TN+FN) !=0 else np.nan
    
    recall_positive = TP/(TP+FN) if (TP+FN) !=0 else np.nan
    recall_negative = TN/(TN+FP) if (TN+FP) !=0 else np.nan

    F1_score_positive = 2 *(precision_positive*recall_positive)/(precision_positive+recall_positive) if (precision_positive+recall_positive) !=0 else np.nan
    F1_score_negative = 2 *(precision_negative*recall_negative)/(precision_negative+recall_negative) if (precision_negative+recall_negative) !=0 else np.nan

    return [t, accuracy, precision_positive, recall_positive, F1_score_positive, precision_negative, recall_negative, F1_score_negative]

In [None]:
prediction_proba = logistic.predict_proba(X_test_std) # PREDICT PROBABILITY
# function in the next cell
confusion_matrix = compute_confusion_matrix(y_test, prediction_proba, 0.5)

# 8 - Unsupervized learning <a id="9"> </a> 
__Unsupervized__: Input samples X. Learn f such that y = f(X) is a “simpler” representation.
- Discrete y : __clustering__
- Continuous y : __dimensionality reduction__

## 8.1 - Unsupervized Learning - __Clustering__ with K-means

In [None]:
# Import library
from sklearn.cluster import KMeans
# Example K-means for 5 clusters
kmean = KMeans(n_clusters=5, random_state=42).fit(X)
# Plot the resulting scatter (color shows the clusters)
plt.scatter(X[:,0], X[:,1], c=kmean.labels_)
# Plot the centroid of each cluster
for c in kmean.cluster_centers_:
plt.scatter(c[0], c[1], marker="o", color="red")

__Select the right K__

1. Option 1: Look at __silhouette score__:
Find the K with the desired tradeoff between the number of clusters and cohesion/separation.
To get the curve of the silhouette score, we cluster the data with different values of K and plot the resulting values
(for the plot: x=K, y=Silhouette score).
Choose K that maximizes silhouette score.

2. Option 2: __elbow method__:
Find the "elbow" in the curve of the Sum of Squared Errors
To get the curve for elbow method, compute the SSE (sum of squared errors) for different values of K and plot the values
(for the plot: x=K, y=SSE).
Choose K that minimizes SSE.
See below (end of part 8.3) for __K-means with high dimensional data__.

## 8.2 - Unsupervized learning - __Clustering__ with __DBSCAN__

In [None]:
# Import library
from sklearn.cluster import DBSCAN
# Example DBSCAN with eps=0.13
labels = DBSCAN(eps=0.13).fit_predict(X)
# Remember to play around with the eps value, it is the radius of the spheres
# eps grows => more points are density reachable => less clusters

## 8.2 - Unsupervized learning - __Dimensionality Reduction__
Want to reduce dimensions without losing too much information

### Option1: __t-SNE__

In [None]:
# Import library
from sklearn.manifold import TSNE
# Context: X10d was a 10-dimensional set of data points
# Reduces X10d to 2-dimensions using t-SNE and call it X_reduced_tsne
X_reduced_tsne = TSNE(n_components=2, random_state=0).fit_transform(X10d)

### option 2 : __PCA__

In [None]:
# Import library
from sklearn.decomposition import PCA
# Context: X10d was a 10-dimensional set of data points
# Reduce X10d to 2-dimensions using PCA and call it X_reduced_pca
X_reduced_pca = PCA(n_components=2).fit(X10d).transform(X10d)

To use __K-means clustering on high dimensional data__, you have to perform the K-means on the original high-dimensional (ex: 10-D) data
(X10d) and then plot the reduced-dimension (2-D) data (X_reduced_tsne or X_reduced_pca)


### 9 - Handling text
## 9.1 - Basic prossessing of text data
### Loading text data

# 11 - Handling Networks <a id="12"> </a> 
## 11.1 - create a graph manually 
We will be using NetworkX library ( import networkx as nx ), and several other libraries after so just copy paste all libraries from import
list at the beginning of this file.

In [None]:
# DIRECTED graph
G = nx.DiGraph()
# UNDIRECTED graph
G = nx.Graph()
# Prints graph infomartion 
print(nx.info(G))

### add nodes and edges 

In [None]:
# Add nodes
G.add_node(1)
# Add several nodes at once
G.add_nodes_from(range(2,9))
# Add edges
G.add_edge(1,2)
# Add several edges at once
edges = [(2,3), (1,3), (4,1), (4,5)]
G.add_edges_from(edges)

### Plot + information graph/network

In [None]:
# Plots the network
nx.draw_spring(G, with_labels=True, alpha=0.8)
# Plot a subnetwork
subgraph_Alex = G.subgraph(['Alexander']+list(G.neighbors('Alexander')))
# Just the same as plotting G but replace by the subnetworks name
nx.draw_spring(subgraph_Alex, with_labels=True)

### Helpers functions 

- Degree distribution P(k): probabilitiy that a randomly chosen node has a degree k. Normalized histogram: $P(k)=\frac{N_k}{N}$. Consider
plotting in log-log, especially for power laws!
- NB: If you just need the degree of one node use G.degree(node) .
For in-degree: G.in_degree(node) , for out-degree: G.out_degree(node)
- __Function for plotting the degree distribution:__

In [None]:
# Helper function for plotting the degree distribution of a Graph
def plot_degree_distribution(G):
    degrees = {}
    for node in G.nodes():
        degree = G.degree(node)
        if degree not in degrees:
            degrees[degree] = 0
        degrees[degree] += 1
    sorted_degree = sorted(degrees.items())
    deg = [k for (k,v) in sorted_degree]
    cnt = [v for (k,v) in sorted_degree]
    fig, ax = plt.subplots()
    plt.bar(deg, cnt, width=0.80, color='b')
    plt.title("Degree Distribution")
    plt.ylabel("Frequency")
    plt.xlabel("Degree")
    ax.set_xticks([d+0.05 for d in deg])
    ax.set_xticklabels(deg)

In [None]:
# Helper function for printing various graph properties
def describe_graph(G):
    print(nx.info(G))
    if nx.is_connected(G):
        print("Avg. Shortest Path Length: %.4f" %nx.average_shortest_path_length(G))
        print("Diameter: %.4f" %nx.diameter(G)) # Longest shortest path
    else:
        print("Graph is not connected")
        print("Diameter and Avg shortest path length are not defined!")
    print("Sparsity: %.4f" %nx.density(G))  # #edges/#edges-complete-graph
    # #closed-triplets(3*#triangles)/#all-triplets
    print("Global clustering coefficient aka Transitivity: %.4f" %nx.transitivity(G))

In [None]:
# Helper function for visualizing the graph
def visualize_graph(G, with_labels=True, k=None, alpha=1.0, node_shape='o'):
    #nx.draw_spring(G, with_labels=with_labels, alpha = alpha)
    pos = nx.spring_layout(G, k=k)
    if with_labels:
        lab = nx.draw_networkx_labels(G, pos, labels=dict([(n, n) for n in G.nodes()]))
    ec = nx.draw_networkx_edges(G, pos, alpha=alpha)
    nc = nx.draw_networkx_nodes(G, pos, nodelist=G.nodes(), node_color='g', node_shape=node_shape)
    plt.axis('off')

## 11.2 - Networks from pandas dataframes
Creating a graph from a dataframe

In [None]:
# Import a dataframe with edgelist
df_edges = pd.read_csv('data/edgelist.csv')
# In df, 'Source' and 'Target' are columns that give the edges
# For directed graph
G = nx.from_pandas_edgelist(df_edges, 'Source', 'Target', edge_attr=None, create_using=nx.DiGraph())
# For undirected graph
G = nx.from_pandas_edgelist(df_edges, 'Source', 'Target', edge_attr=None, create_using=nx.Graph())

In [None]:
# Node attributes are in the node list dataframe
df_nodes = pd.read_csv('data/nodelist.csv')
# Add these attributes to their corresponding node in the graph
nx.set_node_attributes(G, df_nodes['Gender'].to_dict(), 'Gender')
nx.set_node_attributes(G, df_nodes['Birthdate'].to_dict(), 'Birthdate')
# This way you can access one node's attributes very easily:
G.nodes['Guillaume Ryelandt'] # (here the index column in the df_nodes was the names)

{'Role': 'EPFL student',
'Gender': 'male',
'Birthdate': 1999}

## 11.3 - describing the networks 
### Sparsity
Sparsity of a graph with $n$ nodes is defined as the number of edges over the maximumum number of edges.
- A directed graph can have at most $n(n-1)$ edges
- An undirected graph can have at most $n(n-1)/2$ edges

In [None]:
nx.density(G)
print("The sparsity of the graph is: {}".format(nx.density(G)))

### Transitivity 
(a.k.a. Global clustering coefficient)
It is the overall probability for the network to have adjacent nodes interconnected, thus revealing the existence of tightly connected
communities (or clusters). "A friend of my friend is my friend".

In [None]:
# GLOBAL clustering coefficient (for the whole graph)
nx.transitivity(G)
print("The transitivity of the graph is: {}".format(nx.transitivity(G)))
print("The global clustering coefficient of the graph is: {}".format(nx.transitivity(G)))
# Clustering coefficient (between two nodes)
nx.clustering(G, ['Alexander', 'John']))
print("The clustering coefficient is: {}".format(nx.clustering(G), ['Alexander', 'John']))

## 11.4 - is the graph connected ? 
The graph is connected if __there is a path from any point to any other point in the graph__. No seperate "communities".

In [None]:
if nx.is_connected(G):
    print("The network is connected.")
else:
    print("The network is not connected.")

### for connected graphs 
We can define:
- Average shortest path length
- Longest shortest path (Diameter)
- Shortest path between two nodes

In [None]:
# Average shortest path length:
nx.average_shortest_path_length(G)
print("The average shortest path length is {}".format(nx.average_shortest_path_length(G)))
# Longest shortest path length (a.k.a diameter):
print("The longest shortest path length (diameter) is {}".format(nx.diameter(G)))
# Shortest path between two nodes:
tom_bob_path = nx.shortest_path(G, source="Thomas", target="Bob")

### for non-connected graphs 
we can look at the number of components 

In [None]:
# Get a list of connected components of G
comp = list(nx.connected_components(G))
print('The graph contains {} connected components'.format(len(comp)))

# Get the largest component
largest_comp = max(comp, key=len)
H = G.subgraph(list(largest_comp))
print('The largest component contains {} nodes'.format(len(largest_comp)))
print("The largest component: {}".format(nx.info(H)))

### strong and weak connection 

In [None]:
# Strongly connected graph
nx.is_strongly_connected(G)
# Weakly connected graph
nx.is_weakly_connected(G)
# There are many built-in functions in NetworkX, look them up!

## 11.6 - Measuring importance of a node

# Note: this is an undirected graph. If you were to have a **directed** one, use separate metrics for **indegree** and **outdegree**.
### Degree centrality
Many neighbours = important node

In [None]:
degrees = dict(G.degree(G.nodes()))
sorted_degree = sorted(G.items(), key=itemgetter(1), reverse=True)
# Get the top 5 most popular nodes
for quaker, degree in sorted_degree[:5]:
print(quaker, 'who is', G.nodes[quaker]['Role'], 'knows', degree, 'people')

### Katz Centrality

In [None]:
degrees = dict(G.degree(G.nodes()))
# Compute katz centrality
katz = nx.katz_centrality(G)
# Add it to node attributes and sort
nx.set_node_attributes(G, katz, 'katz')
sorted_katz = sorted(katz.items(), key=itemgetter(1), reverse=True)
# Get the top 5 most popular nodes
for quaker, katzc in sorted_katz[:5]:
    print(quaker, 'who is', G.nodes[quaker]['Role'], 'has katz-centrality: %.3f' %katzc)

### Betweenness centrality :

In [None]:
# Compute betweenness centrality
betweenness = nx.betweenness_centrality(G)
# Add it to node attributes and sort
nx.set_node_attributes(G, betweenness, 'betweenness')
sorted_betweenness = sorted(betweenness.items(), key=itemgetter(1), reverse=True)
# Get the top 5 most popular nodes
for quaker, bw in sorted_betweenness[:5]:
    print(quaker, 'who is', G.nodes[quaker]['Role'], 'has betweeness: %.3f' %bw)

### show the results 

In [None]:
# similar pattern
list_nodes =list(quakerG.nodes())
list_nodes.reverse()   # for showing the nodes with high betweeness centrality 
pos = nx.spring_layout(quakerG)
ec = nx.draw_networkx_edges(quakerG, pos, alpha=0.1)
nc = nx.draw_networkx_nodes(quakerG, pos, nodelist=list_nodes, node_color=[quakerG.nodes[n]["betweenness"] for n in list_nodes], 
                            alpha=0.8, node_shape = '.')
plt.colorbar(nc)
plt.axis('off')
plt.show()

## 11.7 - Community detection 
### Girvan Newman \
Idea: Edges with high betweeness centrality separate communities. Algorithm starts with the entire graph and then it iteratively removes the
edge with the highest betweeness.

In [None]:
comp = girvan_newman(G)
iteration = 0
for communities in itertools.islice(comp, 4):
iteration +=1
print('Iteration', iteration)
print(tuple(sorted(c) for c in communities))
visualize_graph(G)

### The Louvain method 

Idea: Proceeds the other way around: initially every node is considered as a community. The communities are traversed, and for each
community it is tested whether joining it to a neighboring community gives us a better clustering.

In [5]:
partition = community_louvain.best_partition(G)
# Add it as an attribute to the nodes
for n in G.nodes:
G.nodes[n]["louvain"] = partition[n]
# Plot it out
pos = nx.spring_layout(G,k=0.2)
ec = nx.draw_networkx_edges(G, pos, alpha=0.2)
nc = nx.draw_networkx_nodes(G, pos, nodelist=G.nodes(),
node_color=[G.nodes[n]["louvain"] for n in G.nodes],
node_size=100, cmap=plt.cm.jet)
plt.show()
# ------------------------------ Take a look at specific cluster ------------------------------
cluster_James = partition['James Nayler']
# Take all the nodes that belong to James' cluster
members_cluster = [q for q in G.nodes if partition[q] == cluster_James]

# Get information about them
for quaker in members_cluster:
print(quaker, 'who is', G.nodes[quaker]['Role'], 'and died in ', G.nodes[quaker]['Deathdate'])

IndentationError: expected an indented block (2451269364.py, line 4)

## 11.8 - Homophily in network
### Influence: I copy eating behavior of those around me
### Homophily: people with similar eating behavior prone to become friends

How likely is it that two quakers who have the same attribute are linked?

Try to measure the similarity of connections in the graph with respect to a given attribute.   
*Intuition: Like correlation, but translated to graphs.*

In [None]:
# For categorical attributes
nx.attribute_assortativity_coefficient(G, 'Gender')
# For numerical attributes, values must be integers
nx.numeric_assortativity_coefficient(G, 'Deathdate')
# If this value is high, means there is some form of gender homophily in the network.
# People who are the same gender tend to belong to the same network.

# 13 - problems <a id="14"> </a> 
## TODO 
> read before the exam 

## HOMEWORK 1
    - question 1.2 : count duplicates in a column 