# 1- EXPLORATORY DATA ANALYSE

## INTRODUCTION

## Objectif : 
- Comprendre au maximum les données dont on dispose pour définir une stratégie de modélisation

- Dévolopper une première stratégie de modélisation

#### ANALYSE DE LA FORME : 

- **Identification de la target** : stroke

- **Nombre de lignes et de colonnes** : 5110 lignes et 12 colonnes

- **Types de variables** : qualitatives : 9, quantitatives : 3

- **Identification des valeurs manquantes** : peu de NaN, seulement sur la variable bmi (indice de masse corporelle), il y a 4% de valeurs manquantes



#### ANALYSE DE LA FORME

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


data = pd.read_csv('../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv', delimiter = ',', encoding = 'utf-8')
data.head()

In [None]:
df = data.copy()
df.dtypes


In [None]:
df.dtypes.value_counts()

In [None]:
df.shape

In [None]:
# Identification des valeurs manquantes :

import seaborn as sns 
plt.figure(figsize =  (20,10))
sns.heatmap(df.isna(), cbar = False)



In [None]:
df.isna().sum()/df.shape[0]
(df.isna().sum()/df.shape[0]).sort_values(ascending = False)

#### ANALYSE DE FOND : 

- **Visualisation de la target (Histogramme si c’est une valeur continue / Boxplot si c’est une valeur discrète** : 
    - Seulement 4% de positifs

- **Signification des différentes variables** :
    - Variables continues : non-standardisées, asymétriques
    - Variables age : age varie de 0 à 80 ans, on pourra créer une variable pour les catégories d'age plus tard
    - Variables qualitatives : plus de femmes, plus de mariés, plus de personnes travaillant dans le privé. Peu de malades hypertension, et maladie du coeur. Une inconnu sur les fumeurs'unknow'

- **Relations features – target (Histogramme / Boxplot)** :
    - target / catégorie : Pour l'instant on ne peut rien tirer de ces graphs 
    - target / age : très peu d'accident vasculaire avant 40 ans, cela augmente avec l'age. La'ge pourrait etre une variable interessante
    - target / viral : l'age et le glucose pourrait être des facteurs => à tester

In [None]:
df['stroke'].value_counts(normalize = True)

In [None]:
for col in df.select_dtypes('float'):
    print(col)

In [None]:
for col in df.select_dtypes('float'):
    plt.figure()
    sns.distplot(df[col])

In [None]:
for col in df.select_dtypes('object'):
    plt.figure()
    df[col].value_counts().plot.pie()


In [None]:
df['hypertension'].value_counts().plot.pie()

In [None]:
df['heart_disease'].value_counts().plot.pie()

In [None]:
# création des sous-ensemble 

negative_df = df[df['stroke'] == 0]

positive_df = df[df['stroke'] == 1]


In [None]:
positive_df.shape

In [None]:
cat_columns = df[['hypertension','heart_disease','gender','ever_married','work_type','Residence_type','smoking_status']].columns.to_list()

num_columns = df[['age','avg_glucose_level','bmi' ]].columns.to_list()

In [None]:
# Relation catégorie / target 

for col in cat_columns : 
    plt.figure()
    sns.heatmap(pd.crosstab(df['stroke'], df[col]), annot = True, fmt = 'd')


In [None]:
#target / age

plt.figure(figsize = (20,8))
sns.countplot(x = 'age', hue = 'stroke', data = df)

In [None]:
plt.figure(figsize = (12,8))
plt.scatter(df['age'], df['bmi'], c = df['stroke'], alpha = 0.4)

In [None]:
#target /numé 

for i in num_columns: 
    plt.figure()
    sns.distplot(positive_df[i], label= 'positive')
    sns.distplot(negative_df[i], label= 'negative')
    plt.legend()


#### ANALYSE DETAILLEE : 

- ** Relations variables / Variables ** :
    - Numérique / Numérique : pas de relation linéaire
    - Numérique / age : Pas de relation linéaire
    - Catégorielles / Catégorielles : 
    - Catégorielles / Age : 

- **Sous-Ensemble** :
    - est malade (hypertension et maladie cardiaque) : on une IMC plus élévé, un age plus élevé aussi 

- **Test hypothèses ** : 

In [None]:
sns.pairplot(df[num_columns])

In [None]:
sns.heatmap(df[num_columns].corr())

In [None]:
df.corr()['age'
         ].sort_values()

In [None]:
cat_columns

In [None]:
pd.crosstab(df['hypertension'], df['heart_disease'])

In [None]:
df.columns

In [None]:
df['smoking_status'].value_counts()

In [None]:
df['est malade'] = np.sum(df[['hypertension','heart_disease' ]] == 1, axis = 1) >=1

malade_df = df[df['est malade'] == True]
non_malade_df = df[df['est malade'] == False]

In [None]:
for i in num_columns : 
    plt.figure()
    sns.distplot(malade_df[i], label= 'malade')
    sns.distplot(non_malade_df[i], label= 'non malade')
    plt.legend()


In [None]:
from scipy.stats import ttest_ind

positive_df.shape

In [None]:
negative_df.shape

In [None]:
balanced_neg = negative_df.sample(positive_df.shape[0])

In [None]:
balanced_neg.shape

In [None]:
def t_test (col) : 
    alpha = 0.2
    stat, p = ttest_ind(balanced_neg[col].dropna(), positive_df[col].dropna())
    if p < alpha : 
        return 'H0 Rejectée'
    else : 
        return 0

In [None]:
for col in cat_columns : 
    print(f'{col}{t_test(col)}')

In [None]:
for col in num_columns : 
    print(f'{col}{t_test(col)}')

# 2- PRE TRAITEMENT DES DONNÉES

In [None]:
df = df.drop('id', axis = 1)

In [None]:
df.columns 

In [None]:
from sklearn.model_selection import train_test_split

### Autres visualisation

In [None]:
%matplotlib
from mpl_toolkits.mplot3d import Axes3D
ax = plt.axes(projection = '3d')
ax.scatter(df['hypertension'], df['age'], df['heart_disease'], c=df['stroke'])


### Encodage 

In [None]:
# Colonne Gender : 

df.loc[df['gender'] == 'Male','gender'] = 0
df.loc[df["gender"] == "Female","gender"] = 1

In [None]:
# Colonne Ever Married :

df.loc[df['ever_married'] == "Yes", 'ever_married'] = 1
df.loc[df['ever_married'] == "No", 'ever_married'] = 0

In [None]:
#Colonne Residence


df.loc[df['Residence_type'] == "Urban", 'Residence_type'] = 1
df.loc[df['Residence_type'] == "Rural", 'Residence_type'] = 0

df = df.rename(columns = {'Residence_type': 'Urban_residence'}) 

In [None]:
cat_columns = ['hypertension','heart_disease','gender','ever_married','work_type','Urban_residence','smoking_status']

In [None]:
#Colonne Work Type et Smoking Status

df2 = pd.get_dummies(df[['work_type', 'smoking_status']], prefix=['work_type', 'smoking_status'])

df = df.join(df2)

In [None]:
# Suppression des colonnes 'work_type','smoking_status','smoking_status_Unknown','smoking_status_never smoked'

df = df.drop(['work_type','smoking_status','smoking_status_Unknown','smoking_status_never smoked'], axis = 1)

In [None]:
#Suppression de la colonne work type children

df = df.drop('work_type_children', axis = 1)

In [None]:
df.loc[df['est malade'] == True, 'est malade'] = 1
df.loc[df['est malade'] == False, 'est malade'] = 0

### Fonctions de preprocessing

In [None]:
# On supprime la ligne Gender = Other
indexNames = df[df['gender'] == 'Other'].index
indexNames


In [None]:
df = df.drop(index = indexNames)

In [None]:
# On remet les bons types aux colonnes

df[['age', 'avg_glucose_level','bmi']] = df[['age', 'avg_glucose_level','bmi']].astype(float)
df[['gender','ever_married','Urban_residence','est malade' ]] = df[['gender','ever_married','Urban_residence','est malade' ]].astype(int)

In [None]:
df[['work_type_Govt_job', 'work_type_Never_worked','work_type_Private','work_type_Self-employed',
   'smoking_status_formerly smoked','smoking_status_smokes']] = df[['work_type_Govt_job', 'work_type_Never_worked','work_type_Private','work_type_Self-employed',
   'smoking_status_formerly smoked','smoking_status_smokes']].astype(int)

In [None]:
df.dtypes

In [None]:
def imputation(df): 
    return df.dropna(axis =0)

In [None]:
 def feature_engineering (df) : 
        df['est malade'] = np.sum(df[['hypertension','heart_disease' ]] == 1, axis = 1) >=1
        df = df.drop(['hypertension','heart_disease'], axis = 1)
        
        df['a fumé'] = np.sum(df[['smoking_status_formerly smoked','smoking_status_smokes']] == 1, axis = 1) >=1
        df = df.drop(['smoking_status_formerly smoked','smoking_status_smokes'], axis = 1)
        
        df['vieux'] = df.loc[df['age'] >= 55, 'age']
        df.loc[df['vieux'] >= 50, 'vieux'] = 1
        df['vieux'] = df['vieux'].fillna(0)
        
        return df
    

In [None]:
def preprocessing (df) : 
    
    df = imputation(df)
    df = feature_engineering(df)
    
    X = df.drop('stroke', axis = 1)
    y = df['stroke']
    
    print(y.value_counts(normalize = True))
    
    return X,y

In [None]:
trainset, testset = train_test_split(df, test_size = 0.2, random_state = 0)

In [None]:
X_train, y_train = preprocessing(trainset)

In [None]:
X_test, y_test = preprocessing(testset)

### Sur echantillonnage

In [None]:
from imblearn.over_sampling import SMOTE 

smote = SMOTE(sampling_strategy = 0.1)
X_train, y_train = smote.fit_resample(X_train, y_train)

In [None]:
from imblearn.under_sampling import RandomUnderSampler 

rUs = RandomUnderSampler(sampling_strategy=0.9)
X_train, y_train = rUs.fit_resample(X_train, y_train)


In [None]:
y_train.value_counts(normalize = True)

### Modelling

In [None]:
# On entraine un arbre de décision 

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn.feature_selection import SelectKBest, f_classif, chi2
from sklearn.preprocessing import PolynomialFeatures
from sklearn.decomposition import PCA

model = make_pipeline(PolynomialFeatures(2),SelectKBest(score_func=chi2, k=12),
                      DecisionTreeClassifier(random_state = 0))

#model = make_pipeline(PolynomialFeatures(2),PCA (n_components = 3),
                      #DecisionTreeClassifier(random_state = 0))



#model = DecisionTreeClassifier(random_state = 0)

### Evaluation

In [None]:
from sklearn.metrics import f1_score,  confusion_matrix, classification_report
from sklearn.model_selection import learning_curve 

def evaluation(model):
    model.fit(X_train,y_train)
    y_pred = model.predict (X_test)
    
    print(confusion_matrix(y_test,y_pred))
    print(classification_report(y_test, y_pred))
    
    N, train_score, val_score = learning_curve(model, X_train,y_train, 
                                              cv = 4, scoring = 'f1',
                                               train_sizes = np.linspace(0.1,1,10))
    
    plt.figure(figsize = (12,8))
    plt.plot(N,train_score.mean(axis = 1), label = 'train score')
    plt.plot(N,val_score.mean(axis = 1), label = 'validation score')
    plt.legend()

In [None]:
evaluation(model)

In [None]:
#pd.DataFrame(model.feature_importances_, index = X_train.columns).plot.bar()

# 3- MODELISATION

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.pipeline import make_pipeline
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.decomposition import PCA

In [None]:
preprocessor = make_pipeline(PolynomialFeatures(2, include_bias = False), SelectKBest(chi2, k=12))

In [None]:
RandomForest = make_pipeline(preprocessor, RandomForestClassifier(random_state = 0))

AdaBoost = make_pipeline(preprocessor, AdaBoostClassifier(random_state = 0))

SVM = make_pipeline(preprocessor, StandardScaler(), SVC(random_state = 0))

KNN = make_pipeline(preprocessor, StandardScaler(), KNeighborsClassifier())

In [None]:
list_of_models = [RandomForest,AdaBoost, SVM, KNN ]
dict_of_models = {'RandomForest' :RandomForest ,
                 'Adaboost' : AdaBoost ,
                 'SVM' :SVM ,
                 'KNN':KNN }

In [None]:
for name,model in dict_of_models.items() : 
    print(name)
    evaluation(model)

## OPTIMISATION

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
hyper_params = {'svc__gamma' : [1e-3, 1e-4],
                'svc__C' : [1,10,100,1000]}

In [None]:

SVM = make_pipeline(preprocessor, StandardScaler(), SVC(random_state = 0))

grid = GridSearchCV(SVM,hyper_params, scoring = "recall", cv =4)

grid.fit(X_train,y_train)

print(grid.best_params_)

In [None]:
y_pred = grid.predict(X_test)
print(classification_report (y_test,y_pred))

In [None]:
evaluation(grid.best_estimator_)

In [None]:
hyper_params = {'svc__gamma' : [1e-3, 1e-4],
                'svc__C' : [1,10,100,1000],
               'svc__kernel':['rbf', 'linear', 'poly', 'rbf', 'sigmoid'],
               'pipeline__polynomialfeatures__degree' : [2,3],
               'pipeline__selectkbest__k' : range(40,80)}

In [None]:
from sklearn.model_selection import RandomizedSearchCV

SVM = make_pipeline(preprocessor, StandardScaler(), SVC(random_state = 0))

grid = RandomizedSearchCV(SVM,hyper_params, scoring = "recall", cv =4, n_iter = 40)

grid.fit(X_train,y_train)

print(grid.best_params_)

y_pred = grid.predict(X_test)
print(classification_report (y_test,y_pred))

evaluation(grid.best_estimator_)

In [None]:
SVM = make_pipeline(preprocessor, StandardScaler(), SVC(random_state = 0))

evaluation(SVM)

## PRECISION RECALL CURVE

In [None]:
from sklearn.metrics import precision_recall_curve

In [None]:
precision, recall, threshold = precision_recall_curve(y_test, SVM.decision_function(X_test))

In [None]:
plt.plot(threshold, precision[:-1], label = 'precision')
plt.plot(threshold, recall[:-1], label = 'recall')
plt.legend()

In [None]:
def model_final(model,X, threshold = 0.8) : 
    return model.decision_function(X) > threshold

In [None]:
y_pred = model_final(SVM, X_test,threshold = 0.1)

print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test, y_pred))