<a href="https://colab.research.google.com/github/Azimoj/CNN/blob/main/Copie_de_W9_Building_Perceptrons_with_Scikit_Learn_JS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Diabetese Detection Models

This [dataset](https://raw.githubusercontent.com/mansont/datasets-tests/main/diabetese.csv) contains patient data and their diabetese condition: "1" they have diabetes, "0" they do not have diabetese.


Build the following models and compare their performance:
* A logistic regression model
* A single-layer perceptron model
* A multilayer perceptron

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Nos outils de preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Nos modèles
from sklearn.linear_model import Perceptron, LogisticRegression
from sklearn.neural_network import MLPClassifier

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/mansont/datasets-tests/main/diabetese.csv")

In [None]:
df.head(2)

Unnamed: 0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age,diabetes
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   pregnancies  768 non-null    int64  
 1   glucose      768 non-null    int64  
 2   diastolic    768 non-null    int64  
 3   triceps      768 non-null    int64  
 4   insulin      768 non-null    int64  
 5   bmi          768 non-null    float64
 6   dpf          768 non-null    float64
 7   age          768 non-null    int64  
 8   diabetes     768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


Preprocessing :

In [None]:
X = df.loc[:, ['pregnancies', 'glucose', 'diastolic', 'triceps', 'insulin', 'bmi', 'dpf', 'age']]
y = df['diabetes']

In [None]:
sc = StandardScaler()
X_std = sc.fit_transform(X)

Régression logistique :

In [None]:
logistic_regression = LogisticRegression(max_iter=1000)

In [None]:
from sklearn.model_selection import cross_val_score
results = cross_val_score(logistic_regression, X_std, y, cv=5, scoring="accuracy")
print(f"Cross validation score: {results}")
print(f"Score moyen: {np.mean(results):.2%}")

Cross validation score: [0.77272727 0.74675325 0.75324675 0.81699346 0.76470588]
Score moyen: 77.09%


Single layer perceptron :

In [None]:
single_layer_perceptron = MLPClassifier(hidden_layer_sizes=(100,), max_iter=1000)

results = cross_val_score(single_layer_perceptron, X_std, y, cv=5, scoring="accuracy")
print(f"Cross validation score: {results}")
print(f"Score moyen: {np.mean(results):.2%}")

Cross validation score: [0.76623377 0.74675325 0.76623377 0.81045752 0.77124183]
Score moyen: 77.22%


Multilayer perceptron :

In [None]:
multi_layer_perceptron = MLPClassifier(hidden_layer_sizes=(100, 50, 25), max_iter=1000)

results = cross_val_score(multi_layer_perceptron, X_std, y, cv=5, scoring="accuracy")
print(f"Cross validation score: {results}")
print(f"Score moyen: {np.mean(results):.2%}")

Cross validation score: [0.72077922 0.65584416 0.73376623 0.79738562 0.73856209]
Score moyen: 72.93%


### Is there a notable difference in the MLP performance when a ReLU, Sigmoid or SoftMax activation function is used?


In [None]:
for activation in ["logistic", "tanh", "relu"]:
  multi_layer_perceptron = MLPClassifier(hidden_layer_sizes=(100, 50, 25), max_iter=1000, activation=activation)

  results = cross_val_score(multi_layer_perceptron, X_std, y, cv=5, scoring="accuracy")
  print(f"Fonction d'activation: {activation}")
  print(f"Cross validation score: {results}")
  print(f"Score moyen: {np.mean(results):.2%}")
  print('-'*30, '\n')

Fonction d'activation: logistic
Cross validation score: [0.75324675 0.74675325 0.74025974 0.83006536 0.77124183]
Score moyen: 76.83%
------------------------------ 

Fonction d'activation: tanh
Cross validation score: [0.66883117 0.69480519 0.7012987  0.70588235 0.74509804]
Score moyen: 70.32%
------------------------------ 

Fonction d'activation: relu
Cross validation score: [0.72727273 0.67532468 0.76623377 0.74509804 0.7254902 ]
Score moyen: 72.79%
------------------------------ 



### Does the network performance change when the density (number of neurons) of the hidden layers change?

In [None]:
for layer_config in [(10, 5, 2), (50, 25, 12), (100, 50, 25), (500, 250, 125)]:
  multi_layer_perceptron = MLPClassifier(hidden_layer_sizes=layer_config, max_iter=1000, activation="relu")

  results = cross_val_score(multi_layer_perceptron, X_std, y, cv=5, scoring="accuracy")
  print(f"Nombre de couches: {layer_config}")
  print(f"Cross validation score: {results}")
  print(f"Score moyen: {np.mean(results):.2%}")
  print('-'*30, '\n')



Nombre de couches: (10, 5, 2)
Cross validation score: [0.75324675 0.74025974 0.69480519 0.82352941 0.79084967]
Score moyen: 76.05%
------------------------------ 

Nombre de couches: (50, 25, 12)
Cross validation score: [0.74025974 0.70779221 0.74675325 0.73856209 0.75163399]
Score moyen: 73.70%
------------------------------ 

Nombre de couches: (100, 50, 25)
Cross validation score: [0.71428571 0.7012987  0.71428571 0.7254902  0.81045752]
Score moyen: 73.32%
------------------------------ 

Nombre de couches: (500, 250, 125)
Cross validation score: [0.72727273 0.7012987  0.72077922 0.73202614 0.77777778]
Score moyen: 73.18%
------------------------------ 



Données peut-être linéairement séparables -> meilleure performance avec un modèle simple qui évite davantage l'overfitting comme la régression logistique. De plus, faible quantité de données.