<a href="https://colab.research.google.com/github/Gladybams/Wine_Pycaret/blob/main/WineProject.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Projet : Construire un classificateur de vin avec PyCaret et Streamlit pour créer et déployer l'application Web

In [2]:
import pandas as pd
import numpy as np

In [3]:
data = pd.read_csv("winequality-red.csv")

In [4]:
data.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


Comme on peut voir sur notre dataset, nous avons différentes caractéristiques telles que fixed acidity, citric acid, pH et ainsi de suite. La but de notre classificateur est de prédire si la qualité du vin est bonne ou mauvaise. Cependant, les valeurs ne correspondent pas à ce à quoi nous nous attendions. Nous devons transformer la valeur de cette fonctionnalité en «bonne» ou «mauvaise».

Pour ce faire, nous devons définir certaines règles. Si la qualité du vin est égale ou supérieure à 6, alors la qualité du vin est bonne, sinon la qualité est mauvaise.

In [5]:
data.quality = np.where(data.quality >= 6,'Good', 'Bad')
data.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,Bad
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,Bad
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,Bad
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,Good
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,Bad


On regarde mainteant la colonne quality et c'est bon !

Le dataset est propre, ce qui signifie qu'il n'y a aucune valeur manquante, aucune valeur en double et les types de données sont tous corrects.

## PyCaret

PyCaret est une librairie de ML à faible code qui automatise tous les flux de travail de ML.

Avec PyCaret, nous pouvons essentiellement créer notre modèle de ML pour les problèmes de classification, de régression, de clustering en quelques lignes de code.


In [6]:
pip install pycaret

Collecting pycaret
[?25l  Downloading https://files.pythonhosted.org/packages/30/4b/c2b856b18c0553238908f34d53e6c211f3cc4bfa13a8e8d522567a00b3d7/pycaret-2.3.0-py3-none-any.whl (261kB)
[K     |████████████████████████████████| 266kB 12.8MB/s 
[?25hCollecting imbalanced-learn>=0.7.0
[?25l  Downloading https://files.pythonhosted.org/packages/80/98/dc784205a7e3034e84d41ac4781660c67ad6327f2f5a80c568df31673d1c/imbalanced_learn-0.8.0-py3-none-any.whl (206kB)
[K     |████████████████████████████████| 215kB 27.6MB/s 
Collecting pandas-profiling>=2.8.0
[?25l  Downloading https://files.pythonhosted.org/packages/dd/12/e2870750c5320116efe7bebd4ae1709cd7e35e3bc23ac8039864b05b9497/pandas_profiling-2.11.0-py2.py3-none-any.whl (243kB)
[K     |████████████████████████████████| 245kB 13.5MB/s 
Collecting pyLDAvis
[?25l  Downloading https://files.pythonhosted.org/packages/03/a5/15a0da6b0150b8b68610cc78af80364a80a9a4c8b6dd5ee549b8989d4b60/pyLDAvis-3.3.1.tar.gz (1.7MB)
[K     |█████████████████████

In [7]:
# Configuration de l'environnement pycaret par défaut 
from pycaret.classification import *
exp_clf01 = setup(data = data, target = 'quality', session_id = 123)

Unnamed: 0,Description,Value
0,session_id,123
1,Target,quality
2,Target Type,Binary
3,Label Encoded,"Bad: 0, Good: 1"
4,Original Data,"(1599, 12)"
5,Missing Values,False
6,Numeric Features,11
7,Categorical Features,0
8,Ordinal Features,False
9,High Cardinality Features,False


In [8]:
best = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
rf,Random Forest Classifier,0.8222,0.8973,0.8384,0.8357,0.8364,0.6416,0.6429,0.568
et,Extra Trees Classifier,0.8159,0.9044,0.8319,0.8302,0.8306,0.629,0.6299,0.507
lightgbm,Light Gradient Boosting Machine,0.8132,0.8849,0.8204,0.8346,0.8266,0.6242,0.6257,0.125
gbc,Gradient Boosting Classifier,0.7855,0.8593,0.799,0.8071,0.8018,0.5682,0.5703,0.221
ridge,Ridge Classifier,0.7569,0.0,0.7497,0.791,0.7688,0.5131,0.5151,0.023
lr,Logistic Regression,0.7507,0.8177,0.748,0.7825,0.7642,0.5,0.5015,0.437
lda,Linear Discriminant Analysis,0.7489,0.8173,0.7513,0.7779,0.7635,0.496,0.4974,0.02
dt,Decision Tree Classifier,0.7444,0.7411,0.7809,0.7568,0.7684,0.4835,0.4841,0.024
nb,Naive Bayes,0.7418,0.8043,0.7646,0.7615,0.7621,0.4798,0.4811,0.02
ada,Ada Boost Classifier,0.7363,0.8126,0.7645,0.7548,0.7578,0.4684,0.4711,0.128


In [9]:
# Configuration optimisé 

exp_clf102 = setup(data = data, target = 'quality', session_id=123, normalize = True, transformation = True)

Unnamed: 0,Description,Value
0,session_id,123
1,Target,quality
2,Target Type,Binary
3,Label Encoded,"Bad: 0, Good: 1"
4,Original Data,"(1599, 12)"
5,Missing Values,False
6,Numeric Features,11
7,Categorical Features,0
8,Ordinal Features,False
9,High Cardinality Features,False


normalize : Transformer nos fonctionnalités en les adaptant à une plage donnée.

transformation : Transformer nos fonctionnalités afin que nos données puissent être représentées par une distribution normale. 


In [10]:
best = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
rf,Random Forest Classifier,0.8222,0.8976,0.8351,0.838,0.8359,0.6418,0.643,0.569
et,Extra Trees Classifier,0.8222,0.9028,0.8434,0.8324,0.8373,0.6413,0.6425,0.513
lightgbm,Light Gradient Boosting Machine,0.8141,0.8835,0.8237,0.8337,0.8275,0.626,0.6277,0.088
gbc,Gradient Boosting Classifier,0.7873,0.8596,0.7991,0.8095,0.803,0.572,0.5741,0.218
lr,Logistic Regression,0.7525,0.8201,0.7727,0.7719,0.7711,0.5015,0.5032,0.023
qda,Quadratic Discriminant Analysis,0.7507,0.8123,0.776,0.7679,0.7711,0.4972,0.4985,0.019
ridge,Ridge Classifier,0.7498,0.0,0.7595,0.775,0.7659,0.4972,0.499,0.019
lda,Linear Discriminant Analysis,0.7498,0.8215,0.7595,0.775,0.7659,0.4972,0.499,0.021
dt,Decision Tree Classifier,0.7444,0.7413,0.7793,0.7578,0.768,0.4837,0.4844,0.024
nb,Naive Bayes,0.7373,0.8128,0.7119,0.7854,0.7461,0.4753,0.4787,0.019


In [11]:
rf_model = create_model('rf')

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.8393,0.9176,0.8197,0.8772,0.8475,0.6781,0.6798
1,0.8393,0.9166,0.8033,0.8909,0.8448,0.6791,0.683
2,0.8393,0.9156,0.8525,0.8525,0.8525,0.676,0.676
3,0.8393,0.9087,0.8689,0.8413,0.8548,0.6749,0.6754
4,0.8304,0.8901,0.8525,0.8387,0.8455,0.6574,0.6575
5,0.8214,0.9074,0.8525,0.8254,0.8387,0.6388,0.6392
6,0.8304,0.9002,0.8689,0.8281,0.848,0.6563,0.6573
7,0.7857,0.8663,0.8167,0.7903,0.8033,0.5681,0.5685
8,0.7857,0.8628,0.8333,0.7812,0.8065,0.567,0.5685
9,0.8108,0.891,0.7833,0.8545,0.8174,0.6219,0.6244


In [12]:
evaluate_model(rf_model)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Hyperparameters', 'param…

In [13]:
predict_model(rf_model)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Random Forest Classifier,0.7812,0.8781,0.8347,0.7638,0.7977,0.5606,0.5632


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,Label,Score
0,1.020588,-0.262337,0.879953,0.909526,-0.811347,-1.540060,-1.767543,-0.001569,0.179457,-0.026319,1.507669,Good,Good,0.81
1,0.803909,0.365825,0.069862,0.182406,0.169895,0.853208,1.990040,0.562903,-0.216541,-0.678448,-1.071227,Bad,Bad,0.95
2,-0.548073,1.883447,-0.893326,-0.877415,-0.243956,0.627022,0.590835,-0.012133,0.309834,-0.576033,-1.403862,Bad,Bad,0.83
3,0.107596,-1.862275,0.790671,-0.626787,-0.545447,-0.666586,-1.164981,-1.619911,-0.619242,-0.783892,1.558527,Good,Good,0.94
4,-0.094340,-0.848967,-0.143052,-0.877415,-0.079076,0.853208,0.377180,-0.235514,-0.150025,0.344294,-1.234888,Bad,Good,0.61
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
475,-0.548073,-0.140215,0.322091,-0.184964,-1.025574,1.330908,0.533354,0.282230,-0.017590,0.205814,-0.205689,Good,Good,0.77
476,1.829654,-0.021461,1.097945,0.007663,0.466399,-1.351648,-1.357751,1.774646,-0.754683,-2.001079,-0.760163,Bad,Bad,0.96
477,-0.629808,-0.781026,0.222941,-0.877415,-0.301560,1.330908,0.934478,0.056549,0.761128,-0.287161,-0.912956,Good,Good,0.75
478,0.936345,-1.341807,0.654237,-1.148794,-1.025574,-1.735214,-1.697007,-1.047378,-1.095523,-0.476666,1.112170,Good,Good,0.89


In [14]:
save_model(rf_model, model_name = 'random_forest_model', model_only=True)

Transformation Pipeline and Model Succesfully Saved


(RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                        criterion='gini', max_depth=None, max_features='auto',
                        max_leaf_nodes=None, max_samples=None,
                        min_impurity_decrease=0.0, min_impurity_split=None,
                        min_samples_leaf=1, min_samples_split=2,
                        min_weight_fraction_leaf=0.0, n_estimators=100,
                        n_jobs=-1, oob_score=False, random_state=123, verbose=0,
                        warm_start=False), 'random_forest_model.pkl')