<b>Méthodologie de travail :</b>

Pour elaborer un modèle en Machine Learning, on va adopter/suivre la methodologie CRISP-DM :
- Compréhension de métier : consiste à :
    - Identifier la tache de prédiction (Classification, Regression, Forecast, ...)
    - Compréhension des données (Vérifier si les données disponibles sont adéquates avec la tache de prediction)
    - Préparation des données (Transformation des données pour qu'elles soient exploitables dans l'entrainement du modèle)
    - Division des données :
        - Données d'apprentissage (70%)
        - Données de test (30%)
    - Entrainement du modèle (Estimation des paramètres du modèle en utilisant les données d'apprentissages) => Résultat : modèle entrainé
    - Evaluation du modèle (Calcul des métriques de performance du modèle) => Utilisation d'une base de test
    - Déploiement du modèle

In [158]:
import pandas as pd

<b>1. Compréhension du métier</b>

La tâche de prédiction consiste à prédire si une vente d'un produit est rentable ou non

Il s'agit d'une tâche de classification tel que :
- L'entrée (x) est une vente caractérisée par :
    - nom du produit
    - catégorie
    - puht
    - quantité vendue
    - date
- La sortie/cible :
    - classe rentable qui prend la valeur 0 (non) et 1 (Oui)

<b>2. Compréhension des données</b>

Normalemement, il faut que les données soient <b>etiquetées</b> pour pouvoir faire la classification

<b>Description des données :</b>
- Un modèle de la classification est entrainé par un algorithme supervisé.
- Un algorithme supervisé a besoin d'un dataset etiqueté
- Un dataset etiqueté est une table des données dont :
    - Les lignes sont les individus (à classifier)
    - Toutes les colonnes , sauf la dernière, sont les caractéristiques mesurables(features)
    - La dernière colonne représente l'étiquette, elle n'est pas mesurable mais plutôt préparée manuellement par un expert du domaine. On l'appelle Cible/Target
- Dans notre cas, le dataset est interprété comme suit :
    - Les individus sont les ventes
    - Les features sont les caractéristiques/colonnes categorie, nom du produit, puht, quantité vendue
    - Il manque une colonne qui représente le cible. Pour la classification, il s'agit de la classe.

<b>Solution :</b> 

- La colonne cible manquante est préparée normalement par un expert du domaine, qui est capable d'après son expérience de juger une vente rentable ou non.
- On va joindre les 3 tables en une seule table quireprésente notre dataset à étiqueter et qu'on appelle <b>ventes</b>
- On va l'enregistrer comme étant un fichier Excel.
- L'expert peut utiliser l'Excel pour faire ses calculs sur les donénes et déduire l'étiquette

Création du dataset pour l'étiquetage

In [188]:
df_produit = pd.read_csv('data/produit.csv')
df_temps = pd.read_csv('data/temps.csv')
df_vente = pd.read_csv(f'data/vente.csv')
# merge
df_main = pd.merge(df_vente, df_temps, on='id_temps')
df_main = pd.merge(df_main, df_produit, on='id_produit')
df_main.to_csv('data/main.csv', index=None)

In [189]:
df_main.head()

Unnamed: 0,id_produit,id_temps,puht,quantite_vendue,date,categorie,nom
0,1,1,1000,15,2023-01-01,smartphone,iPhone
1,1,2,1100,14,2023-02-01,smartphone,iPhone
2,1,3,1200,16,2023-03-01,smartphone,iPhone
3,1,4,1250,15,2023-04-01,smartphone,iPhone
4,1,5,1300,16,2023-05-01,smartphone,iPhone


<b>2. Compréhension & Préparation des données</b>

Avant de charger le dataset à partir de l'excel, installer la bibliothèque <b>openpyxl</b>

In [293]:
df_main = pd.read_excel("data/data_jour4.xlsx", sheet_name='data_etiquetée')
df_main.head()

Unnamed: 0,id_produit,id_temps,puht,quantite_vendue,date,categorie,nom,Montant,Classe
0,1,1,1000,15,2023-01-01,smartphone,iPhone,15000,Non
1,1,2,1100,14,2023-02-01,smartphone,iPhone,15400,Oui
2,1,3,1200,16,2023-03-01,smartphone,iPhone,19200,Oui
3,1,4,1250,15,2023-04-01,smartphone,iPhone,18750,Oui
4,1,5,1300,16,2023-05-01,smartphone,iPhone,20800,Oui


Transformation des colonnes catégoriques

In [294]:
from sklearn.preprocessing import OneHotEncoder
ohe_categorie=OneHotEncoder()
ohe_categorie.fit(df_main[['categorie']])
df_categorie_encoded = pd.DataFrame(ohe_categorie.transform(df_main[['categorie']]).toarray(), columns='categorie_'+ohe_categorie.categories_[0])
df_main=pd.concat([df_main,df_categorie_encoded] , axis=1)
df_main.drop('categorie',axis=1,errors='ignore',inplace=True)

In [296]:
df_main.head()

Unnamed: 0,id_produit,id_temps,puht,quantite_vendue,date,nom,Montant,Classe,categorie_PC portable,categorie_electromenager,categorie_smartphone
0,1,1,1000,15,2023-01-01,iPhone,15000,Non,0.0,0.0,1.0
1,1,2,1100,14,2023-02-01,iPhone,15400,Oui,0.0,0.0,1.0
2,1,3,1200,16,2023-03-01,iPhone,19200,Oui,0.0,0.0,1.0
3,1,4,1250,15,2023-04-01,iPhone,18750,Oui,0.0,0.0,1.0
4,1,5,1300,16,2023-05-01,iPhone,20800,Oui,0.0,0.0,1.0


In [297]:
from sklearn.preprocessing import OneHotEncoder
ohe_nom=OneHotEncoder()
ohe_nom.fit(df_main[['nom']])
df_categorie_encoded = pd.DataFrame(ohe_nom.transform(df_main[['nom']]).toarray(), columns='nom_'+ohe_nom.categories_[0])
df_main=pd.concat([df_main,df_categorie_encoded] , axis=1)
df_main.drop('nom',axis=1,errors='ignore',inplace=True)

In [298]:
df_main.head()

Unnamed: 0,id_produit,id_temps,puht,quantite_vendue,date,Montant,Classe,categorie_PC portable,categorie_electromenager,categorie_smartphone,nom_Asus,nom_Decodeur,nom_Dell,nom_HP,nom_Oppo,nom_Samsung,nom_TV,nom_iPhone
0,1,1,1000,15,2023-01-01,15000,Non,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,1,2,1100,14,2023-02-01,15400,Oui,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,1,3,1200,16,2023-03-01,19200,Oui,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,1,4,1250,15,2023-04-01,18750,Oui,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,1,5,1300,16,2023-05-01,20800,Oui,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [238]:
df_main.head()

Unnamed: 0,id_produit,id_temps,puht,quantite_vendue,date,Montant,Classe,categorie_PC portable,categorie_electromenager,categorie_smartphone,nom_Asus,nom_Decodeur,nom_Dell,nom_HP,nom_Oppo,nom_Samsung,nom_TV,nom_iPhone
0,1,1,1000,15,2023-01-01,15000,Non,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,1,2,1100,14,2023-02-01,15400,Oui,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,1,3,1200,16,2023-03-01,19200,Oui,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,1,4,1250,15,2023-04-01,18750,Oui,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,1,5,1300,16,2023-05-01,20800,Oui,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [240]:
import numpy as np
mois=df_main['date'].dt.strftime('%m').astype(np.int32)
mois

0      1
1      2
2      3
3      4
4      5
      ..
59    12
60     9
61    10
62    11
63    12
Name: date, Length: 64, dtype: int32

In [241]:
annee=df_main['date'].dt.strftime('%Y')
annee

0     2023
1     2023
2     2023
3     2023
4     2023
      ... 
59    2023
60    2023
61    2023
62    2023
63    2023
Name: date, Length: 64, dtype: object

In [242]:
annee.unique()

array(['2023'], dtype=object)

In [243]:
df_main['mois']=df_main['date'].dt.strftime('%m').astype(np.int32)
df_main.head()

Unnamed: 0,id_produit,id_temps,puht,quantite_vendue,date,Montant,Classe,categorie_PC portable,categorie_electromenager,categorie_smartphone,nom_Asus,nom_Decodeur,nom_Dell,nom_HP,nom_Oppo,nom_Samsung,nom_TV,nom_iPhone,mois
0,1,1,1000,15,2023-01-01,15000,Non,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1
1,1,2,1100,14,2023-02-01,15400,Oui,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2
2,1,3,1200,16,2023-03-01,19200,Oui,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,3
3,1,4,1250,15,2023-04-01,18750,Oui,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,4
4,1,5,1300,16,2023-05-01,20800,Oui,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,5


In [244]:
df_main.drop('date',axis=1,errors='ignore',inplace=True)

In [245]:
df_main.drop(['id_produit','id_temps'],axis=1,errors='ignore',inplace=True)

In [247]:
df_main.head()

Unnamed: 0,puht,quantite_vendue,Montant,Classe,categorie_PC portable,categorie_electromenager,categorie_smartphone,nom_Asus,nom_Decodeur,nom_Dell,nom_HP,nom_Oppo,nom_Samsung,nom_TV,nom_iPhone,mois
0,1000,15,15000,Non,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1
1,1100,14,15400,Oui,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2
2,1200,16,19200,Oui,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,3
3,1250,15,18750,Oui,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,4
4,1300,16,20800,Oui,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,5


Split de dataset en X et y

In [249]:
X=df_main.drop('Classe', axis=1)

In [250]:
y=df_main['Classe']

In [251]:
y

0     Non
1     Oui
2     Oui
3     Oui
4     Oui
     ... 
59    Oui
60    Non
61    Oui
62    Oui
63    Oui
Name: Classe, Length: 64, dtype: object

Split en train et test

In [258]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(X,y,train_size=0.7, random_state=45, stratify=y)

In [259]:
X_train.head()

Unnamed: 0,puht,quantite_vendue,Montant,categorie_PC portable,categorie_electromenager,categorie_smartphone,nom_Asus,nom_Decodeur,nom_Dell,nom_HP,nom_Oppo,nom_Samsung,nom_TV,nom_iPhone,mois
43,1550,10,15500,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,8
5,1400,17,23800,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,6
3,1250,15,18750,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,4
55,1050,9,9450,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,12
29,650,22,14300,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,6


Entrainement du modèle

In [260]:
X.shape

(64, 15)

In [262]:
y

0     Non
1     Oui
2     Oui
3     Oui
4     Oui
     ... 
59    Oui
60    Non
61    Oui
62    Oui
63    Oui
Name: Classe, Length: 64, dtype: object

In [264]:
from sklearn.linear_model import LogisticRegression
rl=LogisticRegression()
rl.fit(X_train,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [268]:
y_pred=rl.predict(X_test)
y_pred

array(['Non', 'Oui', 'Non', 'Oui', 'Non', 'Non', 'Oui', 'Oui', 'Non',
       'Oui', 'Oui', 'Non', 'Oui', 'Oui', 'Oui', 'Non', 'Non', 'Oui',
       'Non', 'Oui'], dtype=object)

In [267]:
y_test

51    Oui
9     Oui
56    Non
22    Oui
47    Non
24    Non
0     Non
39    Oui
15    Non
32    Oui
30    Oui
28    Oui
11    Oui
31    Oui
36    Non
27    Non
12    Non
63    Oui
14    Non
35    Oui
Name: Classe, dtype: object

In [270]:
from sklearn.metrics import accuracy_score

print('Accuracy score=',accuracy_score(y_test,y_pred))

Accuracy score= 0.8


In [274]:
rl.classes_

array(['Non', 'Oui'], dtype=object)

In [273]:
from sklearn.metrics import confusion_matrix

print(pd.DataFrame(confusion_matrix(y_test,y_pred)))

   0  1
0  7  2
1  2  9


Comment prédire la classe pour une nouvelle opération de vente ?

In [311]:
from datetime import datetime

df_vente_new=pd.DataFrame([[1,1,1000,15,'2023-01-01','smartphone','iPhone',1000*15]], columns=['id_produit','id_temps','puht','quantite_vendue','date','categorie','nom','Montant'])
df_vente_new['date']=pd.to_datetime(df_vente_new['date'])
# Remplacer Categorie par son encodage OHE
df_categorie_encoded = pd.DataFrame(ohe_categorie.transform(df_vente_new[['categorie']]).toarray(),
                                    columns='categorie_'+ohe_categorie.categories_[0])
df_vente_new=pd.concat([df_vente_new,df_categorie_encoded] , axis=1)
df_vente_new.drop('categorie',axis=1,errors='ignore',inplace=True)
# Remplacer Nom par son encodage OHE
df_categorie_encoded = pd.DataFrame(ohe_nom.transform(df_vente_new[['nom']]).toarray(), 
                                    columns='nom_'+ohe_nom.categories_[0])
df_vente_new=pd.concat([df_vente_new,df_categorie_encoded] , axis=1)
df_vente_new.drop('nom',axis=1,errors='ignore',inplace=True)
# Remplacer date par mois
df_vente_new['mois']=df_vente_new['date'].dt.strftime('%m').astype(np.int32)
df_vente_new.drop('date',axis=1,errors='ignore',inplace=True)
df_vente_new.drop(['id_produit','id_temps'],axis=1,errors='ignore',inplace=True)

In [312]:
df_vente_new

Unnamed: 0,puht,quantite_vendue,Montant,categorie_PC portable,categorie_electromenager,categorie_smartphone,nom_Asus,nom_Decodeur,nom_Dell,nom_HP,nom_Oppo,nom_Samsung,nom_TV,nom_iPhone,mois
0,1000,15,15000,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1


In [313]:
rl.predict(df_vente_new)

array(['Oui'], dtype=object)