<a href="https://colab.research.google.com/github/Dhoulnoun/fairuz-ML-project/blob/master/fairuz.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

First, let's frame the problem

• Is it Supervised, Unsupervised, or Reinforcement Learning?

  It's Supervised because each instance has a corresponding target

• Is it a Classification task, a Regression task, or something else?

  It's a Classification task because our targets are categorical

• Should you use batch learning or online learning techniques?

   We should use batch learning because our dataset is not really huge and is not a real-time stream data


In [None]:
import pandas as pd
from scipy.io import arff

arff_file = arff.loadarff('/content/fairuz.arff')
df = pd.DataFrame(arff_file[0])

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 17 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   V1      45211 non-null  float64
 1   V2      45211 non-null  object 
 2   V3      45211 non-null  object 
 3   V4      45211 non-null  object 
 4   V5      45211 non-null  object 
 5   V6      45211 non-null  float64
 6   V7      45211 non-null  object 
 7   V8      45211 non-null  object 
 8   V9      45211 non-null  object 
 9   V10     45211 non-null  float64
 10  V11     45211 non-null  object 
 11  V12     45211 non-null  float64
 12  V13     45211 non-null  float64
 13  V14     45211 non-null  float64
 14  V15     45211 non-null  float64
 15  V16     45211 non-null  object 
 16  Class   45211 non-null  object 
dtypes: float64(7), object(10)
memory usage: 5.9+ MB


45211 entries
17 features

In [None]:
df.describe()

Unnamed: 0,V1,V6,V10,V12,V13,V14,V15
count,45211.0,45211.0,45211.0,45211.0,45211.0,45211.0,45211.0
mean,40.93621,1362.272058,15.806419,258.16308,2.763841,40.197828,0.580323
std,10.618762,3044.765829,8.322476,257.527812,3.098021,100.128746,2.303441
min,18.0,-8019.0,1.0,0.0,1.0,-1.0,0.0
25%,33.0,72.0,8.0,103.0,1.0,-1.0,0.0
50%,39.0,448.0,16.0,180.0,2.0,-1.0,0.0
75%,48.0,1428.0,21.0,319.0,3.0,-1.0,0.0
max,95.0,102127.0,31.0,4918.0,63.0,871.0,275.0


Commentaire sur la répartition des features numériques:
  **TODO**


In [None]:
df['V2'].value_counts()

Unnamed: 0_level_0,count
V2,Unnamed: 1_level_1
b'blue-collar',9732
b'management',9458
b'technician',7597
b'admin.',5171
b'services',4154
b'retired',2264
b'self-employed',1579
b'entrepreneur',1487
b'unemployed',1303
b'housemaid',1240


In [None]:
df['V3'].value_counts()

Unnamed: 0_level_0,count
V3,Unnamed: 1_level_1
b'married',27214
b'single',12790
b'divorced',5207


In [None]:
df['V4'].value_counts()

Unnamed: 0_level_0,count
V4,Unnamed: 1_level_1
b'secondary',23202
b'tertiary',13301
b'primary',6851
b'unknown',1857


In [None]:
df['V5'].value_counts()

Unnamed: 0_level_0,count
V5,Unnamed: 1_level_1
b'no',44396
b'yes',815


In [None]:
df['V7'].value_counts()

Unnamed: 0_level_0,count
V7,Unnamed: 1_level_1
b'yes',25130
b'no',20081


In [None]:
df['V8'].value_counts()

Unnamed: 0_level_0,count
V8,Unnamed: 1_level_1
b'no',37967
b'yes',7244


In [None]:
df['V9'].value_counts()

Unnamed: 0_level_0,count
V9,Unnamed: 1_level_1
b'cellular',29285
b'unknown',13020
b'telephone',2906


In [None]:
df['V11'].value_counts()

Unnamed: 0_level_0,count
V11,Unnamed: 1_level_1
b'may',13766
b'jul',6895
b'aug',6247
b'jun',5341
b'nov',3970
b'apr',2932
b'feb',2649
b'jan',1403
b'oct',738
b'sep',579


In [None]:
df['V16'].value_counts()

Unnamed: 0_level_0,count
V16,Unnamed: 1_level_1
b'unknown',36959
b'failure',4901
b'other',1840
b'success',1511


Toutes les features de catégorie semblent équilibrement répartie. Aucun traitement ne semble nécessaire.

Maintenant, analysons la répartition de valeur de notre target

In [None]:
df['Class'].value_counts()

Unnamed: 0_level_0,count
Class,Unnamed: 1_level_1
b'1',39922
b'2',5289


On voit une répartition très largement en faveur de la valeur '1'.
Pour palier à ce problème nous avons plusieurs types de solutions.

*   Rerépartir les données par du sur-échantillonnage de la classe minoritaire (SMOTE) ou sous-échantillonnage de la classe majoritaire.
*   Utilisez des modèles qui sont efficaces même avec un déséquilibre comme random forest par exemple
*   Collectez des données supplémentaires (Impossible dans notre cas)
*   Lors de l’entraînement, appliquez des poids inversés proportionnels à la fréquence des classes pour équilibrer leur impact sur la fonction de perte.






**Sur-échantillonnage de la classe '2' minoritaire**

In [None]:
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

# Séparer les features et la target
X = df.drop('Class', axis=1)
y = df['Class']

print(X.dtypes)


# Identifier les colonnes catégoriques
categorical_columns = X.select_dtypes(include=['object']).columns

# Appliquer One-Hot Encoding aux colonnes catégoriques
encoder = OneHotEncoder(drop='first', sparse_output=False)  # drop='first' pour éviter la multicolinéarité
encoded_features = encoder.fit_transform(X[categorical_columns])

# Convertir les données encodées en DataFrame
encoded_df = pd.DataFrame(encoded_features, columns=encoder.get_feature_names_out(categorical_columns))

# Supprimer les colonnes originales catégoriques et les remplacer par les colonnes encodées
X = X.drop(columns=categorical_columns)
X = pd.concat([X.reset_index(drop=True), encoded_df.reset_index(drop=True)], axis=1)


# Convertir les labels en format numérique si nécessaire
y = y.astype('int')

# Appliquer SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)
# Reconstruire un DataFrame
df_resampled = pd.concat([pd.DataFrame(X_resampled), pd.DataFrame(y_resampled, columns=['Class'])], axis=1)

# Vérification
print(df_resampled['Class'].value_counts())
df_resampled.describe()



V1     float64
V2      object
V3      object
V4      object
V5      object
V6     float64
V7      object
V8      object
V9      object
V10    float64
V11     object
V12    float64
V13    float64
V14    float64
V15    float64
V16     object
dtype: object




Class
1    39922
2    39922
Name: count, dtype: int64


Unnamed: 0,V1,V6,V10,V12,V13,V14,V15,V2_b'blue-collar',V2_b'entrepreneur',V2_b'housemaid',...,V11_b'jun',V11_b'mar',V11_b'may',V11_b'nov',V11_b'oct',V11_b'sep',V16_b'other',V16_b'success',V16_b'unknown',Class
count,79844.0,79844.0,79844.0,79844.0,79844.0,79844.0,79844.0,79844.0,79844.0,79844.0,...,79844.0,79844.0,79844.0,79844.0,79844.0,79844.0,79844.0,79844.0,79844.0,79844.0
mean,41.108247,1548.572511,15.50368,377.080401,2.495298,51.208589,0.824201,0.180495,0.028429,0.024828,...,0.111426,0.026116,0.248373,0.082991,0.035699,0.030002,0.047046,0.098948,0.745108,1.5
std,10.939095,3173.08165,7.796583,346.260499,2.566869,106.839315,2.316078,0.362371,0.156272,0.145851,...,0.293134,0.138659,0.409494,0.257089,0.16292,0.148073,0.195141,0.276992,0.430402,0.500003
min,18.0,-8019.0,1.0,0.0,1.0,-1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,32.89518,123.584889,9.0,144.0,1.0,-1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.323245,1.0
50%,39.0,557.406991,15.154436,258.981436,2.0,-1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.5
75%,48.051756,1733.372402,21.0,504.6403,3.0,45.514108,1.0,0.0,0.0,0.0,...,0.0,0.0,0.483785,0.0,0.0,0.0,0.0,0.0,1.0,2.0
max,95.0,102127.0,31.0,4918.0,63.0,871.0,275.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0
