# Conclusion 

Lorsqu'on applique la régression logistique sur les données resampling et en faisant le cross validation, le modèle s'est détérioré avec C = 0.001. Pour éviter de trop pénalisé, on a donc réduit C à 0.1 comme l'on a fait auparavant. Ainsi, on a pu avoir une excellente performance. Pourquoi avec un même réglage, l'entrainement avec les données resampling s'en sort très bien par rapport au modèle entrainé aux données brutes? 

**Modèle 1: Avec Rééchantillonnage**
- Performance:
  - Précision (Accuracy): 0.814
  - Macro F1 Score: 0.792
  - Micro F1 Score: 0.814
- Équité:
  - TPR_GAP: 0.182
  - FPR_GAP: 0.004
  - PPR_GAP: 0.005
Score Final: 0.805

**Modèle 2: Sans Rééchantillonnage**
- Performance:
  - Précision (Accuracy): 0.785
  - Macro F1 Score: 0.720
  - Micro F1 Score: 0.785
- Équité:
  - TPR_GAP: 0.189
  - FPR_GAP: 0.008
  - PPR_GAP: 0.024
Score Final: 0.766

**Impact du Rééchantillonnage sur la Performance:**

précision et F1 score. Cela suggère que le rééchantillonnage des données pour réduire les biais de genre a pu améliorer la capacité du modèle à généraliser à de nouvelles données

Pourquoi?: Le rééchantillonnage équilibre les classes dans les données d'entraînement. C'est crucial quand les classes sont déséquilibrées. Cela évite que le modèle ne privilégie la classe majoritaire, assurant de meilleures performances globales.

**Impact du Rééchantillonnage sur l'Équité:**

TPR_GAP, FPR_GAP, et PPR_GAP: Le premier modèle a des écarts réduits en TPR_GAP, FPR_GAP et PPR_GAP, montrant une meilleure équité dans les prédictions.
Pourquoi ? Le rééchantillonnage vise à éliminer les biais en représentant équitablement toutes les classes. Cela se traduit par des prédictions plus justes, comme en témoigne la réduction des écarts entre les groupes en TPR et FPR.


Le premier modèle, utilisant des données rééchantillonnées pour atténuer la relation entre le genre et la variable cible, est à la fois plus performant et plus équitable. L'amélioration de la performance est due à une meilleure généralisation grâce à des données d'entraînement équilibrées. De même, l'équité est améliorée en minimisant les biais potentiels dans les données d'entraînement, permettant des prédictions plus justes, indépendamment du genre. En résumé, le rééchantillonnage a un impact significativement positif sur la performance et l'équité du modèle.

Cela ne veut pas dire que l'hyperparamètre est à son niveau optimal. Ainsi, on a essayé de simuler la régression logistique avec différents valeurs de C proches de 0.1. On a fait varier dans l'interval supérieur et inférieur tout en restant attentif à l'volution des résultats sur l'entrainement et le test final. Il nous paraît que C = 0.9 engendre une meilleure généralisation et performance dans l'ensemble : 

(LogisticRegression(C=0.09, max_iter=5000, multi_class='multinomial',
                    random_state=42),
 {'performance_metrics': {'Accuracy': 0.8126801152737753,
   'Macro F1 Score': 0.7888312053403073,
   'Micro F1 Score': 0.8126801152737753},
  'fairness_metrics': {'TPR_GAP': 0.17383415524618148,
   'FPR_GAP': 0.003952200171387337,
   'PPR_GAP': 0.004325430259110213},
  'final_score': 0.8074985250470629,
  'number_of_estimators': 'N/A'}).
  
 Avec ce modèle le score sur le test va jusqu'à 78,20.

In [1]:
import pandas as pd
from modelization import *
import pandas as pd
from sklearn.utils import resample

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RepeatedStratifiedKFold

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from evaluator import *

In [2]:
with open('data-challenge-student.pickle', 'rb') as handle:
    # dat = pickle.load(handle)
    dat = pd.read_pickle(handle)

In [3]:
# Features
X_train = dat['X_train']

# Données à prédire
Y_train = dat['Y']

# Attributs sensibles
S_train = dat['S_train']

In [46]:
X_test = dat['X_test'] 
S_test = dat['S_test']

In [33]:
import pandas as pd
from sklearn.utils import resample

# Supposons que X_train, Y_train, et S_train sont déjà définis comme dans votre question
data = pd.concat([S_train, X_train, Y_train],axis=1)

balanced_data_list = []
for nbclass in range(28):  # Assumant 28 classes professionnelles distinctes
    # Sélection des données pour une classe professionnelle spécifique
    data_class = data[data['profession_class'] == nbclass]
    
    # Séparation basée sur la classe de genre
    data_s0 = data_class[data_class['gender_class'] == 0]
    data_s1 = data_class[data_class['gender_class'] == 1]
    
    if len(data_s0) >= len(data_s1):
        # Rééchantillonnage pour équilibrer les classes de genre
        data_s1_resampled = resample(data_s1, replace=True, n_samples=len(data_s0), random_state=42)
        balanced_data = pd.concat([data_s0, data_s1_resampled])
    else:
        data_s0_resampled = resample(data_s0, replace=True, n_samples=len(data_s1), random_state=42)
        balanced_data = pd.concat([data_s1, data_s0_resampled])
        
    balanced_data_list.append(balanced_data)

# Combiner toutes les données rééquilibrées en un seul DataFrame
balanced_data_combined = pd.concat(balanced_data_list)

In [34]:
balanced_data_combined

Unnamed: 0,gender_class,0,1,2,3,4,5,6,7,8,...,759,760,761,762,763,764,765,766,767,profession_class
15102,0,-0.106644,0.206202,-0.487275,-0.565128,0.002457,0.521392,0.222879,0.332393,-0.158473,...,-0.025967,-0.188476,-0.541974,0.066505,-0.000817,-0.237024,-0.169442,0.153728,0.160571,0
9853,0,-0.539986,0.367027,-1.042473,-0.315929,0.223228,-0.020717,-0.085954,0.298949,-0.063420,...,-0.231364,0.026662,-0.298197,0.536527,-0.356383,0.343746,0.356660,0.265317,0.140032,0
38521,0,-0.667977,0.534055,-0.372200,-0.912800,-0.162027,0.316608,0.409040,0.698239,-0.228332,...,-0.174538,-0.090482,-0.540254,0.252228,0.325463,-0.212083,0.143992,-0.016629,0.306894,0
22478,0,-0.646435,0.244130,-0.838173,-0.224850,-0.186924,0.327114,-0.150429,0.745634,-0.238859,...,-0.076886,0.060196,0.024156,0.312725,0.224630,-0.279840,0.124797,0.418145,0.388695,0
28294,0,-0.269952,0.055027,-0.498366,-0.789691,0.517833,0.072657,0.168787,0.356092,-0.075094,...,-0.280233,0.377222,-0.733948,0.610944,-0.117537,-0.160587,0.328925,0.379089,0.035509,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4076,0,-0.150664,-0.275969,0.328843,-0.289479,-0.359328,-0.589007,0.828748,0.445308,-0.073924,...,-0.451324,-0.064055,-0.831989,0.462128,0.823785,-0.522116,-0.273711,-0.010537,0.610961,27
4712,0,-0.696221,0.049864,-0.205544,-0.244224,-0.378419,-0.132278,0.664804,0.635028,0.081595,...,-0.171151,0.397429,-0.452958,0.146219,0.330908,-0.029582,-0.276181,0.445554,-0.012045,27
22208,0,-0.071043,0.044882,-0.311584,-0.620255,-0.161639,0.015657,0.541004,0.306950,0.011576,...,-0.627619,-0.025529,-0.389166,-0.107837,0.682593,-0.051020,0.044216,0.269877,0.008501,27
22664,0,-0.269814,-0.132485,0.134214,-0.446334,-0.440597,-0.351853,0.659138,0.553297,-0.405849,...,0.016632,0.678242,-0.739566,0.369372,0.412901,-0.314313,-0.157456,0.622148,-0.006625,27


In [35]:
# Assuming balanced_data_combined is a pandas DataFrame
# Supprimer la première colonne
X_balanced = balanced_data_combined.iloc[:, 1:]

# Supprimer la dernière colonne
X_balanced = X_balanced.iloc[:, :-1]

S_balanced = balanced_data_combined.iloc[:, 0]

Y_balanced = balanced_data_combined.iloc[:, -1]  # Select all rows for the second-last column

In [36]:
X_balanced

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,758,759,760,761,762,763,764,765,766,767
15102,-0.106644,0.206202,-0.487275,-0.565128,0.002457,0.521392,0.222879,0.332393,-0.158473,-0.425709,...,0.262542,-0.025967,-0.188476,-0.541974,0.066505,-0.000817,-0.237024,-0.169442,0.153728,0.160571
9853,-0.539986,0.367027,-1.042473,-0.315929,0.223228,-0.020717,-0.085954,0.298949,-0.063420,-0.127194,...,0.263436,-0.231364,0.026662,-0.298197,0.536527,-0.356383,0.343746,0.356660,0.265317,0.140032
38521,-0.667977,0.534055,-0.372200,-0.912800,-0.162027,0.316608,0.409040,0.698239,-0.228332,0.177368,...,0.074018,-0.174538,-0.090482,-0.540254,0.252228,0.325463,-0.212083,0.143992,-0.016629,0.306894
22478,-0.646435,0.244130,-0.838173,-0.224850,-0.186924,0.327114,-0.150429,0.745634,-0.238859,0.379400,...,0.040561,-0.076886,0.060196,0.024156,0.312725,0.224630,-0.279840,0.124797,0.418145,0.388695
28294,-0.269952,0.055027,-0.498366,-0.789691,0.517833,0.072657,0.168787,0.356092,-0.075094,-0.683792,...,0.199538,-0.280233,0.377222,-0.733948,0.610944,-0.117537,-0.160587,0.328925,0.379089,0.035509
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4076,-0.150664,-0.275969,0.328843,-0.289479,-0.359328,-0.589007,0.828748,0.445308,-0.073924,-0.626374,...,0.219889,-0.451324,-0.064055,-0.831989,0.462128,0.823785,-0.522116,-0.273711,-0.010537,0.610961
4712,-0.696221,0.049864,-0.205544,-0.244224,-0.378419,-0.132278,0.664804,0.635028,0.081595,-0.573454,...,-0.403174,-0.171151,0.397429,-0.452958,0.146219,0.330908,-0.029582,-0.276181,0.445554,-0.012045
22208,-0.071043,0.044882,-0.311584,-0.620255,-0.161639,0.015657,0.541004,0.306950,0.011576,-0.167356,...,0.098436,-0.627619,-0.025529,-0.389166,-0.107837,0.682593,-0.051020,0.044216,0.269877,0.008501
22664,-0.269814,-0.132485,0.134214,-0.446334,-0.440597,-0.351853,0.659138,0.553297,-0.405849,-0.363420,...,-0.137180,0.016632,0.678242,-0.739566,0.369372,0.412901,-0.314313,-0.157456,0.622148,-0.006625


In [37]:
S_balanced

15102    0
9853     0
38521    0
22478    0
28294    0
        ..
4076     0
4712     0
22208    0
22664    0
19012    0
Name: gender_class, Length: 34704, dtype: int64

In [38]:
Y_balanced

15102     0
9853      0
38521     0
22478     0
28294     0
         ..
4076     27
4712     27
22208    27
22664    27
19012    27
Name: profession_class, Length: 34704, dtype: int64

# modèle avec C=0.001

In [39]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RepeatedStratifiedKFold

# Configuration de la validation croisée
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# Initialisation du modèle avec les paramètres spécifiques
model_1 = LogisticRegression(random_state=42, solver='lbfgs', C=0.001, multi_class='multinomial', max_iter=5000)

In [40]:
result_1 = train_and_evaluate(model_1, X_balanced, Y_balanced, S_balanced, cv)

In [41]:
result_1

(LogisticRegression(C=0.001, max_iter=5000, multi_class='multinomial'),
 {'performance_metrics': {'Accuracy': 0.6646499567847882,
   'Macro F1 Score': 0.4120925272854986,
   'Micro F1 Score': 0.6646499567847882},
  'fairness_metrics': {'TPR_GAP': 0.10299692310266696,
   'FPR_GAP': 0.008388676021826632,
   'PPR_GAP': 0.008886545606355751},
  'final_score': 0.6545478020914158,
  'number_of_estimators': 'N/A'})

# modèle avec C=0.1

In [42]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RepeatedStratifiedKFold

# Configuration de la validation croisée
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# Initialisation du modèle avec les paramètres spécifiques
model_2 = LogisticRegression(random_state=42, solver='lbfgs', C=0.1, multi_class='multinomial', max_iter=5000)

In [43]:
result_2 = train_and_evaluate(model_2, X_balanced, Y_balanced, S_balanced, cv)

In [44]:
result_2

(LogisticRegression(C=0.1, max_iter=5000, multi_class='multinomial',
                    random_state=42),
 {'performance_metrics': {'Accuracy': 0.8141210374639769,
   'Macro F1 Score': 0.7916262462858562,
   'Micro F1 Score': 0.8141210374639769},
  'fairness_metrics': {'TPR_GAP': 0.18156604252832018,
   'FPR_GAP': 0.004012551187902068,
   'PPR_GAP': 0.004524875067262163},
  'final_score': 0.805030101878768,
  'number_of_estimators': 'N/A'})

In [63]:
regression_classique = result_2[0].predict(X_test.values)
results=pd.DataFrame(regression_classique, columns= ['score'])

results.to_csv("Data_Challenge_MDI_c01.csv", header = None,index=None)

# modèle avec C 0.2

In [52]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RepeatedStratifiedKFold

# Configuration de la validation croisée
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# Initialisation du modèle avec les paramètres spécifiques
model_3 = LogisticRegression(random_state=42, solver='lbfgs', C=0.2, multi_class='multinomial', max_iter=5000)

In [53]:
result_3 = train_and_evaluate(model_3, X_balanced, Y_balanced, S_balanced, cv)

In [58]:
result_3

(LogisticRegression(C=0.2, max_iter=5000, multi_class='multinomial',
                    random_state=42),
 {'performance_metrics': {'Accuracy': 0.8193083573487032,
   'Macro F1 Score': 0.8037774064068582,
   'Micro F1 Score': 0.8193083573487032},
  'fairness_metrics': {'TPR_GAP': 0.23596706886173952,
   'FPR_GAP': 0.003611114369412836,
   'PPR_GAP': 0.00434063616888848},
  'final_score': 0.7839051687725593,
  'number_of_estimators': 'N/A'})

In [64]:
regression_classique = result_3[0].predict(X_test.values)
results=pd.DataFrame(regression_classique, columns= ['score'])

results.to_csv("Data_Challenge_MDI_c02.csv", header = None,index=None)

# modèle avec C 0.5

In [54]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RepeatedStratifiedKFold

# Configuration de la validation croisée
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# Initialisation du modèle avec les paramètres spécifiques
model_4 = LogisticRegression(random_state=42, solver='lbfgs', C=0.5, multi_class='multinomial', max_iter=5000)

In [55]:
result_4 = train_and_evaluate(model_4, X_balanced, Y_balanced, S_balanced, cv)

In [59]:
result_4

(LogisticRegression(C=0.5, max_iter=5000, multi_class='multinomial',
                    random_state=42),
 {'performance_metrics': {'Accuracy': 0.822241428983002,
   'Macro F1 Score': 0.7973501815489147,
   'Micro F1 Score': 0.822241428983002},
  'fairness_metrics': {'TPR_GAP': 0.23168778050990602,
   'FPR_GAP': 0.004773274089815812,
   'PPR_GAP': 0.00517583851036082},
  'final_score': 0.7828312005195044,
  'number_of_estimators': 'N/A'})

In [65]:
regression_classique = result_4[0].predict(X_test.values)
results=pd.DataFrame(regression_classique, columns= ['score'])

results.to_csv("Data_Challenge_MDI_c05.csv", header = None,index=None)

# modèle avec C 0.09

In [67]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RepeatedStratifiedKFold

# Configuration de la validation croisée
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# Initialisation du modèle avec les paramètres spécifiques
model_5 = LogisticRegression(random_state=42, solver='lbfgs', C=0.09, multi_class='multinomial', max_iter=5000)

In [68]:
result_5 = train_and_evaluate(model_5, X_balanced, Y_balanced, S_balanced, cv)

In [69]:
result_5

(LogisticRegression(C=0.09, max_iter=5000, multi_class='multinomial',
                    random_state=42),
 {'performance_metrics': {'Accuracy': 0.8126801152737753,
   'Macro F1 Score': 0.7888312053403073,
   'Micro F1 Score': 0.8126801152737753},
  'fairness_metrics': {'TPR_GAP': 0.17383415524618148,
   'FPR_GAP': 0.003952200171387337,
   'PPR_GAP': 0.004325430259110213},
  'final_score': 0.8074985250470629,
  'number_of_estimators': 'N/A'})

In [70]:
regression_classique = result_5[0].predict(X_test.values)
results=pd.DataFrame(regression_classique, columns= ['score'])

results.to_csv("Data_Challenge_MDI_c1.csv", header = None,index=None)

# modèle avec C 0.092

In [71]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RepeatedStratifiedKFold

# Configuration de la validation croisée
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# Initialisation du modèle avec les paramètres spécifiques
model_6 = LogisticRegression(random_state=42, solver='lbfgs', C=0.092, multi_class='multinomial', max_iter=5000)

In [72]:
result_6 = train_and_evaluate(model_6, X_balanced, Y_balanced, S_balanced, cv)

In [73]:
result_6

(LogisticRegression(C=0.092, max_iter=5000, multi_class='multinomial',
                    random_state=42),
 {'performance_metrics': {'Accuracy': 0.8126801152737753,
   'Macro F1 Score': 0.7879424010677223,
   'Micro F1 Score': 0.8126801152737753},
  'fairness_metrics': {'TPR_GAP': 0.17352185485139063,
   'FPR_GAP': 0.003939873608294418,
   'PPR_GAP': 0.0043376700880842604},
  'final_score': 0.8072102731081658,
  'number_of_estimators': 'N/A'})

In [None]:
regression_classique = result_6[0].predict(X_test.values)
results=pd.DataFrame(regression_classique, columns= ['score'])

results.to_csv("Data_Challenge_MDI_c092.csv", header = None,index=None)

# modèle avec C 0.094

In [74]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RepeatedStratifiedKFold

# Configuration de la validation croisée
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# Initialisation du modèle avec les paramètres spécifiques
model_7 = LogisticRegression(random_state=42, solver='lbfgs', C=0.094, multi_class='multinomial', max_iter=5000)

In [75]:
result_7 = train_and_evaluate(model_7, X_balanced, Y_balanced, S_balanced, cv)

In [None]:
regression_classique = result_7[0].predict(X_test.values)
results=pd.DataFrame(regression_classique, columns= ['score'])

results.to_csv("Data_Challenge_MDI_c094.csv", header = None,index=None)

In [78]:
result_7

(LogisticRegression(C=0.094, max_iter=5000, multi_class='multinomial',
                    random_state=42),
 {'performance_metrics': {'Accuracy': 0.8129682997118156,
   'Macro F1 Score': 0.7897626518746941,
   'Micro F1 Score': 0.8129682997118156},
  'fairness_metrics': {'TPR_GAP': 0.1825012241152687,
   'FPR_GAP': 0.0038970581967480095,
   'PPR_GAP': 0.004351441681491001},
  'final_score': 0.8036307138797127,
  'number_of_estimators': 'N/A'})

# modèle avec C 0.089

In [81]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RepeatedStratifiedKFold

# Configuration de la validation croisée
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# Initialisation du modèle avec les paramètres spécifiques
model_8 = LogisticRegression(random_state=42, solver='lbfgs', C=0.089, multi_class='multinomial', max_iter=5000)

In [82]:
result_8 = train_and_evaluate(model_8, X_balanced, Y_balanced, S_balanced, cv)

In [83]:
result_8

(LogisticRegression(C=0.089, max_iter=5000, multi_class='multinomial',
                    random_state=42),
 {'performance_metrics': {'Accuracy': 0.8121037463976946,
   'Macro F1 Score': 0.7882220258809245,
   'Micro F1 Score': 0.8121037463976946},
  'fairness_metrics': {'TPR_GAP': 0.1739350021392385,
   'FPR_GAP': 0.0037898193533557846,
   'PPR_GAP': 0.004186467978556214},
  'final_score': 0.807143511870843,
  'number_of_estimators': 'N/A'})

In [None]:
regression_classique = result_8[0].predict(X_test.values)
results=pd.DataFrame(regression_classique, columns= ['score'])

results.to_csv("Data_Challenge_MDI_c096.csv", header = None,index=None)

# modèle avec C 0.085

In [84]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RepeatedStratifiedKFold

# Configuration de la validation croisée
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# Initialisation du modèle avec les paramètres spécifiques
model_9 = LogisticRegression(random_state=42, solver='lbfgs', C=0.085, multi_class='multinomial', max_iter=5000)

In [85]:
result_9 = train_and_evaluate(model_9, X_balanced, Y_balanced, S_balanced, cv)

In [87]:
result_9

(LogisticRegression(C=0.085, max_iter=5000, multi_class='multinomial',
                    random_state=42),
 {'performance_metrics': {'Accuracy': 0.8106628242074928,
   'Macro F1 Score': 0.7865837881329135,
   'Micro F1 Score': 0.8106628242074928},
  'fairness_metrics': {'TPR_GAP': 0.171479220863163,
   'FPR_GAP': 0.0037942737037841927,
   'PPR_GAP': 0.004140113341055508},
  'final_score': 0.8075522836348752,
  'number_of_estimators': 'N/A'})

In [None]:
regression_classique = result_9[0].predict(X_test.values)
results=pd.DataFrame(regression_classique, columns= ['score'])

results.to_csv("Data_Challenge_MDI_c098.csv", header = None,index=None)

# modèle avec C 0.08

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RepeatedStratifiedKFold

# Configuration de la validation croisée
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# Initialisation du modèle avec les paramètres spécifiques
model_10 = LogisticRegression(random_state=42, solver='lbfgs', C=0.08, multi_class='multinomial', max_iter=5000)

In [None]:
result_10 = train_and_evaluate(model_10, X_balanced, Y_balanced, S_balanced, cv)

In [88]:
result_10

NameError: name 'result_10' is not defined

In [None]:
regression_classique = result_10[0].predict(X_test.values)
results=pd.DataFrame(regression_classique, columns= ['score'])

results.to_csv("Data_Challenge_MDI_c099.csv", header = None,index=None)