<a name="Genre"></a>
## Prédiction du genre à partir du prénom

Nous allons construire deux modèles :

1. **Régression logistique** sur des n‑grammes de caractères.
2. **Naive Bayes** sur les mêmes caractéristiques.

Le jeu de données provient du corpus **`names`** de NLTK : il contient des listes de prénoms masculins et féminins anglophones.

In [1]:
import nltk, pandas as pd, numpy as np, re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report
nltk.download('names')

names_male = [(n, 'M') for n in nltk.corpus.names.words('male.txt')]
names_female = [(n, 'F') for n in nltk.corpus.names.words('female.txt')]

df_names = pd.DataFrame(names_male + names_female, columns=['name', 'gender'])
df_names = df_names.sample(frac=1, random_state=42)  # shuffle

def clean_name(n):
    return re.sub('[^a-z]', '', n.lower())

df_names['name_clean'] = df_names['name'].apply(clean_name)
df_names

[nltk_data] Downloading package names to
[nltk_data]     C:\Users\dell\AppData\Roaming\nltk_data...
[nltk_data]   Package names is already up-to-date!


Unnamed: 0,name,gender,name_clean
1163,Hersh,M,hersh
2283,Saxon,M,saxon
7156,Roselyn,F,roselyn
1421,Karel,M,karel
3296,Ariadne,F,ariadne
...,...,...,...
5226,Jacinta,F,jacinta
5390,Jo Ann,F,joann
860,Forbes,M,forbes
7603,Tessy,F,tessy


In [2]:
X = df_names['name_clean']
y = df_names['gender']

vectorizer = CountVectorizer(analyzer='char', ngram_range=(1,4))

In [3]:
# 1. Pipeline LogReg
pipe_lr = Pipeline([
    ('vect', vectorizer),
    ('clf', LogisticRegression(max_iter=500))
])

In [4]:
# 2. Pipeline Naive Bayes
pipe_nb = Pipeline([
    ('vect', vectorizer),
    ('clf', MultinomialNB())
])

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42, test_size=0.1)

In [6]:
for name, model in [('LogReg', pipe_lr), ('NaiveBayes', pipe_nb)]:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(f"\n==> {name}")
    print(classification_report(y_test, y_pred))

# Choix du modèle LogReg pour l'enrichissement
gender_model = pipe_lr


==> LogReg
              precision    recall  f1-score   support

           F       0.87      0.87      0.87       500
           M       0.78      0.77      0.78       295

    accuracy                           0.83       795
   macro avg       0.82      0.82      0.82       795
weighted avg       0.83      0.83      0.83       795


==> NaiveBayes
              precision    recall  f1-score   support

           F       0.81      0.84      0.82       500
           M       0.71      0.66      0.68       295

    accuracy                           0.77       795
   macro avg       0.76      0.75      0.75       795
weighted avg       0.77      0.77      0.77       795



In [7]:
def predict_gender(name: str, model=pipe_nb):
    if not isinstance(name, str) or name.strip() == "":
        raise ValueError("Le prénom doit être une chaîne non vide.")

    # Nettoyage cohérent avec l’entraînement
    name_clean = clean_name(name)

    # Prédiction + probabilité
    proba = model.predict_proba([name_clean])[0]
    idx   = proba.argmax()
    label = model.classes_[idx]
    confidence = float(proba[idx])
    return label, confidence

In [8]:
for n in ["Mochtar", "Maimouna", "Khadija", "Abdellahi", "Mamadou"]:
    lab, conf = predict_gender(n)
    print(f"{n:<10s} → {lab}  (confiance : {conf:.2%})")

Mochtar    → M  (confiance : 99.72%)
Maimouna   → M  (confiance : 81.58%)
Khadija    → M  (confiance : 97.98%)
Abdellahi  → F  (confiance : 99.98%)
Mamadou    → M  (confiance : 84.17%)


In [9]:
for n in ["Mariem", "Marie", "Abou"]:
    lab, conf = predict_gender(n)
    print(f"{n:<10s} → {lab}  (confiance : {conf:.2%})")

Mariem     → F  (confiance : 99.97%)
Marie      → F  (confiance : 99.98%)
Abou       → M  (confiance : 99.21%)
