## Homework [DRAFT]

> Note: sometimes your answer doesn't match one of the options exactly.
> That's fine.
> Select the option that's closest to your solution.


### Dataset

In this homework, we will use the lead scoring dataset Bank Marketing dataset. Download it from [here](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv).

Or you can do it with `wget`:

```bash
wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv
```

In this dataset our desired target for classification task will be `converted` variable - has the client signed up to the platform or not.

### Data preparation

* Check if the missing values are presented in the features.
* If there are missing values:
    * For caterogiral features, replace them with 'NA'
    * For numerical features, replace with with 0.0


In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import accuracy_score, mutual_info_score

# Charger les données
df = pd.read_csv('course_lead_scoring.csv')

print("=" * 80)
print("APERÇU DES DONNÉES")
print("=" * 80)
print(df.head())
print("\nShape:", df.shape)
print("\nTypes de données:")
print(df.dtypes)
print("\nValeurs manquantes:")
print(df.isnull().sum())

# Data preparation - Gestion des valeurs manquantes
print("\n" + "=" * 80)
print("DATA PREPARATION")
print("=" * 80)

# Identifier les colonnes catégorielles et numériques
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Retirer 'converted' des colonnes catégorielles si présent
if 'converted' in categorical_cols:
    categorical_cols.remove('converted')
if 'converted' in numerical_cols:
    numerical_cols.remove('converted')

print(f"Colonnes catégorielles: {categorical_cols}")
print(f"Colonnes numériques: {numerical_cols}")

# Remplacer les valeurs manquantes
for col in categorical_cols:
    df.fillna({col: 'NA'}, inplace=True)
#
for col in numerical_cols:
    df.fillna({col:0.0}, inplace=True)

print("\nValeurs manquantes après traitement:")
print(df.isnull().sum())

APERÇU DES DONNÉES
    lead_source    industry  number_of_courses_viewed  annual_income  \
0      paid_ads         NaN                         1        79450.0   
1  social_media      retail                         1        46992.0   
2        events  healthcare                         5        78796.0   
3      paid_ads      retail                         2        83843.0   
4      referral   education                         3        85012.0   

  employment_status       location  interaction_count  lead_score  converted  
0        unemployed  south_america                  4        0.94          1  
1          employed  south_america                  1        0.80          0  
2        unemployed      australia                  3        0.69          1  
3               NaN      australia                  1        0.87          0  
4     self_employed         europe                  3        0.62          1  

Shape: (1462, 9)

Types de données:
lead_source                  object
i


### Question 1

What is the most frequent observation (mode) for the column `industry`?

- `NA`
- `technology`
- `healthcare`
- `retail`


In [2]:

print("\n" + "=" * 80)
print("QUESTION 1: Mode de la colonne 'industry'")
print("=" * 80)
industry_mode = df['industry'].mode()[0]
print(f"Mode de 'industry': {industry_mode}")


QUESTION 1: Mode de la colonne 'industry'
Mode de 'industry': retail


### Question 2

Create the [correlation matrix](https://www.google.com/search?q=correlation+matrix) for the numerical features of your dataset.
In a correlation matrix, you compute the correlation coefficient between every pair of features.

What are the two features that have the biggest correlation?

- `interaction_count` and `lead_score`
- `number_of_courses_viewed` and `lead_score`
- `number_of_courses_viewed` and `interaction_count`
- `annual_income` and `lead_score`

Only consider the pairs above when answering this question.


In [3]:

# Créer la matrice de corrélation pour les features numériques
correlation_matrix = df[numerical_cols].corr()
print("\nMatrice de corrélation:")
print(correlation_matrix)

# Calculer les corrélations pour les paires spécifiées
pairs = [
    ('interaction_count', 'lead_score'),
    ('number_of_courses_viewed', 'lead_score'),
    ('number_of_courses_viewed', 'interaction_count'),
    ('annual_income', 'lead_score')
]

print("\nCorrélations pour les paires spécifiées:")
correlations = {}
for feat1, feat2 in pairs:
    if feat1 in correlation_matrix.columns and feat2 in correlation_matrix.columns:
        corr = correlation_matrix.loc[feat1, feat2]
        correlations[f"{feat1} - {feat2}"] = abs(corr)
        print(f"{feat1} et {feat2}: {corr:.4f} (abs: {abs(corr):.4f})")

max_corr_pair = max(correlations, key=correlations.get)
print(f"\nPaire avec la plus grande corrélation: {max_corr_pair}")





Matrice de corrélation:
                          number_of_courses_viewed  annual_income  \
number_of_courses_viewed                  1.000000       0.009770   
annual_income                             0.009770       1.000000   
interaction_count                        -0.023565       0.027036   
lead_score                               -0.004879       0.015610   

                          interaction_count  lead_score  
number_of_courses_viewed          -0.023565   -0.004879  
annual_income                      0.027036    0.015610  
interaction_count                  1.000000    0.009888  
lead_score                         0.009888    1.000000  

Corrélations pour les paires spécifiées:
interaction_count et lead_score: 0.0099 (abs: 0.0099)
number_of_courses_viewed et lead_score: -0.0049 (abs: 0.0049)
number_of_courses_viewed et interaction_count: -0.0236 (abs: 0.0236)
annual_income et lead_score: 0.0156 (abs: 0.0156)

Paire avec la plus grande corrélation: number_of_courses_view

print("\n" + "=" * 80)
print("QUESTION 2: Corrélation entre variables numériques")
print("=" * 80)
### Split the data

* Split your data in train/val/test sets with 60%/20%/20% distribution.
* Use Scikit-Learn for that (the `train_test_split` function) and set the seed to `42`.
* Make sure that the target value `y` is not in your dataframe.

### Question 3

* Calculate the mutual information score between `y` and other categorical variables in the dataset. Use the training set only.
* Round the scores to 2 decimals using `round(score, 2)`.

Which of these variables has the biggest mutual information score?

- `industry`
- `location`
- `lead_source`
- `employment_status`


In [4]:
# Split des données
print("\n" + "=" * 80)
print("SPLIT DES DONNÉES (60/20/20)")
print("=" * 80)

# Séparer y et X
y = df['converted'].values
df_features = df.drop('converted', axis=1)

# Split train/temp (60/40)
df_train, df_temp, y_train, y_temp = train_test_split(
    df_features, y, test_size=0.4, random_state=42
)

# Split temp en val/test (50/50 du 40%, donc 20/20)
df_val, df_test, y_val, y_test = train_test_split(
    df_temp, y_temp, test_size=0.5, random_state=42
)

print(f"Train set: {len(df_train)} samples")
print(f"Validation set: {len(df_val)} samples")
print(f"Test set: {len(df_test)} samples")


# QUESTION 3: Mutual Information Score
print("\n" + "=" * 80)
print("QUESTION 3: Mutual Information Score")
print("=" * 80)

# Recalculer les colonnes catégorielles sans 'converted'
categorical_features = df_train.select_dtypes(include=['object']).columns.tolist()

mi_scores = {}
for col in categorical_features:
    mi = mutual_info_score(y_train, df_train[col])
    mi_scores[col] = round(mi, 2)
    print(f"{col}: {mi_scores[col]}")

max_mi_feature = max(mi_scores, key=mi_scores.get)
print(f"\nFeature avec le plus grand MI score: {max_mi_feature} ({mi_scores[max_mi_feature]})")



SPLIT DES DONNÉES (60/20/20)
Train set: 877 samples
Validation set: 292 samples
Test set: 293 samples

QUESTION 3: Mutual Information Score
lead_source: 0.03
industry: 0.02
employment_status: 0.02
location: 0.0

Feature avec le plus grand MI score: lead_source (0.03)



### Question 4

* Now let's train a logistic regression.
* Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.
* Fit the model on the training dataset.
    - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
    - `model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)`
* Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

What accuracy did you get?

- 0.64
- 0.74
- 0.84
- 0.94


In [5]:

# QUESTION 4: Logistic Regression avec one-hot encoding
print("\n" + "=" * 80)
print("QUESTION 4: Logistic Regression")
print("=" * 80)

# Convertir les dataframes en dictionnaires pour DictVectorizer
train_dict = df_train.to_dict(orient='records')
val_dict = df_val.to_dict(orient='records')
test_dict = df_test.to_dict(orient='records')

# One-hot encoding
dv = DictVectorizer(sparse=False)
X_train = dv.fit_transform(train_dict)
X_val = dv.transform(val_dict)
X_test = dv.transform(test_dict)

print(f"Shape après encoding - Train: {X_train.shape}, Val: {X_val.shape}, Test: {X_test.shape}")

# Entraîner le modèle
model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
model.fit(X_train, y_train)

# Prédictions et accuracy
y_pred_val = model.predict(X_val)
accuracy_q4 = accuracy_score(y_val, y_pred_val)
accuracy_q4_rounded = round(accuracy_q4, 2)

print(f"Accuracy sur validation: {accuracy_q4:.6f}")
print(f"Accuracy arrondie: {accuracy_q4_rounded}")



QUESTION 4: Logistic Regression
Shape après encoding - Train: (877, 31), Val: (292, 31), Test: (293, 31)
Accuracy sur validation: 0.743151
Accuracy arrondie: 0.74




### Question 5

* Let's find the least useful feature using the *feature elimination* technique.
* Train a model using the same features and parameters as in Q4 (without rounding).
* Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
* For each feature, calculate the difference between the original accuracy and the accuracy without the feature.

Which of following feature has the smallest difference?

- `'industry'`
- `'employment_status'`
- `'lead_score'`

> **Note**: The difference doesn't have to be positive.



In [6]:

# QUESTION 5: Feature Elimination
print("\n" + "=" * 80)
print("QUESTION 5: Feature Elimination")
print("=" * 80)

baseline_accuracy = accuracy_q4
print(f"Baseline accuracy: {baseline_accuracy:.6f}")

# Obtenir les noms des features après encoding
feature_names = dv.get_feature_names_out()
original_features = df_train.columns.tolist()

differences = {}

for feature in original_features:
    # Trouver les colonnes correspondantes après one-hot encoding
    feature_cols = [i for i, name in enumerate(feature_names) if name.startswith(feature + '=')]

    if len(feature_cols) > 0:
        # Créer X_train et X_val sans cette feature
        X_train_reduced = np.delete(X_train, feature_cols, axis=1)
        X_val_reduced = np.delete(X_val, feature_cols, axis=1)

        # Entraîner le modèle sans cette feature
        model_reduced = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
        model_reduced.fit(X_train_reduced, y_train)

        # Calculer l'accuracy
        y_pred_reduced = model_reduced.predict(X_val_reduced)
        accuracy_reduced = accuracy_score(y_val, y_pred_reduced)

        # Calculer la différence
        diff = abs(baseline_accuracy - accuracy_reduced)
        differences[feature] = diff

        print(f"{feature}: accuracy={accuracy_reduced:.6f}, diff={diff:.6f}")

# Trouver la feature avec la plus petite différence
min_diff_feature = min(differences, key=differences.get)
print(f"\nFeature avec la plus petite différence: {min_diff_feature} ({differences[min_diff_feature]:.6f})")



QUESTION 5: Feature Elimination
Baseline accuracy: 0.743151
lead_source: accuracy=0.729452, diff=0.013699
industry: accuracy=0.743151, diff=0.000000
employment_status: accuracy=0.746575, diff=0.003425
location: accuracy=0.743151, diff=0.000000

Feature avec la plus petite différence: industry (0.000000)



### Question 6

* Now let's train a regularized logistic regression.
* Let's try the following values of the parameter `C`: `[0.01, 0.1, 1, 10, 100]`.
* Train models using all the features as in Q4.
* Calculate the accuracy on the validation dataset and round it to 3 decimal digits.

Which of these `C` leads to the best accuracy on the validation set?

- 0.01
- 0.1
- 1
- 10
- 100


In [7]:

# QUESTION 6: Regularized Logistic Regression
print("\n" + "=" * 80)
print("QUESTION 6: Regularized Logistic Regression")
print("=" * 80)

C_values = [0.01, 0.1, 1, 10, 100]
results = {}

for C in C_values:
    model_reg = LogisticRegression(solver='liblinear', C=C, max_iter=1000, random_state=42)
    model_reg.fit(X_train, y_train)

    y_pred_reg = model_reg.predict(X_val)
    accuracy_reg = accuracy_score(y_val, y_pred_reg)
    accuracy_reg_rounded = round(accuracy_reg, 3)

    results[C] = accuracy_reg_rounded
    print(f"C={C}: accuracy={accuracy_reg_rounded}")

# Trouver le meilleur C
best_accuracy = max(results.values())
best_C = min([c for c, acc in results.items() if acc == best_accuracy])

print(f"\nMeilleur C: {best_C} avec accuracy={best_accuracy}")



QUESTION 6: Regularized Logistic Regression
C=0.01: accuracy=0.743
C=0.1: accuracy=0.743
C=1: accuracy=0.743
C=10: accuracy=0.743
C=100: accuracy=0.743

Meilleur C: 0.01 avec accuracy=0.743



> **Note**: If there are multiple options, select the smallest `C`.

## Submit the results

* Submit your results here: https://courses.datatalks.club/ml-zoomcamp-2025/homework/hw03
* If your answer doesn't match options exactly, select the closest one

In [None]:

# RÉSUMÉ DES RÉPONSES
print("\n" + "=" * 80)
print("RÉSUMÉ DES RÉPONSES")
print("=" * 80)
print(f"Question 1 (Mode de 'industry'): {industry_mode}")
print(f"Question 2 (Plus grande corrélation): {max_corr_pair}")
print(f"Question 3 (Plus grand MI score): {max_mi_feature}")
print(f"Question 4 (Accuracy): {accuracy_q4_rounded}")
print(f"Question 5 (Feature avec plus petite différence): {min_diff_feature}")
print(f"Question 6 (Meilleur C): {best_C}")
print("=" * 80)