## Homework

### Dataset

In this homework, we will use the lead scoring dataset Bank Marketing dataset. Download it from [here](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv).

Or you can do it with `wget`:

```
wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv
```

In this dataset our desired target for classification task will be `converted` variable - has the client signed up to the platform or not. 

In [1]:
#Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [19]:
data = 'https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv'

### Data preparation

* Check if the missing values are presented in the features.
* If there are missing values:
    * For caterogiral features, replace them with 'NA'
    * For numerical features, replace with with 0.0 

In [30]:
df = pd.read_csv(data)
df.head()

Unnamed: 0,lead_source,industry,number_of_courses_viewed,annual_income,employment_status,location,interaction_count,lead_score,converted
0,paid_ads,,1,79450.0,unemployed,south_america,4,0.94,1
1,social_media,retail,1,46992.0,employed,south_america,1,0.8,0
2,events,healthcare,5,78796.0,unemployed,australia,3,0.69,1
3,paid_ads,retail,2,83843.0,,australia,1,0.87,0
4,referral,education,3,85012.0,self_employed,europe,3,0.62,1


In [39]:
#Check if the missing values are presented in the features.
df.isna().sum()

lead_source                 128
industry                    134
number_of_courses_viewed      0
annual_income               181
employment_status           100
location                     63
interaction_count             0
lead_score                    0
converted                     0
dtype: int64

In [32]:
df.dtypes

lead_source                  object
industry                     object
number_of_courses_viewed      int64
annual_income               float64
employment_status            object
location                     object
interaction_count             int64
lead_score                  float64
converted                     int64
dtype: object

In [38]:
categorical_columns = list(df.select_dtypes(include="object").columns)
categorical_columns

['lead_source', 'industry', 'employment_status', 'location']

In [37]:
numerical_columns = list(df.select_dtypes(exclude="object").columns)
numerical_columns

['number_of_courses_viewed',
 'annual_income',
 'interaction_count',
 'lead_score',
 'converted']

In [41]:
for col in categorical_columns:
    df[col] = df[col].fillna("NA")
for num in numerical_columns:
    df[num] = df[num].fillna(0.0)

In [42]:
df.isna().sum()

lead_source                 0
industry                    0
number_of_courses_viewed    0
annual_income               0
employment_status           0
location                    0
interaction_count           0
lead_score                  0
converted                   0
dtype: int64

### Question 1

What is the most frequent observation (mode) for the column `industry`?

- `NA`
- `technology`
- `healthcare`
- `retail`

In [46]:
df["industry"].value_counts(ascending=False)

industry
retail           203
finance          200
other            198
healthcare       187
education        187
technology       179
manufacturing    174
NA               134
Name: count, dtype: int64

The most frequent observation  for column `industry`is **retail** with **203** values.

### Question 2

Create the [correlation matrix](https://www.google.com/search?q=correlation+matrix) for the numerical features of your dataset. 
In a correlation matrix, you compute the correlation coefficient between every pair of features.

What are the two features that have the biggest correlation?

- `interaction_count` and `lead_score`
- `number_of_courses_viewed` and `lead_score`
- `number_of_courses_viewed` and `interaction_count`
- `annual_income` and `interaction_count`

Only consider the pairs above when answering this question.

In [47]:
df[numerical_columns].corr()

Unnamed: 0,number_of_courses_viewed,annual_income,interaction_count,lead_score,converted
number_of_courses_viewed,1.0,0.00977,-0.023565,-0.004879,0.435914
annual_income,0.00977,1.0,0.027036,0.01561,0.053131
interaction_count,-0.023565,0.027036,1.0,0.009888,0.374573
lead_score,-0.004879,0.01561,0.009888,1.0,0.193673
converted,0.435914,0.053131,0.374573,0.193673,1.0


From the options above, `annual_income` and `interaction_count` have the highest correlation of **0.027036**.

### Split the data

* Split your data in train/val/test sets with 60%/20%/20% distribution.
* Use Scikit-Learn for that (the `train_test_split` function) and set the seed to `42`.
* Make sure that the target value `y` is not in your dataframe.

In [51]:
from sklearn.model_selection import train_test_split
#Split full train and test set
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=42)

#Split train and validation set
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=42)

#Reset the indices
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

#Set the y_train, y_val and y_test aside
y_train = df_train.converted.values
y_val = df_val.converted.values
y_test = df_test.converted.values

#Drop the targets from the dfs
del df_train['converted']
del df_val['converted']
del df_test['converted']

In [52]:
len(df_train), len(df_val), len(df_test)

(876, 293, 293)

### Question 3

* Calculate the mutual information score between `y` and other categorical variables in the dataset. Use the training set only.
* Round the scores to 2 decimals using `round(score, 2)`.

Which of these variables has the biggest mutual information score?
  
- `industry`
- `location`
- `lead_source`
- `employment_status`

In [60]:
#Import mutual_info_score from sklearn
from sklearn.metrics import mutual_info_score

#Create a function to calc the mutual_info_score
def mutual_info_converted_score(series):
    return mutual_info_score(series, df_full_train.converted)

#Apply the function to the teaining data on categorical variables only, rounded to 2dp
mutual_info = round(df_full_train[categorical_columns].apply(mutual_info_converted_score),2)

#Sort the results
mutual_info.sort_values(ascending=False)

lead_source          0.03
industry             0.01
employment_status    0.01
location             0.00
dtype: float64

From the sorted reults above, lead_source has the highest mutual_info_score:**0.03**.

### Question 4

* Now let's train a logistic regression.
* Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.
* Fit the model on the training dataset.
    - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
    - `model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)`
* Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

What accuracy did you get?

- 0.64
- 0.74
- 0.84
- 0.94

In [65]:
numerical_columns.remove('converted')

ValueError: list.remove(x): x not in list

In [68]:
print(numerical_columns)

['number_of_courses_viewed', 'annual_income', 'interaction_count', 'lead_score']


In [69]:
#Import DictVectorizer from sklearn
from sklearn.feature_extraction import DictVectorizer

#Inscitanciate DictVectorizer
dv = DictVectorizer(sparse=False)

#Convert the training dataset to dictionary
train_dict = df_train[categorical_columns + numerical_columns].to_dict(orient='records')

#Fit and transform to one hot encode the training set
X_train = dv.fit_transform(train_dict)

#Convert the validation dataset to dictionary
val_dict = df_val[categorical_columns + numerical_columns].to_dict(orient='records')

#Transform to one hot encode the validation set
X_val = dv.transform(val_dict)

In [72]:
#Import the LogisticRegression algorithm
from sklearn.linear_model import LogisticRegression

#Insatiente the model
model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)

#Fit the model to training data
model.fit(X_train, y_train)

#Make predictions on validation dataset
y_pred = model.predict_proba(X_val)[:, 1]

converted_decision = (y_pred >= 0.5)
accuracy = (y_val == converted_decision).mean()
print(f"Validation accuracy: {accuracy:.2f}")

Validation accuracy: 0.70


### Question 5 

* Let's find the least useful feature using the *feature elimination* technique.
* Train a model using the same features and parameters as in Q4 (without rounding).
* Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
* For each feature, calculate the difference between the original accuracy and the accuracy without the feature. 

Which of following feature has the smallest difference?

- `'industry'`
- `'employment_status'`
- `'lead_score'`

> **Note**: The difference doesn't have to be positive.

In [73]:
#Import DictVectorizer from sklearn
from sklearn.feature_extraction import DictVectorizer

#Inscitanciate DictVectorizer
dv = DictVectorizer(sparse=False)

#Convert the training dataset to dictionary
train_dict = df_train[categorical_columns + numerical_columns].to_dict(orient='records')

#Fit and transform to one hot encode the training set
X_train = dv.fit_transform(train_dict)

#Convert the validation dataset to dictionary
val_dict = df_val[categorical_columns + numerical_columns].to_dict(orient='records')

#Transform to one hot encode the validation set
X_val = dv.transform(val_dict)

#Import the LogisticRegression algorithm
from sklearn.linear_model import LogisticRegression

#Insatiente the model
model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)

#Fit the model to training data
model.fit(X_train, y_train)

#Make predictions on validation dataset
y_pred = model.predict_proba(X_val)[:, 1]

converted_decision = (y_pred >= 0.5)
base_accuracy = (y_val == (model.predict_proba(X_val)[:, 1] >= 0.5)).mean()
print(f"Validation accuracy: {base_accuracy}")

Validation accuracy: 0.6996587030716723


In [88]:
base_accuracy = (y_val == (model.predict_proba(X_val)[:, 1] >= 0.5)).mean()
# Store differences
feature_diffs = {}

for feature in ['industry', 'employment_status', 'lead_score']:
    # Drop this feature from training and validation
    idx_to_keep = [i for i, name in enumerate(dv.get_feature_names_out()) if not name.startswith(feature + '=')]

    X_train_new = X_train[:, idx_to_keep]
    X_val_new = X_val[:, idx_to_keep]

    # Train new model
    model_new = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
    model_new.fit(X_train_new, y_train)

    # Compute accuracy without this feature
    y_pred_new = model_new.predict_proba(X_val_new)[:, 1]
    accuracy_new = (y_val == (y_pred_new >= 0.5)).mean()

    # Record difference
    diff = base_accuracy - accuracy_new
    feature_diffs[feature] = diff
    print(f"{feature}: change_accuracy = {diff:.6f}")

# Find feature with smallest difference
least_useful = min(feature_diffs, key=feature_diffs.get)
print(f"\nLeast useful feature: {least_useful}")

industry: change_accuracy = 0.000000
employment_status: change_accuracy = 0.003413
lead_score: change_accuracy = 0.000000

Least useful feature: industry


### Question 6

* Now let's train a regularized logistic regression.
* Let's try the following values of the parameter `C`: `[0.01, 0.1, 1, 10, 100]`.
* Train models using all the features as in Q4.
* Calculate the accuracy on the validation dataset and round it to 3 decimal digits.

Which of these `C` leads to the best accuracy on the validation set?

- 0.01
- 0.1
- 1
- 10
- 100

> **Note**: If there are multiple options, select the smallest `C`.

In [87]:
# Define the range of C values to test
C_values = [0.01, 0.1, 1, 10, 100]

# Store results
accuracies = {}

for C in C_values:
    model_reg = LogisticRegression(solver='liblinear', C=C, max_iter=1000, random_state=42)
    model_reg.fit(X_train, y_train)

    y_pred = model_reg.predict_proba(X_val)[:, 1]
    val_accuracy = (y_val == (y_pred >= 0.5)).mean()

    accuracies[C] = round(val_accuracy, 3)
    print(f"C: {C:<6} Validation accuracy: {val_accuracy:.3f}")

C: 0.01   Validation accuracy: 0.700
C: 0.1    Validation accuracy: 0.700
C: 1      Validation accuracy: 0.700
C: 10     Validation accuracy: 0.700
C: 100    Validation accuracy: 0.700


## Submit the results

* Submit your results here: https://courses.datatalks.club/ml-zoomcamp-2025/homework/hw03
* If your answer doesn't match options exactly, select the closest one