## Week 3 Homework 2025

### Dataset

In this homework, we will use the lead scoring dataset Bank Marketing dataset. Download it from [here](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv).
Or you can do it with `wget`:

```bash
wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv
```



In [1]:
data = "https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv"
!wget -O course_lead_scoring.csv $data

--2025-10-09 04:11:10--  https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 80876 (79K) [text/plain]
Saving to: ‘course_lead_scoring.csv’


2025-10-09 04:11:11 (61.3 MB/s) - ‘course_lead_scoring.csv’ saved [80876/80876]



In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_squared_error, accuracy_score, mutual_info_score
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, Ridge

In [5]:
data = pd.read_csv("course_lead_scoring.csv")
data.head().T

Unnamed: 0,0,1,2,3,4
lead_source,paid_ads,social_media,events,paid_ads,referral
industry,,retail,healthcare,retail,education
number_of_courses_viewed,1,1,5,2,3
annual_income,79450.0,46992.0,78796.0,83843.0,85012.0
employment_status,unemployed,employed,unemployed,,self_employed
location,south_america,south_america,australia,australia,europe
interaction_count,4,1,3,1,3
lead_score,0.94,0.8,0.69,0.87,0.62
converted,1,0,1,0,1


### Data preparation

* Check if the missing values are presented in the features.
* If there are missing values:
    * For caterogiral features, replace them with 'NA'
    * For numerical features, replace with with 0.0 


In [6]:
data.isnull().sum()

lead_source                 128
industry                    134
number_of_courses_viewed      0
annual_income               181
employment_status           100
location                     63
interaction_count             0
lead_score                    0
converted                     0
dtype: int64

In [7]:
data.describe()

Unnamed: 0,number_of_courses_viewed,annual_income,interaction_count,lead_score,converted
count,1462.0,1281.0,1462.0,1462.0,1462.0
mean,2.031464,59886.273224,2.976744,0.506108,0.619015
std,1.449717,15070.140389,1.681564,0.288465,0.485795
min,0.0,13929.0,0.0,0.0,0.0
25%,1.0,49698.0,2.0,0.2625,0.0
50%,2.0,60148.0,3.0,0.51,1.0
75%,3.0,69639.0,4.0,0.75,1.0
max,9.0,109899.0,11.0,1.0,1.0


In [8]:
data.dtypes

lead_source                  object
industry                     object
number_of_courses_viewed      int64
annual_income               float64
employment_status            object
location                     object
interaction_count             int64
lead_score                  float64
converted                     int64
dtype: object

In [9]:
category = ['lead_source', 'industry', 'employment_status', 'location']
for c in category:
    data[c] = data[c].fillna('NA')

In [10]:
data.isnull().sum()

lead_source                   0
industry                      0
number_of_courses_viewed      0
annual_income               181
employment_status             0
location                      0
interaction_count             0
lead_score                    0
converted                     0
dtype: int64

In [11]:
data['annual_income'].fillna(0.0, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['annual_income'].fillna(0.0, inplace=True)


In [12]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1462 entries, 0 to 1461
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   lead_source               1462 non-null   object 
 1   industry                  1462 non-null   object 
 2   number_of_courses_viewed  1462 non-null   int64  
 3   annual_income             1462 non-null   float64
 4   employment_status         1462 non-null   object 
 5   location                  1462 non-null   object 
 6   interaction_count         1462 non-null   int64  
 7   lead_score                1462 non-null   float64
 8   converted                 1462 non-null   int64  
dtypes: float64(2), int64(3), object(4)
memory usage: 102.9+ KB


### Question 1

What is the most frequent observation (mode) for the column `industry`?

- `NA`
- `technology`
- `healthcare`
- `retail`

In [13]:
data['industry'].mode()

0    retail
Name: industry, dtype: object

The most frequent observation (mode) for the column industry is __retail__.

### Question 2

Create the [correlation matrix](https://www.google.com/search?q=correlation+matrix) for the numerical features of your dataset. 
In a correlation matrix, you compute the correlation coefficient between every pair of features.

What are the two features that have the biggest correlation?

- `interaction_count` and `lead_score`
- `number_of_courses_viewed` and `lead_score`
- `number_of_courses_viewed` and `interaction_count`
- `annual_income` and `interaction_count`


In [14]:
data_numeric = data.copy()
data_numeric = data_numeric.drop(['lead_source', 'industry', 'employment_status', 'location', 'converted'], axis = 1)
data_numeric.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1462 entries, 0 to 1461
Data columns (total 4 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   number_of_courses_viewed  1462 non-null   int64  
 1   annual_income             1462 non-null   float64
 2   interaction_count         1462 non-null   int64  
 3   lead_score                1462 non-null   float64
dtypes: float64(2), int64(2)
memory usage: 45.8 KB


In [15]:
data_numeric.corr()

Unnamed: 0,number_of_courses_viewed,annual_income,interaction_count,lead_score
number_of_courses_viewed,1.0,0.00977,-0.023565,-0.004879
annual_income,0.00977,1.0,0.027036,0.01561
interaction_count,-0.023565,0.027036,1.0,0.009888
lead_score,-0.004879,0.01561,0.009888,1.0


In [16]:
data_numeric.corr().unstack().sort_values(ascending = False)

number_of_courses_viewed  number_of_courses_viewed    1.000000
annual_income             annual_income               1.000000
lead_score                lead_score                  1.000000
interaction_count         interaction_count           1.000000
annual_income             interaction_count           0.027036
interaction_count         annual_income               0.027036
lead_score                annual_income               0.015610
annual_income             lead_score                  0.015610
lead_score                interaction_count           0.009888
interaction_count         lead_score                  0.009888
annual_income             number_of_courses_viewed    0.009770
number_of_courses_viewed  annual_income               0.009770
lead_score                number_of_courses_viewed   -0.004879
number_of_courses_viewed  lead_score                 -0.004879
                          interaction_count          -0.023565
interaction_count         number_of_courses_viewed   -0

The two features that have the biggest correlation are __annual_income__ and __interaction_count__.

### Split the data

* Split your data in train/val/test sets with 60%/20%/20% distribution.
* Use Scikit-Learn for that (the `train_test_split` function) and set the seed to `42`.
* Make sure that the target value `y` is not in your dataframe.


In [17]:
ranseed = 42
df_full_train, df_test = train_test_split(data, test_size = 0.2, random_state = ranseed)
df_train, df_val = train_test_split(df_full_train, test_size = 0.25, random_state = ranseed)

assert len(data) == len(df_train) + len(df_val) + len(df_test)

In [18]:
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [19]:
y_train = df_train.converted.values
y_test = df_test.converted.values
y_val = df_val.converted.values

In [20]:
df_train = df_train.drop(['converted'], axis = 1)
df_val = df_val.drop(['converted'], axis = 1)
df_test = df_test.drop(['converted'], axis = 1)

In [21]:
df_val.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 293 entries, 0 to 292
Data columns (total 8 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   lead_source               293 non-null    object 
 1   industry                  293 non-null    object 
 2   number_of_courses_viewed  293 non-null    int64  
 3   annual_income             293 non-null    float64
 4   employment_status         293 non-null    object 
 5   location                  293 non-null    object 
 6   interaction_count         293 non-null    int64  
 7   lead_score                293 non-null    float64
dtypes: float64(2), int64(2), object(4)
memory usage: 18.4+ KB


### Question 3

* Calculate the mutual information score between `y` and other categorical variables in the dataset. Use the training set only.
* Round the scores to 2 decimals using `round(score, 2)`.

Which of these variables has the biggest mutual information score?
  
- `industry`
- `location`
- `lead_source`
- `employment_status`


In [22]:
def calculate_mi(series):
    return mutual_info_score(series, y_train)

In [23]:
cat = ['lead_source', 'industry', 'employment_status', 'location']
mi = df_train[cat].apply(calculate_mi)
mi = mi.sort_values(ascending=False).to_frame(name='MI')
mi

Unnamed: 0,MI
lead_source,0.035396
employment_status,0.012938
industry,0.011575
location,0.004464


__Lead source__ has the biggest mutual information score.

### Question 4

* Now let's train a logistic regression.
* Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.
* Fit the model on the training dataset.
    - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
    - `model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)`
* Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

What accuracy did you get?

- 0.64
- 0.74
- 0.84
- 0.94

In [24]:
dv = DictVectorizer(sparse = False)
train_dict = df_train.to_dict(orient = 'records')
X_train = dv.fit_transform(train_dict)

In [25]:
model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
model.fit(X_train, y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,42
,solver,'liblinear'
,max_iter,1000


In [26]:
val_dict = df_val.to_dict(orient='records')
X_val = dv.transform(val_dict)

y_pred = model.predict(X_val)

In [27]:
accuracy = np.round(accuracy_score(y_val, y_pred),2)
print(f'Accuracy = {accuracy}')

Accuracy = 0.7


### Question 5 

* Let's find the least useful feature using the *feature elimination* technique.
* Train a model using the same features and parameters as in Q4 (without rounding).
* Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
* For each feature, calculate the difference between the original accuracy and the accuracy without the feature. 

Which of following feature has the smallest difference?

- `'industry'`
- `'employment_status'`
- `'lead_score'`

> **Note**: The difference doesn't have to be positive.

In [28]:
features = df_train.columns.to_list()
features

['lead_source',
 'industry',
 'number_of_courses_viewed',
 'annual_income',
 'employment_status',
 'location',
 'interaction_count',
 'lead_score']

In [29]:
original_score = accuracy
scores = pd.DataFrame(columns=['eliminated_feature', 'accuracy', 'difference'])
for feature in features:
    subset = features.copy()
    subset.remove(feature)

    dv = DictVectorizer(sparse = False)
    train_dict = df_train[subset].to_dict(orient = 'records')
    X_train = dv.fit_transform(train_dict)

    val_dict = df_val[subset].to_dict(orient='records')
    X_val = dv.transform(val_dict)

    model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
    
    model.fit(X_train, y_train)
    y_pred = model.predict(X_val)

    score = accuracy_score(y_val, y_pred)

    scores.loc[len(scores)] = [feature, score, original_score - score]

In [34]:
scores[scores['eliminated_feature'].isin(['industry', 'employment_status', 'lead_score'])]

Unnamed: 0,eliminated_feature,accuracy,difference
1,industry,0.699659,0.000341
4,employment_status,0.696246,0.003754
7,lead_score,0.706485,-0.006485


__Industry__ has the smallest difference

### Question 6

* Now let's train a regularized logistic regression.
* Let's try the following values of the parameter `C`: `[0.01, 0.1, 1, 10, 100]`.
* Train models using all the features as in Q4.
* Calculate the accuracy on the validation dataset and round it to 3 decimal digits.

Which of these `C` leads to the best accuracy on the validation set?

- 0.01
- 0.1
- 1
- 10
- 100

> **Note**: If there are multiple options, select the smallest `C`.


In [36]:
df_full_train, df_test = train_test_split(data, test_size = 0.2, random_state = 42)
df_train, df_val = train_test_split(df_full_train, test_size = 0.25, random_state = 42)

In [37]:
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [38]:
y_train = df_train.converted.values
y_val = df_val.converted.values
y_test = df_test.converted.values

In [39]:
df_train = df_train.drop(['converted'], axis = 1)
df_val = df_val.drop(['converted'], axis = 1)
df_test = df_test.drop(['converted'], axis = 1)

assert 'converted' not in df_train
assert 'converted' not in df_val
assert 'converted' not in df_test

In [40]:
dv = DictVectorizer(sparse = False)
train_dict = df_train.to_dict(orient = 'records')
X_train = dv.fit_transform(train_dict)

In [41]:
val_dict = df_val.to_dict(orient = 'records')
X_val = dv.transform(val_dict)

In [49]:
results = []
for C in [0.01, 0.1, 1, 10, 100]:
    model = LogisticRegression(
        penalty='l2', 
        C=C, 
        solver='liblinear', 
        max_iter=1000,
        random_state=42
    )
    model.fit(X_train, y_train)
    
    y_pred = model.predict(X_val)
    accuracy = accuracy_score(y_val, y_pred)
    
    results.append({'C': C, 'Accuracy': round(accuracy, 3)})

In [50]:
result_data = pd.DataFrame(results)
result_data

Unnamed: 0,C,Accuracy
0,0.01,0.7
1,0.1,0.7
2,1.0,0.7
3,10.0,0.7
4,100.0,0.7


__0.01__ value of C leads to the best accuracy on the validation set.