# Module 3 ? Classification Homework (2025 cohort)

Dataset: Bank Marketing style lead scoring ? target column `converted`.


## 1. Setup

We use Pandas/NumPy for data prep and scikit?learn for splitting, mutual information, encoding, and logistic regression.


In [2]:
import numpy as np
import pandas as pd
from itertools import combinations

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import mutual_info_score
from sklearn.linear_model import LogisticRegression

pd.set_option('display.float_format', lambda x: f'{x:.5f}')


## 2. Load and Inspect Data

We load only once from the course URL. If you prefer a local copy, download the CSV to a `data/` folder and change `DATA_URL` accordingly.


In [3]:
DATA_URL = 'https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv'

df = pd.read_csv(DATA_URL)
df.head()


Unnamed: 0,lead_source,industry,number_of_courses_viewed,annual_income,employment_status,location,interaction_count,lead_score,converted
0,paid_ads,,1,79450.0,unemployed,south_america,4,0.94,1
1,social_media,retail,1,46992.0,employed,south_america,1,0.8,0
2,events,healthcare,5,78796.0,unemployed,australia,3,0.69,1
3,paid_ads,retail,2,83843.0,,australia,1,0.87,0
4,referral,education,3,85012.0,self_employed,europe,3,0.62,1


## 3. Target and Feature Types

- Target: `converted` (binary: 0/1)
- We'll detect numeric vs categorical by dtype; then apply the homework's imputation rules.


In [4]:
TARGET = 'converted'

# Separate features and target
full_cols = df.columns.tolist()
feature_cols = [c for c in full_cols if c != TARGET]

# Infer types
num_cols = [c for c in feature_cols if pd.api.types.is_numeric_dtype(df[c])]
cat_cols = [c for c in feature_cols if c not in num_cols]

num_cols, cat_cols


(['number_of_courses_viewed',
  'annual_income',
  'interaction_count',
  'lead_score'],
 ['lead_source', 'industry', 'employment_status', 'location'])

## 4. Missing Values (Homework Data Preparation)

- Categorical NaNs ? `'NA'`
- Numerical NaNs ? `0.0`

We keep a clean `df_clean` for downstream steps.


In [5]:
df_clean = df.copy()

# Fill categorical with 'NA'
for c in cat_cols:
    df_clean[c] = df_clean[c].astype('object').fillna('NA')

# Fill numeric with 0.0
for c in num_cols:
    df_clean[c] = df_clean[c].astype('float64').fillna(0.0)

# Quick check remaining NAs
na_check = df_clean[feature_cols].isna().sum().sum()
na_check


np.int64(0)

## 5. Question 1 ? Mode of `industry`

We compute the mode of `industry` after imputation.


In [6]:
mode_industry = df_clean['industry'].mode(dropna=False).iloc[0]
mode_industry


'retail'

## 6. Question 2 ? Correlation Matrix (Numeric Only)

We compute a correlation matrix for numeric features and then report the correlation for the specified pairs:
- `interaction_count` ? `lead_score`
- `number_of_courses_viewed` ? `lead_score`
- `number_of_courses_viewed` ? `interaction_count`
- `annual_income` ? `lead_score`


In [7]:
corr = df_clean[num_cols].corr(numeric_only=True)

pairs = [
    ('interaction_count', 'lead_score'),
    ('number_of_courses_viewed', 'lead_score'),
    ('number_of_courses_viewed', 'interaction_count'),
    ('annual_income', 'lead_score'),
]

pair_vals = {}
for a, b in pairs:
    if a in corr.index and b in corr.columns:
        pair_vals[(a, b)] = float(abs(corr.loc[a, b]))
    else:
        pair_vals[(a, b)] = np.nan

pair_vals


{('interaction_count', 'lead_score'): 0.009888182496913131,
 ('number_of_courses_viewed', 'lead_score'): 0.004878998354681276,
 ('number_of_courses_viewed', 'interaction_count'): 0.023565222882888037,
 ('annual_income', 'lead_score'): 0.015609546050139008}

## 7. Train/Val/Test Split (60/20/20)

- Use `train_test_split` with `random_state=42`.
- Ensure target is not in the features frame.


In [8]:
X = df_clean[feature_cols].copy()
y = df_clean[TARGET].astype('int64').values

# First split train vs temp (val+test)
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.4, random_state=42, stratify=y
)
# Split temp into val and test 50/50 ? each becomes 20% of total
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp
)

len(X_train), len(X_val), len(X_test)


(877, 292, 293)

## 8. Question 3 ? Mutual Information of Categorical Features

Compute mutual information on the training set only, round to 2 decimals, and report the largest among:
- `industry`, `location`, `lead_source`, `employment_status`.


In [9]:
def mutual_info_classif_series(x, y):
    # x and y must be 1D label arrays
    return mutual_info_score(x, y)

mi_features = ['industry', 'location', 'lead_source', 'employment_status']
mi_scores = {}
for c in mi_features:
    if c in X_train.columns:
        s = mutual_info_classif_series(X_train[c], y_train)
        mi_scores[c] = round(float(s), 2)
    else:
        mi_scores[c] = np.nan

mi_scores


{'industry': 0.01,
 'location': 0.0,
 'lead_source': 0.03,
 'employment_status': 0.01}

## 9. Question 4 ? Logistic Regression with One?Hot Encoding

We one?hot encode categorical variables (training fit only), align columns for validation, and evaluate accuracy.


In [10]:
# One?hot encode train
X_train_enc = pd.get_dummies(X_train, columns=cat_cols, drop_first=False)

# Apply same columns to val: get_dummies and align
X_val_enc = pd.get_dummies(X_val, columns=cat_cols, drop_first=False)
X_val_enc = X_val_enc.reindex(columns=X_train_enc.columns, fill_value=0)

logreg = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
logreg.fit(X_train_enc, y_train)
val_pred = logreg.predict(X_val_enc)
acc_val = accuracy_score(y_val, val_pred)
acc_val, round(acc_val, 2)


(0.6815068493150684, 0.68)

## 10. Question 5 ? Simple Feature Elimination by Leaving?One?Out

Train the baseline model (Q4), then for each original feature, remove it, re?fit, and measure ?accuracy = baseline ? without(feature).
We report the smallest difference among the listed options.


In [11]:
baseline_acc = acc_val

candidates = ['industry', 'employment_status', 'lead_score']

acc_drop = {}
for feat in candidates:
    cols_minus = [c for c in feature_cols if c != feat]

    Xt = X_train[cols_minus]
    Xv = X_val[cols_minus]

    Xt_enc = pd.get_dummies(Xt, columns=[c for c in cols_minus if c in cat_cols], drop_first=False)
    Xv_enc = pd.get_dummies(Xv, columns=[c for c in cols_minus if c in cat_cols], drop_first=False)
    Xv_enc = Xv_enc.reindex(columns=Xt_enc.columns, fill_value=0)

    m = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
    m.fit(Xt_enc, y_train)
    pred = m.predict(Xv_enc)
    acc = accuracy_score(y_val, pred)

    acc_drop[feat] = float(baseline_acc - acc)

acc_drop


{'industry': -0.006849315068493178,
 'employment_status': 0.0,
 'lead_score': 0.006849315068493067}

## 11. Question 6 ? Regularization Sweep

Evaluate validation accuracy for `C` in `[0.01, 0.1, 1, 10, 100]` using the same encoded features as in Q4. Report the best accuracy (rounded to 3 decimals) and the smallest `C` achieving it.


In [12]:
grid_C = [0.01, 0.1, 1, 10, 100]
acc_by_C = {}

for C in grid_C:
    m = LogisticRegression(solver='liblinear', C=C, max_iter=1000, random_state=42)
    m.fit(X_train_enc, y_train)
    p = m.predict(X_val_enc)
    acc = accuracy_score(y_val, p)
    acc_by_C[C] = float(acc)

acc_rounded = {k: round(v, 3) for k, v in acc_by_C.items()}
best_acc = max(acc_by_C.values())
best_C = min([C for C, a in acc_by_C.items() if abs(a - best_acc) < 1e-12])
acc_rounded, best_C, round(best_acc, 3)


({0.01: 0.688, 0.1: 0.682, 1: 0.682, 10: 0.682, 100: 0.682}, 0.01, 0.688)

## 12. Summary for Submission

This cell prints all values needed for the multiple?choice form.


In [13]:
summary = {
    'Q1_mode_industry': str(mode_industry),
    'Q2_pairwise_corr_abs': pair_vals,  # inspect to choose the largest
    'Q3_mi_scores': mi_scores,
    'Q4_val_accuracy': round(acc_val, 2),
    'Q5_delta_accuracy_by_feature': acc_drop,
    'Q6_best_C': best_C,
    'Q6_best_val_accuracy': round(max(acc_by_C.values()), 3),
}
summary


{'Q1_mode_industry': 'retail',
 'Q2_pairwise_corr_abs': {('interaction_count',
   'lead_score'): 0.009888182496913131,
  ('number_of_courses_viewed', 'lead_score'): 0.004878998354681276,
  ('number_of_courses_viewed', 'interaction_count'): 0.023565222882888037,
  ('annual_income', 'lead_score'): 0.015609546050139008},
 'Q3_mi_scores': {'industry': 0.01,
  'location': 0.0,
  'lead_source': 0.03,
  'employment_status': 0.01},
 'Q4_val_accuracy': 0.68,
 'Q5_delta_accuracy_by_feature': {'industry': -0.006849315068493178,
  'employment_status': 0.0,
  'lead_score': 0.006849315068493067},
 'Q6_best_C': 0.01,
 'Q6_best_val_accuracy': 0.688}