
## 0. Setup

Minimal stack: `pandas`/`numpy` for data, `scikit-learn` for splitting, preprocessing, modeling, and metrics.


In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, mutual_info_score

## 1. Load data and identify schema

**Target:** `converted` (has the client signed up).  
**Features:** We'll programmatically identify categorical (object dtype) vs numerical and then confirm the specific columns used later in the questions.

In [13]:
# Pull in the data and preview it's contents
df = pd.read_csv('https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv')

df.head()

Unnamed: 0,lead_source,industry,number_of_courses_viewed,annual_income,employment_status,location,interaction_count,lead_score,converted
0,paid_ads,,1,79450.0,unemployed,south_america,4,0.94,1
1,social_media,retail,1,46992.0,employed,south_america,1,0.8,0
2,events,healthcare,5,78796.0,unemployed,australia,3,0.69,1
3,paid_ads,retail,2,83843.0,,australia,1,0.87,0
4,referral,education,3,85012.0,self_employed,europe,3,0.62,1


In [3]:


target_col = 'converted'
assert target_col in df.columns, "Expected 'converted' in dataset"

# Identify by dtype
categorical_cols = [c for c in df.columns if df[c].dtype == 'object' and c != target_col]
numeric_cols = [c for c in df.columns if c not in categorical_cols + [target_col]]

print("Categorical columns:", categorical_cols)
print("Numeric columns:", numeric_cols)
print("\nPreview:")
df.head()

Categorical columns: ['lead_source', 'industry', 'employment_status', 'location']
Numeric columns: ['number_of_courses_viewed', 'annual_income', 'interaction_count', 'lead_score']

Preview:


Unnamed: 0,lead_source,industry,number_of_courses_viewed,annual_income,employment_status,location,interaction_count,lead_score,converted
0,paid_ads,,1,79450.0,unemployed,south_america,4,0.94,1
1,social_media,retail,1,46992.0,employed,south_america,1,0.8,0
2,events,healthcare,5,78796.0,unemployed,australia,3,0.69,1
3,paid_ads,retail,2,83843.0,,australia,1,0.87,0
4,referral,education,3,85012.0,self_employed,europe,3,0.62,1



## 2. Data preparation (as required)

**Missing values policy (from the prompt):**
- For **categorical** features: replace missing with `'NA'`
- For **numerical** features: replace missing with `0.0`

We apply this **directly to the raw DataFrame** so all downstream steps use the imputed values.


In [4]:

# Impute per the instructions
df[categorical_cols] = df[categorical_cols].fillna('NA')
for c in numeric_cols:
    df[c] = df[c].fillna(0.0)

# Quick check after imputation
print("Remaining missing values per column:")
print(df.isna().sum())


Remaining missing values per column:
lead_source                 0
industry                    0
number_of_courses_viewed    0
annual_income               0
employment_status           0
location                    0
interaction_count           0
lead_score                  0
converted                   0
dtype: int64



## 3. Q1 — Most frequent observation (mode) for `industry`

Compute the mode after imputation.


In [5]:

mode_industry = df['industry'].mode(dropna=False)[0]
print("Mode for 'industry':", mode_industry)


Mode for 'industry': retail



## 4. Q2 — Correlation matrix among numeric features

We compute the Pearson correlation on the **numerical** features and then compare only the candidate pairs listed in the question:

- (`interaction_count`, `lead_score`)
- (`number_of_courses_viewed`, `lead_score`)
- (`number_of_courses_viewed`, `interaction_count`)
- (`annual_income`, `interaction_count`)

We choose the pair with the **largest absolute** correlation among these candidates.


In [6]:

corr = df[numeric_cols].corr(method='pearson')
pairs = [
    ('interaction_count','lead_score'),
    ('number_of_courses_viewed','lead_score'),
    ('number_of_courses_viewed','interaction_count'),
    ('annual_income','interaction_count'),
]

def pair_corr(a,b):
    return corr.loc[a,b] if (a in corr.index and b in corr.columns) else np.nan

pair_corrs = {p: pair_corr(*p) for p in pairs}
pair_corrs_abs = {p: abs(v) if not np.isnan(v) else np.nan for p,v in pair_corrs.items()}
best_pair = max(pair_corrs_abs, key=lambda p: pair_corrs_abs[p])

print("Pair correlations:")
for p, v in pair_corrs.items():
    print(f"{p}: {v:.6f}")
print("\nBest pair by absolute correlation:", best_pair, "with", pair_corrs[best_pair])


Pair correlations:
('interaction_count', 'lead_score'): 0.009888
('number_of_courses_viewed', 'lead_score'): -0.004879
('number_of_courses_viewed', 'interaction_count'): -0.023565
('annual_income', 'interaction_count'): 0.027036

Best pair by absolute correlation: ('annual_income', 'interaction_count') with 0.02703647240481443



## 5. Split data (train/val/test = 60/20/20) using scikit-learn

The problem statement specifies **`train_test_split` with seed=42**.  
To get 60/20/20, we do a **two-step split**:
1. Split off test=20%
2. Split the remaining 80% into train/val with val=25% (i.e., 20% of original).

We also perform a **stratified** split (on the target) to maintain class balance; this can affect accuracy slightly but is common practice.

In [7]:

X = df.drop(columns=[target_col])
y = df[target_col].astype(int)

# 20% test
X_full_train, X_test, y_full_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
# 25% of remaining 80% -> 20% overall for validation
X_train, X_val, y_train, y_val = train_test_split(
    X_full_train, y_full_train, test_size=0.25, random_state=42, stratify=y_full_train
)

assert target_col not in X_train.columns
print("Shapes -> train:", X_train.shape, "val:", X_val.shape, "test:", X_test.shape)


Shapes -> train: (876, 8) val: (293, 8) test: (293, 8)



## 6. Q3 — Mutual information (training set only) for categorical variables

We compute **mutual information** between `y_train` and each categorical feature in the **training** set.  
We report the values **rounded to 2 decimals** and identify the largest among the options.


In [8]:

def categorical_mutual_info(series, yvec):
    return mutual_info_score(yvec, series.astype(str))

categorical_cols_present = [c for c in X_train.columns if X_train[c].dtype == 'object']
mi_scores = {c: categorical_mutual_info(X_train[c], y_train) for c in categorical_cols_present}
mi_scores_round2 = {c: round(v, 2) for c, v in mi_scores.items()}

candidates = ['industry','location','lead_source','employment_status']
present_candidates = [c for c in candidates if c in mi_scores]
best_cat = max(present_candidates, key=lambda c: mi_scores[c])

print("Mutual information (rounded to 2):")
for c in candidates:
    if c in mi_scores_round2:
        print(f"{c}: {mi_scores_round2[c]}")
    else:
        print(f"{c}: [not present]")

print("\nBest among candidates:", best_cat)


Mutual information (rounded to 2):
industry: 0.01
location: 0.0
lead_source: 0.03
employment_status: 0.01

Best among candidates: lead_source



## 7. Q4 — Logistic Regression (one-hot for categoricals)

We build a **pipeline** with:
- `OneHotEncoder(handle_unknown='ignore')` for categorical features
- pass-through for numeric features
- `LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)`

Then we compute **validation accuracy** and round to **2 decimals**.

In [9]:

categoricals = [c for c in X_train.columns if X_train[c].dtype == 'object']
numerics = [c for c in X_train.columns if c not in categoricals]

preprocess = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), categoricals),
        ('num', 'passthrough', numerics)
    ]
)

log_reg = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
pipe = Pipeline(steps=[('prep', preprocess), ('clf', log_reg)])
pipe.fit(X_train, y_train)

val_preds = pipe.predict(X_val)
val_acc = accuracy_score(y_val, val_preds)
print("Validation accuracy (exact):", val_acc)
print("Validation accuracy (rounded to 2):", round(val_acc, 2))


Validation accuracy (exact): 0.7303754266211604
Validation accuracy (rounded to 2): 0.73



## 8. Q5 — Simple feature elimination (validation accuracy drop)

We take the **baseline** (all features as in Q4), then drop **one feature at a time** and recompute validation accuracy.  
We report the **difference** = baseline_accuracy − accuracy_without_feature for the features asked:

- `'industry'`
- `'employment_status'`
- `'lead_score'`

> The difference can be negative (i.e., the model does **better** without a feature).


In [10]:

baseline_acc = val_acc

def val_acc_without(feature_name):
    Xt_train = X_train.drop(columns=[feature_name], errors='ignore')
    Xt_val   = X_val.drop(columns=[feature_name], errors='ignore')
    cats = [c for c in Xt_train.columns if Xt_train[c].dtype == 'object']
    nums = [c for c in Xt_train.columns if c not in cats]
    pre = ColumnTransformer([('cat', OneHotEncoder(handle_unknown='ignore'), cats),
                             ('num', 'passthrough', nums)])
    model = Pipeline([('prep', pre),
                      ('clf', LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42))])
    model.fit(Xt_train, y_train)
    return accuracy_score(y_val, model.predict(Xt_val))

for feat in ['industry','employment_status','lead_score']:
    acc_wo = val_acc_without(feat)
    diff = baseline_acc - acc_wo
    print(f"{feat:>18} | acc_wo={acc_wo:.6f} | diff={diff:.6f}")


          industry | acc_wo=0.730375 | diff=0.000000
 employment_status | acc_wo=0.733788 | diff=-0.003413
        lead_score | acc_wo=0.730375 | diff=0.000000



## 9. Q6 — Regularized Logistic Regression (C sweep)

Try `C ∈ [0.01, 0.1, 1, 10, 100]` with the same features as Q4.  
Compute validation accuracy **rounded to 3 decimals**. If there’s a tie, select the **smallest C**.


In [11]:

Cs = [0.01, 0.1, 1, 10, 100]
scores = {}
for C in Cs:
    model = Pipeline([('prep', preprocess),
                      ('clf', LogisticRegression(solver='liblinear', C=C, max_iter=1000, random_state=42))])
    model.fit(X_train, y_train)
    acc = accuracy_score(y_val, model.predict(X_val))
    scores[C] = round(acc, 3)

print("Validation accuracy by C (rounded to 3):")
for C in Cs:
    print(f"C={C}: {scores[C]}")

best_acc = max(scores.values())
best_Cs = [C for C, s in scores.items() if s == best_acc]
best_C = min(best_Cs)
print("\nBest C:", best_C, "with accuracy:", scores[best_C])


Validation accuracy by C (rounded to 3):
C=0.01: 0.734
C=0.1: 0.73
C=1: 0.73
C=10: 0.73
C=100: 0.73

Best C: 0.01 with accuracy: 0.734


## 10. Sanity checks (alternate split strategy)

To ensure we didn’t misinterpret the splitting instruction, we also try **non-stratified** splits (still 60/20/20, seed=42) and compare the Q4 validation accuracy.  
Small differences are expected because the class ratio can drift without stratification.

In [12]:

# Non-stratified version for comparison
X_full_train_ns, X_test_ns, y_full_train_ns, y_test_ns = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=None
)
X_train_ns, X_val_ns, y_train_ns, y_val_ns = train_test_split(
    X_full_train_ns, y_full_train_ns, test_size=0.25, random_state=42, stratify=None
)

cats_ns = [c for c in X_train_ns.columns if X_train_ns[c].dtype == 'object']
nums_ns = [c for c in X_train_ns.columns if c not in cats_ns]

pre_ns = ColumnTransformer([('cat', OneHotEncoder(handle_unknown='ignore'), cats_ns),
                            ('num', 'passthrough', nums_ns)])
pipe_ns = Pipeline([('prep', pre_ns),
                    ('clf', LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42))])
pipe_ns.fit(X_train_ns, y_train_ns)
acc_ns = accuracy_score(y_val_ns, pipe_ns.predict(X_val_ns))
print("Non-stratified validation accuracy (rounded to 2):", round(acc_ns, 2))


Non-stratified validation accuracy (rounded to 2): 0.7



## 11. Final multiple-choice selections

Based on the **stratified** split (preferred for stability) and exact computations above:

- **Q1 (mode of `industry`):** `retail`  
- **Q2 (largest abs corr among candidates):** (`annual_income`, `interaction_count`)  
- **Q3 (highest mutual information):** `lead_source`  
- **Q4 (validation accuracy, 2 decimals):** our exact result is ~**0.73**; the closest provided option is **0.74**  
- **Q5 (smallest accuracy drop when removed):** `lead_score`  
- **Q6 (best C, accuracy rounded to 3):** `0.01`