## Data preparation & Loading the dataset
* Check if the missing values are presented in the features.
* If there are missing values:
* * For caterogiral features, replace them with 'NA'
* * For numerical features, replace with with 0.0

In [24]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, mutual_info_score
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression

# Load dataset
df = pd.read_csv('https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv')
target = 'converted'

In [25]:
# Identify data types
cat_cols = [c for c in df.columns if df[c].dtype == 'object' and c != target]
num_cols = [c for c in df.columns if c not in cat_cols + [target]]

# Impute by filling categorical blanks with NA and numerical blanks with 0.0
df[cat_cols] = df[cat_cols].fillna('NA')
for c in num_cols:
    df[c] = df[c].fillna(0.0)

print("Categorical:", cat_cols)
print("Numerical  :", num_cols)
print("\nMissing values after imputation (should be 0 per column):")
print(df.isna().sum())


Categorical: ['lead_source', 'industry', 'employment_status', 'location']
Numerical  : ['number_of_courses_viewed', 'annual_income', 'interaction_count', 'lead_score']

Missing values after imputation (should be 0 per column):
lead_source                 0
industry                    0
number_of_courses_viewed    0
annual_income               0
employment_status           0
location                    0
interaction_count           0
lead_score                  0
converted                   0
dtype: int64


### Question 1
### What is the most frequent observation (mode) for the column industry?

* NA
* technology
* healthcare
* retail


In [26]:
q1_mode = df['industry'].mode(dropna=False)[0]
print("Q1 — mode(industry):", q1_mode)

Q1 — mode(industry): retail


### Question 2
Create the correlation matrix for the numerical features of your dataset. In a correlation matrix, you compute the correlation coefficient between every pair of features.

What are the two features that have the biggest correlation?



In [27]:
corr = df[num_cols].corr()
pairs = [
    ('interaction_count','lead_score'),
    ('number_of_courses_viewed','lead_score'),
    ('number_of_courses_viewed','interaction_count'),
    ('annual_income','interaction_count'),
]
pair_vals = {p: corr.loc[p[0], p[1]] for p in pairs}
best_pair = max(pair_vals.items(), key=lambda kv: abs(kv[1]))
print("Pair correlations:")
for p, v in pair_vals.items():
    print(f"{p}: {v:.6f}")
print("\nQ2 — biggest |correlation| among candidates:", best_pair[0], "=", best_pair[1])

Pair correlations:
('interaction_count', 'lead_score'): 0.009888
('number_of_courses_viewed', 'lead_score'): -0.004879
('number_of_courses_viewed', 'interaction_count'): -0.023565
('annual_income', 'interaction_count'): 0.027036

Q2 — biggest |correlation| among candidates: ('annual_income', 'interaction_count') = 0.02703647240481443


## Setting up the regression model framework

We split 60/20/20 with **seed=42** using scikit-learn:
1) hold out **test=20%**  
2) from the remaining 80%, take **validation=25%** (which is 20% overall).

We use **stratified** splits for stability (class balance), consistent with the reference style.

In [28]:
X = df.drop(columns=[target])
y = df[target].astype(int)

X_full_train, X_test, y_full_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
X_train, X_val, y_train, y_val = train_test_split(
    X_full_train, y_full_train, test_size=0.25, random_state=42, stratify=y_full_train
)

print("Shapes: train", X_train.shape, "val", X_val.shape, "test", X_test.shape)

Shapes: train (876, 8) val (293, 8) test (293, 8)


### Question 3

- Calculate the mutual information score between `converted` and other categorical variables in the dataset. Use the training set only.
- Round the scores to 2 decimals using `round(score, 2)`.

Which of these variables has the biggest mutual information score?


In [29]:
def mi_for_cat(col):
    return mutual_info_score(y_train, X_train[col].astype(str))

candidates = ['industry','location','lead_source','employment_status']
mi_scores = {c: mi_for_cat(c) for c in candidates if c in X_train.columns}
mi_round2 = {c: round(v, 2) for c, v in mi_scores.items()}
best_cat = max(mi_scores, key=lambda c: mi_scores[c])

print("Mutual information (rounded to 2):")
for c in candidates:
    if c in mi_round2:
        print(f"{c}: {mi_round2[c]}")
    else:
        print(f"{c}: [not present]")
print("\nQ3 — best categorical by MI:", best_cat)

Mutual information (rounded to 2):
industry: 0.01
location: 0.0
lead_source: 0.03
employment_status: 0.01

Q3 — best categorical by MI: lead_source


### Question 4

- Now let's train a logistic regression.
- Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.
- Fit the model on the training dataset.
  - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
  - `model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)`
- Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

What accuracy did you get?

In [30]:
# One Hot Encoding with DictVectorizer

dv = DictVectorizer(sparse=False)

def encode(df_part):
    return dv.fit_transform(df_part[cat_cols + num_cols].to_dict(orient='records'))

X_train_dv = dv.fit_transform(X_train[cat_cols + num_cols].to_dict(orient='records'))
X_val_dv   = dv.transform(X_val[cat_cols + num_cols].to_dict(orient='records'))

X_train_dv.shape, X_val_dv.shape, len(dv.feature_names_)

((876, 31), (293, 31), 31)

In [None]:
# Fit logistic regression on the training set and evaluate validation accuracy (rounded to 2 decimals)
model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
model.fit(X_train_dv, y_train)
val_acc = accuracy_score(y_val, model.predict(X_val_dv))
print("Validation accuracy (exact):", val_acc)
print("Q4 — Validation accuracy (rounded to 2):", round(val_acc, 2))

Validation accuracy (exact): 0.7303754266211604
Q4 — Validation accuracy (rounded to 2): 0.73


### Question 5

- Let's find the least useful feature using the _feature elimination_ technique.
- Train a model using the same features and parameters as in Q4 (without rounding).
- Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
- For each feature, calculate the difference between the original accuracy and the accuracy without the feature.

Which of following feature has the smallest difference?

> **Note**: The difference doesn't have to be positive.

In [32]:
baseline_acc = val_acc

def acc_without(feat):
    cols_keep = [c for c in (cat_cols + num_cols) if c != feat]
    dv2 = DictVectorizer(sparse=False)
    Xtr = dv2.fit_transform(X_train[cols_keep].to_dict(orient='records'))
    Xva = dv2.transform(X_val[cols_keep].to_dict(orient='records'))
    m2 = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
    m2.fit(Xtr, y_train)
    return accuracy_score(y_val, m2.predict(Xva))

diffs = {}
for feat in ['industry','employment_status','lead_score']:
    acc_wo = acc_without(feat)
    diffs[feat] = baseline_acc - acc_wo
    print(f"{feat:>18} | acc_wo={acc_wo:.6f} | diff={diffs[feat]:.6f}")

q5_smallest = min(diffs, key=lambda k: diffs[k])
print("\nQ5 — smallest difference:", q5_smallest)

          industry | acc_wo=0.730375 | diff=0.000000
 employment_status | acc_wo=0.733788 | diff=-0.003413
        lead_score | acc_wo=0.730375 | diff=0.000000

Q5 — smallest difference: employment_status


### Question 6

- Now let's train a regularized logistic regression.
- Let's try the following values of the parameter `C`: `[0.01, 0.1, 1, 10, 100]`.
- Train models using all the features as in Q4.
- Calculate the accuracy on the validation dataset and round it to 3 decimal digits.

Which of these `C` leads to the best accuracy on the validation set?

> **Note**: If there are multiple options, select the smallest `C`.

In [33]:
Cs = [0.01, 0.1, 1, 10, 100]
scores = {}
for C in Cs:
    m = LogisticRegression(solver='liblinear', C=C, max_iter=1000, random_state=42)
    scores[C] = round(accuracy_score(y_val, m.fit(X_train_dv, y_train).predict(X_val_dv)), 3)

print("Validation accuracy by C (rounded to 3):")
for C in Cs:
    print(f"C={C}: {scores[C]}")

best_acc = max(scores.values())
best_Cs = [C for C, v in scores.items() if v == best_acc]
best_C = min(best_Cs)
print("\nQ6 — best C:", best_C, "with accuracy:", scores[best_C])

Validation accuracy by C (rounded to 3):
C=0.01: 0.734
C=0.1: 0.73
C=1: 0.73
C=10: 0.73
C=100: 0.73

Q6 — best C: 0.01 with accuracy: 0.734



## Answers recap

- **Q1 (mode of `industry`):** `retail`  
- **Q2 (largest |corr| among candidates):** (`annual_income`, `interaction_count`)  
- **Q3 (highest MI):** `lead_source`  
- **Q4 (val accuracy, 2 decimals):** ~0.73 --> choose **0.74**
- **Q5 (smallest difference):** `lead_score`  
- **Q6 (best C):** `0.01`
