# Machine Learning Zoomcamp - Homework 3: Classification

## Dataset Preparation

First, let's load necessary libraries, load and prepare the dataset:

In [98]:
# Loading necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import mutual_info_score


In [99]:
# Load dataset
df = pd.read_csv('course_lead_scoring.csv')

# Print the shape and know how big is our dataset
print(f"Dataset shape: {df.shape}")

# View the first few raws 
print(f"\nFirst few rows:")
print(df.head())

# Print the names of features we have in the dataset
print(f"\nColumn names:")
print(df.columns.tolist())

# Checking for missing values
print(f"\nMissing values:")
print(df.isnull().sum())

Dataset shape: (1462, 9)

First few rows:
    lead_source    industry  number_of_courses_viewed  annual_income  \
0      paid_ads         NaN                         1        79450.0   
1  social_media      retail                         1        46992.0   
2        events  healthcare                         5        78796.0   
3      paid_ads      retail                         2        83843.0   
4      referral   education                         3        85012.0   

  employment_status       location  interaction_count  lead_score  converted  
0        unemployed  south_america                  4        0.94          1  
1          employed  south_america                  1        0.80          0  
2        unemployed      australia                  3        0.69          1  
3               NaN      australia                  1        0.87          0  
4     self_employed         europe                  3        0.62          1  

Column names:
['lead_source', 'industry', 'number_

### Handle Missing Values
- For categorical features: replace missing values with 'NA'
- For numerical features: replace missing values with 0.0

In [100]:
# Identifying categorical and numerical columns
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Removing target from lists if present
if 'converted' in categorical_cols:
    categorical_cols.remove('converted')
if 'converted' in numerical_cols:
    numerical_cols.remove('converted')

#printing the columns to see if the target variable is excluded
print(f"Categorical columns: {categorical_cols}")
print(f"Numerical columns: {numerical_cols}")

# Fill missing values
for col in categorical_cols:
    df[col] = df[col].fillna('NA')

for col in numerical_cols:
    df[col] = df[col].fillna(0.0)

# Checking for missing values after handling
print("\nMissing values after handling:")
print(df.isnull().sum())

Categorical columns: ['lead_source', 'industry', 'employment_status', 'location']
Numerical columns: ['number_of_courses_viewed', 'annual_income', 'interaction_count', 'lead_score']

Missing values after handling:
lead_source                 0
industry                    0
number_of_courses_viewed    0
annual_income               0
employment_status           0
location                    0
interaction_count           0
lead_score                  0
converted                   0
dtype: int64


**Main Findings:**
- Dataset has 1,462 rows and 9 columns
- Missing values found in: `lead_source` (128), `industry` (134), `annual_income` (181), `employment_status` (100), `location` (63)
- Categorical columns: `lead_source`, `industry`, `employment_status`, `location`
- Numerical columns: `number_of_courses_viewed`, `annual_income`, `interaction_count`, `lead_score`
- Target variable: `converted` (whether client signed up)

---
## Question 1

**What is the most frequent observation (mode) for the column `industry`?**

Options:
- `NA`
- `technology`
- `healthcare`
- `retail`

In [101]:
# Find the mode of 'industry' column
industry_mode = df['industry'].mode()[0]
print(f"The most frequent observation (mode) for 'industry': {industry_mode}")

The most frequent observation (mode) for 'industry': retail


### Answer: `retail`

**Main Findings:**

After handling missing values (replaced with 'NA'), the `industry` column's mode is **retail**. This indicates that the majority of leads in this dataset come from the retail industry. This is valuable information as it suggests the platform or service being marketed has strong appeal in the retail sector, which could inform marketing strategies and resource allocation.

---
## Question 2

**Create the correlation matrix for the numerical features of your dataset. What are the two features that have the biggest correlation?**

Options:
- `interaction_count` and `lead_score`
- `number_of_courses_viewed` and `lead_score`
- `number_of_courses_viewed` and `interaction_count`
- `annual_income` and `interaction_count`

In [102]:
# Get numerical features (excluding target)
numerical_features = [col for col in df.select_dtypes(include=['int64', 'float64']).columns 
                     if col != 'converted']

# Calculate correlation matrix
correlation_matrix = df[numerical_features].corr()
print("Correlation matrix:")
print(correlation_matrix)

# Check specific pairs
pairs = [
    ('interaction_count', 'lead_score'),
    ('number_of_courses_viewed', 'lead_score'),
    ('number_of_courses_viewed', 'interaction_count'),
    ('annual_income', 'interaction_count')
]

print("\nCorrelations for specified pairs:")
for feat1, feat2 in pairs:
    corr = df[feat1].corr(df[feat2])
    print(f"{feat1} & {feat2}: {corr:.4f}")

Correlation matrix:
                          number_of_courses_viewed  annual_income  \
number_of_courses_viewed                  1.000000       0.009770   
annual_income                             0.009770       1.000000   
interaction_count                        -0.023565       0.027036   
lead_score                               -0.004879       0.015610   

                          interaction_count  lead_score  
number_of_courses_viewed          -0.023565   -0.004879  
annual_income                      0.027036    0.015610  
interaction_count                  1.000000    0.009888  
lead_score                         0.009888    1.000000  

Correlations for specified pairs:
interaction_count & lead_score: 0.0099
number_of_courses_viewed & lead_score: -0.0049
number_of_courses_viewed & interaction_count: -0.0236
annual_income & interaction_count: 0.0270


**Results:**
- `interaction_count` & `lead_score`: 0.0099
- `number_of_courses_viewed` & `lead_score`: -0.0049
- `number_of_courses_viewed` & `interaction_count`: -0.0236
- `annual_income` & `interaction_count`: **0.0270** ← Highest

**Main Findings:**

The highest correlation among the specified pairs is between **annual_income** and **interaction_count** (0.0270). However, it's important to note that all correlations are very weak (close to 0), suggesting that these numerical features are largely independent of each other. This weak correlation means:
1. Features provide different information and are not redundant
2. Multicollinearity is not a concern for our logistic regression model
3. The relationship between these variables is mostly linear-independent, which is actually good for modeling

### Answer: `annual_income` and `interaction_count`

---
## Question 3

**Calculate the mutual information score between `y` and other categorical variables in the dataset. Use the training set only. Which of these variables has the biggest mutual information score?**

Options:
- `industry`
- `location`
- `lead_source`
- `employment_status`

First, split the data:

In [103]:
# Separate features and target
y = df['converted'].values
df_features = df.drop('converted', axis=1)

# Split: 60% train, 20% val, 20% test
df_train, df_temp, y_train, y_temp = train_test_split(
    df_features, y, test_size=0.4, random_state=42
)

df_val, df_test, y_val, y_test = train_test_split(
    df_temp, y_temp, test_size=0.5, random_state=42
)

print(f"Train size: {len(df_train)} ({len(df_train)/len(df)*100:.1f}%)")
print(f"Val size: {len(df_val)} ({len(df_val)/len(df)*100:.1f}%)")
print(f"Test size: {len(df_test)} ({len(df_test)/len(df)*100:.1f}%)")

Train size: 877 (60.0%)
Val size: 292 (20.0%)
Test size: 293 (20.0%)


Calculate mutual information:

In [104]:
# Get categorical features from training set
categorical_features = df_train.select_dtypes(include=['object']).columns.tolist()

# Calculate MI scores
mi_scores = {}
for col in categorical_features:
    mi = mutual_info_score(y_train, df_train[col])
    mi_scores[col] = round(mi, 2)
    print(f"{col}: {mi_scores[col]}")

max_mi_feature = max(mi_scores, key=mi_scores.get)
print(f"\nFeature with highest MI score: {max_mi_feature} = {mi_scores[max_mi_feature]}")

lead_source: 0.03
industry: 0.02
employment_status: 0.02
location: 0.0

Feature with highest MI score: lead_source = 0.03


**Results:**
- `lead_source`: **0.03** ← Highest
- `industry`: 0.02
- `employment_status`: 0.02
- `location`: 0.00

### Answer: `lead_source`

**Main Findings:**

**lead_source** has the highest mutual information score (0.03) with the target variable `converted`. Mutual information measures the amount of information one variable provides about another. This finding indicates that:
1. **lead_source** is the most informative categorical feature for predicting conversion
2. The source of the lead (e.g., paid ads, social media, referral) has the strongest relationship with whether someone converts
3. `location` has essentially zero mutual information (0.00), suggesting it provides almost no predictive value
4. All MI scores are relatively low, indicating weak but existing relationships

This insight is crucial for marketing: the channel through which leads are acquired matters more than their location or industry when predicting conversion.

---
## Question 4

**Train a logistic regression model with one-hot encoding. What accuracy did you get on the validation dataset?**

Options:
- 0.64
- 0.74
- 0.84
- 0.94

In [105]:
# Prepare data with one-hot encoding using DictVectorizer
train_dicts = df_train.to_dict(orient='records')
val_dicts = df_val.to_dict(orient='records')

dv = DictVectorizer(sparse=False)
X_train = dv.fit_transform(train_dicts)
X_val = dv.transform(val_dicts)

print(f"Feature matrix shape: {X_train.shape}")
print(f"Number of features after one-hot encoding: {X_train.shape[1]}")

# Train logistic regression
model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
model.fit(X_train, y_train)

# Calculate accuracy
y_pred = model.predict(X_val)
accuracy = accuracy_score(y_val, y_pred)
accuracy_rounded = round(accuracy, 2)

print(f"\nValidation Accuracy: {accuracy:.4f}")
print(f"Validation Accuracy (rounded to 2 decimals): {accuracy_rounded}")

Feature matrix shape: (877, 31)
Number of features after one-hot encoding: 31

Validation Accuracy: 0.7432
Validation Accuracy (rounded to 2 decimals): 0.74


**Result:** Validation Accuracy = 0.74 (0.7432 before rounding)

### Answer: 0.74

**Main Findings:**

The logistic regression model achieved a **74.32% accuracy** on the validation set. This means:
1. The model correctly predicts conversion status for about 3 out of 4 leads
2. After one-hot encoding, we have 31 features (from 8 original features)
3. An accuracy of 74% is reasonably good for a baseline model, though there's room for improvement
4. The model performs better than random guessing (50%) and suggests that the features do contain predictive information about conversion

The model provides a solid foundation that could be further improved through feature engineering, trying different algorithms, or ensemble methods.

---
## Question 5

**Using feature elimination technique, which feature has the smallest difference in accuracy when removed?**

Options:
- `'industry'`
- `'employment_status'`
- `'lead_score'`

In [106]:
# Store the baseline accuracy
baseline_accuracy = accuracy  # from Question 4

print(f"Baseline accuracy: {baseline_accuracy:.4f}")

# Features to test
features_to_test = ['industry', 'employment_status', 'lead_score']

feature_differences = {}

for feature in features_to_test:
    # Create dataset without this feature
    df_train_temp = df_train.drop(feature, axis=1)
    df_val_temp = df_val.drop(feature, axis=1)
    
    # Transform to dict and one-hot encode
    train_dicts_temp = df_train_temp.to_dict(orient='records')
    val_dicts_temp = df_val_temp.to_dict(orient='records')
    
    dv_temp = DictVectorizer(sparse=False)
    X_train_temp = dv_temp.fit_transform(train_dicts_temp)
    X_val_temp = dv_temp.transform(val_dicts_temp)
    
    # Train model
    model_temp = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
    model_temp.fit(X_train_temp, y_train)
    
    # Calculate accuracy and difference
    y_pred_temp = model_temp.predict(X_val_temp)
    accuracy_temp = accuracy_score(y_val, y_pred_temp)
    difference = baseline_accuracy - accuracy_temp
    
    feature_differences[feature] = difference
    print(f"\nWithout '{feature}':")
    print(f"  Accuracy: {accuracy_temp:.4f}")
    print(f"  Difference: {difference:.4f}")

# Find feature with smallest difference
min_diff_feature = min(feature_differences, key=lambda k: abs(feature_differences[k]))
print(f"\nFeature with smallest difference: '{min_diff_feature}' = {feature_differences[min_diff_feature]:.4f}")

Baseline accuracy: 0.7432

Without 'industry':
  Accuracy: 0.7432
  Difference: 0.0000

Without 'employment_status':
  Accuracy: 0.7466
  Difference: -0.0034

Without 'lead_score':
  Accuracy: 0.7432
  Difference: 0.0000

Feature with smallest difference: 'industry' = 0.0000


**Results:**
- Without `'industry'`: Difference = **0.0000** ← Smallest
- Without `'employment_status'`: Difference = -0.0034
- Without `'lead_score'`: Difference = 0.0000

### Answer: `'industry'`

**Main Findings:**

**'industry'** has the smallest difference (0.0000), meaning removing this feature has virtually no impact on model accuracy. Key insights:
1. **industry** is the least useful feature - it contributes nothing to model performance
2. This aligns with Question 3 findings where industry had low mutual information (0.02)
3. **employment_status** actually slightly improves accuracy when removed (negative difference), suggesting it might add noise
4. **lead_score** also has zero difference, making it equally "useless" with industry
5. This suggests we could simplify the model by removing these features without sacrificing performance

This is valuable for model deployment: simpler models with fewer features are easier to maintain, faster to run, and less prone to overfitting.

---
## Question 6

**Train regularized logistic regression with different C values: [0.01, 0.1, 1, 10, 100]. Which C leads to the best accuracy on the validation set?**

Options:
- 0.01
- 0.1
- 1
- 10
- 100

In [107]:
C_values = [0.01, 0.1, 1, 10, 100]

# Use the original full feature set
train_dicts = df_train.to_dict(orient='records')
val_dicts = df_val.to_dict(orient='records')

dv = DictVectorizer(sparse=False)
X_train = dv.fit_transform(train_dicts)
X_val = dv.transform(val_dicts)

results = {}

for C in C_values:
    model = LogisticRegression(solver='liblinear', C=C, max_iter=1000, random_state=42)
    model.fit(X_train, y_train)
    
    y_pred = model.predict(X_val)
    accuracy = accuracy_score(y_val, y_pred)
    accuracy_rounded = round(accuracy, 3)
    
    results[C] = accuracy_rounded
    print(f"C = {C}: Accuracy = {accuracy_rounded}")

# Find best C
best_accuracy = max(results.values())
best_C_values = [C for C, acc in results.items() if acc == best_accuracy]
best_C = min(best_C_values)  # If tie, select smallest C

print(f"\nBest C value: {best_C} with accuracy: {best_accuracy}")

C = 0.01: Accuracy = 0.743
C = 0.1: Accuracy = 0.743
C = 1: Accuracy = 0.743
C = 10: Accuracy = 0.743
C = 100: Accuracy = 0.743

Best C value: 0.01 with accuracy: 0.743


**Results:**
- C = 0.01: 0.743 ← **Best (smallest C when tied)**
- C = 0.1: 0.743
- C = 1: 0.743
- C = 10: 0.743
- C = 100: 0.743

**Main Findings:**

All C values achieve the same accuracy (0.743), so we select **C = 0.01** (the smallest value as per instructions). This surprising result reveals important insights:

1. **Model is not overfitting**: If overfitting were an issue, stronger regularization (smaller C) would improve performance
2. **Regularization doesn't matter here**: The consistent accuracy across all C values suggests the model is already well-regularized or the features are not causing overfitting
3. **Simple features**: The dataset's features are relatively simple and don't have complex interactions that would benefit from different regularization strengths
4. **Choose smaller C for robustness**: When performance is equal, choosing smaller C (stronger regularization) is preferred as it:
   - Prevents potential overfitting on new data
   - Produces more generalizable coefficients
   - Is more conservative and robust

This suggests our baseline model from Question 4 is already quite stable and well-calibrated.

### Answer: 0.01

---
## Summary of Answers

| Question | Answer | Value |
|----------|--------|-------|
| Q1: Mode of 'industry' | retail | - |
| Q2: Highest correlation pair | annual_income & interaction_count | 0.0270 |
| Q3: Highest MI score | lead_source | 0.03 |
| Q4: Validation accuracy | 0.74 | 74.32% |
| Q5: Least useful feature | 'industry' | Difference: 0.0000 |
| Q6: Best C value | 0.01 | Accuracy: 0.743 |

## Key Takeaways

1. **Data Quality**: The dataset has moderate missing values that were handled appropriately
2. **Feature Independence**: Numerical features show very weak correlations, reducing multicollinearity concerns
3. **Lead Source Matters**: The source of the lead is the most predictive categorical feature
4. **Baseline Performance**: A simple logistic regression achieves 74% accuracy
5. **Feature Redundancy**: 'industry' feature can be removed without impacting performance
6. **Regularization Stability**: The model shows consistent performance across all regularization strengths