# Explore here

In [1]:
# Import pandas for data loading and table manipulation.
import pandas as pd

# Define the remote CSV file URL containing the bank marketing dataset.
url = "https://breathecode.herokuapp.com/asset/internal-link?id=413&path=bank-marketing-campaign-data.csv"
# Read the CSV file using semicolon as delimiter into a DataFrame.
df = pd.read_csv(url, sep=";")

# Display the first 5 rows for a quick data preview.
display(df.head())
# Print total number of rows and columns to confirm dataset dimensions.
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


Rows: 41188, Columns: 21


## Model Training Overview

This notebook builds a **Logistic Regression** model to predict whether a client will subscribe to a term deposit (`yes` or `no`).

### What happens in simple steps
1. Load the dataset.
2. Split the data into inputs (`X`) and target (`y`).
3. Split into training and testing sets with balanced classes.
4. Convert text columns to numbers safely using a preprocessing pipeline.
5. Train a fast Logistic Regression model.
6. Measure accuracy on unseen test data.
7. Try a faster hyperparameter search to improve the model.


### Step 1: Load data
- `pd.read_csv(..., sep=';')` reads the CSV file.
- We use `sep=';'` because this file uses semicolons.

### Step 2: Create features and target
- `X = df.drop(columns=['y'])` keeps all input columns.
- `y = df['y'].map({'no': 0, 'yes': 1})` converts labels to numbers.
  - `0` = no subscription
  - `1` = subscription

### Step 3: Fast and balanced train/test split
- `train_test_split(..., test_size=0.2, stratify=y, random_state=42)` does an 80/20 split.
- `stratify=y` keeps class proportions similar in train and test.
- This is simpler and faster than manual index splitting.

### Step 4: Convert text to numeric (without leakage)
- The model cannot learn from raw text directly.
- `OneHotEncoder` turns categories (job, month, etc.) into numeric columns.
- `ColumnTransformer` applies encoding only to categorical columns.
- `Pipeline` keeps preprocessing + model together in one clean workflow.
- `handle_unknown='ignore'` avoids errors if a new category appears in test data.

### Step 5: Train the baseline model
- We use `LogisticRegression(solver='liblinear', max_iter=200)` for fast binary classification.
- `model.fit(X_train_raw, y_train)` learns from the training set.

### Step 6: Evaluate model quality
- `model.predict(X_test_raw)` makes predictions on unseen data.
- `accuracy_score(y_test, y_pred)` gives the percentage of correct predictions.
- Accuracy around **0.91** means about **91%** of predictions are correct.

### Step 7: Faster hyperparameter tuning
- Instead of trying every combination (Grid Search), we use `RandomizedSearchCV`.
- It tests a small random set of good parameter candidates (`n_iter=6`, `cv=3`).
- This usually gives strong results in much less time.

---

### Why this approach is better
- Faster training and tuning.
- Cleaner code with one pipeline.
- Lower risk of preprocessing mistakes.
- Good accuracy with less compute time.

In [2]:
# Import pandas for tabular data handling.
import pandas as pd
# Import transformer to apply preprocessing by column type.
from sklearn.compose import ColumnTransformer
# Import logistic regression classifier for binary prediction.
from sklearn.linear_model import LogisticRegression
# Import metric to compute test-set accuracy.
from sklearn.metrics import accuracy_score
# Import helper to split data into train and test sets.
from sklearn.model_selection import train_test_split
# Import pipeline to chain preprocessing and model steps.
from sklearn.pipeline import Pipeline
# Import one-hot encoder to convert categorical text into numeric vectors.
from sklearn.preprocessing import OneHotEncoder

# Load the local CSV file using semicolon separator.
df = pd.read_csv("data/raw/bank-marketing-campaign-data.csv", sep=";")

# Create feature matrix by removing target column.
X = df.drop(columns=["y"])
# Map target labels from text (no/yes) to numeric (0/1).
y = df["y"].map({"no": 0, "yes": 1})
# Print shapes to verify expected dimensions before splitting.
print("X shape:", X.shape, "y shape:", y.shape)

# Split features and target into stratified train/test subsets (80/20).
X_train_raw, X_test_raw, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Print split shapes to confirm train/test partitioning.
print("X_train:", X_train_raw.shape, "X_test:", X_test_raw.shape)
# Print label vector shapes to confirm target split worked correctly.
print("y_train:", y_train.shape, "y_test:", y_test.shape)

# Identify categorical columns that need one-hot encoding.
categorical_cols = X.select_dtypes(include=["object", "string"]).columns.tolist()
# Build preprocessing step: one-hot encode categorical columns and pass through others.
preprocess = ColumnTransformer(
    transformers=[
        (
            "cat",
            OneHotEncoder(handle_unknown="ignore", sparse_output=True),
            categorical_cols,
        )
    ],
    remainder="passthrough",
)

# Create end-to-end pipeline with preprocessing followed by logistic regression.
model = Pipeline(
    steps=[
        ("preprocess", preprocess),
        (
            "clf",
            LogisticRegression(
                solver="liblinear",
                C=1.0,
                max_iter=200,
                random_state=42,
            ),
        ),
    ]
)

# Fit the full pipeline on the training data.
model.fit(X_train_raw, y_train)
# Predict target values for the unseen test set.
y_pred = model.predict(X_test_raw)
# Compute accuracy by comparing predictions to true labels.
acc = accuracy_score(y_test, y_pred)
# Print rounded accuracy for quick model quality check.
print("Accuracy:", round(acc, 4))

X shape: (41188, 20) y shape: (41188,)
X_train: (32950, 20) X_test: (8238, 20)
y_train: (32950,) y_test: (8238,)
Accuracy: 0.9132


## Hyperparameter Tuning Overview

This section improves the model by trying different settings automatically and choosing the best one.

### Key points
- We start with the model pipeline built in the previous cell.
- We test different values for `C` (regularization strength).
- We also test whether balancing class weights helps.
- We do this with `RandomizedSearchCV`, which is faster than checking every combination.
- Finally, we keep the best model and check its accuracy on test data.

### Why this is useful
- It usually gives better or equal performance.
- It saves time compared with a full grid search.
- It helps find strong settings without manual trial and error.

In [3]:
# Import Python warnings module.
import warnings
# Hide repeated FutureWarning messages so output stays clean and easy to read.
warnings.filterwarnings("ignore", category=FutureWarning)

# Import GridSearchCV to test multiple parameter combinations.
from sklearn.model_selection import GridSearchCV

# Define the parameter grid to evaluate.
param_grid = {
    # Test different regularization strengths.
    "clf__C": [0.1, 1, 10],
    # Keep solver fixed to liblinear for binary logistic regression.
    "clf__solver": ["liblinear"],
    # Use L2 penalty only.
    "clf__penalty": ["l2"],
    # Use a higher max_iter to help convergence.
    "clf__max_iter": [500],
}

# Create GridSearchCV with the existing preprocessing + model pipeline.
search = GridSearchCV(
    # Use the pipeline created in the previous training cell.
    estimator=model,
    # Provide all parameter combinations to evaluate.
    param_grid=param_grid,
    # Use 5-fold cross-validation for robust validation.
    cv=5,
    # Optimize based on classification accuracy.
    scoring="accuracy",
    # Run folds/combinations in parallel across available CPU cores.
    n_jobs=-1,
)

# Fit grid search on the training split only.
search.fit(X_train_raw, y_train)

# Print the best hyperparameter combination found.
print("Best Params:", search.best_params_)
# Print mean CV accuracy for the best combination.
print("Best CV Accuracy:", round(search.best_score_, 4))
# Print the final test score of the best estimator.
print("Test Score:", round(search.best_estimator_.score(X_test_raw, y_test), 4))

Best Params: {'clf__C': 1, 'clf__max_iter': 500, 'clf__penalty': 'l2', 'clf__solver': 'liblinear'}
Best CV Accuracy: 0.9097
Test Score: 0.9132


## Observations: Accuracy Score vs GridSearchCV

### 1) Baseline model (`accuracy_score`)
- **Test Accuracy:** **91.32%**
- The baseline model correctly predicts approximately 91 out of 100 cases on the test set.
- This approach is computationally efficient because it uses a single fixed configuration.

### 2) Tuned model (`GridSearchCV`)
- **Best CV Accuracy (5-fold):** **90.97%**
- **Test Score:** **91.32%**
- Best hyperparameters: `C=1`, `solver='liblinear'`, `penalty='l2'`, `max_iter=500`.

### 3) Comparison and conclusion
- **Baseline Test Accuracy vs GridSearchCV Test Score:** **91.32% vs 91.32%**
- **Baseline Test Accuracy vs GridSearchCV Best CV Accuracy:** **91.32% vs 90.97%**
- The two approaches deliver identical test-set performance in this notebook.
- GridSearchCV provides stronger methodological confidence by validating performance across multiple folds.
- Recommended interpretation: retain the baseline model for speed-critical workflows, and use GridSearchCV when robust hyperparameter validation is required.

## RandomizedSearchCV: Code and Output Explanation

This cell tunes the Logistic Regression model by testing multiple hyperparameter combinations automatically.

### What the code does
- Imports `RandomizedSearchCV` and `numpy`.
- Defines `param_distributions` to explore different values for:
  - `clf__C` (regularization strength from 0.01 to 100 using `np.logspace`).
  - `clf__solver` (`liblinear`).
  - `clf__penalty` (`l2`).
  - `clf__class_weight` (`None` or `balanced`).
  - `clf__max_iter` (300 or 600).
- Builds `RandomizedSearchCV` with:
  - `n_iter=8` (tries 8 random combinations).
  - `cv=3` (3-fold cross-validation).
  - `scoring='accuracy'` (selects best accuracy).
  - `n_jobs=-1` (uses all CPU cores).
- Fits the search on training data and prints the best settings and scores.

### How to read the output
- `Fitting 3 folds for each of 8 candidates, totalling 24 fits`
  - Confirms total model trainings performed during cross-validation.
- `Best Params: {...}`
  - Shows the hyperparameter combination with the highest average CV accuracy.
- `Best CV Accuracy: ...`
  - Mean accuracy across the 3 validation folds for the best combination.
- `Test Score: ...`
  - Accuracy of the best model on unseen test data.

### Interpretation tip
If `Test Score` is close to `Best CV Accuracy`, the tuned model generalizes well and is likely not overfitting heavily.

In [5]:
# Import RandomizedSearchCV for faster randomized hyperparameter tuning.
from sklearn.model_selection import RandomizedSearchCV

# Import NumPy to generate logarithmically spaced C values.
import numpy as np

# Define parameter distributions to sample during randomized search.
param_distributions = {
    # `np.logspace(-2, 2, 10)` creates 10 C values from 0.01 to 100, evenly spaced in log scale (each step multiplies by a constant factor).
    # Sample C values on a log scale from 10^-2 to 10^2.
    "clf__C": np.logspace(-2, 2, 10),
    # Keep solver fixed to liblinear for this binary classification setup.
    "clf__solver": ["liblinear"],
    # Restrict to l2 penalty for compatibility and stable convergence.
    "clf__penalty": ["l2"],
    # Test default class weighting versus balanced class weighting.
    "clf__class_weight": [None, "balanced"],
    # Try moderate and higher iteration limits to ensure convergence.
    "clf__max_iter": [300, 600],
}

# Create randomized search configured for faster execution.
random_search = RandomizedSearchCV(
    # Use the pipeline trained in previous steps as the estimator.
    estimator=model,
    # Pass the parameter distributions defined above.
    param_distributions=param_distributions,
    # Evaluate 8 random combinations from the search space.
    n_iter=8,
    # Use 3-fold cross-validation for speed.
    cv=3,
    # Optimize hyperparameters using accuracy.
    scoring="accuracy",
    # Use all available CPU cores in parallel.
    n_jobs=-1,
    # Fix randomness so results are reproducible.
    random_state=42,
    # Show progress while fitting.
    verbose=1,
)

# Fit randomized search on the training split.
random_search.fit(X_train_raw, y_train)

# Print the best hyperparameter combination found.
print("Best Params:", random_search.best_params_)
# Print the best average cross-validation accuracy.
print("Best CV Accuracy:", round(random_search.best_score_, 4))
# Print the test accuracy of the best estimator.
print("Test Score:", round(random_search.best_estimator_.score(X_test_raw, y_test), 4))

Fitting 3 folds for each of 8 candidates, totalling 24 fits
Best Params: {'clf__solver': 'liblinear', 'clf__penalty': 'l2', 'clf__max_iter': 300, 'clf__class_weight': None, 'clf__C': np.float64(0.027825594022071243)}
Best CV Accuracy: 0.9096
Test Score: 0.9132


## Pros and Cons of This RandomizedSearchCV Strategy

### Pros
- Faster than full GridSearchCV because it tests only a random subset of combinations (`n_iter=8`).
- Good coverage across scales of regularization strength using `np.logspace(-2, 2, 10)` for `C`.
- Uses cross-validation (`cv=3`), so parameter choice is more reliable than a single train/validation split.
- Reproducible results due to `random_state=42`.
- Efficient runtime with parallel processing (`n_jobs=-1`).

### Cons
- May miss the true best combination because it does not evaluate every possible parameter set.
- With only 8 iterations, search depth is limited for larger parameter spaces.
- `cv=3` is faster but less stable than higher folds (e.g., 5-fold) for noisy data.
- Restricting to one solver (`liblinear`) and one penalty (`l2`) reduces exploration breadth.
- Accuracy-only optimization may overlook other useful metrics (precision, recall, F1, ROC-AUC) depending on business goals.