# Explore here

In [None]:
import pandas as pd

url = "https://breathecode.herokuapp.com/asset/internal-link?id=413&path=bank-marketing-campaign-data.csv"
df = pd.read_csv(url, sep=";")

display(df.head())
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")



## Model Training Overview

This notebook builds a **Logistic Regression** model to predict whether a client will subscribe to a term deposit (`yes` or `no`).

### What happens in simple steps
1. Load the dataset.
2. Split the data into inputs (`X`) and target (`y`).
3. Split into training and testing sets with balanced classes.
4. Convert text columns to numbers safely using a preprocessing pipeline.
5. Train a fast Logistic Regression model.
6. Measure accuracy on unseen test data.
7. Try a faster hyperparameter search to improve the model.


### Step 1: Load data
- `pd.read_csv(..., sep=';')` reads the CSV file.
- We use `sep=';'` because this file uses semicolons.

### Step 2: Create features and target
- `X = df.drop(columns=['y'])` keeps all input columns.
- `y = df['y'].map({'no': 0, 'yes': 1})` converts labels to numbers.
  - `0` = no subscription
  - `1` = subscription

### Step 3: Fast and balanced train/test split
- `train_test_split(..., test_size=0.2, stratify=y, random_state=42)` does an 80/20 split.
- `stratify=y` keeps class proportions similar in train and test.
- This is simpler and faster than manual index splitting.

### Step 4: Convert text to numeric (without leakage)
- The model cannot learn from raw text directly.
- `OneHotEncoder` turns categories (job, month, etc.) into numeric columns.
- `ColumnTransformer` applies encoding only to categorical columns.
- `Pipeline` keeps preprocessing + model together in one clean workflow.
- `handle_unknown='ignore'` avoids errors if a new category appears in test data.

### Step 5: Train the baseline model
- We use `LogisticRegression(solver='liblinear', max_iter=200)` for fast binary classification.
- `model.fit(X_train_raw, y_train)` learns from the training set.

### Step 6: Evaluate model quality
- `model.predict(X_test_raw)` makes predictions on unseen data.
- `accuracy_score(y_test, y_pred)` gives the percentage of correct predictions.
- Accuracy around **0.91** means about **91%** of predictions are correct.

### Step 7: Faster hyperparameter tuning
- Instead of trying every combination (Grid Search), we use `RandomizedSearchCV`.
- It tests a small random set of good parameter candidates (`n_iter=6`, `cv=3`).
- This usually gives strong results in much less time.

---

### Why this approach is better
- Faster training and tuning.
- Cleaner code with one pipeline.
- Lower risk of preprocessing mistakes.
- Good accuracy with less compute time.

In [None]:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

# 1) Load CSV
df = pd.read_csv("data/raw/bank-marketing-campaign-data.csv", sep=";")

# 2) Define features and target
X = df.drop(columns=["y"])
y = df["y"].map({"no": 0, "yes": 1})
print("X shape:", X.shape, "y shape:", y.shape)

# 3) Fast stratified split (80/20)
X_train_raw, X_test_raw, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

print("X_train:", X_train_raw.shape, "X_test:", X_test_raw.shape)
print("y_train:", y_train.shape, "y_test:", y_test.shape)

# 4) Sparse one-hot preprocessing + fast binary solver
categorical_cols = X.select_dtypes(include=["object", "string"]).columns.tolist()
preprocess = ColumnTransformer(
    transformers=[
        (
            "cat",
            OneHotEncoder(handle_unknown="ignore", sparse_output=True),
            categorical_cols,
        )
    ],
    remainder="passthrough",
)

model = Pipeline(
    steps=[
        ("preprocess", preprocess),
        (
            "clf",
            LogisticRegression(
                solver="liblinear",
                C=1.0,
                max_iter=200,
                random_state=42,
            ),
        ),
    ]
)

# 5) Train + evaluate
model.fit(X_train_raw, y_train)
y_pred = model.predict(X_test_raw)
acc = accuracy_score(y_test, y_pred)
print("Accuracy:", round(acc, 4))

## Hyperparameter Tuning Overview

This section improves the model by trying different settings automatically and choosing the best one.

### Key points
- We start with the model pipeline built in the previous cell.
- We test different values for `C` (regularization strength).
- We also test whether balancing class weights helps.
- We do this with `RandomizedSearchCV`, which is faster than checking every combination.
- Finally, we keep the best model and check its accuracy on test data.

### Why this is useful
- It usually gives better or equal performance.
- It saves time compared with a full grid search.
- It helps find strong settings without manual trial and error.

In [None]:
# Import Python warnings module
import warnings
# Hide repeated FutureWarning messages so output stays clean and easy to read
warnings.filterwarnings("ignore", category=FutureWarning)

# Tool to test many settings and pick the best one
from sklearn.model_selection import GridSearchCV

# List the settings we want to test
param_grid = {
    # C controls how strong regularization is
    "clf__C": [0.1, 1, 10],
    # Use this solver
    "clf__solver": ["liblinear"],
    # Use L2 penalty
    "clf__penalty": ["l2"],
    # Give the model enough training steps
    "clf__max_iter": [500],
}

# Set up Grid Search with the model pipeline
search = GridSearchCV(
    # Model from previous cell
    estimator=model,
    # Settings to test
    param_grid=param_grid,
    # Split training data into 5 parts
    cv=5,
    # Compare by accuracy
    scoring="accuracy",
    # Use all CPU cores for speed
    n_jobs=-1,
)

# Train all combinations on training data
search.fit(X_train_raw, y_train)

# Show the best settings
print("Best Params:", search.best_params_)
# Show best average score from 5-fold CV
print("Best CV Accuracy:", round(search.best_score_, 4))
# Test the best model on test data
print("Test Score:", round(search.best_estimator_.score(X_test_raw, y_test), 4))

## Observations: Accuracy Score vs GridSearchCV

### 1) Baseline model (`accuracy_score`)
- **Test Accuracy:** **91.32%**
- The baseline model correctly predicts approximately 91 out of 100 cases on the test set.
- This approach is computationally efficient because it uses a single fixed configuration.

### 2) Tuned model (`GridSearchCV`)
- **Best CV Accuracy (5-fold):** **90.97%**
- **Test Score:** **91.32%**
- Best hyperparameters: `C=1`, `solver='liblinear'`, `penalty='l2'`, `max_iter=500`.

### 3) Comparison and conclusion
- **Baseline Test Accuracy vs GridSearchCV Test Score:** **91.32% vs 91.32%**
- **Baseline Test Accuracy vs GridSearchCV Best CV Accuracy:** **91.32% vs 90.97%**
- The two approaches deliver identical test-set performance in this notebook.
- GridSearchCV provides stronger methodological confidence by validating performance across multiple folds.
- Recommended interpretation: retain the baseline model for speed-critical workflows, and use GridSearchCV when robust hyperparameter validation is required.

In [None]:
import numpy as np
from sklearn.model_selection import RandomizedSearchCV

