## 1) Imports

**What is this section for?**

This section imports all the essential libraries and modules from scikit-learn (sklearn) that you'll need for machine learning tasks. Think of it as gathering all your tools before starting a project.

**Key Components:**
- **numpy & pandas**: For data manipulation and numerical operations
- **model_selection**: Tools for splitting data and evaluating models
- **Pipeline & ColumnTransformer**: For creating reproducible preprocessing workflows
- **preprocessing**: For transforming features (scaling, encoding, etc.)
- **metrics**: For evaluating model performance

**Why import everything at the start?**
By importing all modules upfront, you ensure all dependencies are available and can quickly identify any missing packages before running your analysis.

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import (
    train_test_split, KFold, StratifiedKFold, GridSearchCV, cross_val_score, cross_val_predict
)
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import (
    StandardScaler, MinMaxScaler, OneHotEncoder, PolynomialFeatures
)
from sklearn.impute import SimpleImputer

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score, roc_auc_score,
    confusion_matrix, classification_report,
    mean_absolute_error, mean_squared_error, r2_score, root_mean_squared_error
)


## 2) Universal Split

**Purpose:** Splitting your data into training and test sets is crucial for evaluating how well your model generalizes to unseen data.

**Why split data?**
- **Training set**: Used to train/fit the model
- **Test set**: Used to evaluate model performance on unseen data
- Prevents overfitting and gives a realistic estimate of model performance

### 2.1 Basic Split

**What it does:** Randomly splits your data into training (80%) and test (20%) sets.

**Parameters explained:**
- `test_size=0.2`: 20% of data goes to test set, 80% to training
- `random_state=0`: Sets a seed for reproducibility (same split every time)
- `shuffle=True`: Randomly shuffles data before splitting (recommended)

**When to use:** Most general-purpose machine learning tasks where class balance isn't a major concern.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0, shuffle=True
)

### 2.2 Stratified Split

**For Imbalanced Classes**

**What it does:** Ensures that the proportion of each class is maintained in both training and test sets.

**Parameters explained:**
- `stratify=y`: Maintains the same class distribution in train and test sets

**Example:** If your dataset has 70% class 0 and 30% class 1, stratified split ensures both train and test sets have approximately the same 70-30 ratio.

**When to use:** 
- Classification problems with imbalanced classes
- When you want to ensure representative sampling of minority classes
- Prevents situations where rare classes might be missing from test set

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0, shuffle=True, stratify=y
)

### 2.3 Train Test Calibration Split

**What it does:** Creates THREE sets instead of two:
- **Training set (60%)**: For training the model
- **Calibration set (20%)**: For calibrating predicted probabilities
- **Test set (20%)**: For final evaluation

**Why calibration?** 
Some models (like SVM) produce scores that aren't well-calibrated probabilities. A calibration set helps adjust these scores to reflect true probabilities.

**When to use:**
- When you need reliable probability estimates (e.g., risk assessment)
- When working with models that output uncalibrated scores
- When threshold selection is important for decision-making

In [2]:
X_tr, X_tmp, y_tr, y_tmp = train_test_split(X, y, test_size=0.4, random_state=0, stratify=y)
X_cal, X_te, y_cal, y_te = train_test_split(X_tmp, y_tmp, test_size=0.5, random_state=0, stratify=y_tmp)

NameError: name 'X' is not defined

## 3) Data Handling & Encoding

**What is this section for?**

This section covers essential data preprocessing techniques for handling different types of features in your dataset. Before feeding data into machine learning models, you often need to transform categorical variables, cyclical features, and handle various data types appropriately.

**Key Concepts:**
- **Categorical Encoding**: Converting text/category labels into numbers
- **One-Hot Encoding**: Creating binary columns for each category
- **Ordinal Encoding**: Assigning numeric order to categories
- **Cyclical Encoding**: Handling circular features like time/angles
- **Target Encoding**: Using target statistics for encoding

**Why is encoding important?**
Most machine learning algorithms work only with numerical data. Proper encoding ensures that categorical information is preserved while making it compatible with ML models.

### 3.1 One-Hot Encoding

**What it does:** Creates binary (0/1) columns for each unique category in a categorical variable. Each row gets a 1 in the column corresponding to its category, and 0 in all others.

**Example:** 
- Color: ['red', 'blue', 'green'] becomes:
  - `Color_red`: [1, 0, 0]
  - `Color_blue`: [0, 1, 0]
  - `Color_green`: [0, 0, 1]

**Parameters:**
- `drop='first'`: Removes one category to avoid multicollinearity (recommended for linear models)
- `handle_unknown='ignore'`: Ignores unseen categories in test data
- `sparse_output=False`: Returns dense array instead of sparse matrix

**When to use:**
- Nominal categorical features (no inherent order)
- When you have relatively few unique categories (< 10-15)
- Tree-based models (don't need to drop first)

**Advantages:** No ordinal relationship assumed
**Disadvantages:** Creates many columns with high cardinality features

In [None]:
from sklearn.preprocessing import OneHotEncoder

# One-Hot Encoding in a pipeline
cat_pipe = Pipeline(steps=[
    ("onehot", OneHotEncoder(drop='first', handle_unknown='ignore', sparse_output=False))
])

# Full preprocessing with OneHotEncoder
pre = ColumnTransformer(transformers=[
    ("num", StandardScaler(), numeric_cols),
    ("cat", OneHotEncoder(drop='first', handle_unknown='ignore'), categorical_cols)
])

# Usage
pre.fit(X_train)
X_train_processed = pre.transform(X_train)
X_test_processed = pre.transform(X_test)

# Or directly without pipeline
encoder = OneHotEncoder(drop='first', handle_unknown='ignore')
X_encoded = encoder.fit_transform(X_train[categorical_cols])

### 3.2 Ordinal Encoding

**What it does:** Maps categories to integers while preserving order/rank. Useful when categories have a natural ordering.

**Example:**
- Education: ['High School', 'Bachelor', 'Master', 'PhD'] → [0, 1, 2, 3]
- Size: ['Small', 'Medium', 'Large'] → [0, 1, 2]

**Parameters:**
- `categories`: List of lists defining the order for each feature
- `handle_unknown='use_encoded_value'`: Handle unseen categories
- `unknown_value`: Value to use for unknown categories (e.g., -1)

**When to use:**
- Ordinal categorical features (clear ordering)
- Education level, satisfaction ratings, size categories
- When the numeric order matters

**Advantages:** 
- Single column output (memory efficient)
- Preserves ordinal relationships

**Disadvantages:** 
- Assumes equal spacing between categories
- Can mislead models into thinking numeric distance is meaningful

In [None]:
from sklearn.preprocessing import OrdinalEncoder

# Define category order
education_order = [['High School', 'Bachelor', 'Master', 'PhD']]
size_order = [['Small', 'Medium', 'Large']]

# Ordinal encoding in pipeline
ordinal_pipe = Pipeline(steps=[
    ("ordinal", OrdinalEncoder(
        categories=education_order,
        handle_unknown='use_encoded_value',
        unknown_value=-1
    ))
])

# Multiple ordinal features with ColumnTransformer
pre = ColumnTransformer(transformers=[
    ("num", StandardScaler(), numeric_cols),
    ("ordinal_edu", OrdinalEncoder(categories=education_order), ['education']),
    ("ordinal_size", OrdinalEncoder(categories=size_order), ['size']),
    ("cat", OneHotEncoder(drop='first'), other_categorical_cols)
])

# Direct usage
encoder = OrdinalEncoder(categories=education_order)
X_encoded = encoder.fit_transform(X_train[['education']])

### 3.3 Label Encoding

**What it does:** Simple integer encoding for categorical variables. Maps each unique category to an integer (0, 1, 2, ...).

**Example:**
- Color: ['red', 'blue', 'green', 'red'] → [0, 1, 2, 0]

**When to use:**
- **Primary use**: Encoding the target variable (y) for classification
- Tree-based models can handle label-encoded features
- When you need simple integer mapping without creating multiple columns

**Important Note:** 
- For features, prefer OneHotEncoder or OrdinalEncoder depending on whether there's an order
- LabelEncoder doesn't work with pipelines (no transform method for new data)
- LabelEncoder is mainly for target variables

**Advantages:** Simple, memory efficient, good for targets
**Disadvantages:** Implies ordinal relationship, not pipeline-compatible for features

In [None]:
from sklearn.preprocessing import LabelEncoder

# Primarily for target variable encoding
le = LabelEncoder()
y_train_encoded = le.fit_transform(y_train)  # ['cat', 'dog', 'cat'] → [0, 1, 0]
y_test_encoded = le.transform(y_test)

# Get original labels back
y_original = le.inverse_transform(y_train_encoded)

# See the mapping
print(le.classes_)  # ['cat', 'dog']

# For features (not recommended - use OrdinalEncoder instead)
# LabelEncoder can be used but doesn't work well in pipelines
le_feature = LabelEncoder()
X_train['color_encoded'] = le_feature.fit_transform(X_train['color'])
X_test['color_encoded'] = le_feature.transform(X_test['color'])

### 3.4 Cyclical Encoding (Sine-Cosine Transform)

**What it does:** Transforms cyclical/circular features (like time, angles, days) into two continuous features using sine and cosine functions. This preserves the circular nature where the maximum value is "close" to the minimum.

**Why is it needed?**
- Regular encoding treats 23:59 and 00:01 as far apart, but they're actually 2 minutes apart
- Hour 0 and Hour 23 should be close, not distant
- Months: December (12) and January (1) are consecutive

**Example - Hours (0-23):**
- Hour 0 → sin=0, cos=1
- Hour 6 → sin=1, cos=0
- Hour 12 → sin=0, cos=-1
- Hour 18 → sin=-1, cos=0

**When to use:**
- Time features: hours, minutes, seconds
- Calendar features: day of week, day of month, month
- Angular measurements: compass directions
- Any feature that "wraps around"

**Formula:**
- `sin_feature = sin(2π × value / max_value)`
- `cos_feature = cos(2π × value / max_value)`

In [None]:
import numpy as np
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline

# Create cyclical encoding function
def cyclical_encoding(X, period):
    """
    Encode cyclical features using sin/cos transformation
    X: array-like, shape (n_samples, 1) - the cyclical feature
    period: int - the period of the cycle (e.g., 24 for hours, 7 for days, 12 for months)
    """
    X = np.asarray(X).reshape(-1, 1)
    sin_encoded = np.sin(2 * np.pi * X / period)
    cos_encoded = np.cos(2 * np.pi * X / period)
    return np.concatenate([sin_encoded, cos_encoded], axis=1)

# Example: Encode hours (0-23)
X_train['hour_sin'] = np.sin(2 * np.pi * X_train['hour'] / 24)
X_train['hour_cos'] = np.cos(2 * np.pi * X_train['hour'] / 24)

# Example: Encode day of week (0-6)
X_train['day_sin'] = np.sin(2 * np.pi * X_train['day_of_week'] / 7)
X_train['day_cos'] = np.cos(2 * np.pi * X_train['day_of_week'] / 7)

# Example: Encode month (1-12)
X_train['month_sin'] = np.sin(2 * np.pi * X_train['month'] / 12)
X_train['month_cos'] = np.cos(2 * np.pi * X_train['month'] / 12)

# Using FunctionTransformer in a pipeline
hour_transformer = FunctionTransformer(lambda x: cyclical_encoding(x, period=24))

pre = ColumnTransformer(transformers=[
    ("hour_cyclic", hour_transformer, ['hour']),
    ("num", StandardScaler(), other_numeric_cols),
    ("cat", OneHotEncoder(drop='first'), categorical_cols)
])

### 3.5 Target Encoding (Mean Encoding)

**What it does:** Replaces each category with the mean of the target variable for that category. This creates a numeric representation that directly correlates with the target.

**Example:**
```
City        | Target | → City_encoded
------------|--------|----------------
New York    | 1      | 0.75
Los Angeles | 0      | 0.75
New York    | 1      | 0.75
Chicago     | 0      | 0.33
New York    | 0      | 0.75
Chicago     | 1      | 0.33
```
New York average: (1+1+0)/3 = 0.67, etc.

**Advantages:**
- Handles high cardinality features well (many unique categories)
- Captures relationship with target
- Single column output

**Disadvantages:**
- **Risk of overfitting/data leakage** - must use cross-validation
- Requires target variable (can't use on test set directly)
- Can overfit on rare categories

**When to use:**
- High cardinality categorical features (e.g., city names, zip codes)
- When One-Hot would create too many columns
- Tree-based models (gradient boosting, random forests)

**Important:** Always use cross-validation or holdout approach to prevent data leakage!

In [None]:
# Target Encoding using category_encoders library
# Install: pip install category-encoders

from category_encoders import TargetEncoder

# Target encoding in pipeline (handles data leakage with smoothing)
target_enc = TargetEncoder(cols=['city', 'category'], smoothing=1.0)

# Fit on training data with target
target_enc.fit(X_train[['city', 'category']], y_train)

# Transform both train and test
X_train_encoded = target_enc.transform(X_train[['city', 'category']])
X_test_encoded = target_enc.transform(X_test[['city', 'category']])

# Manual target encoding with cross-validation to prevent overfitting
from sklearn.model_selection import KFold

def target_encode_cv(X_train, y_train, X_test, col, n_splits=5):
    """Target encode with cross-validation to prevent data leakage"""
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=0)
    
    X_train_encoded = np.zeros(len(X_train))
    # Cross-validation encoding for train
    for train_idx, val_idx in kf.split(X_train):
        target_mean = X_train.iloc[train_idx].groupby(col)[y_train.iloc[train_idx]].mean()
        X_train_encoded[val_idx] = X_train.iloc[val_idx][col].map(target_mean)
    
    # Global mean for test set
    target_mean_global = X_train.groupby(col)[y_train].mean()
    X_test_encoded = X_test[col].map(target_mean_global)
    
    # Fill unknown categories with overall mean
    X_test_encoded = X_test_encoded.fillna(y_train.mean())
    
    return X_train_encoded, X_test_encoded

# Usage
X_train['city_encoded'], X_test['city_encoded'] = target_encode_cv(
    X_train, y_train, X_test, col='city'
)

### 3.6 Frequency Encoding

**What it does:** Replaces each category with its frequency (count) or proportion in the dataset.

**Example:**
```
Color   | Count | → Frequency | → Proportion
--------|-------|-------------|-------------
Red     | 3     | 3           | 0.50
Blue    | 2     | 2           | 0.33
Red     | 3     | 3           | 0.50
Green   | 1     | 1           | 0.17
Red     | 3     | 3           | 0.50
Blue    | 2     | 2           | 0.33
```

**When to use:**
- When the frequency of a category is informative
- High cardinality categorical features
- As a simple alternative to one-hot encoding
- Complement to other encoding methods

**Advantages:**
- Simple and fast
- Handles high cardinality
- No data leakage risk
- Single column output

**Disadvantages:**
- Different categories with same frequency get same encoding
- Doesn't capture relationship with target

In [None]:
# Frequency encoding - manual implementation
def frequency_encoding(X_train, X_test, col):
    """Encode categorical feature by its frequency"""
    # Calculate frequencies on training data
    freq_map = X_train[col].value_counts().to_dict()
    
    # Apply to train and test
    X_train_encoded = X_train[col].map(freq_map)
    X_test_encoded = X_test[col].map(freq_map)
    
    # Handle unseen categories in test (assign 0 or min frequency)
    X_test_encoded = X_test_encoded.fillna(0)
    
    return X_train_encoded, X_test_encoded

# Usage
X_train['color_freq'], X_test['color_freq'] = frequency_encoding(
    X_train, X_test, col='color'
)

# Proportion encoding (frequency / total count)
def proportion_encoding(X_train, X_test, col):
    """Encode categorical feature by its proportion"""
    freq_map = X_train[col].value_counts(normalize=True).to_dict()
    
    X_train_encoded = X_train[col].map(freq_map)
    X_test_encoded = X_test[col].map(freq_map).fillna(0)
    
    return X_train_encoded, X_test_encoded

# Usage
X_train['color_prop'], X_test['color_prop'] = proportion_encoding(
    X_train, X_test, col='color'
)

# Using category_encoders library
# pip install category-encoders
from category_encoders import CountEncoder

count_enc = CountEncoder(cols=['color', 'brand'])
count_enc.fit(X_train)
X_train_encoded = count_enc.transform(X_train)
X_test_encoded = count_enc.transform(X_test)

### 3.7 Binary Encoding

**What it does:** Converts categories to binary code (like computer binary). Each category gets a unique binary representation, then each bit becomes a separate feature.

**Example:**
```
Category | Integer | Binary  | → Bit_0 | Bit_1 | Bit_2
---------|---------|---------|---------|-------|-------
Red      | 0       | 000     | 0       | 0     | 0
Blue     | 1       | 001     | 0       | 0     | 1
Green    | 2       | 010     | 0       | 1     | 0
Yellow   | 3       | 011     | 0       | 1     | 1
Orange   | 4       | 100     | 1       | 0     | 0
```

**When to use:**
- High cardinality categorical features (100+ categories)
- When one-hot encoding creates too many columns
- Middle ground between one-hot and label encoding
- Tree-based models

**Advantages:**
- Fewer columns than one-hot: log₂(n) columns instead of n
- Handles high cardinality better than one-hot
- Better than label encoding (doesn't imply order)

**Disadvantages:**
- Less interpretable than one-hot
- Binary bits may not be meaningful to linear models

In [None]:
# Binary encoding using category_encoders
# Install: pip install category-encoders

from category_encoders import BinaryEncoder

# Binary encoding
binary_enc = BinaryEncoder(cols=['city', 'product_id'])
binary_enc.fit(X_train)

X_train_encoded = binary_enc.transform(X_train)
X_test_encoded = binary_enc.transform(X_test)

# Use in pipeline with ColumnTransformer
from sklearn.compose import make_column_transformer

pre = make_column_transformer(
    (StandardScaler(), numeric_cols),
    (BinaryEncoder(), ['city', 'product_id']),
    (OneHotEncoder(drop='first'), low_cardinality_cols)
)

pipe = Pipeline([
    ("pre", pre),
    ("model", LogisticRegression(max_iter=5000))
])

pipe.fit(X_train, y_train)
pred = pipe.predict(X_test)

### 3.8 Text Preprocessing with TF-IDF

**What is TF-IDF?**
TF-IDF (Term Frequency-Inverse Document Frequency) converts text into numerical features by weighing how important each word is:
- **TF (Term Frequency)**: How often a word appears in a document
- **IDF (Inverse Document Frequency)**: How unique/rare a word is across all documents
- Words common everywhere get lower scores; rare discriminative words get higher scores

**Why use TF-IDF?**
- Transforms text into numbers that machine learning models can use
- Automatically identifies important words
- Reduces impact of very common words
- Works well for text classification tasks

**Parameters:**
- `lowercase=True`: Convert all text to lowercase
- `max_features`: Limit number of features (e.g., 5000 most important words)
- `ngram_range=(1,1)`: Use single words; (1,2) includes word pairs
- `min_df`: Minimum document frequency (ignore rare words)
- `max_df`: Maximum document frequency (ignore very common words)
- `stop_words='english'`: Remove common words like "the", "is", "and"

**When to use:**
- Text classification (spam detection, sentiment analysis)
- Document categorization
- Any NLP task requiring text vectorization

**Common use cases:**
- Email/message classification
- Product review sentiment analysis
- News article categorization
- Search relevance ranking

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Basic TF-IDF for a single text column
tfidf = TfidfVectorizer(
    lowercase=True,
    max_features=5000,        # Keep top 5000 features
    ngram_range=(1, 2),       # Use unigrams and bigrams
    min_df=2,                 # Ignore terms appearing in < 2 documents
    max_df=0.95,              # Ignore terms appearing in > 95% of documents
    stop_words='english'      # Remove common English words
)

# Fit and transform training data
X_train_tfidf = tfidf.fit_transform(X_train['text'].fillna(""))
X_test_tfidf = tfidf.transform(X_test['text'].fillna(""))

# Use in a pipeline with a classifier
from sklearn.linear_model import LogisticRegression

text_clf = Pipeline([
    ('tfidf', TfidfVectorizer(
        lowercase=True,
        max_features=10000,
        ngram_range=(1, 2),
        min_df=2,
        max_df=0.95
    )),
    ('clf', LogisticRegression(max_iter=5000))
])

# Fit and predict (fillna important for missing text)
text_clf.fit(X_train['text'].fillna(""), y_train)
predictions = text_clf.predict(X_test['text'].fillna(""))
probabilities = text_clf.predict_proba(X_test['text'].fillna(""))[:, 1]

# Multiple text columns with ColumnTransformer
text_transformer = ColumnTransformer(
    transformers=[
        ('title', TfidfVectorizer(max_features=3000), 'title'),
        ('body', TfidfVectorizer(max_features=7000), 'body')
    ]
)

multi_text_clf = Pipeline([
    ('text_features', text_transformer),
    ('clf', LogisticRegression(max_iter=5000))
])

# Prepare data (fill NaN for each text column)
X_train_clean = X_train.copy()
X_test_clean = X_test.copy()
for col in ['title', 'body']:
    X_train_clean[col] = X_train_clean[col].fillna("")
    X_test_clean[col] = X_test_clean[col].fillna("")

multi_text_clf.fit(X_train_clean, y_train)
predictions = multi_text_clf.predict(X_test_clean)

## 4) Pipelining

**What is a Pipeline?**

A Pipeline is a powerful tool that chains together multiple preprocessing steps and a final model into a single object. This ensures:
- **Reproducibility**: Same preprocessing applied consistently
- **No data leakage**: Test data is never seen during fit
- **Clean code**: All transformations in one place
- **Easy deployment**: One object to save and load

**Key concepts:**
- **ColumnTransformer**: Applies different transformations to different columns
- **Pipeline**: Chains multiple steps sequentially
- Each step transforms the data before passing to the next step

In [None]:
# This function creates a preprocessing pipeline for mixed data types
def make_preprocess(numeric_cols, categorical_cols):
    # Numeric pipeline: handle missing values → scale features
    num_pipe = Pipeline(steps=[
        ("imputer", SimpleImputer(strategy="median")),  # Fill missing with median
        ("scaler", StandardScaler())                    # Standardize (mean=0, std=1)
    ])

    # Categorical pipeline: handle missing values → one-hot encode
    cat_pipe = Pipeline(steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),  # Fill with mode
        ("onehot", OneHotEncoder(handle_unknown="ignore"))     # Convert to binary columns
    ])

    # Combine both pipelines
    pre = ColumnTransformer(
        transformers=[
            ("num", num_pipe, numeric_cols),   # Apply num_pipe to numeric columns
            ("cat", cat_pipe, categorical_cols),  # Apply cat_pipe to categorical columns
        ],
        remainder="drop"  # Drop any columns not specified
    )
    return pre

### 4.1 Single text column → Classifier (Using TF-IDF)

**What is TF-IDF?**
TF-IDF (Term Frequency-Inverse Document Frequency) converts text into numerical features by:
- **TF**: How often a word appears in a document
- **IDF**: How unique/rare a word is across all documents
- Words that are common everywhere get lower scores

**Parameters explained:**
- `lowercase=True`: Convert all text to lowercase
- `stop_words=None`: Keep all words (or use "english" to remove common words like "the", "is")
- `ngram_range=(1,2)`: Use individual words (unigrams) and word pairs (bigrams)
- `min_df=2`: Ignore words that appear in fewer than 2 documents
- `max_df=0.95`: Ignore words that appear in more than 95% of documents
- `max_features=200000`: Keep only top 200,000 most important features

**Use case:** Text classification (spam detection, sentiment analysis, document categorization)

In [None]:
import numpy as np
import pandas as pd

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

TEXT_COL = "text"  # change

clf = Pipeline([
    ("tfidf", TfidfVectorizer(
        lowercase=True,
        stop_words=None,          # or "english" if allowed/desired
        ngram_range=(1, 2),
        min_df=2,
        max_df=0.95,
        max_features=200000
    )),
    ("model", LogisticRegression(max_iter=5000))
])

# If X is a DataFrame:
clf.fit(X_train[TEXT_COL].fillna(""), y_train)

proba = clf.predict_proba(X_test[TEXT_COL].fillna(""))[:, 1]
pred  = (proba >= 0.5).astype(int)


### 4.2 Multiple Text Fields

**When to use:** Your dataset has multiple text columns (e.g., email subject + body, product title + description).

**How it works:**
- Each text field gets its own TF-IDF vectorizer
- Features from all fields are concatenated
- This allows the model to learn from different text sources independently

**Important:** Always fill NaN values with empty strings before processing text, as TF-IDF cannot handle missing values.

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

TEXT_COLS = ["subject", "body"]  # change

pre = ColumnTransformer(
    transformers=[
        ("subj", TfidfVectorizer(ngram_range=(1,2), min_df=2), "subject"),
        ("body", TfidfVectorizer(ngram_range=(1,2), min_df=2), "body"),
    ],
    remainder="drop"
)

model = Pipeline([
    ("pre", pre),
    ("clf", LogisticRegression(max_iter=5000))
])

# Important: fill NaNs per column
Xtr = X_train.copy()
Xte = X_test.copy()
for c in TEXT_COLS:
    Xtr[c] = Xtr[c].fillna("")
    Xte[c] = Xte[c].fillna("")

model.fit(Xtr, y_train)
proba = model.predict_proba(Xte)[:, 1]


### 4.3 Mixed Text, Numeric, Categorical

**Most realistic scenario:** Your dataset contains different types of features:
- **Text features**: Free-form text (reviews, descriptions, comments)
- **Numeric features**: Numbers (age, price, counts)
- **Categorical features**: Categories (country, device type, status)

**Why different processing?**
- Text needs TF-IDF vectorization
- Numbers need imputation and scaling
- Categories need one-hot encoding

**The pipeline handles all three types simultaneously**, ensuring proper preprocessing for each feature type.

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

TEXT_COL = "text"
NUM_COLS = ["len", "num_links"]      # change / can be []
CAT_COLS = ["country", "device"]     # change / can be []

num_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

cat_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

pre = ColumnTransformer([
    ("text", TfidfVectorizer(ngram_range=(1,2), min_df=2), TEXT_COL),
    ("num", num_pipe, NUM_COLS),
    ("cat", cat_pipe, CAT_COLS),
])

pipe = Pipeline([
    ("pre", pre),
    ("model", LogisticRegression(max_iter=5000))
])

Xtr = X_train.copy()
Xte = X_test.copy()
Xtr[TEXT_COL] = Xtr[TEXT_COL].fillna("")
Xte[TEXT_COL] = Xte[TEXT_COL].fillna("")

pipe.fit(Xtr, y_train)
proba = pipe.predict_proba(Xte)[:, 1]


### 3.4 Grid Search Template for Pipeline

**What is Grid Search?**
Grid Search automatically tests multiple combinations of hyperparameters to find the best model configuration.

**How to specify parameters:**
Use double underscore notation: `"step_name__parameter_name"`
- Example: `"tfidf__ngram_range"` refers to the `ngram_range` parameter of the `tfidf` step

**How it works:**
1. Tests all combinations of parameters
2. Uses cross-validation to evaluate each combination
3. Returns the best configuration

**Tip:** Use `n_jobs=-1` to utilize all CPU cores for faster computation.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

pipe = Pipeline([
    ("tfidf", TfidfVectorizer()),
    ("model", LogisticRegression(max_iter=5000))
])

param_grid = {
    "tfidf__ngram_range": [(1,1), (1,2)],
    "tfidf__min_df": [1, 2, 5],
    "model__C": [0.1, 1.0, 10.0],
}

grid = GridSearchCV(pipe, param_grid=param_grid, scoring="f1", cv=5, n_jobs=-1)
grid.fit(X_train[TEXT_COL].fillna(""), y_train)

best = grid.best_estimator_
best_params = grid.best_params_
best_params


## 4) Classification Models

**What is classification?**
Predicting a categorical label (class) from input features. Examples:
- Spam vs. not spam
- Disease present vs. absent
- Customer will churn vs. won't churn

**Key concepts:**
- **proba**: Predicted probability (value between 0 and 1)
- **pred**: Predicted class (0 or 1) after applying a threshold
- Default threshold is usually 0.5

### 4.1 Logistic Regression

**What is it?**
Despite its name, Logistic Regression is a **classification** algorithm that predicts probabilities using a linear decision boundary.

**Characteristics:**
- ✅ Simple, fast, interpretable
- ✅ Works well with high-dimensional data (many features)
- ✅ Provides probability estimates
- ❌ Assumes linear relationship between features and log-odds

**Parameters:**
- `max_iter=5000`: Maximum iterations for optimization (increase if convergence warning appears)

**When to use:** Baseline model, text classification, when interpretability is important

In [None]:
from sklearn.linear_model import LogisticRegression

clf = Pipeline(steps=[
    ("pre", pre),  # from make_preprocess(...)
    ("model", LogisticRegression(max_iter=5000))
])
clf.fit(X_train, y_train)

proba = clf.predict_proba(X_test)[:, 1]
pred  = (proba >= 0.5).astype(int)


### 4.2 Linear SVM

**What is SVM?**
Support Vector Machine finds the optimal hyperplane that maximally separates classes.

**Characteristics:**
- ✅ Effective in high-dimensional spaces
- ✅ Memory efficient
- ✅ Robust to outliers
- ❌ Doesn't provide probability estimates by default (only decision scores)

**Important notes:**
- `decision_function()` returns raw scores (not probabilities)
- Default threshold is 0 (not 0.5 like in probabilities)
- `C`: Regularization parameter (smaller = more regularization)

**When to use:** High-dimensional data, text classification, when you need a strong linear classifier

In [None]:
from sklearn.svm import LinearSVC

svm = Pipeline(steps=[
    ("pre", pre),
    ("model", LinearSVC(C=1.0, max_iter=20000))
])
svm.fit(X_train, y_train)

scores = svm.decision_function(X_test)  # not probabilities
pred = (scores >= 0.0).astype(int)      # threshold 0 by default


### 4.3 RBF SVM

**What is RBF kernel?**
RBF (Radial Basis Function) kernel allows SVM to learn **non-linear** decision boundaries by mapping data to higher-dimensional space.

**Characteristics:**
- ✅ Can capture complex, non-linear patterns
- ✅ Works well with small to medium datasets
- ❌ Slower to train than linear models
- ❌ More prone to overfitting
- ❌ Requires careful hyperparameter tuning

**Parameters:**
- `kernel="rbf"`: Use radial basis function kernel
- `gamma="scale"`: Controls influence of single training example
- `probability=True`: Enable probability estimates (slower but useful)

**When to use:** Non-linear relationships, small/medium datasets, when accuracy is more important than speed

In [None]:
from sklearn.svm import SVC

svm_rbf = Pipeline(steps=[
    ("pre", pre),
    ("model", SVC(C=1.0, kernel="rbf", gamma="scale", probability=True))
])
svm_rbf.fit(X_train, y_train)
proba = svm_rbf.predict_proba(X_test)[:, 1]


### 4.4 Random Forest

**What is Random Forest?**
An ensemble of decision trees that:
1. Creates multiple decision trees on random subsets of data
2. Each tree makes a prediction
3. Final prediction is by majority vote (classification) or average (regression)

**Characteristics:**
- ✅ Handles non-linear relationships naturally
- ✅ Robust to outliers and noise
- ✅ Can handle mixed feature types
- ✅ Provides feature importance
- ✅ Less prone to overfitting than single decision tree
- ❌ Slower to train and predict than linear models
- ❌ Less interpretable than single trees

**Parameters:**
- `n_estimators=300`: Number of trees (more = better performance but slower)
- `random_state=0`: For reproducibility

**When to use:** General-purpose strong classifier, when you need feature importance, when you have mixed data types

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = Pipeline(steps=[
    ("pre", pre),
    ("model", RandomForestClassifier(n_estimators=300, random_state=0))
])
rf.fit(X_train, y_train)
pred = rf.predict(X_test)
proba = rf.predict_proba(X_test)[:, 1]


### 4.9 Decision Tree

**What is Decision Tree?**
A tree structure where internal nodes represent feature tests, branches represent outcomes, and leaf nodes represent class labels.

**Characteristics:**
- ✅ Highly interpretable (can visualize the tree)
- ✅ No feature scaling required
- ✅ Handles both numerical and categorical features
- ✅ Automatically performs feature selection
- ✅ Can capture non-linear relationships
- ❌ Prone to overfitting (especially deep trees)
- ❌ Unstable (small data changes can drastically change tree)
- ❌ Biased toward features with many levels

**Parameters:**
- `max_depth=5`: Maximum tree depth (controls overfitting)
- `min_samples_split=20`: Minimum samples to split a node
- `min_samples_leaf=10`: Minimum samples in a leaf node
- `criterion='gini'`: Splitting criterion ('gini' or 'entropy')

**When to use:** When interpretability is crucial, as a baseline, or within ensemble methods

In [None]:
from sklearn.tree import DecisionTreeClassifier

dt = Pipeline(steps=[
    ("pre", pre),
    ("model", DecisionTreeClassifier(max_depth=5, min_samples_split=20, 
                                     min_samples_leaf=10, random_state=0))
])
dt.fit(X_train, y_train)
proba = dt.predict_proba(X_test)[:, 1]
pred = dt.predict(X_test)

# Optional: Visualize the tree
# from sklearn.tree import plot_tree
# import matplotlib.pyplot as plt
# plt.figure(figsize=(20,10))
# plot_tree(dt.named_steps["model"], filled=True, feature_names=feature_names)
# plt.show()

### 4.8 K-Nearest Neighbors (KNN)

**What is KNN?**
A non-parametric method that classifies based on the majority class of k nearest neighbors.

**Characteristics:**
- ✅ Simple to understand
- ✅ No training phase (lazy learning)
- ✅ Naturally handles multi-class problems
- ✅ Can capture complex decision boundaries
- ❌ Very slow prediction (searches all training data)
- ❌ Sensitive to feature scaling (must scale features!)
- ❌ Poor with high-dimensional data ("curse of dimensionality")
- ❌ Requires large memory (stores all training data)

**Parameters:**
- `n_neighbors=5`: Number of neighbors to consider (odd numbers for binary classification)
- `weights='uniform'`: All neighbors weighted equally (or 'distance' for distance-weighted)
- `metric='minkowski'`: Distance metric

**When to use:** Small datasets, as a baseline, when decision boundary is complex

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn = Pipeline(steps=[
    ("pre", pre),  # IMPORTANT: KNN requires scaled features!
    ("model", KNeighborsClassifier(n_neighbors=5, weights='uniform'))
])
knn.fit(X_train, y_train)
proba = knn.predict_proba(X_test)[:, 1]
pred = knn.predict(X_test)

### 4.7 Naive Bayes

**What is Naive Bayes?**
A probabilistic classifier based on Bayes' theorem with "naive" assumption of feature independence.

**Characteristics:**
- ✅ Very fast training and prediction
- ✅ Works well with high-dimensional data
- ✅ Good for text classification
- ✅ Requires little training data
- ❌ Assumes feature independence (rarely true in practice)
- ❌ Can be outperformed by more sophisticated models

**Variants:**
- **GaussianNB**: For continuous features (assumes Gaussian distribution)
- **MultinomialNB**: For discrete counts (word counts, frequencies)
- **BernoulliNB**: For binary features

**When to use:** Text classification, as a fast baseline, when training data is limited

In [None]:
from sklearn.naive_bayes import GaussianNB, MultinomialNB

# For continuous features
nb_gauss = Pipeline(steps=[
    ("pre", pre),
    ("model", GaussianNB())
])

# For text/count data (use with TfidfVectorizer or CountVectorizer)
nb_multi = Pipeline(steps=[
    ("tfidf", TfidfVectorizer(ngram_range=(1,2), min_df=2)),
    ("model", MultinomialNB(alpha=1.0))  # alpha: Laplace smoothing parameter
])

nb_gauss.fit(X_train, y_train)
proba = nb_gauss.predict_proba(X_test)[:, 1]

### 4.6 XGBoost Classifier

**What is XGBoost?**
An optimized implementation of gradient boosting with advanced features and better performance.

**Characteristics:**
- ✅ State-of-the-art performance on many datasets
- ✅ Very fast training (optimized C++ backend)
- ✅ Built-in regularization prevents overfitting
- ✅ Handles missing values automatically
- ✅ Parallel processing support
- ❌ Requires separate installation (`pip install xgboost`)
- ❌ More hyperparameters to tune

**Parameters:**
- `n_estimators=100`: Number of boosting rounds
- `max_depth=6`: Maximum tree depth
- `learning_rate=0.3`: Step size shrinkage
- `use_label_encoder=False`: Suppress warning for newer versions
- `eval_metric='logloss'`: Evaluation metric

**When to use:** Kaggle competitions, production systems, when you need top performance

In [None]:
# Install: pip install xgboost
from xgboost import XGBClassifier

xgb = Pipeline(steps=[
    ("pre", pre),
    ("model", XGBClassifier(n_estimators=100, max_depth=6, learning_rate=0.3, 
                           use_label_encoder=False, eval_metric='logloss', random_state=0))
])
xgb.fit(X_train, y_train)
proba = xgb.predict_proba(X_test)[:, 1]
pred = xgb.predict(X_test)

### 4.5 Gradient Boosting Classifier

**What is Gradient Boosting?**
Builds trees sequentially, where each new tree tries to correct the errors of previous trees.

**Characteristics:**
- ✅ Usually highest accuracy among tree-based methods
- ✅ Handles non-linear relationships
- ✅ Provides feature importance
- ✅ Less prone to overfitting than single trees
- ❌ Slower to train (sequential process)
- ❌ Requires careful hyperparameter tuning
- ❌ Can overfit if not properly regularized

**Parameters:**
- `n_estimators=100`: Number of boosting stages
- `learning_rate=0.1`: Shrinks contribution of each tree (lower = more conservative)
- `max_depth=3`: Maximum depth of trees (lower = simpler models)

**When to use:** Competitions, when accuracy is crucial, tabular data with complex patterns

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

gb = Pipeline(steps=[
    ("pre", pre),
    ("model", GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=0))
])
gb.fit(X_train, y_train)
proba = gb.predict_proba(X_test)[:, 1]
pred = gb.predict(X_test)

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = Pipeline(steps=[
    ("pre", pre),
    ("model", RandomForestClassifier(n_estimators=300, random_state=0))
])
rf.fit(X_train, y_train)
pred = rf.predict(X_test)
proba = rf.predict_proba(X_test)[:, 1]

## 5) Classification Evaluation Metrics

**Why evaluate?**
Accuracy alone is often misleading, especially with imbalanced classes. You need multiple metrics to understand model performance from different angles.

**Key metrics overview:**
- **Accuracy**: Overall correctness (use with caution on imbalanced data)
- **Precision**: Of predicted positives, how many are truly positive? (Focus on false positives)
- **Recall**: Of actual positives, how many did we catch? (Focus on false negatives)
- **F1-Score**: Harmonic mean of precision and recall (balanced metric)
- **Confusion Matrix**: Shows all combinations of predicted vs actual
- **AUC**: Overall ability to distinguish between classes across all thresholds

### 5.1 Classification Reports

**Confusion Matrix structure:**
```
              Predicted
             Neg    Pos
Actual Neg | TN  |  FP |
Actual Pos | FN  |  TP |
```

**Metric formulas:**
- **Accuracy** = (TP + TN) / (TP + TN + FP + FN)
- **Precision** = TP / (TP + FP) - "When I predict positive, how often am I right?"
- **Recall** = TP / (TP + FN) - "Of all actual positives, how many did I find?"
- **F1** = 2 × (Precision × Recall) / (Precision + Recall)

**`zero_division=0`**: Handles edge cases where division by zero occurs (e.g., no predicted positives)

**Classification Report**: Shows precision, recall, and F1 for each class, plus weighted averages

In [None]:
pred = clf.predict(X_test)

print("Accuracy:", accuracy_score(y_test, pred))
print("Precision:", precision_score(y_test, pred, zero_division=0))
print("Recall:", recall_score(y_test, pred, zero_division=0))
print("F1:", f1_score(y_test, pred, zero_division=0))
print("Confusion matrix:\n", confusion_matrix(y_test, pred))
print(classification_report(y_test, pred, zero_division=0))


### 5.2 AUC (Area Under Curve)

**What is AUC-ROC?**
Area Under the Receiver Operating Characteristic curve measures the model's ability to distinguish between classes across all possible thresholds.

**AUC interpretation:**
- **1.0**: Perfect classifier
- **0.9-1.0**: Excellent
- **0.8-0.9**: Good
- **0.7-0.8**: Fair
- **0.5-0.7**: Poor
- **0.5**: Random guessing (no better than coin flip)

**Why AUC?**
- Threshold-independent metric
- Works well with imbalanced datasets
- Single number summary of model performance

**Note:** Requires probability scores or decision function values, not just binary predictions

# proba or score for class 1
proba = clf.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, proba)
print("AUC:", auc)

In [None]:
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt

# Get predictions
proba = clf.predict_proba(X_test)[:, 1]

# Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, proba)
auc = roc_auc_score(y_test, proba)

# Plot
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'Model (AUC = {auc:.3f})', linewidth=2)
plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier (AUC = 0.5)')
plt.xlabel('False Positive Rate', fontsize=12)
plt.ylabel('True Positive Rate', fontsize=12)
plt.title('ROC Curve', fontsize=14)
plt.legend()
plt.grid(alpha=0.3)
plt.show()

print(f"AUC: {auc:.4f}")

### 5.3 ROC Curve Visualization

**What is ROC Curve?**
ROC (Receiver Operating Characteristic) curve plots True Positive Rate vs False Positive Rate at various threshold settings.

**How to interpret:**
- **X-axis (FPR)**: False Positive Rate = FP / (FP + TN) - "False alarm rate"
- **Y-axis (TPR)**: True Positive Rate = TP / (TP + FN) - "Recall"
- Diagonal line = random classifier
- Closer to top-left corner = better performance
- Area under curve (AUC) summarizes overall performance

**Why plot ROC?**
- Visualize trade-off between TPR and FPR
- Compare multiple models on same plot
- Choose optimal threshold based on business requirements
- Works well with imbalanced datasets

## 6) Cost-Sensitive Threshold Optimization

**Why adjust thresholds?**
Default threshold (0.5) assumes equal cost for false positives (FP) and false negatives (FN). In real-world scenarios, these costs are often different.

**Examples:**
- **Medical diagnosis**: Missing a disease (FN) is much worse than a false alarm (FP)
- **Spam filtering**: False positive (blocking real email) is worse than false negative (letting spam through)
- **Fraud detection**: Missing fraud (FN) costs money, but investigating false alarms (FP) costs time

**How it works:**
1. Test many different thresholds
2. Calculate cost for each threshold: `Cost = FP × c_fp + FN × c_fn`
3. Choose threshold that minimizes total cost

**Parameters:**
- `c_fp`: Cost of one false positive
- `c_fn`: Cost of one false negative

**Practical use:** Adjust decision threshold based on business requirements and asymmetric costs

In [None]:
def sweep_thresholds(y_true, scores, thresholds, c_fp=1.0, c_fn=1.0):
    y_true = np.asarray(y_true, int)
    scores = np.asarray(scores, float)
    best = None
    for t in thresholds:
        y_pred = (scores >= t).astype(int)
        tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
        cost = fp*c_fp + fn*c_fn
        row = {"threshold": float(t), "cost": float(cost), "tp": tp, "tn": tn, "fp": fp, "fn": fn}
        if best is None or row["cost"] < best["cost"]:
            best = row
    return best

thresholds = np.linspace(np.min(scores), np.max(scores), 200)
best = sweep_thresholds(y_test, scores, thresholds, c_fp=100, c_fn=500)
best


## 7) Calibration Templates

**What is probability calibration?**
Some models output scores that don't represent true probabilities. Calibration adjusts these scores so that:
- When model says "70% probability", about 70% of those predictions are actually positive
- Predicted probabilities match observed frequencies

**Why calibrate?**
- For reliable risk assessment and decision-making
- When you need to trust the probability values (not just rankings)
- Required for threshold-based decisions where probability matters

**Models that need calibration:**
- SVM (outputs decision scores, not probabilities)
- Naive Bayes (often overconfident)
- Neural networks (can be miscalibrated)

**Well-calibrated models:**
- Logistic Regression (generally well-calibrated)
- Random Forest (reasonably calibrated)

### 7.1 CalibratedClassifierCV

**What it does:**
Wraps any classifier and calibrates its output using cross-validation.

**Calibration methods:**
1. **Isotonic regression** (`method="isotonic"`):
   - Non-parametric, more flexible
   - Can overfit on small datasets
   - Requires more calibration data
   - Better for non-monotonic relationships

2. **Platt scaling** (`method="sigmoid"`):
   - Parametric (fits a sigmoid function)
   - Works with less calibration data
   - More stable on small datasets
   - Better for monotonic relationships

**Parameters:**
- `cv=3`: Uses 3-fold cross-validation for calibration

**When to use:** When working with SVMs or other models that don't output well-calibrated probabilities

In [None]:
from sklearn.calibration import CalibratedClassifierCV
from sklearn.svm import LinearSVC

base = LinearSVC(C=1.0, max_iter=20000)
cal = CalibratedClassifierCV(base, method="isotonic", cv=3)  # or method="sigmoid"
pipe = Pipeline([("pre", pre), ("model", cal)])
pipe.fit(X_train, y_train)

proba = pipe.predict_proba(X_test)[:, 1]

NameError: name 'LinearSVC' is not defined

### 7.2 Separate Calibration Model on Calibration Split

**Advanced calibration approach:**
1. Train base model on training set
2. Get predictions on separate calibration set
3. Train a calibrator (e.g., Decision Tree) to map raw probabilities to calibrated probabilities
4. Apply both models in sequence at test time

**Why use this?**
- More control over calibration process
- Can use different calibration models (trees, isotonic regression, etc.)
- Useful when you need custom calibration approaches

**Important:** 
- Always clip final probabilities to [0, 1] range
- Requires a three-way split (train, calibration, test)

**Trade-off:** Uses less data for training base model but provides better probability estimates

In [None]:
from sklearn.tree import DecisionTreeRegressor

# base probability model
base = Pipeline([("pre", pre), ("model", LogisticRegression(max_iter=5000))])
base.fit(X_tr, y_tr)

p_cal = base.predict_proba(X_cal)[:, 1]
calibrator = DecisionTreeRegressor(max_depth=3, random_state=0)
calibrator.fit(p_cal.reshape(-1,1), y_cal)

p_test_raw = base.predict_proba(X_te)[:, 1]
p_test = np.clip(calibrator.predict(p_test_raw.reshape(-1,1)), 0, 1)


## 8) Regression Templates

**What is regression?**
Predicting a **continuous numerical value** (not a category). Examples:
- House prices
- Temperature
- Stock prices
- Customer lifetime value

**Key difference from classification:**
- Output is a number (not a class label)
- Different evaluation metrics (MAE, MSE, RMSE, R²)
- No probability predictions or thresholds

### 8.1 Linear Regression Baseline

**What is Linear Regression?**
Fits a linear equation to model the relationship between features and target:
`y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ`

**Characteristics:**
- ✅ Simple, fast, interpretable
- ✅ Works well when relationships are linear
- ✅ No hyperparameters to tune
- ❌ Assumes linear relationships
- ❌ Sensitive to outliers
- ❌ Can't capture complex non-linear patterns

**When to use:** 
- As a baseline model (always start here!)
- When relationships appear linear
- When interpretability is crucial (coefficients show feature impact)

In [4]:
from sklearn.linear_model import LinearRegression

reg = Pipeline([("pre", pre), ("model", LinearRegression())])
reg.fit(X_train, y_train)
pred = reg.predict(X_test)


NameError: name 'pre' is not defined

### 8.2 Ridge/Lasso

**What are Ridge and Lasso?**
Regularized linear regression models that prevent overfitting by penalizing large coefficients.

**Ridge Regression (L2 regularization):**
- Shrinks all coefficients toward zero (but never exactly zero)
- Good when many features are useful
- `alpha`: Controls regularization strength (higher = more regularization)

**Lasso Regression (L1 regularization):**
- Can shrink coefficients exactly to zero (feature selection)
- Good for sparse models (many irrelevant features)
- Creates simpler, more interpretable models
- `alpha`: Regularization strength (start with 0.01-1.0)

**When to use:**
- **Ridge**: High multicollinearity, many correlated features, want to keep all features
- **Lasso**: Feature selection, suspect many features are irrelevant, want sparse model

In [None]:
from sklearn.linear_model import Ridge, Lasso

ridge = Pipeline([("pre", pre), ("model", Ridge(alpha=1.0))])
lasso = Pipeline([("pre", pre), ("model", Lasso(alpha=0.01, max_iter=20000))])

### 8.2.1 ElasticNet

**What is ElasticNet?**
Combines L1 (Lasso) and L2 (Ridge) regularization, getting benefits of both.

**Characteristics:**
- ✅ Feature selection like Lasso
- ✅ Stability of Ridge with correlated features
- ✅ Better than Lasso when features are highly correlated
- ❌ Two hyperparameters to tune (alpha and l1_ratio)

**Parameters:**
- `alpha`: Overall regularization strength
- `l1_ratio`: Balance between L1 and L2 (0 = Ridge, 1 = Lasso, 0.5 = equal mix)

**When to use:** Many correlated features, want both feature selection and stability

In [None]:
from sklearn.linear_model import ElasticNet

elastic = Pipeline([("pre", pre), ("model", ElasticNet(alpha=0.1, l1_ratio=0.5, max_iter=20000))])
elastic.fit(X_train, y_train)
pred = elastic.predict(X_test)

### 8.3 Random Forest Regressor

**What is it?**
Ensemble of decision trees for regression (similar to Random Forest Classifier but predicts numbers).

**Characteristics:**
- ✅ Handles non-linear relationships naturally
- ✅ Robust to outliers
- ✅ Can capture complex patterns
- ✅ Provides feature importance
- ✅ Works with mixed data types
- ❌ Slower than linear models
- ❌ Less interpretable
- ❌ Can overfit on noisy data

**Parameters:**
- `n_estimators=500`: Number of trees (more = better but slower)
- Higher is generally better for regression

**When to use:** 
- Non-linear relationships
- Complex patterns
- When accuracy matters more than speed
- When you need feature importance

In [None]:
from sklearn.ensemble import RandomForestRegressor

rf_reg = Pipeline([("pre", pre), ("model", RandomForestRegressor(n_estimators=500, random_state=0))])
rf_reg.fit(X_train, y_train)
pred = rf_reg.predict(X_test)


### 8.6 Support Vector Regressor (SVR)

**What is SVR?**
SVM adapted for regression - tries to fit data within a margin (epsilon-tube).

**Characteristics:**
- ✅ Effective in high-dimensional spaces
- ✅ Robust to outliers (within epsilon tube)
- ✅ Can use different kernels for non-linear relationships
- ❌ Slow on large datasets
- ❌ Memory intensive
- ❌ Requires feature scaling

**Parameters:**
- `kernel='rbf'`: Kernel type ('linear', 'rbf', 'poly')
- `C=1.0`: Regularization parameter
- `epsilon=0.1`: Width of epsilon-tube (points inside don't contribute to loss)

**When to use:** High-dimensional data, small to medium datasets, when outlier robustness is needed

In [None]:
from sklearn.svm import SVR

svr = Pipeline([("pre", pre), ("model", SVR(kernel='rbf', C=1.0, epsilon=0.1))])
svr.fit(X_train, y_train)
pred = svr.predict(X_test)

### 8.5 XGBoost Regressor

**What is it?**
Optimized gradient boosting implementation for regression tasks.

**Characteristics:**
- ✅ State-of-the-art performance
- ✅ Very fast training
- ✅ Built-in regularization
- ✅ Handles missing values automatically
- ❌ Requires separate installation

**When to use:** Production systems, competitions, when you need best performance

In [None]:
# Install: pip install xgboost
from xgboost import XGBRegressor

xgb_reg = Pipeline([("pre", pre), ("model", XGBRegressor(
    n_estimators=100, max_depth=6, learning_rate=0.1, random_state=0
))])
xgb_reg.fit(X_train, y_train)
pred = xgb_reg.predict(X_test)

### 8.4 Gradient Boosting Regressor

**What is it?**
Gradient Boosting for regression - builds trees sequentially to correct errors.

**Characteristics:**
- ✅ Usually best performance among tree-based methods
- ✅ Handles non-linear relationships
- ✅ Robust to outliers
- ✅ Provides feature importance
- ❌ Slower to train (sequential)
- ❌ Requires hyperparameter tuning
- ❌ Can overfit if not regularized

**Parameters:**
- `n_estimators=100`: Number of boosting stages
- `learning_rate=0.1`: Shrinks contribution of each tree
- `max_depth=3`: Maximum depth of individual trees
- `subsample=1.0`: Fraction of samples for fitting trees (< 1.0 = stochastic gradient boosting)

**When to use:** Kaggle competitions, when accuracy is most important, complex patterns

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

gb_reg = Pipeline([("pre", pre), ("model", GradientBoostingRegressor(
    n_estimators=100, learning_rate=0.1, max_depth=3, random_state=0
))])
gb_reg.fit(X_train, y_train)
pred = gb_reg.predict(X_test)

### 8.4 Regression Metrics

**Key regression metrics:**

1. **MAE (Mean Absolute Error)**:
   - Average absolute difference between predicted and actual
   - Easy to interpret (same units as target)
   - Robust to outliers
   - Formula: `(1/n) Σ|yᵢ - ŷᵢ|`

2. **MSE (Mean Squared Error)**:
   - Average squared difference
   - Penalizes large errors more heavily
   - Not in same units as target (squared units)
   - Formula: `(1/n) Σ(yᵢ - ŷᵢ)²`

3. **RMSE (Root Mean Squared Error)**:
   - Square root of MSE
   - Same units as target (easy to interpret)
   - Penalizes large errors
   - Most common metric
   - Formula: `√MSE`

4. **R² (R-squared / Coefficient of Determination)**:
   - Proportion of variance explained by model
   - Range: -∞ to 1 (1 = perfect, 0 = baseline, negative = worse than baseline)
   - Scale-independent
   - Formula: `1 - (SS_res / SS_tot)`

**Which to use?**
- **RMSE**: Most common, interpretable, penalizes large errors
- **MAE**: When outliers shouldn't be penalized heavily
- **R²**: To understand proportion of variance explained

In [None]:
pred = reg.predict(X_test)
print("MAE:", mean_absolute_error(y_test, pred))
print("MSE:", mean_squared_error(y_test, pred))
print("RMSE:", root_mean_squared_error(y_test, pred))
print("R2:", r2_score(y_test, pred))

## 9) Cross-Validation Templates

**What is cross-validation?**
Instead of a single train-test split, cross-validation:
1. Divides data into k "folds"
2. Trains k times, each time using different fold as test set
3. Averages results across all folds

**Why use cross-validation?**
- ✅ More reliable performance estimate
- ✅ Uses all data for both training and testing
- ✅ Reduces variance in performance estimates
- ✅ Better for small datasets
- ❌ Computationally expensive (trains k models instead of 1)

**Common k values:**
- k=5: Standard choice, good balance
- k=10: More thorough, more expensive
- k=n (Leave-One-Out): Maximum data usage, very expensive

### 9.1 Simple cross_val_score

**What it does:**
Performs k-fold cross-validation and returns an array of scores (one per fold).

**Parameters:**
- `cv=5`: Number of folds
- `scoring`: Metric to evaluate
  - Classification: "accuracy", "f1", "precision", "recall", "roc_auc"
  - Regression: "neg_mean_squared_error", "neg_mean_absolute_error", "r2"

**Output interpretation:**
- `scores.mean()`: Average performance across folds
- `scores.std()`: Variability between folds (lower = more stable)

**Note:** For regression metrics, sklearn returns negative values (so higher is always better). Use `np.abs()` or `-scores.mean()` to get positive values.

In [None]:
scores = cross_val_score(clf, X, y, cv=5, scoring="accuracy")  # change scoring
print(scores.mean(), scores.std())


### 9.2 StratifiedKFold (Classification)

**What is StratifiedKFold?**
Ensures each fold maintains the same class distribution as the original dataset.

**Why use it?**
- Critical for imbalanced datasets
- Prevents folds with very few (or zero) examples of minority class
- More reliable performance estimates

**Parameters:**
- `n_splits=5`: Number of folds
- `shuffle=True`: Randomly shuffle data before splitting (recommended)
- `random_state=0`: For reproducibility

**When to use:** 
- Always for classification with imbalanced classes
- Recommended for any classification cross-validation

In [None]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
scores = cross_val_score(clf, X, y, cv=cv, scoring="f1")


### 9.3 KFold (Regression)

**What is KFold?**
Standard k-fold cross-validation for regression tasks (no stratification needed for continuous targets).

**Parameters:**
- `n_splits=5`: Number of folds
- `shuffle=True`: Randomly shuffle before splitting (recommended)
- `random_state=0`: For reproducibility

**When to use:** Regression tasks, any time you don't need stratification

In [None]:
cv = KFold(n_splits=5, shuffle=True, random_state=0)
scores = cross_val_score(reg, X, y, cv=cv, scoring="neg_mean_squared_error")
# Note: negative MSE, so use -scores.mean() or np.abs()
print(f"MSE: {-scores.mean():.4f} (+/- {scores.std():.4f})")

## 10) GridSearchCV Template

**What is GridSearchCV?**
Automates hyperparameter tuning by:
1. Defining a grid of parameter combinations
2. Training a model for each combination
3. Evaluating using cross-validation
4. Returning the best combination

**How it works:**
- Tests **all possible combinations** (Cartesian product)
- Uses cross-validation for each combination
- Prevents overfitting to validation set

**Parameters explained:**
- `estimator`: Your model/pipeline to optimize
- `param_grid`: Dictionary of parameters to try
- `scoring`: Metric to optimize
- `cv`: Number of cross-validation folds
- `n_jobs=-1`: Use all CPU cores (much faster)

**Important notes:**
- Accessing best model: `grid.best_estimator_`
- Best parameters: `grid.best_params_`
- Best CV score: `grid.best_score_`
- All results: `grid.cv_results_`

**Tip:** Grid grows exponentially! 3 parameters with 3 values each = 3³ = 27 combinations

In [None]:
param_grid = {
    "model__C": [0.01, 0.1, 1, 10]
}
grid = GridSearchCV(
    estimator=svm,              # pipeline
    param_grid=param_grid,
    scoring="f1",               # change
    cv=5,
    n_jobs=-1
)
grid.fit(X_train, y_train)
best_model = grid.best_estimator_
grid.best_params_, grid.best_score_


### 10.1 RandomizedSearchCV

**What is RandomizedSearchCV?**
Randomly samples hyperparameter combinations instead of testing all combinations.

**Why use it?**
- ✅ Much faster than GridSearchCV
- ✅ Can cover wider parameter space
- ✅ Good for exploratory tuning
- ✅ Often finds near-optimal parameters with far fewer iterations

**How it works:**
- Randomly samples `n_iter` combinations from parameter distributions
- Each combination is evaluated with cross-validation
- Returns best combination found

**Parameters:**
- `param_distributions`: Dictionary of parameters (can use distributions or lists)
- `n_iter=20`: Number of random combinations to try
- `random_state=0`: For reproducibility

**When to use:**
- Large parameter space (would take too long with GridSearchCV)
- Early-stage hyperparameter exploration
- When you need results quickly

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

param_dist = {
    "model__n_estimators": randint(50, 500),           # Random integers
    "model__max_depth": randint(3, 20),
    "model__learning_rate": uniform(0.01, 0.29),       # Continuous uniform
}

random_search = RandomizedSearchCV(
    estimator=gb,
    param_distributions=param_dist,
    n_iter=20,              # Number of random combinations
    scoring="f1",
    cv=5,
    random_state=0,
    n_jobs=-1
)
random_search.fit(X_train, y_train)
best_model = random_search.best_estimator_
random_search.best_params_, random_search.best_score_

## 12) PCA Templates

**What is PCA?**
Principal Component Analysis is a dimensionality reduction technique that:
1. Finds directions of maximum variance in data
2. Projects data onto these directions (principal components)
3. Reduces dimensions while preserving most information

**Why use PCA?**
- ✅ Reduce computational cost (fewer features)
- ✅ Reduce overfitting (fewer features to learn)
- ✅ Visualization (reduce to 2D or 3D)
- ✅ Remove multicollinearity
- ✅ Denoise data
- ❌ Loses interpretability (PCs are combinations of original features)
- ❌ Assumes linear relationships

**Key concepts:**
- **Explained variance**: How much information each PC captures
- **Cumulative explained variance**: Total information captured by first k components

### 11.1 PCA for Dimensionality Reduction + Explained Variance

**What it does:**
Reduces dimensions while keeping a specified amount of variance.

**Parameters:**
- `n_components=0.90`: Keep 90% of variance
  - Alternative: `n_components=50` (keep exactly 50 components)
- `svd_solver="full"`: Algorithm for computing PCA

**How to use:**
1. Fit PCA on training data
2. Transform both training and test data
3. Check actual number of components: `pca.n_components_`

**Interpretation:**
- 0.90 means we keep 90% of the original information
- Trade-off: Higher variance = more features = more computation
- Typical values: 0.85-0.95

**Important:** 
- Always fit PCA on training data only!
- Apply same transformation to test data
- PCA requires scaled features (use StandardScaler first)

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=0.90, svd_solver="full")  # keep 90% variance
pipe = Pipeline([("pre", pre), ("pca", pca)])
Z_train = pipe.fit_transform(X_train)
Z_test  = pipe.transform(X_test)

# if you need actual number of components:
k = pipe.named_steps["pca"].n_components_
k


### 11.2 Reconstruction Error (Anomaly Detection)

**What is reconstruction error?**
The difference between original data and reconstructed data after dimensionality reduction.

**How it works for anomaly detection:**
1. Train PCA on normal data
2. Project data to lower dimensions
3. Reconstruct back to original dimensions
4. Calculate error = ||original - reconstructed||
5. High error = potential anomaly (data point is unusual)

**Intuition:**
- Normal data reconstructs well (low error)
- Anomalies don't fit the patterns and reconstruct poorly (high error)
- Works because PCA learns patterns from normal data

**Use cases:**
- Fraud detection
- Manufacturing defect detection
- Network intrusion detection
- Quality control

**Parameters:**
- `n_components=k`: Use fewer components for more sensitive anomaly detection
- `top_idx`: Indices of most anomalous samples (highest reconstruction error)

In [None]:
pca = PCA(n_components=k)
X_scaled = pre.fit_transform(X)   # if using ColumnTransformer; becomes numpy matrix
Z = pca.fit_transform(X_scaled)
X_hat = pca.inverse_transform(Z)
err = np.linalg.norm(X_scaled - X_hat, axis=1)
top_idx = np.argsort(err)[::-1][:10]
top_idx


### 15.3 Random Under-sampling

**What it does:**
Randomly removes samples from majority class to balance classes.

**Advantages:**
- ✅ Fast and simple
- ✅ Reduces training time
- ❌ Loses potentially useful information
- ❌ May remove important samples

**When to use:**
- Very large majority class (can afford to lose samples)
- When training time is a concern
- Combined with oversampling (combine both approaches)

**Alternative:** Combine with SMOTE using `RandomUnderSampler` or use `SMOTEENN` / `SMOTETomek`

In [None]:
# Install: pip install imbalanced-learn
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as ImbPipeline

# Random undersampling
pipe_under = ImbPipeline([
    ("pre", pre),
    ("undersampler", RandomUnderSampler(sampling_strategy='auto', random_state=0)),
    ("model", LogisticRegression(max_iter=5000))
])

pipe_under.fit(X_train, y_train)
pred = pipe_under.predict(X_test)

# Combined approach: SMOTE + Undersampling
from imblearn.combine import SMOTEENN

pipe_combined = ImbPipeline([
    ("pre", pre),
    ("resample", SMOTEENN(random_state=0)),
    ("model", LogisticRegression(max_iter=5000))
])

pipe_combined.fit(X_train, y_train)
pred = pipe_combined.predict(X_test)

### 15.2 SMOTE (Synthetic Minority Over-sampling)

**What is SMOTE?**
Creates synthetic samples of minority class by interpolating between existing minority samples.

**How it works:**
1. For each minority sample, find k nearest minority neighbors
2. Create synthetic samples along the lines connecting neighbors
3. Results in more diverse minority samples than simple duplication

**Advantages:**
- ✅ Better than random oversampling (no exact duplicates)
- ✅ Increases minority class representation
- ❌ Can create noisy samples in overlapping regions
- ❌ Requires installation: `pip install imbalanced-learn`

**When to use:** 
- Severe class imbalance (< 10% minority)
- When class weights alone aren't enough
- With careful validation (can cause overfitting)

In [None]:
# Install: pip install imbalanced-learn
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline  # Use imblearn's Pipeline

# Create pipeline with SMOTE
pipe_smote = ImbPipeline([
    ("pre", pre),
    ("smote", SMOTE(sampling_strategy='auto', random_state=0)),  # auto balances classes
    ("model", LogisticRegression(max_iter=5000))
])

pipe_smote.fit(X_train, y_train)
pred = pipe_smote.predict(X_test)
proba = pipe_smote.predict_proba(X_test)[:, 1]

# Or apply SMOTE separately
# smote = SMOTE(random_state=0)
# X_train_resampled, y_train_resampled = smote.fit_resample(X_train_processed, y_train)

### 15.1 Class Weights

**What it does:**
Gives higher penalty for misclassifying minority class during training.

**How it works:**
- Automatically adjusts loss function to account for class imbalance
- No need to resample data
- Most sklearn models support `class_weight` parameter

**Options:**
- `class_weight='balanced'`: Automatically adjusts weights inversely proportional to class frequencies
- `class_weight={0: 1, 1: 10}`: Manual weight assignment

**When to use:** 
- First approach to try (simple, no data modification)
- Works with all sklearn models that support it
- When you want to keep original data distribution

In [None]:
# Works with: LogisticRegression, SVC, RandomForest, DecisionTree, etc.

# Automatic balanced weights
clf_balanced = Pipeline([
    ("pre", pre),
    ("model", LogisticRegression(max_iter=5000, class_weight='balanced'))
])

# Or with Random Forest
rf_balanced = Pipeline([
    ("pre", pre),
    ("model", RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=0))
])

# Manual weights (if class 1 is 10x more important)
clf_manual = Pipeline([
    ("pre", pre),
    ("model", LogisticRegression(max_iter=5000, class_weight={0: 1, 1: 10}))
])

clf_balanced.fit(X_train, y_train)
proba = clf_balanced.predict_proba(X_test)[:, 1]

## 14) Handling Class Imbalance

**What is class imbalance?**
When one class has significantly more samples than others (e.g., 95% negative, 5% positive).

**Why it's a problem:**
- Models tend to predict majority class
- Accuracy becomes misleading metric
- Minority class is often the most important (fraud, disease, churn)

**Solutions:**
1. **Resampling**: Oversample minority or undersample majority
2. **Class weights**: Penalize mistakes on minority class more
3. **Ensemble methods**: SMOTE, borderline-SMOTE
4. **Evaluation**: Use appropriate metrics (F1, precision-recall, AUC)

### 14.3 Bagging Classifier

**What is bagging?**
Bootstrap Aggregating - trains multiple models on random subsets (with replacement) of data and averages predictions.

**How it works:**
1. Create multiple bootstrap samples (random sampling with replacement)
2. Train a model on each sample
3. Average predictions (regression) or vote (classification)

**Benefits:**
- ✅ Reduces variance (overfitting)
- ✅ Works well with unstable models (decision trees)
- ✅ Can be parallelized (fast training)

**Note:** Random Forest is a special case of bagging with decision trees and additional randomization

**When to use:** With high-variance models (decision trees), when you want to reduce overfitting

In [None]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

# Bagging with decision trees
bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(max_depth=10),
    n_estimators=100,      # Number of base models
    max_samples=0.8,       # % of samples to draw for each base model
    max_features=0.8,      # % of features to draw for each base model
    bootstrap=True,        # Sample with replacement
    random_state=0,
    n_jobs=-1
)

pipe = Pipeline([("pre", pre), ("bagging", bagging)])
pipe.fit(X_train, y_train)
proba = pipe.predict_proba(X_test)[:, 1]
pred = pipe.predict(X_test)

### 14.2 Stacking Classifier

**What is stacking?**
A two-level ensemble:
1. **Level 0**: Multiple base models make predictions
2. **Level 1**: A meta-model learns to combine base model predictions

**How it works:**
- Train multiple diverse base models on training data
- Use their predictions as features for meta-model
- Meta-model learns optimal way to combine base predictions

**Advantages:**
- ✅ Often achieves best performance
- ✅ Learns optimal combination weights
- ❌ More complex, slower to train

**When to use:** Competitions, when maximum accuracy is needed, final production model

In [None]:
from sklearn.ensemble import StackingClassifier

# Base models (level 0)
base_models = [
    ('lr', LogisticRegression(max_iter=5000, random_state=0)),
    ('rf', RandomForestClassifier(n_estimators=100, random_state=0)),
    ('gb', GradientBoostingClassifier(n_estimators=50, random_state=0))
]

# Meta-model (level 1)
meta_model = LogisticRegression(max_iter=5000)

# Create stacking classifier
stacking = StackingClassifier(
    estimators=base_models,
    final_estimator=meta_model,
    cv=5  # Cross-validation to generate meta-features
)

pipe = Pipeline([("pre", pre), ("stacking", stacking)])
pipe.fit(X_train, y_train)
proba = pipe.predict_proba(X_test)[:, 1]
pred = pipe.predict(X_test)

### 14.1 Voting Classifier

**What it does:**
Combines predictions from multiple different classifiers.

**Voting strategies:**
- **'hard'**: Majority vote (most common predicted class)
- **'soft'**: Averages predicted probabilities (requires `predict_proba`)

**When to use:**
- Combine diverse models (e.g., logistic regression + random forest + SVM)
- Often improves over individual models
- Simple and effective ensemble method

In [None]:
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

# Define individual models
clf1 = LogisticRegression(max_iter=5000, random_state=0)
clf2 = RandomForestClassifier(n_estimators=100, random_state=0)
clf3 = SVC(probability=True, random_state=0)  # probability=True for soft voting

# Create voting classifier
voting = VotingClassifier(
    estimators=[('lr', clf1), ('rf', clf2), ('svc', clf3)],
    voting='soft'  # or 'hard'
)

# Use in pipeline
pipe = Pipeline([("pre", pre), ("voting", voting)])
pipe.fit(X_train, y_train)
proba = pipe.predict_proba(X_test)[:, 1]
pred = pipe.predict(X_test)

## 13) Ensemble Methods

**What are ensemble methods?**
Combine multiple models to create a stronger predictor. The key principle: "wisdom of the crowd."

**Main approaches:**
1. **Voting**: Combine predictions from multiple different models
2. **Bagging**: Train same model on different data subsets (e.g., Random Forest)
3. **Boosting**: Train models sequentially, each correcting previous errors (e.g., Gradient Boosting)
4. **Stacking**: Use predictions from multiple models as input to a meta-model

**Benefits:**
- ✅ Often better performance than single models
- ✅ Reduces overfitting (averaging reduces variance)
- ✅ More robust predictions

### 13.3 Feature Importance from Tree-Based Models

**What it does:**
Uses feature importance scores from trained tree-based models to select features.

**Advantages:**
- ✅ Fast (uses already-trained model)
- ✅ Captures feature interactions
- ✅ Works well in practice

**When to use:** With tree-based models (RF, GB, XGB), after training your model

In [None]:
from sklearn.feature_selection import SelectFromModel

# Train Random Forest and select important features
rf = RandomForestClassifier(n_estimators=100, random_state=0)
rf.fit(X_train_processed, y_train)  # X_train_processed = after preprocessing

# Select features with importance > threshold
selector = SelectFromModel(rf, threshold="median", prefit=True)  # or threshold=0.01
X_train_selected = selector.transform(X_train_processed)
X_test_selected = selector.transform(X_test_processed)

# Or use in pipeline
selector = SelectFromModel(RandomForestClassifier(n_estimators=100, random_state=0), threshold="median")
pipe = Pipeline([
    ("pre", pre),
    ("selector", selector),
    ("model", LogisticRegression(max_iter=5000))
])
pipe.fit(X_train, y_train)

### 13.2 Recursive Feature Elimination (RFE)

**What it does:**
Recursively removes least important features and builds model until desired number remains.

**How it works:**
1. Train model on all features
2. Rank features by importance
3. Remove least important feature(s)
4. Repeat until k features remain

**Characteristics:**
- ❌ Slow (trains multiple models)
- ✅ Model-aware (uses actual model importance)
- ✅ Often better than filter methods

**When to use:** When you have time, when feature interactions matter, for final model

In [None]:
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

# Select top 10 features using Random Forest
estimator = RandomForestClassifier(n_estimators=50, random_state=0)
rfe = RFE(estimator=estimator, n_features_to_select=10, step=1)

pipe = Pipeline([
    ("pre", pre),
    ("rfe", rfe),
    ("model", LogisticRegression(max_iter=5000))
])
pipe.fit(X_train, y_train)
pred = pipe.predict(X_test)

# See which features were selected
# rfe.support_  # Boolean mask
# rfe.ranking_  # Feature rankings

### 13.1 SelectKBest (Filter Method)

**What it does:**
Selects top k features based on statistical tests.

**Score functions:**
- **Classification**: `f_classif` (ANOVA F-value), `chi2`, `mutual_info_classif`
- **Regression**: `f_regression`, `mutual_info_regression`

**When to use:** Fast feature selection, as preprocessing step, when you know how many features you want

In [None]:
from sklearn.feature_selection import SelectKBest, f_classif

# Select top 10 features
selector = SelectKBest(score_func=f_classif, k=10)
pipe = Pipeline([
    ("pre", pre),
    ("selector", selector),
    ("model", LogisticRegression(max_iter=5000))
])
pipe.fit(X_train, y_train)
pred = pipe.predict(X_test)

# See selected feature scores
# selector.scores_

## 12) Feature Selection Templates

**What is feature selection?**
The process of selecting a subset of relevant features to:
- ✅ Reduce overfitting
- ✅ Improve model performance
- ✅ Reduce training time
- ✅ Improve model interpretability

**Three main approaches:**
1. **Filter methods**: Score features independently (fast, model-agnostic)
2. **Wrapper methods**: Use model performance to select features (slower, more accurate)
3. **Embedded methods**: Feature selection during model training (e.g., Lasso)