### **Table of Contents**

1. **Problem Definition**
    - 1.1: Problem Classification
    - 1.2: Defining Success Metrics
    - 1.3: 

2. **Data Collection** 

3. **Exploratory Data Analysis (EDA)**
    - 3.1: Data Quality Dimensions
    - 3.2: Univariate Analysis
    - 3.3: Bivariate Analysis
    - 3.4: Multivariate Analysis

4. **Data Cleaning and Preprocessing** 
    - 4.1: Handling Missing Values
    - 4.2: Handling Outliers
    - 4.3: Data Type Correction
    - 4.4: Duplicates Removal 
    - 4.5: Data Transformation

5. **Feature Engineering** 
    - 5.1: Feature Creation
    - 5.2: Feature Selection
    - 5.3: Feature Scaling

6. **Data Splitting**
    - Train-Test Split
    - Cross Validation
    - Time Series Split

7. **Machine Learning: A Practical Overview**
    - 7.1 Baseline Model
    - 7.2 Linear Models
       - Linear Regression
       - Logistic Regression
       - Ridge/Lasso Regression
    - 7.3 Tree-Based Models
       - Decision Trees
       - Random Forest
       - Gradient Boosting (XGBoost, LightGBM, CatBoost)
    - 7.4 Other Algorithms
       - K-Nearest Neighbors (KNN)
       - Naive Bayes
       - Support Vector Machines (SVM)
    - 7.5 Model Selection & Comparison

    

### **Part 1.1: Problem Classification** 

- Correctly identify the problem you are trying to solve 

- Within the first 30 seconds, follow the step-by-step process 
1. Listen to the problem at hand 
2. Immediately classify the problem by considering: 
    - Is this a supervised or unsupervised learning problem?
    - If supervised: Is this a regression or classification problem?
    - If unsupervised: Is this a clustering or dimensionality reduction problem?
3. Understand the objective: Prediction, causal inference or finding patterns?
4. Confirm with the stakeholders that your understanding is correct

**Supervised Learning:** You have labeled data. Your goal is to learn a function that maps inputs to outputs:
- Regression: Predicting a continuous value 
- Classification: Predicting a discrete value (class labels)

**Unsupervised Learning:** You have unlabeled data. Your goal is to find patterns or structure in the data.
- Clustering: Grouping similar data points together
- Dimensionality Reduction: Reducing the number of features while preserving important information

**Prediction**
- Question: "What will happen?"
- Goal: Forecast outcomes accurately
- Example: "Which customers will churn?"
- Don't care about WHY - just want accurate predictions.

**Causal Inference**
- Question: "What CAUSED this?" or "What IF we do X?"
- Goal: Understand cause-and-effect for decision-making
- Example: "Will offering discounts CAUSE customers to stay?"
- Care deeply about WHY - need to know what actually works.



### **Part 1.2 Defining Success Metrics**

Before diving into the data you need to be able to define what business success looks like to your stakeholders

The **SMART Framework** is a useful tool to define success metrics: 
- Specific: Clear definition of what you want
- Measurable: Quantifiable indicators or outcome
- Achievable: Realistic goals, considering constraints 
- Relevant: Tied to business objectives
- Time bound: Clear deadlines for acheivement

**Baseline:** This is the current performance level before you do anything. You goal is to beat this baseline by adding value 
*Example: Current churn rate: 20%*

**Target:** the desired performance level after intervention. Should be a specific, quantified goal. Be realisitc: 10-30% improvement is standard. *Example: Reduce churn to 15% (25% relative reduction)*

Note: Also mention the business value of hitting the target. *Example: This would save $500K annually in revenue*

**Minimum Viable Performance**: The lowest acceptable performance level to consider the project a success. Helps manage expectations. *Example: Reduce churn to 18% (10% relative reduction)*

**Constraints**: Understand the real world limitations that affect feasibility
- Budget: Financial limits on resources
- Time: Deadlines for delivery
- Resources: Availability of data, tools, personnel

Break success down into two components: 

1. **Business Success**: The business outcome we are trying to acheive, measured in business units: Dollars, percentages, customer counts, time saved, etc. *Example: Increase revenue by $1M annually*

2. **Model Success**: The technical performance of the machine learning model, measured in ML metrics: Accuracy, precision, recall, F1-score, RMSE, AUC-ROC, etc. *Example: Achieve 85% accuracy on test set*









### **Part 3.1: Data Quality Assessment**

The first step before any analysis is to understand the quality of data . There are six main dimensions of data quality we need to assess. 

1. **Completeness**: Do we actually have all the required data points
- Check for missing values in key columns. Understand how many fields are missing, what percentage of each column is missing, and are there rows that are largely empty. 

- Then check for missingness patterns. Are the missing vaslues random or systematic? Is there any correlation between missingness? 

- Things to look out for: 
    - If missingness correlates with target = Results are biased
    - If entire columns/rows are missing = Data pipeline issues 
    - If  over 30% of data is missing in key fields = Impute or use alternative data sources

In [None]:
# Check missing values
missing = df.isnull().sum()

# Check percentage of missing values
missing_pct = df.isnull().sum() / len(df) * 100

# To summarise the two together
missing_summary = pd.DataFrame({
    'missing_count': missing,
    'missing_pct': missing_pct
}).sort_values('missing_pct', ascending=False)

# Visualise missingness pattern 
import missingno as msno
msno.matrix(df)
plt.show()

# Visualise correlation between missingness
msno.heatmap(df)
plt.show()

2. **Accuracy:** Is the data correct and reliable. 
- Impossible values: Negative ages, future dates, out-of-range values = Data collection/entry errors
- Data entry errors: Typos, inconsistent formats, duplicates = Human error
- Outliers: Values that are extreme and likely erroneous = Measurement errors

- **Use summary statistics to identify these anomalies, or visualisations liek boxplots**

In [None]:
# Descriptive statistics
df.describe()  # Check min, max, mean

# Check for impossible values
df[df['age'] < 0]  # Negative ages
df[df['age'] > 120]  # Unrealistic ages
df[df['price'] < 0]  # Negative prices

# Value counts for categories
df['category'].value_counts()  # Spot typos

3. **Consistency:** Is the data consistent across different sources and formats?

In [None]:
# Check date formats
pd.to_datetime(df['date'], errors='coerce').isnull().sum()

# Check cross-field logic
df[df['end_date'] < df['start_date']]  # Illogical dates

# Check for sudden changes
df.groupby('date')['revenue'].mean().plot()  # Visual inspection

4. **Validity:** Does the data conform to defined formats, types, and ranges?


In [None]:
# Check email format
import re
email_pattern = r'^[\w\.-]+@[\w\.-]+\.\w+$'
df['email'].str.match(email_pattern).value_counts()

# Check category validity
valid_categories = ['A', 'B', 'C']
invalid = df[~df['category'].isin(valid_categories)]

# Check business rules
df[df['discount'] > df['price']]  # Invalid discounts

5. **Uniqueness:** Are there duplicate records or entries in the dataset?

- Check for exact duplicates, such as completely identical rows

- Check for partial duplicates based on key identifiers, such as same customer ID but different names

- Check for primary key violations: Expected unique ID's are not unique, multiple records for the same entity

In [None]:
# Check for exact duplicates
df.duplicated().sum()
df[df.duplicated(keep=False)]  # Show all duplicates

# Check primary key uniqueness
df['customer_id'].nunique() == len(df)

# Check for near-duplicates
df.groupby(['customer_id', 'date']).size()[lambda x: x > 1]

6. **Timeliness:** Is the data up-to-date and relevant for the analysis?

- Freshness: When was the data last updated, and is it recent enough for the use case

- Update Frequency: Is the data updates as scheduleed, or are there gaps in time series data

- Lag: Is there a delay between data generation and availability for analysis

In [None]:
# Check latest date
df['date'].max()

# Check for gaps in time-series
date_range = pd.date_range(df['date'].min(), df['date'].max(), freq='D')
missing_dates = date_range.difference(df['date'])

# Check update frequency
df.groupby('date').size().plot()

### **Part 3.2: Univariate Analysis** 
Analysing individual variables to understand distribution, tendencies, variance

**Numerical Variables**

- Distribution Shape: Is it normally distributed, skewed, bimodal? 
- Central Tendency: Mean, median, mode
- Dispersion: Range, variance, standard deviation, interquartile range
- Outliers: Identify extreme values using boxplots, z-scores

In [None]:
# Central tendency
df['age'].mean()
df['age'].median()
df['age'].mode()

# Spread
df['age'].std()
df['age'].var()
df['age'].min()
df['age'].max()
df['age'].quantile([0.25, 0.5, 0.75])

# Quick summary
df['age'].describe()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Histogram
df['age'].hist(bins=30)
plt.title('Age Distribution')

# Box plot (shows outliers)
sns.boxplot(y=df['age'])

# Density plot
df['age'].plot(kind='density')

**Categorical Variables**

- Frequency Distribution: Count of each category
- Cardinality: Number of unique categories 
- Proportions: Relative frequencies of each category
- Find rare categories that may need to be grouped together


In [None]:
# Frequency counts
df['category'].value_counts()

# Proportions
df['category'].value_counts(normalize=True)

# Number of unique values
df['category'].nunique()

In [None]:
# Bar chart
df['category'].value_counts().plot(kind='bar')

# Pie chart (if few categories)
df['category'].value_counts().plot(kind='pie')

### **Part 3.3: Bivariate Analysis**
Analysing the relationship between two variasbles. Generally feature vs target or feature vs feature

**Feature vs Target**: Identify which features are potentially predictive of the target variable 

- Numerical Feature vs Numerical Target: Scatter plots, correlation coefficients (Pearson, Spearman)

- Numerical Feature vs Categorical Target: Box plots, violin plots, histograms per category 

- Categorical vs Categorical Target: Cross tabulations

In [None]:
# Relationship between two numerical variables
plt.scatter(df['age'], df['income'])
plt.xlabel('Age')
plt.ylabel('Income')

# Pearson correlation (-1 to 1)
df['age'].corr(df['income'])

# Correlation matrix for multiple variables
df[['age', 'income', 'credit_score']].corr()

In [None]:
# Mean of numerical by category
df.groupby('churned')['age'].mean()

# Multiple statistics
df.groupby('churned')['age'].agg(['mean', 'median', 'std'])

# Box plots by group
sns.boxplot(x='churned', y='age', data=df)

# Violin plots (distribution shape)
sns.violinplot(x='churned', y='age', data=df)

# Histogram by group
df[df['churned']==1]['age'].hist(alpha=0.5, label='Churned')
df[df['churned']==0]['age'].hist(alpha=0.5, label='Not Churned')
plt.legend()

In [None]:
# Frequency table
pd.crosstab(df['gender'], df['churned'])

# With percentages
pd.crosstab(df['gender'], df['churned'], normalize='index')

**Feature vs Feature**: The important thing here is to check for multicollinearity between features, which can mess up certain models (like linear regression). 

**High correlation (> 0.8)** between features means they are redundant = consider dropping one 

In [None]:
# Correlation heatmap
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')

### **Part 3.4: Multivariate Analysis**
Analyzing interactions between three or more variables simultaneously to uncover complex relationships and patterns
- Use pair plots to visualize relationships between multiple numerical variables
- Use 3D scatter plots for three numerical variables
- Use heatmaps to visualize correlations among multiple variables


### **Part 3.5: Distribution Analysis**
Analyzing the distribution of key variables to understand their characteristics and inform modeling decisions

1. **Check for normality**
     - Many algorithms assume normal distribution (e.g., linear regression)
     - Use QQ plots and statistical tests (Shapiro-Wilk, Kolmogorov-Smirnov)
     - Mean = Median = Mode indicates a gaussian distribution = bell shaped curve

2. **Identify skewness and kurtosis**
     - Skewed distributions may require transformations (log, square root)
     - High kurtosis indicates heavy tails or outliers. Kurtosis > 3 indicates heavy tails = Lots of outliers


In [None]:
# Log transformation (for right-skewed)
df['log_income'] = np.log1p(df['income'])

# Square root
df['sqrt_income'] = np.sqrt(df['income'])

# Box-Cox transformation
from scipy.stats import boxcox
df['boxcox_income'], _ = boxcox(df['income'] + 1)

3. **Detect Outliers**

     - Use boxplots and IQR method to identify extreme values
     - Use z-scores to find points beyond 3 standard deviations from the mean
     - Decide whether to cap, remove, or transform outliers

In [None]:
Q1 = df['age'].quantile(0.25)
Q3 = df['age'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = df[(df['age'] < lower_bound) | (df['age'] > upper_bound)]

### **Part 4.2: Dealing with Outliers**

There are three broad types of outliers: 

A. **Data errors:** Entry mistakes, measurement errors
Example: Age = 150 years
Action: Correct or remove

B. **True extreme values:** Valid but rare
Example: Luxury home price $10M in dataset of $200K homes
Action: Keep or handle carefully

C. **Different population:** From different distribution
Example: Corporate accounts mixed with individual customers
Action: Segment or model separately   

There are four strategies to deal with outliers:

**A. Remove Outliers: Delete the rows completely**

  *When to use:*
- Confirmed data errors
- Less than 1% of data 
- Linear models that are sensitive to extremes


In [None]:
# Remove using IQR
df_clean = df[(df['price'] >= lower) & (df['price'] <= upper)]

**B. Cap Outliers:** Set extreme values to a maximum/minimum threshold

*When to use:*
- Outliers are valid but too extreme
- Want to keep all data points
- Linear models that need bounded ranges

In [None]:
# Cap at percentiles
lower_cap = df['price'].quantile(0.01)
upper_cap = df['price'].quantile(0.99)
df['price_capped'] = df['price'].clip(lower_cap, upper_cap)

**C. Transform Outliers**: Apply mathematical function to compress range. Reduces impact of extremes without removing information 

- Log: Strong right skew (most common for money, prices)
- Square root: Moderate right skew (counts, areas)
- Box-Cox: Automatically finds best transformation

*When to use:*

- Right-skewed distributions (income, prices, counts)
- Outliers are valid and informative
- Want to preserve ranking but reduce scale

**Note**: Changes interpretation. Now predicting log-price instead of price.

In [None]:
# Log transformation (reduces right skew)
df['log_price'] = np.log1p(df['price'])

# Square root
df['sqrt_price'] = np.sqrt(df['price'])

**D. Separate Treatment**: Treat outliers as a different group. You need enough outlier data to model. Example: Build one model for normal customers andone for luxury customers. 

*When to use:*

- Outliers represent fundamentally different behavior
- Example: VIP customers, luxury segment, fraud cases
- Different patterns require different models

In [None]:
# Flag outliers, model separately
df['is_outlier'] = (df['price'] > upper) | (df['price'] < lower)

# Or create separate model for outliers
normal_data = df[df['price'] <= upper]
outlier_data = df[df['price'] > upper]

**Quick Rules:** 
- Linear models (regression, logistic) are sensitive to outliers, tree models (RF, XGBoost) are not. 
- For linear models, transform > cap > remove.
- If unsure whether removing outliers helps, try training the model with and without outliers and compare test eprformance

### **Part 4.3: Data Type Correction**

Extremely important for proper processing. the first thing to do is to identify the type issues: 

- Numbers stored as strings? 
- Dates stored as strings? 
- Categories stored as objects (unnecessary memory)
- Booleans stored a strings ("true"/"false")

In [None]:
# Check current types
df.dtypes

# Check unique values (spot type issues)
df.info()

**A. Correcting numerical data** 
- When numbers are stored as text
- For example: "1234" instead of 1234
- Could also be special formats like "$1,234"

In [None]:
# Convert string to number
df['price'] = pd.to_numeric(df['price'], errors='coerce')
# errors ='coerce' → invalid values become NaN

In [None]:
# Example of special formats ("$1,234"): Remove $ and commas, then convert
df['price'] = df['price'].str.replace('$', '').str.replace(',', '')
df['price'] = pd.to_numeric(df['price'])

**B. Correct Date/Time Data**
- When dates are stored as strings
- Use the pd.to_datetime() function 
- Can also extract useful components 

In [None]:
# String to datetime
df['date'] = pd.to_datetime(df['date'], errors='coerce')

# Specify format for speed
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')

# Handle multiple formats
df['date'] = pd.to_datetime(df['date'], infer_datetime_format=True)

**C. Handle Mixed Types**
- When a column has multiple types
- Important to handle for correct operations (can't calculate mean of strings)
- Feature engineering (can't extract month from string dates)
- Model compatibility (algorithms need numeric input)

In [None]:
# Check for mixed types
df['mixed_col'].apply(type).value_counts()

# Force conversion (lose invalid data)
df['clean_col'] = pd.to_numeric(df['mixed_col'], errors='coerce')

# Or separate valid/invalid
valid = pd.to_numeric(df['mixed_col'], errors='coerce')
invalid = df[valid.isnull() & df['mixed_col'].notnull()]

**DECISION FRAMEWORK:**
Check dtypes
- String numbers → pd.to_numeric()
- String dates → pd.to_datetime()
- String categories (few unique) → astype('category')
- String booleans → map() or astype(bool)
- Mixed types → Convert + handle invalids

### **Part 4.4. Duplicates Removal**

### **Part 4.5. Data Transformation**

### **5.1. Feature Creation**
Creating meaningful feastures based on domain knowledge. Think what would be useful in a business context 

**Step by Step Framework**

1. Domain: What makes business sense?
2. Time: Any datetime to decompose?
3. Aggregations: Can I summarize at customer/store level?
4. Ratios: Are relative measures better?
5. Flags: Any useful binary conditions?





1. **Domain Based Features**

- Use business logic to create meaningful features
- Always think about use in a business context


In [None]:
# E-commerce example
df['avg_order_value'] = df['total_spent'] / df['num_orders']
df['days_since_last_purchase'] = (today - df['last_purchase_date']).dt.days

# Finance example
df['debt_to_income_ratio'] = df['total_debt'] / df['annual_income']

2. **Time Based features**: Extract components from datetime


In [None]:
# Basic components
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.dayofweek  # 0=Monday
df['is_weekend'] = (df['day_of_week'] >= 5).astype(int)

# Time differences
df['days_since_signup'] = (df['current_date'] - df['signup_date']).dt.days

3. **Aggregation Features**
- Summarize at different levels 
- Common aggregations: sum, mean, count, min, max

In [None]:
# Customer-level aggregations from transactions
customer_features = transactions.groupby('customer_id').agg({
    'amount': ['sum', 'mean', 'count'],
    'date': ['min', 'max']
}).reset_index()

customer_features.columns = ['customer_id', 'total_spent', 'avg_spent', 
                              'num_transactions', 'first_purchase', 'last_purchase']

4. **Ratio Features**: Relative measures could be more useful than absolute ones 

In [None]:
# Ratios
df['click_through_rate'] = df['clicks'] / df['impressions']
df['price_per_sqft'] = df['price'] / df['square_feet']

5. **Interaction Features**: 
- Combine features that work together.
- Only use when domain knwoledge suggests interaction matters 

In [None]:
# Simple multiplications
df['bedrooms_x_bathrooms'] = df['bedrooms'] * df['bathrooms']
df['total_rooms'] = df['bedrooms'] + df['bathrooms']

6. **Boolean/Flag Features**: Binary indicators, usually based on thresholds


In [None]:
# Threshold-based
df['is_high_value'] = (df['order_value'] > 100).astype(int)
df['is_senior'] = (df['age'] >= 65).astype(int)

# Missing indicators (can be informative)
df['income_missing'] = df['income'].isnull().astype(int)

**Summary:**

- Start with domain knowledge: For churn prediction, I'd create features like:
    - days_since_last_login (time-based)
    - avg_session_duration (aggregation)
    - support_ticket_rate (ratio)
    - is_premium_user (flag)

- Extract from datetime: Day of week, month, is_weekend, days_since_signup
- Create aggregations: Total spend, average order value, number of transactions per customer
- Build ratios: Revenue per employee, clicks per impression

- **Key principle: Start simple with domain features, add complexity only if improves performance**

### **5.2. Feature Transformation**

Transforming skewed features helps improve model performance 

**1. Identify Need**
- Check skewness
- Transform when skewness > 1 (right skewed)
- Especially relevant for linear models
- Income, prices, population all tend to be right skewed

In [None]:
# Check skewness
from scipy.stats import skew
skewness = df['income'].skew()
print(f"Skewness: {skewness}")  # |skew| > 1 suggests transformation

# Visual check
df['income'].hist(bins=50)

**2. Apply Log Transformation**
- Compresses large values more than small values 
- Use when strong right skew (skewness > 1)
- Linear models benfit most, tree models often dont need it

In [None]:
# For right-skewed data (income, prices, counts)
df['log_income'] = np.log1p(df['income'])  # log1p handles zeros

**3. Validate:** Goal is to get skewness closer to 0

In [None]:
# Check improvement
print(f"Original skew: {df['income'].skew():.2f}")
print(f"Transformed skew: {df['log_income'].skew():.2f}")

# Visualize before/after
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
df['income'].hist(bins=50, ax=axes[0], title='Original')
df['log_income'].hist(bins=50, ax=axes[1], title='Log Transformed')

4. **Fit ONLY on training data**: Fit transformer on training data only, then apply to test data with same parameters

In [None]:
from sklearn.preprocessing import PowerTransformer

transformer = PowerTransformer()  # Yeo-Johnson (handles negatives)

# Fit on train, apply to test
train['transformed'] = transformer.fit_transform(train[['income']])
test['transformed'] = transformer.transform(test[['income']])  # Use train params

### **5.3. Encoding Categorical Variables**

ML algorithms need numerical inpout so it is important to encode categorical variables 

1. **Understand your variable type**

- Nominal = No order (e.g. city, color, etc)
- Ordinal = Has some order (e.g. ratings)

2. **One Hot Encoding**
- For linear models, required for nominal variables
- Creates a binary column for each category 
- Use when there's low cardinality (<20 categories)


In [None]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(drop='first', handle_unknown='ignore')
encoded = ohe.fit_transform(df[['city']])

3. **Label Encoding**

- Assigns an integer to each category 
- Tree based models: Works for nominal and ordinal 
- Works with ordinal variables for any model 

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['city_encoded'] = le.fit_transform(df['city'])

# NYC → 0, LA → 1, Chicago → 2

# Education has meaningful order
education_map = {'High School': 0, 'Bachelor': 1, 'Master': 2, 'PhD': 3}
df['education_encoded'] = df['education'].map(education_map)

4. **Handling High Cardinality**

- Problem: 500 cities → 500 one-hot columns (too many!)
- Solution: Group rare categories as "other"

In [None]:
# Keep top 10 categories, rest → "Other"
top_categories = df['city'].value_counts().head(10).index

df['city_grouped'] = df['city'].apply(
    lambda x: x if x in top_categories else 'Other'
)

# Now one-hot encode (only 10-11 columns)
df_encoded = pd.get_dummies(df, columns=['city_grouped'], drop_first=True)

5. **Handling unknown categories:** Test set might have categories we have not seen in training

In [None]:
# One-Hot: use handle_unknown='ignore'
ohe = OneHotEncoder(handle_unknown='ignore')  # Creates all-zero row

# Label Encoding: map with default
city_map = {'NYC': 0, 'LA': 1, 'Chicago': 2}
test['city_encoded'] = test['city'].map(city_map).fillna(-1)  # -1 for unknown

**Summary**

**Linear models:**
- Nominal variables → One-Hot encoding (required)
- `pd.get_dummies(df, columns=['city'], drop_first=True)`

**Tree models:**
- Can use Label Encoding (simpler, efficient)
- `LabelEncoder().fit_transform(df['city'])`

**Ordinal variables (any model):** Label encode with proper order

**High cardinality:**
- Group rare categories into 'Other' (keep top 10-20)
- Then one-hot encode

**Key:** Always fit on training data, apply to test with same mapping

### **5.4. Feature Scaling**

It is important to scale features so that they are on similar ranges. 

*Example*: 
- age: 20-80 (range of 60)
- income: 20,000-200,000 (range of 180,000)
- Without scaling, income dominates distance calculations

**1. Check algorithm**
- Distance based algorithms need scaling (Linear, logistic, KNN, SVM)
- Tree based models do not need scaling (RF, XGBoost, Decision Trees)

**2. StandardScaler()**
- What it does: Mean = 0 and Std = 1
- Most common scaler
- **Formula:** `(x - mean) / std`

- **Result:** Features centered around 0, most values between -3 and 3



In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# Fit on train, transform both
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Use train mean/std

**3. MinMaxScaler()**
- Scales to range [0, 1]
- Use when you need abounded range (eg for neural networks)
- Useful when you don't want negative values
- Limitation: Sensitive to outliers

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

4. **RobustScaler()**

- Useful to deal with outliers
- Uses median and IQR
- **Formula:** `(x - median) / IQR`

In [None]:
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

**Scale:**
- Numerical features with different units/ranges

**Don't scale:**
- Binary features (0/1) - already same scale
- One-hot encoded features - already 0/1
- Tree-based models - not needed

**Check if needed:**
- **Tree models:** Don't need scaling
- **Linear models with regularization:** Need scaling (regularization penalizes by coefficient size)
- **Distance-based (KNN, SVM):** Need scaling (distances dominated by large-scale features)

**Choose scaler:**
- **StandardScaler (default):** Centers at mean=0, std=1
- **RobustScaler:** If outliers present

In [None]:
# IMPLEMENT SCALER USING PIPELINE

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression())
])

pipeline.fit(X_train, y_train)  # Fits scaler on train only

### **5.5. Feature Selection**

Select relevant features to improve performance and reduce overfitting

**Benefits:**
- Reduces overfitting
- Faster training
- Simpler, more interpretable models
- Removes irrelevant/redundant features

**When needed:**
- Many features (100+)
- Some features irrelevant
- Model overfitting

1. **Remove low variance features**
- Remove features with low variation
- Features that are almost constant add no information


In [None]:
from sklearn.feature_selection import VarianceThreshold

# Remove features with <1% variance
selector = VarianceThreshold(threshold=0.01)
X_selected = selector.fit_transform(X_train)

2. **Remove highly correlated features:** Highly correlated features are redundant (multicollinearity)

In [None]:
# Calculate correlation matrix
corr_matrix = X_train.corr().abs()

# Find pairs with correlation > 0.9
upper_tri = corr_matrix.where(
    np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
)

# Drop one from each highly correlated pair
to_drop = [col for col in upper_tri.columns if any(upper_tri[col] > 0.9)]
X_selected = X_train.drop(columns=to_drop)

3. **Select based on Importance:** Use feature importance from tree models

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Train model
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Get importance
importance = pd.DataFrame({
    'feature': X_train.columns,
    'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)

# Keep top N features
top_features = importance.head(20)['feature'].tolist()
X_selected = X_train[top_features]

4. **L1 Regularization (LASSO)**: Automatically zeros out unimportant features

In [None]:
from sklearn.linear_model import LassoCV

# Lasso with cross-validation
lasso = LassoCV(cv=5, random_state=42)
lasso.fit(X_train, y_train)

# Keep non-zero coefficients
selected_features = X_train.columns[lasso.coef_ != 0]
X_selected = X_train[selected_features]

print(f"Selected {len(selected_features)} out of {len(X_train.columns)} features")

5. **Dimensionality Reduction (PCA)**: Combine correlated features into fewer components 

   **When to use:**
- Many correlated features
- Want to reduce dimensions
- Less interpretable (components are combinations)

- **Trade-off:** Lose interpretability (can't say "age matters" - only "PC1 matters")

In [None]:
from sklearn.decomposition import PCA

# Reduce to 10 components (or explain 95% variance)
pca = PCA(n_components=0.95)  # Keep 95% of variance
X_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

print(f"Reduced from {X_train.shape[1]} to {X_pca.shape[1]} features")

6. **Forward/Backward Selection**
- Iteratively add/removefeatures
- Slow for many features, so use only for small feature sets

In [None]:
# Simplified forward selection concept
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LogisticRegression

sfs = SequentialFeatureSelector(
    LogisticRegression(),
    n_features_to_select=10,  # Select best 10
    direction='forward',
    cv=5
)

sfs.fit(X_train, y_train)
selected_features = X_train.columns[sfs.get_support()]

**Summary**

**Step 1 - Remove obvious issues:**
- Low-variance features (almost constant)
- Highly correlated features (keep one from each pair with r > 0.9)

**Step 2 - Model-based selection:**
- **Tree models:** Use `feature_importances_`, keep top N features
- **Linear models:** Use Lasso, keeps non-zero coefficients

**Step 3 - Validate:**
- Train model with selected features
- Compare performance to using all features
- Ideally: similar performance with fewer features

**PCA:** Reduces dimensions by combining correlated features. Good for visualization or when many correlated features, but loses interpretability.

**Key principle:** Start with all features, remove if not helping performance

### **6. Data Splitting**

Important to split data in order to ensure valid model evaluation 

1. **Train/Validation/Split**

- Train (60%): Fit model parameters
- Validation (20%): Tune hyperparameters, compare models
- Test (20%): Final evaluation (touch ONCE at end)


In [None]:
from sklearn.model_selection import train_test_split

# First split: train+val vs test (80/20)
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Second split: train vs val (75/25 of remaining = 60/20 overall)
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=42
)

# Final split: 60% train, 20% val, 20% test

2. **Stratified Split**
- Maintain class proportions in each split
- Used for classification

   When to Use: 
   - Imbalanced Classes
   - Small datasets
   



In [None]:
# For imbalanced data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    stratify=y,  # Maintains class proportions
    random_state=42
)

# If 80/20 class split in full data → 80/20 in train AND test

3. **Cross Validation**
- K-Fold CV: Split into K folds, train K times
- Split data into 5 parts (or k parts)
- Train on 4, validate on 1
- Repeat 5 times (each fold used for validation once)
- Average results

   When to Use: 
   - Small datasets (maximizes training data)
   - Robust performance estimate
   - Hyperparameter tuning

In [None]:
from sklearn.model_selection import cross_val_score

# 5-fold cross-validation
scores = cross_val_score(
    model, X_train, y_train, 
    cv=5,  # 5 folds
    scoring='roc_auc'
)

print(f"CV scores: {scores}")
print(f"Mean AUC: {scores.mean():.3f} (+/- {scores.std():.3f})")

In [None]:
# Use stratified K fold for classification

from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for train_idx, val_idx in skf.split(X, y):
    X_train_fold = X.iloc[train_idx]
    X_val_fold = X.iloc[val_idx]

4. **Time Series Split**: 
- Never shuffle, maintain order
- Always train on the past, predict future: no data leakage


In [None]:
from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)

for train_idx, val_idx in tscv.split(X):
    X_train = X.iloc[train_idx]
    X_val = X.iloc[val_idx]
    # Train on past, validate on future

**Small Dataset (less than 1k samples):** Use cross validation, no separate test set

In [None]:
cv_scores = cross_val_score(model, X, y, cv=5)

**Medium dataset (1K-100K)**: Use 80/20 train/test + 5-fold CV on train

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
cv_scores = cross_val_score(model, X_train, y_train, cv=5)

**Large dataset (>100K):** Simple train/val/test split (60/20/20). CV too slow

**Summary**

Standard approach:
- 60% train, 20% validation, 20% test
- Or 80/20 train/test + 5-fold CV on train

Classification: Use stratify = y to maintain class proportions

Time-series:
- Never shuffle! Use TimeSeriesSplit
- Train on past, validate on future

Prevent leakage:
- Fit all preprocessing (scaling, encoding) on training data only
- Use sklearn Pipeline to automate this

### **7.1. Baseline Model**

This is the simplest possible model that sets a minumum performance bar to show if complex models actually add value. 
If your model is not ablke to ebat the baseline, then something is wrong...

**For classification**: Predict the most frequent class. Example: If 80% of customers don't churn, predicting "no churn" for everyone gives 80% accuracy baseline

In [None]:
from sklearn.dummy import DummyClassifier

# Strategy 1: Predict most frequent class
baseline = DummyClassifier(strategy='most_frequent')
baseline.fit(X_train, y_train)
baseline_score = baseline.score(X_test, y_test)

print(f"Baseline accuracy: {baseline_score}")

**For Regression**: The baseline can either be the mean or the median. Baseline tells you: 
- R² = 0 for mean prediction → any model should have R² > 0
- RMSE of mean prediction → target to beat 

In [None]:
from sklearn.dummy import DummyRegressor

# Strategy 1: Predict mean
baseline = DummyRegressor(strategy='mean')
baseline.fit(X_train, y_train)
baseline_score = baseline.score(X_test, y_test)  # R²

# Strategy 2: Predict median (robust to outliers)
baseline = DummyRegressor(strategy='median')

### **Part 7.2.1. Linear Regression**

Finds the best-fit straight line through data points. Starting point for regression problems because it is simple, iterpretable and fast. 

The model predicts: **y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ + ε**

- **y:** Target (what we predict)
- **β₀:** Intercept (baseline value)
- **β₁, β₂, ...:** Coefficients (impact of each feature)
- **x₁, x₂, ...:** Features
- **ε:** Error term

**Example:** House price = 50,000 + 100×(square_feet) + 5,000×(bedrooms)

**How it learns:** Minimize the sum of squared errors (least squares)

Loss = Σ(actual - predicted)²

**Objective**: Draw a line that minimises this loss, ie the vertical distance to each point 

**Good For:** 
- Linear relationships between features and target
- Need interpretability (can explain: "each bedroom adds $5K")
- Fast training needed
- Baseline model
- Small datasets

**Struggles with:**
- Non-linear relationships
- Multicollinearity (highly correlated features)
- Many outliers (sensitive to them)

**The 5 Key Assumptions**

A. Linearity: Relationship between X and Y is linear

In [None]:
# Check: scatter plots
plt.scatter(X['feature'], y)

B. Independence: Observations are independent (not time-series with autocorrelation)

C. Homoscedasticity: Constant variance of errors, scatter of errors looks like random cloud

In [None]:
# Check: residual plot (should be random cloud)
residuals = y_true - y_pred
plt.scatter(y_pred, residuals)

D. Normality of errors: Residuals are normally distributed

In [None]:
# Check: Q-Q plot
from scipy import stats
stats.probplot(residuals, dist="norm", plot=plt)

E. No multicollinearity: Features not highly correlated

In [None]:
# Check: correlation matrix
corr_matrix = X.corr()
sns.heatmap(corr_matrix, annot=True)

There are no hyperparameters for linear regression. The only parameter is fit_intercept

- fit_intercept=True: Estimate β₀ (usual case)
- fit_intercept=False: Force line through origin (rare)

In [None]:
model = LinearRegression(fit_intercept=True)


**Limitations**

**A. Sensitive to outliers:**
- One extreme point can skew entire line
- Solution: Remove outliers or use robust methods

**B. Assumes linearity:**
- Can't capture curves/interactions naturally
- Solution: Feature engineering (polynomials, interactions)

**C. Multicollinearity problems:**
- Unstable coefficients when features correlated
- Solution: Ridge/Lasso regression

**D. No automatic feature selection:**
- Uses all features (even irrelevant ones)
- Solution: Lasso regression or manual selection

**E. Extrapolation risk:**
- Predictions outside training range unreliable
- Example: Trained on $100K-$500K houses, predicting $2M house is risky

**Summing Up:** 

- Pros: Fast, interpretable (can explain each coefficient's impact), works well for linear relationships, no hyperparameters to tune.

- Checks: I'd verify assumptions - linearity via scatter plots, multicollinearity via correlation matrix, homoscedasticity via residual plots.

- If assumptions violated: Consider polynomial features for non-linearity, Ridge/Lasso for multicollinearity, or switch to tree-based models for complex non-linear patterns.

- Expected performance: If R² > 0.7 and assumptions hold, Linear Regression is often sufficient. If R² < 0.5, likely need more complex models or better features

**Comparison Table**


| Aspect | Linear Regression | Decision Tree | Random Forest |
|--------|------------------|---------------|---------------|
| Interpretability | ✅ Very high | ✅ High | ❌ Low |
| Speed | ✅ Very fast | ✅ Fast | ⚠️ Slower |
| Handles non-linearity | ❌ No | ✅ Yes | ✅ Yes |
| Handles outliers | ❌ No | ✅ Yes | ✅ Yes |
| Feature scaling needed | ✅ Yes | ❌ No | ❌ No |

In [None]:
# FULL IMPLEMENTATION OF LINEAR REGRESSION 

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train
model = LinearRegression()
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)

# Get coefficients
coefficients = pd.DataFrame({
    'feature': X.columns,
    'coefficient': model.coef_
}).sort_values('coefficient', ascending=False)

print(coefficients)
print(f"Intercept: {model.intercept_}")

### **7.2.2. Logistic Regression**

Predict **probability** that something belongs to a class (0 or 1). Fast, interpretable and gives probabilities for binary classification

The **sigmoid function** squashes any number into the range [0, 1]. It is an S-shaped curve. 

**p(y=1) = 1 / (1 + e^-(β₀ + β₁x₁ + β₂x₂ + ...))**

- Large negative → probability near 0
- Large positive → probability near 1
- Zero → probability = 0.5


**How it learns:** It Maximizes likelihood. Find coefficients that make observed outcomes most probable.

**Good for:**

- Binary classification (yes/no, 0/1)
- Need probability estimates (not just class labels)
- Need interpretability (coefficients show feature impact)
- Linear decision boundary acceptable
- Baseline classification model
- Class imbalance moderate (<90/10 split)

**Not good for:**

- Multi-class (use multinomial logistic or other methods)
- Highly non-linear decision boundaries
- Features not linearly separable

**Assumptions**

A. Linear relationship between features and log-odds:

- Log-odds = β₀ + β₁x₁ + ...
- Check with scatter plots of features vs. log-odds

B. Independence of observations: Each data point independent

C. No multicollinearity: Features shouldn't be highly correlated

D. Large sample size: Need enough data (rule of thumb: 10-15 events per feature)

**Hyperparameters**

A. **C: Regularization Strength**

- Confidence in training data. 
- Defeault C = 1.  
- High C (e.g., 100): "Trust the training data more" → Complex model, fits training closely, risk overfitting
- Low C (e.g., 0.01): "Keep it simple" → Simpler model, ignores noise, risk underfitting

B. **L1 vs L2: Penalty**

- L2 (Ridge): "Shrink all features a little" → Keeps all features but reduces their impact
- L1 (Lasso): "Remove useless features" → Sets some coefficients to exactly zero (feature selection)
- elasticnet: Mix of l1 and l2
- none: No regularisation - Only use iff there is no overfitting 

C. **class_weight (for imbalanced data):**

- None (default): Equal weight to all classes
- 'balanced': Automatically adjusts weights inversely proportional to class frequency
- Custom dict: {0: 1, 1: 10} → penalize class 1 errors 10x more

D. **solver:**

- Different algorithms to find the coefficients
- 'lbfgs' (default): Good for most cases, works with l2
- 'liblinear': Good for small datasets, works with l1
- 'saga': Works with all penalties, good for large datasets

In [None]:
# HYPERPARAMETER TUNING USING GRIDSEARCHCV

from sklearn.model_selection import GridSearchCV

param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear']
}

grid_search = GridSearchCV(
    LogisticRegression(random_state=42),
    param_grid,
    cv=5,
    scoring='roc_auc'
)

grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")
print(f"Best AUC: {grid_search.best_score_:.3f}")

# Use best model
best_model = grid_search.best_estimator_

**Imbalanced classes: One class has way more examples than the other.**

Examples:
- Fraud detection: 99% legitimate, 1% fraud
- Churn: 80% stay, 20% leave
- Disease screening: 95% healthy, 5% sick

Problem: Model learns to just predict majority class (gets 99% accuracy by always saying "not fraud")
- Balanced: 50/50 or 60/40 split
- Imbalanced: >70/30 split
- Severe imbalance: >90/10 split

In [None]:
# DEALING WITH IMBALANCED CLASSES

# Method 1: class_weight='balanced'
model = LogisticRegression(class_weight='balanced')

# Method 2: Custom weights
weights = {0: 1, 1: 5}  # Penalize minority class errors 5x more
model = LogisticRegression(class_weight=weights)

# Method 3: Adjust threshold
y_proba = model.predict_proba(X_test)[:, 1]
y_pred_custom = (y_proba > 0.3).astype(int)  # Lower threshold to catch more positives

**Limitations**

A. **Assumes linear decision boundary:**

- Can't naturally capture complex non-linear patterns
- Solution: Feature engineering (polynomial features, interactions)

B. **Sensitive to outliers:**

- Extreme values can influence coefficients
- Solution: Remove outliers or use robust scaling

C. **Requires feature scaling:**

- Features on different scales affect regularization unevenly
- Solution: Standardize features before training

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

D. **Poor with high-dimensional sparse data**:

- Many features relative to samples
- Solution: L1 regularization or feature selection

E. **Limited to linear separability:**

- If classes not linearly separable, performance suffers
- Solution: Try SVM with kernel or tree-based models

**Summary**

Advantages:

- Outputs probabilities (can rank customers by risk)
- Interpretable coefficients (show feature impact)
- Fast training and prediction
- Handles moderate imbalance with class_weight='balanced'

Hyperparameters to tune:

- C (regularization strength): Try [0.01, 0.1, 1, 10]
- penalty: L1 for feature selection, L2 if all features useful
- class_weight: 'balanced' if classes imbalanced

Feature prep:

- Scale features (StandardScaler) before training
- Check for multicollinearity (correlation >0.8)

Evaluation:

- Use AUC-ROC (threshold-independent)
- Precision-Recall curve if imbalanced
- Calibration plot to verify probability quality

**When to switch: If AUC < 0.7 or clear non-linear patterns, try tree-based models or add polynomial features**

In [None]:
# FULL IMPLEMENTATION OF LOGISTIC REGRESSION

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)

# Predict classes
y_pred = model.predict(X_test)

# Predict probabilities
y_proba = model.predict_proba(X_test)[:, 1]  # Probability of class 1

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
auc = roc_auc_score(y_test, y_proba)

### **7.3.1: Decision Trees**

Intuitive interpretable models that make predictions by learning simple decision rules from data. Split data into branches based on feature values, like a flowchart. 

                     Age > 30?               
                   /          \               
                 Yes           No              
                 /              \           
        Income > 50K?      Student?              
         /        \         /      \          
       Yes        No      Yes      No          
        |          |       |        |         
      Buy      Don't   Don't      Buy          


**How it learns:** 
1. Find best feature and split point that separates classes best
2. Repeat for each branch recursively
3. Stop when pure (all same class) or max depth reached

**Mathematical intuition:**

At each split, find feature and threshold that maximizes:
- **Classification:** Information Gain (or Gini decrease)
- **Regression:** Variance reduction

**Hyperparameters to Tune**

**A. max_depth:**

- None (default): Grow until pure leaves → OVERFITS
- Small (3-5): Shallow tree, simple model, may underfit
- Medium (5-10): Good balance
- Large (>15): Deep tree, complex, likely overfits

**Intuition:** How many questions can you ask?

- max_depth=3 → 3 questions maximum
- Deeper = more complex patterns, more overfitting risk

Typical range: [3, 5, 7, 10, 15, 20, None]

**B. min_samples_split:**

- Default: 2 → Split if ≥2 samples
- Higher (20, 50): More samples needed to split → simpler tree

**Intuition:** Don't split unless you have enough data

- min_samples_split=50 → need 50 samples before considering a split
- Typical range: [2, 10, 20, 50, 100]

**C. min_samples_leaf:**

Default: 1 → Leaves can have 1 sample → OVERFITS
Higher (5, 10, 20): Each leaf needs more samples → simpler tree

Intuition: Don't trust tiny leaves

min_samples_leaf=10 → every prediction based on ≥10 samples

Typical range: [1, 5, 10, 20, 50]

**D. max_features:**

None: Consider all features at each split
'sqrt': Consider √(n_features) random features
'log2': Consider log₂(n_features) features
int: Specific number

Intuition: Adds randomness, reduces overfitting

Often used in Random Forest, less critical for single tree

**E. class_weight (for classification):**

None: Equal weight
'balanced': Adjust weights for imbalanced classes
Dict: Custom weights

Same concept as Logistic Regression

In [None]:
# HYPERPARAMETER TUNING 

param_grid = {
    'max_depth': [3, 5, 7, 10, None],
    'min_samples_split': [2, 10, 20],
    'min_samples_leaf': [1, 5, 10]
}

grid_search = GridSearchCV(
    DecisionTreeClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy'
)

grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.3f}")

# Use best model
best_model = grid_search.best_estimator_

**Overfitting Issue**: Decision trees overfit very easily 

- With no constraints, tree memorizes training data
- Creates one leaf per training sample
- Perfect training accuracy, poor test accuracy

Therefor it is important to ALWAYS limit tree growth through **pruning** strategies

**A. Pre-pruning = Stop growing early** 


In [None]:
model = DecisionTreeClassifier (
    max_depth=5,              # Limit depth
    min_samples_split=20,     # Need 20+ samples to split
    min_samples_leaf=10       # Leaves need 10+ samples
)

**B. Post-pruning = grow full, then cut backwards**

Higher the ccp_alpha = More aggressive pruning 

In [None]:
model = DecisionTreeClassifier(ccp_alpha=0.01)  # Cost-complexity pruning

**Advantages**
1. No feature scaling needed
2. Handles non-linearity automatically: Captures thresholds and curves naturally
3. Handles mixed data types: Can mix numerical and categorical (after encoding)
4. Robust to outliers: Splits based on ranking 
5. Highly interpretable: Can visualise entire decision path

**Limitations**

**A. Instability:**

- Small data changes → completely different tree
- Not robust
-  Solution = Use random forest (ensemble of trees)

In [None]:
# Train on slightly different data
model1.fit(X_train, y_train)
model2.fit(X_train[:-10], y_train[:-10])  # Remove 10 samples
# Trees can look totally different!

**B. Overfitting tendency:**

- Without constraints, memorizes training data
- Solution: ALWAYS tune max_depth, min_samples_leaf

**C. Biased toward features with more levels:**

- Features with more unique values favored in splits
- Solution: Use Random Forest or limit max_features

**D. Cannot Extrapolate**

- Trained on ages 18-65
- Predicting age 80 → uses closest leaf (age 65)
- Can't predict beyond training range

**E. Creates axis-parallel splits:**
- Only splits on one feature at a time
- Can't capture diagonal decision boundaries naturally
- **Solution:** Feature engineering (create interaction features)


**F. Biased with imbalanced classes:**
- Prefers majority class
- **Solution:** class_weight='balanced'


| Aspect | Decision Tree | Linear Regression | Logistic Regression |
|--------|---------------|-------------------|---------------------|
| **Non-linearity** | ✅ Handles naturally | ❌ Needs feature engineering | ❌ Needs feature engineering |
| **Interpretability** | ✅ Visual tree | ✅ Coefficients | ✅ Coefficients |
| **Feature scaling** | ✅ Not needed | ❌ Required | ❌ Required |
| **Outliers** | ✅ Robust | ❌ Sensitive | ❌ Sensitive |
| **Overfitting risk** | ⚠️ High (needs tuning) | ⚠️ Medium | ⚠️ Medium |
| **Stability** | ❌ Unstable | ✅ Stable | ✅ Stable |
| **Speed** | ✅ Fast | ✅ Very fast | ✅ Very fast |


**Advantages for this problem:**

- Handles non-linear patterns naturally [if applicable]
- No feature scaling needed
- Interpretable - can visualize decision rules
- Robust to outliers
- Captures feature interactions automatically

**Critical: Prevent overfitting:**

- Tune max_depth (start with 5-10)
- Set min_samples_leaf (10-20 for reasonable leaf size)
- Use cross-validation to find optimal hyperparameters

**Hyperparameter strategy:**

- Grid search over max_depth=[3,5,7,10], min_samples_leaf=[5,10,20]
- Monitor train vs test accuracy gap
- If large gap → more regularization (lower max_depth, higher min_samples_leaf)

**When to switch:**

- If single tree unstable → Random Forest
- If need higher performance → Gradient Boosting (XGBoost)
- If smooth linear relationship → Ridge/Lasso

**Expected outcome:** Decision tree should capture non-linear patterns that linear models miss, but single tree may overfit - Random Forest likely better choice for production.



In [None]:
# IMPLEMENTATION FOR CLASSIFICATION

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train (limiting depth to prevent overfitting)
model = DecisionTreeClassifier(
    max_depth=5,
    min_samples_split=20,
    min_samples_leaf=10,
    random_state=42
)
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(classification_report(y_test, y_pred))

In [None]:
# IMPLEMENTATION FOR REGRESSION

from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score

model = DecisionTreeRegressor(
    max_depth=5,
    min_samples_leaf=10,
    random_state=42
)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print(f"RMSE: {mean_squared_error(y_test, y_pred, squared=False):.2f}")
print(f"R²: {r2_score(y_test, y_pred):.3f}")

### **7.3.2. Random Forest**

Train many decision trees on random subsets of data, then average their predictions. It has 3 main steps: 

A. **Bootstrap Sampling (Bagging)**:

- Create N different training sets by **random sampling with replacement**
- Each tree trained on different subset
- Typical: each subset = same size as original, but with duplicates

*Example*: Original data = [1,2,3,4,5]

- Tree 1 trained on: [1,1,3,4,5]
- Tree 2 trained on: [2,2,3,3,4]
- Tree 3 trained on: [1,2,5,5,5]

B. **Random Feature Selection**:

- At each split, only consider random subset of features
- Typical: √(n_features) for classification, n_features/3 for regression
- Makes trees more diverse (decorrelated)

*Example*: 10 features total

- Split 1: randomly consider features [2, 5, 8]
- Split 2: randomly consider features [1, 3, 9]

C. **Aggregate Predictions:**

- Classification: Majority vote
- Regression: Average

*Example* - Classification:

- Tree 1 predicts: Class 1
- Tree 2 predicts: Class 0
- Tree 3 predicts: Class 1
- **Final prediction: Class 1 (2 out of 3)**

**Overcoming the Bias-Variance Tradeoff:**

Single Decision Tree:

- Low bias (can fit complex patterns)
- HIGH variance (unstable, changes with data)

Random Forest:

- Low bias (still flexible)
- LOW variance (averaging reduces instability)

Intuition:

- Individual trees make different mistakes
- Averaging cancels out random errors
- Only systematic patterns survive

Mathematical: Variance of average of N uncorrelated models = Variance/N 


**Good For**

- Almost any tabular data problem (very general-purpose)
- Non-linear relationships
- Feature interactions
- Outliers present
- Need feature importance
- Imbalanced classes
- High-dimensional data
- When single decision tree overfits
- Need robust, stable predictions

**Not good for:**

- High-cardinality categorical features (many unique values)
- Very large datasets (slower than single tree)
- Need simple interpretability (can't visualize like single tree)
- Linear relationships (overkill, use linear models)
- Extreme real-time latency requirements

**Best use case: Structured/tabular data with complex patterns**

**Main Hyperparameters**

A. **n_estimators (number of trees):**

- More trees = better performance, but diminishing returns
- More trees = slower training/prediction

- Typical values: [50, 100, 200, 500]
- Start with 100
- If computational budget allows → 200-500
- More is almost always better (just slower)
- Intuition: More trees = more opinions = more stable

B. **max_depth:**

- None (default): Trees grow fully → individual trees overfit, but ensemble handles it
- Limit (5-20): Prevents individual trees from overfitting too much

- Typical values: [None, 10, 20, 30]

- Often can leave as None (RF handles overfitting via averaging)
- If very large dataset → limit to speed up training

- Difference from single tree: RF is MORE robust to deep trees than single tree

C. **min_samples_split:**

- Default: 2
- Higher (10, 20): Simpler trees

- Typical values: [2, 5, 10, 20]

- Less critical than for single tree (averaging helps)

D. **min_samples_leaf:**

- Default: 1
- Higher (5, 10): Smoother predictions

- Typical values: [1, 2, 5, 10]

- Increase if overfitting (though RF less prone to this)

E. **max_features (features considered per split):**

- 'sqrt' (default for classification): √(n_features)
- 'log2': log₂(n_features)
- None: All features (reduces diversity)
- int/float: Specific number or fraction

- Typical values: ['sqrt', 'log2', 0.3, 0.5]

- 'sqrt' is good default
- Lower → more diversity, less overfitting
- Higher → stronger trees, more correlation

- Intuition: Controls diversity of trees

- Fewer features → trees more different → less correlation → better averaging

F. **max_samples (bootstrap sample size):**

- None (default): Use all samples (with replacement)
- Float (0.5, 0.8): Use fraction of samples per tree

- Typical values: [None, 0.6, 0.8]

- Lower → faster training, more diversity
- Higher → stronger individual trees

G. **n_jobs:**

- -1: Use all CPU cores (parallel training)
- 1: Single core
- Always use -1 for faster training

In [None]:
# HYPERPARAMETER TUNING FOR Random Forest

from sklearn.model_selection import RandomizedSearchCV

param_dist = {
    'n_estimators': [100, 200, 500],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 5],
    'max_features': ['sqrt', 'log2', 0.3]
}

# Use RandomizedSearchCV (faster than GridSearch for RF)
random_search = RandomizedSearchCV(
    RandomForestClassifier(random_state=42, n_jobs=-1),
    param_distributions=param_dist,
    n_iter=20,  # Try 20 random combinations
    cv=5,
    scoring='roc_auc',
    random_state=42,
    n_jobs=-1
)

random_search.fit(X_train, y_train)
print(f"Best params: {random_search.best_params_}")
print(f"Best AUC: {random_search.best_score_:.3f}")

best_model = random_search.best_estimator_

**Random Forest feature importance has biases:**

A. **Biased toward high-cardinality features:**

- Features with more unique values get higher importance
- Not always reflective of true predictive power

B. **Correlated features:**

- Importance split between correlated features
- May underestimate importance of any single correlated feature

C. **Not causal:**

- High importance ≠ causes outcome
- Just means useful for prediction

Alternative: **Permutation Importance** 
- Shuffle feature, measure drop in performance
- More reliable than built-in performance 
- Slower computation

In [None]:
from sklearn.inspection import permutation_importance

perm_importance = permutation_importance(
    model, X_test, y_test, n_repeats=10, random_state=42
)

importance_perm = pd.DataFrame({
    'feature': X.columns,
    'importance': perm_importance.importances_mean
}).sort_values('importance', ascending=False)

**Why Random Forest:**

- Ensemble of trees fixes instability of single decision tree
- Resistant to overfitting through averaging
- Handles non-linearity, interactions, outliers naturally
- No feature scaling needed
- Provides feature importance

**Hyperparameter strategy:**

- Start with n_estimators=100-200 (more if time allows)
- Keep max_depth=None initially (RF handles deep trees)
- Tune min_samples_leaf if overfitting (5-10)
- Use n_jobs=-1 for parallel training
- Monitor OOB score for quick validation

**Expected performance:**

- Should significantly beat single decision tree
- Comparable to or slightly below XGBoost (but easier to tune)
- AUC typically 0.75-0.90 for good problems

**When to switch:**

- Need max performance → XGBoost/LightGBM
- Need interpretability → Single tree or linear model
- Very large data → LightGBM (faster)

Random Forest is my baseline for complex tabular data - works reliably with minimal tuning


| Aspect | Random Forest | Single Decision Tree | Gradient Boosting |
|--------|---------------|----------------------|-------------------|
| **Overfitting** | ✅ Resistant | ❌ Prone | ⚠️ Can overfit if not tuned |
| **Interpretability** | ⚠️ Medium | ✅ High | ❌ Low |
| **Training speed** | ⚠️ Medium | ✅ Fast | ❌ Slow |
| **Prediction speed** | ⚠️ Medium | ✅ Fast | ⚠️ Medium |
| **Performance** | ✅ Good | ⚠️ Medium | ✅ Best |
| **Hyperparameter tuning** | ✅ Easy | ✅ Easy | ❌ Requires care |
| **Stability** | ✅ Stable | ❌ Unstable | ✅ Stable |

In [None]:
# FULL CLASSIFICATION IMPLEMENTATION - RF 

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train
model = RandomForestClassifier(
    n_estimators=100,
    max_depth=None,
    min_samples_leaf=5,
    max_features='sqrt',
    n_jobs=-1,
    random_state=42
)
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]

# Evaluate
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(f"AUC: {roc_auc_score(y_test, y_proba):.3f}")
print(classification_report(y_test, y_pred))


In [None]:
# FULL REGRESSION IMPLEMENTATION - RF

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

model = RandomForestRegressor(
    n_estimators=100,
    max_depth=None,
    min_samples_leaf=5,
    n_jobs=-1,
    random_state=42
)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print(f"RMSE: {mean_squared_error(y_test, y_pred, squared=False):.2f}")
print(f"R²: {r2_score(y_test, y_pred):.3f}")

### **8.1. Evaluating Classification**

1. **Confusion Matrix**
- TP (True Positive): Correctly predicted positive
- TN (True Negative): Correctly predicted negative
- FP (False Positive): Predicted positive, actually negative (Type I error)
- FN (False Negative): Predicted negative, actually positive (Type II error)

```
                      PREDICTED             
                 Negative  Positive         
              ┌──────────┬──────────┐        
ACTUAL   Neg  │    TN    │    FP    │        
              ├──────────┼──────────┤       
         Pos  │    FN    │    TP    │       
              └──────────┴──────────┘        
 

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(cm, display_labels=['No Churn', 'Churn'])
disp.plot()

2. **Accuracy**: What percentage of predictions are correct

   **When to use:**

- Balanced classes (50/50 or 60/40)
- All errors equally costly

   **When NOT to use:**

- Imbalanced data (99% one class → 99% accuracy by always predicting majority)
- Example: Email spam (95% legitimate, 5% spam)
- Predict "all legitimate" → 95% accuracy but catches zero spam!

**(TP + TN) / (TP + TN + FP + FN)**

In [None]:
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred)
# Or: (TP + TN) / (TP + TN + FP + FN)

3. **Precision**: What percentage of positive predictions were correct

   When to use:

- Cost of False Positives is HIGH
- Don't want false alarms

   Examples:

- Spam filter: Don't want to mark important emails as spam (FP bad)
- Medical diagnosis for expensive treatment: Don't give treatment to healthy people (FP expensive)

   **TP / (TP + FP)**

In [None]:
from sklearn.metrics import precision_score

precision = precision_score(y_test, y_pred)

4. **Recall**: What percentage of actual positives did we catch. "Of all the actual cases, how many did I find?" 

   When to use:

- Cost of False Negatives is HIGH
- Can't afford to miss positives

   Examples:

- Cancer screening: Can't miss cancer cases (FN catastrophic)
- Fraud detection: Missing fraud costs money (FN expensive)
- Churn prediction: Missing churners loses customers (FN costly)

  **TP / (TP + FN)**

In [None]:
from sklearn.metrics import recall_score

recall = recall_score(y_test, y_pred)

5. **F1-Score**: Harmonic mean of Precision and Recall

   When to use:

- Need balance between Precision and Recall
- Imbalanced classes
- Single metric needed

   Context: F1 = 0.8 means good balance of precision and recall

In [None]:
from sklearn.metrics import f1_score

f1 = f1_score(y_test, y_pred)

6. **ROC and AUC**

- ROC Curve: True Positive Rate vs False Positive Rate at different thresholds
- AUC (Area Under Curve): Single number summarizing ROC

  **AUC Interpretation:**

- AUC = 0.5: Random guessing (worthless model)
- AUC = 0.7-0.8: Fair model
- AUC = 0.8-0.9: Good model
- AUC = 0.9-1.0: Excellent model
- AUC = 1.0: Perfect (or data leakage!)

  **When to use:**

- Comparing models overall
- Threshold-independent metric
- Imbalanced classes

In [None]:
from sklearn.metrics import roc_auc_score, roc_curve
import matplotlib.pyplot as plt

# Calculate AUC
auc = roc_auc_score(y_test, y_proba)

# Plot ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
plt.plot(fpr, tpr, label=f'AUC = {auc:.3f}')
plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate (Recall)')
plt.legend()

**Summary**

Step 1 - Check confusion matrix:

- Understand TP, TN, FP, FN distribution
- Identify which errors are happening

Step 2 - Choose primary metric based on business:

- Churn prediction: Optimize Recall (can't afford to miss churners - FN costly)
- Fraud detection: Balance Precision and Recall (F1-score or tune threshold)
- Spam filter: Optimize Precision (don't mark important emails as spam - FP bad)

Step 3 - Use AUC for model comparison:

- Threshold-independent
- AUC > 0.8 is good, > 0.9 is excellent
- If AUC = 1.0, check for data leakage

Step 4 - Adjust threshold if needed

### **8.2. Regression Evaluation**

1. **Mean Absolute Esrror (MAE)**

- Formula: Average of |actual - predicted|
- Interpretation: "On average, predictions are off by $X"
  

- Easy to interpret (same units as target)
- All errors treated equally
- Robust to outliers (compared to RMSE)

  Example: MAE = $5,000 means average error is $5,000

In [None]:
from sklearn.metrics import mean_absolute_error

mae = mean_absolute_error(y_test, y_pred)

2. **Root Mean Squared Error (MSE/RMSE)**

- Penalizes large errors more
- MSE Formula: Average of (actual - predicted)²
- RMSE Formula: √MSE

- Large errors are particularly bad
- Most common regression metric
- Interpretable (RMSE in original units)

  **Difference from MAE:**
- RMSE penalizes large errors more (squared)
- Error of $10K contributes more than 2× error of $5K

- Example: RMSE = $8,000 means typical error is around $8,000, with extra penalty on large errors

In [None]:
from sklearn.metrics import mean_squared_error

mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)

3. **R Squared (R2)**

- Percentage of variance explained by model

- Formula: 1 - (Sum of squared residuals / Total sum of squares)
- Range: -∞ to 1

- R² = 1: Perfect predictions
- R² = 0: Model no better than predicting mean
- R² < 0: Model worse than predicting mean

   **When to Use:**
- Comparing models
- Understanding model fit quality
- Standard regression metric

  **Interpretation:**

- R² = 0.85 means model explains 85% of variance in target
- Remaining 15% is unexplained (noise, missing features)

  **Limitation:** Can be misleading with non-linear relationships or outliers

In [None]:
from sklearn.metrics import r2_score

r2 = r2_score(y_test, y_pred)

RMSE = $45K means typical prediction error is $45,000
R² = 0.85 means model explains 85% of price variation
If R² < 0.5, model not capturing patterns well (need better features or model)

Validate with residual plot:

Random scatter = good model fit
Patterns = model missing something (non-linearity, heteroscedasticity)

Choose based on business:

- If large errors particularly costly → RMSE
- If prefer equal weighting of errors → MAE
- If want % → MAPE"