<a href="https://colab.research.google.com/github/Jhansipothabattula/Machine_Learning/blob/main/Day34.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Cross Validation and Hyperparameter Tuning

**Introduction to Cross-Validation**

- What is Cross-Validation?

  - Technique used to assess how well a machine Learning model generalizes to an independent dataset

- Types of Cross-Validation

  - K-Fold-Cross-Validation

    - Splits the dataset into K folds of approximately equal size

    - The model is trained on k-1 folds and validated on the remaining fold

    - This Process is repeated k times, and the average perfomance is computed
  
  - Stratified K-Fold

    - Ensures that each fold maintains the same class distribution as the original dataset

    - Useful for imbalanced datasets

  - Leave-One-Out Cross-Validation(LOOCV)

    - Uses a single data point for validation and the rest for training

    - Repeats this process for all data points

    - Computationally expensive but provides the most robust evaluation

**Hyperparameter Tuning**
# **What is Hyperparameter Tuning?**

In Machine Learning models, there are **two types of parameters**:

###  **1. Parameters (learned automatically)**

These are learned from data during training.
Example: weights (W), bias (b).

###  **2. Hyperparameters (set by you, NOT learned automatically)**

These control *how* the model learns.
Examples:

* Learning rate
* Number of trees in Random Forest
* Number of hidden layers in Neural Networks
* C (regularization strength) in SVM
* k value in KNN

 **Hyperparameter tuning = finding the best hyperparameter values** so the model gives the **highest accuracy** and **lowest error**.

---

**Why do we need Hyperparameter Tuning?**

Because a model with wrong hyperparameters can:

❌ Overfit

❌ Underfit

❌ Learn very slowly

❌ Learn too fast and become unstable

So tuning helps your model perform **at its best**.

---

**Techniques for Hyperparameter Tuning (Very Simple)**


* You give a **set of possible values**
* The algorithm tries **every combination**
* Best accuracy combination is selected

✔️ Simple
❌ Slow for large ranges

Example:

```python
param_grid = {'learning_rate': [0.1, 0.01, 0.001]}
```

---

## 2. **Random Search**

* Model picks **random combinations** instead of trying all
* Much **faster** than Grid Search

✔️ Good for large search space
✔️ Often finds good results faster

---

## 3. **Bayesian Optimization**

* Model **learns** which hyperparameters give good results
* It avoids bad areas and tests only promising values
* Faster + smarter

Used in:

* Google Cloud AI Platform
* Keras Tuner

---

## 4. **Hyperband (for Neural Networks)**

* Trains many models with different hyperparameters
* Stops training *bad models early*
* Saves time

Used in:

* Keras Tuner
* Ray Tune

---

## 5. **Genetic Algorithms**

* Inspired by biological evolution
* Hyperparameters mutate + crossover
* Best performing ones survive

Used in:

* DEAP
* TPOT library

---

**Summary Table**

| Method                | Speed        | Best For                      |
| --------------------- | ------------ | ----------------------------- |
| Grid Search           | Slow         | Small hyperparameter sets     |
| Random Search         | Medium       | Large search spaces           |
| Bayesian Optimization | Fast & Smart | ML models needing efficiency  |
| Hyperband             | Very Fast    | Deep Learning models          |
| Genetic Algorithms    | Medium       | Complex hyperparameter spaces |

---


**Importance of Hyperparameter tuning for models perfomance**

- Without Tuning, the model might not reach it's optimal perfomance, leading to:

  - Underfitting: Model fails to capture the underlying patterns

  - Overfitting: Model memorizes the training data and performs poorly on unseen data

**Feature Engineering and Model Selection**

- Objective

  - Perform end-to-end feature engineering, model evaluation, and hyperparameter tuning on a dataset

- Tasks

 - Task1: Perform Feature Engineering

 - Task2: Train and Evaluate Models

 - Task3: Apply Grid Search for Hyperparameter tuning

In [1]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

# Load Titanic Dataset
from google.colab import files
uploaded = files.upload()

Saving titanic.csv to titanic.csv


In [2]:
df = pd.read_csv("titanic.csv")
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
from sklearn.model_selection import cross_val_predict
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

# Select Relevant Features
df = df[["Pclass", "Sex", "Age", "Fare", "Embarked", "Survived"]]

# Handle missing values
df.fillna({"Age":df["Age"].median()}, inplace=True)
df.fillna({"Embarked":df["Embarked"].mode()[0]}, inplace=True)

# Define features and Target
X = df.drop(columns=["Survived"])
y = df["Survived"]

# Apply feature Scaling and encoding
preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), ["Age", "Fare"]),
        ("cat", OneHotEncoder(), ["Pclass", "Sex", "Embarked"])
    ]
)

X_preprocessed = preprocessor.fit_transform(X)

# Train and Evaluate Logistic Regression model
log_model = LogisticRegression()
log_scores = cross_val_predict(log_model, X_preprocessed, y, cv=5)
print(f"Logistic Regression Accuracy: {log_scores.mean():.2f}")

# Train and Evaluate Random Forest
rf_model = RandomForestClassifier(random_state=42)
rf_scores = cross_val_score(rf_model, X_preprocessed, y, cv=5, scoring="accuracy")
print(f"Random forest Accuracy: {rf_scores.mean():.2f}")

# Define Hyperparameter grid
param_grid = {
    "n_estimators": [50, 100, 200],
    "max_depth":[None, 10, 20],
    "min_samples_split":[2, 5, 10]
}

# Define gridsearch
grid_search = GridSearchCV(
    estimator=RandomForestClassifier(random_state=42),
    param_grid=param_grid,
    cv=5,
    n_jobs=-1
)
grid_search.fit(X_preprocessed, y)

# Display best Hyperparameters and score
print(f"Best hyperparameters: {grid_search.best_params_}")
print(f"Best Accuracy: {grid_search.best_score_:.2f}")

Logistic Regression Accuracy: 0.36
Random forest Accuracy: 0.81
Best hyperparameters: {'max_depth': 10, 'min_samples_split': 5, 'n_estimators': 100}
Best Accuracy: 0.83
