# PHASE 4

## PART 1: K-NEAREST NEIGHBOR(KNN) ALGO

* KNN is used for classification & regression by leveraging nearby data points(neighbors).  
__Steps__:
> 1. Choose a point to predict.
> 2. Find the K-nearest points (K is a predefined constant like 1,  3, 5).
> 3. For **classification**: Predict by taking the most common class among the K neighbors.
> 4. For **regression**: Predict by averaging the target values of the K neighbors.
> 5. **Weighted Prediction**: KNN can also weigh the neighbors' influence based on their distance from the point being predicted.
   
> Choosing the right distance metric is crucial to the success of KNN.

* K-means is a related algorithm but it's used for unsupervised learning and clustering.
* In K-means, K represents the number of clusters, not neighbors.
* It's an iterative algorithm that groups points based on a distance metric until convergence.

### **A}** DISTANCE METRICS

> * Distance metrics quantify similarity btn data points{algo like KNN}.   
> * Data pts closer in distance are more likely to belong to the same class.   
> * **Application**: Each dataset column represents a dimension, allowing distance measurement between points in a multi-dimensional space.

> MANHATTAN DISTANCE {c=1}  
* It measures the distance by traveling along grid axes. {Walking through a city block by block}. 
* Applied to higher dimensions(3D SPACE)
* **Best for grid based problem & works well in high dimension spaces**
> EUCLIDEAN DISTANCE {c=2}  
* It measures the straight line distance btn 2 points using pythagorean theorem.
* **Most Common. Best when shortest path is needed** 
> MINKOWSKI DISTANCE {c>2, cubic,fifth}  
* Generalization of Manhattan & Euclidean Distance
* Defined by parameter of c,changes the exponent of the sum of absolute differences.
* Applied in ML{in KNN and choice of c}
* **Flexible coz it encompasses Manhattan & Euclidean by changing the parameter c**

In [2]:
# Manhattan Distance
A = (2,3,5)
B = (1,-1,3)
manhattan_dist = sum(abs(A[i] - B[i]) for i in range(3))
manhattan_dist

7

In [3]:
# Euclidean Distance
from math import sqrt
A = (2,3,5)
B = (1,-1,3)
euclidean_distance = sqrt(sum((A[i] -B[i])**2 for i in range(3)))
print(euclidean_distance)
print(f"{euclidean_distance:.2f}")

4.58257569495584
4.58


In [4]:
# Minkowski Distance
import numpy as np
A = (2,3,5)
B = (1,-1,3)
c = 3
# Calculate the Minkowski distance btn points A and B
# 1. For each dimension, calculate the absolute difference between A[i] and B[i].
# 2. Raise that difference to the power of c.
# 3. Sum these powered differences.
# 4. Take the c-th root of the sum: raise the sum to the power of 1/c).
minkowski_distance = np.power(sum(np.abs(A[i] - B[i])**c for i in range(3)),1/c)
print(minkowski_distance)
print(f"{minkowski_distance:.2f}")

4.179339196381232
4.18


### **B}** K- NEAREST NEIGHBORS

> OVERVIEW 
* It's a supersised learning algo for classification & regression tasks.
* It's principle is that smilar data pts are close together and distance metrics help identify similarity.
> FIT STAGE
* It stores training data with labels without calculating distances.
> PREDICTION STAGE
* Calculates the distances between the new data pt and every data pt in the training set.
* Identifies the closest K pt(neighbours) and assigns a class based on the majority vote among those neighbors.
> DISTANCE METRICS
* Uses distance metrics like Manhattan, Euclidean or Minkowski depending on the problem
> EVALUATING PERFORMANCE
* **FOR CLASSIFICATION**: performance is measures using Accuracy, Precision, Recall and F1-Score.
* **FOR REGRESSION**: it averages the target values of the K nearest neighbors.

### **C}** K- NEAREST NEIGHBORS Classifier - {*Used iris dataset*}

> **Fit Method**
* Stores training data for later use
> **_get_distances Method**
* Calculates Euclidean distance btn test point and every training pt.
> **_get_k_nearest Method**
* Sorts the distances and returns the indices of the k-nearest neightbors
> **_get_label_prediction Method**
* Finds the most common label among the k-nearest neighbors
> **Predict method**
* Generates predictions for all test points using above methods
> **Testing Method**
* Model is tested on Iris dataset and should have an output score around 97%

In [5]:
# Import relebant libraries
from scipy.spatial.distance import euclidean
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Define KNN Class
class KNN:
    def __init__(self):
        pass
    def fit(self,X_train, y_train):
        # store training data and labels
        self.X_train = X_train
        self.y_train = y_train
    def _get_distances(self,x):
        # Create an empty list to store distances
        distances = []
        # enumerate through the training data to calculate distances
        for idx,point in enumerate(self.X_train):
            dist = euclidean(x,point)
            distances.append((idx,dist))
        return distances
    def _get_k_nearest(self,dists,k):
        # sort the distances by the second value in each tuple(distance)
        sorted_dists = sorted(dists,key=lambda x:x[1])
        # return the first tuples
        return sorted_dists[:k]
    def _get_label_prediction(self,k_nearest):
        # get the labels for the k_nearest neighbors
        labels = [self.y_train[idx] for idx,_ in k_nearest]
        # count the frequency of each label
        counts = np.bincount(labels)
        # return the label with the highest frequency
        return np.argmax(counts)
    def predict(self,X_test,k=3):
        # A list to store the predictions
        preds = []
        # Iterate through all the test points
        for x in X_test:
            # get distances to all training pts
            distances = self._get_distances(x)
            # get k_nearest pts
            k_nearest = self._get_k_nearest(distances,k)
            # predict label based on the nearest neighbors
            pred = self._get_label_prediction(k_nearest)
            # Append the prediction to the list
            preds.append(pred)
        # return the predictions for all the test pts
        return preds
# load Iris dataset
iris = load_iris()
data = iris.data
target = iris.target

# split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size = 0.25, random_state = 0)

# Instantiate and fit the KNN model
knn = KNN()
knn.fit(X_train,y_train)

# generate predictions for the test set
preds = knn.predict(X_test, k=3)

#calculate & print the accuracy
print(f"Accuracy Score: {accuracy_score(y_test,preds)}")

Accuracy Score: 0.9736842105263158


### **D}** Finding Best value for K in KNN

> Optimal K value in KNN: 
* **A small K (k=1)**: Can lead to *overfitting* whereby the model is too sensitive to small variations
* **A large K**: Can lead to *underfitting* whereby the model oversimplifies and misses important patterns
* **Odd values for K (k=3,k=5)** help avoid ties in classification
* Generally there is no universally best value for K
> Iterating to find best K:
* Best to different values of K esp the odd numbers
* plot error for each K Value.
* Choose the K where the error is lowest or it has the highest performance
> KNN & Curse of Dimensionality
* Due to the curse of dimensionality, KNN struggles with high-dimensional data that has many columns(features)
* This means it's inefficient for very large dataset(thousands of columns, millions of rows) as it also grows exponentially with such large dataset. It is time complex.

<img src="Images/1. Best K value.webp" alt="Best K Value" width="300" height="300">

> A smaller K (like K=1) leads to overfitting, while a larger K may lead to underfitting.    
> The optimal K value is where error is lowest as shown by K=3

### **E}** KNN with Scikit-Learn -Titanic Dataset Complete Summary

> 1. Import Data & Libraries

In [6]:
# import relevant libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import precision_score,accuracy_score,recall_score,f1_score

raw_df = pd.read_csv(r"C:\Users\User\Documents\Moringa_labs\PHASE 4\1.-Phase-4-SUMMARY-\DATA\1. titanic.csv")
raw_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [7]:
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


> 2. Data Preprocessing

In [8]:
# Dropping unnecessary columns
df = raw_df.drop(columns=["PassengerId","Name","Ticket","Cabin"])
# Convert "sex column" to binary encoding 
df['Sex'] = df['Sex'].replace({"male":0 , "female":1})
# Handle missing values in "Age column" - fill with median
df["Age"] = df["Age"].fillna(df["Age"].median())
# drop rows with missing "Embarked" values
df = df.dropna(subset=["Embarked"])
# One hot encode Embarked Column
df = pd.get_dummies(df, columns= ["Embarked"], drop_first=True)
# Separate target("Survived") from features
labels = df["Survived"]
df = df.drop(columns= ["Survived"])

df.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_Q,Embarked_S
0,3,0,22.0,1,0,7.25,0,1
1,1,1,38.0,1,0,71.2833,0,0
2,3,1,26.0,0,0,7.925,0,1
3,1,1,35.0,1,0,53.1,0,1
4,3,0,35.0,0,0,8.05,0,1


> Split Data into Training & Testing Sets

In [9]:
X_train, X_test, y_train, y_test = train_test_split(df, labels, test_size=0.25, random_state=42)

> Normalize the Data

In [10]:
# normalize using stdscaler
scaler = StandardScaler()
# fit the scaler on training data then transform both train and test data
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Convert into a dataframe
scaled_df_train = pd.DataFrame(X_train_scaled, columns = df.columns)
scaled_df_train.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_Q,Embarked_S
0,0.815528,1.390655,-0.575676,-0.474917,-0.480663,-0.500108,-0.311768,0.620174
1,-0.386113,1.390655,1.550175,-0.474917,-0.480663,-0.435393,-0.311768,0.620174
2,-0.386113,-0.719086,-0.120137,-0.474917,-0.480663,-0.644473,-0.311768,0.620174
3,-1.587755,-0.719086,-0.120137,-0.474917,-0.480663,-0.115799,-0.311768,0.620174
4,0.815528,1.390655,-1.107139,0.413551,-0.480663,-0.356656,-0.311768,-1.612452


> Train KNN Classifier

In [11]:
# Instantiate the classifier with default parameter
clf = KNeighborsClassifier()
# fit classifier to training data
clf.fit(X_train_scaled, y_train)
# Use trained model to make predictions on the test set
test_preds = clf.predict(X_test_scaled)

> Evaluate the model

In [12]:
# A function to print key evaluation metrics
def print_metrics(y_true, y_pred):
    print(f"Precision Score: {precision_score(y_true,y_pred):.2f}")
    print(f"Recall Score: {recall_score(y_true,y_pred):.2f}")
    print(f"Accuracy Score: {accuracy_score(y_true,y_pred):.2f}")
    print(f"F1 Score: {f1_score(y_true,y_pred):.2f}")

#Evaluate model performance on test set
print_metrics(y_test, test_preds )

Precision Score: 0.71
Recall Score: 0.73
Accuracy Score: 0.79
F1 Score: 0.72


> Tune K value (Hyperparameter Tuning)

In [13]:
# Function to find the best k value by iterating over odd values
def find_best_k(X_train, y_train, X_test, y_test, min_k=1, max_k=25):
    best_k = min_k
    best_score = 0

    #iterating over odd values of K
    for k in range(min_k, max_k + 1, 2):
        clf = KNeighborsClassifier(n_neighbors = k)
        clf.fit(X_train, y_train)
        preds = clf.predict(X_test)
        f1 = f1_score(y_test, preds)

        if f1 > best_score: 
            best_score = f1
            best_k = k

    print(f"The Best K {best_k} with F1 Score of {best_score}")

# Find Best K value
find_best_k(X_train_scaled, y_train, X_test_scaled, y_test)

The Best K 17 with F1 Score of 0.7468354430379746


## PART 2: Model Tuning, Pipelines, & Ensemble Methods

### **A}** GridSearchCv  Intro

> Advanced Supervised Learning Techniques
Such as:  
   *  1. Hyperparameter tuning(GridSearchCV)  
   *  2. Machine Learning Pipelines  
   *  3. Model Persistence(Saving the model)  
> 1. Tuning model hyperparameters with GridsearchCV
* **A} Hyperparameter**: are values set before training a model
        1. Maximum tree depth in decision trees
        2. Nb of neighbors in K-nearest neighbors
    * They help balance bias-variance trade-off.
        1. bias{underfitting}
        2. variance {overfitting}
* **B} GridSearchCV**: It automates searching thru combinations of hyperparameters to optimize model performance.  
    * It does an exhaustive search and evaluates each model using K-Fold Cross Validation.
    * Tuning parameters include: criterion, max_depth, min_samples_split for decision trees
    * **Disadvantage**: Time consuming & computationally expensive. We only use the most important hyperparameters.
> 2. Machine Learning Pipelines  
   
  * Pipelines help streamline Ml Workflow.  
  * A single Pipeline can handle data transformation (cleaning & encoding), model training and prediction  
  * It helps prevent data leakage{where info from test data influences model training}
  * Integrating GridSearchCv into a pipeline will automate hyperparameter tuning within a workflow.
> 3. Save & Reuse model with Pickle  
  * Pickle a model, it'll save to disk space and you don't have to retrain a model each time
  * Pickled models can be reused for future predictions, often deployed in production of APIs
> Parameter Tuning with GridsearchCV
  * Gridsearch allows tuning multiple parameters while searching for the best combination
  * It can be resource-extensive esp with complex models and large datasets


### **B}** PIPELINES INTRO

> Pipelines allow you to combine multiple steps into a single manageable workflow. Steps include:
  * data preprocessing
  * feature selection
  * model training 
> Pipelines support integration with grid search & cross-validation, streamlining & hyperparameter tuning.

### **C}** REFACTORING CODE TO USE PIPELINES 

##### **1.** CODE WITHOUT PIPELINES

* **PREPROCESSING**: Transform data into suitable ML models
* **APPROACH**: The code manually applies preprocessing step in a sequence.
* **TRANSFORMER**: {OnehotEncoder,Standardscaler} transform specific features.

In [14]:
# dataset has both categorical & numerical columns
example_data = pd.DataFrame([
    {"category": "A", "number": 7, "target": 1},
    {"category": "A", "number": 8, "target": 1},
    {"category": "B", "number": 9, "target": 0},
    {"category": "B", "number": 7, "target": 1},
    {"category": "C", "number": 4, "target": 0}
])
# separate features and target variable
example_X = example_data.drop("target", axis=1)
example_y = example_data["target"]

# one hot encode the category column to convert it to numeric format
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(categories= "auto", handle_unknown= "ignore", sparse= False)
category_encoded = ohe.fit_transform(example_X[["category"]])

# Convert the one-hot encoded array to a DataFrame
category_encoded_df = pd.DataFrame(category_encoded, columns=ohe.categories_[0], index=example_X.index)

# Feature engineering: {number_odd: indicates if value is odd(1) or even(1)}
from sklearn.preprocessing import FunctionTransformer
func_transformer = FunctionTransformer(lambda x: x%2)
number_odd = func_transformer.fit_transform(example_X["number"])
example_X["number_odd"] = number_odd

# Combine the encoded category DataFrame with the original DataFrame (without the category column)
example_X = example_X.drop("category", axis=1)  # Drop original 'category' column
example_X = pd.concat([category_encoded_df, example_X], axis=1)  #concatenate encoded features with numeric features

# Standardscaler to normalize numerical data so that all features have a mean 0 and stdev 1
scaler = StandardScaler()
data_scaled = scaler.fit_transform(example_X)
# convert scaled dat back into a dataframe
scaled_df = pd.DataFrame(data_scaled, columns=example_X.columns, index=example_X.index)

# Display the transformed DataFrame
print(scaled_df)

          A         B    C    number  number_odd
0  1.224745 -0.816497 -0.5  0.000000    0.816497
1  1.224745 -0.816497 -0.5  0.597614   -1.224745
2 -0.816497  1.224745 -0.5  1.195229    0.816497
3 -0.816497  1.224745 -0.5  0.000000    0.816497
4 -0.816497 -0.816497  2.0 -1.792843   -1.224745


In [15]:
# FULL CODE FROM GPT FOR DEEP UNDERSTANDING
def preprocess_data_without_pipeline(X):
    """
    Manually preprocess data without using a pipeline.
    Steps include:
    1. Encoding categorical data
    2. Feature engineering (adding new features)
    3. Scaling all features
    
    Args:
    X (DataFrame): Input data with categorical and numerical columns.

    Returns:
    DataFrame: Transformed dataset.
    List: A list of transformers used during the process.
    """
    
    # List to store transformers (e.g., encoders, scalers)
    transformers = []

    ### 1. Encoding Categorical Data ###
    
    # Step 1: Instantiate OneHotEncoder for the 'category' column.
    ohe = OneHotEncoder(categories="auto", handle_unknown="ignore", sparse=False)
    
    # Step 2: Apply OneHotEncoder to the 'category' column.
    category_encoded = ohe.fit_transform(X[["category"]])
    
    # Step 3: Convert the encoded array into a DataFrame for readability.
    category_encoded = pd.DataFrame(
        category_encoded,  # Encoded values
        columns=ohe.categories_[0],  # Categorical labels from the encoder
        index=X.index  # Keep the original index
    )
    
    # Step 4: Store the encoder in the list of transformers for later use.
    transformers.append(ohe)
    
    # Step 5: Drop the original 'category' column and append the encoded version.
    X.drop("category", axis=1, inplace=True)
    X = pd.concat([category_encoded, X], axis=1)

    ### 2. Feature Engineering ###
    
    # Step 6: Define a helper function to flag odd numbers (returns 1 if odd, 0 if even).
    def is_odd(data):
        return data % 2

    # Step 7: Create a transformer using FunctionTransformer.
    func_transformer = FunctionTransformer(is_odd)

    # Step 8: Apply the transformer to the 'number' column to create the 'number_odd' feature.
    number_odd = func_transformer.fit_transform(X["number"])
    
    # Step 9: Store the transformer.
    transformers.append(func_transformer)
    
    # Step 10: Add the new 'number_odd' feature to the dataset.
    X["number_odd"] = number_odd

    ### 3. Scaling the Data ###
    
    # Step 11: Instantiate a StandardScaler to standardize the features.
    scaler = StandardScaler()

    # Step 12: Fit and apply the scaler to the entire dataset.
    data_scaled = scaler.fit_transform(X)
    
    # Step 13: Store the scaler for future use (e.g., inverse transform).
    transformers.append(scaler)

    # Step 14: Replace the original data with the scaled version, converting back to DataFrame.
    X = pd.DataFrame(
        data_scaled,  # Scaled values
        columns=X.columns,  # Preserve the original column names
        index=X.index  # Keep the same index
    )

    # Return the transformed dataset and the list of transformers used.
    return X, transformers

# Example usage
# Remove target column from the example data for preprocessing
example_X = example_data.drop("target", axis=1)

# Apply the preprocessing function to the dataset
result, transformers = preprocess_data_without_pipeline(example_X)

# View the transformed dataset
result


Unnamed: 0,A,B,C,number,number_odd
0,1.224745,-0.816497,-0.5,0.0,0.816497
1,1.224745,-0.816497,-0.5,0.597614,-1.224745
2,-0.816497,1.224745,-0.5,1.195229,0.816497
3,-0.816497,1.224745,-0.5,0.0,0.816497
4,-0.816497,-0.816497,2.0,-1.792843,-1.224745


##### **2.** CODE WITH PIPELINES

> 1. PIPELINE CLASS
* Helps streamline preporocessing & model steps allow for consistency in transformations
* It standardizes features & feeds them to a logistic regression model.
> 2. Columntransformer Class
* Allow application of distinct transformations to different columns.
* eg Apply ohe to categorical data only and not on the numeric data
* After transformation pipelines return a numpy array but easily convertible to Dataframe.
> 3. FeatureUnion Class
* It applies multiple transformations & concatenate their results 
* e.g A feature to flag whether a number is odd while also encoding categorical data
> A complete pipeline would include all the 3.

In [16]:
import pandas as pd
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer, StandardScaler


def preprocess_data_with_pipeline(X):
    # 1. define the columntransformer for original features encoding
    original_features_encoded = ColumnTransformer(
        transformers = [
            ("ohe", OneHotEncoder(categories= "auto", handle_unknown= "ignore"), ["category"])
        ],
        remainder = "passthrough"
    )
    # 2. create a fn for feature engineering to identify odd and even numbers
    def is_odd(data):
        return data % 2 # 1 for odd nbs & 0 for even numbers
    # 3. Create a column transformer for feature engineering
    feature_eng = ColumnTransformer(
        transformers = [
         # Apply Function transformer to create a new number odd feature
            ("add_number_odd", FunctionTransformer(is_odd), ["number"]) 
        ],
        # drop unspecified columnns in the transformer
        remainder ="drop" 
    )
    # 4. Combine transformations into a FeatureUnion
    feature_union = FeatureUnion(transformer_list = [
        # encoded features from the original features transformer
        ("encoded_features", original_features_encoded),
        # Engineered features from feature engineering transformer
        ("engineered_features", feature_eng)
    ])
     # 5. Create a pipeline to manage the transformation and scaling
    pipe = Pipeline(steps=[
        # Apply feature union transformations
        ("feature_union", feature_union),
        # scale the resulting features using standard scaler
        ("scale", StandardScaler())
     ])
    # 6. Fit the pipeline to data and transform it
    transformed_data = pipe.fit_transform(X)
    # 7. Extract category labels for final dataframe creation
    encoder = original_features_encoded.named_transformers_["ohe"]
    category_labels = encoder.categories_[0]
    # 8. Combine transformed data into a dataframe with appropriate column names
    all_cols = list(category_labels) + ["number", "number_odd"]
    return pd.DataFrame(transformed_data, columns= all_cols, index = X.index), pipe

### **D}** ENSEMBLE MODELS

* They combine multiple models to improve prediction accuracy thus outperforming individual models.
* *Ensemble Models* leverage the wisdom of crowds{Average multiple individual estimates} which often yields better results. 
    * Bootstrap Aggregation{bagging} is introduced here
    * Boosting focuses on improving weak models
    * Bagging reduces variance by creating multiple models from random subsets of data
* *Random Forest* are an ensemble method for decision trees. It uses bagging & subspace sampling to create a forest of decision trees. This improves accuracy over a single tree
* *GridSearchCV* tunes decision trees allowing exhaustive parameter search to optimize model performace.
* *Gradient Boosting* introduced with the concept of weak learners & powerful ensemble methods like Adaboost and Gradient Boosted Trees.
* *XGBOOST* {eXtreme Gradient Boosting} an effective boosting algorithm.   

* Ensembles are key to winning kaggle competitions.
* **Bagging (Bootstrap Aggregation)**: Creates multiple models using different subsets of data & aggregates their predictions thus reducing variance.
* **Tree-based models**: e.g Random Forests & Gradient boosted trees have a resilience against a higher variance.


### **E}** RANDOM FORESTS

> INTRODUCTION
* Random forests use decision trees for both classification and regression tasks.
* Random forest provide a balance btn accuracy, interpretability & robustness. Helps work on complex real world problems.
* **Main Idea** is to create multiple decision trees with varying inputs to improve prediction accuracy and robustness of the model.
* They mitigate overfitting and handle noisy data much better.
> MAIN CONCEPTS
* **Bagging(Bootstrap Aggregation)**: Train each tree on diff samples of data. 2/3 of data is used to train each tree while rest is left for out of bag{OOB} error estimation.
* **Subspace sampling method**: For each tree, a random subset of features is chosen at each node ensuring no single tree over relies on any one feature.
>RESILIENCE TO OVERFITTING
* Randomness in data & feature sampling ensures the model is resilient to overfitting.
* Noisy data in one tree does not affect the overall model's performance coz other trees rely on diff features.
> PREDICTION PROCESS
* Once all trees have been traines, the algo aggregates prediction from all trees{often using majority voting for classification or Averaging for Regression}
> BENEFITS
1. **Strong Performance**: Works well on a wide range of Data Science problems
2. **Interpretability**: Individual trees are interpretable and it's possible to examine feature importance.
> DISADVANTAGES
1. **Computational Complexity**: Training many trees is time consuming and the computational cost is high.
2. **Memory usage**: Random Forests require significant memory since they need to store every tree.

### **F}** GRADIENT BOOSTING

> INTRODUCTION
* Gradient boosting helps improve accuracy.
* It builds models iteratively with each new model correcting the errors of the previous model.
* Iteratively improves model performance by focusing on the hardest examples. 
* **Adaboost** uses weighted data to adjust training
* **Gradient Boosting** leverages gradient descent to minimize errors
> WEAK LEARNERS & BOOSTING
* **Weak Learners**: Simple models that perform slightly better than random guessing. 
* Boosting uses weak learners to improve prediction iteratively.  
      **Boosting Process**  
      1. Train a weak learner  
      2. Identify the errors it makes  
      3. Train the next learner to focus on those errors  
      4. Rpt to create a strong model from many weak ones
> UNDERSTANDING Adaboost
* **Adaboost**: Trains weak learners on diff subsamples of data adjusting the weights of examples after each iteration.
> UNDERSTANDING GRADIENT BOOSTING  
* **Gradient Descent**: In gradient boosting, models are trained using gradient descent to minimize loss.
* The algo identifies where predictions are wrong, calculates residuals & adjusts parameters.
> LEARNING RATE
* **Learning Rate(y)**: A scalar that controls how much the modedks parameters are adjusted at each step.
* Lowering learning rate make slower but more accurate adjustments.
* Higher Learning rate can overshoot optimal values

> Boosting vs Random Forests  

|SIMILARITIES | DIFFERENCES|
|---------|-----------|
|* Both methods are ensembles that aggregate the prediction of multiple models  | i) Boosting trees are weak learners ii) Random forest trees are strong learners |
| - | i) Boosting uses weighted aggregation of predictions ii) random forests use majority voting | 

### **G}** XGBOOST (eXtreme Gradient Boosting)

> INTRODUCTION
* It's the leading gradient oosting algo.
* Widely recognized for its speed & performance in ML Tasks esp in Kaggle
> WHY IS XGBOOST SPECIAL?
* It outperforms other gradient boosting algos thru multiple optimizations:
  * *Parallelization*: During tree construction, it uses all CPU cores making it **faster**.
  * *Distributed Training*: Can be run across multiple computers allowing for **Scalability**
  * *Handling Missing Data* : Automatically manages missing values reducing the need for preprocessing.
* These optimizations make it the fastest, most efficient gradient boosting tool available. widely used in real world application.
> APPLICATION OF XGBOOST
1. **Competitions**:Dominant in Data Science competitions e.g Kaggle due to its top-tier perfirmance in classification tasks.
2. **Best-in Class**: Outperforms most algos{except deep learning in some cases} thus making it a go-to choice for many practical applications