<a href="https://colab.research.google.com/github/Jhansipothabattula/Machine_Learning/blob/main/Day36.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bagging and Random Forests

**Understanding Bagging(Bootstrap Aggregating)**

- What is Bagging?

  - Ensemble learning Technique that trains multiple models on different subsets of the data, created by random sampling with replacement

  - Regression: Average the predictions of individual models

  - Classification: Use majority voting to determine the final class

- What is Bagging?

  - Reduces Variance

  - Improves Robustness

- Applications

  - Bagging is commonly used with decision trees, which are prone to high variance

**Introduction to Random Forests**

- What is a Random Forest?

  - Ensemble learning method that builds multiple decision trees using bagging

  - Key features of Random Forests

    - Bootstrap Sampling

    - Feature Randomness

    - Prediction Aggregation

  - Advantages

    - Handles both regression and classififcation tasks efficiently

    - Works well with high-dimensionality data

    - Reduces ovefitting compared to single decision trees

**Key Parameters in Random Forests**

- Number of Trees(n_estimators)

  - The number of decision trees in the forest

  - Larger values reduce variance but increase computational cost

- Maximum Depth(max_depth)

  - Limits the depth of each tree to prevent ovefitting

  - Shallower trees generalize better but may underfit

- Feature Selection(max_features)

  - Number of Features to consider when looking for the best split

  - Options

    - Sqrt|log2|None

- Minimum Samples per Leaf(min_samples_leaf)

  - Minimum number of samples required in a leaf node

  - Prevents overly complex trees by ensuring each leaf contains enough samples

**1. Train a Random Forest Classifier on a dataset, tune it's Parameters, and evaluate it's perfomance**

In [5]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import GridSearchCV

# Load Dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split Dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display dataset info
print("Features:", data.feature_names)
print("Labels:", data.target_names)

# Train Random Forest
rf_model = RandomForestClassifier(random_state = 42)
rf_model.fit(X_train, y_train)

# predict
y_pred = rf_model.predict(X_test)

# Evaluate perfomance
accuracy = accuracy_score(y_test, y_pred)
print("Random Forest Accuracy: \n", accuracy)
print("Classification report: \n", classification_report(y_test, y_pred))

# Define hyperparameter grid
param_grid = {
    "n_estimators": [50, 100, 200],
    "max_depth":[None, 10, 20],
    "max_features":["sqrt", "log2"]
}

grid_search = GridSearchCV(
    estimator=RandomForestClassifier(random_state=42),
    param_grid=param_grid,
    cv=5,
    scoring="accuracy",
    n_jobs=-1
)
grid_search.fit(X_train, y_train)

# Display Best Parameters and scores
print(f"Best Parameters :{grid_search.best_params_}")
print(f"Best Cross-Validation Accuracy:{grid_search.best_score_}")

Features: ['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']
Labels: ['malignant' 'benign']
Random Forest Accuracy: 
 0.9649122807017544
Classification report: 
               precision    recall  f1-score   support

           0       0.98      0.93      0.95        43
           1       0.96      0.99      0.97        71

    accuracy                           0.96       114
   macro avg       0.97      0.96      0.96       114
weighted avg       0.97      0.96      0.96       114

Best Pa