## Ensemble Learning (Bagging)
- Ensemble learning is a machine learning technique where multiple models (learners) are combined to improve the overall performance and generalization of the system. 
- The idea behind ensemble methods is that by aggregating the predictions of multiple models, the weaknesses of individual models can be compensated, leading to better overall performance.
- Bagging: In bagging, multiple models are trained independently on different random subsets of the training data. Each model gets a vote, and the final prediction is often determined by averaging or taking a majority vote. Eg: Random Forest

## Approach
#### These steps outline the process to be followed when working on a predictive model: 
- Problem Definition
- Data Collection
- Data Preprocessing
- Feature Selection/Engineering
- Data Splitting
- Model Selection
- Model Training
- Prediction
- Hyperparameter Tuning
- Model Evaluation



## Problem Definition

### *Clearly state the problem you want to solve, as well as the outcome you want to predict.*


Here we have to predict whether the person is diabiatic or not using Ensemble Learning.

In [1]:
# Importing libraries 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

## Data Collection

### *Gather relevant data that will be used to train and test the prediction model.*


In [2]:
df = pd.read_csv('pima-indians-diabetes.csv')

In [3]:
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,1,85,66,29,0,26.6,0.351,31,0
1,8,183,64,0,0,23.3,0.672,32,1
2,1,89,66,23,94,28.1,0.167,21,0
3,0,137,40,35,168,43.1,2.288,33,1
4,5,116,74,0,0,25.6,0.201,30,0


## Data Preprocessing


### *Clean the data by handling missing values, dealing with outliers, data visualization, normalizing features, and encoding categorical variables.*


In [4]:
# Check Null values
df.isnull().sum()

Pregnancies                 0
 Glucose                    0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [5]:
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,767.0,767.0,767.0,767.0,767.0,767.0,767.0,767.0,767.0
mean,3.842243,120.859192,69.101695,20.517601,79.90352,31.990482,0.471674,33.219035,0.34811
std,3.370877,31.978468,19.368155,15.954059,115.283105,7.889091,0.331497,11.752296,0.476682
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.2435,24.0,0.0
50%,3.0,117.0,72.0,23.0,32.0,32.0,0.371,29.0,0.0
75%,6.0,140.0,80.0,32.0,127.5,36.6,0.625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [6]:
df.Outcome.value_counts()

0    500
1    267
Name: Outcome, dtype: int64

## Feature Selection/Engineering

### *Identify which features are important for the prediction task and create new features if needed.*


In [7]:
X = df.drop('Outcome', axis = 'columns')
y = df.Outcome

In [8]:
from sklearn.preprocessing import StandardScaler

In [9]:
scaler = StandardScaler()

In [10]:
X_scaled = scaler.fit_transform(X)

## Data Splitting

### *Divide the datasets into a training set and a testing set to evaluate your model's performance.*

In [11]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, stratify=y, random_state=10)

In [12]:
X_train.shape

(575, 8)

## Model Selection

### *Choose an appropriate machine learning algorithm based on the type of problem (classification, regression, etc.) and the characteristics of the data.*

In [13]:
# Using sinle model
from sklearn.tree import DecisionTreeClassifier

In [14]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(DecisionTreeClassifier(), X, y, cv=5)
scores

array([0.71428571, 0.67532468, 0.68627451, 0.77124183, 0.7124183 ])

In [15]:
scores.mean()

0.7119090060266531

In [16]:
# Using Bagging

from sklearn.ensemble import BaggingClassifier

bag_model = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(), 
    n_estimators=100, 
    max_samples=0.8, 
    oob_score=True,
    random_state=0
)

## Model Training

### *Use the training data to train the selected model by adjusting its parameters to minimize the prediction error.*

In [17]:
bag_model.fit(X_train, y_train)

BaggingClassifier(base_estimator=DecisionTreeClassifier(), max_samples=0.8,
                  n_estimators=100, oob_score=True, random_state=0)

In [18]:
# The OOB score provides an estimate of a model's performance on unseen data without the need for a separate validation set

bag_model.oob_score_

0.7634782608695653

In [19]:
bag_model.score(X_test, y_test)

0.7552083333333334

## Prediction

### *Once the model is trained and validated, it can be used to make predictions on new, unseen data.*


In [20]:
bag_model = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(), 
    n_estimators=100, 
    max_samples=0.8, 
    oob_score=True,
    random_state=0
)
scores = cross_val_score(bag_model, X, y, cv=5)
scores

array([0.75324675, 0.73376623, 0.75816993, 0.81699346, 0.73202614])

In [21]:
scores.mean()

0.7588405058993294

## Hyperparameter Tuning

### *Fine-tune the model's hyperparameters to optimize its performance.*


No Need !!!

## Model Evaluation

### *Assess the model's performance on a separate set of data not used during training to understand its predictive power and generalization capability.*



In [22]:
from sklearn.ensemble import RandomForestClassifier

scores = cross_val_score(RandomForestClassifier(n_estimators=50), X, y, cv=5)
scores.mean()

0.7510482981071217

## Thank You !!!