<a href="https://colab.research.google.com/github/Jhansipothabattula/Machine_Learning/blob/main/Day41.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ensemble Learning Project-Comparing Models on a Real Dataset

**Building and Evaluating Multiple Ensemble Models**

- Why Compare Ensemble Models?

  - Excel in different scenarios

  - Helps identify the most effective model for a specific dataset or problem

- Ensemble Methods to consider:

  - Bagging(Ex: Random Forest)

    - Reduces variance by averaging predictions from multiple independent models

    - Works well with high-variance models like decision trees

  - Boosting(Ex: Gradient Boosting, XGBoost, LightGBM)

    - Reduces bias by sequentially correcting errors from previous models

**Comparing Bagging and Boosting**

- Bagging

  - Builds models independently using random subsets of data

  - Robust against overfitting with strong base learns

- Boosting

  - Sequentially builds models, focusing on hard-to-predict sample

  - Requires careful tuning to prevent overfitting

**Model Perfomance on Balanced vs Imbalanced Data**

- Challenges with Imbalanced Data

  - Models may prioritize the majority class, leading to poor perfomance on the minority class

- Evaluation Metrics

  - Accuracy

    - May be reflect true perfomance for imbalanced datasets

  - F1-Score

    - Balances Precision and recall, focsusing on the minority class

  - ROC-AUC

    - Evaluates the model's ability to distinguish between classes across thresholds

**1. Ensemble Learning and Model Comparison**

- Train and compare multiple ensemble methods on a real-world dataset, analyzing their perfomance under balanced and imbalanced conditions

In [5]:
import pandas as pd

# Load Dataset
from google.colab import files
uploaded = files.upload()

Saving churns.txt to churns.txt


In [12]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
import pandas as pd
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.ensemble import RandomForestClassifier

# Load Dataset
df = pd.read_csv("churns.txt")
print(df.info())

# Display dataset info and preview
print("Dataset info: \n")
print(df.info())
print("Class Distribution: \n")
print(df["Churn"].value_counts())
print("Sample Data: \n", df.head())

# Handle Missing Values
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce")
df.fillna(df["TotalCharges"].median(), inplace=True)

# Encode categorical variables
label_encoder = LabelEncoder()
for column in df.select_dtypes(include=["object"]).columns:
  if column!="Churn":
    df[column] = label_encoder.fit_transform(df[column])

# Encode target variable
df["churn"] = label_encoder.fit_transform(df["Churn"])

# Scale Numerical features
scaler = StandardScaler()
numerical_features = ["tenure", "MonthlyCharges", "TotalCharges"]
df[numerical_features] = scaler.fit_transform(df[numerical_features])

# Define X and y for splitting
X = df.drop(["Churn", "customerID", "churn"], axis=1)
y = df["churn"]

# Split Dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply SMOTE
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# Display class distribution after SMOTE
print("Class Distribution after SMOTE: \n")
print(pd.Series(y_train_resampled).value_counts())

# Train Random Forest
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train_resampled, y_train_resampled)
y_pred_rf = rf_model.predict(X_test)
roc_auc_rf = roc_auc_score(y_test, y_pred_rf)

# Train XGBoost
xgb_model = XGBClassifier(eval_metric="logloss", random_state=42)
xgb_model.fit(X_train_resampled, y_train_resampled)
y_pred_xgb = xgb_model.predict(X_test)
roc_auc_xgb = roc_auc_score(y_test, xgb_model.predict_proba(X_test)[:, 1])

# Train LightGBM
lgb_model = LGBMClassifier(random_state=42)
lgb_model.fit(X_train_resampled, y_train_resampled)
y_pred_lgb = lgb_model.predict(X_test)
roc_auc_lgb = roc_auc_score(y_test, lgb_model.predict_proba(X_test)[:, 1]) # Corrected variable name

# Classification reports
print("Classification Report for Random Forest: \n")
print(classification_report(y_test, y_pred_rf))
print("Classification Report for XGBoost: \n")
print(classification_report(y_test, y_pred_xgb))
print("Classification Report for LightGBM: \n")
print(classification_report(y_test, y_pred_lgb))

# ROC_AUC comaprison
print("ROC-AUC Comparison: \n")
print(f"Random Forest ROC-AUC: {roc_auc_rf:.2f}")
print(f"XGBoost ROC-AUC: {roc_auc_xgb:.2f}")
print(f"LightGBM ROC-AUC: {roc_auc_lgb:.2f}")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 
