# Easy Ensemble AdaBoost Classifier

In AdaBoost, a model is trained and then evaluated. After evaluating the errors of the first model, another model is trained. 

The model gives extra weight to the errors from the previous model. The purpose of this weighting is to minimize similar errors in subsequent models. Then, the errors from the second model are given extra weight for the third model. This process is repeated until the error rate is minimized.

However, AdaBoost is more sensetive to overfitting than Random Forests.

In [84]:
import pandas as pd
import numpy as np

from sqlalchemy import create_engine
# from config import db_password
from sqlalchemy import inspect
import matplotlib.pyplot as plt
import pandas as pd

business_df = pd.read_csv("../../../Data/business_census_merged_dataset.csv")

# Categorizing restaurants based on stars ratings
business_df["Category"] = pd.cut(business_df["Stars_Rating"],bins=[0.9,3,5],
                                 labels=["Lower Performance", "Higher Performance"])

def changeStatus(status):
    if status == "Lower Performance":
        return 0
    else:
        return 1
    
business_df['Category_Encoded'] = business_df["Category"].apply(changeStatus)
business_df["Category_Encoded"] = pd.to_numeric(business_df["Category_Encoded"])

business_df["Median_Income(dollars)"] = pd.to_numeric(business_df["Median_Income(dollars)"], errors='coerce')
business_df["Mean_Income(dollars)"] = pd.to_numeric(business_df["Mean_Income(dollars)"], errors='coerce')

business_df = business_df.dropna(subset=['Median_Income(dollars)'])

In [85]:
# # Creating engine and connection to the SQL database
# db_string = f"postgresql://postgres:{db_password}@127.0.0.1:5432/yelp_DB"
# engine = create_engine(db_string)
# df = pd.read_sql_table('business', engine)

In [86]:
business_df.columns

Index(['Restaurant_ID', 'Restaurants_Name', 'Address', 'City', 'State',
       'Postal_Code', 'Latitude', 'Longitude', 'Stars_Rating', 'Review_Count',
       'Restaurants_Delivery', 'Outdoor_Seating', 'Accepts_CreditCards',
       'Price_Range', 'Alcohol', 'Good_For_Kids', 'Reservations',
       'Restaurants_TakeOut', 'WiFi', 'Good_For_Groups',
       'Wheelchair_Accessible', 'Happy_Hour', 'Noise_Level',
       'Dietary_Restrictions', 'Total_Estimate_Households_per_Zip',
       'Total_Estimate_Married-couple_Family_households',
       'Total_Estimate_Nonfamily_households', 'Median_Income(dollars)',
       'Mean_Income(dollars)', 'Category', 'Category_Encoded'],
      dtype='object')

In [87]:
X = business_df[['Review_Count', 'Restaurants_Delivery', 'Outdoor_Seating',
       'Accepts_CreditCards', 'Price_Range', 'Alcohol', 'Good_For_Kids',
       'Reservations', 'Restaurants_TakeOut', 'WiFi', 'Good_For_Groups',
       'Wheelchair_Accessible', 'Happy_Hour', 'Noise_Level',
       'Dietary_Restrictions',
                'Total_Estimate_Married-couple_Family_households',
       'Total_Estimate_Nonfamily_households',
                'Median_Income(dollars)', 'Total_Estimate_Households_per_Zip']]

In [88]:
# Define the target set
y = business_df["Category_Encoded"]

In [89]:
# Split the model into training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,
                                                   y, 
                                                   random_state=1, 
                                                    stratify=y)

In [90]:
# Train the EasyEnsembleClassifier
from imblearn.ensemble import EasyEnsembleClassifier

model = EasyEnsembleClassifier(n_estimators=100, random_state=1)

model.fit(X_train, y_train)

EasyEnsembleClassifier(n_estimators=100, random_state=1)

In [91]:
# Calculated the balanced accuracy score
from sklearn.metrics import balanced_accuracy_score
y_pred = model.predict(X_test)
balanced_accuracy_score(y_test, y_pred)

0.6636770412722759

In [92]:
y_pred_train = model.predict(X_train)
balanced_accuracy_score(y_train, y_pred_train)

0.6867554794313522

In [93]:
# Display the confusion matrix
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)
cm

array([[1012,  447],
       [1731, 2995]])

In [94]:
# Print the imbalanced classification report
from imblearn.metrics import classification_report_imbalanced

print(classification_report_imbalanced(y_test, y_pred))

                   pre       rec       spe        f1       geo       iba       sup

          0       0.37      0.69      0.63      0.48      0.66      0.44      1459
          1       0.87      0.63      0.69      0.73      0.66      0.44      4726

avg / total       0.75      0.65      0.68      0.67      0.66      0.44      6185

