# Balanced Random Forest Classifier

A random forest model combines many decision trees into a forest of trees. Random forest models:
- Are robust against overfitting because all of those weak learners are trained on different pieces of the data.
- Can be used to rank the importance of input variables in a natural way.
- Can handle thousands of input variables without variable deletion.
- Are robust to outliers and nonlinear data.
- Run efficiently on large datasets. 

In [1]:
import matplotlib.pyplot as plt
import pandas as pd

business_df = pd.read_csv("Clean_Business_Data.csv")

In [2]:
# Categorizing restaurants based on stars ratings
business_df["Category"] = pd.cut(business_df["Stars_Rating"],bins=[0.9,2,3,4,5],labels=["Poor","Average","Good","Successful"])

In [3]:
# Define features set
X = business_df.drop(columns=['Unnamed: 0', 'Restaurant_ID', 'Restaurants_Name', 'Address', 'City',
       'State', 'Postal_Code', 'Latitude', 'Longitude', 'Stars_Rating', 'Category'])
X

Unnamed: 0,Review_Count,Restaurants_Delivery,Outdoor_Seating,Restaurants_TakeOut,WiFi,Restaurants_Reservations,Good_For_Groups,Wheelchair_Accessible,Happy_Hour,Dietary_Restrictions
0,80,0,0,1,1,0,0,0,0,0
1,6,1,1,1,0,0,1,1,0,0
2,19,0,1,1,1,0,1,0,0,0
3,10,1,1,1,0,0,0,1,0,0
4,10,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...
49852,998,1,1,1,1,0,1,0,0,0
49853,11,1,0,1,0,0,0,0,0,0
49854,33,0,1,0,1,0,1,0,0,0
49855,35,1,0,1,1,0,1,0,0,0


In [4]:
# Define the target
y = business_df["Category"]
y

0              Good
1              Poor
2           Average
3              Poor
4              Good
            ...    
49852    Successful
49853       Average
49854          Good
49855    Successful
49856    Successful
Name: Category, Length: 49857, dtype: category
Categories (4, object): ['Poor' < 'Average' < 'Good' < 'Successful']

In [5]:
# Split the model into training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,
                                                   y, 
                                                   random_state=1, 
                                                    stratify=y)

In [6]:
# Resample the training data with the BalancedRandomForestClassifier

from imblearn.ensemble import BalancedRandomForestClassifier

model = BalancedRandomForestClassifier(n_estimators=100, random_state=1)

model.fit(X_train, y_train)  

BalancedRandomForestClassifier(random_state=1)

In [9]:
# Calculated the balanced accuracy score
from sklearn.metrics import balanced_accuracy_score

y_pred = model.predict(X_test)
balanced_accuracy_score(y_test, y_pred)

0.4078382825547869

In [10]:
# Display the confusion matrix
from sklearn.metrics import confusion_matrix

y_pred = model.predict(X_test)

confusion_matrix(y_test, y_pred)

array([[ 789,  511, 1167,  495],
       [1286, 1666, 1367, 1683],
       [ 176,   68,  740,  133],
       [ 369,  477,  525, 1013]])

In [11]:
# Print the imbalanced classification report
from imblearn.metrics import classification_report_imbalanced

print(classification_report_imbalanced(y_test, y_pred))

                   pre       rec       spe        f1       geo       iba       sup

    Average       0.30      0.27      0.81      0.28      0.46      0.20      2962
       Good       0.61      0.28      0.84      0.38      0.48      0.22      6002
       Poor       0.19      0.66      0.73      0.30      0.70      0.48      1117
 Successful       0.30      0.42      0.77      0.35      0.57      0.32      2384

avg / total       0.44      0.34      0.81      0.35      0.51      0.26     12465



In [12]:
# List the features sorted in descending order by feature importance
feature_importance = sorted(zip(model.feature_importances_, X.columns), reverse=True)

for i in feature_importance:
    print('{} : ({})'.format(i[1], i[0]))

Review_Count : (0.7364963017104491)
Wheelchair_Accessible : (0.05313788141439246)
Restaurants_Delivery : (0.042971487057914134)
Good_For_Groups : (0.0348208320025486)
Restaurants_Reservations : (0.03033889139861774)
Outdoor_Seating : (0.029563122334321824)
WiFi : (0.027785774907790495)
Happy_Hour : (0.02359546471731503)
Restaurants_TakeOut : (0.020909415904844936)
Dietary_Restrictions : (0.00038082855180569727)
