# Balanced Random Forest Classifier

A random forest model combines many decision trees into a forest of trees. Random forest models:
- Are robust against overfitting because all of those weak learners are trained on different pieces of the data.
- Can be used to rank the importance of input variables in a natural way.
- Can handle thousands of input variables without variable deletion.
- Are robust to outliers and nonlinear data.
- Run efficiently on large datasets. 

In [6]:
import pandas as pd

business_df = pd.read_csv("../../Data/02_Clean_Business_Data_Add_Attrs.csv")

# Categorizing restaurants based on stars ratings
business_df["Category"] = pd.cut(business_df["Stars_Rating"],bins=[0.9,2,3,4,5],labels=["Poor","Average","Good","Successful"])

# Since price can't be 0 and None, so replace it with a 1
def changeStatus(status):
    if status == "Poor":
        return 0
    elif status == "Average":
        return 1
    elif status ==  "Good":
        return 2
    else:
        return 3

business_df['Category_Encoded'] = business_df["Category"].apply(changeStatus)
business_df["Category_Encoded"] = pd.to_numeric(business_df["Category_Encoded"])

In [7]:
business_df.columns

Index(['Unnamed: 0', 'Restaurant_ID', 'Restaurants_Name', 'Address', 'City',
       'State', 'Postal_Code', 'Latitude', 'Longitude', 'Stars_Rating',
       'Review_Count', 'Restaurants_Delivery', 'Outdoor_Seating',
       'Accepts_CreditCards', 'Price_Range', 'Alcohol', 'Good_For_Kids',
       'Reservations', 'Restaurants_TakeOut', 'WiFi', 'Good_For_Groups',
       'Wheelchair_Accessible', 'Happy_Hour', 'Noise_Level',
       'Dietary_Restrictions', 'Category', 'Category_Encoded'],
      dtype='object')

In [8]:
# Define features set
X = business_df[['Review_Count', 'Restaurants_Delivery', 'Outdoor_Seating',
       'Accepts_CreditCards', 'Price_Range', 'Alcohol', 'Good_For_Kids',
       'Reservations', 'Restaurants_TakeOut', 'WiFi', 'Good_For_Groups',
       'Wheelchair_Accessible', 'Happy_Hour', 'Noise_Level',
       'Dietary_Restrictions']]

In [9]:
# Define the target
y = business_df["Category_Encoded"]
y

0        2
1        3
2        2
3        3
4        1
        ..
27202    1
27203    2
27204    3
27205    2
27206    3
Name: Category_Encoded, Length: 27207, dtype: int64

In [10]:
# Split the model into training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,
                                                   y, 
                                                   random_state=1, 
                                                    stratify=y)

In [11]:
# Resample the training data with the BalancedRandomForestClassifier

from imblearn.ensemble import BalancedRandomForestClassifier

model = BalancedRandomForestClassifier(n_estimators=100, random_state=1)

model.fit(X_train, y_train)  

BalancedRandomForestClassifier(random_state=1)

In [13]:
# Calculated the balanced accuracy score
from sklearn.metrics import balanced_accuracy_score

y_pred = model.predict(X_test)
balanced_accuracy_score(y_test, y_pred)

0.4714710441759943

In [15]:
# Training balanced accuracy
y_pred_train = model.predict(X_train)
balanced_accuracy_score(y_train, y_pred_train)

0.6724106784258966

In [16]:
# Display the confusion matrix
from sklearn.metrics import confusion_matrix

y_pred = model.predict(X_test)

confusion_matrix(y_test, y_pred)

array([[ 185,   64,   14,   24],
       [ 367,  499,  273,  191],
       [ 548,  954, 1202, 1138],
       [ 150,  139,  311,  743]])

In [17]:
# Print the imbalanced classification report
from imblearn.metrics import classification_report_imbalanced

print(classification_report_imbalanced(y_test, y_pred))

                   pre       rec       spe        f1       geo       iba       sup

          0       0.15      0.64      0.84      0.24      0.73      0.53       287
          1       0.30      0.38      0.79      0.33      0.54      0.28      1330
          2       0.67      0.31      0.80      0.43      0.50      0.24      3842
          3       0.35      0.55      0.75      0.43      0.65      0.41      1343

avg / total       0.51      0.39      0.79      0.40      0.55      0.29      6802



In [61]:
# List the features sorted in descending order by feature importance
feature_importance = sorted(zip(model.feature_importances_, X.columns), reverse=True)

for i in feature_importance:
    print('{} : ({})'.format(i[1], i[0]))

Review_Count : (0.5175181357539319)
Noise_Level : (0.08191126703995406)
Price_Range : (0.0521487776554046)
Restaurants_Delivery : (0.046852773149190965)
Wheelchair_Accessible : (0.04664099891117743)
WiFi : (0.04512343624949573)
Outdoor_Seating : (0.03970376791535502)
Good_For_Kids : (0.03405568241334003)
Reservations : (0.034037312394555326)
Happy_Hour : (0.03187377728904784)
Good_For_Groups : (0.03152922938415627)
Alcohol : (0.020460340797408022)
Accepts_CreditCards : (0.018144501046982703)
