# Methodology

## Importing Libraries and Datasets

In [55]:
!pip install catboost

Collecting catboost
  Downloading catboost-0.26.1-cp37-none-manylinux1_x86_64.whl (67.4 MB)
[K     |████████████████████████████████| 67.4 MB 28 kB/s 
Installing collected packages: catboost
Successfully installed catboost-0.26.1


In [56]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from catboost import CatBoostClassifier
from xgboost import XGBClassifier
from sklearn.metrics import confusion_matrix, classification_report, f1_score

In [43]:
df = pd.read_csv("/content/train.csv")

In [44]:
X = df.drop('satisfaction',axis=1)
y = df['satisfaction']

## Splitting data to train and test

In [45]:
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=42)

## Baseline Model
- In modelling, it's important on having baseline model. This model will be a simple model to check the scores of our model. Once we found out the score, we can implement other model. Other model will be compared to this baseline model based on its score.

In [46]:
baseline_model = LogisticRegression(solver='liblinear')
baseline_model.fit(X_train, y_train)
baseline_model.score(X_test, y_test)

0.8767086261487373

- We Have a baseline model with a score of 0.87, we wil use this to compare with other model to get the ***Best Model***

In [47]:
models = {'KNN': KNeighborsClassifier(),
          'Decision Tree' : DecisionTreeClassifier(),
         'Random Forest': RandomForestClassifier(),
         'Ada Boost': AdaBoostClassifier(),
          'SGDClassifier': SGDClassifier(),
          'Support Vector Machine': SVC()}

def fit_and_score(models, X_train, X_test, y_train, y_test):
    np.random.seed(42)
    
    model_scores = {}
    
    for name, model in models.items():
        model.fit(X_train, y_train)
        
        model_scores[name] = model.score(X_test, y_test)
    return model_scores

In [48]:
model_scores = fit_and_score(models=models, 
                             X_train=X_train,
                            X_test=X_test,
                            y_train=y_train,
                            y_test=y_test)
model_scores

{'Ada Boost': 0.9308826936442969,
 'Decision Tree': 0.9462506757278555,
 'KNN': 0.7423739284886863,
 'Random Forest': 0.9630473395629006,
 'SGDClassifier': 0.6455710865703915,
 'Support Vector Machine': 0.6691636419800757}

- As we can see, SDGClassifier seemed to be a bad model for our datasets. However, Random Forest shows a great performance for our dataset
- But, lets try the XGBoostClassifier and other model.

## XGBoostClassifier

In [49]:
xgb = XGBClassifier(n_estimators=1000,
                           learning_rate=0.8,
                           subsample=0.8, 
                           colsample_bytree=0.7,
                      eval_metric='logloss', use_label_encoder=False)

xgb.fit(X_train, y_train)
xgb_pred = xgb.predict(X_test)
xgb.score(X_test, y_test)

0.9596107807552707

In [51]:
params_xgb = {'n_estimators': [50,100,250,400,600,800,1000], 'learning_rate': [0.2,0.5,0.8,1]}
rs_xgb =  RandomizedSearchCV(xgb, param_distributions=params_xgb, cv=5)
rs_xgb.fit(X_train, y_train)
xgb_pred_2 = rs_xgb.predict(X_test)
rs_xgb.score(X_test, y_test)

0.9618503359332767

- After we try to use XGBoost model and tune it, Random Forest shows slightly better performance than this model.

## CatBoostClassifier

In [57]:
models_cat = CatBoostClassifier()
models_cat.fit(X_train, y_train)
models_cat.score(X_test, y_test)

Learning rate set to 0.066089
0:	learn: 0.6041666	total: 81.4ms	remaining: 1m 21s
1:	learn: 0.5336533	total: 115ms	remaining: 57.4s
2:	learn: 0.4528259	total: 147ms	remaining: 48.7s
3:	learn: 0.4100859	total: 180ms	remaining: 44.8s
4:	learn: 0.3639059	total: 216ms	remaining: 43s
5:	learn: 0.3398799	total: 247ms	remaining: 41s
6:	learn: 0.3131295	total: 279ms	remaining: 39.6s
7:	learn: 0.2946009	total: 313ms	remaining: 38.8s
8:	learn: 0.2794128	total: 345ms	remaining: 37.9s
9:	learn: 0.2681894	total: 377ms	remaining: 37.3s
10:	learn: 0.2554388	total: 410ms	remaining: 36.8s
11:	learn: 0.2448181	total: 446ms	remaining: 36.7s
12:	learn: 0.2323639	total: 480ms	remaining: 36.4s
13:	learn: 0.2249306	total: 514ms	remaining: 36.2s
14:	learn: 0.2165094	total: 545ms	remaining: 35.8s
15:	learn: 0.2076019	total: 577ms	remaining: 35.5s
16:	learn: 0.1995636	total: 607ms	remaining: 35.1s
17:	learn: 0.1929811	total: 639ms	remaining: 34.9s
18:	learn: 0.1864053	total: 679ms	remaining: 35s
19:	learn: 0.18

0.9654413468221484

- From score above, we can see that **CatBoostClassifier** gives a great performance to our model. Now let's use this model

In [61]:
pred = models_cat.predict(X_test)

## Model Evaluation
- In this Evaluation, we will take a look at Classification Report and focusing on **f1 score**

In [62]:
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

           0       0.96      0.98      0.97     14767
           1       0.97      0.95      0.96     11131

    accuracy                           0.97     25898
   macro avg       0.97      0.96      0.96     25898
weighted avg       0.97      0.97      0.97     25898



- We achieved 97% of f1-score, which is great. 