<div style="border:solid green 2px; padding: 20px">
<b>Reviewer‚Äôs comments ‚Äì Iteration 1:</b><br>

  Hello Michael!,
  
I am Alexangel, your reviewer,
  
Another project successfully completed - well done! üèÜ Your consistent effort and progress are truly commendable.

Our team is here to help you keep pushing forward and honing your skills as you advance through the program.

My comments are marked as `Reviewer's comment`. You can contact me via Tripleten Hub for further feedback. This information is described below.

**What Was Great**:
- Excellent job on following the structure of the project.
- You‚Äôve shown strong skills in testing the models in this project.

**Tips for Future Projects**:
- Consider adding brief comments after the analysis or graph of every dataframe to make your work even more integral.

Congratulations again on your accomplishment! Each project you complete adds to your growing expertise, and it‚Äôs exciting to see you make such great strides. Keep up the great work! üéØ

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

***Name of the reviewer***: Alexangel Bracho

***Reviewer's Tripleten Hub  link*** : [reviewer's link](https://hub.tripleten.com/u/6b1cbe37)


# Final Project: Model Testing and Analysis

## Loading Libraries and Preparing Data

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.dummy import DummyClassifier

from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_val_score

from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier

In [2]:
contract_df = pd.read_csv('/datasets/final_provider/contract.csv')
personal_df = pd.read_csv('/datasets/final_provider/personal.csv')
internet_df = pd.read_csv('/datasets/final_provider/internet.csv')
phone_df = pd.read_csv('/datasets/final_provider/phone.csv')

In [3]:
contract_df = contract_df.rename(columns = {'customerID':'customer_id', 'BeginDate':'begin_date', 'EndDate':'end_date',
                                            'Type':'type', 'PaperlessBilling':'paperless_billing', 'PaymentMethod':'payment_method',
                                           'MonthlyCharges':'monthly_charges', 'TotalCharges':'total_charges'})
personal_df = personal_df.rename(columns = {'customerID':'customer_id', 'SeniorCitizen':'senior_citizen', 'Partner':'partner',
                                           'Dependents':'dependents'})
internet_df = internet_df.rename(columns = {'customerID':'customer_id', 'InternetService':'internet_service', 
                                            'OnlineSecurity':'online_security', 'OnlineBackup':'online_backup',
                                           'DeviceProtection':'device_protection', 'TechSupport':'tech_support',
                                           'StreamingTV':'streaming_tv', 'StreamingMovies':'streaming_movies'})
phone_df = phone_df.rename(columns = {'customerID':'customer_id', 'MultipleLines':'multiple_lines'})

In [4]:
merge1 = contract_df.merge(personal_df, on = 'customer_id', how = 'left')
merge2 = internet_df.merge(phone_df, on = 'customer_id', how = 'left')
df = merge1.merge(merge2, on = 'customer_id', how = 'left')

In [5]:
internet_columns = ['internet_service', 'online_security', 'online_backup', 'device_protection', 'tech_support', 'streaming_tv',
                   'streaming_movies']
df[internet_columns] = df[internet_columns].fillna('No internet service')

df['multiple_lines'] = df['multiple_lines'].fillna('No phone service')

In [6]:
df['total_charges'] = pd.to_numeric(df['total_charges'], errors = 'coerce')
df['total_charges'] = df['total_charges'].fillna(0)

In [7]:
df['begin_date'] = pd.to_datetime(df['begin_date'])
df['end_date'] = pd.to_datetime(df['end_date'].replace("No", pd.NaT))

In [8]:
df['partner'] = df['partner'].replace('Yes', 1)
df['partner'] = df['partner'].replace('No', 0)
df['paperless_billing'] = df['paperless_billing'].replace('Yes', 1)
df['paperless_billing'] = df['paperless_billing'].replace('No', 0)
df['dependents'] = df['dependents'].replace('Yes', 1)
df['dependents'] = df['dependents'].replace('No', 0)

In [9]:
cutoff_date = pd.to_datetime("2020-02-01")
df['tenure_days'] = (df['end_date'].fillna(cutoff_date) - df['begin_date']).dt.days
df['tenure_months'] = df['tenure_days'] // 30
df['churn'] = df['end_date'].notna().astype(int)

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7043 entries, 0 to 7042
Data columns (total 23 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   customer_id        7043 non-null   object        
 1   begin_date         7043 non-null   datetime64[ns]
 2   end_date           1869 non-null   datetime64[ns]
 3   type               7043 non-null   object        
 4   paperless_billing  7043 non-null   int64         
 5   payment_method     7043 non-null   object        
 6   monthly_charges    7043 non-null   float64       
 7   total_charges      7043 non-null   float64       
 8   gender             7043 non-null   object        
 9   senior_citizen     7043 non-null   int64         
 10  partner            7043 non-null   int64         
 11  dependents         7043 non-null   int64         
 12  internet_service   7043 non-null   object        
 13  online_security    7043 non-null   object        
 14  online_b

In [11]:
customer_ids = df['customer_id']
features = df.drop(['customer_id', 'begin_date', 'end_date', 'churn'], axis = 1)
target = df['churn']
print(features.shape, target.shape)

(7043, 19) (7043,)


In [12]:
cat_cols = features.select_dtypes(include = ['object']).columns
num_cols = ['monthly_charges', 'total_charges', 'tenure_months', 'tenure_days']

In [13]:
features_train, features_temp, target_train, target_temp = train_test_split(features, target, test_size = 0.4, random_state = 2356)
features_test, features_valid, target_test, target_valid = train_test_split(features_temp, target_temp, test_size = 0.5, random_state = 2356)
print(features_train.shape, target_train.shape)
print()
print(features_valid.shape, target_valid.shape, features_test.shape, target_test.shape)

(4225, 19) (4225,)

(1409, 19) (1409,) (1409, 19) (1409,)


Splitting our data into a training, test, and validation set at a typical 3:1:1 ratio

In [14]:
preprocessor = ColumnTransformer(
    transformers = [('num', StandardScaler(), num_cols), ('cat', OneHotEncoder(drop = 'first'), cat_cols)
                   ]
)

In [15]:
#determining class weights and counts for scale_pos_weight use in XGB and LGBM
class_counts = df['end_date'].notna().value_counts()
class_counts

False    5174
True     1869
Name: end_date, dtype: int64

In [16]:
scale_pos_weight = class_counts[0] / class_counts[1]

In [17]:
cat_boost_weights = [(len(df) / (2 * class_counts[0])), (len(df) / (2 * class_counts[1]))]

## Analysis and Evaluation of Training and Validation Data

In [18]:
models = {'Logistic Regression': LogisticRegression(class_weight = 'balanced',
                                                                    solver = 'liblinear',
                                                                    random_state = 2356
                                                   ),
          'Decision Tree': DecisionTreeClassifier(class_weight = 'balanced',
                                                  random_state = 2356
                                                 ),
          'Random Forest': RandomForestClassifier(class_weight = 'balanced',
                                                  random_state = 2356
                                                 ),
          'XGBoost': XGBClassifier(scale_pos_weight = scale_pos_weight,
                                   use_label_encoder = False,
                                   eval_metric = 'auc',
                                   random_state = 2356
                                  ),
          'LightGBM': LGBMClassifier(scale_pos_weight = scale_pos_weight,
                                     random_state = 2356
                                    ),
          'CatBoost': CatBoostClassifier(class_weights = cat_boost_weights,
                                         random_state = 2356,
                                         verbose = 0
                                        )
         }
results = {}

for name, model in models.items():
    pipeline = Pipeline([
        ('preprocessor', preprocessor), ('classifier', model)
    ])
    pipeline.fit(features_train, target_train)
    pred_prob = pipeline.predict_proba(features_valid)[:, 1]
    auc = roc_auc_score(target_valid, pred_prob)
    results[name] = auc
    results_df = pd.DataFrame.from_dict(results, orient = 'index', columns = ['ROC-AUC'])
    results_df.index.name = 'Model'
    results_df = results_df.sort_values('ROC-AUC', ascending = False)
results_df

Unnamed: 0_level_0,ROC-AUC
Model,Unnamed: 1_level_1
LightGBM,0.892073
XGBoost,0.890113
CatBoost,0.886889
Random Forest,0.854115
Logistic Regression,0.845545
Decision Tree,0.724422


Comparing the ROC-AUC scores of our models before cross-validation gives me a good idea of results to expect after CV. The models might not perform the exact same but this at least helps me understand which models are separating themselves from the rest.

In [19]:
models = {'Logistic Regression': LogisticRegression(class_weight = 'balanced',
                                                                    solver = 'liblinear',
                                                                    random_state = 2356
                                                   ),
          'Decision Tree': DecisionTreeClassifier(class_weight = 'balanced',
                                                  random_state = 2356
                                                 ),
          'Random Forest': RandomForestClassifier(class_weight = 'balanced',
                                                  random_state = 2356
                                                 ),
          'XGBoost': XGBClassifier(scale_pos_weight = scale_pos_weight,
                                   use_label_encoder = False,
                                   eval_metric = 'auc',
                                   random_state = 2356
                                  ),
          'LightGBM': LGBMClassifier(scale_pos_weight = scale_pos_weight,
                                     random_state = 2356
                                    ),
          'CatBoost': CatBoostClassifier(class_weights = cat_boost_weights,
                                         random_state = 2356,
                                         verbose = 0
                                        )
         }
cross_val_results = {}

for name, model in models.items():
    pipeline = Pipeline([
        ('preprocessor', preprocessor), ('classifier', model)
    ])
    scores = cross_val_score(
        pipeline,
        features_train,
        target_train,
        cv = 5,
        scoring = 'roc_auc'
    )

    cross_val_results[name] = {
        'mean_auc': scores.mean(),
        'std_auc': scores.std()
    }
cross_val_df = pd.DataFrame(cross_val_results).T
cross_val_df = cross_val_df[['mean_auc', 'std_auc']]
cross_val_df = cross_val_df.sort_values('mean_auc', ascending = False)
cross_val_df

Unnamed: 0,mean_auc,std_auc
XGBoost,0.885597,0.007435
CatBoost,0.883571,0.009063
LightGBM,0.883344,0.007506
Logistic Regression,0.851195,0.014797
Random Forest,0.84632,0.013522
Decision Tree,0.696199,0.014131


After running cross-validation on the chosen models, the boosted models still performed the best, all showing mean ROC-AUC scores just over 0.88 and very minimal standard deviations. The simpler models like Logistic Regression and Random Forest performed well, and the Decision Tree Classifier was the poorest of the bunch, not even topping 0.7. For evaluating with the test set, I'll be using the 3 boosted models and comparing their performance against each other and the baseline.

## Evaluation and Analysis of Test Data

In [20]:
models = {'XGBoost': XGBClassifier(scale_pos_weight = scale_pos_weight,
                                   use_label_encoder = False,
                                   eval_metric = 'auc',
                                   random_state = 2356
                                  ),
          'LightGBM': LGBMClassifier(scale_pos_weight = scale_pos_weight,
                                     random_state = 2356
                                    ),
          'CatBoost': CatBoostClassifier(class_weights = cat_boost_weights,
                                         random_state = 2356,
                                         verbose = 0
                                        )
         }
test_results = {}

for name, model in models.items():
    pipeline = Pipeline([
        ('preprocessor', preprocessor), ('classifier', model)
    ])
    pipeline.fit(features_train, target_train)
    pred_prob = pipeline.predict_proba(features_test)[:, 1]
    auc = roc_auc_score(target_test, pred_prob)
    test_results[name] = auc
    test_results_df = pd.DataFrame.from_dict(test_results, orient = 'index', columns = ['ROC-AUC'])
    test_results_df.index.name = 'Model'
    test_results_df = test_results_df.sort_values('ROC-AUC', ascending = False)
test_results_df

Unnamed: 0_level_0,ROC-AUC
Model,Unnamed: 1_level_1
CatBoost,0.867685
LightGBM,0.866738
XGBoost,0.864549


After letting the models analyze the test data, the results are very promising! The CatBoost model gets a slight edge over LGBM, with XGB not too far behind. Knowing that a score closer to 1 is indicative of a stronger model, we have 3 very high quality choices on our hands with these boosted models. 

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Good work so far with the `Solution Code`.

</div>