# General

## Dataset
https://www.kaggle.com/datasets/blastchar/telco-customer-churn

# TODO:
### David:
- pre-processing
  - feature scaling (standardize or normalize ?)
- feature selection
- performance visualization


### Ideas for the final report.  What is the goal of our final project?

Primary

* Make RandomForestClassifier work and figure out how to optimize its predictive performance.
* Show that it performs better than other algorithms.

Secondary

* Focus it more on understanding why an ML algorithm performs better
* (Overfitting, high correlation, other pitfalls of ML algorithms)?

In [5]:
## Imports & Settings
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
import sklearn.metrics as skm

# display all columns and rows:
# pd.set_option('display.max_columns', None)
# pd.set_option('display.max_rows', None)

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=UserWarning)

In [6]:
## Dataset & Preprocessing

# Dataset
df_original = pd.read_csv('dataset/telco-customer-churn.csv', index_col=0)
df = df_original.copy()

# Label Encoding (converting categorical to numerical)
categorical_columns = [
    'gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines','InternetService',
    'OnlineSecurity', 'OnlineBackup','DeviceProtection','TechSupport','StreamingTV',
    'StreamingMovies','Contract','PaperlessBilling','PaymentMethod', 'Churn'
]

# convert and replace categorical columns with numerical data
# encoding is done with categorical labels that sorted alphabetically so df = ['c','z','a'] will always encode to [1,2,0]
for col in categorical_columns:
    df[col] = df[col].astype('category')
    df[col] = df[col].cat.codes

# TotalCharges cleanup of blank columns ' ', will be replaced with 0
df['TotalCharges'].replace(" ", 0, inplace=True)
df['TotalCharges'] = df['TotalCharges'].astype(float)

In [7]:
## Model
# Writing a Class that takes in the dataset and spits out a classifier.
# Define some methods that calculate and display the different measures of convergence (Accuracy, AUC, ROC)
# Implement Hyper-Parameter comparison with GridSearchCV
X = df.drop("Churn", axis=1)
y = df["Churn"]
X
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=100)

# Classifier
#clf = RandomForestClassifier(random_state=100)
#clf.fit(X_train,y_train)

# Y-hat Prediction
#y_pred = clf.predict(X_test)

Unnamed: 0_level_0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
7590-VHVEG,0,0,1,0,1,0,1,0,0,2,0,0,0,0,0,1,2,29.85,29.85
5575-GNVDE,1,0,0,0,34,1,0,0,2,0,2,0,0,0,1,0,3,56.95,1889.50
3668-QPYBK,1,0,0,0,2,1,0,0,2,2,0,0,0,0,0,1,3,53.85,108.15
7795-CFOCW,1,0,0,0,45,0,1,0,2,0,2,2,0,0,1,0,0,42.30,1840.75
9237-HQITU,0,0,0,0,2,1,0,1,0,0,0,0,0,0,0,1,2,70.70,151.65
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6840-RESVB,1,0,1,1,24,1,2,0,2,0,2,2,2,2,1,1,3,84.80,1990.50
2234-XADUH,0,0,1,1,72,1,2,1,0,2,2,0,2,2,1,1,1,103.20,7362.90
4801-JZAZL,0,0,1,1,11,0,1,0,2,0,0,0,0,0,0,1,2,29.60,346.45
8361-LTMKD,1,1,1,0,4,1,2,1,0,0,0,0,0,0,0,1,3,74.40,306.60


In [None]:
## Metrics

# Confusion Matrix
cf = skm.confusion_matrix(y_pred,y_test)
sns.heatmap(cf/np.sum(cf), fmt='.2%', annot=True, cmap='Blues')

# Classification Report
print(skm.classification_report(y_pred, y_test))