# Decision Trees - Class Exercise 2

## Introduction

Customer churn is a major problem and one of the most important concerns for large companies. Due to the direct impact on the revenues generated, especially in the telecommunications field, companies are seeking to develop means to predict potential customers that will churn. Therefore, finding factors that correlate with customer churn is important for companies to take necessary actions to reduce this churn.

The data set includes information about:
* Customers who left within the last month – this column is called Churn
* Services that each customer signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
* Customer account information – how long they have been a customer, contract, payment method, paperless billing, monthly charges, and total charges
* Demographic information about customers – gender, age range, and if they have partners and dependents

Our goal is to build a decision tree a model to predict customer churn behaviour to retain customers.

## Metadata

| Variables          | Description                                                     |
|--------------------|-----------------------------------------------------------------|
| customerID         | Customer ID                                                     |
| gender             | Whether the customer is a male or a female                      |
| SeniorCitizen      | Whether the customer is a senior citizen or not (1 = yes, 0 = no) |
| Partner            | Whether the customer has a partner or not (Yes, No)             |
| Dependents         | Whether the customer has dependents or not (Yes, No)            |
| tenure             | Number of months the customer has stayed with the company       |
| PhoneService       | Whether the customer has a phone service or not (Yes, No)       |
| MultipleLines      | Whether the customer has multiple lines or not (Yes, No, No phone service) |
| InternetService    | Customer’s internet service provider (DSL, Fiber optic, No)     |
| OnlineSecurity     | Whether the customer has online security or not (Yes, No, No internet service) |
| OnlineBackup       | Whether the customer has online backup or not (Yes, No, No internet service) |
| DeviceProtection   | Whether the customer has device protection or not (Yes, No, No internet service) |
| TechSupport        | Whether the customer has tech support or not (Yes, No, No internet service) |
| StreamingTV        | Whether the customer has streaming TV or not (Yes, No, No internet service) |
| StreamingMovies    | Whether the customer has streaming movies or not (Yes, No, No internet service) |
| Contract           | The contract term of the customer (Month-to-month, One year, Two year) |
| PaperlessBilling   | Whether the customer has paperless billing or not (Yes, No)     |
| PaymentMethod      | The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic)) |
| MonthlyCharges     | The amount charged to the customer monthly                      |
| TotalCharges       | The total amount charged to the customer                        |
| Churn              | Whether the customer churned or not (Yes or No)                 |


## Import Necessary Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay

## Import Data

In [None]:
df = pd.read_csv(_)
df

## Handling Missing Values

There are some missing values in this dataset that are dubbed by a whitespace character. We will begin by replacing these whitespace characters with a ```NaN``` value, then imputing all the ```NaN``` values afterwards.

In [None]:
# Replace whitespace characters with missing values
df.replace(r'^\s*$', np.nan, regex=True, inplace=True)

In [None]:
# Summarize counts of missing values
missing_values = df._
missing_values

We shall impute these missing values with the:
* **mean** for **floating-point** type variables
* **rounded mean** for **integer** type variables
* **mode** for **string** type variables

In [None]:
# Ensure that each variable is set to the correct data type
for col in df.columns:
    if df[col].dtype == 'object':
        try:
            df[col] = pd.to_numeric(df[col], errors='raise')
        except ValueError:
            pass

In [None]:
df.dtypes

In [None]:
# Impute missing values
for col in df.columns:
    if df[col].isnull().any():
        if df[col].dtype == 'float64':
            df[col].fillna(df[col].mean(), inplace=True)
        elif df[col].dtype == 'int64':
            df[col].fillna(round(df[col].mean()), inplace=True)
            df[col] = df[col].astype('int64')
        elif df[col].dtype == 'object':
            df[col].fillna(df[col].mode()[0], inplace=True)

In [None]:
# Verify that there are no missing values after imputation
missing_values = df._
missing_values

We observe that there are no more missing values in the dataset.

## Dropping Irrelevant Features

In [None]:
# Drop customerID as it is an irrelevant feature
df.drop('customerID', axis=1, inplace=True)
df.shape

## One-hot Encoding for String-type Variables

The decision tree model in ```scikit-learn``` can only work with numerically encoded variables, so we first need to encode all string-type variables into an equivalent numeric form before training the model. The process of converting string-type variables with ```K``` classes into separate ```K-1``` binary vectors is called **one-hot encoding**.

In [None]:
# Initialize encoder
encoder = OneHotEncoder(drop='first', sparse_output=False)

# Encode string-type variables
for column in df.select_dtypes(include=['object']).columns:
    encoded_result = encoder.fit_transform(df[[column]])
    encoded_df = pd.DataFrame(encoded_result, columns=encoder.get_feature_names_out([column]))
    # Drop original column and concatenate the new one-hot encoded DataFrame
    df.drop(column, axis=1, inplace=True)
    df = pd.concat([df, encoded_df], axis=1)

df.head()

## Train-Test Split

In [None]:
label = _
excluded_columns = [label]
features = [feature for feature in list(df) if feature not in excluded_columns]

In [None]:
X = df[features]
y = df[label]

In [None]:
# Specify split parameters
random_seed = 9002
test_size = 0.2

# Perform train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_seed)

In [None]:
print('Size of train set: ', len(X_train))
print('Size of test set: ', len(X_test))

## Train Model

In [None]:
# Specify model parameters
criterion = 'gini'
min_samples_leaf = 1000

# Build model
model = DecisionTreeClassifier(_)

# Fit model on training data
model.fit(_, _)

# Visualize the decision tree
feature_names = X_train.columns.tolist()
plt.figure(figsize=(10, 4))
plot_tree(model, filled=True, feature_names=feature_names)
plt.show()

## Evaluate Model

In [None]:
# Predict test data
y_pred = model.predict(_)

In [None]:
# Generate confusion matrix
cm = confusion_matrix(_, _)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(cmap=plt.cm.Blues)
plt.show()

In [None]:
# Check test accuracy
accuracy_test = accuracy_score(_, _)
print(f"Test accuracy: {accuracy_test:.2f}")

## Model Improvement

To improve the model, we will experiment with two hyperparameters:
* ```criterion```
* ```min_samples_leaf```

We will first define the hyperparameter grid for ```criterion``` and ```min_samples_leaf```, which contains values for these two hyperparameters that we will be experimenting with. Then, we will use ```GridSearchCV``` to perform a grid search to obtain the optimal values of the two hyperparameters.

In [None]:
# Define hyperparameter grid
param_grid = {
    'criterion': ['gini', 'entropy'],
    'min_samples_leaf': [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1100, 1200, 1300, 1400, 1500]
}

In [None]:
# Perform grid search with 10-fold cross-validation
model = DecisionTreeClassifier()
cv = KFold(n_splits=10, shuffle=True, random_state=9002)
grid_search = GridSearchCV(_)
grid_search.fit(X_train, y_train)

In [None]:
# Display best params and best validation score
print("Best parameters:", grid_search.best_params_)
print(f"Best average cross-validation score: {grid_search.best_score_:.2f}")

In [None]:
# Fit optimal model using best params found above
optimal_model = grid_search.best_estimator_

# Visualize the optimal decision tree
plt.figure(figsize=(20, 8))
plot_tree(optimal_model, filled=True, feature_names=feature_names)
plt.show()

In [None]:
# Apply the optimal model on the test data
y_test_pred = optimal_model.predict(_)
test_accuracy = accuracy_score(_, _)
print(f"Test accuracy: {test_accuracy:.2f}")

In [None]:
# Generate confusion matrix for optimal model
cm = confusion_matrix(_, _)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(cmap=plt.cm.Blues)
plt.show()