# 1 - Design Phase

In [1]:
# Author information
__author__ = "Troy Reynolds"
__email__ = "Troy.Lloyd.Reynolds@gmail.com"

# Business Problem

#### <u>Goal</u>: 
Create a churn model to identify clients that are likely to terminate services.

#### <u>Error Metric</u>: 
1. Accuracy if classes are balanced
2. Recall if class are unbalanced

**Assumption for Recall**: The opportunity loss from false negative outweighs the profit loss from a false positive

**Reasoning**: The attrition of a false negative client eliminates all profits from the client while the promotions sent to a false positive client will reduce some profits.

#### <u>Baseline Model</u>: 
We will assume that no clients will terminate services and use that methodology as our baseline model.

#### <u>Proposed Solutions</u>: 
1. Logistic Regression: Linearly seperating model
2. KNN Model: Distance-based model
3. Random Forest: Node purity based model
4. Xgboost: Grandient boosted ensemble model

# Split data into train/test data

#### <u>Load Necessary Files</u>:

In [2]:
#### Libraries
import pandas as pd
import sys
import os
from sklearn.model_selection import train_test_split

# Add data_storage and helper_functions to directory
sys.path.insert(0, "./data_storage")
sys.path.insert(0, "./helper_functions")

from reporting import dataset_glimpse, clean_data_report

#### <u>Load and Describe Data</u>:
1. 3 ID Features (RowNumber, CustomerID, Surname)
    - **Assumption**: The ID features do not aid in model prediction
    - Surname may offer information for prediction, but may subject the model to profiling
    
2. 2 Categorical Features (Geography, Gender)
3. 2 Binary Features (HasChckng, IsActiveMember)
4. 6 Numeric Features (CreditScore, Age, Tenure, Balance, NumOfProducts, EstimatedSalary)
5. Target Features is Binary

In [3]:
# Read in data (head, dtypes)
data = pd.read_csv("data_storage/Churn Modeling.csv")
dataset_glimpse(data)

*********************** Characteristics of Churn Dataset ***********************
Dimensions of data: 10000 observations, 14 features

**************************** Feature and data types ****************************
RowNumber            int64
CustomerId           int64
Surname             object
CreditScore        float64
Geography           object
Gender              object
Age                  int64
Tenure               int64
Balance            float64
NumOfProducts        int64
HasChckng            int64
IsActiveMember       int64
EstimatedSalary    float64
Exited               int64
dtype: object

***************************** First 6 Observations *****************************


Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasChckng,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619.0,West,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608.0,Central,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502.0,West,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699.0,West,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850.0,Central,Female,43,2,125510.82,1,1,1,79084.1,0


#### <u>Data Cleanliness Report</u>:
The data is assessed on the understanding and expectation of the business logic of the features rather than data itself.

NOTE: There are missing values for CreditScore.

In [4]:
clean_data_report(data)

******************************** Missing Values ********************************
Features           Missing Values
CreditScore        3
RowNumber          0
CustomerId         0
Surname            0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasChckng          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

********************************** Duplicates **********************************
There are no duplicates in the data.

*************************** Businesss Logic Defiance ***************************
Negative Value Assessment

Feature            Negative Values
RowNumber          0
CustomerId         0
CreditScore        0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasChckng          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

Expected Binary Feature Evauation:

Feature		Is Binary?
HaHasChckng:	B

#### <u>Data Split</u>:
The data will be split into training and testing sets at a ratio of 80/20. There will be no validation set since cross-validation will be performed.

In [5]:
# Split data into train/test (80/20) , train_labels, test_labels
train_data, test_data = train_test_split(
    data, test_size = 0.2, random_state = 42, stratify = data["Exited"]
)

# print dataset dimensions
print("New dataset Dimensions: ")
print("Train: ", train_data.shape)
print("Test:  ", test_data.shape)

# Save dataframes as .pkl file
train_data.to_pickle("data_storage/train_data.pkl")
test_data.to_pickle("data_storage/test_data.pkl")
print("\nDataset Saved")

New dataset Dimensions: 
Train:  (8000, 14)
Test:   (2000, 14)

Dataset Saved
