_MBD @AP2022_

_CAPSTONE PROJECT_

_Group A_

# Capstone - Churn Prediction for ClientCo®



<img width="600" style="float:left" 
src="https://images.unsplash.com/photo-1509909756405-be0199881695?ixlib=rb-4.0.3&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=2070&q=80" />



## 0. Context


The end of this project is to predict churn, from retail data. The goal is to achieve an accuracy of 80%, using F1-Score as metric. 

_+ Info:_ <https://blackboard.ie.edu/ultra/courses/_46938_1/outline/file/_683290_1>


## 1. Framing the Problem

### Type of ML System

* Supervised Learning: the data set includes the target’s labels [CHURN]
* Batch: the data provided comes in `.csv` format.
* Model based: the intended approach is to generalize from the training set, with a ML model.

### Kind of ML Problem

* Binary Classification: given the objective of the project and the nature of the data, we can assume the goal is a to build Binary Classifier.

### Performance Metric

    
<img width="200" style="float:center" 
 src="https://miro.medium.com/max/1400/1*9uo7HN1pdMlMwTbNSdyO3A.png" />

--- 

### Summary

* Goal: predict **Churn**
* Type of ML System: **Supervised**
* Type of ML Problem: **Binary Classification**
* Performance Metric: **F1-Score**

## 2. Obtaining the Data

Let's start off with the data. 

First of all, we have a sigle dataset:

* `train.csv`

Our data is not split for us. We may use an `80%` of the complete dataset to for the `train` and a posterior `K-Fold CV` to tune the hyperparameters. The spare `20%` will be used in the `test`, to measure how our model generalizes. This is the strategy we will apply.



We first create a function to import our batch training and test data:

In [None]:
SEED = 42
TEST_SIZE = 0.2

In [1]:
%matplotlib inline
from mlTools import dataLoader, dataSplitter, dataExplorer, dataProcessor

In [2]:
loaderObj = dataLoader()
data = loaderObj.batch_loader("./data/data.parquet",False)

Now that the whole `data` is imported, let's perform the fixed split we mentioned before:

In [None]:
splitterObj = dataSplitter(data)
train,test = splitterObj.train_splitter("Churn", TEST_SIZE, SEED, True)

Our data is now imported and split as `train` and `test`, both as a _pandas DataFrame_. 

Now, we will perform an Exploratory Data Analysis on the `train`.

## 3. EDA

In this step of the Pipeline we will grasp for the first time the substance of our data: we will learn what features are present in our `train`set, check their types and how their values are distributed.

This step is fundamental to set a strategy around the next step, Data Processing.

We will start with the basics. It's always useful to use the `head()` method to see what kind of features and values to expect. 

In [None]:
categorical = ["order_channel","branch_id"]
numerical = ["product_id", "client_id", "sales_net", "quantity"]

In [None]:
explorerObj = dataExplorer(train, categorical, numerical)
explorerObj.basic_explorer("Churn")

In [None]:
basic_explorer(train)

Firstly, we see we only have 8 features. Some are Categorical, others Numerical. We must cast both `dates`, and from it we might obtain interesting new features.

In the end, having 8 features is not sufficient to create a robust Machine Learning model.

Now, lets have a first glance to some basic statistics of how data is distributed. 

For that, we will use the `describe()` method. 

In [None]:
train.describe()

The method returns some of the statistics behind a Boxplot, such as the quartiles. It also gives some information around the mean and standard deviation of each feature.

From this we can sense the range of values in which the different features range. We can also check the possibility of Outliers. 

One possible outlier may reside in the features `sales_net` and `quantity`, as their maximum values fall by far out of the IQR. In the following steps, we will further analyze this possibility.

In [None]:
SMALL_SIZE = 24
MEDIUM_SIZE = 32
BIGGER_SIZE = 48

plt.rc('font', size=SMALL_SIZE)          # controls default text sizes
plt.rc('axes', titlesize=SMALL_SIZE)     # fontsize of the axes title
plt.rc('axes', labelsize=MEDIUM_SIZE)    # fontsize of the x and y labels
plt.rc('xtick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
plt.rc('ytick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
plt.rc('legend', fontsize=SMALL_SIZE)    # legend fontsize
plt.rc('figure', titlesize=BIGGER_SIZE)  # fontsize of the figure title


fig = plt.figure(figsize =(20, 20))
ax = fig.add_subplot()
ax.boxplot([train["sales_net"],train["quantity"]], labels =["sales_net","quantity"])
ax.set_title('Outliers')
ax.set_xlabel('Feature')
ax.set_ylabel('Value')
plt.show()

Regarding the count of rows we have per attribute, as also the type of each, using the `info()` method comes quite handy. 

In [None]:
train.info()

As the size of the `dataset` is significant, we can apply some fast optimizations on the types assigned to each column. This ensures a faster and more efficient load, without comprimising any loss of data.

In [None]:
def mem_Optimizer(df):
    df.product_id = df.product_id.astype(np.int32)
    df.client_id= df.client_id.astype(np.int32)
    df.quantity= df.quantity.astype(np.int32)
    df.branch_id= df.branch_id.astype(np.int16)
    df.sales_net=df.sales_net.astype(np.float32)
    df['date_order'] =  pd.to_datetime(df['date_order'], format='%Y-%m-%d')
    df['date_invoice'] =  pd.to_datetime(df['date_invoice'], format='%Y-%m-%d')
    df.order_channel= df.order_channel.astype('category')
    df['data_order_year']=df['date_order'].dt.year
    df['data_order_month']=df['date_order'].dt.month_name()
    df['data_order_dayofmonth']=df['date_order'].dt.day
    df['data_order_dayofweek']=df['date_order'].dt.day_name()
    return df

In [None]:
train = mem_Optimizer(train)

Let's see the effect of those changes:

In [None]:
train.info()

Nice! That's more than a 40% decrease in memory usage! This will make the posterior analysis way more efficient.



In [None]:
def extract_dates(df):
    df['data_order_year']=df['date_order'].dt.year
    df['data_order_month']=df['date_order'].dt.month_name()
    df['data_order_dayofmonth']=df['date_order'].dt.day
    df['data_order_dayofweek']=df['date_order'].dt.day_name()
    return df

def data_optimize(df):
    df.product_id= df.product_id.astype(np.int32)
    df.client_id= df.client_id.astype(np.int32)
    df.quantity= df.quantity.astype(np.int32)
    df.branch_id= df.branch_id.astype(np.int16)
    df.sales_net=df.sales_net.astype(np.float32)
    df['date_order'] =  pd.to_datetime(df['date_order'], format='%Y-%m-%d')
    df['date_invoice'] =  pd.to_datetime(df['date_invoice'], format='%Y-%m-%d')
    df.order_channel= df.order_channel.astype('category')
    return df

train=(train
.pipe(data_optimize)
.pipe(extract_dates))  
train[['data_order_month', 'data_order_year', 'data_order_dayofweek', 'data_order_dayofmonth']]=train[['data_order_month', 'data_order_year', 'data_order_dayofweek', 'data_order_dayofmonth']].astype('category')
train[['data_order_month', 'data_order_year', 'data_order_dayofweek', 'data_order_dayofmonth']]=train[['data_order_month', 'data_order_year', 'data_order_dayofweek', 'data_order_dayofmonth']].astype('category')

In [None]:
train.info()

From this method, we confirm the types we analyzed before.

But wait a second...


60+ millions of rows?

That's a massive dataset! 


We may need to prepare ourselves to perform some sampling in the next steps of the EDA, or find out more efficient ways to import such a dataset.

The dataset is so huge `info()` doesn't return the whole information in regards to missing values. 

Let's explore them with `isnull()`:

In [None]:
print(train.isnull().value_counts())

Fortunately, there is only a fully `NaN` value, located in the `date_invoice`. This might mean lots of things:

* Unvalid billing information
* Blocked payment by the bank institution
* Others

However, we won't bother much. We can safely get rid of this single missing value, as it's not relevant enough.

---

_Summary_

* Missing Values: `date_invoice` -> `Drop`





We can also take a look at the categorical features, to check the different values we have.

This step is significant to plan out an strategy to later `Encoding`.

In [None]:
print(train["order_channel"].value_counts(), train["branch_id"].value_counts())

We can easily check that `One Hot Encoding` should be the way to go for `order_channel`, as there increment of features won't be too heavy.

This is not the case for `branch_id`. We might be tempted to implement `Target Encoding`. But this is impossible: we have no target (Unsupervised).

Speaking of correlations, let's have a look at them. 

As at the moment we have both `numerical`and `categorical`features, we can analyize the correlations (both linear an non linear) with the `phik_matrix()`method:

In [None]:
import phik
corr_matrix = train.phik_matrix()

In [None]:
fig, ax = plt.subplots(figsize=(50,50))         
sns.heatmap(corr_matrix, annot=True, linewidths=.5, ax=ax)

We shouldn't fool ourselves: the correlations don't look promising to establish an strategy. Small to no correlation is observed in between the default features. With `Feature Generation`we might enrich the set, and the
correlations.

NOTE: As an `Unsupervised Learning` system, we can't compare the correlations to a target!

We can, at least, see if there is any interesting interaction in between the `numerical`features we described previously:

In [None]:
numerical = ["product_id", "client_id", "sales_net", "quantity"]
scatter_matrix(train[numerical], figsize = (24, 24));

From the `scatter_matrix()`we can not only see the distribution of each feature, but also the interactions in between them.

Being an `Unsupervised Learning`system, we should expect the usage of `Clustering`algorithms to point out whether a customer may or not churn. 

As those algorithms are based on the distance between instances, we must scale our data. There are no requirements in regards to `anomalous` data, and therefore, a simple `StandardScaler()` should be enough.


---

_Summary_

* Numerical Features: `["product_id", "client_id", "sales_net", "quantity"] `


* Categorical Features: `["order_channel","branch_id"]`


* Irrelevant Features: `NaN`


* Missing Values: `["date_invoice"]` -> `Drop instance`


* Possible Outliers: `["sales_net", "quantity]`


* One Hot Encoding: `["order_channel"]`


* Feature Extraction: 
    + **`Numerical`**: `["date_order", "date_invoice"]`
    + **`Categorical`**
    
    
* Feature Generation: 
    + **`Numerical`**: `["day", "month", "year", "week", "payment_delay"]`
    + **`Categorical`**:`["is_weekend", "is_bfriday", "is_christmas"]`

    
    

* Scaling: `StandardScaler()`

## 4. Data Preprocessing

So, we've seen a lot in this `train` set. We saw the amount of features, how they are distributed and the possible future combinations that might be useful.

Now it's time to take action, and implement the strategies we planned in the previous step!

Let's go on with the Data Cleaning checklist:

### _Missing Values_



In [None]:
train = train["date_invoice"].dropna()

### _Outliers_



### _One Hot Encoding_



In [None]:
train = pd.get_dummies(train, columns = "order_channel")
train = train.drop('order_channel', axis = 1)

In [None]:
train.head()

### _Feature Extraction_


### _Feature Generation_


In [None]:
date_pattern = r'([0-9]{4}-[0-9]{2}-[0-9]{2})

In [None]:
for row in range(0, len(train)):
    date = re.search(date_pattern, train["date_invoice"][row])
    train["date_invoice"][row] = date

### _Scaling_

In [None]:
numerical = ["product_id", "client_id", "sales_net", "quantity"]

In [None]:
def numericalScaler(train, numerical):
    '''
    Scales the numerical features, with an StandardScaler()/MinMaxScaler()
    '''
    scaler = StandardScaler()
    train[numerical] = scaler.fit_transform(train[numerical])
    return train

In [None]:
train = numericalScaler(train, numerical) 

## 5. Model Training

## 6. Model Testing

## 7. Model Deployment