# Predictive Model
## Import libraries

In [107]:
import pandas as pd
import numpy as np 
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder, Normalizer,MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score,accuracy_score,precision_score,recall_score

In [114]:
df = pd.read_csv('cleaned_and_engineered_data.csv')
df.head()

Unnamed: 0,id,channel_sales,cons_12m,cons_gas_12m,cons_last_month,date_activ,date_end,date_modif_prod,date_renewal,forecast_cons_12m,...,peak_mid_peak_var_max_monthly_diff,off_peak_mid_peak_var_max_monthly_diff,off_peak_peak_fix_max_monthly_diff,peak_mid_peak_fix_max_monthly_diff,off_peak_mid_peak_fix_max_monthly_diff,tenure,months_activ,months_to_end,months_modif_prod,months_renewal
0,24011ae4ebbe3035111d65fa7c15bc57,foosdfpfkusacimwkcsosbicdxkicaua,0.0,4.739944,0.0,2013-06-15,2016-06-15,2015-11-01,2015-06-23,0.0,...,0.034219,0.058257,18.590255,7.45067,26.040925,3,30,5,2,6
1,d29c2c54acc38ff3c0614d0a653813dd,MISSING,3.668479,0.0,0.0,2009-08-21,2016-08-30,2009-08-21,2015-08-31,2.28092,...,0.007124,0.149609,44.311375,0.0,44.311375,7,76,7,76,4
2,764c75f661154dac3a6c254cd082ea7d,foosdfpfkusacimwkcsosbicdxkicaua,2.736397,0.0,0.0,2010-04-16,2016-04-16,2010-04-16,2015-04-17,1.689841,...,0.088421,0.170512,44.38545,0.0,44.38545,6,68,3,68,8
3,bba03439a292a1e166f80264c16191cb,lmkebamcaaclubfxadlmueccxoimlema,3.200029,0.0,0.0,2010-03-30,2016-03-30,2010-03-30,2015-03-31,2.382089,...,0.0,0.15121,44.400265,0.0,44.400265,6,69,2,69,9
4,149d57cf92fc41cf94415803a877cb4b,MISSING,3.646011,0.0,2.721811,2010-01-13,2016-03-07,2010-01-13,2015-03-09,2.650065,...,0.030773,0.051309,16.275263,8.137629,24.412893,6,71,2,71,9


In [116]:
df.columns

Index(['id', 'channel_sales', 'cons_12m', 'cons_gas_12m', 'cons_last_month',
       'date_activ', 'date_end', 'date_modif_prod', 'date_renewal',
       'forecast_cons_12m', 'forecast_cons_year', 'forecast_discount_energy',
       'forecast_meter_rent_12m', 'forecast_price_energy_off_peak',
       'forecast_price_energy_peak', 'forecast_price_pow_off_peak', 'has_gas',
       'imp_cons', 'margin_gross_pow_ele', 'margin_net_pow_ele', 'nb_prod_act',
       'net_margin', 'num_years_antig', 'origin_up', 'pow_max',
       'mean_year_price_p1_var', 'mean_year_price_p2_var',
       'mean_year_price_p3_var', 'mean_year_price_p1_fix',
       'mean_year_price_p2_fix', 'mean_year_price_p3_fix',
       'mean_year_price_p1', 'mean_year_price_p2', 'mean_year_price_p3',
       'mean_6m_price_p1_var', 'mean_6m_price_p2_var', 'mean_6m_price_p3_var',
       'mean_6m_price_p1_fix', 'mean_6m_price_p2_fix', 'mean_6m_price_p3_fix',
       'mean_6m_price_p1', 'mean_6m_price_p2', 'mean_6m_price_p3',
       'mea

In [117]:
X = df.drop(["date_activ","date_end","date_modif_prod","date_renewal","id","churn"],axis=1)
Y = df['churn']
cols_norm = ['cons_12m','cons_gas_12m','cons_last_month','forecast_cons_12m','forecast_cons_year','forecast_discount_energy',
            'forecast_meter_rent_12m','forecast_price_pow_off_peak','imp_cons','margin_gross_pow_ele','margin_net_pow_ele',
            'net_margin','pow_max','modif_after_activ']

## Model Design
I created a function that will test the model we create and print out the performance.To test the performance of the model I'll be using the precision, recall and accuracy scores.

For our prediction model, I'll be using the Random Forest Classifier which is an `ensemble` algorithms because internally the `Forest` refers to a collection of `Decision Trees` which are tree-based learning algorithms.
Some additional advantages of the random forest classifier include:

- The random forest uses a rule-based approach instead of a distance calculation and so features do not need to be scaled
- It is able to handle non-linear parameters better than linear based models

On the flip side, some disadvantages of the random forest classifier include:

- The computational power needed to train a random forest on a large dataset is high, since we need to build a whole ensemble of estimators.
- Training time can be longer due to the increased complexity and size of thee ensemble. 

The first thing we want to do is split our dataset into training and test samples. The reason why we do this, is so that we can simulate a real life situation by generating predictions for our test sample, without showing the predictive model these data points. This gives us the ability to see how well our model is able to generalise to new data, which is critical.

A typical % to dedicate to testing is between 20-30, for this example we will use a 80-20% split between train and test respectively.


In [118]:
def run_experiment(X,Y):
    X_train, X_test,y_train, y_test = train_test_split(X,Y,test_size=0.20, random_state=42)
    model = RandomForestClassifier(1000)
    model.fit(X_train,y_train)
    y_pred = model.predict(X_test)
    print('Precision: %.3f' % precision_score(y_test, y_pred))
    print('Recall: %.3f' % recall_score(y_test, y_pred))
    print('F1: %.3f' % f1_score(y_test, y_pred))
    print('Accuracy: %.3f' % accuracy_score(y_test, y_pred))
    print('-'*20)

I'll also encode categorical features into numerical representations since our model cannot accept categorical or `string` values

In [122]:
def transform_categorical(X):
    '''
    Encodes categorical columns
    '''
    X_copy = X.copy()
    categories = (X_copy.dtypes =="object")
    cat_cols = list(categories[categories].index)
    label_encoder = LabelEncoder()
    for col in cat_cols:
        X_copy[col] = label_encoder.fit_transform(X[col])
    return X_copy

def normalize_data(X,cols_to_norm):
    scaler = MinMaxScaler()
    X_copy = X.copy()
    X_copy[cols_to_norm] = scaler.fit_transform(X[cols_to_norm])
    return X_copy
x_2 = transform_categorical(X)
x_2 = normalize_data(x_2,cols_norm)
run_experiment(x_2,Y)

Precision: 0.895
Recall: 0.057
F1: 0.107
Accuracy: 0.903
--------------------


### Evaluation:

We are going to use 3 metrics to evaluate performance:

- Accuracy = the ratio of correctly predicted observations to the total observations
- Precision = the ability of the classifier to not label a negative sample as positive
- Recall = the ability of the classifier to find all the positive samples

The reason why we are using these three metrics is because a simple accuracy is not always a good measure to use.

- Looking at the accuracy score, this is very misleading! Hence the use of precision and recall is important. The accuracy score is high, but it does not tell us the whole story.
- Looking at the precision score, this shows us a score of 0.89 which is not bad, but could be improved.
- However, the recall shows us that the classifier has a very poor ability to identify positive samples. This would be the main concern for improving this model!

So overall, we're able to very accurately identify clients that do not churn, but we are not able to predict cases where clients do churn! What we are seeing is that a high % of clients are being identified as not churning when they should be identified as churning. This in turn tells me that the current set of features are not discriminative enough to clearly distinguish between churners and non-churners. 