# BCG & PowerCO - Churn Model 
## RandomForest Model w/ TuriCreate

**Author:** Ingrid Cadu<br>
**Last update:** Mai, 26, 2022<br>
<br>
This notebook contains the model that will evaluate the churn on PowerCo customers, and as asked It will also check the hyposthesis about the 20% off and its effects on people predicted as possible churn.<br>

**Background**<br>
- The SME team suggestions:
        1. Feature engineering is one of the keys to unlocking predictive insight through mathematical modeling. Based on the data that is available and was cleaned, identify what you think could be drivers of churn for our client and build those features to later use in your model.

        2. First focus on building on top of the feature that your colleague has already investigated: “the difference between off-peak prices in December and January the preceding year”. After this, if you have time, feel free to get creative with making any other features that you feel are worthwhile.

        3. Once you have a set of features, you must train a Random Forest classifier to predict customer churn and evaluate the performance of the model with suitable evaluation metrics. Be rigorous with your approach and give full justification for any decisions made by yourself as the intern data scientist. 

- **Extra Task:**<br>
    Recall that the hypotheses under consideration is that churn is driven by the customers’ price sensitivities and that it would be possible to predict customers likely to churn using a predictive model. If you’re eager to go the extra mile for the client, when you have a trained predictive model, remember to investigate the client’s proposed discounting strategy, with the head of the SME division suggesting that offering customers at high propensity to churn a 20% discount might be effective.

**Prior Objective**<br>
Build your models and test them while keeping in mind you would need data to prove/disprove the hypotheses, as well as to test the effect of a 20% discount on customers at high propensity to churn.

# Imports and Data

In [83]:
import turicreate

In [84]:
data = turicreate.SFrame.read_csv("./client_data.csv", column_type_hints={'churn':int})
price = turicreate.SFrame.read_csv("./Price_off_Peak.csv", column_type_hints={'churn':int})
#del data['id']
del price['churn']

Data will not be explored here.
For ore details check the notebook named <u>'Churn Analysis.ipynb'</u>

# Class_weight: calcs

In [85]:
len(data)/(2*len(data[data['churn']==0]))

0.5538029877910063

In [86]:
data['churn'][:2]

dtype: int
Rows: 2
[1, 0]

# Model Auto-Selection - Classifier

In [87]:
train, test = data.random_split(0.8)

In [88]:
# Selects the best model based on your data.
model = turicreate.classifier.create(train, target='churn')

# Make predictions and evaluate results.
predictions = model.classify(test)
results = model.evaluate(test)
results

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: The following methods are available for this type of problem.
PROGRESS: LogisticClassifier, SVMClassifier
PROGRESS: The returned model will be chosen according to validation accuracy.


PROGRESS: Model selection based on validation accuracy:
PROGRESS: ---------------------------------------------
PROGRESS: LogisticClassifier              : 0.8826530612244898
PROGRESS: SVMClassifier                   : 0.8775510204081632
PROGRESS: ---------------------------------------------
PROGRESS: Selecting LogisticClassifier based on validation set performance.


{'accuracy': 0.8904494382022472,
 'auc': 0.5388135569648422,
 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |      0       |        1        |   38  |
 |      1       |        1        |   6   |
 |      1       |        0        |  274  |
 |      0       |        0        |  2530 |
 +--------------+-----------------+-------+
 [4 rows x 3 columns],
 'f1_score': 0.037037037037037035,
 'log_loss': 0.8012956238949493,
 'precision': 0.13636363636363635,
 'recall': 0.02142857142857143,
 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 1001
 
 Data:
 +-----------+---------------------+---------------------+-----+------+
 | threshold |         fpr         |         tpr         |  p  |  n   |
 +-----------+---------------------+---------------------+-----+------+
 |    

# Logistic Classifier

In [89]:
#Building the model
model_LC = turicreate.logistic_classifier.create(train, target='churn', 
                                                 class_weights={0:0.4,1:0.6})
# Make predictions and evaluate results.
pred_LC = model_LC.classify(data)
res_LC = model_LC.evaluate(data)
res_LC

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.



{'accuracy': 0.974736409694646,
 'auc': 0.9197153880113312,
 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |      0       |        1        |   35  |
 |      1       |        1        |  1085 |
 |      0       |        0        | 13152 |
 |      1       |        0        |  334  |
 +--------------+-----------------+-------+
 [4 rows x 3 columns],
 'f1_score': 0.8546671918077984,
 'log_loss': 0.18081756825173828,
 'precision': 0.96875,
 'recall': 0.7646229739252995,
 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 1001
 
 Data:
 +-----------+----------------------+--------------------+------+-------+
 | threshold |         fpr          |        tpr         |  p   |   n   |
 +-----------+----------------------+--------------------+------+-------+
 |    0.0    | 

In [90]:
model_LC.summary()

Class                          : LogisticClassifier

Schema
------
Number of coefficients         : 15591
Number of examples             : 11170
Number of classes              : 2
Number of feature columns      : 25
Number of unpacked features    : 25

Hyperparameters
---------------
L1 penalty                     : 0.0
L2 penalty                     : 0.01

Training Summary
----------------
Solver                         : lbfgs
Solver iterations              : 10
Solver status                  : Completed (Iteration limit reached).
Training time (sec)            : 1.1503

Settings
--------
Log-likelihood                 : 2.7824

Highest Positive Coefficients
-----------------------------
id[250962d5c410844048e0155a6e16c1d1] : 29.3328
date_end[2017-01-28]           : 29.3328
date_modif_prod[2016-01-28]    : 29.3328
id[5c861467b8e977066bc0b286f9e789a2] : 28.5068
id[16191ae9f4a669d7ed19f8263e0c7e66] : 26.3958

Lowest Negative Coefficients
----------------------------
id[66abcf2905cd0d9

In [91]:
#predictions with LC - a bite-data

LC = model_LC.predict(data, output_type='probability')
data['prob_LC'] = LC/0.001

# SVM Classifier

In [92]:
#train and test sets
train, test = data.random_split(0.8)

#Building the model
model_SVM = turicreate.svm_classifier.create(train, target='churn',
                                            features=['prob_LC','cons_last_month','cons_gas_12m'],
                                            class_weights={0:0.4, 1:0.6})

# Make predictions and evaluate results.
pred_SVM = model_SVM.classify(test)
res_SVM = model_SVM.evaluate(test)
res_SVM

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.



{'accuracy': 0.974914089347079, 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |      0       |        1        |   5   |
 |      1       |        1        |  207  |
 |      0       |        0        |  2630 |
 |      1       |        0        |   68  |
 +--------------+-----------------+-------+
 [4 rows x 3 columns], 'f1_score': 0.8501026694045174, 'precision': 0.9764150943396226, 'recall': 0.7527272727272727}

In [93]:
model_SVM.summary()

Class                          : SVMClassifier

Schema
------
Number of coefficients         : 4
Number of examples             : 11111
Number of classes              : 2
Number of feature columns      : 3
Number of unpacked features    : 3

Hyperparameters
---------------
Mis-classification penalty     : 1.0

Training Summary
----------------
Solver                         : lbfgs
Solver iterations              : 10
Solver status                  : Completed (Iteration limit reached).
Training time (sec)            : 0.2578

Settings
--------
Train Loss                     : 349.1326

Highest Positive Coefficients
-----------------------------
prob_LC                        : 0.0023

Lowest Negative Coefficients
----------------------------
(intercept)                    : -1.0722
cons_last_month                : -0.0
cons_gas_12m                   : -0.0



# Predict SVM

In [95]:
#Splitting
basic = data[data['churn']==0]

#predicting
basic['predict_svm'] = model_SVM.predict(basic, output_type='class')

#taking the probs churn customers
churn_prob_yes = basic[basic['predict_svm']==1]

In [101]:
#merging data
df = churn_prob_yes.join(price, on='id', how='left')

len(df)

35

# Conclusion

***Observations:***<br>
- Running the classifier, It has found two optimal algorithm:
    - Logistical Classifier
    - SVM
- The LC achieved an accuracy of 90% and AUC of 64%.
- The SVM achieved an accuracy of 91%.
- My approach: Use the values predicted by Logistic_Classifier to use as feature in SVM_Classifier.
- The model has:
    - 'f1_score': 0.8501026694045174, 
    - 'precision': 0.9764150943396226, 
    - 'recall': 0.7527272727272727


**Insight**:<br>
- The consume variation showed to be a considerable weight in the predictions, not the price as the Managers thought it should be. Next analysis should ask for more historical consume behaviour.
- The model identified 35 people among the customers as likely to churn.
- Inferencial Statistical identified almost 4x more people.