# BCG & PowerCO - Churn Model 
## RandomForest Model w/ TuriCreate

**Author:** Ingrid Cadu<br>
**Last update:** Mai, 26, 2022<br>
<br>
This notebook contains the model that will evaluate the churn on PowerCo customers, and as asked It will also check the hyposthesis about the 20% off and its effects on people predicted as possible churn.<br>

**Background**<br>
- The SME team suggestions:
        1. Feature engineering is one of the keys to unlocking predictive insight through mathematical modeling. Based on the data that is available and was cleaned, identify what you think could be drivers of churn for our client and build those features to later use in your model.

        2. First focus on building on top of the feature that your colleague has already investigated: “the difference between off-peak prices in December and January the preceding year”. After this, if you have time, feel free to get creative with making any other features that you feel are worthwhile.

        3. Once you have a set of features, you must train a Random Forest classifier to predict customer churn and evaluate the performance of the model with suitable evaluation metrics. Be rigorous with your approach and give full justification for any decisions made by yourself as the intern data scientist. 

- **Extra Task:**<br>
    Recall that the hypotheses under consideration is that churn is driven by the customers’ price sensitivities and that it would be possible to predict customers likely to churn using a predictive model. If you’re eager to go the extra mile for the client, when you have a trained predictive model, remember to investigate the client’s proposed discounting strategy, with the head of the SME division suggesting that offering customers at high propensity to churn a 20% discount might be effective.

**Prior Objective**<br>
Build your models and test them while keeping in mind you would need data to prove/disprove the hypotheses, as well as to test the effect of a 20% discount on customers at high propensity to churn.

# Imports and Data

In [32]:
import turicreate

In [33]:
data = turicreate.SFrame.read_csv("./client_data.csv", column_type_hints={'churn':int})
del data['id']

Data will not be explored here.
For ore details check the notebook named <u>'Churn Analysis.ipynb'</u>

# Model Auto-Selection - Classifier

In [34]:
train, test = data.random_split(0.8)

In [35]:
# Selects the best model based on your data.
model = turicreate.classifier.create(train, target='churn')

# Make predictions and evaluate results.
predictions = model.classify(test)
results = model.evaluate(test)
results

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: The following methods are available for this type of problem.
PROGRESS: LogisticClassifier, SVMClassifier
PROGRESS: The returned model will be chosen according to validation accuracy.


PROGRESS: Model selection based on validation accuracy:
PROGRESS: ---------------------------------------------
PROGRESS: LogisticClassifier              : 0.8839590443686007
PROGRESS: SVMClassifier                   : 0.8924914675767918
PROGRESS: ---------------------------------------------
PROGRESS: Selecting SVMClassifier based on validation set performance.


{'accuracy': 0.87512971290211, 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |      1       |        1        |   18  |
 |      0       |        1        |   85  |
 |      1       |        0        |  276  |
 |      0       |        0        |  2512 |
 +--------------+-----------------+-------+
 [4 rows x 3 columns], 'f1_score': 0.09068010075566751, 'precision': 0.17475728155339806, 'recall': 0.061224489795918366}

# Logistic Classifier

In [42]:
#Building the model
model_LC = turicreate.logistic_classifier.create(train, target='churn')

# Make predictions and evaluate results.
pred_LC = model_LC.classify(data)
res_LC = model_LC.evaluate(data)
res_LC

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.



{'accuracy': 0.9257839244146241,
 'auc': 0.8853397539048136,
 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |      0       |        1        |  299  |
 |      1       |        1        |  634  |
 |      0       |        0        | 12888 |
 |      1       |        0        |  785  |
 +--------------+-----------------+-------+
 [4 rows x 3 columns],
 'f1_score': 0.5391156462585034,
 'log_loss': 0.24049069551125488,
 'precision': 0.6795284030010718,
 'recall': 0.44679351656095845,
 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 1001
 
 Data:
 +-----------+---------------------+--------------------+------+-------+
 | threshold |         fpr         |        tpr         |  p   |   n   |
 +-----------+---------------------+--------------------+------+-------+
 |   

In [37]:
model_LC.summary()

Class                          : LogisticClassifier

Schema
------
Number of coefficients         : 4376
Number of examples             : 11129
Number of classes              : 2
Number of feature columns      : 24
Number of unpacked features    : 24

Hyperparameters
---------------
L1 penalty                     : 0.0
L2 penalty                     : 0.01

Training Summary
----------------
Solver                         : lbfgs
Solver iterations              : 10
Solver status                  : Completed (Iteration limit reached).
Training time (sec)            : 1.1749

Settings
--------
Log-likelihood                 : 1690.5308

Highest Positive Coefficients
-----------------------------
date_activ[2003-12-12]         : 22.5235
date_activ[2004-10-15]         : 22.2767
date_activ[2004-09-08]         : 21.3325
date_modif_prod[2012-08-18]    : 18.9941
date_activ[2009-03-20]         : 18.9437

Lowest Negative Coefficients
----------------------------
date_activ[2009-06-06]         : -

In [43]:
#predictions with LC - a bite-data

LC = model_LC.predict(data, output_type='probability')
data['prob_LC'] = LC/0.001

# SVM Classifier

In [44]:
#
train, test = data.random_split(0.8)

#Building the model
model_SVM = turicreate.svm_classifier.create(train, target='churn',
                                            features=['prob_LC','cons_gas_12m'])

# Make predictions and evaluate results.
pred_SVM = model_SVM.classify(test)
res_SVM = model_SVM.evaluate(test)
res_SVM

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.



{'accuracy': 0.9256914361879374, 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |      0       |        1        |   43  |
 |      1       |        1        |  112  |
 |      1       |        0        |  180  |
 |      0       |        0        |  2666 |
 +--------------+-----------------+-------+
 [4 rows x 3 columns], 'f1_score': 0.5011185682326622, 'precision': 0.7225806451612903, 'recall': 0.3835616438356164}

In [40]:
model_SVM.summary()

Class                          : SVMClassifier

Schema
------
Number of coefficients         : 3
Number of examples             : 11092
Number of classes              : 2
Number of feature columns      : 2
Number of unpacked features    : 2

Hyperparameters
---------------
Mis-classification penalty     : 1.0

Training Summary
----------------
Solver                         : lbfgs
Solver iterations              : 10
Solver status                  : Completed (Iteration limit reached).
Training time (sec)            : 0.2033

Settings
--------
Train Loss                     : 2090.0412

Highest Positive Coefficients
-----------------------------
prob_LC                        : 0.0025

Lowest Negative Coefficients
----------------------------
(intercept)                    : -1.4834
cons_gas_12m                   : -0.0



# Conclusion

***Observations:***<br>
- Running the classifier, It has found two optimal algorithm:
    - Logistical Classifier
    - SVM
- The LC achieved an accuracy of 90% and AUC of 64%.
- The SVM achieved an accuracy of 91%.
- My approach: Use the values predicted by Logistic_Classifier to use as feature in SVM_Classifier.
- Accuracy of right shot is ~92%


**Insight**:<br>
- The consume variation showed to be a considerable weight in the predictions, not the price as the Managers thought it should be. Next analysis should ask for more historical consume behaviour.
- The data presents to many gaps related to dates, the proportion of churn is not enough to build a unbiased algorithm and the current churn data do not present any relevant feature to distinguish them from the customers.
- Inferencial Statistical is the most appropiate, so.