# Notebook for Modelling

Creation date: 12/09/2020

Created by: Lucas Rodrigues

Summary:

<ul>
<li>Problem Description and Objectives</li>
<li>Solution Description</li>
<li>Code and Comments</li>
<li>Limitations & What Could Be Done with Real Life Data</li>
<li>Conclusions and Improvements</li>
<li>References</li>
</ul>

## Problem Description and Objectives
In summary, the main question to be answered by the data in this demonstration is:

> Can we determine the investor's profile based on the clients' daily expenses?

The main objective of this notebook is to build a classifier, trying to classify the investment profiles, in order to answer the above hypothesis.

## Solution Description

The construction of the model was simplified due to time constraints and the fact that the data was created in order to validate the main hypothesis.

For the modeling, a table of Expenditure Ranges per customer is used, in which each entry in the table is a customer with their expenses summed up and summarized in percentage credit and debit expenditure ranges.

In a simplified way, the Spend Range table contains the percentage of customer spending transactions contained within that range. With these features, a Random Forest classifier was trained using the scikit-learn package.

As the problem is, in a way, simple, a simple type of Random Forest was trained. In the pre-processing stage of the data, the labels were transformed from strings to numbers and the data was divided into training and testing.

For simplicity, the only measurement metric used was accuracy. After training the model, the test dataset was applied and from it the accuracy was measured.

Finally, the model built was "pickled" using the pickle package. This pickle file will be incorporated into the backend of the main solution

## Code and Comments

The code is very simple. Basically: read the data; transform the labels; separate the X database containing all the features and y database with all labels; then separate in train and test dataset; initialize and train the model; finally, pickle the model

In [63]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import preprocessing
from sklearn.metrics import accuracy_score
import pickle

In [48]:
## Read the data
database = pd.read_csv('data_perfil_consumo_agrupado_completo.csv', sep = ',')
database[0:10]

Unnamed: 0.1,Unnamed: 0,Faixa de Gasto Baixa Credito,Faixa de Gasto Media Credito,Faixa de Gasto Alta Credito,Faixa de Gasto Muito Alta Credito,Faixa de Gasto Baixa Debito,Faixa de Gasto Media Debito,Faixa de Gasto Alta Debito,Faixa de Gasto Muito Alta Debito,Perfil de Investimento
0,818371,0.009674,0.039903,0.019347,0.009674,0.532044,0.180169,0.189843,0.019347,ultraconservador
1,125630,0.019031,0.039792,0.029412,0.039792,0.472318,0.261246,0.129758,0.008651,ultraconservador
2,830798,0.050119,0.029833,0.050119,0.029833,0.310263,0.470167,0.059666,0.0,ultraconservador
3,673279,0.029777,0.049628,0.019851,0.009926,0.390819,0.430521,0.019851,0.049628,ultraconservador
4,823930,0.019582,0.009138,0.019582,0.030026,0.370757,0.460836,0.090078,0.0,ultraconservador
5,851336,0.039835,0.009615,0.028846,0.049451,0.482143,0.331044,0.039835,0.019231,ultraconservador
6,291438,0.009563,0.019126,0.030055,0.019126,0.632514,0.240437,0.030055,0.019126,ultraconservador
7,637559,0.009132,0.028919,0.019787,0.009132,0.351598,0.482496,0.059361,0.039574,ultraconservador
8,147150,0.019469,0.019469,0.019469,0.019469,0.511504,0.2,0.180531,0.030088,ultraconservador
9,372984,0.030142,0.008865,0.030142,0.019504,0.62234,0.049645,0.230496,0.008865,ultraconservador


In [49]:
## Map to transform label
map_dict = {
    'ultraconservador': 1,
    'conservador': 2,
    'moderado': 3,
    'dinamico': 4
}
map_dict

{'ultraconservador': 1, 'conservador': 2, 'moderado': 3, 'dinamico': 4}

In [52]:
## Transform labels
database['Perfil de Investimento'] = database['Perfil de Investimento'].map(map_dict)
database[0:10]

Unnamed: 0.1,Unnamed: 0,Faixa de Gasto Baixa Credito,Faixa de Gasto Media Credito,Faixa de Gasto Alta Credito,Faixa de Gasto Muito Alta Credito,Faixa de Gasto Baixa Debito,Faixa de Gasto Media Debito,Faixa de Gasto Alta Debito,Faixa de Gasto Muito Alta Debito,Perfil de Investimento
0,818371,0.009674,0.039903,0.019347,0.009674,0.532044,0.180169,0.189843,0.019347,1
1,125630,0.019031,0.039792,0.029412,0.039792,0.472318,0.261246,0.129758,0.008651,1
2,830798,0.050119,0.029833,0.050119,0.029833,0.310263,0.470167,0.059666,0.0,1
3,673279,0.029777,0.049628,0.019851,0.009926,0.390819,0.430521,0.019851,0.049628,1
4,823930,0.019582,0.009138,0.019582,0.030026,0.370757,0.460836,0.090078,0.0,1
5,851336,0.039835,0.009615,0.028846,0.049451,0.482143,0.331044,0.039835,0.019231,1
6,291438,0.009563,0.019126,0.030055,0.019126,0.632514,0.240437,0.030055,0.019126,1
7,637559,0.009132,0.028919,0.019787,0.009132,0.351598,0.482496,0.059361,0.039574,1
8,147150,0.019469,0.019469,0.019469,0.019469,0.511504,0.2,0.180531,0.030088,1
9,372984,0.030142,0.008865,0.030142,0.019504,0.62234,0.049645,0.230496,0.008865,1


In [53]:
## Separate features from label
X = database[['Faixa de Gasto Baixa Credito',
                   'Faixa de Gasto Media Credito',
                   'Faixa de Gasto Alta Credito',
                   'Faixa de Gasto Muito Alta Credito',
                   'Faixa de Gasto Baixa Debito',
                   'Faixa de Gasto Media Debito',
                   'Faixa de Gasto Alta Debito',
                   'Faixa de Gasto Muito Alta Debito']]
y = database['Perfil de Investimento']

In [54]:
## Separate train and test
train_features, test_features, train_labels, test_labels = train_test_split(X, y, test_size=0.33, random_state=42)

In [55]:
# Instantiate model with 1000 decision trees
rf = RandomForestClassifier(n_estimators = 1000, random_state = 42)
# Train the model on training data
rf.fit(train_features, train_labels)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=1000,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

In [56]:
rf.score(train_features, train_labels)

1.0

In [57]:
# Use the forest's predict method on the test data
predictions = rf.predict(test_features)
# Calculate the accuracy
errors = accuracy_score(y_true = test_labels, y_pred = predictions)
print('Accuracy:', errors)

Accuracy: 1.0


In [59]:
probs = rf.predict_proba(test_features)
probs

array([[0.   , 1.   , 0.   , 0.   ],
       [0.   , 0.001, 0.971, 0.028],
       [0.   , 0.   , 0.996, 0.004],
       ...,
       [0.027, 0.97 , 0.001, 0.002],
       [0.   , 0.   , 0.991, 0.009],
       [0.998, 0.002, 0.   , 0.   ]])

In [76]:
## finally pickle the model
pickle.dump(rf, open('model_sklearn.pickle', 'wb'))

## Limitations & What Could Be Done with Real Life Data

As the data was created in order to make the classes separable and in a structured way, the main limitations are that more processing would be necessary in real life data.

In addition, the modeling part may require additional tunning steps. With the possibility of testing other classification techniques and even other modeling strategies.

## Conclusions and Improvements

The main improvements are in the tunning of the model, creation of the Machine Learning pipeline and improvement of the code.

Among the observed results, the export and import of the pickle model was quite quick and simplified. This method of taking the model to productive environments can be used in PoCs, MVPs or more simplified scenarios.

## References

Code was construct using those references:

- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
- https://www.datacamp.com/community/tutorials/random-forests-classifier-python
- https://towardsdatascience.com/random-forest-in-python-24d0893d51c0
- https://www.datacamp.com/community/tutorials/pickle-python-tutorial?utm_source=adwords_ppc&utm_campaignid=1455363063&utm_adgroupid=65083631748&utm_device=c&utm_keyword=&utm_matchtype=b&utm_network=g&utm_adpostion=&utm_creative=278443377092&utm_targetid=aud-299261629574:dsa-429603003980&utm_loc_interest_ms=&utm_loc_physical_ms=1001566&gclid=Cj0KCQjwhvf6BRCkARIsAGl1GGi3eLri22vZ8TfwjVokwInSaUQELuvQuLq2VaqXiQC3WDtPriF3xZIaAnYGEALw_wcB