<a href="https://colab.research.google.com/github/MarceloPiemonteRibeiro/learning-Machine-learning/blob/main/Optimization/Cross_validation_and_optimization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Import libraries, import and manipulate data

Code source: [Alura course](https://cursos.alura.com.br/course/machine-learning-validando-modelos) about machine learning and cross-validation

In [36]:
import pandas as pd
import numpy as np


In [37]:
uri = "https://gist.githubusercontent.com/guilhermesilveira/e99a526b2e7ccc6c3b70f53db43a87d2/raw/1605fc74aa778066bf2e6695e24d53cf65f2f447/machine-learning-carros-simulacao.csv"
dados = pd.read_csv(uri).drop(columns=["Unnamed: 0"], axis=1)
dados.head()

Unnamed: 0,preco,vendido,idade_do_modelo,km_por_ano
0,30941.02,1,18,35085.22134
1,40557.96,1,20,12622.05362
2,89627.5,0,12,11440.79806
3,95276.14,0,3,43167.32682
4,117384.68,1,4,12770.1129


Separate the data between our X (input) and Y (output) and 75% of our data for train and 25% for test. 

In [38]:
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score

x = dados[["preco", "idade_do_modelo","km_por_ano"]]
y = dados["vendido"]

SEED = 158020
np.random.seed(SEED)
treino_x, teste_x, treino_y, teste_y = train_test_split(x, y, test_size = 0.25,
                                                         stratify = y)
print("Treinaremos com %d elementos e testaremos com %d elementos" % (len(treino_x), len(teste_x)))

Treinaremos com 7500 elementos e testaremos com 2500 elementos


## Add baseline accuracy rate of the algorithm

In [39]:
from sklearn.dummy import DummyClassifier

dummy_stratified = DummyClassifier(strategy='stratified')
dummy_stratified.fit(treino_x, treino_y)
acuracia = dummy_stratified.score(teste_x, teste_y) * 100

print("A acurácia do dummy stratified foi de %.2f%%" % acuracia)

A acurácia do dummy stratified foi de 50.96%


In case you have another baseline, go ahead.

## First ML algorithm (decision tree)

In [40]:
from sklearn.tree import DecisionTreeClassifier

SEED = 158020
np.random.seed(SEED)
modelo = DecisionTreeClassifier(max_depth=2)
modelo.fit(treino_x, treino_y)
previsoes = modelo.predict(teste_x)

acuracia = accuracy_score(teste_y, previsoes) * 100
print ("A acurácia foi %.2f%%" % acuracia)

A acurácia foi 71.92%


Change the SEED, see how sensitive is your algorithm

In [41]:
x = dados[["preco", "idade_do_modelo","km_por_ano"]]
y = dados["vendido"]

SEED = 5 # you can play with SEED here (SEED=5 will give for example 76%)
np.random.seed(SEED)
treino_x, teste_x, treino_y, teste_y = train_test_split(x, y, test_size = 0.25,
                                                         stratify = y)
print("Treinaremos com %d elementos e testaremos com %d elementos" % (len(treino_x), len(teste_x)))

modelo = DecisionTreeClassifier(max_depth=2)
modelo.fit(treino_x, treino_y)
previsoes = modelo.predict(teste_x)

acuracia = accuracy_score(teste_y, previsoes) * 100
print("A acurácia do dummy stratified foi %.2f%%" % acuracia)

Treinaremos com 7500 elementos e testaremos com 2500 elementos
A acurácia do dummy stratified foi 76.84%


But see how sensitive is my algorithm to the SEED (so randomness). It is dangerous to have your algorithm so sensitive to randomness. This is due the fact that we ran the train and test just once, thus we are subject to a bad separation of test and train subsets. So, I will need to test and train multiple times so we can have multiple accuracy rates of the algorithms, thus more robust to SEED and reach an interval of accuracy rate. 

## Cross validation

For example you can reach an accuracy of 35% if given 70% of your dataset is randomly selected to be trained/tested (holdout). But if other subsets are chosen you could reach 65% accuracy, other sets 72% and so on. 

**Cross-validation** means that we will not only divide once our dataset in train and test and run the algorithm, but will divide our dataset in train and test n times, recovering different train and test subsets and calculating the algorithms' accuracy for each time and then taking theaverage accuracy. This is called **k-fold**. Depending on your n, your algorithm will take more/less time to run. Therefore, choosing k is important. 

In [42]:
from sklearn.model_selection import cross_validate
SEED = 158020
np.random.seed(SEED)

modelo = DecisionTreeClassifier(max_depth=2) # that is my tree algorithm
results = cross_validate(modelo, x, y, cv = 3, return_train_score=False) # cross validate will break my train and test in cv=3 pieces, return_train omit the score of train which we are not interested
results

{'fit_time': array([0.00959802, 0.00725603, 0.00709939]),
 'score_time': array([0.00223756, 0.00235367, 0.00206685]),
 'test_score': array([0.75704859, 0.7629763 , 0.75337534])}

We are interested in the accuracy of our (cv=3) algorithm *test score*

In [43]:
results ['test_score']

array([0.75704859, 0.7629763 , 0.75337534])

We can see the previous mean

In [44]:
media = results ['test_score'].mean()
media

0.7578000751484867

We can also see the mean and std deviation and see that our accuracy lie within the below interval.

In [45]:
media = results ['test_score'].mean()
desvio_padrao = results['test_score'].std()
print("Accuracy with cross validation, 3 = [%.2f, %.2f]" % ((media - 2 *desvio_padrao) * 100, (media + 2 * desvio_padrao) * 100))

Accuracy with cross validation, 3 = [74.99, 76.57]


We can test whether we are still susceptible to the randomness by changing our SEED

In [46]:
from sklearn.model_selection import cross_validate

SEED = 301
np.random.seed(SEED)

modelo = DecisionTreeClassifier(max_depth=2)
results = cross_validate(modelo, x, y, cv = 3, return_train_score=False)
media = results ['test_score'].mean()
desvio_padrao = results['test_score'].std()
print("Accuracy with cross validation and different SEED, 3 = [%.2f, %.2f]" % ((media - 2 *desvio_padrao) * 100, (media + 2 * desvio_padrao) * 100))

Accuracy with cross validation and different SEED, 3 = [74.99, 76.57]


See that the interval is still the same

However, this is not the case if we change the k (cv) to 5 for example

In [47]:
from sklearn.model_selection import cross_validate

SEED = 301
np.random.seed(SEED)

modelo = DecisionTreeClassifier(max_depth=2)
results = cross_validate(modelo, x, y, cv = 5, return_train_score=False)
media = results ['test_score'].mean()
desvio_padrao = results['test_score'].std()
print("Accuracy with cross validation, 5 = [%.2f, %.2f]" % ((media - 2 *desvio_padrao) * 100, (media + 2 * desvio_padrao) * 100))

Accuracy with cross validation, 5 = [75.21, 76.35]


How to choose k? Litterature indicates k between 5 and 10 is enough.

### Randomness in the cross-validate

The current code is not actually exploring randomnsess. This is due to the fact that cv is deterministic as it already divides our dataset in cv=5. The randomness so far is affecting only the DecisionTree algorithm and in a marginal way. If we play with SEED we will still have similar results.

There is a way to really explore the randomness by not shuffling the data before cv from cross-validate. We can do that by allowing shuffle into cv.

CV=10

In [48]:
def imprime_resultados(results):
    media = results ['test_score'].mean()
    desvio_padrao = results['test_score'].std()
    print("Accuracy avg: %.2f" % (media * 100))
    print("Accuracy with cross validation interval = [%.2f, %.2f]" % ((media - 2 *desvio_padrao) * 100, (media + 2 * desvio_padrao) * 100))

In [49]:
from sklearn.model_selection import KFold

SEED = 301
np.random.seed(SEED)

cv = KFold(n_splits = 10)
modelo = DecisionTreeClassifier(max_depth=2)
results = cross_validate(modelo, x, y, cv = cv, return_train_score=False)
imprime_resultados(results)

Accuracy avg: 75.78
Accuracy with cross validation interval = [74.37, 77.19]


CV=10 + **SHUFFLE**

In [50]:
SEED = 301
np.random.seed(SEED)

cv = KFold(n_splits = 10, shuffle=True)
modelo = DecisionTreeClassifier(max_depth=2)
results = cross_validate(modelo, x, y, cv = cv, return_train_score=False)
imprime_resultados(results)

Accuracy avg: 75.76
Accuracy with cross validation interval = [73.26, 78.26]


See that the interval changes a bit when introducing Shuffle. Here we used K-Fold to split our dataset but **there are other splitter classes from SKlearn**

### Balance data in cross-validate

Another aspect we need to care is the balance of our data. For example, our output (dependent variable) 0-1 may be unbalance in our train and test subsets by chance or by the data structure. In this case we risk of having a lot of 0s (1s) in the train(test) subsets, creating an inbalance which will affect our algorithm estimator. To deal with that we use the **stratify parameter in our train_test_split**

We need to add that in our K-fold code.

We can first simulate scenarios where these inbalances are present. For example, imagine our data is sorted, so the beginning of it will have only zeros and the tail 1s.

In [51]:
dados.sort_values("vendido", ascending=True) # see column vendido

Unnamed: 0,preco,vendido,idade_do_modelo,km_por_ano
4999,74023.29,0,12,24812.80412
5322,84843.49,0,13,23095.63834
5319,83100.27,0,19,36240.72746
5316,87932.13,0,16,32249.56426
5315,77937.01,0,15,28414.50704
...,...,...,...,...
5491,71910.43,1,9,25778.40812
1873,30456.53,1,6,15468.97608
1874,69342.41,1,11,16909.33538
5499,70520.39,1,16,19622.68262


Separate train and test

In [52]:
dados_azar = dados.sort_values("vendido", ascending=True)
x_azar = dados_azar[["preco", "idade_do_modelo", "km_por_ano"]]
y_azar = dados_azar["vendido"]
dados_azar.head()

Unnamed: 0,preco,vendido,idade_do_modelo,km_por_ano
4999,74023.29,0,12,24812.80412
5322,84843.49,0,13,23095.63834
5319,83100.27,0,19,36240.72746
5316,87932.13,0,16,32249.56426
5315,77937.01,0,15,28414.50704


Run K-fold without shuffle (see how bad is the result)

In [53]:
from sklearn.model_selection import KFold

SEED = 301
np.random.seed(SEED)

cv = KFold(n_splits = 10)
modelo = DecisionTreeClassifier(max_depth=2)
results = cross_validate(modelo, x_azar, y_azar, cv = cv, return_train_score=False)
imprime_resultados(results)

Accuracy avg: 57.84
Accuracy with cross validation interval = [34.29, 81.39]


Adding shuffle you get much similar results.

In [54]:
from sklearn.model_selection import KFold

SEED = 301
np.random.seed(SEED)

cv = KFold(n_splits = 10, shuffle=True)
modelo = DecisionTreeClassifier(max_depth=2)
results = cross_validate(modelo, x_azar, y_azar, cv = cv, return_train_score=False)
imprime_resultados(results)

Accuracy avg: 75.78
Accuracy with cross validation interval = [72.30, 79.26]


However, the right thing to do is to add the stratifier

In [55]:
from sklearn.model_selection import StratifiedKFold

SEED = 301
np.random.seed(SEED)

cv = StratifiedKFold(n_splits = 10, shuffle=True)
modelo = DecisionTreeClassifier(max_depth=2)
results = cross_validate(modelo, x_azar, y_azar, cv = cv, return_train_score=False)
imprime_resultados(results)

Accuracy avg: 75.78
Accuracy with cross validation interval = [73.55, 78.01]


### Create categories of our data to make the algorithm robust to new data

So far we worked only with our dataset. Our algorithm should be able to work with new data. In this example we create a model that predcits if a car will be sold based on the cars' characteristics. In our train and test subsets we have all car models. But in real life we would have cars models in a subset of our data but new data would contain newer car models.

Currently our model does not account for that (this is also why we have such a good accuracy). In our train and test subsets we have all models of car, but in the real life the test subset (new data) would have unknown/newer car models to our model. 

In this case we will simply create an extra column indicating the model of the cars. But one could create an extra column clustering the cars according to their characteristics. This would actually increase the accuracy of the estimator

#### Create a column model to our dataset

In our data we don't have a column saying the car models, so we will create one based on the cars' ages. Cars of 20 years old are not necessarily the same model, but should have similar.

For each car age we will add a random number between 2 and -2

In [56]:
np.random.seed(SEED) # create random
np.random.randint(-2, 3) # create values between 2 and -2

-2

In [57]:
np.random.seed(SEED)
dados["modelo"] = dados.idade_do_modelo + np.random.randint(-2, 3, size=10000) # 10000 because this is the nrow of our dataframe
dados.head()

Unnamed: 0,preco,vendido,idade_do_modelo,km_por_ano,modelo
0,30941.02,1,18,35085.22134,16
1,40557.96,1,20,12622.05362,22
2,89627.5,0,12,11440.79806,12
3,95276.14,0,3,43167.32682,4
4,117384.68,1,4,12770.1129,3


Our models now are

In [58]:
dados.modelo.unique()

array([16, 22, 12,  4,  3, 11, 18, 17, 13,  0, 15, 10,  9, 14,  1,  5, 19,
       21,  8,  7, 20,  6,  2, -1])

See however we have a -1 model (probably result from idade_modelo=0 + a random number -1). We can sum 1 in this case to all the column.

In [59]:
dados.modelo = dados.modelo + abs(dados.modelo.min()) 
dados.head()

Unnamed: 0,preco,vendido,idade_do_modelo,km_por_ano,modelo
0,30941.02,1,18,35085.22134,17
1,40557.96,1,20,12622.05362,23
2,89627.5,0,12,11440.79806,13
3,95276.14,0,3,43167.32682,5
4,117384.68,1,4,12770.1129,4


In [60]:
dados.modelo.min()

0

We can add 1 to avoid having a 0 in the modelo_aleatorio column

In [61]:
dados.modelo = dados.modelo + abs(dados.modelo.min()) + 1
dados.head()

Unnamed: 0,preco,vendido,idade_do_modelo,km_por_ano,modelo
0,30941.02,1,18,35085.22134,18
1,40557.96,1,20,12622.05362,24
2,89627.5,0,12,11440.79806,14
3,95276.14,0,3,43167.32682,6
4,117384.68,1,4,12770.1129,5


In [62]:
dados.modelo.value_counts() # how many models we have

20    901
19    798
18    771
21    723
17    709
16    668
14    621
22    575
15    573
13    557
12    511
11    401
10    371
23    370
9     336
8     278
7     206
24    199
6     181
5     108
4      76
3      44
2      17
1       6
Name: modelo, dtype: int64

#### Estimate model robust to new data

In [63]:
from sklearn.model_selection import GroupKFold

SEED = 301
np.random.seed(SEED)

cv = GroupKFold(n_splits = 10)
modelo = DecisionTreeClassifier(max_depth=2)
results = cross_validate(modelo, x_azar, y_azar, cv = cv, groups = dados.modelo, return_train_score=False) # groups modelo is the categories we are taking into account, it could be other column from our dataset
imprime_resultados(results)

Accuracy avg: 75.78
Accuracy with cross validation interval = [73.67, 77.90]


In this case the estimator robust to the cars' models is close to our initial estimator, but this was a matter of luck. In this case the cars sales behave roughly independently from the cars' models and this is why the estimator here is similar to the previous ones. If a totally new type of car (a car that fly) then the estimator would be inaccurate.

## Scaled ML algorithm (SVC)

We used so far a decision tree algorithm with cross-validation. However, the latter algorithm is senstive to the scale of our dimensions. For example if X1 varies from 0-20 and X3 from 0-10,000 they will have a very different influence in our algorithm. Therefore, re-scaling them is necessary.

In [69]:
from sklearn.pipeline import Pipeline

SEED = 301
np.random.seed(SEED)

scaler =StandardScaler() # scaler
modelo = SVC() # algorithm

# we create a pipeline which will transform our data using scaler ans estimate for every cv times the different train and test subsets, all taking into account what have been done so far
pipeline = Pipeline([('transformacao', scaler), ('estimador', modelo)]) 

cv = GroupKFold(n_splits = 10)
result = cross_validate(pipeline, x_azar, y_azar, cv = cv, groups = dados.modelo, return_train_score=False)
imprime_resultados(results)

Accuracy avg: 77.27
Accuracy with cross validation interval = [74.35, 80.20]
