## 1st Class: Cross validation and initial randomness

> Date: June 30, 2020

- **Holdout:** Technique where we separate train and test data.
- **Cross validation:** Technique used to enrich the data usage, seizing all data in both, train and test.

### Fourth project of the first class ([1_introduction_to_the_classification_with_SKLearn](https://github.com/BrunaMS/machine_learning_course/blob/master/1_introduction_to_the_classification_with_SKLearn/5.%20The%20fourth%20project.ipynb))

In this project, I made a project to verify if, with a defined price, the car model and its age, one person would sell his/her automobile, making a prediction about the sale and if is or not likely that it'll be sold.

In [79]:
import pandas as pd
from datetime import datetime

uri = "https://gist.githubusercontent.com/guilhermesilveira/4d1d4a16ccbf6ea4e0a64a38a24ec884/raw/afd05cb0c796d18f3f5a6537053ded308ba94bf7/car-prices.csv"
data = pd.read_csv(uri)

sold_change = {
    'yes' : 1,
    'no'  : 0
}

current_year = datetime.today().year
data['model_age'] = current_year - data.model_year 
data['km_per_year'] = data.mileage_per_year * 1.60934 
data.sold = data.sold.map(sold_change)
data = data.drop(columns = ['Unnamed: 0', 'mileage_per_year', 'model_year'], axis=1) # Axis: 1 - Column, 0 - Row
data.head()

Unnamed: 0,price,sold,model_age,km_per_year
0,30941.02,1,20,35085.22134
1,40557.96,1,22,12622.05362
2,89627.5,0,14,11440.79806
3,95276.14,0,5,43167.32682
4,117384.68,1,6,12770.1129


In [80]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import  StandardScaler
from sklearn.dummy import DummyClassifier
import numpy as np

x = data[['model_age', 'price', 'km_per_year']]
y = data['sold']

SEED = 15
np.random.seed(SEED)
raw_train_x, raw_test_x, train_y, test_y = train_test_split(x, y, random_state = SEED, test_size = 0.25, stratify = y)

# Baseline (default - stratified)
dummy = DummyClassifier()
dummy.fit(raw_train_x, train_y)
predictions = dummy.predict(raw_test_x)
accuracy = dummy.score(test_y, predictions)
print("The accuracy of Dummy algorithm is %.2f%% " % (accuracy*100))

The accuracy of Dummy algorithm is 51.68% 


In [81]:
scaler = StandardScaler()
scaler.fit(raw_train_x)
train_x = scaler.transform(raw_train_x)
test_x = scaler.transform(raw_test_x)

# Using SVC
model = SVC()
model.fit(train_x, train_y)
predictions = model.predict(test_x)

accuracy = accuracy_score(test_y, predictions)
print("The accuracy of this algorithm is %.2f%% " % (accuracy*100))

The accuracy of this algorithm is 77.20% 


In [82]:
## Decision trees
## Classifiers that can show to us reasons to the taken decisions.

from sklearn.tree import DecisionTreeClassifier

raw_train_x, raw_test_x, train_y, test_y = train_test_split(x, y, random_state = SEED, test_size = 0.25, stratify = y)

# Using a decision tree machine  
# The argument max_depth is used to define de max size of the thee
tree_model = DecisionTreeClassifier(max_depth = 3)
tree_model.fit(raw_train_x, train_y)
predictions = tree_model.predict(raw_test_x)

accuracy = accuracy_score(test_y, predictions)
print("The accuracy of this algorithm is %.2f%% " % (accuracy*100))

The accuracy of this algorithm is 79.68% 


### Continuation of the project with cross validation

> There are same pappers saying that a cross validation between 5 and 10  is the best way to use this tool.

In [83]:
# Import cross_validate
# Define the model
# Use cross validate with cv = 3
# Print results
# Remove train score of results
# Print test_score
# Remember to define always the same SEED
# Print mean with +/- standard deviation (std)

In [84]:
import pandas as pd
from datetime import datetime


uri = "https://gist.githubusercontent.com/guilhermesilveira/4d1d4a16ccbf6ea4e0a64a38a24ec884/raw/afd05cb0c796d18f3f5a6537053ded308ba94bf7/car-prices.csv"
data = pd.read_csv(uri)

sold_change = {
    'yes' : 1,
    'no'  : 0
}

current_year = datetime.today().year
data['model_age'] = current_year - data.model_year 
data['km_per_year'] = data.mileage_per_year * 1.60934 
data.sold = data.sold.map(sold_change)
data = data.drop(columns = ['Unnamed: 0', 'mileage_per_year'], axis=1) # Axis: 1 - Column, 0 - Row

In [85]:
from sklearn.model_selection import cross_validate
from sklearn.tree import DecisionTreeClassifier

SEED = 25
np.random.seed(SEED)

x = data[['model_age', 'price', 'km_per_year']]
y = data['sold']


tree_model = DecisionTreeClassifier(max_depth = 2)
result = cross_validate(tree_model, x, y, cv = 5)
result

{'fit_time': array([0.00913358, 0.00799227, 0.00769711, 0.00772214, 0.00771403]),
 'score_time': array([0.00163722, 0.00182605, 0.0014379 , 0.00143409, 0.0014379 ]),
 'test_score': array([0.756 , 0.7565, 0.7625, 0.7545, 0.7595])}

In [86]:
print('Test Score: ', result['test_score'])
accuracy_mean = result['test_score'].mean()
accuracy_std = result['test_score'].std()
print('Test accuracy gap: [%.2f%% - %.2f%%]' % (100 * (accuracy_mean - (2 * accuracy_std)), 100 * (accuracy_mean + (2 * accuracy_std))))

Test Score:  [0.756  0.7565 0.7625 0.7545 0.7595]
Test accuracy gap: [75.21% - 76.35%]


> At this moment, the seed is not influencing the test score, only times are changing when run the code again

## 2nd Class: Kfold with randomization

> Date: July 01, 2020

Now, we can see that your data is been better used, sharing data between train and test and using all results to do an mean of the results, what give us more accurated and reliable outcome.

However, if we considerate that the data can be sorted by an caracteristic (like price, size, year etc.), maybe the result would be better if we mix before, ensuring that the result won't be influenced by anything.

### Splitter Classes
- **[KFold:](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html)** Provides train/test indices to split data in train/test sets. Split dataset into k consecutive folds (without shuffling by default). Each fold is then used once as a validation while the k - 1 remaining folds form the training set.

In [87]:
# Create a cv with KFold (sklearn.model_selection)
# Create a function to print results
    # Print result, average and interval with +/- standard validation
# Set shuffle = True

In [88]:
def print_results(result):
    print('Raw result: ', result)
    print('\n----------------------------------------------------------------------------\nTest Score mean: %.2f%% ' % (result['test_score'].mean() * 100))
    accuracy_mean = result['test_score'].mean()
    accuracy_std = result['test_score'].std()
    print('Test accuracy gap: [%.2f%% - %.2f%%]' % (100 * (accuracy_mean - (2 * accuracy_std)), 100 * (accuracy_mean + (2 * accuracy_std))))

In [89]:
from sklearn.model_selection import KFold

SEED = 20
cross_val = KFold(n_splits = 10, shuffle = True, random_state = SEED)
result = cross_validate(tree_model, x, y, cv = cross_val)
print_results(result)

Raw result:  {'fit_time': array([0.01426721, 0.00905824, 0.00856137, 0.00862384, 0.00871229,
       0.00910783, 0.01672864, 0.01226997, 0.00870705, 0.00871038]), 'score_time': array([0.00171113, 0.0014255 , 0.00136924, 0.00135231, 0.00150537,
       0.00293922, 0.00272298, 0.00168777, 0.00141191, 0.00137806]), 'test_score': array([0.738, 0.753, 0.766, 0.752, 0.75 , 0.783, 0.77 , 0.757, 0.748,
       0.761])}

----------------------------------------------------------------------------
Test Score mean: 75.78% 
Test accuracy gap: [73.36% - 78.20%]


## 3rd Class: Stratification

> Date: July 01, 2020

Considering the same situation cited above, we can have a problem with something like all the data of a given state is into the train split and all the data with another characteristic in test, for example.

Therefore, we'll run a simulation of this situation to verify what happens with our classifier in this case. 

**[Stratified KFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold):** A splitter class that balance/stratify the data to train and test.

In [90]:
sorted_data = data.sort_values('sold')
sorted_data.head()

Unnamed: 0,model_year,price,sold,model_age,km_per_year
4999,2006,74023.29,0,14,24812.80412
5322,2005,84843.49,0,15,23095.63834
5319,1999,83100.27,0,21,36240.72746
5316,2002,87932.13,0,18,32249.56426
5315,2003,77937.01,0,17,28414.50704


In [91]:
x = sorted_data[['model_age', 'price', 'km_per_year']]
y = sorted_data['sold']

### 1. With KFold, without shuffle

In [92]:
cross_val = KFold(n_splits = 10, shuffle = False, random_state = SEED)
result = cross_validate(tree_model, x, y, cv = cross_val)
print_results(result)

Raw result:  {'fit_time': array([0.01102853, 0.00928283, 0.00832105, 0.00818372, 0.00878716,
       0.00841427, 0.00830317, 0.00825119, 0.00840187, 0.00828242]), 'score_time': array([0.00160909, 0.00136924, 0.00131154, 0.00131202, 0.00163126,
       0.0013082 , 0.00131035, 0.0013206 , 0.00131798, 0.00135112]), 'test_score': array([0.447, 0.409, 0.438, 0.446, 0.694, 0.663, 0.668, 0.673, 0.67 ,
       0.676])}

----------------------------------------------------------------------------
Test Score mean: 57.84% 
Test accuracy gap: [34.29% - 81.39%]


### 2. With KFold and shuffle

In [93]:
cross_val = KFold(n_splits = 10, shuffle = True, random_state = SEED)
result = cross_validate(tree_model, x, y, cv = cross_val)
print_results(result)

Raw result:  {'fit_time': array([0.00945544, 0.00888348, 0.00859284, 0.00884867, 0.00845504,
       0.00825787, 0.00829768, 0.0083406 , 0.00837779, 0.00838399]), 'score_time': array([0.00144076, 0.0012958 , 0.00136185, 0.00145435, 0.0013001 ,
       0.00131059, 0.00135016, 0.00135803, 0.00135255, 0.00135255]), 'test_score': array([0.749, 0.785, 0.749, 0.761, 0.757, 0.763, 0.758, 0.743, 0.761,
       0.752])}

----------------------------------------------------------------------------
Test Score mean: 75.78% 
Test accuracy gap: [73.59% - 77.97%]


### 3. With Stratified KFold and without shuffle

In [94]:
from sklearn.model_selection import StratifiedKFold

cross_val = StratifiedKFold(n_splits = 10, shuffle = False, random_state = SEED)
result = cross_validate(tree_model, x, y, cv = cross_val)
print_results(result)

Raw result:  {'fit_time': array([0.00987673, 0.00830221, 0.00846028, 0.00827575, 0.00832272,
       0.00852108, 0.00867724, 0.00913239, 0.00884986, 0.00911117]), 'score_time': array([0.00134897, 0.00136399, 0.00131583, 0.00131178, 0.00132203,
       0.0015347 , 0.00159645, 0.00148916, 0.00154877, 0.0015254 ]), 'test_score': array([0.744, 0.759, 0.763, 0.765, 0.754, 0.742, 0.771, 0.748, 0.764,
       0.768])}

----------------------------------------------------------------------------
Test Score mean: 75.78% 
Test accuracy gap: [73.83% - 77.73%]


### 4. With Stratified KFold and shuffle

In [95]:
cross_val = StratifiedKFold(n_splits = 10, shuffle = True, random_state = SEED)
result = cross_validate(tree_model, x, y, cv = cross_val)
print_results(result)

Raw result:  {'fit_time': array([0.01354766, 0.01373267, 0.01405978, 0.00859761, 0.01174855,
       0.00871611, 0.00848222, 0.00857401, 0.00873303, 0.00862336]), 'score_time': array([0.00246692, 0.00267506, 0.00180364, 0.00153065, 0.00161409,
       0.00147033, 0.00150442, 0.0013752 , 0.00158191, 0.0013752 ]), 'test_score': array([0.759, 0.731, 0.761, 0.76 , 0.757, 0.764, 0.767, 0.75 , 0.753,
       0.776])}

----------------------------------------------------------------------------
Test Score mean: 75.78% 
Test accuracy gap: [73.52% - 78.04%]


### With these tests, we can conclude that:

- The shuffle is very important when there isn't any way to mix the data

- The KFold with shuffle makes a good job in this case and, if the coder prefer, appears that it wouldn't make much difference in the results

- The stratification is very good to balance/equilibrate data, mainly when we have so much data of one case and least of other.

In [96]:
sold_1 = len(data.query('sold == 1'))
sold_0 = len(data.query('sold == 0'))

print('Quantity of sold cars: %d' % sold_1)
print('Quantity of not sold cars: %d' % sold_0)

Quantity of sold cars: 5800
Quantity of not sold cars: 4200


## 4th Class: Groupable data

> Date: July 02, 2020

Now, thinking about the solution of this project, we have to consent that, if we want a useful algorithm, we have to be able to add new cars or users or anything to our dataset, right? moreover, we have to be able to apply your model in new cases, because this is our main target, making a code really useful in the real world. 


Thinking in it, we'll create a simulated column that will be group and defined as a random model, based on the car ages (cars of the same year are probabily more similar)

In [97]:
# Create a random column into a range between [-2, 3] and sum that with the model age
# Place this column into the dataframe 
# Verify if there is any value less than 0 and modify to be greater than 0

In [98]:
np.random.seed(SEED)
random_model = np.random.randint(-2, 3, size = len(data))
data['random_model'] = random_model + data['model_age']
data.head()

Unnamed: 0,model_year,price,sold,model_age,km_per_year,random_model
0,2000,30941.02,1,20,35085.22134,21
1,1998,40557.96,1,22,12622.05362,22
2,2006,89627.5,0,14,11440.79806,16
3,2015,95276.14,0,5,43167.32682,5
4,2014,117384.68,1,6,12770.1129,5


In [99]:
print(np.sort(data['random_model'].unique()))
print(data['random_model'].value_counts())

[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24]
20    895
19    817
18    771
21    706
16    692
17    691
15    659
14    588
22    559
13    554
12    470
11    414
10    379
23    370
9     339
8     280
7     205
24    182
6     164
5     112
4      81
3      49
2      16
1       7
Name: random_model, dtype: int64


- Now, we have to make a model that can agroup the data of tran and test according to the model. Going to the sklearn documentation, we can see the [GroupKFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GroupKFold.html), that is completely able to do this task.

In [100]:
from sklearn.model_selection import GroupKFold

cross_val = GroupKFold(n_splits = 10)
result = cross_validate(tree_model, x, y, cv = cross_val, groups = data.random_model)
print_results(result)

Raw result:  {'fit_time': array([0.01312661, 0.0086813 , 0.00857949, 0.00847578, 0.00848532,
       0.00833583, 0.00837135, 0.00853181, 0.00862813, 0.00867462]), 'score_time': array([0.00138545, 0.00139761, 0.00148487, 0.00135422, 0.00142646,
       0.00132918, 0.00132728, 0.00139737, 0.00139189, 0.00150514]), 'test_score': array([0.76961271, 0.76860347, 0.76646707, 0.74596774, 0.7671093 ,
       0.75145631, 0.75218659, 0.73041709, 0.75231244, 0.7734375 ])}

----------------------------------------------------------------------------
Test Score mean: 75.78% 
Test accuracy gap: [73.20% - 78.35%]


## 5th Class: Pipeline

> Date: July 02, 2020

To finish the project, we can test another model like, for example, **SVC**. However, we have to remember that this classifier is very affected by unbalanced values (some columns very large and others very small). In this case, we can use **StandardScaler** again. But, now, we have to remeber that ir need to be used always before the use of the model. Thus, to use with cross validation, that is setted to test and train 10 times, we would need to run that 10 times too, **if** we didn't have the [**Pipeline**](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html), that was made to be used as a "to assemble several steps that can be cross-validated together while setting different parameters."

In [101]:
# Import standard scaler
# Use standard scaler (define, fit, transform train and test) //
# Import and use SVC (with x and y scalled) //
# Use accuracy score //
# Test SVC with cross validation //
# Scale sorted values, used on the cross validate 
# Pipeline - Line of proccess
# Create a pipeline with the scaler and the model
# Define cv with GroupKFold
# Use pipeline into cross validate

In [102]:
from sklearn.pipeline import Pipeline

x = sorted_data[['model_age', 'price', 'km_per_year']]
y = sorted_data['sold']

scaler = StandardScaler()
model = SVC()
pipeline = Pipeline([('StandardScaler', scaler), ('SVC', model)])

cross_val = GroupKFold(n_splits = 10)
result = cross_validate(pipeline, x, y, cv = cross_val, groups = data.random_model)
print_results(result)

Raw result:  {'fit_time': array([2.42453885, 2.38989568, 2.37032509, 2.48003626, 2.44237256,
       2.43157697, 2.43290687, 2.44072819, 2.48474479, 2.42990232]), 'score_time': array([0.14068985, 0.13901258, 0.1942997 , 0.13774228, 0.14524531,
       0.1428957 , 0.15274239, 0.15115571, 0.13665795, 0.17064285]), 'test_score': array([0.77656405, 0.78593272, 0.78542914, 0.75302419, 0.7773238 ,
       0.76019417, 0.74829932, 0.73753815, 0.77081192, 0.78515625])}

----------------------------------------------------------------------------
Test Score mean: 76.80% 
Test accuracy gap: [73.52% - 80.08%]
