## 1st Class: Cross validation and initial randomness

> Date: June 30, 2020

- **Holdout:** Technique where we separate train and test data.
- **Cross validation:** Technique used to enrich the data usage, seizing all data in both, train and test.

### Fourth project of the first class ([1_introduction_to_the_classification_with_SKLearn](https://github.com/BrunaMS/machine_learning_course/blob/master/1_introduction_to_the_classification_with_SKLearn/5.%20The%20fourth%20project.ipynb))

In this project, I made a project to verify if, with a defined price, the car model and its age, one person would sell his/her automobile, making a prediction about the sale and if is or not likely that it'll be sold.

In [2]:
import pandas as pd
from datetime import datetime
uri = "https://gist.githubusercontent.com/guilhermesilveira/4d1d4a16ccbf6ea4e0a64a38a24ec884/raw/afd05cb0c796d18f3f5a6537053ded308ba94bf7/car-prices.csv"
data = pd.read_csv(uri)

sold_change = {
    'yes' : 1,
    'no'  : 0
}

current_year = datetime.today().year
data['model_age'] = current_year - data.model_year 
data['km_per_year'] = data.mileage_per_year * 1.60934 
data.sold = data.sold.map(sold_change)
data = data.drop(columns = ['Unnamed: 0', 'mileage_per_year', 'model_year'], axis=1) # Axis: 1 - Column, 0 - Row
data.head()

Unnamed: 0,price,sold,model_age,km_per_year
0,30941.02,1,20,35085.22134
1,40557.96,1,22,12622.05362
2,89627.5,0,14,11440.79806
3,95276.14,0,5,43167.32682
4,117384.68,1,6,12770.1129


In [3]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import  StandardScaler
from sklearn.dummy import DummyClassifier
import numpy as np

x = data[['model_age', 'price', 'km_per_year']]
y = data['sold']

SEED = 15
np.random.seed(SEED)
raw_train_x, raw_test_x, train_y, test_y = train_test_split(x, y, random_state = SEED, test_size = 0.25, stratify = y)

# Baseline (default - stratified)
dummy = DummyClassifier()
dummy.fit(raw_train_x, train_y)
predictions = dummy.predict(raw_test_x)
accuracy = dummy.score(test_y, predictions)
print("The accuracy of Dummy algorithm is %.2f%% " % (accuracy*100))

The accuracy of Dummy algorithm is 51.68% 


In [4]:
scaler = StandardScaler()
scaler.fit(raw_train_x)
train_x = scaler.transform(raw_train_x)
test_x = scaler.transform(raw_test_x)

# Using SVC
model = SVC()
model.fit(train_x, train_y)
predictions = model.predict(test_x)

accuracy = accuracy_score(test_y, predictions)
print("The accuracy of this algorithm is %.2f%% " % (accuracy*100))

The accuracy of this algorithm is 77.20% 


In [5]:
## Decision trees
## Classifiers that can show to us reasons to the taken decisions.

from sklearn.tree import DecisionTreeClassifier

raw_train_x, raw_test_x, train_y, test_y = train_test_split(x, y, random_state = SEED, test_size = 0.25, stratify = y)

# Using a decision tree machine  
# The argument max_depth is used to define de max size of the thee
tree_model = DecisionTreeClassifier(max_depth = 3)
tree_model.fit(raw_train_x, train_y)
predictions = tree_model.predict(raw_test_x)

accuracy = accuracy_score(test_y, predictions)
print("The accuracy of this algorithm is %.2f%% " % (accuracy*100))

The accuracy of this algorithm is 79.68% 


### Continuation of the project with cross validation

> There are same pappers saying that a cross validation between 5 and 10  is the best way to use this tool.

In [6]:
# Import cross_validate
# Define the model
# Use cross validate with cv = 3
# Print results
# Remove train score of results
# Print test_score
# Remember to define always the same SEED
# Print mean with +/- standard deviation (std)

In [7]:
import pandas as pd
from datetime import datetime


uri = "https://gist.githubusercontent.com/guilhermesilveira/4d1d4a16ccbf6ea4e0a64a38a24ec884/raw/afd05cb0c796d18f3f5a6537053ded308ba94bf7/car-prices.csv"
data = pd.read_csv(uri)

sold_change = {
    'yes' : 1,
    'no'  : 0
}

current_year = datetime.today().year
data['model_age'] = current_year - data.model_year 
data['km_per_year'] = data.mileage_per_year * 1.60934 
data.sold = data.sold.map(sold_change)
data = data.drop(columns = ['Unnamed: 0', 'mileage_per_year'], axis=1) # Axis: 1 - Column, 0 - Row

In [8]:
from sklearn.model_selection import cross_validate
from sklearn.tree import DecisionTreeClassifier

SEED = 25
np.random.seed(SEED)

x = data[['model_age', 'price', 'km_per_year']]
y = data['sold']


tree_model = DecisionTreeClassifier(max_depth = 2)
result = cross_validate(tree_model, x, y, cv = 5)
result

{'fit_time': array([0.00736451, 0.0065515 , 0.00662231, 0.00627613, 0.00665045]),
 'score_time': array([0.00132871, 0.00132298, 0.00121951, 0.00121045, 0.0012846 ]),
 'test_score': array([0.756 , 0.7565, 0.7625, 0.7545, 0.7595])}

In [9]:
print('Test Score: ', result['test_score'])
accuracy_mean = result['test_score'].mean()
accuracy_std = result['test_score'].std()
print('Test accuracy gap: between %.2f - %.2f' % (100 * (accuracy_mean - accuracy_std), 100 * (accuracy_mean + accuracy_std)))

Test Score:  [0.756  0.7565 0.7625 0.7545 0.7595]
Test accuracy gap: between 75.49 - 76.07


> At this moment, the seed is not influencing the test score, only times are changing when run the code again

## 2nd Class: Kfold with randomization

> Date: July 01, 2020

Now, we can see that your data is been better used, sharing data between train and test and using all results to do an mean of the results, what give us more accurated and reliable outcome.

However, if we considerate that the data can be sorted by an caracteristic (like price, size, year etc.), maybe the result would be better if we mix before, ensuring that the result won't be influenced by anything.

### Splitter Classes
- **[KFold:](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html)** Provides train/test indices to split data in train/test sets. Split dataset into k consecutive folds (without shuffling by default). Each fold is then used once as a validation while the k - 1 remaining folds form the training set.

In [10]:
# Create a cv with KFold (sklearn.model_selection)
# Create a function to print results
    # Print result, average and interval with +/- standard validation
# Set shuffle = True

In [11]:
def print_results(result):
    print('Raw result: ', result)
    print('\n----------------------------------------------------------------------------\nTest Score mean: %.2f%% ' % (result['test_score'].mean() * 100))
    accuracy_mean = result['test_score'].mean()
    accuracy_std = result['test_score'].std()
    print('Test accuracy gap: [%.2f%% - %.2f%%]' % (100 * (accuracy_mean - accuracy_std), 100 * (accuracy_mean + accuracy_std)))

In [12]:
from sklearn.model_selection import KFold
SEED = 20
cross_val = KFold(n_splits = 10, shuffle = True, random_state = SEED)
result = cross_validate(tree_model, x, y, cv = cross_val)
print_results(result)

Raw result:  {'fit_time': array([0.01001835, 0.01186156, 0.00824761, 0.00726461, 0.00702739,
       0.00723958, 0.00712395, 0.0071559 , 0.00703311, 0.00700378]), 'score_time': array([0.00150442, 0.00150633, 0.0014236 , 0.00155401, 0.00144553,
       0.00132871, 0.00112653, 0.00128675, 0.00110865, 0.00129819]), 'test_score': array([0.738, 0.753, 0.766, 0.752, 0.75 , 0.783, 0.77 , 0.757, 0.748,
       0.761])}

----------------------------------------------------------------------------
Test Score mean: 75.78% 
Test accuracy gap: [74.57% - 76.99%]


## 3rd Class: Stratification

> Date: July 01, 2020

Considering the same situation cited above, we can have a problem with something like all the data of a given state is into the train split and all the data with another characteristic in test, for example.

Therefore, we'll run a simulation of this situation to verify what happens with our classifier in this case. 

**[Stratified KFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold):** A splitter class that balance/stratify the data to train and test.

In [13]:
sorted_data = data.sort_values('sold')
sorted_data.head()

Unnamed: 0,model_year,price,sold,model_age,km_per_year
4999,2006,74023.29,0,14,24812.80412
5322,2005,84843.49,0,15,23095.63834
5319,1999,83100.27,0,21,36240.72746
5316,2002,87932.13,0,18,32249.56426
5315,2003,77937.01,0,17,28414.50704


In [14]:
x = sorted_data[['model_age', 'price', 'km_per_year']]
y = sorted_data['sold']

### 1. With KFold, without shuffle

In [15]:
cross_val = KFold(n_splits = 10, shuffle = False, random_state = SEED)
result = cross_validate(tree_model, x, y, cv = cross_val)
print_results(result)

Raw result:  {'fit_time': array([0.01005673, 0.00878739, 0.00731444, 0.00691891, 0.00720501,
       0.00692582, 0.00700688, 0.00680923, 0.00699711, 0.00690484]), 'score_time': array([0.00276041, 0.00131202, 0.00141144, 0.00117922, 0.00127935,
       0.00109339, 0.0012567 , 0.00109577, 0.00130105, 0.00122452]), 'test_score': array([0.447, 0.409, 0.438, 0.446, 0.694, 0.663, 0.668, 0.673, 0.67 ,
       0.676])}

----------------------------------------------------------------------------
Test Score mean: 57.84% 
Test accuracy gap: [46.07% - 69.61%]


### 2. With KFold and shuffle

In [16]:
cross_val = KFold(n_splits = 10, shuffle = True, random_state = SEED)
result = cross_validate(tree_model, x, y, cv = cross_val)
print_results(result)

Raw result:  {'fit_time': array([0.00979042, 0.01025629, 0.00809169, 0.00696421, 0.00819921,
       0.00754452, 0.00772619, 0.00811768, 0.01080966, 0.00805402]), 'score_time': array([0.00239611, 0.00114155, 0.00192857, 0.001266  , 0.00141144,
       0.00145435, 0.00133014, 0.0013268 , 0.00165129, 0.00116134]), 'test_score': array([0.749, 0.785, 0.749, 0.761, 0.757, 0.763, 0.758, 0.743, 0.761,
       0.752])}

----------------------------------------------------------------------------
Test Score mean: 75.78% 
Test accuracy gap: [74.69% - 76.87%]


### 3. With Stratified KFold and without shuffle

In [17]:
from sklearn.model_selection import StratifiedKFold

cross_val = StratifiedKFold(n_splits = 10, shuffle = False, random_state = SEED)
result = cross_validate(tree_model, x, y, cv = cross_val)
print_results(result)

Raw result:  {'fit_time': array([0.01107407, 0.00850868, 0.00714111, 0.00763988, 0.00829053,
       0.00698519, 0.00789618, 0.00717402, 0.00677824, 0.00692368]), 'score_time': array([0.00157619, 0.00113893, 0.00133538, 0.00128317, 0.00145578,
       0.00134993, 0.00122547, 0.0012238 , 0.00106978, 0.00118876]), 'test_score': array([0.744, 0.759, 0.763, 0.765, 0.754, 0.742, 0.771, 0.748, 0.764,
       0.768])}

----------------------------------------------------------------------------
Test Score mean: 75.78% 
Test accuracy gap: [74.81% - 76.75%]


### 4. With Stratified KFold and shuffle

In [18]:
cross_val = StratifiedKFold(n_splits = 10, shuffle = True, random_state = SEED)
result = cross_validate(tree_model, x, y, cv = cross_val)
print_results(result)

Raw result:  {'fit_time': array([0.00820899, 0.01231766, 0.00777626, 0.00687766, 0.00732684,
       0.00685787, 0.00695109, 0.00727725, 0.00714374, 0.00684524]), 'score_time': array([0.00242758, 0.00247693, 0.00182939, 0.00114346, 0.00124669,
       0.00108266, 0.00199628, 0.00128388, 0.00114989, 0.0010941 ]), 'test_score': array([0.759, 0.731, 0.761, 0.76 , 0.757, 0.764, 0.767, 0.75 , 0.753,
       0.776])}

----------------------------------------------------------------------------
Test Score mean: 75.78% 
Test accuracy gap: [74.65% - 76.91%]


### With these tests, we conclude that:

- The shuffle is very important when there isn't any way to mix the data

- The KFold with shuffle makes a good job in this case and, if the coder prefer, appears that it wouldn't make much difference in the results

- The stratification is very good to balance/equilibrate data, mainly when we have so much data of one case and least of other.

In [21]:
sold_1 = len(data.query('sold == 1'))
sold_0 = len(data.query('sold == 0'))

print('Quantity of sold cars: %d' % sold_1)
print('Quantity of not sold cars: %d' % sold_0)

Quantity of sold cars: 5800
Quantity of not sold cars: 4200
