## 1st Class: Cross validation and initial randomness

> Date: June 30, 2020

- **Holdout:** Technique where we separate train and test data.
- **Cross validation:** Technique used to enrich the data usage, seizing all data in both, train and test.

### Fourth project of the first class ([1_introduction_to_the_classification_with_SKLearn](https://github.com/BrunaMS/machine_learning_course/blob/master/1_introduction_to_the_classification_with_SKLearn/5.%20The%20fourth%20project.ipynb))

In this project, I made a project to verify if, with a defined price, the car model and its age, one person would sell his/her automobile, making a prediction about the sale and if is or not likely that it'll be sold.

In [3]:
import pandas as pd
from datetime import datetime
uri = "https://gist.githubusercontent.com/guilhermesilveira/4d1d4a16ccbf6ea4e0a64a38a24ec884/raw/afd05cb0c796d18f3f5a6537053ded308ba94bf7/car-prices.csv"
data = pd.read_csv(uri)

sold_change = {
    'yes' : 1,
    'no'  : 0
}

current_year = datetime.today().year
data['model_age'] = current_year - data.model_year 
data['km_per_year'] = data.mileage_per_year * 1.60934 
data.sold = data.sold.map(sold_change)
data = data.drop(columns = ['Unnamed: 0', 'mileage_per_year', 'model_year'], axis=1) # Axis: 1 - Column, 0 - Row
data.head()

Unnamed: 0,price,sold,model_age,km_per_year
0,30941.02,1,20,35085.22134
1,40557.96,1,22,12622.05362
2,89627.5,0,14,11440.79806
3,95276.14,0,5,43167.32682
4,117384.68,1,6,12770.1129


In [4]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import  StandardScaler
from sklearn.dummy import DummyClassifier
import numpy as np

x = data[['model_age', 'price', 'km_per_year']]
y = data['sold']

SEED = 15
np.random.seed(SEED)
raw_train_x, raw_test_x, train_y, test_y = train_test_split(x, y, random_state = SEED, test_size = 0.25, stratify = y)

# Baseline (default - stratified)
dummy = DummyClassifier()
dummy.fit(raw_train_x, train_y)
predictions = dummy.predict(raw_test_x)
accuracy = dummy.score(test_y, predictions)
print("The accuracy of Dummy algorithm is %.2f%% " % (accuracy*100))

The accuracy of Dummy algorithm is 51.68% 


In [5]:
scaler = StandardScaler()
scaler.fit(raw_train_x)
train_x = scaler.transform(raw_train_x)
test_x = scaler.transform(raw_test_x)

# Using SVC
model = SVC()
model.fit(train_x, train_y)
predictions = model.predict(test_x)

accuracy = accuracy_score(test_y, predictions)
print("The accuracy of this algorithm is %.2f%% " % (accuracy*100))

The accuracy of this algorithm is 77.20% 


In [6]:
## Decision trees
## Classifiers that can show to us reasons to the taken decisions.

from sklearn.tree import DecisionTreeClassifier

raw_train_x, raw_test_x, train_y, test_y = train_test_split(x, y, random_state = SEED, test_size = 0.25, stratify = y)

# Using a decision tree machine  
# The argument max_depth is used to define de max size of the thee
tree_model = DecisionTreeClassifier(max_depth = 3)
tree_model.fit(raw_train_x, train_y)
predictions = tree_model.predict(raw_test_x)

accuracy = accuracy_score(test_y, predictions)
print("The accuracy of this algorithm is %.2f%% " % (accuracy*100))

The accuracy of this algorithm is 79.68% 


### Continuation of the project with cross validation

> There are same pappers saying that a cross validation between 5 and 10  is the best way to use this tool.

In [7]:
# Import cross_validate
# Define the model
# Use cross validate with cv = 3
# Print results
# Remove train score of results
# Print test_score
# Remember to define always the same SEED
# Print mean with +/- standard deviation (std)

In [10]:
import pandas as pd
from datetime import datetime


uri = "https://gist.githubusercontent.com/guilhermesilveira/4d1d4a16ccbf6ea4e0a64a38a24ec884/raw/afd05cb0c796d18f3f5a6537053ded308ba94bf7/car-prices.csv"
data = pd.read_csv(uri)

sold_change = {
    'yes' : 1,
    'no'  : 0
}

current_year = datetime.today().year
data['model_age'] = current_year - data.model_year 
data['km_per_year'] = data.mileage_per_year * 1.60934 
data.sold = data.sold.map(sold_change)
data = data.drop(columns = ['Unnamed: 0', 'mileage_per_year'], axis=1) # Axis: 1 - Column, 0 - Row

In [43]:
from sklearn.model_selection import cross_validate
from sklearn.tree import DecisionTreeClassifier

SEED = 52
np.random.seed(SEED)

x = data[['model_age', 'price', 'km_per_year']]
y = data['sold']


tree_model = DecisionTreeClassifier(max_depth = 2)
result = cross_validate(tree_model, x, y, cv = 5)
result

{'fit_time': array([0.00909925, 0.00901389, 0.0105176 , 0.00750256, 0.00760198]),
 'score_time': array([0.00247097, 0.00224519, 0.00381517, 0.00132179, 0.0013504 ]),
 'test_score': array([0.756 , 0.7565, 0.7625, 0.7545, 0.7595])}

In [44]:
print('Test Score: ', result['test_score'])
accuracy_mean = result['test_score'].mean()
accuracy_std = result['test_score'].std()
print('Test accuracy gap: between %.2f - %.2f' % (100 * (accuracy_mean - accuracy_std), 100 * (accuracy_mean + accuracy_std)))

Test Score:  [0.756  0.7565 0.7625 0.7545 0.7595]
Test accuracy gap: between 75.49 - 76.07


> At this moment, the seed is not influencing the test score, only times are changing when run the code again