# Extra Practice for Machine Learning
For this you will be working with the `cars` dataset from `vega_datasets`. You may have seen them before, but if not, they are described below.

In [82]:
import pandas as pd

cars = pd.read_csv('cars.csv')

The `cars` dataset is a dataset with a bunch of different models of car, with several different statistics about each of them, including their horsepower, acceleration, etc., the year they were released, and their country of origin. Here's what it looks like:

In [83]:
cars.head()

Unnamed: 0,Name,Miles_per_Gallon,Cylinders,Displacement,Horsepower,Weight_in_lbs,Acceleration,Year,Origin
0,chevrolet chevelle malibu,18.0,8,307.0,130.0,3504,12.0,1970-01-01,USA
1,buick skylark 320,15.0,8,350.0,165.0,3693,11.5,1970-01-01,USA
2,plymouth satellite,18.0,8,318.0,150.0,3436,11.0,1970-01-01,USA
3,amc rebel sst,16.0,8,304.0,150.0,3433,12.0,1970-01-01,USA
4,ford torino,17.0,8,302.0,140.0,3449,10.5,1970-01-01,USA


## ML with Quantitative Data

Create and train a model that, given the `cars` dataset, will predict the Horsepower of a car. Think about the type of data you are trying to predict - what model (of the ones we have already seen) should you use to predict quantitative data? Make sure to split training and testing data, and check the mean squared error of your model.

In [84]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
cars = cars.dropna(subset = ["Horsepower"])
x = cars[["Miles_per_Gallon", "Cylinders", "Displacement", "Weight_in_lbs", "Acceleration"]]
y = cars["Horsepower"]
X_train, X_test, y_train, y_test = train_test_split(
    x, y, test_size=0.5, random_state=42
)
model = DecisionTreeRegressor()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print("Mean squared error is ", mse)
# Enter the rest of your solution here!

Mean squared error is  187.175


Bonus: How does accuracy or mean squared error change with the split between training and testing data? Do at least three different splits with this data to see. 

Bonus: How does accuracy or mean squared error change with the split between training and testing data? Do at least three different splits with this data to see. 

In [85]:
cars = cars.dropna(subset = ["Horsepower"])
x = cars[["Miles_per_Gallon", "Cylinders", "Displacement", "Weight_in_lbs", "Acceleration"]]
y = cars["Horsepower"]
splits = [0.8,0.5,0.3]
result = []
for split in splits:
    x_train, x_test, y_train, y_test = train_test_split(
        x,y, test_size = split, random_state=42
    )
    model = DecisionTreeRegressor()
    model.fit(x_train, y_train)
    y_pred = model.predict(x_test)
    mse = mean_squared_error(y_test, y_pred)
    result.append((split, mse))
for split,mse in result:
    print("Split: ",split, " MSE: ",mse)

Split:  0.8  MSE:  279.778125
Split:  0.5  MSE:  187.295
Split:  0.3  MSE:  275.6333333333333


More Bonus: Testing hyperparameters. What maximum depth has the greatest accuracy in our testing set. If we want to do this without making decision off our our training dataset, we need to split our data into three categories: train, test, and development. Then we can compare our changes in how it affects the development dataset and not the training dataset so we can do a final evaluation at the end with our final model. 

## ML with Categorical Data

Create, train, and test a model that will predict the country of origin for the `cars` dataset. Remember, this is categorical data, so you will need to use a different type of model (of the ones we have already seen) than you did for the `Horsepower` model.

In [86]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
cars = cars.dropna(subset = ["Miles_per_Gallon" ,"Cylinders","Displacement","Horsepower","Weight_in_lbs","Acceleration","Origin"])
x = cars[["Miles_per_Gallon" ,"Cylinders","Displacement","Horsepower","Weight_in_lbs","Acceleration"]]
y = cars["Origin"]
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=42, test_size = 0.2)
model = DecisionTreeClassifier(random_state = 42)
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
acc = accuracy_score(y_test, y_pred)
print("Acc score: ", acc)

Acc score:  0.7721518987341772


Bonus: How does accuracy or mean squared error change with the split between training and testing data? Do at least three different splits with this data to see. 

In [87]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
cars = cars.dropna(subset = ["Horsepower"])
x = cars[["Miles_per_Gallon","Cylinders","Displacement","Horsepower","Weight_in_lbs","Acceleration"]]
y = cars["Horsepower"]
splits = [0.9,0.5,0.2]
results = []
for split in splits:
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=split, random_state=42)
    model = DecisionTreeRegressor()
    model.fit(x_train, y_train)
    y_pred = model.predict(x_test)
    mse = mean_squared_error(y_test, y_pred)
    results.append((split,mse))
results

[(0.9, 203.28045325779036),
 (0.5, 12.918367346938776),
 (0.2, 0.9113924050632911)]

More Bonus: Testing hyperparameters. What maximum depth (or other hyperparameter) has the greatest accuracy in our testing set? If we want to do this without making decision off our our training dataset, we need to split our data into three categories: train, test, and development. Then we can compare our changes in how it affects the development dataset and not the training dataset so we can do a final evaluation at the end with our final model. 