# Extra Practice for Machine Learning
For this you will be working with the `cars` dataset from `vega_datasets`. You may have seen them before, but if not, they are described below.

In [21]:
import pandas as pd

cars = pd.read_csv('cars.csv').dropna()

The `cars` dataset is a dataset with a bunch of different models of car, with several different statistics about each of them, including their horsepower, acceleration, etc., the year they were released, and their country of origin. Here's what it looks like:

In [22]:
cars.head()

Unnamed: 0,Name,Miles_per_Gallon,Cylinders,Displacement,Horsepower,Weight_in_lbs,Acceleration,Year,Origin
0,chevrolet chevelle malibu,18.0,8,307.0,130.0,3504,12.0,1970-01-01,USA
1,buick skylark 320,15.0,8,350.0,165.0,3693,11.5,1970-01-01,USA
2,plymouth satellite,18.0,8,318.0,150.0,3436,11.0,1970-01-01,USA
3,amc rebel sst,16.0,8,304.0,150.0,3433,12.0,1970-01-01,USA
4,ford torino,17.0,8,302.0,140.0,3449,10.5,1970-01-01,USA


## ML with Quantitative Data

Create and train a model that, given the `cars` dataset, will predict the Horsepower of a car. Think about the type of data you are trying to predict - what model (of the ones we have already seen) should you use to predict quantitative data? Make sure to split training and testing data, and check the mean squared error of your model.

In [51]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
x = cars[["Weight_in_lbs","Acceleration","Displacement"]]
y = cars[["Horsepower"]]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)
x_train2, x_test2, y_train2, y_test2 = train_test_split(x, y, test_size=0.33, random_state=86)
x_train3, x_test3, y_train3, y_test3 = train_test_split(x, y, test_size=0.86, random_state=42)
x_train4, x_test4, y_train4, y_test4 = train_test_split(x, y, test_size=0.86, random_state=86)
reg = DecisionTreeRegressor(random_state = 1).fit(x_train,y_train)
pred = reg.predict(x_test)
mse1 = mean_squared_error(pred,y_test)
mse2 = mean_squared_error(reg.predict(x_test2),y_test2)
mse3 = mean_squared_error(reg.predict(x_test3),y_test3)
mse4 = mean_squared_error(reg.predict(x_test4),y_test4)
print(mse1)
print(mse2)
print(mse3)
print(mse4)
# Enter the rest of your solution here!

171.95384615384614
53.63846153846154
66.13609467455622
56.63609467455621


Bonus: How does accuracy or mean squared error change with the split between training and testing data? Do at least three different splits with this data to see.
As the test size increases, the accuracy seems to increase, and as the shuffling applied to the data increases the accuracy also seems to increase. This is because of bias and variance, with the model becoming more accurate by being given more random values(it can't make absurd connections between data points) and it becomes more accurate given a larger testing set(another way that it can't make absurb connections, and also provides more of an evening out of mistakes due to statistics). However, if you were to increase both test size and randomness, the accuracy starts decreasing a little, this is because the benefits of both can interfere with each other leading to more mistakes. Also, the reason that I think there is a high mean squared error for the first test set is because the test sets that follow could use the same data that is present in earlier training sets leading to the model having the answers, I verified this by commenting out the original testing set and replacing it with one of the other ones, and saw that the mean squared error was super high.

More Bonus: Testing hyperparameters. What maximum depth has the greatest accuracy in our testing set. If we want to do this without making decision off our our training dataset, we need to split our data into three categories: train, test, and development. Then we can compare our changes in how it affects the development dataset and not the training dataset so we can do a final evaluation at the end with our final model. 

In [84]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
x = cars[["Weight_in_lbs","Acceleration","Displacement"]]
y = cars[["Horsepower"]]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)
x_train, x_dev, y_train, y_dev = train_test_split(x_train, y_train, test_size=0.33, random_state=85)
x_dev, x_devt, y_dev, y_devt = train_test_split(x_dev, y_dev, test_size=0.01, random_state=99999)
reg = DecisionTreeRegressor(random_state = 1).fit(x_dev,y_dev)
pred = reg.predict(x_test)
mse1 = mean_squared_error(pred,y_test)
print(mse1)
# Enter the rest of your solution here!

158.15384615384616


## ML with Categorical Data

Create, train, and test a model that will predict the country of origin for the `cars` dataset. Remember, this is categorical data, so you will need to use a different type of model (of the ones we have already seen) than you did for the `Horsepower` model.

In [88]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
x = cars[["Cylinders","Miles_per_Gallon","Displacement"]]
y = cars[["Origin"]]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)
x_train2, x_test2, y_train2, y_test2 = train_test_split(x, y, test_size=0.33, random_state=86)
x_train3, x_test3, y_train3, y_test3 = train_test_split(x, y, test_size=0.86, random_state=42)
x_train4, x_test4, y_train4, y_test4 = train_test_split(x, y, test_size=0.86, random_state=86)
reg = DecisionTreeClassifier(random_state = 1).fit(x_train,y_train)
pred = reg.predict(x_test)
ac1 = accuracy_score(pred,y_test)
ac2 = accuracy_score(reg.predict(x_test2),y_test2)
ac3 = accuracy_score(reg.predict(x_test3),y_test3)
ac4 = accuracy_score(reg.predict(x_test4),y_test4)
print(ac1)
print(ac2)
print(ac3)
print(ac4)
# Enter the rest of your solution here!

0.7538461538461538
0.9384615384615385
0.9023668639053254
0.9201183431952663


Bonus: How does accuracy or mean squared error change with the split between training and testing data? Do at least three different splits with this data to see. Unlike for the regression, in this model, adjusting the randomness or shuffling is what produced the most accurate predictions. As shown in the 3 splits, we can see that increasing the test size reduces the accuracy, as the model doesn't have enough data to work with, and increasing the randomness prevents bias which improves the accuracy as shown.

More Bonus: Testing hyperparameters. What maximum depth (or other hyperparameter) has the greatest accuracy in our testing set? If we want to do this without making decision off our our training dataset, we need to split our data into three categories: train, test, and development. Then we can compare our changes in how it affects the development dataset and not the training dataset so we can do a final evaluation at the end with our final model. 

In [92]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
x = cars[["Weight_in_lbs","Miles_per_Gallon","Displacement"]]
y = cars[["Origin"]]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)
x_train, x_dev, y_train, y_dev = train_test_split(x_train, y_train, test_size=0.33, random_state=85)
x_dev, x_devt, y_dev, y_devt = train_test_split(x_dev, y_dev, test_size=0.01, random_state=99999)
reg = DecisionTreeClassifier(random_state = 1).fit(x_dev,y_dev)
pred = reg.predict(x_test)
ac1 = accuracy_score(pred,y_test)
print(ac1)

0.8076923076923077
