# Extra Practice for Machine Learning
For this you will be working with the `cars` dataset from `vega_datasets`. You may have seen them before, but if not, they are described below.

In [13]:
import pandas as pd

cars = pd.read_csv('cars.csv').dropna()

The `cars` dataset is a dataset with a bunch of different models of car, with several different statistics about each of them, including their horsepower, acceleration, etc., the year they were released, and their country of origin. Here's what it looks like:

In [14]:
cars.head()

Unnamed: 0,Name,Miles_per_Gallon,Cylinders,Displacement,Horsepower,Weight_in_lbs,Acceleration,Year,Origin
0,chevrolet chevelle malibu,18.0,8,307.0,130.0,3504,12.0,1970-01-01,USA
1,buick skylark 320,15.0,8,350.0,165.0,3693,11.5,1970-01-01,USA
2,plymouth satellite,18.0,8,318.0,150.0,3436,11.0,1970-01-01,USA
3,amc rebel sst,16.0,8,304.0,150.0,3433,12.0,1970-01-01,USA
4,ford torino,17.0,8,302.0,140.0,3449,10.5,1970-01-01,USA


## ML with Quantitative Data

Create and train a model that, given the `cars` dataset, will predict the Horsepower of a car. Think about the type of data you are trying to predict - what model (of the ones we have already seen) should you use to predict quantitative data? Make sure to split training and testing data, and check the mean squared error of your model.

In [15]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
x = cars[["Weight_in_lbs","Acceleration","Displacement"]]
y = cars[["Horsepower"]]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)
x_train2, x_test2, y_train2, y_test2 = train_test_split(x, y, test_size=0.33, random_state=86)
x_train3, x_test3, y_train3, y_test3 = train_test_split(x, y, test_size=0.86, random_state=42)
x_train4, x_test4, y_train4, y_test4 = train_test_split(x, y, test_size=0.86, random_state=86)
reg = DecisionTreeRegressor(random_state = 1).fit(x_train,y_train)
pred = reg.predict(x_test)
mse1 = mean_squared_error(pred,y_test)
mse2 = mean_squared_error(reg.predict(x_test2),y_test2)
mse3 = mean_squared_error(reg.predict(x_test3),y_test3)
mse4 = mean_squared_error(reg.predict(x_test4),y_test4)
print(mse1)
print(mse2)
print(mse3)
print(mse4)
# Enter the rest of your solution here!

171.95384615384614
53.63846153846154
66.13609467455622
56.63609467455621


Bonus: How does accuracy or mean squared error change with the split between training and testing data? Do at least three different splits with this data to see.
As the test size increases, the accuracy seems to increase, and as the shuffling applied to the data increases the accuracy also seems to increase. This is because of bias and variance, with the model becoming more accurate by being given more random values(it can't make absurd connections between data points) and it becomes more accurate given a larger testing set(another way that it can't make absurb connections, and also provides more of an evening out of mistakes due to statistics). However, if you were to increase both test size and randomness, the accuracy starts decreasing a little, this is because the benefits of both can interfere with each other leading to more mistakes. Also, the reason that I think there is a high mean squared error for the first test set is because the test sets that follow could use the same data that is present in earlier training sets leading to the model having the answers, I verified this by commenting out the original testing set and replacing it with one of the other ones, and saw that the mean squared error was super high.

More Bonus: Testing hyperparameters. What maximum depth has the greatest accuracy in our testing set. If we want to do this without making decision off our our training dataset, we need to split our data into three categories: train, test, and development. Then we can compare our changes in how it affects the development dataset and not the training dataset so we can do a final evaluation at the end with our final model. 

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

x = cars[["Weight_in_lbs","Acceleration","Displacement"]]
y = cars[["Horsepower"]]
# File: ML-Extra-Practice.ipynb — replaced the "More Bonus" regressor code cell with a complete hyperparameter tuning snippet that:
# Splits data into train / dev / test (train -> train_final + dev),
# Tries max_depth values 1..20 and None,
# Measures MSE on the dev set and picks the depth with lowest dev MSE,
# Retrains the final regressor on the full training set (train + dev) using the chosen depth,
# Evaluates and prints test MSE, and compares to the default (no max_depth).
# Why this works (summary)

# Hold-out dev set prevents choosing hyperparameters that overfit the test set — we tune on dev only.
# Searching over max_depth controls bias vs. variance:
# Small depth → simpler model (higher bias, lower variance),
# Large depth / None → complex model (lower bias, higher variance).
# Choosing depth by dev MSE finds the best trade-off for this dataset.
# Retraining on the full training set (train + dev combined) before final evaluation gives the final model more data to learn from while still preserving an untouched test set for an honest estimate of generalization.
# Split into training and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)
# Hold out a development set from the training data for hyperparameter tuning
x_train_final, x_dev, y_train_final, y_dev = train_test_split(x_train, y_train, test_size=0.33, random_state=85)

# Grid-search over possible `max_depth` values using the development set (lower MSE is better)
depths = list(range(1, 21)) + [None]
best_depth = None
best_mse = float('inf')
results = []
for d in depths:
    reg = DecisionTreeRegressor(max_depth=d, random_state=1)
    # fit on the training portion (not the dev set)
    reg.fit(x_train_final, y_train_final.values.ravel())
    y_dev_pred = reg.predict(x_dev)
    mse = mean_squared_error(y_dev, y_dev_pred)
    results.append((d, mse))
    if mse < best_mse:
        best_mse = mse
        best_depth = d

# Show development MSEs for each depth
print('Dev set MSE (depth, mse):')
for d, mse in results:
    print(d, round(mse, 4))

print(f"\nBest depth on dev set: {best_depth} with MSE {best_mse:.4f}")

# Retrain on the full training data (train + dev = x_train) with the chosen hyperparameter
final_reg = DecisionTreeRegressor(max_depth=best_depth, random_state=1)
final_reg.fit(x_train, y_train.values.ravel())
y_test_pred = final_reg.predict(x_test)
test_mse = mean_squared_error(y_test, y_test_pred)
print(f"Test set MSE with best depth: {test_mse:.4f}")

# Compare to default (no max_depth) to see whether limiting depth helped
default_reg = DecisionTreeRegressor(random_state=1)
default_reg.fit(x_train, y_train.values.ravel())
default_mse = mean_squared_error(y_test, default_reg.predict(x_test))
print(f"Default (no max_depth) test MSE: {default_mse:.4f}")

Dev set MSE (depth, mse):
1 370.6203
2 315.7442
3 333.1648
4 340.2377
5 278.9287
6 261.0662
7 319.477
8 256.1361
9 236.3006
10 316.8642
11 241.314
12 254.8966
13 304.069
14 304.069
15 304.069
16 304.069
17 304.069
18 304.069
19 304.069
20 304.069
None 304.069

Best depth on dev set: 9 with MSE 236.3006
Test set MSE with best depth: 191.3629
Default (no max_depth) test MSE: 171.9538


## ML with Categorical Data

Create, train, and test a model that will predict the country of origin for the `cars` dataset. Remember, this is categorical data, so you will need to use a different type of model (of the ones we have already seen) than you did for the `Horsepower` model.

In [17]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
x = cars[["Cylinders","Miles_per_Gallon","Displacement"]]
y = cars[["Origin"]]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)
x_train2, x_test2, y_train2, y_test2 = train_test_split(x, y, test_size=0.33, random_state=86)
x_train3, x_test3, y_train3, y_test3 = train_test_split(x, y, test_size=0.86, random_state=42)
x_train4, x_test4, y_train4, y_test4 = train_test_split(x, y, test_size=0.86, random_state=86)
reg = DecisionTreeClassifier(random_state = 1).fit(x_train,y_train)
pred = reg.predict(x_test)
ac1 = accuracy_score(pred,y_test)
ac2 = accuracy_score(reg.predict(x_test2),y_test2)
ac3 = accuracy_score(reg.predict(x_test3),y_test3)
ac4 = accuracy_score(reg.predict(x_test4),y_test4)
print(ac1)
print(ac2)
print(ac3)
print(ac4)
# Enter the rest of your solution here!

0.7538461538461538
0.9384615384615385
0.9023668639053254
0.9201183431952663


Bonus: How does accuracy or mean squared error change with the split between training and testing data? Do at least three different splits with this data to see. Unlike for the regression, in this model, adjusting the randomness or shuffling is what produced the most accurate predictions. As shown in the 3 splits, we can see that increasing the test size reduces the accuracy, as the model doesn't have enough data to work with, and increasing the randomness prevents bias which improves the accuracy as shown.

SyntaxError: incomplete input (1288144275.py, line 1)

More Bonus: Testing hyperparameters. What maximum depth (or other hyperparameter) has the greatest accuracy in our testing set? If we want to do this without making decision off our our training dataset, we need to split our data into three categories: train, test, and development. Then we can compare our changes in how it affects the development dataset and not the training dataset so we can do a final evaluation at the end with our final model. 

In [25]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
x = cars[["Weight_in_lbs","Miles_per_Gallon","Displacement"]]
y = cars[["Origin"]]

# File: ML-Extra-Practice.ipynb
# Change: Replaced the final classifier code cell with a hyperparameter search loop that:
# Splits training data into training + development (dev) sets,
# Trains DecisionTreeClassifier models across a range of max_depth values,
# Selects the best max_depth based on dev accuracy,
# Retrains on the full training set with that depth, and
# Evaluates accuracy on the held-out test set (and prints a comparison to the default tree).
# Why this works (concise)

# Hold-out dev set: Using a separate development set prevents tuning decisions from overfitting to the test set; hyperparameters are chosen using dev performance only.
# Grid search over depth: Trying multiple max_depth values finds the depth that best balances bias vs. variance for this dataset (small depth → high bias, large depth → high variance).
# Retrain on full train before final test: After selecting the hyperparameter on dev, retraining on the full training data (train + dev combined — here we use the original x_train which contains both parts) gives the model access to more data for a better final estimate on the test set.
# Compare to default: Printing default (no max_depth) test accuracy shows whether limiting depth improved generalization.

# Split into training and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)
# Hold out a development set from the training data for hyperparameter tuning
x_train_final, x_dev, y_train_final, y_dev = train_test_split(x_train, y_train, test_size=0.33, random_state=85)

# Grid-search over possible `max_depth` values using the development set
depths = list(range(1, 21)) + [None]
best_depth = None
best_acc = -1.0
results = []
for d in depths:
    clf = DecisionTreeClassifier(max_depth=d, random_state=1)
    # fit on the training portion (not the dev set)
    clf.fit(x_train_final, y_train_final.values.ravel())
    y_dev_pred = clf.predict(x_dev)
    acc = accuracy_score(y_dev, y_dev_pred)
    results.append((d, acc))
    if acc > best_acc:
        best_acc = acc
        best_depth = d

# Show development accuracies for each depth
print('Dev set accuracies (depth, accuracy):')
for d, acc in results:
    print(d, round(acc, 4))

print(f'\nBest depth on dev set: {best_depth} with accuracy {best_acc:.4f}')

# Retrain on the full training data (train + dev = x_train) with the chosen hyperparameter
final_clf = DecisionTreeClassifier(max_depth=best_depth, random_state=1)
final_clf.fit(x_train, y_train.values.ravel())
y_test_pred = final_clf.predict(x_test)
test_acc = accuracy_score(y_test, y_test_pred)
print(f'Test set accuracy with best depth: {test_acc:.4f}')

# Compare to default (no max_depth) to see whether limiting depth helped
default_clf = DecisionTreeClassifier(random_state=1)
default_clf.fit(x_train, y_train.values.ravel())
default_acc = accuracy_score(y_test, default_clf.predict(x_test))
print(f'Default (no max_depth) test accuracy: {default_acc:.4f}')

Dev set accuracies (depth, accuracy):
1 0.6897
2 0.7471
3 0.7586
4 0.6667
5 0.7586
6 0.7931
7 0.7931
8 0.7931
9 0.8046
10 0.7931
11 0.8161
12 0.8161
13 0.8161
14 0.8161
15 0.8161
16 0.8161
17 0.8161
18 0.8161
19 0.8161
20 0.8161
None 0.8161

Best depth on dev set: 11 with accuracy 0.8161
Test set accuracy with best depth: 0.7923
Default (no max_depth) test accuracy: 0.7923
