<h1><u>Auto MPG Data Set</u></h1>
<blockquote><a href="https://archive.ics.uci.edu/ml/datasets/Auto+MPG">UCI ML Repo</a></blockquote>

In [1]:
import os
import pandas as pd
import numpy as np

df = pd.read_csv("https://data.heatonresearch.com/data/t81-558/auto-mpg.csv",na_values=['NA', '?'])
display(df[0:10])

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,1,ford torino
5,15.0,8,429.0,198.0,4341,10.0,70,1,ford galaxie 500
6,14.0,8,454.0,220.0,4354,9.0,70,1,chevrolet impala
7,14.0,8,440.0,215.0,4312,8.5,70,1,plymouth fury iii
8,14.0,8,455.0,225.0,4425,10.0,70,1,pontiac catalina
9,15.0,8,390.0,190.0,3850,8.5,70,1,amc ambassador dpl


<br><h3>Shuffling dataset</h3>
to know why its important <a href="https://stats.stackexchange.com/questions/245502/why-should-we-shuffle-data-while-training-a-neural-network">click here! </a>

In [2]:
#np.random.seed(42) # Uncomment this line to get the same shuffle each time
df = df.reindex(np.random.permutation(df.index))
df.reset_index(inplace=True, drop=True)
display(df[0:10])

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,31.0,4,119.0,82.0,2720,19.4,82,1,chevy s-10
1,34.2,4,105.0,70.0,2200,13.2,79,1,plymouth horizon
2,25.4,6,168.0,116.0,2900,12.6,81,3,toyota cressida
3,25.5,4,122.0,96.0,2300,15.5,77,1,plymouth arrow gs
4,15.0,8,304.0,150.0,3892,12.5,72,1,amc matador (sw)
5,33.0,4,91.0,53.0,1795,17.5,75,3,honda civic cvcc
6,16.9,8,350.0,155.0,4360,14.9,79,1,buick estate wagon (sw)
7,31.3,4,120.0,75.0,2542,17.5,80,3,mazda 626
8,26.0,4,79.0,67.0,1963,15.5,74,2,volkswagen dasher
9,14.0,8,455.0,225.0,3086,10.0,70,1,buick estate wagon (sw)


<br><h3>Grouping</h3>

In [3]:
g = df.groupby('cylinders')['mpg'].mean().to_dict()
g

{3: 20.55,
 4: 29.28676470588235,
 5: 27.366666666666664,
 6: 19.985714285714288,
 8: 14.963106796116502}

In addition to mean, you can use other aggregating functions, such as <b>sum</b> or <b>count</b>.

In [4]:
df.groupby('cylinders')['mpg'].count().to_dict()

{3: 4, 4: 204, 5: 3, 6: 84, 8: 103}

<br><h3>Mapping</h3>
 You provide map with a dictionary of values to transform the target column. The map keys specify what values in the target column should be turned into values specified by those keys. The following code shows how the map function can transform the numeric values of 1, 2, and 3 into the string values of North America, Europe and Asia.

In [5]:
df['origin_name'] = df['origin'].map({1: 'North America', 2: 'Europe', 3: 'Asia'})
display(df[0:15])

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name,origin_name
0,31.0,4,119.0,82.0,2720,19.4,82,1,chevy s-10,North America
1,34.2,4,105.0,70.0,2200,13.2,79,1,plymouth horizon,North America
2,25.4,6,168.0,116.0,2900,12.6,81,3,toyota cressida,Asia
3,25.5,4,122.0,96.0,2300,15.5,77,1,plymouth arrow gs,North America
4,15.0,8,304.0,150.0,3892,12.5,72,1,amc matador (sw),North America
5,33.0,4,91.0,53.0,1795,17.5,75,3,honda civic cvcc,Asia
6,16.9,8,350.0,155.0,4360,14.9,79,1,buick estate wagon (sw),North America
7,31.3,4,120.0,75.0,2542,17.5,80,3,mazda 626,Asia
8,26.0,4,79.0,67.0,1963,15.5,74,2,volkswagen dasher,Europe
9,14.0,8,455.0,225.0,3086,10.0,70,1,buick estate wagon (sw),North America


<hr>

In [6]:
df['horsepower'] = df['horsepower'].fillna(df['horsepower'].median())

# Pandas to Numpy
x = df[['cylinders', 'displacement', 'horsepower', 'weight','acceleration', 'year', 'origin']].values
y = df['mpg'].values # regression

In [7]:
from sklearn import metrics
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.callbacks import EarlyStopping

In [8]:
# Split into validation and training sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=42)

In [10]:
# Build the neural network
model = Sequential()
model.add(Dense(25, input_dim=x.shape[1], activation='relu')) # Hidden 1
model.add(Dense(10, activation='relu')) # Hidden 2
model.add(Dense(1)) # Output
model.compile(loss='mean_squared_error', optimizer='adam',metrics=['accuracy'])

monitor = EarlyStopping(monitor='val_loss', min_delta=1e-3, patience=5, verbose=1, mode='auto',restore_best_weights=True)

model.fit(x_train,y_train,validation_data=(x_test,y_test),callbacks=[monitor],verbose=2,epochs=1000)

Train on 298 samples, validate on 100 samples
Epoch 1/1000
298/298 - 0s - loss: 367278.0468 - accuracy: 0.0000e+00 - val_loss: 204025.3788 - val_accuracy: 0.0000e+00
Epoch 2/1000
298/298 - 0s - loss: 134762.1861 - accuracy: 0.0000e+00 - val_loss: 51379.4238 - val_accuracy: 0.0000e+00
Epoch 3/1000
298/298 - 0s - loss: 25611.9535 - accuracy: 0.0000e+00 - val_loss: 3669.5909 - val_accuracy: 0.0000e+00
Epoch 4/1000
298/298 - 0s - loss: 1112.1547 - accuracy: 0.0000e+00 - val_loss: 1374.3971 - val_accuracy: 0.0000e+00
Epoch 5/1000
298/298 - 0s - loss: 2369.6011 - accuracy: 0.0000e+00 - val_loss: 2929.2992 - val_accuracy: 0.0000e+00
Epoch 6/1000
298/298 - 0s - loss: 2069.0944 - accuracy: 0.0000e+00 - val_loss: 1042.1761 - val_accuracy: 0.0000e+00
Epoch 7/1000
298/298 - 0s - loss: 467.6237 - accuracy: 0.0000e+00 - val_loss: 221.4131 - val_accuracy: 0.0000e+00
Epoch 8/1000
298/298 - 0s - loss: 188.7004 - accuracy: 0.0000e+00 - val_loss: 277.1480 - val_accuracy: 0.0000e+00
Epoch 9/1000
298/298 -

<tensorflow.python.keras.callbacks.History at 0x272ff3b35c8>

In [11]:
# Measure RMSE error.  RMSE is common for regression.
pred = model.predict(x_test)
score = np.sqrt(metrics.mean_squared_error(pred,y_test))
print(f"Final score (RMSE): {score}")

Final score (RMSE): 4.829259815759482
