# Automobile Fuel Efficiency Prediction using PyTorch

This notebook presents an end-to-end process for predicting the fuel efficiency (highway MPG) of Toyota cars 
using a neural network built with PyTorch. The steps include:

1. Loading and inspecting the dataset.
2. Preprocessing the data, which involves filtering for Toyota cars, selecting features, and normalizing them.
3. Defining a neural network model architecture.
4. Training the model with the prepared data.
5. Evaluating the model's performance on test data.
6. Saving the trained model for future predictions.

We aim to predict the `highway-mpg` for cars, which is a critical metric for estimating a car's fuel efficiency on highways.

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import torch
import torch.nn as nn
import torch.optim as optim

In [4]:
# Load the dataset
data_path = 'cleaned_automobile.csv'
automobile_data = pd.read_csv(data_path)
automobile_data.head()

Unnamed: 0,symboling,normalized-losses,make,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,length,...,peak-rpm,city-mpg,highway-mpg,price,city-L/100km,highway-L/100km,horsepower-binned,fuel-type-diesel,fuel-type-gas,Car Size
0,3,121,alfa-romero,std,two,convertible,rwd,front,88.6,0.811148,...,5000.0,21,27,13495,11.190476,9.37037,low,0,1,0.589311
1,3,121,alfa-romero,std,two,convertible,rwd,front,88.6,0.811148,...,5000.0,21,27,16500,11.190476,9.37037,low,0,1,0.589311
2,1,121,alfa-romero,std,two,hatchback,rwd,front,94.5,0.822681,...,5000.0,19,26,16500,12.368421,9.730769,medium,0,1,0.655799
3,2,164,audi,std,four,sedan,fwd,front,99.8,0.84863,...,5500.0,24,30,13950,9.791667,8.433333,low,0,1,0.708505
4,2,164,audi,std,four,sedan,4wd,front,99.4,0.84863,...,5500.0,18,22,17450,13.055556,11.5,low,0,1,0.710645


In [5]:
# Filter the dataset for Toyota vehicles
toyota_data = automobile_data[automobile_data['make'] == 'toyota']

In [6]:
# Select relevant features for the model
features = ['engine-size', 'curb-weight', 'horsepower', 'peak-rpm', 'city-mpg']
target = 'highway-mpg'

In [7]:
# Prepare the feature matrix and target vector
X = toyota_data[features]
y = toyota_data[target]

In [8]:

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [9]:
# Define a data preprocessing pipeline
numerical_features = features
numerical_transformer = StandardScaler()

In [10]:
# Combine transformers into a preprocessor step
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
    ])

In [11]:
# Create a preprocessing and training pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor)])

In [12]:
# Apply the preprocessing pipeline to the training data
X_train_prepared = pipeline.fit_transform(X_train)
X_test_prepared = pipeline.transform(X_test)

In [13]:
# Define the neural network architecture
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(5, 64)
        self.fc2 = nn.Linear(64, 64)
        self.fc3 = nn.Linear(64, 1) 

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        return self.fc3(x)
# Initialize the network
net = Net()    

In [14]:
# Define the loss function and optimizer
criterion = nn.MSELoss()
optimizer = optim.Adam(net.parameters(), lr=0.001)

In [15]:
# Convert numpy arrays to PyTorch tensors
X_train_tensor = torch.tensor(X_train_prepared.astype(np.float32))
y_train_tensor = torch.tensor(y_train.values.astype(np.float32)).view(-1, 1)
X_test_tensor = torch.tensor(X_test_prepared.astype(np.float32))
y_test_tensor = torch.tensor(y_test.values.astype(np.float32)).view(-1, 1)

In [16]:
# Training loop
for epoch in range(1000): 
    net.train()
    optimizer.zero_grad() 
    output = net(X_train_tensor)
    loss = criterion(output, y_train_tensor)
    loss.backward()
    optimizer.step()

    if epoch % 100 == 0:
        net.eval()
        with torch.no_grad():
            test_loss = criterion(net(X_test_tensor), y_test_tensor)
            print(f'Epoch {epoch}, Training loss {loss.item()}, Test loss {test_loss.item()}')

Epoch 0, Training loss 1134.728271484375, Test loss 1079.02001953125
Epoch 100, Training loss 211.02523803710938, Test loss 204.09706115722656
Epoch 200, Training loss 27.352506637573242, Test loss 54.114864349365234
Epoch 300, Training loss 16.935937881469727, Test loss 33.133880615234375
Epoch 400, Training loss 12.158271789550781, Test loss 26.2406063079834
Epoch 500, Training loss 8.476327896118164, Test loss 21.687862396240234
Epoch 600, Training loss 5.348508834838867, Test loss 19.0009765625
Epoch 700, Training loss 3.061279535293579, Test loss 18.992721557617188
Epoch 800, Training loss 1.6207307577133179, Test loss 22.119876861572266
Epoch 900, Training loss 0.8390568494796753, Test loss 28.499038696289062


In [17]:
# Save the trained model
torch.save(net.state_dict(), 'toyota_mpg_prediction_model.pth')

In [18]:
# Making a prediction with the trained model
# For demonstration, let's predict the highway-mpg for the first car in the test set
net.eval()
with torch.no_grad():
    sample_car = X_test_tensor[0]
    predicted_mpg = net(sample_car)
    print(f'Predicted highway-mpg for the sample Toyota car: {predicted_mpg.item()}')

Predicted highway-mpg for the sample Toyota car: 25.20076560974121


In [20]:
from sklearn.metrics import mean_squared_error, r2_score

net.eval()
with torch.no_grad():
    predictions = net(X_test_tensor)
    mse = mean_squared_error(y_test_tensor.numpy(), predictions.numpy())
    r2 = r2_score(y_test_tensor.numpy(), predictions.numpy())
    print(f'Mean Squared Error on test set: {mse}')
    print(f'R^2 Score on test set: {r2}')

Mean Squared Error on test set: 34.349586486816406
R^2 Score on test set: 0.37196656006932516


## Model Evaluation and Analysis

### Prediction Results
The developed neural network model was applied to a sample Toyota car from the test set, resulting in a predicted `highway-mpg` of approximately 25.20. This single prediction provides a glimpse into the model's capability to estimate fuel efficiency based on the features provided.

### Model Performance
The model's performance was quantitatively assessed using two common regression metrics:

- **Mean Squared Error (MSE):** The MSE on the test set is around 34.35. This value represents the average squared difference between the actual and predicted values. The relatively high MSE suggests that the model's predictions are, on average, a considerable distance from the actual values.
  
- **R^2 Score:** The R^2 score, or the coefficient of determination, is approximately 0.37. This score indicates that around 37% of the variance in the `highway-mpg` is explained by the model. An R^2 score of 0.37 is generally considered a low score, implying that the model does not very accurately capture the variation in the fuel efficiency based on the features it was given.

### Insights and Suggested Improvements

Considering the modest R^2 score and high MSE, the model's performance can be considered suboptimal. This could be due to several factors, including:

  
- **Model Complexity:** The current architecture of the neural network may be too simple to capture the underlying patterns in the data. Experimenting with a more complex model or different architectures might yield better results.
  
- **Data Quality:** The quality of the dataset, including the accuracy and the granularity of the data, can significantly impact model performance. Ensuring that the dataset is well-curated and representative of the population is crucial. Notable is the little amount of data used in this experiment especially after filtering data for only Toyota Car Brand. More observations are required to get more accurate results
  
- **Hyperparameter Tuning:** Adjusting the model's hyperparameters, such as the learning rate, the number of epochs, and the number of neurons in each layer, can lead to better performance.

Important Additional Adjustments could be:

- Exploring alternative models, such as Random Forest or Gradient Boosting, which may capture non-linear relationships more effectively.
- Increasing the dataset size if possible to provide the model with more training examples.