
**1. Business Understanding:**

-   **Objective:** Predict the miles per gallon (MPG) of cars based on various attributes.
-   **Business Goals:** Improve fuel efficiency predictions for better decision-making regarding car models and environmental impact.

**2. Data Understanding:**

-   **Data Collection:** Use the Auto MPG dataset, containing information about car attributes and MPG.
-   **Exploratory Data Analysis (EDA):**
    -   Examine the structure of the dataset.
    -   Identify features and their data types.
    -   Check for missing values and outliers.
    -   Understand the distribution of the target variable (MPG).
    -   Explore relationships between features and the target variable.

**3. Data Preparation:**

-   **Cleaning:**
    -   Handle missing values (if any).
    -   Convert relevant columns to the correct data types.
-   **Feature Engineering:**
    -   Create derived attributes or transform existing features.
-   **Scaling:**
    -   Standardize numerical features using `StandardScaler`.
-   **Train-Test Split:**
    -   Split the dataset into training and testing sets.

**4. Modeling:**

-   **Model Selection:**
    -   Collect information about multiple possible models, in our case we tested 3 approaches: Linear regression, Neural Network and Gradient Boosting Regressor. After an analysis that will be presented a bit later in this document, we chose Gradient Boosting regressor.
    -   Choose the Gradient Boosting Regressor as the modeling technique.
-   **Hyperparameter Tuning:**
    -   Experiment with hyperparameter values (e.g., learning rate, number of estimators) to optimize model performance.
-   **Training:**
    -   Train the Gradient Boosting Regressor on the training set.

**5. Evaluation:**

-   **Model Evaluation:**
    -   Evaluate the model's performance on the test set using metrics like Mean Squared Error (MSE).
    -   Assess how well the model aligns with business goals.
-   **Visualizations:**
    -   Create visualizations, such as scatter plots for true vs. predicted values and feature importance plots.

**6. Deployment:**

-   **Deployment Plan:**
    -   Plan the deployment of the trained model into a production environment by wrapping it inside an application built using a traditional REST framework like Django, FLask, etc.
-   **Monitoring:**
    -   Implement monitoring to track the model's performance in real-world scenarios using Grafana or other tools.
-   **Documentation:**
    -   Create documentation for end-users and stakeholders: since this project is mostly a back-end application, users won't see it, so the documentation will focus mostly on the devops and support teams, so they can know how to debug and restart the project when needed.

**7. Iteration:**

-   **Feedback Loop:**
    -   Gather feedback from stakeholders.
    -   Consider improvements, additional features, or model retraining based on feedback and changing requirements.
    -   Iterate through the CRISP-DM process as needed.

**Choosing a model**

We will analyse 3 sepparate approaches, each with its unique advantages and drawback. We will implement initially all of them, do a quick test based on the mean squared error, and analyse which one of them is the most accurate for our business case. 

The 3 models which will be analysed are:

- Linear Regression
- Neural Network
- Gradient Boosting Regressor

We will now analyse each one of them:


**Before we begin**

Before we start working with each model, we will ask you to run the following pieces of code, in order to import the training data, and the required libraries, so we don't import them and trasnform the model again on each iteration

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression

import tensorflow as tf
# workaround for a Jupyter notebook error when importing tf.keras.models
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense


# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data"
column_names = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model_year', 'origin', 'car_name']
df = pd.read_csv(url, delim_whitespace=True, names=column_names)

# Preprocess the data
df.drop('car_name', axis=1, inplace=True)
df['horsepower'] = pd.to_numeric(df['horsepower'], errors='coerce')
df.dropna(inplace=True)

# Split the data into features (X) and target variable (y)
X = df.drop('mpg', axis=1)
y = df['mpg']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


**About the data preprocession**

to be continued

**1st approach: Linear Regression:**

Linear regression is a statistical method used for modeling the relationship between a dependent variable (or target) and one or more independent variables (or features). The goal of linear regression is to find the linear relationship that best fits the data. The simplest form is simple linear regression, which involves a single independent variable, while multiple linear regression deals with multiple independent variables.

In a simple linear regression equation, it takes the form:

    y=mx+b

where:

-   y is the dependent variable (target),
-   x is the independent variable (feature),
-   m is the slope of the line, and
-   b is the y-intercept.


**Advantages of Linear Regression:**

1.  **Interpretability:** Linear regression models are simple and easy to interpret. The coefficients in the equation provide insights into the relationship between variables.
    
2.  **Computational Efficiency:** Linear regression is computationally efficient, making it suitable for large datasets.
    
3.  **Quick Implementation:** It is quick to implement and serves as a good baseline model.
    
4.  **Well-understood:** Linear regression is a well-understood and extensively studied statistical method.
    

**Drawbacks of Linear Regression:**

1.  **Assumption of Linearity:** Linear regression assumes that the relationship between variables is linear. If the true relationship is non-linear, linear regression may provide inaccurate predictions.
    
2.  **Sensitive to Outliers:** Outliers can significantly influence the model's coefficients and predictions.
    
3.  **Assumption of Independence:** Linear regression assumes that the errors are independent. Violation of this assumption can affect the model's accuracy.
    
4.  **Multicollinearity:** When independent variables are highly correlated, it can lead to unstable coefficient estimates.
    
5.  **Limited Complexity:** Linear regression is not suitable for capturing complex relationships with intricate patterns in the data.


**Linear regression implementation**

We implemented the model using a simple linear regression, and we will test the mean squared error. In the example given above, from multiple runs, we managed to obtain a mean square error of about **10.7**

In [7]:
# Train a linear regression model
model = LinearRegression()
model.fit(X_train_scaled, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test_scaled)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')

# Example prediction
example_input = [[6, 225, 100, 3233, 15.4, 76, 1]]  # Example input data for prediction
example_input_scaled = scaler.transform(example_input)
predicted_mpg = model.predict(example_input_scaled)
print(f'Predicted MPG for the example input: {predicted_mpg[0]}')

Mean Squared Error: 10.710864418838396
Predicted MPG for the example input: 21.385639953053804




**2nd approach: Neural Networks**

Neural networks are a class of machine learning models inspired by the structure and functioning of the human brain. They consist of interconnected nodes (neurons) organized into layers. Neural networks are capable of learning complex patterns and relationships from data through a process called training. The training involves adjusting the weights and biases of the connections between neurons to minimize the difference between predicted and actual outcomes.

**Advantages of Neural Networks:**

1.  **Non-Linearity:** Neural networks can model complex, non-linear relationships in data, making them suitable for a wide range of tasks.
    
2.  **Feature Learning:** Neural networks can automatically learn relevant features from the data, reducing the need for manual feature engineering.
    
3.  **Versatility:** Neural networks are versatile and can be applied to various types of data, including images, text, and sequences.
    
4.  **Parallel Processing:** The parallel processing capability of neural networks enables them to handle large amounts of data efficiently.
    
5.  **Adaptability:** Neural networks can adapt and generalize well to different types of problems, making them applicable to a diverse set of tasks.
    

**Drawbacks of Neural Networks:**

1.  **Complexity:** The architecture and hyperparameter tuning of neural networks can be complex and time-consuming.
    
2.  **Data Requirements:** Neural networks often require large amounts of data to generalize well and avoid overfitting.
    
3.  **Computationally Intensive:** Training deep neural networks can be computationally intensive and may require specialized hardware.
    
4.  **Black Box Nature:** The inner workings of complex neural networks can be challenging to interpret, leading to a "black box" problem.
    
5.  **Vulnerability to Noisy Data:** Neural networks may be sensitive to noisy data, outliers, or irrelevant features.
    
6.  **Hyperparameter Sensitivity:** Neural networks are sensitive to the choice of hyperparameters, and finding the optimal configuration may involve extensive experimentation.

**Neural Networks implementation**

Just like we did with the Linear regression, we implemented the model using a neural network, and we will test the mean squared error. 
In the example given below, from multiple runs, we managed to obtain a mean square error of about **10.2** after 50 epochs

In [6]:
# Build the neural network model
model = Sequential()
model.add(Dense(64, input_dim=X_train_scaled.shape[1], activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='linear'))

# Compile the model
model.compile(loss='mean_squared_error', optimizer='adam')

# Train the model
model.fit(X_train_scaled, y_train, epochs=50, batch_size=32, validation_data=(X_test_scaled, y_test), verbose=2)

# Make predictions on the test set
y_pred = model.predict(X_test_scaled)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')

# Example prediction
example_input = np.array([[6, 225, 100, 3233, 15.4, 76, 1]])  # Example input data for prediction
example_input_scaled = scaler.transform(example_input)
predicted_mpg = model.predict(example_input_scaled)
print(f'Predicted MPG for the example input: {predicted_mpg[0][0]}')

Epoch 1/50
10/10 - 0s - loss: 592.0648 - val_loss: 532.1111 - 437ms/epoch - 44ms/step
Epoch 2/50
10/10 - 0s - loss: 566.5259 - val_loss: 505.7727 - 40ms/epoch - 4ms/step
Epoch 3/50
10/10 - 0s - loss: 538.0164 - val_loss: 474.8748 - 40ms/epoch - 4ms/step
Epoch 4/50
10/10 - 0s - loss: 502.6318 - val_loss: 438.0248 - 38ms/epoch - 4ms/step
Epoch 5/50
10/10 - 0s - loss: 461.5247 - val_loss: 393.2238 - 43ms/epoch - 4ms/step
Epoch 6/50
10/10 - 0s - loss: 411.1219 - val_loss: 341.0549 - 38ms/epoch - 4ms/step
Epoch 7/50
10/10 - 0s - loss: 351.9496 - val_loss: 282.9438 - 38ms/epoch - 4ms/step
Epoch 8/50
10/10 - 0s - loss: 288.0170 - val_loss: 221.3045 - 38ms/epoch - 4ms/step
Epoch 9/50
10/10 - 0s - loss: 221.2415 - val_loss: 161.3786 - 42ms/epoch - 4ms/step
Epoch 10/50
10/10 - 0s - loss: 157.9914 - val_loss: 108.2342 - 35ms/epoch - 3ms/step
Epoch 11/50
10/10 - 0s - loss: 105.2013 - val_loss: 68.7048 - 36ms/epoch - 4ms/step
Epoch 12/50
10/10 - 0s - loss: 66.7083 - val_loss: 47.6189 - 39ms/epoch -



**3rd approach: Gradient Boosting Regressor**


Gradient Boosting is an ensemble learning technique that builds a series of weak learners, typically decision trees, and combines their predictions to create a strong predictive model. Gradient Boosting Regressor specifically focuses on regression tasks, aiming to predict a continuous target variable.

**Advantages of Gradient Boosting Regressor:**

1.  **High Predictive Accuracy:** Gradient Boosting Regressor often achieves high predictive accuracy and outperforms many other algorithms.
    
2.  **Handles Non-Linearity and Interactions:** The ensemble nature of gradient boosting allows it to capture non-linear relationships and interactions between features.
    
3.  **Robustness to Outliers:** Gradient Boosting is less sensitive to outliers in the dataset compared to some other models.
    
4.  **Feature Importance:** Provides a feature importance ranking, helping to identify the most influential variables in making predictions.
    
5.  **Flexible with Different Loss Functions:** Can be used with various loss functions, making it adaptable to different types of regression problems.
    

**Drawbacks of Gradient Boosting Regressor:**

1.  **Computationally Intensive:** Training a large number of trees in the ensemble can be computationally intensive, especially for deep trees.
    
2.  **Potential for Overfitting:** Gradient Boosting Regressor can be prone to overfitting, especially if the model is too complex or the learning rate is too high.
    
3.  **Need for Careful Hyperparameter Tuning:** Effective use of Gradient Boosting Regressor often requires careful tuning of hyperparameters, such as the number of trees and learning rate.
    
4.  **Less Interpretability:** While it provides feature importance, the interpretability of the model is lower compared to simpler models like linear regression.
    
5.  **Sensitive to Noisy Data:** Gradient Boosting Regressor can be sensitive to noisy data and outliers, impacting its performance.
    
6.  **Potential for Bias:** If the training dataset is imbalanced, the model may be biased toward the majority class.



**Gradient Boosting Regressor implementation**

Just like we did with the Linear regression and the NEural Neteorks, we implemented the model using a Gradient Boosting Regressor, and we will test the mean squared error. 
In the example given below, from multiple runs, we managed to obtain a mean square error of about **6.2**

In [5]:
# Build and train the Gradient Boosting Regressor model
model_gb = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
model_gb.fit(X_train_scaled, y_train)

# Make predictions on the test set
y_pred_gb = model_gb.predict(X_test_scaled)

# Evaluate the model
mse_gb = mean_squared_error(y_test, y_pred_gb)
print(f'Mean Squared Error (Gradient Boosting): {mse_gb}')

# Example prediction
example_input_gb = [[6, 225, 100, 3233, 15.4, 76, 1]]  # Example input data for prediction
example_input_gb_scaled = scaler.transform(example_input_gb)
predicted_mpg_gb = model_gb.predict(example_input_gb_scaled)
print(f'Predicted MPG for the example input (Gradient Boosting): {predicted_mpg_gb[0]}')

Mean Squared Error (Gradient Boosting): 6.232502768358345
Predicted MPG for the example input (Gradient Boosting): 19.23407630490937




**Error analysis and model selection**



Following the experiments ran until now, we can conclude that for our business case, the Gradient Boosting Regressor is the best choice, since it gave the smallest error, and it also ran in a pretty decent time.

