# Tutorial 8: Model Development

## Objectives

After this tutorial you will be able to:

*   Use `scikit-learn` to perform simple linear regression to predict outputs
*   Use `scikit-learn` to perform multiple linear regression to predict outputs
*   Use `scikit-learn` to perform non-linear regression to predict outputs
*   Evaluate the developed models and select the appropriate model

<h2>Table of Contents</h2>

<ol>
    <li>
        <a href="#import">Import dataset</a>
    </li>
    <br>
    <li>
        <a href="#reg">Regression Overview</a>
    </li>
    <br>
    <li>
        <a href="#slr">Simple Linear Regression</a>
    </li>
    <br>
    <li>
        <a href="#mlr">Multiple Linear Regression</a>
    </li>
    <br>
    <li>
        <a href="#nlr">Non-Linear Regression</a>
    </li>
    <br>
    <li>
        <a href="#pipe">Pipelines and Grid Search</a>
    </li>
    <br>
    <li>
        <a href="#eval">Visual Evaluation of Higher Dimensional Models</a>
    </li>
    <br>
    <li>
        <a href="#class">Classification</a>
    </li>
    <br>
    <li>
        <a href="#log">Logistic Regression</a>
    </li>
</ol>


<hr id="import">

<h2>1. Import the dataset</h2>

Import the libraries

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from scipy import stats

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures, StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.pipeline import Pipeline

Read the data from `csv` into a `Pandas DataFrame`

### Understanding the Data

This dataset contains information about vehicle specifications, fuel consumption, and CO₂ emissions. Each row represents a unique vehicle model with various attributes related to its design, fuel efficiency, and environmental impact.

#### Column Descriptions

- **Make**: The brand or manufacturer of the vehicle (e.g., ACURA, BMW).
- **Model**: The specific model of the vehicle, with additional descriptors:
  - **4WD/4X4**: Four-wheel drive
  - **AWD**: All-wheel drive
  - **FFV**: Flexible-fuel vehicle
  - **SWB**: Short wheelbase
  - **LWB**: Long wheelbase
  - **EWB**: Extended wheelbase

- **Vehicle Class**: The category of the vehicle based on size and design (e.g., Compact, SUV - Small).
- **Engine Size [L]**: The engine size in liters, indicating the displacement capacity of the engine.
- **Cylinders**: The number of cylinders in the engine, impacting performance and fuel efficiency.
- **Transmission**: The type of transmission and its characteristics:
  - **A**: Automatic
  - **AM**: Automated manual
  - **AS**: Automatic with select shift
  - **AV**: Continuously variable
  - **M**: Manual
  - **3-10**: Number of gears in the transmission

- **Fuel Type**: The type of fuel the vehicle uses:
  - **X**: Regular gasoline
  - **Z**: Premium gasoline
  - **D**: Diesel
  - **E**: Ethanol (E85)
  - **N**: Natural gas

- **Fuel Consumption**:
  - **Fuel Consumption City [L/100 km]**: Fuel consumption rating for city driving in liters per 100 kilometers.
  - **Fuel Consumption Hwy [L/100 km]**: Fuel consumption rating for highway driving in liters per 100 kilometers.
  - **Fuel Consumption Comb [L/100 km]**: Combined fuel consumption rating (55% city, 45% highway) in liters per 100 kilometers.
  - **Fuel Consumption Comb [mpg]**: Combined fuel consumption rating in miles per imperial gallon (mpg).

- **CO₂ Emissions [g/km]**: Tailpipe emissions of carbon dioxide in grams per kilometer for combined city and highway driving.

#### Dataset Purpose and Use
This dataset provides insights into vehicle fuel efficiency and emissions, which can be used for:
- **Comparing vehicle fuel efficiency** across different makes and models.
- **Evaluating CO₂ emissions** based on fuel type and vehicle class.
- **Analyzing fuel consumption patterns** for city, highway, and combined driving conditions.
- **Supporting regulatory compliance** and environmental impact assessments by tracking emission levels.

This dataset can be visualized to identify trends in fuel efficiency, highlight differences across vehicle classes, and understand how vehicle specifications impact emissions.

In [None]:
df = pd.read_csv('CO2_Emissions_Canada.csv')
df.head()

Get information about the columns of the `DataFrame`

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df['Fuel Type'].value_counts()

In [None]:
df = df[df['Fuel Type'] != 'N']

<hr id="reg">

<h2>2. Regression Overview</h2>

#### Regression
Regression is a statistical technique used to model the relationship between one or more independent variables (also known as features or predictors) and a dependent variable (also known as the target or response variable).  
The goal of regression is to understand how changes in the independent variables affect the dependent variable. Regression is a powerful tool for prediction, forecasting, and understanding complex relationships in data.  

#### Goodness of fitting
Fitting goodness is about finding the optimum spot between capturing the data's essence without getting "too close" or being "too simple." It's a balancing act between complexity and generalizability, ensuring the model performs well not just on the data it saw, but on the real world it faces.

**Underfitting:**  
The model is too simple and captures only the basic trends, underestimating the true complexity of the data. This leads to inaccurate predictions for both training and unseen data.

**Good Fitting:**  
The model captures the main trends and patterns in the data, leading to accurate predictions on unseen examples. It balances complexity and flexibility without overfitting.

**Overfitting:**  
The model memorizes the training data too closely, capturing even the noise and random fluctuations. This leads to excellent performance on the training data but poor generalizability to new examples.

<div style="text-align: center;">
    <img src="fitting.webp">
</div>

#### Steps
General steps to perform regression using `scikit-learn`:


1. **Data Loading**: extract input features (x) and output target (y)

2. **Data Preprocessing (if necessary)**: prepare input features for the selected regression model

3. **Data Splitting**: split data into training data and testing data for model evaluation

4. **Model Training**: train/fit the selected regression model on the *training* data

5. **Predictions**: make prediction (y_hat) using the trained model

6. **Evaluation**: evaluate the performance of the model using appropriate metrics (i.e. MSE, R-squared)

<hr id="slr">

<h2>3. Simple Linear Regression</h2>

Simple linear regression is a statistical method that models the linear relationship between a single independent variable and a dependent variable.

The model is on the form:
$$
y = ax + b
$$

Where, 
- `a` is the slope of the independent parameter
- `b` is the intercept with the Y axis

In [None]:
# create linear regression model object
lr = LinearRegression()

# get the input and output variables
x = df[['Engine Size [L]']]
y = df['CO2 Emissions [g/km]']

# split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
print('Training set: ', x_train.shape, y_train.shape)
print('Testing set: ', x_test.shape, y_test.shape)

In [None]:
# fit the model
lr.fit(x_train, y_train)


print('Coefficients: ', lr.coef_)
print('Intercept: ', lr.intercept_)

In [None]:
# make predictions
y_pred_train = lr.predict(x_train)
y_pred_test = lr.predict(x_test)
print('Predictions: ', y_pred_test.shape)

In [None]:
# evaluate the model on training and testing sets
print('Training set:')
print('MSE: {:.2f}'.format(mean_squared_error(y_train, y_pred_train)))
print('RMSE: {:.2f}'.format(np.sqrt(mean_squared_error(y_train, y_pred_train))))
print('R2: {:.2f}'.format(r2_score(y_train, y_pred_train)))
print()
print('Testing set:')
print('MSE: {:.2f}'.format(mean_squared_error(y_test, y_pred_test)))
print('RMSE: {:.2f}'.format(np.sqrt(mean_squared_error(y_test, y_pred_test))))
print('R2: {:.2f}'.format(r2_score(y_test, y_pred_test)))

In [None]:
# visualize the model fit
plt.figure(figsize=(10, 6))
plt.scatter(x_test, y_test)
plt.plot(x, lr.predict(x), color='red')
plt.title('Linear Regression')
plt.xlabel('Engine Size [L]')
plt.ylabel('CO2 Emissions [g/km]')
plt.show()

Visualize the effect of other parameters

In [None]:
plt.figure(figsize=(15, 10))
sns.scatterplot(x='Engine Size [L]', y='CO2 Emissions [g/km]', size='Fuel Consumption Comb [L/100 km]', hue='Fuel Type', data=df)

In [None]:
# create linear regression model object
lr = LinearRegression()

# get the input and output variables
x = df[['Fuel Consumption Comb [L/100 km]']]
y = df['CO2 Emissions [g/km]']

# split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
print('Training set: ', x_train.shape, y_train.shape)
print('Testing set: ', x_test.shape, y_test.shape)

In [None]:
# fit the model
lr.fit(x_train, y_train)
print('Coefficients: ', lr.coef_)
print('Intercept: ', lr.intercept_)

In [None]:
# make predictions
y_pred_train = lr.predict(x_train)
y_pred_test = lr.predict(x_test)
print('Predictions: ', y_pred_test.shape)

In [None]:
# evaluate the model on training and testing sets
print('Training set:')
print('MSE: {:.2f}'.format(mean_squared_error(y_train, y_pred_train)))
print('RMSE: {:.2f}'.format(np.sqrt(mean_squared_error(y_train, y_pred_train))))
print('R2: {:.2f}'.format(r2_score(y_train, y_pred_train)))
print()
print('Testing set:')
print('MSE: {:.2f}'.format(mean_squared_error(y_test, y_pred_test)))
print('RMSE: {:.2f}'.format(np.sqrt(mean_squared_error(y_test, y_pred_test))))
print('R2: {:.2f}'.format(r2_score(y_test, y_pred_test)))

In [None]:
# visualize the model fit
plt.figure(figsize=(10, 6))
plt.scatter(x_train, y_train)
plt.plot(x, lr.predict(x), color='red')
plt.title('Linear Regression')
plt.xlabel('Fuel Consumption Comb [L/100 km]')
plt.ylabel('CO2 Emissions [g/km]')
plt.legend()
plt.show()

Visualize the effect of other parameters

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(df, x='Fuel Consumption Comb [L/100 km]', y='CO2 Emissions [g/km]', hue='Fuel Type', size='Engine Size [L]')
plt.title('CO2 Emissions vs Fuel Consumption')
plt.show()

<hr id="mlr">

<h2>4. Multiple Linear Regression</h2>

Multiple linear regression is an extension of simple linear regression that allows for the inclusion of multiple independent variables. It assumes that the relationship between the dependent variable and the independent variables is linear and additive, meaning that the effect of each independent variable on the dependent variable is independent of the other independent variables.

The model is on the form:
$$
y = a_0 +  a_1x_1 + a_2x_2 + ...
$$

Where,
- `a_0` is the intercept with the Y axis
- `a_1, a_2, ...` is the slope of each independent parameter

In [None]:
# create linear regression model object
lr = LinearRegression()

# get the input and output variables
x = df[['Engine Size [L]', 'Fuel Consumption Comb [L/100 km]']]
y = df['CO2 Emissions [g/km]']

# split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
print('Training set: ', x_train.shape, y_train.shape)
print('Testing set: ', x_test.shape, y_test.shape)

In [None]:
# fit the model
lr.fit(x_train, y_train)
print('Coefficients: ', lr.coef_)
print('Intercept: ', lr.intercept_)

In [None]:
# make predictions
y_pred_train = lr.predict(x_train)
y_pred_test = lr.predict(x_test)
print('Predictions: ', y_pred_test.shape)

In [None]:
# evaluate the model on training and testing sets
print('Training set:')
print('MSE: {:.2f}'.format(mean_squared_error(y_train, y_pred_train)))
print('RMSE: {:.2f}'.format(np.sqrt(mean_squared_error(y_train, y_pred_train))))
print('R2: {:.2f}'.format(r2_score(y_train, y_pred_train)))
print()
print('Testing set:')
print('MSE: {:.2f}'.format(mean_squared_error(y_test, y_pred_test)))
print('RMSE: {:.2f}'.format(np.sqrt(mean_squared_error(y_test, y_pred_test))))
print('R2: {:.2f}'.format(r2_score(y_test, y_pred_test)))

In [None]:
# use plotly to create an interactive 3D plot
import plotly.express as px
import plotly.graph_objects as go

# plot scatter plot in 3D
fig = px.scatter_3d(df, x='Engine Size [L]', y='Fuel Consumption Comb [L/100 km]', z='CO2 Emissions [g/km]', color='Fuel Type')
fig.update_traces(marker=dict(size=3))

# add linear regression surface
x_grid = np.arange(0.0, 10.0, 0.1)
y_grid = np.arange(0.0, 30.0, 0.1)
x_grid, y_grid = np.meshgrid(x_grid, y_grid)
x_lr = np.stack((x_grid.flatten(), y_grid.flatten()), axis=1)
z_grid = lr.predict(x_lr).reshape(x_grid.shape)

fig.add_trace(go.Surface(
    x=x_grid,
    y=y_grid,
    z=z_grid,
    opacity=0.2,
    showscale=False,    
    surfacecolor=z_grid-z_grid,
))

# update figure size
fig.update_layout(
    height=700,
)
fig.show()

<hr id="nlr">

<h2>5. Non-Linear Regression</h2>

Non-linear regression is a statistical method that models the relationship between a single or multiple independent variables and a dependent variable when the relationship is not linear.  
Non-linear regression techniques can capture more complex relationships between variables than linear regression methods.  
Some common non-linear regression techniques include polynomial regression, support vector regression (SVR), decision tree regression, and neural networks.

For different non-linear equations (e.g. polynomial, exponential, etc.). It is common to linearize the equation, then perform linear regression on the linearized equation using the `LinearRegression()` model.

For example, for a polynomial of the second degree:
$$
y = a_0 + a_1x + a_2x^2
$$
This can be linearized as follows:
$$
y = a_0 + a_1x_1 + a_2x_2
$$
Where,
$$ 
x_1 = x  \\
x_2 = x^2
$$

In [None]:
# create linear regression model object
lr = LinearRegression()

# get the input and output variables
x = df[['Engine Size [L]']]
y = df['CO2 Emissions [g/km]']

# create polynomial features
# automatically creates new columns for x^2, x^3, x^4, ... calculated from the input variable x
poly = PolynomialFeatures(degree=3, include_bias=False)
x_poly = poly.fit_transform(x)
print('Polynomial features: ', x_poly.shape)
print()
print(x[:5])
print()
print(x_poly[:5])

# split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x_poly, y, test_size=0.2)
print('Training set: ', x_train.shape, y_train.shape)
print('Testing set: ', x_test.shape, y_test.shape)

In [None]:
# fit the model
lr.fit(x_train, y_train)
print('Coefficients: ', lr.coef_)
print('Intercept: ', lr.intercept_)

In [None]:
# make predictions
y_pred_train = lr.predict(x_train)
y_pred_test = lr.predict(x_test)
print('Predictions: ', y_pred_test.shape)

In [None]:
# evaluate the model on training and testing sets
print('Training set:')
print('MSE: {:.2f}'.format(mean_squared_error(y_train, y_pred_train)))
print('RMSE: {:.2f}'.format(np.sqrt(mean_squared_error(y_train, y_pred_train))))
print('R2: {:.2f}'.format(r2_score(y_train, y_pred_train)))
print()
print('Testing set:')
print('MSE: {:.2f}'.format(mean_squared_error(y_test, y_pred_test)))
print('RMSE: {:.2f}'.format(np.sqrt(mean_squared_error(y_test, y_pred_test))))
print('R2: {:.2f}'.format(r2_score(y_test, y_pred_test)))


In [None]:
# visualize the model fit
plt.figure(figsize=(10, 6))
plt.scatter(x_test[:, 0], y_test, color='blue')
x_fit = np.arange(0.0, 10.0, 0.1).reshape(-1, 1)
x_fit_poly = poly.fit_transform(x_fit)
y_fit = lr.predict(x_fit_poly)
plt.plot(x_fit, y_fit, color='red')
plt.title('Polynomial Regression')
plt.xlabel('Engine Size [L]')
plt.ylabel('CO2 Emissions [g/km]')
plt.show()

### Multiple Non-Linear Regression

In [None]:
# create linear regression model object
lr = LinearRegression()

# get the input and output variables
x = df[['Engine Size [L]', 'Fuel Consumption Comb [L/100 km]']]
y = df['CO2 Emissions [g/km]']

# create polynomial features
poly = PolynomialFeatures(degree=3, include_bias=False)
x_poly = poly.fit_transform(x)
print('Polynomial features: ', x_poly.shape)
print()
print(x[:5])
print()
print(x_poly[:5])
print()

# split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x_poly, y, test_size=0.2)
print('Training set: ', x_train.shape, y_train.shape)
print('Testing set: ', x_test.shape, y_test.shape)

In [None]:
# fit the model
lr.fit(x_train, y_train)
print('Coefficients: ', lr.coef_)
print('Intercept: ', lr.intercept_)

In [None]:
# make predictions
y_pred_train = lr.predict(x_train)
y_pred_test = lr.predict(x_test)
print('Predictions: ', y_pred_test.shape)

In [None]:
# evaluate the model on training and testing sets
print('Training set:')
print('MSE: {:.2f}'.format(mean_squared_error(y_train, y_pred_train)))
print('RMSE: {:.2f}'.format(np.sqrt(mean_squared_error(y_train, y_pred_train))))
print('R2: {:.2f}'.format(r2_score(y_train, y_pred_train)))
print()
print('Testing set:')
print('MSE: {:.2f}'.format(mean_squared_error(y_test, y_pred_test)))
print('RMSE: {:.2f}'.format(np.sqrt(mean_squared_error(y_test, y_pred_test))))
print('R2: {:.2f}'.format(r2_score(y_test, y_pred_test)))

In [None]:
# use plotly to create an interactive 3D plot

# plot scatter plot in 3D
fig = px.scatter_3d(df, x='Engine Size [L]', y='Fuel Consumption Comb [L/100 km]', z='CO2 Emissions [g/km]', color='Fuel Type')
fig.update_traces(marker=dict(size=3))

# add linear regression surface
x_grid = np.arange(0.0, 10.0, 0.1)
y_grid = np.arange(0.0, 30.0, 0.1)
x_grid, y_grid = np.meshgrid(x_grid, y_grid)

# create polynomial features
x_poly_grid = np.stack((x_grid.flatten(), y_grid.flatten()), axis=1)
x_poly_grid = poly.fit_transform(x_poly_grid)
z_grid = lr.predict(x_poly_grid).reshape(x_grid.shape)

fig.add_trace(go.Surface(
    x=x_grid,
    y=y_grid,
    z=z_grid,
    opacity=0.2,
    showscale=False,    
    surfacecolor=z_grid-z_grid,
))

# update figure size
fig.update_layout(
    height=700,
)
fig.show()

<hr id="pipe">

<h2>6. Pipelines and Grid Search</h2>

### Pipelines

Scikit-learn pipelines chain together multiple preprocessing, transformation, and modeling steps in a sequential workflow. This simplifies machine learning workflows and improves code organization and readability.

Benefits:

- Improved code organization and readability: Easier to understand and maintain workflows.
- Reduced boilerplate code: Single call to fit and predict the entire pipeline.
- Automatic cross-validation: Simplifies hyperparameter tuning by applying cross-validation to the entire pipeline.
- Streamlined workflow: Automates data processing and model training steps.

In [None]:
# get the input and output variables
x = df[['Engine Size [L]', 'Fuel Consumption Comb [L/100 km]']]
y = df['CO2 Emissions [g/km]']

# create pipeline
pipe = Pipeline([
    ('poly', PolynomialFeatures(degree=2, include_bias=False)),
    ('scaler', StandardScaler()),
    ('lr', LinearRegression())
])

# split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
print('Training set: ', x_train.shape, y_train.shape)
print('Testing set: ', x_test.shape, y_test.shape)

In [None]:
# fit the model
pipe.fit(x_train, y_train)
print('Coefficients: ', pipe['lr'].coef_)
print('Intercept: ', pipe['lr'].intercept_)

In [None]:
# make predictions
y_pred_train = pipe.predict(x_train)
y_pred_test = pipe.predict(x_test)
print('Predictions: ', y_pred_test.shape)

In [None]:
# evaluate the model on training and testing sets
print('Training set:')
print('MSE: {:.2f}'.format(mean_squared_error(y_train, y_pred_train)))
print('RMSE: {:.2f}'.format(np.sqrt(mean_squared_error(y_train, y_pred_train))))
print('R2: {:.2f}'.format(r2_score(y_train, y_pred_train)))
print()
print('Testing set:')
print('MSE: {:.2f}'.format(mean_squared_error(y_test, y_pred_test)))
print('RMSE: {:.2f}'.format(np.sqrt(mean_squared_error(y_test, y_pred_test))))
print('R2: {:.2f}'.format(r2_score(y_test, y_pred_test)))

### Grid Search

Scikit-learn's GridSearchCV (Grid Search Cross-Validation) is a hyperparameter tuning method that systematically evaluates model performance across a predefined grid of parameter values. It automates the process of exploring different training configurations, ultimately identifying the combination that yields the best performance on unseen data.

Algorithm:

1. Generate parameter combinations: GridSearchCV iterates through all possible combinations of the specified parameter values.
2. Train and evaluate models: For each combination, a separate model is trained on a subset of the data (one fold) and evaluated on the remaining folds (unseen data).
3. Performance scoring: A chosen metric (e.g., accuracy, F1-score) is used to quantify the model's performance on each fold.
4. Aggregate results: Scores are averaged across folds for each parameter combination.
5. Identify best model: The combination with the highest average score is identified as the optimal configuration.

Benefits:

- Improved model performance: By exploring various configurations, GridSearchCV optimizes hyperparameters, potentially leading to significantly better model performance.
- Reduced manual effort: Automates the hyperparameter tuning process, saving time and effort compared to manual exploration.
- Data-driven insights: Provides data-driven insights into how different parameter values influence model performance.

In [None]:
# grid search
from sklearn.model_selection import GridSearchCV

# get the input and output variables
x = df[['Engine Size [L]', 'Fuel Consumption Comb [L/100 km]']]
y = df['CO2 Emissions [g/km]']

# create pipeline
pipe = Pipeline([
    ('poly', PolynomialFeatures(include_bias=False)),
    ('scaler', StandardScaler()),
    ('lr', LinearRegression())
])

# create grid search
param_grid = {
    'poly__degree': [1, 2, 3],
}
grid = GridSearchCV(pipe, param_grid=param_grid, cv=5, scoring='r2')

# split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
print('Training set: ', x_train.shape, y_train.shape)
print('Testing set: ', x_test.shape, y_test.shape)

In [None]:
# fit the model
grid.fit(x_train, y_train)

# get the best parameters
print('Best parameters: ', grid.best_params_)
print('Best score: ', grid.best_score_)


In [None]:
# make predictions
y_pred_train = grid.predict(x_train)
y_pred_test = grid.predict(x_test)
print('Predictions: ', y_pred_test.shape)

In [None]:
# evaluate the model on training and testing sets
print('Training set:')
print('MSE: {:.2f}'.format(mean_squared_error(y_train, y_pred_train)))
print('RMSE: {:.2f}'.format(np.sqrt(mean_squared_error(y_train, y_pred_train))))
print('R2: {:.2f}'.format(r2_score(y_train, y_pred_train)))
print()
print('Testing set:')
print('MSE: {:.2f}'.format(mean_squared_error(y_test, y_pred_test)))
print('RMSE: {:.2f}'.format(np.sqrt(mean_squared_error(y_test, y_pred_test))))
print('R2: {:.2f}'.format(r2_score(y_test, y_pred_test)))

<hr id="eval">

<h2>7. Visual Evaluation of Higher Dimensional Models</h2>

Fit a model with more than 2 input parameters (total dimensions > 3)

### Label Encoding Categorical Features

In [None]:
# label encode "fuel type"
types = {
    'E': 1,
    'X': 2,
    'Z': 2,
    'D': 3,
}
df['Fuel Type'] = df['Fuel Type'].map(types)

df['Fuel Type'].value_counts()


In [None]:
# get the input and output variables
x = df[['Engine Size [L]', 'Fuel Consumption Comb [L/100 km]', 'Fuel Type']]
y = df['CO2 Emissions [g/km]']

# split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
print('Training set: ', x_train.shape, y_train.shape)
print('Testing set: ', x_test.shape, y_test.shape)

In [None]:
# create model object
from sklearn.ensemble import RandomForestRegressor
model = LinearRegression()
# model = RandomForestRegressor(n_estimators=100)

# fit the model
model.fit(x_train, y_train)
print('Coefficients: ', lr.coef_)
print('Intercept: ', lr.intercept_)

In [None]:
# make predictions
y_pred_train = model.predict(x_train)
y_pred_test = model.predict(x_test)
print('Predictions: ', y_pred_test.shape)

In [None]:
# evaluate the model on training and testing sets
print('Training set:')
print('MSE: {:.2f}'.format(mean_squared_error(y_train, y_pred_train)))
print('RMSE: {:.2f}'.format(np.sqrt(mean_squared_error(y_train, y_pred_train))))
print('R2: {:.2f}'.format(r2_score(y_train, y_pred_train)))
print()
print('Testing set:')
print('MSE: {:.2f}'.format(mean_squared_error(y_test, y_pred_test)))
print('RMSE: {:.2f}'.format(np.sqrt(mean_squared_error(y_test, y_pred_test))))
print('R2: {:.2f}'.format(r2_score(y_test, y_pred_test)))

**1. Predicted vs Actual Plot**

In [None]:
# plot the predictions vs. the actual values for training and testing sets
plt.figure(figsize=(10, 6))
plt.scatter(y_train, y_pred_train, label='Training')
plt.scatter(y_test, y_pred_test, label='Testing')

# plot 45 degree line
plt.plot(y, y, color='red')

plt.title('Predictions vs. Actual Values')
plt.xlabel('Actual Values')
plt.ylabel('Predictions')
plt.legend()
plt.show()

**2. Residuals Plot**

In [None]:
# plot the residuals vs. the predictions for training and testing sets
plt.figure(figsize=(10, 6))
plt.scatter(y_pred_train, y_pred_train - y_train, label='Training')
plt.scatter(y_pred_test, y_pred_test - y_test, label='Testing')

# plot 0 line
plt.axhline(y=0, color='red')

plt.title('Residuals vs. Predictions')
plt.xlabel('Predictions')
plt.ylabel('Residuals')
plt.legend()
plt.show()

**3. Feature Importances**

In [None]:
# plot relative feature importance
features = pd.Series(model.feature_importances_, index=x.columns)
features.sort_values(ascending=False, inplace=True)

plt.figure(figsize=(10, 6))
sns.barplot(x=features, y=features.index)
plt.title('Feature Importance')
plt.xlabel('Relative Importance')
plt.ylabel('Feature')
plt.grid(axis='x')
plt.show()

In [None]:
# export decision tree to a file to use later without retraining
import joblib

# save the model to disk
joblib.dump(model, 'tree.joblib')

<hr id="import">

<h2>8. Classification Overview</h2>


#### Classification  
Classification is a statistical and machine learning technique used to assign labels or categories to data points based on one or more features (independent variables).  
The goal of classification is to determine the category or class of a data point by learning from labeled training data. This makes classification a cornerstone of predictive modeling, widely used in tasks like spam detection, medical diagnosis, and image recognition.  

#### Goodness of Classification  
Goodness of classification refers to how well the model assigns the correct class labels to new, unseen data. A good classification model balances capturing patterns in the data without overfitting to noise.

---

#### Logistic Regression  
Logistic regression is a classification algorithm used to predict a binary outcome (e.g., 0 or 1, True or False, Yes or No) based on one or more independent variables. Despite its name, logistic regression is not used for regression tasks but for classification.

The logistic regression model predicts the probability that an observation belongs to a particular class. The relationship between the independent variables and the predicted probability is modeled using the **logistic (sigmoid) function**:

$$
P(y=1|X) = \frac{1}{1 + e^{-(a_0 + a_1x_1 + a_2x_2 + \dots)}}
$$

Where:  

- \( P(y=1|X) \): The predicted probability of the positive class (class 1).  

- \( a_0 \): The intercept (bias).  

- \( a_1, a_2, ...\): The coefficients of the independent variables.  

The model output is a probability value between 0 and 1, which is thresholded (e.g., at 0.5) to decide the predicted class. Logistic regression works well for binary classification problems and is interpretable and computationally efficient.

---

#### How Logistic Regression Relates to Linear Regression  
Logistic regression is closely related to linear regression, as it starts with the same linear equation to model the relationship between independent variables and the target variable:

$$
z = a_0 + a_1x_1 + a_2x_2 + \dots
$$

In linear regression, this equation directly predicts the dependent variable \( y \). However, in logistic regression, this linear combination \( z \) is transformed using the logistic (sigmoid) function:

$$
P(y=1|X) = \frac{1}{1 + e^{-z}}
$$

This transformation ensures that the output is always between 0 and 1, making it suitable for probability estimation. Logistic regression is called "regression" because it models the underlying relationship between the independent variables and the log-odds of the outcome:

$$
\text{log-odds} = \ln\left(\frac{P(y=1|X)}{1 - P(y=1|X)}\right) = z
$$

The log-odds (logarithm of the odds ratio) is a linear function of the independent variables, similar to linear regression.

---

#### Why is Logistic Regression Called Regression Despite Being a Classification Algorithm?  
Logistic regression retains the "regression" name because:  
1. It uses a linear regression-like equation as its foundation.  
2. It estimates parameters (coefficients) using a regression-like approach (maximum likelihood estimation).  
3. It predicts a continuous value (probability) before applying a threshold for classification.  

Thus, logistic regression combines the interpretability and simplicity of linear regression with the ability to handle classification problems, making it a versatile and widely used algorithm.

In [None]:
# create a binary feature for low-emission
df['low-emission'] = df['CO2 Emissions [g/km]'] <= 200

df['low-emission'].value_counts()

In [None]:
# create logistic regression model
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Initialize logistic regression model
log_reg = LogisticRegression()

# get input and output variables
x = df[['Engine Size [L]', 'Fuel Consumption Comb [L/100 km]', 'Fuel Type']]
y = df['low-emission']

# split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
print('Training set: ', x_train.shape, y_train.shape)
print('Testing set: ', x_test.shape, y_test.shape)


In [None]:
# fit the model
log_reg.fit(x_train, y_train)

# print the coefficients and intercept
print('Coefficients: ', log_reg.coef_)
print('Intercept: ', log_reg.intercept_)


In [None]:
# make predictions
y_pred_train = log_reg.predict(x_train)
y_pred_test = log_reg.predict(x_test)
print('Predictions: ', y_pred_test.shape)

In [None]:
# evaluate the model on training and testing sets
print('Training set:')
print('Accuracy: {:.2f}'.format(accuracy_score(y_train, y_pred_train)))
print('Confusion Matrix:')
print(confusion_matrix(y_train, y_pred_train))
print('Classification Report:')
print(classification_report(y_train, y_pred_train))
print()
print('Testing set:')
print('Accuracy: {:.2f}'.format(accuracy_score(y_test, y_pred_test)))
print('Confusion Matrix:')
print(confusion_matrix(y_test, y_pred_test))
print('Classification Report:')
print(classification_report(y_test, y_pred_test))

In [None]:
# use plotly to create an interactive 3D plot

# create a column for true positive and false positive and true negative and false negative
df['y_pred'] = log_reg.predict(x)
df['result'] = ''
df['result'][(df['low-emission'] == True) & (df['y_pred'] == True)] = 'TP'
df['result'][(df['low-emission'] == False) & (df['y_pred'] == False)] = 'TN'
df['result'][(df['low-emission'] == True) & (df['y_pred'] == False)] = 'FN'
df['result'][(df['low-emission'] == False) & (df['y_pred'] == True)] = 'FP'

# plot scatter plot in 3D
fig = px.scatter_3d(df, x='Engine Size [L]', y='Fuel Consumption Comb [L/100 km]', z='Fuel Type', color='result')
fig.update_traces(marker=dict(size=3))

# get the coefficients and intercept
coefficients = log_reg.coef_[0]
intercept = log_reg.intercept_[0]

# plot the decision boundary
# x3 = -(b + w1*x1 + w2*x2) / w3
x1 = np.arange(0.0, 10.0, 0.1)
x2 = np.arange(0.0, 30.0, 0.1)
x1, x2 = np.meshgrid(x1, x2)
x3 = -(intercept + coefficients[0]*x1 + coefficients[1]*x2) / coefficients[2]

fig.add_trace(go.Surface(
    x=x1,
    y=x2,
    z=x3,
    opacity=0.2,
    showscale=False,    
    surfacecolor=x3-x3,
))

# update figure size
fig.update_layout(
    height=700,
)
fig.show()

---

#### NOTE
Model development and evaluation is an iterative process.  
We typically try multiple models and different combinations of parameters and select the most accurate based on the evaluation metrics.

<hr style="margin-top: 4rem;">
<h2>Author</h2>

<a href="https://github.com/SamerHany">Samer Hany</a>

<h2>References</h2>
<a href="https://www.w3schools.com/python/default.asp">w3schools.com</a>
<br>
<a href="https://www.geeksforgeeks.org/underfitting-and-overfitting-in-machine-learning/">geeksforgeeks.com</a>
<br>
<a href="https://www.kaggle.com/datasets/mrmorj/car-fuel-emissions">CO2 emissions dataset (kaggle.com)</a>