#  Simple Linear Regression

### Installs:

In [0]:
%%capture
%pip install numpy==2.4.0
%pip install pandas==2.3.3
%pip install scikit-learn==1.8.0
%pip install matplotlib==3.10.8
%pip seaborn==0.13.0

In [0]:
# Command to restart the kernel and update the installed libraries
%restart_python

### Imports:

In [0]:
# Data Analize
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

# Data Modeling / Model Linear / Metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, root_mean_squared_error, r2_score


### Load the data

In [0]:
df = pd.read_csv('./data/FuelConsumptionCo2.csv')

### verify successful load with some randomly selected records


In [0]:
df.sample(9)

### Understand the data

#### `FuelConsumption.csv`:
You will use a fuel consumption dataset, **`FuelConsumption.csv`**, which contains model-specific fuel consumption ratings and estimated carbon dioxide emissions for new light-duty vehicles for retail sale in Canada. [Dataset source](http://open.canada.ca/data/en/dataset/98f1a129-f628-4ce4-b24d-6f16bf24dd64).

- **MODEL YEAR** e.g. 2014
- **MAKE** e.g. VOLVO
- **MODEL** e.g. S60 AWD
- **VEHICLE CLASS** e.g. COMPACT
- **ENGINE SIZE** e.g. 3.0
- **CYLINDERS** e.g 6
- **TRANSMISSION** e.g. AS6
- **FUEL TYPE** e.g. Z
- **FUEL CONSUMPTION in CITY(L/100 km)** e.g. 13.2
- **FUEL CONSUMPTION in HWY (L/100 km)** e.g. 9.5
- **FUEL CONSUMPTION COMBINED (L/100 km)** e.g. 11.5
- **FUEL CONSUMPTION COMBINED MPG (MPG)** e.g. 25
- **CO2 EMISSIONS (g/km)** e.g. 182

The objective will be to create a simple linear regression model from one of these characteristics to predict CO2 emissions from unobserved cars based on that characteristic.


### Explore the data
First, consider a statistical summary of the data.

In [0]:
df.describe()

In [0]:
df.info()

We can observe from the statistics here that 75% of cars have a combined fuel consumption that is almost three times higher than that of the most efficient car, with respective values ​​of 31 MPG and 11 MPG.

The highest consumption, of 60 MPG, is suspiciously high, but may be legitimate.

The MODELYEAR has a standard deviation of 0 and therefore contains no relevant information.

### Select Features
I will be selecting some resources that may be indicative of CO2 emissions to explore further.

In [0]:
cdf = df[['ENGINESIZE', 'CYLINDERS', 'FUELCONSUMPTION_COMB', 'CO2EMISSIONS']]
cdf.sample(10)

### Visualize features
Consider the histograms for each of these features.


In [0]:
# Data collect
data_ax = cdf.copy()

sns.set_style('whitegrid')

# Define Figure
plt.figure(figsize  = (15, 10))

for i, col in enumerate(data_ax.columns):
    plt.subplot(2, 2, i + 1) # Create a grid 2x2

    # Histogram
    sns.histplot(
        data = data_ax[col],
        kde = True,
        color = 'teal'
    )

    plt.title(f'Distribution of: {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

Most engines have 4, 6, or 8 cylinders and engine sizes between 2 and 4 inches.
As expected, combined fuel consumption and CO2 emissions have very similar distributions.

Display some scatter plots of these characteristics in relation to CO2 emissions to see how their relationships are linear.

In [0]:
# Data collect
data_ax = cdf.copy()

sns.set_style('whitegrid')

# Define Figure
plt.figure(figsize  = (8, 4))

# scatter_kws={'alpha':0.5}: Makes the points transparent
# line_kws={'color':'red'}: Highlights the trendline in red

sns.regplot(
    x = 'FUELCONSUMPTION_COMB', 
    y = 'CO2EMISSIONS', 
    data = data_ax,
    color = 'teal',
    scatter_kws = {'alpha': 0.4},
    line_kws = {'color': 'red'}
)

plt.title('Linear Correlation with Regression Line: FUELCONSUMPTION_COMB x CO2EMISSIONS')
plt.tight_layout()
plt.show()

This is an informative result. Three car groups each have a strong linear relationship between their combined fuel consumption and their CO2 emissions.
Their intercepts are similar, while they noticeably differ in their slopes.


In [0]:
# Data collect
data_ax = cdf.copy()

sns.set_style('whitegrid')

# Define Figure
plt.figure(figsize  = (8, 4))

# scatter_kws={'alpha':0.5}: Makes the points transparent
# line_kws={'color':'red'}: Highlights the trendline in red

sns.regplot(
    x = 'ENGINESIZE', 
    y = 'CO2EMISSIONS', 
    data = data_ax,
    scatter_kws = {'alpha': 0.4, 'color': 'darkred'},
    line_kws = {'color': 'black', 'linewidth':2}
)

plt.title('Linear Correlation with Regression Line: ENGINESIZE x CO2EMISSIONS')
plt.xlim(0, 10)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()


Although the relationship between engine size and CO2 emissions is fairly linear, it can be observed that the correlation between them is weaker than that observed for each of the three fuel consumption groups. 

In [0]:
# Data collect
data_ax = cdf.copy()

sns.set_style('whitegrid')

# Define Figure
plt.figure(figsize  = (8, 4))

# scatter_kws={'alpha':0.5}: Makes the points transparent
# line_kws={'color':'red'}: Highlights the trendline in red

sns.regplot(
    x = 'CYLINDERS', 
    y = 'CO2EMISSIONS', 
    data = data_ax,
    scatter_kws = {'alpha': 0.4, 'color': 'darkgreen'},
    line_kws = {'color': 'black', 'linewidth':2}
)

plt.title('Linear Correlation with Regression Line: CYLINDERS x CO2EMISSIONS')
plt.xlim(0, 15)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

The number of cylinders directly impacts the amount of CO2 emissions. However, 12-cylinder cars consume less fuel than 10-cylinder cars. There is a considerable correlation between these two characteristics, but it is not a perfectly linear relationship.


### Define feature for Linear Regression

I selected the variable FUELCONSUMPTION_COMB as the main feature for the Simple Linear Regression. This choice is due to the fact that it presents the greatest clarity with the target variable compared to many variations observed.

In [0]:
X = cdf['FUELCONSUMPTION_COMB'].to_numpy()
y = cdf['CO2EMISSIONS'].to_numpy()

print(f'The shape of X is: {X.shape}')
print(f'\nThe shape of y is: {y.shape}')

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, shuffle = True,random_state = 33)

type(X_train), np.shape(X_train), np.shape(X_test)

### Creating the linear regression model

In [0]:
# Create Model
regressor = LinearRegression()
regressor.fit(X_train.reshape(-1, 1), y_train)

print(f'Coefficients: {regressor.coef_[0]}')
print(f'Intercept: {regressor.intercept_}')

The model indicates a marginal impact of 16.14: each additional unit of fuel increases emissions by ~16g of CO2. The intercept of 69.31 acts only as a mathematical adjustment (bias) to align the line with the observed data, having no real physical meaning, since zero consumption does not generate emissions.

### Viewing model outputs

In [0]:
# Data collect
data_ax = cdf.copy()

sns.set_style('whitegrid')

# Define Figure
plt.figure(figsize  = (8, 4))

sns.regplot(
    x = X_train, 
    y = y_train,
    fit_reg = False,
    scatter_kws = {'alpha': 0.5, 'color': 'royalblue'},
)

plt.plot(X_train, 
    regressor.predict(X_train.reshape(-1, 1)),
    color='firebrick', 
    linewidth=2.5
)

plt.title('Linear Regression of Training: FUELCONSUMPTION_COMB x CO2EMISSIONS', fontsize = 14)
plt.xlabel('FUELCONSUMPTION_COMB')
plt.ylabel('CO2EMISSIONS')

plt.tight_layout()
plt.show()

### Model Evaluation

In [0]:
# Use the predict method to make test predictions
y_test_  = regressor.predict(X_test.reshape(-1, 1))

print(f'Mean Absolute error: {mean_absolute_error(y_test_, y_test):.2f}')
print(f'Mean Squared error: {mean_squared_error(y_test_, y_test):.2f}')
print(f'Root men squared error: {root_mean_squared_error(y_test_, y_test):.2f}')
print(f'R2-score: {r2_score(y_test_, y_test):.2f}')

In [0]:
# Data collect
data_ax = cdf.copy()

sns.set_style('whitegrid')

# Define Figure
plt.figure(figsize  = (8, 4))

sns.regplot(
    x = X_test, 
    y = y_test,
    fit_reg = False,
    scatter_kws = {'alpha': 0.5, 'color': 'royalblue'},
)

plt.plot(X_test, 
    y_test_,
    color='firebrick', 
    linewidth=2.5
)

plt.title('Linear Regression of Test: FUELCONSUMPTION_COMB x CO2EMISSIONS', fontsize = 14)
plt.xlabel('FUELCONSUMPTION_COMB')
plt.ylabel('CO2EMISSIONS')

plt.tight_layout()
plt.show()

### Conclusion

The model is underfitted.

R² (0.74): Acceptable for a first test, but weak for physical data. The model did not capture 26% of the cars' behavior well,
suggesting that the real relationship is a curve, not a straight line.

MAE (19.17) vs RMSE (27.14): There is a large difference between them, which raises a red flag. The high RMSE indicates that the model is making significant errors for some specific cars and penalizing the overall average.

It is possible to conclude that a straight line for this type of data would not be the most appropriate solution. Perhaps exploring new model options with Polynomial Regression or Multilinear Regression would be helpful.