# <font color=black>Simple Linear Regression</font>

In this notebook, we will use scikit-learn to implement simple linear regression. We load a dataset that is related to fuel consumption and Carbon dioxide emission of cars. Then, we split our data into training and test sets, create a model using training set, evaluate your model using test set, and finally use model to predict unknown value.

In this notebook, we are trying to understand the relationship between engine size and carbon dioxide emission of cars.

<h2>Understanding the Fuel Consumption dataset</h2>
    
### `FuelConsumption.csv`:
We have downloaded a fuel consumption dataset, **`FuelConsumption.csv`**, which contains model-specific fuel consumption ratings and estimated carbon dioxide emissions for new light-duty vehicles for retail sale in Canada. [Dataset source](http://open.canada.ca/data/en/dataset/98f1a129-f628-4ce4-b24d-6f16bf24dd64)

- **MODELYEAR** e.g. 2014
- **MAKE** e.g. Acura
- **MODEL** e.g. ILX
- **VEHICLE CLASS** e.g. SUV
- **ENGINE SIZE** e.g. 4.7
- **CYLINDERS** e.g 6
- **TRANSMISSION** e.g. A6
- **FUEL CONSUMPTION in CITY(L/100 km)** e.g. 9.9
- **FUEL CONSUMPTION in HWY (L/100 km)** e.g. 8.9
- **FUEL CONSUMPTION COMB (L/100 km)** e.g. 9.2
- **CO2 EMISSIONS (g/km)** e.g. 182   --> low --> 0

## <font color=black>Importing Needed Packages</font>

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

### Check out the Data

In [None]:
#Load the dataset
FuelConsumption = pd.read_csv("FuelConsumption.csv")

In [None]:
# take a look at the dataset
FuelConsumption.head()

In [None]:
# Take a look at the dataframe containing fuel data
FuelConsumption.info()

In [None]:
# Take a look at descriptive statistics of the dataset
FuelConsumption.describe()

In [None]:
# Take a look at the column names
FuelConsumption.columns

# EDA - explaratory data analysis

Let's create some simple plots to check out the data!

In [None]:
# Distribution of Co2 emissions column
sns.displot(FuelConsumption['CO2EMISSIONS'])

In [None]:
viz = FuelConsumption[['ENGINESIZE','CO2EMISSIONS']]
viz.hist()
plt.show()

Now, lets plot Engine size vs the Emission, to see how linear is their relation:

In [None]:
plt.scatter(FuelConsumption['ENGINESIZE'], FuelConsumption['CO2EMISSIONS'],  color='blue')
plt.xlabel("Engine size")
plt.ylabel("Emission")
plt.show()

## Training a Linear Regression Model

Let's now begin to train out regression model! We will need to first split up our data into an X array that contains Engine Size to train on, and a y array with the target variable, in this case the Co2 Emissions column. 
### X and y arrays

In [None]:
X = FuelConsumption[['ENGINESIZE']]
y = FuelConsumption['CO2EMISSIONS']

## Train Test Split

Now let's split the data into a training set and a testing set. We will train out model on the training set and then use the test set to evaluate the model.

Train/Test Split involves splitting the dataset into training and testing sets respectively, which are mutually exclusive. After which, you train with the training set and test with the testing set. 
This will provide a more accurate evaluation on out-of-sample accuracy because the testing dataset is not part of the dataset that have been used to train the data. It is more realistic for real world problems.

Lets split our dataset into train and test sets, 80% of the entire data for training, and the 20% for testing. 

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=101)

train_test_split randomly picks samples from our data to form the training and test set.

- <b> test_size </b> shows how much of our data we will dedicate as test set. In our case, 40% will be dedicated to testing and 60% will be dedicated to training.

- <b> random_state </b> performs a random split using np.random. If you want your results to be stochastic each time, simply leave it as the default value “None”. (If you want the same results as me, you need to use the same random_state)

- <b> shuffle </b> (default = True) shuffles the samples in the dataset to ensure randomness of the train_test selection.

## Creating and Training the Model

#### Training data distribution (Enginesize vs. CO2Emission)

In [None]:
plt.scatter(X_train, y_train,  color='blue')
plt.xlabel("Engine size")
plt.ylabel("Emission")
plt.show()

#### Testing data distribution (Enginesize vs. CO2Emission)

In [None]:
plt.scatter(X_test, y_test,  color='blue')
plt.xlabel("Engine size")
plt.ylabel("Emission")
plt.show()

#### Modeling
Using sklearn package to model data.

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
lm = LinearRegression()

In [None]:
lm.fit(X_train, y_train) #Train the model
# Notice we don't set it equal to an object. Since we already created the model in the cell above, 
# fit function trains our model

## Model Evaluation

Let's evaluate the model by checking out it's coefficients and how we can interpret them.

As mentioned before, __Coefficient__ and __Intercept__ in the simple linear regression, are the parameters of the fit line. 
Given that it is a simple linear regression, with only 2 parameters, and knowing that the parameters are the intercept and slope of the line, sklearn can estimate them directly from our data. 

In [None]:
# print the intercept
print('Intercept: ',lm.intercept_)

In [None]:
coeff_df = pd.DataFrame(lm.coef_,X.columns,columns=['Coefficient'])
coeff_df
# Alternatively you can also just print the coefficients as below
#print ('Coefficients: ', lm.coef_)

Interpreting the coefficient:

A 1 unit increase in **Engine Size** is associated with an **increase of ~39 g/km in CO2 Emissions**.

#### Plot outputs

We can plot the fit line over the data:

In [None]:
plt.scatter(X_train, y_train,  color='blue')
plt.plot(X_train,lm.coef_*X_train + lm.intercept_, color="red")
#plt.plot(X_train,lm.predict(X_train),"-r")
plt.xlabel("Engine size")
plt.ylabel("Emission")

## Predictions from our Model

Now that we have trained our model, it’s time to make some predictions. To do so, we will use our test data and see how accurately our algorithm predicts the C02 Emissions

In [None]:
predictions = lm.predict(X_test)

Let's plot our straight line with the test data :

In [None]:
plt.scatter(X_test,y_test)
plt.plot(X_train,lm.coef_*X_train + lm.intercept_, color="red")
plt.xlabel("Engine size")
plt.ylabel("Emission")

**Residual Histogram**

Normality of the residuals is an assumption of running a linear model. So, if your residuals are normal, it means that your assumption is valid and model predictions should also be valid.

In [None]:
sns.displot((y_test-predictions),bins=50)

#### Evaluation

The final step is to evaluate the performance of the algorithm. We compare the actual values and predicted values to calculate the accuracy of a regression model. Evaluation metrics provide a key role in the development of a model, as it provides insight to areas that require improvement.

Here are three common evaluation metrics for regression problems:

**Mean Absolute Error** (MAE) is the mean of the absolute value of the errors. This is the easiest of the metrics to understand since it’s just average error:

$$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$

**Mean Squared Error** (MSE) is the mean of the squared errors. It’s more popular than Mean absolute error because the focus is geared more towards large errors. This is due to the squared term exponentially increasing larger errors in comparison to smaller ones:

$$\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2$$

**Root Mean Squared Error** (RMSE) is the square root of the mean of the squared errors:

$$\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$

Comparing these metrics:

- **MAE** is the easiest to understand, because it's the average error.
- **MSE** is more popular than MAE, because MSE "punishes" larger errors, which tends to be useful in the real world.
- **RMSE** is even more popular than MSE, because RMSE is interpretable in the "y" units.

All of these are **loss functions**, because we want to minimize them.

In [None]:
from sklearn import metrics

In [None]:
print('Mean Absolute Error (MAE):', metrics.mean_absolute_error(y_test, predictions))  
print('Mean Squared Error (MSE):', metrics.mean_squared_error(y_test, predictions))  
print('Root Mean Squared Error (RMSE):', np.sqrt(metrics.mean_squared_error(y_test, predictions)))

R-squared (R2) is not error, but is a popular metric for accuracy of your model. It represents how close the data are to the fitted regression line. The higher the R2, the better the model fits your data. Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse - R2  compares the fit of the chosen model with that of a horizontal straight line (the null hypothesis). If the chosen model fits worse than a horizontal line, then R2 is negative).

In [None]:
print("R2-score:", metrics.r2_score(predictions, y_test))