# Borrowing from W3 EDA session - Linear Regression on Auto-MPG data
## Please move to regression analysis section but you will need to run a few cells below before machine learning model development.
## You may need to run cells with: import libraries, read data, remove inconsistencies, etc.
https://www.kaggle.com/code/devanshbesain/exploration-and-analysis-auto-mpg

First of all, all the data preprocessing and EDA processes need to be followed, as follows:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.stats import norm
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

We have imported all the packages and libraries we will be using for the initial exploration of data. This notebook will be focusing on the Exploration and Visualization using pandas and seaborn packages.

Let us load the data to explore for hidden treasures

In [None]:
data = pd.read_csv('/content/drive/MyDrive/Colab-Notebooks/auto-mpg.csv',index_col='car name')

Let's have a look at data

In [None]:
print(data.head())
print(data.index)
print(data.columns)

In [None]:
data.shape

In [None]:
data.isnull().any()

Nothing seems to be missing

In [None]:
#data.dtypes
data.info(verbose=True)

Interestingly, the horsepower is an object and not a float. The values we saw above were clearly numbers. So let's try converting the column using astype().

    Let's look at the unique elements of horsepower to look for discrepancies

In [None]:
data.horsepower.unique()

When we print out all the unique values in horsepower, we find that there is '?' which was used as a placeholder for missing values. Lest remove these entries.

In [None]:
data = data[data.horsepower != '?']

In [None]:
print('?' in data.horsepower)

In [None]:
data.shape

In [None]:
data.dtypes

So we see all entries with '?' as place holder for data are removed. However, we the horsepower data is still an object type and not float. That is because initially pandas obtained the entire column as object when we imported the data set due to '?', so lets change that data column to float.

In [None]:
data.horsepower = data.horsepower.astype('float')
data.dtypes

Now everything looks in order so lets continue, let's describe the dataset

In [None]:
data.describe()

- The first quartile, 17 MPG, is the value for which 25% of the entire MPG observations are smaller and 75% are larger.
- Q2, 22.75 MPG, is the same as the median (50% of MPG observations are smaller than Q2, 50% are larger)
- Only 25% of the observations are greater than the third quartile, 29 MPG.

In [None]:
data.head()

## Regression Analysis

Let us use linear regression to predict the value of MPG given the values of a set that is correlated to MPG.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold

In [None]:
factors = ['cylinders','displacement','horsepower','acceleration','weight','origin','model year']
X = pd.DataFrame(data[factors].copy())
y = data['mpg'].copy()
y

In [None]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size = 0.2,random_state=324)
#X_train.shape[0] == y_train.shape[0]
# Always split the data into train and test subsets first, particularly before any preprocessing steps.

In [None]:
reg_model = LinearRegression()
# Selecting linear regression

In [None]:
reg_model.fit(X_train,y_train)
# Training
# Fitting your model to the training data is essentially the training part of the modeling process.
# It finds the coefficients/(Beta) weights for the equation specified via the algorithm being used.

In [None]:
y_predicted = reg_model.predict(X_test)
# Then those Beta Weights/Coefficients are used to calculate the prediction outcomes with the unseen input data X.
# Note that the most important part of this process is to find the coefficients that are fit to your training data.
# y=b0+b1*x1+b2*x2+...+bn*xn, fit() fits the model to training data and finds the B0,...,Bn coefficients, suppose that it is [1 2 3...11].
# Once your unseen new input test data [x1...xn] is provided, then it is easy to calculate the new y=1+2*x1+3*x2+...+11*xn using the learned coeffients (weights).

In [None]:
# Evaluation metrics, MAE, Closer to zero means better accuracy
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test,y_predicted)