# Multiple Linear Regression: Cars Data
## Project Notes
Reference Site: https://www.w3schools.com/python/python_ml_multiple_regression.asp
Workspace: S:\Matt\Data Science\Kaggle\Multiple Linear Regression\Cars (Basic)

**Objective:** To predict CO2 emissions based on volume and weight of vehicle.

## Loading the Data
Steps:
* Read data from csv into dataframe
* Desribe data for initial distribution of variables

In [5]:
# load libaries
import pandas as pd

# load data
df = pd.read_csv('../Data/cars.csv')

# describe data
df.describe()

Unnamed: 0,Volume,Weight,CO2
count,36.0,36.0,36.0
mean,1611.111111,1292.277778,102.027778
std,388.975047,242.123889,7.454571
min,900.0,790.0,90.0
25%,1475.0,1117.25,97.75
50%,1600.0,1329.0,99.0
75%,2000.0,1418.25,105.0
max,2500.0,1746.0,120.0


## Feature Scaling
Notes:
* The distribution and scale of volume and weight are broadly similar, whilst C02 is on a much lower scale. C02 is our label/output/y variable, therefore it doesn't matter that the scale is different here and because volume and weight are broadly similarly distributed, we won't need to run feature scaling for this exercise.
* However, for thoroughness, I will use feature scaling for the two feature variables and compare the results to see if it has any effect.
* There are 2 options when altering features scales:
    * Normalization
    * Standardization
    * Source: https://machinelearningmastery.com/rescaling-data-for-machine-learning-in-python-with-scikit-learn/    
* Normalization adjusts the features scales to between 0 and 1, this is good for models which look at the magnitude of values (e.g. distance measures in k-nearest neighbours or co-efficients in regression).
* Standardization shifts features distribution to have a mean of 0 and a standard deviation of 1 (unit variation), this is better for models that rely on distribution (e.g. Gaussian method).
* We will try all three (raw distribution, normalized and standardized) and compare the results to see which is most effective.

Steps:
* Run standardization and normalization methods to create 3 separate sets of feature variables (raw, standardized and normalized).

In [18]:
# load libraries
from sklearn import preprocessing

# extract features and labels
X = df[['Volume', 'Weight']]
y = df['CO2']

# normalize features
normalized_X = preprocessing.normalize(X)

# standardize features
standardized_X = preprocessing.scale(X)

## Linear Model
Notes:
* Now that our data has been processed, we can load it into our model.
* We will use a linear regression model from sklearn for this.

Steps:
* Build linear regression object.
* Fit model to our data.

In [20]:
# load libraries
from sklearn import linear_model

# instantiate model object
regr = linear_model.LinearRegression()

# fit model to data
regr.fit(X, y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

## To Do
To Do:
* Finish tutorial steps on linked site (see top).
* Visualise our model (use matplotlib/seaborn).
* Bring in a larger dataset and use train/test splitting, plot the linear model line against the actual and predicted values.
* Analyse coefficients and accuracy of model against raw, normalized and standardized data to see which is best (i.e. validation).