## Multiple Linear Regression ##

- more than 2 variables 

The major difference between linear and multiple lies in the evaluation .

You can use it to find out which factor has the highest impact on the predicted output and how difference variables relate to each other.

In [None]:
1. Importing the Libraries
The following script imports the necessary libraries:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline




## 2. Dataset ##

The dataset for this example is available at:

https://drive.google.com/open?id=1mVmGNx6cbfvRHC_DvF12ZL3wGLSHD9f_

Example looks at multiple linear regression to predict the gas consumptions in 48 US states based upon gas taxes , per capita income , paved highways  and the proportion of population that has a drivers licence. 

The following command imports the dataset from the file you downloaded via the link above:

dataset = pd.read_csv('D:\Datasets\petrol_consumption.csv')
Just like last time, let's take a look at what our dataset actually looks like. Execute the head() command:

dataset.head()

To see statistical details of the dataset, we'll use the describe() command again:

dataset.describe()

## 3. Preparing the Data ##
The next step is to divide the data into attributes and labels as we did previously. However, unlike last time, this time around we are going to use column names for creating an attribute set and label. Execute the following script:

X = dataset[['Petrol_tax', 'Average_income', 'Paved_Highways',
       'Population_Driver_licence(%)']]
y = dataset['Petrol_Consumption']
Execute the following code to divide our data into training and test sets:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

## 4. Training the Algorithm##
And finally, to train the algorithm we execute the same code as before, using the fit() method of the LinearRegression class:

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
As said earlier, in case of multivariable linear regression, the regression model has to find the most optimal coefficients for all the attributes. To see what coefficients our regression model has chosen, execute the following script:

coeff_df = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient'])
coeff_df

The result should look something like this:

Coefficient
Petrol_tax	-24.196784
Average_income	-0.81680
Paved_Highways	-0.000522
Population_Driver_license(%)	1324.675464
This means that for a unit increase in "petrol_tax", there is a decrease of 24.19 million gallons in gas consumption. Similarly, a unit increase in proportion of population with a drivers license results in an increase of 1.324 billion gallons of gas consumption. We can see that "Average_income" and "Paved_Highways" have a very little effect on the gas consumption.

## 5. Making Predictions ##
y_pred = regressor.predict(X_test)
To compare the actual output values for X_test with the predicted values, execute the following script:

df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df
The output looks like this:

Actual	Predicted
36	640	643.176639
22	464	411.950913
20	649	683.712762
38	648	728.049522
18	865	755.473801
1	524	559.135132
44	782	671.916474
21	540	550.633557
16	603	594.425464
45	510	525.038883
Evaluating the Algorithm
The final step is to evaluate the performance of algorithm. We'll do this by finding the values for MAE, MSE and RMSE. Execute the following script:

from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
The output will look similar to this:

Mean Absolute Error: 45.8979842541
Mean Squared Error: 3609.37119141
Root Mean Squared Error: 60.0780425065
You can see that the value of root mean squared error is 60.07, which is slightly greater than 10% of the mean value of the gas consumption in all states. This means that our algorithm was not very accurate but can still make reasonably good predictions.


## Factors affecting inaccuracy ##
There are many factors that may have contributed to this inaccuracy, a few of which are listed here:

Need more data: Only one year worth of data isn't that much, whereas having multiple years worth could have helped us improve the accuracy quite a bit.
Bad assumptions: We made the assumption that this data has a linear relationship, but that might not be the case. Visualizing the data may help you determine that.
Poor features: The features we used may not have had a high enough correlation to the values we were trying to predict.

