We always start a project by importing our relevant libraries, which allows us to access pre-written commands that make our lives easier. Here, we import numpy, pandas, and matplot. 

Numpy: adds support for high-level mathematical functions, especially when working with arrays and matrices. 

Pandas: adds support for data manipulation and analysis. It is especially helpful by offering data structures and operations for manipulating numerical tables. 

Matplotlib: adds support for creating static, animated, and interactive visualizations (e.g. graphs). 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

At this point, we will import our dataset using the read_csv command. Through this command, we access the csv and convert it to a dataframe we can manipulate and work with through pandas. We first give the link to our dataset (which for now is in github). We then specify that our header is none, which means that we don't give the columns any names (yet). 

In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/SiP-AI-ML/LessonMaterials/master/sales_data.csv", header=None)

The shape method allows you to view the dimensions of your dataframe as a matrix, which you learn more about in algebra. 

In [3]:
print(df.shape)

(36, 2)


The head() method allows you to view the top five rows of your pandas dataframe and confirm it looks reasonable. 

In [4]:
print(df.head())

      0     1
0  12.0  15.0
1  20.5  16.0
2  21.0  18.0
3  15.5  27.0
4  15.3  21.0


Here, we will rename the columns of our python database into sales and advertising.

In [5]:
df.columns = ['Sales', 'Advertising']

We will call the head function again to make sure that our columns were properly renamed. 

In [6]:
print(df.head())

   Sales  Advertising
0   12.0         15.0
1   20.5         16.0
2   21.0         18.0
3   15.5         27.0
4   15.3         21.0


## Denoting variables

First, we will declare our x value, which is our independent variable. We will take all the values from the sales column and place them into our new variable x, which allows us to specifically access the independent variable. 

In [None]:
x = df['Sales'].values

Next, we will do the same for y. We will take all the values from the advertising column and place them into our new variable y, which allows us to specifically access the dependent variable. 

In [None]:
y = df['Advertising'].values

First, we will plot a simple graph between the x and y variables. This will allow us to visually see what the relationship between the independent/dependent variables look like. 

In [None]:
plt.scatter(X, y, color = 'blue', label='Scatter Plot')
plt.title('Relationship between Sales and Advertising')
plt.xlabel('Sales')
plt.ylabel('Advertising')
plt.legend(loc=4)
plt.show()

## Reformatting data

First, we want to see the dimensions of the x and y variables. They need a very specific format to be used for machine learning and the sklearn dataframe.

In [None]:
print(X.shape)
print(y.shape)

## Reshaping X and y

We will reshape this using the numpy reshape method. We will specify our first dimension to be -1, which means "unspecified". This value is inferred from the length of the array and the remaining dimensions.


In [None]:
X = X.reshape(-1,1)
y = y.reshape(-1,1)


Now, we want to confirm that our variable shapes are the right format. We will print these shapes to confirm this. 

In [None]:
print(X.shape)
print(y.shape)


## Train test split

add some stuff about why test train split is important

We will now split the dataset. 

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.33, random_state=42)


## Mechanics of the model


I split the dataset into two sets – the training set and the test set. Then, I instantiate the regressor lm and fit it on the training set with the fit method. 

In this step, the model learned the relationships between the training data (X_train, y_train). 

Now the model is ready to make predictions on the test data (X_test). Hence, I predict on the test data using the predict method. 


In [None]:
# Fit the linear model

# Instantiate the linear regression object lm
from sklearn.linear_model import LinearRegression
lm = LinearRegression()


# Train the model using training data sets
lm.fit(X_train,y_train)


# Predict on the test data
y_pred=lm.predict(X_test)

## Model slope and intercept term

The model slope is given by lm.coef_ and model intercept term is given by lm.intercept_. 

The estimated model slope and intercept values are 1.60509347 and  -11.16003616.

So, the equation of the fitted regression line is

y = 1.60509347 * x - 11.16003616  


In [None]:
# Compute model slope and intercept

a = lm.coef_
b = lm.intercept_,
print("Estimated model slope, a:" , a)
print("Estimated model intercept, b:" , b) 


In [None]:
# So, our fitted regression line is 

# y = 1.60509347 * x - 11.16003616 

# That is our linear model.

## Making predictions


I have predicted the Advertising values on first five 5 Sales datasets by writing code


		lm.predict(X) [0:5]  
        

If I remove [0:5], then I will get predicted Advertising values for the whole Sales dataset.


To make prediction, on an individual Sales value, I write


		lm.predict(Xi)
        

where Xi is the Sales data value of the ith observation.



In [None]:
# Predicting Advertising values

lm.predict(X)[0:5]

# Predicting Advertising values on first five Sales values.

In [None]:
# To make an individual prediction using the linear regression model.

print(str(lm.predict(24)))

## Regression metrics for model performance


Now, it is the time to evaluate model performance. 

For regression problems, there are two ways to compute the model performance. They are RMSE (Root Mean Square Error) and R-Squared Value. These are explained below:-  


### RMSE

RMSE is the standard deviation of the residuals. So, RMSE gives us the standard deviation of the unexplained variance by the model. It can be calculated by taking square root of Mean Squared Error.
RMSE is an absolute measure of fit. It gives us how spread the residuals are, given by the standard deviation of the residuals. The more concentrated the data is around the regression line, the lower the residuals and hence lower the standard deviation of residuals. It results in lower values of RMSE. So, lower values of RMSE indicate better fit of data. 


In [None]:
# Calculate and print Root Mean Square Error(RMSE)

from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
print("RMSE value: {:.4f}".format(rmse))


###  R2 Score


R2 Score is another metric to evaluate performance of a regression model. It is also called coefficient of determination. It gives us an idea of goodness of fit for the linear regression models. It indicates the percentage of variance that is explained by the model. 


Mathematically, 


R2 Score = Explained Variation/Total Variation


In general, the higher the R2 Score value, the better the model fits the data. Usually, its value ranges from 0 to 1. So, we want its value to be as close to 1. Its value can become negative if our model is wrong.



In [None]:
# Calculate and print r2_score

from sklearn.metrics import r2_score
print ("R2 Score value: {:.4f}".format(r2_score(y_test, y_pred)))


## Interpretation and Conclusion


The RMSE value has been found to be 11.2273. It means the standard deviation for our prediction is 11.2273. So, sometimes we expect the predictions to be off by more than 11.2273 and other times we expect less than 11.2273. So, the model is not good fit to the data. 


In business decisions, the benchmark for the R2 score value is 0.7. It means if R2 score value >= 0.7, then the model is good enough to deploy on unseen data whereas if R2 score value < 0.7, then the model is not good enough to deploy. Our R2 score value has been found to be .5789. It means that this model explains 57.89 % of the variance in our dependent variable. So, the R2 score value confirms that the model is not good enough to deploy because it does not provide good fit to the data.


In [None]:
# Plot the Regression Line


plt.scatter(X, y, color = 'blue', label='Scatter Plot')
plt.plot(X_test, y_pred, color = 'black', linewidth=3, label = 'Regression Line')
plt.title('Relationship between Sales and Advertising')
plt.xlabel('Sales')
plt.ylabel('Advertising')
plt.legend(loc=4)
plt.show()


## Residual analysis



A linear regression model may not represent the data appropriately. The model may be a poor fit to the data. So, we should validate our model by defining and examining residual plots.

The difference between the observed value of the dependent variable (y) and the predicted value (ŷi) is called the residual and is denoted by e. The scatter-plot of these residuals is called residual plot.

If the data points in a residual plot are randomly dispersed around horizontal axis and an approximate zero residual mean, a linear regression model may be appropriate for the data. Otherwise a non-linear model may be more appropriate.

If we take a look at the generated ‘Residual errors’ plot, we can clearly see that the train data plot pattern is non-random. Same is the case with the test data plot pattern.
So, it suggests a better-fit for a non-linear model. 



In [None]:
# Plotting residual errors

plt.scatter(lm.predict(X_train), lm.predict(X_train) - y_train, color = 'red', label = 'Train data')
plt.scatter(lm.predict(X_test), lm.predict(X_test) - y_test, color = 'blue', label = 'Test data')
plt.hlines(xmin = 0, xmax = 50, y = 0, linewidth = 3)
plt.title('Residual errors')
plt.legend(loc = 4)
plt.show()

## Checking for Overfitting and Underfitting


I calculate training set score as 0.2861. Similarly, I calculate test set score as 0.5789. 
The training set score is very poor. So, the model does not learn the relationships appropriately from the training data. Thus, the model performs poorly on the training data. It is a clear sign of Underfitting. Hence, I validated my finding that the linear regression model does not provide good fit to the data. 


Underfitting means our model performs poorly on the training data. It means the model does not capture the relationships between the training data. This problem can be improved by increasing model complexity. We should use more powerful models like Polynomial regression to increase model complexity. 


In [None]:
# Checking for Overfitting or Underfitting the data

print("Training set score: {:.4f}".format(lm.score(X_train,y_train)))

print("Test set score: {:.4f}".format(lm.score(X_test,y_test)))

In [None]:
# Save model for future use

from sklearn.externals import joblib
joblib.dump(lm, 'lm_regressor.pkl')

# To load the model

# lm2=joblib.load('lm_regressor.pkl')

## Simple Linear Regression - Model Assumptions



The Linear Regression Model is based on several assumptions which are listed below:-

i.	Linear relationship
ii.	Multivariate normality
iii.	No or little multicollinearity
iv.	No auto-correlation
v.	Homoscedasticity


### i.	Linear relationship


The relationship between response and feature variables should be linear. This linear relationship assumption can be tested by plotting a scatter-plot between response and feature variables.


### ii.	Multivariate normality

The linear regression model requires all variables to be multivariate normal. A multivariate normal distribution means a vector in multiple normally distributed variables, where any linear combination of the variables is also normally distributed.


### iii.	No or little multicollinearity

It is assumed that there is little or no multicollinearity in the data. Multicollinearity occurs when the features (or independent variables) are highly correlated.


### iv.	No auto-correlation

Also, it is assumed that there is little or no auto-correlation in the data. Autocorrelation occurs when the residual errors are not independent from each other.


### v.	Homoscedasticity

Homoscedasticity describes a situation in which the error term (that is, the noise in the model) is the same across all values of the independent variables. It means the residuals are same across the regression line. It can be checked by looking at scatter plot.


## References


The concepts and ideas in this project have been taken from the following websites and books:-

i.	Machine learning notes by Andrew Ng

ii.	https://en.wikipedia.org/wiki/Linear_regression

iii.https://en.wikipedia.org/wiki/Simple_linear_regression

iv.	https://en.wikipedia.org/wiki/Ordinary_least_squares

v.	https://en.wikipedia.org/wiki/Root-mean-square_deviation

vi.	https://en.wikipedia.org/wiki/Coefficient_of_determination

vii.https://www.statisticssolutions.com/assumptions-of-linear-regression/

viii.Python Data Science Handbook by Jake VanderPlas

ix.	Hands-On Machine Learning with Scikit Learn and Tensorflow by Aurilien Geron

x.	Introduction to Machine Learning with Python by Andreas C Muller and Sarah Guido
