# Introduction to Machine Learning

## Linear Regression
Previously, we presented the preprocessing of data, removing nulls, and categorical variables encoding. In this section we are presenting an introduction to linear regression. We use the Advertising dataset, it has 4 variables; TV, Radio, Newspaper and sales. The dataset represents the sales after making advertisments into the three kinds of media; TV, Radio, and Newspaper. Our target is to predict the value of Sales based on the investment value in TV, Radio and Newspaper.


In [10]:
import pandas as pd
import numpy as np 
data=pd.read_csv("Advertising.csv")
data.head()

Unnamed: 0.1,Unnamed: 0,TV,radio,newspaper,sales
0,1,230.1,37.8,69.2,22.1
1,2,44.5,39.3,45.1,10.4
2,3,17.2,45.9,69.3,9.3
3,4,151.5,41.3,58.5,18.5
4,5,180.8,10.8,58.4,12.9


For simplicity, we divide the dataframe into two dataframes, one for the input features and another series object for the output variable.

In [4]:
X=data.iloc[:, 0:3]
Y= data.iloc[:,3]

In this section we use the scikit-learn package. It contains all the needed functions to perform machine learning effectively. Linear Regression model class can be found in the module linear_model.

In [3]:
from sklearn.linear_model import LinearRegression

To use the LinearRegression, we first make an object and perform training and prediction of the model.

In [8]:
linReg= LinearRegression()
linReg.fit(X,Y)
result=linReg.predict(X)


### Root Mean Squared Error
To evaluate the prediction we need to compare predicted value with the actual value. We ues the root mean squared error. We can find all the required metrics in the module metrics of sklearn.

In [11]:
from sklearn.metrics import mean_squared_error

print( np.sqrt( mean_squared_error(Y, result)))

20.1395234493


### Training and Testing 
To evaluate the model, we need to measure the error on a data that differs from the training data that is used to build the model. We split the data into training and testing datasets randomly. We use the function train_test_split that can be found in the module model_selection.

In [12]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split( X,Y, test_size=.33, random_state=42)
linReg.fit(x_train, y_train)
result= linReg.predict(x_test)
print( np.sqrt( mean_squared_error(y_test, result)))


21.6684298636


### Model Parameters
To access the parameters of the model we can perform the following:

In [13]:
linReg.coef_

array([-0.03131105,  0.01222614,  0.44408598])

In [14]:
linReg.intercept_

21.254997001365847

## References
* https://github.com/justmarkham/scikit-learn-videos
* http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
* https://www.youtube.com/watch?v=OGxgnH8y2NM&list=PLQVvvaa0QuDfKTOs3Keq_kaG2P55YRn5v
* https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-learn-data-science-python-scratch-2/