## Agenda 
In this notebook our focus will be on *An Introduction to Statistical Learning* book defined dataset *Advertising* to Apply Predictive Model of Linear Regression.

We will be focussing on Pandas, numpy and sklearn mainly. 

##### Reference: http://www-bcf.usc.edu/~gareth/ISL/

As we discussed in the lecture before
#### Types of supervised learning
- Classification: Predict a categorical response
- Regression: Predict a continuous response

In [1]:
# conventional way to import pandas
import pandas as pd

In [2]:
# read CSV file directly from a URL and save the results
data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)

# display the first 5 rows
data.head()

Unnamed: 0,TV,radio,newspaper,sales
1,230.1,37.8,69.2,22.1
2,44.5,39.3,45.1,10.4
3,17.2,45.9,69.3,9.3
4,151.5,41.3,58.5,18.5
5,180.8,10.8,58.4,12.9


In [3]:
# check the shape of the DataFrame (rows, columns)
data.shape

(200, 4)


### Linear regression
Pros: fast, no tuning required, highly interpretable, well-understood

Cons: unlikely to produce the best predictive accuracy (presumes a linear relationship between the features and response)

Form of linear regression
$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n$

$y$ is the response
$\beta_0$ is the intercept
$\beta_1$ is the coefficient for $x_1$ (the first feature)
$\beta_n$ is the coefficient for $x_n$ (the nth feature)
In this case:

$y = \beta_0 + \beta_1 \times TV + \beta_2 \times Radio + \beta_3 \times Newspaper$

The $\beta$ values are called the model coefficients. These values are "learned" during the model fitting step using the "least squares" criterion. Then, the fitted model can be used to make predictions!

##### To build model we need to divide our dataset into Train and test dataset. But before that we need to separate out X and y from data. Let do that.

### Preparing X and y using pandas¶


In [13]:
# create a Python list of feature names
feature_cols = ['TV', 'radio', 'newspaper']

# use the list to select a subset of the original DataFrame
X = data[feature_cols]

# print the first 5 rows
print( X.head())
print('------------------------------')


# equivalent command that works if there are no spaces in the column name
y = data.sales

# print the first 5 values
print ( y.head())

      TV  radio  newspaper
1  230.1   37.8       69.2
2   44.5   39.3       45.1
3   17.2   45.9       69.3
4  151.5   41.3       58.5
5  180.8   10.8       58.4
------------------------------
1    22.1
2    10.4
3     9.3
4    18.5
5    12.9
Name: sales, dtype: float64


In [37]:
# See Ref: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

In [38]:
X_train.shape

(140, 3)

In [39]:
X_test.shape

(60, 3)

In [40]:
y_train.shape

(140,)

In [41]:
y_test.shape

(60,)

### Linear regression in scikit-learn¶


In [42]:
# import model
from sklearn.linear_model import LinearRegression

# instantiate
linreg = LinearRegression()

# fit the model to the training data (learn the coefficients)
linreg.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

#### Congrats you did it your first model building. Not Done yet but its nice beginning

### Making predictions¶


In [43]:
# make predictions on the testing set
y_pred = linreg.predict(X_test)

# So we build model on Train set data and predict on test data

#### We need evaluation metrics in order to compare our predictions with the actual values!



#### Following are few metrics for evaluating Linear Model:

- MAE is the easiest to understand, because it's the average error.
- - Mean Absolute Error (MAE) is the mean of the absolute value of the errors:

$$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$
- MSE is more popular than MAE, because MSE "punishes" larger errors.

- - Mean Absolute Error (MAE) is the mean of the absolute value of the errors:

$$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$
- RMSE is even more popular than MSE, because RMSE is interpretable in the "y" units.

- - Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors:

$$\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$

In [44]:

# 'y_test' Remember is original and true values
# 'y_pred' is resulted values
# Hurrah! Time to evaluate both

# calculate MAE using scikit-learn

from sklearn import metrics
print(metrics.mean_absolute_error(y_test, y_pred))

1.0548328405073326


In [45]:
import numpy as np
# Computing the RMSE for our Sales predictions

print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

1.388857410775697


#### R^2 (coefficient of determination) regression score function.

Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.

In [46]:
metrics.r2_score(y_test,y_pred)

0.9224605706201434

Hey One Important Query?


Does Newspaper "belong" in our model? In other words, does it improve the quality of our predictions?

Let's remove it from the model and check the RMSE and R^2

In [51]:
# create a Python list of feature names
feature_cols = ['TV', 'radio']

# use the list to select a subset of the original DataFrame
X = data[feature_cols]

# select a Series from the DataFrame
y = data.sales

# split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3, random_state=1)

# fit the model to the training data (learn the coefficients)
linreg.fit(X_train, y_train)

# make predictions on the testing set
y_pred = linreg.predict(X_test)

# compute the RMSE of our predictions
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

print("R^2: ",metrics.r2_score(y_test, y_pred))

1.383728668840889
R^2:  0.9230321850256801


*The RMSE decreased when we removed Newspaper from the model. (Error is something we want to minimize, so a lower number for RMSE is better.) Thus, it is unlikely that this feature is useful for predicting Sales, and should be removed from the model.*