### Supervised Learning
In supervised learning, the training data you feed to the algorithm includes the desired solutions, called *labels*.

A typical supervised learning task is *classifcation*. The spam filter is a good example of this: it is trained with many example emails along with their class (spam or ham), and it must learn how to classify new emails.

Another typical task is to predict a target numeric value, such as the price of a car, given a set of features (mileage, age, brand, etc.) called predictors. This sort of task is called *regression*. To train the system, you need to give it many examples of cars, including both their predictors and their labels (i.e., their prices).

### Linear Regression
Linear regression is one of the simplest supervised learning algorithms in our toolkit. Linear regression—and its extensions—continues to be a common and useful method of making predictions when the target vector is a quantitative value (e.g., home price, age).

Linear regression models linear relationship among variables. An example of a linear relationship would be the number of stories a building has and the building’s height. In linear regression, we assume the effect of number of stories and building height is approximately constant, meaning a 20-story building will be roughly twice as high as a 10-story building, which will be roughly twice as high as a 5-story building.

More generally, a linear model makes a prediction by simply computing a weighted sum of the input features, plus a constant called the bias term (also called the intercept term) as shown below:

                                                  Y = MX + C
where

* Y is the dependent var
* X is the independent var.
* M is slope
* C is intercept

Why are we learning linear regression?
- widely used
- runs fast
- easy to use (not a lot of tuning required)
- highly interpretable
- basis for many other methods


# Simple Linear regression
Linear regression with a single variable or feature is called **Simple linear regression**. 

In [1]:
import numpy as np
import pandas as pd

# import model related libraries
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# import module to calculate model perfomance metrics
from sklearn import metrics

In [2]:
data = pd.read_csv('Advertising.csv')


In [3]:
data.head()

Unnamed: 0,TV,radio,newspaper,sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9


What are the **features**?
- TV: advertising dollars spent on TV for a single product in a given market (in thousands of dollars)
- Radio: advertising dollars spent on Radio
- Newspaper: advertising dollars spent on Newspaper

What is the **response**?
- Sales: sales of a single product in a given market (in thousands of widgets)

In [4]:
# print the shape of the DataFrame
data.shape

(200, 4)

There are 200 **observations**, and thus 200 markets in the dataset.

In [5]:
#Step 2: select independent(X) and dependent variable (y)
# create a Python list of feature names
feature_names=['TV']

# use the list to select a subset of the original DataFrame
X = data[feature_names]

# sales
y = data.sales

In [6]:
X.head()

Unnamed: 0,TV
0,230.1
1,44.5
2,17.2
3,151.5
4,180.8


In [7]:
y.head()

0    22.1
1    10.4
2     9.3
3    18.5
4    12.9
Name: sales, dtype: float64

In [8]:
# Step 3: Splitting X and y into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1,test_size=0.20)

In [9]:
# Step 4: Fit linear regression model to trainingset
# Linear Regression Model
linreg = LinearRegression()

# fit the model to the training data (learn the coefficients)
linreg.fit(X_train, y_train)

LinearRegression()

## Hypothesis Testing and p-values

Generally speaking, you start with a **null hypothesis** and an **alternative hypothesis** (that is opposite the null). Then, you check whether the data supports **rejecting the null hypothesis** or **failing to reject the null hypothesis**.

(Note that "failing to reject" the null is not the same as "accepting" the null hypothesis. The alternative hypothesis may indeed be true, except that you just don't have enough data to show that.)

As it relates to model coefficients, here is the conventional hypothesis test:
- **null hypothesis:** There is no relationship between TV ads and Sales 
- **alternative hypothesis:** There is a relationship between TV ads and Sales 

How do we test this hypothesis? Intuitively, we reject the null (and thus believe the alternative) if the 95% confidence interval **does not include zero**. Conversely, the **p-value** represents the probability that the coefficient is actually zero:

In [10]:
#It is used to test the null hypothesis that there is no relationship between the regressor and 
# the outcome/label
# f_regression is a Univariate linear regression tests returning p-values.

from sklearn.feature_selection import f_regression  
fregression=f_regression(X_train, y_train) #returns Fvalues of features; p values of features
fregression[1]


array([8.01212052e-37])

In [11]:
p = fregression[1]

print("p-value for significance is: ", p)

if p<0.05:
    print("reject null hypothesis")
else:
    print("accept null hypothesis")

p-value for significance is:  [8.01212052e-37]
reject null hypothesis


a p-value less than 0.05 is one way to decide whether there is likely a relationship between the feature and the response. (Again, using 0.05 as the cutoff is just a convention.)

In this case, the p-value for TV is far less than 0.05, and so we **believe** that there is a relationship between TV ads and Sales.

Note that we generally ignore the p-value for the intercept.

In [12]:
#Step 5: Test the model's generalization ability using testset
# make predictions on the testing set
y_pred = linreg.predict(X_test)

In [13]:
y_pred

array([17.18696362, 16.77798032, 11.51540011, 20.60665526, 19.30579273,
       20.77419058, 14.84639657, 15.70871075, 10.2785952 , 17.41362906,
       14.90552669, 10.20961007, 17.37913649, 12.21017895, 17.92609005,
       12.99365298, 13.28930355, 21.12404376,  8.0612159 , 17.18203611,
       11.74699305, 10.14062494,  8.03657835, 12.09191872, 12.36293175,
       16.08320147,  8.92353007, 19.05941725, 15.01885941, 18.63072392,
       18.6208689 , 18.35478338, 14.17625527, 15.18639473, 19.03970721,
       15.91073864, 17.75855473, 13.17597083, 17.48261419,  7.76556532])

In [14]:
# Step 6: Compute the performance of the model using metrics
print("RMSE=", np.sqrt(metrics.mean_squared_error(y_test, y_pred)))


RMSE= 3.2953520791575928


In [15]:
df_predicted=pd.DataFrame()
df_predicted['Actual']=y_test
df_predicted['Predicted']=y_pred
df_predicted.head(10)

Unnamed: 0,Actual,Predicted
58,23.8,17.186964
40,16.6,16.77798
34,9.5,11.5154
102,14.8,20.606655
184,17.6,19.305793
198,25.5,20.774191
95,16.9,14.846397
4,12.9,15.708711
29,10.5,10.278595
168,17.1,17.413629
