### Forecasting Elantra Sales:
An important application of linear regression is understanding sales. Consider a company that produces and sells a product. In a given period, if the company produces more units than how many consumers will buy, the company will not earn money on the unsold units and will incur additional costs due to having to store those units in inventory before they can be sold. If it produces fewer units than how many consumers will buy, the company will earn less than it potentially could have earned. Being able to predict consumer sales, therefore, is of first order importance to the company.
In this problem, we will try to predict monthly sales of the Hyundai Elantra in the United States. The Hyundai Motor Company is a major automobile manufacturer based in South Korea. The Elantra is a car model that has been produced by Hyundai since 1990 and is sold all over the world, including the United States. We will build a linear regression model to predict monthly sales using economic indicators of the United States as well as Google search queries.

In [1]:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
from sklearn.cross_validation import train_test_split

In [2]:
elantra=pd.read_csv("elantra.csv")

In [3]:
elantra.head()

Unnamed: 0,Month,Year,ElantraSales,Unemployment,Queries,CPI_energy,CPI_all
0,1,2010,7690,9.7,153,213.377,217.466
1,1,2011,9659,9.1,259,229.353,221.082
2,1,2012,10900,8.2,354,244.178,227.666
3,1,2013,12174,7.9,230,242.56,231.321
4,1,2014,15326,6.6,232,247.575,234.933


#### Load the data set. Split the data set into training and testing sets as follows: place all observations for 2012 and earlier in the training set, and all observations for 2013 and 2014 into the testing set.

In [4]:
train=elantra[elantra["Year"] <=2012]
test=elantra[elantra["Year"] >2012]
train.head()

Unnamed: 0,Month,Year,ElantraSales,Unemployment,Queries,CPI_energy,CPI_all
0,1,2010,7690,9.7,153,213.377,217.466
1,1,2011,9659,9.1,259,229.353,221.082
2,1,2012,10900,8.2,354,244.178,227.666
5,2,2010,7966,9.8,130,209.924,217.251
6,2,2011,12289,9.0,266,232.188,221.816


In [5]:
train.shape

(36, 7)

#### 2.1 Build a linear regression model to predict monthly Elantra sales using Unemployment, CPI_all, CPI_energy and Queries as the independent variables. Use all of the training set data to do this. What is the model R-squared? Note: In this problem, we will always be asking for the "Multiple R-Squared" of the model.

In [6]:
model=smf.ols(formula='ElantraSales ~ Unemployment + Queries + CPI_energy + CPI_all',data=train)
fitted=model.fit()
fitted.summary(0)

0,1,2,3
Dep. Variable:,0,R-squared:,0.428
Model:,OLS,Adj. R-squared:,0.354
Method:,Least Squares,F-statistic:,5.803
Date:,"Thu, 08 Sep 2016",Prob (F-statistic):,0.00132
Time:,18:04:38,Log-Likelihood:,-339.99
No. Observations:,36,AIC:,690.0
Df Residuals:,31,BIC:,697.9
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,9.539e+04,1.71e+05,0.559,0.580,-2.53e+05 4.43e+05
Unemployment,-3179.8996,3610.262,-0.881,0.385,-1.05e+04 4183.279
Queries,19.0297,11.259,1.690,0.101,-3.933 41.992
CPI_energy,38.5060,109.601,0.351,0.728,-185.027 262.039
CPI_all,-297.6456,704.837,-0.422,0.676,-1735.169 1139.878

0,1,2,3
Omnibus:,1.21,Durbin-Watson:,1.19
Prob(Omnibus):,0.546,Jarque-Bera (JB):,0.947
Skew:,0.39,Prob(JB):,0.623
Kurtosis:,2.845,Cond. No.,132000.0


#### 2.2 How many variables are significant, or have levels that are significant? Use 0.10 as your p-value cutoff .

In [7]:
#num of  variable with p<0.10 =0

#### 2.3-4 What is the coefficient of the Unemployment variable? what is the interpretaion?

In [8]:
#-3179.90 : implies that for an increase of 1 in Unemployment ; Sales decrease by
3180

3180

#### 3.1: Our model R-Squared is relatively low, so we would now like to improve our model. In modeling demand and sales, it is often useful to model seasonality.   refers to the fact that demand is often cyclical/periodic in time. For example, in countries with different seasons, demand for warm outerwear (like jackets and coats) is higher in fall/autumn and winter (due to the colder weather) than in spring and summer. (In contrast, demand for swimsuits and sunscreen is higher in the summer than in the other seasons.) Another example is the "back to school" period in North America: demand for stationary (pencils, notebooks and so on) in late July and all of August is higher than the rest of the year due to the start of the school year in September.

#### In our problem, since our data includes the month of the year in which the units were sold, it is feasible for us to incorporate monthly seasonality. From a modeling point of view, it may be reasonable that the month plays an effect in how many Elantra units are sold.

#### To incorporate the seasonal effect due to the month, build a new linear regression model that predicts monthly Elantra sales using Month as well as Unemployment, CPI_all, CPI_energy and Queries. Do not modify the training and testing data frames before building the model. What is the model R-Squared?

In [9]:
model2=smf.ols(formula='ElantraSales ~ Unemployment + Queries + CPI_energy + CPI_all + Month',data=train)
fitted2=model2.fit()
fitted.summary()

0,1,2,3
Dep. Variable:,ElantraSales,R-squared:,0.428
Model:,OLS,Adj. R-squared:,0.354
Method:,Least Squares,F-statistic:,5.803
Date:,"Thu, 08 Sep 2016",Prob (F-statistic):,0.00132
Time:,18:04:38,Log-Likelihood:,-339.99
No. Observations:,36,AIC:,690.0
Df Residuals:,31,BIC:,697.9
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,9.539e+04,1.71e+05,0.559,0.580,-2.53e+05 4.43e+05
Unemployment,-3179.8996,3610.262,-0.881,0.385,-1.05e+04 4183.279
Queries,19.0297,11.259,1.690,0.101,-3.933 41.992
CPI_energy,38.5060,109.601,0.351,0.728,-185.027 262.039
CPI_all,-297.6456,704.837,-0.422,0.676,-1735.169 1139.878

0,1,2,3
Omnibus:,1.21,Durbin-Watson:,1.19
Prob(Omnibus):,0.546,Jarque-Bera (JB):,0.947
Skew:,0.39,Prob(JB):,0.623
Kurtosis:,2.845,Cond. No.,132000.0


#### 3.2 What the effect of adding Month?

#### 3.3 In the new model, given two monthly periods that are otherwise identical in Unemployment, CPI_all, CPI_energy and Queries, what is the absolute difference in predicted Elantra sales given that one period is in January and one is in March?

In [10]:
diff=110.6853*3 -110.6853*1
diff

221.37060000000002

#### In the new model, given two monthly periods that are otherwise identical in Unemployment, CPI_all, CPI_energy and Queries, what is the absolute difference in predicted Elantra sales given that one period is in January and one is in May?

In [11]:
diff=110.6853*5 -110.6853*1
diff

442.74120000000005

#### 3.4 You may be experiencing an uneasy feeling that there is something not quite right in how we have modeled the effect of the calendar month on the monthly sales of Elantras. If so, you are right. In particular, we added Month as a variable, but Month is an ordinary numeric variable. In fact, we must convert Month to a factor variable before adding it to the model. why?


 By modeling Month as a factor variable, the effect of each calendar month is not restricted to be linear in the numerical coding of the month. 
 There are several possible approaches to encode categorical values, and statsmodels has built-in support for many of them. In general these work by splitting a categorical variable into many different binary variables. The simplest way to encode categoricals is “dummy-encoding” which encodes a k-level categorical variable into k-1 binary variables.

In statsmodels this is done easily using the C() function.This is equivalent to R's as.factor()

#### 4.1 Re-run the regression with the Month variable modeled as a factor variable 

In [12]:
model3=smf.ols(formula='ElantraSales ~ Unemployment +Queries + CPI_energy + CPI_all + C(Month)',data=train)
fitted3=model3.fit()
fitted3.summary()

0,1,2,3
Dep. Variable:,ElantraSales,R-squared:,0.819
Model:,OLS,Adj. R-squared:,0.684
Method:,Least Squares,F-statistic:,6.044
Date:,"Thu, 08 Sep 2016",Prob (F-statistic):,0.000147
Time:,18:04:39,Log-Likelihood:,-319.26
No. Observations:,36,AIC:,670.5
Df Residuals:,20,BIC:,695.9
Df Model:,15,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,3.125e+05,1.44e+05,2.169,0.042,1.2e+04 6.13e+05
C(Month)[T.2],2254.9978,1943.249,1.160,0.260,-1798.548 6308.543
C(Month)[T.3],6696.5568,1991.635,3.362,0.003,2542.080 1.09e+04
C(Month)[T.4],7556.6074,2038.022,3.708,0.001,3305.368 1.18e+04
C(Month)[T.5],7420.2490,1950.139,3.805,0.001,3352.331 1.15e+04
C(Month)[T.6],9215.8326,1995.230,4.619,0.000,5053.856 1.34e+04
C(Month)[T.7],9929.4644,2238.800,4.435,0.000,5259.409 1.46e+04
C(Month)[T.8],7939.4474,2064.629,3.845,0.001,3632.706 1.22e+04
C(Month)[T.9],5013.2866,2010.745,2.493,0.022,818.946 9207.627

0,1,2,3
Omnibus:,0.047,Durbin-Watson:,2.795
Prob(Omnibus):,0.977,Jarque-Bera (JB):,0.246
Skew:,-0.032,Prob(JB):,0.884
Kurtosis:,2.6,Cond. No.,159000.0


#### 4.2 Which variables are significant, or have levels that are significant? Use 0.10 as your p-value cutoff.

#### Note: unlike R, we cannot look for the number of stars/ periods, but actually need to see for which variables the p value (P>|t|) <0.1 

* CPI_energy, p-value:0.008
* CPI_all, p-value:0.035
* Unemployment, p-value:0.017
* Months( as factor):March, April, May , June, July, Aug ,Sep,Dec


#### 5.1 MultiCollinearity
Another peculiar observation about the regression is that the sign of the Queries variable has changed. In particular, when we naively modeled Month as a numeric variable, Queries had a positive coefficient. Now, Queries has a negative coefficient. Furthermore, CPI_energy has a positive coefficient -- as the overall price of energy increases, we expect Elantra sales to increase, which seems counter-intuitive (if the price of energy increases, we'd expect consumers to have less funds to purchase automobiles, leading to lower Elantra sales).

As we have seen before, changes in coefficient signs and signs that are counter to our intuition may be due to a multicolinearity problem. To check, compute the correlations of the variables in the training set.

#### Which of the following variables is CPI_energy highly correlated with? Include only variables where the absolute value of the correlation exceeds 0.6. For the purpose of this question, treat Month as a numeric variable, not a factor variable.


In [13]:
train.corr()
##For correlations between individual columns:
train.CPI_energy.corr(train.CPI_all)

0.91322590900823353

CPI_enegy is highly correlated with:
* Year
* Unemployment 
* Queries
* CPI_all

#### 5.2 CORRELATIONS  
####Which of the following variables is Queries highly correlated with? Again, compute the correlations on the training set. Include only variables where the absolute value of the correlation exceeds 0.6 and treat Month as a numeric variable, not a factor variable.

From above result Queries is highly corre;lated with:
* Year
* ElantraSales
* Unemployment
* CPI_energy
* CPI_all

#### 6.1 A Reduced Model
Simplify our model (the model using the factor version of the Month variable). We will do this by iteratively removing variables, one at a time. Remove the variable with the highest p-value (i.e., the least statistically significant variable) from the model. Repeat this until there are no variables that are insignificant or variables for which all of the factor levels are insignificant. Use a threshold of 0.10 to determine whether a variable is significant.
Looking at the summary output from model3,re-running the model with the Queries variable removed (p-value: 0.717)

In [14]:
model3=smf.ols(formula='ElantraSales~ C(Month)+Unemployment+CPI_energy+CPI_all',data=train)
fitted3=model3.fit()
fitted3.summary()

0,1,2,3
Dep. Variable:,ElantraSales,R-squared:,0.818
Model:,OLS,Adj. R-squared:,0.697
Method:,Least Squares,F-statistic:,6.744
Date:,"Thu, 08 Sep 2016",Prob (F-statistic):,5.73e-05
Time:,18:04:39,Log-Likelihood:,-319.38
No. Observations:,36,AIC:,668.8
Df Residuals:,21,BIC:,692.5
Df Model:,14,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,3.257e+05,1.37e+05,2.384,0.027,4.16e+04 6.1e+05
C(Month)[T.2],2410.9137,1857.103,1.298,0.208,-1451.144 6272.972
C(Month)[T.3],6880.0868,1888.145,3.644,0.002,2953.474 1.08e+04
C(Month)[T.4],7697.3580,1960.214,3.927,0.001,3620.869 1.18e+04
C(Month)[T.5],7444.6447,1908.477,3.901,0.001,3475.749 1.14e+04
C(Month)[T.6],9223.1343,1953.636,4.721,0.000,5160.325 1.33e+04
C(Month)[T.7],9602.7221,2012.661,4.771,0.000,5417.165 1.38e+04
C(Month)[T.8],7919.4990,2020.993,3.919,0.001,3716.614 1.21e+04
C(Month)[T.9],5074.2910,1962.230,2.586,0.017,993.611 9154.971

0,1,2,3
Omnibus:,0.0,Durbin-Watson:,2.771
Prob(Omnibus):,1.0,Jarque-Bera (JB):,0.137
Skew:,-0.001,Prob(JB):,0.934
Kurtosis:,2.698,Cond. No.,118000.0


#### Using the model from Problem 6.1, make predictions on the test set. What is the sum of squared errors of the model on the test set?

In [15]:
test['Predicted_Sales']=fitted3.predict(test)
SSE=((test.ElantraSales - test.Predicted_Sales)**2).sum()
SSE

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


190757747.4443853

SSE= 190757747.44431427

#### 6.3 - Comparing to a Baseline
What would the baseline method predict for all observations in the test set? Remember that the baseline method we use predicts the average outcome of all observations in the training set.

In [16]:
avg_Sales=train.ElantraSales.mean()
avg_Sales

14462.25

Baseline method would predict average sales of $14462.25.

#### 6.4 - Test Set R-squared
What is the test set R-Squared?

In [17]:
#R-squared= 1-SSE/SST
SST= ((test.ElantraSales -  train.ElantraSales.mean())**2).sum()
R_sq_test= 1- float(SSE/SST)
R_sq_test

0.7280232276290254

#### 6.5 -Absolute Errors
What is the largest absolute error that we make in our test set predictions?

In [18]:
max(abs(test.ElantraSales - test.Predicted_Sales))

7491.4876927119039

In [19]:
test[abs(test.ElantraSales - test.Predicted_Sales)>7491.0]

Unnamed: 0,Month,Year,ElantraSales,Unemployment,Queries,CPI_energy,CPI_all,Predicted_Sales
13,3,2013,26153,7.5,313,244.598,232.075,18661.512307
