# Subject: Classical Data Analysis

## Session 1 - Regression

### Exercise 2 Solution


Implementation of Python StatsModels package with Quandl integration to calculate the Linear regression with one variable.  

Considering the OLS presented in Demo 3 (Outbound tourism statistics for Spain), develop a new regression analysis based on the Quandl dataset “Inbound tourism statistics for Spain”. This dataset is available at https://www.quandl.com/data/UTOR/INTUR_ESP

Use the field “Tourist arrivals at national borders – Thousands” as independent variable and the field “Travel - US$ Mn” as dependent variable.

- Interpret and discuss the OLS Regression Results. 
- Commit scripts in your GitHub account. You should export your solution code (.ipynb notebook) and push it to your repository “ClassicalDataAnalysis”.

The following are the tasks that should complete and synchronize with your repository “ClassicalDataAnalysis” until October 13. Please notice that none of these tasks is graded, however it’s important that you correctly understand and complete them in order to be sure that you won’t have problems with further assignments.


In [1]:
import quandl
quandl.ApiConfig.api_key = 'wagAy5tFsmUZ84CH3Ng8' # A valid API key is required to retrieve data. Please check your API key and try again. You can find your API key under your account settings.

In [2]:
data1 = quandl.get("UTOR/INTUR_ESP", authtoken="wagAy5tFsmUZ84CH3Ng8")

In [3]:
data1.head()

Unnamed: 0_level_0,Tourist arrivals at national borders - Thousands,Tourism expenditure in the country - US$ Mn,Travel - US$ Mn,Passenger transport - US$ Mn
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1995-12-31,34920.0,27369.0,25368.0,2001.0
1996-12-31,36221.0,29751.0,27168.0,2583.0
1997-12-31,39553.0,28649.0,26185.0,2464.0
1998-12-31,41892.0,31592.0,29117.0,2475.0
1999-12-31,45440.0,33784.0,31214.0,2570.0


## Regression model with Statsmodels and without a constant

In [4]:
import statsmodels.api as sm

X = data1["Tourist arrivals at national borders - Thousands"]
y = data1["Travel - US$ Mn"]

In [5]:
model = sm.OLS(y, X).fit() ## sm.OLS(output, input)


In [6]:
predictions = model.predict(X) 

In [7]:
model.summary()

0,1,2,3
Dep. Variable:,Travel - US$ Mn,R-squared:,0.967
Model:,OLS,Adj. R-squared:,0.965
Method:,Least Squares,F-statistic:,551.8
Date:,"Thu, 28 Sep 2017",Prob (F-statistic):,1.68e-15
Time:,11:36:37,Log-Likelihood:,-209.65
No. Observations:,20,AIC:,421.3
Df Residuals:,19,BIC:,422.3
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Tourist arrivals at national borders - Thousands,0.9010,0.038,23.491,0.000,0.821 0.981

0,1,2,3
Omnibus:,6.516,Durbin-Watson:,0.171
Prob(Omnibus):,0.038,Jarque-Bera (JB):,1.723
Skew:,0.056,Prob(JB):,0.422
Kurtosis:,1.566,Cond. No.,1.0


### Interpreting the Table 


The coefficient of 0.9010 means that as the Tourist arrivals at national borders - Thousands variable increases by 1, the predicted value of Travel - US$ Mn increases by 0.9010. A few other important values are the high R-squared of 0.967 — the percentage of variance our model explains.

### Regression model with Statsmodels and with a constant

In [8]:
import statsmodels.api as sm

X = data1["Tourist arrivals at national borders - Thousands"]
y = data1["Travel - US$ Mn"]

In [9]:
X = sm.add_constant(X) 

In [10]:
model = sm.OLS(y, X).fit() ## sm.OLS(output, input)

In [11]:
predictions = model.predict(X)

In [13]:
model.summary()

0,1,2,3
Dep. Variable:,Travel - US$ Mn,R-squared:,0.814
Model:,OLS,Adj. R-squared:,0.803
Method:,Least Squares,F-statistic:,78.61
Date:,"Thu, 28 Sep 2017",Prob (F-statistic):,5.5e-08
Time:,12:00:02,Log-Likelihood:,-203.23
No. Observations:,20,AIC:,410.5
Df Residuals:,18,BIC:,412.5
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
const,-3.832e+04,9518.711,-4.026,0.001,-5.83e+04 -1.83e+04
Tourist arrivals at national borders - Thousands,1.6339,0.184,8.866,0.000,1.247 2.021

0,1,2,3
Omnibus:,1.016,Durbin-Watson:,0.368
Prob(Omnibus):,0.602,Jarque-Bera (JB):,0.788
Skew:,-0.13,Prob(JB):,0.674
Kurtosis:,2.063,Cond. No.,333000.0


### Interpreting the Table 
With the constant term the coefficients are different. Without a constant we are forcing our model to go through the origin, but now we have a y-intercept at -3.832e+04. We also changed the slope of the Departures - Thousands predictor from 0.9010 to 1.6339.
A few other important values are modification of the R-squared, from 0.967 to 0.814.