# Subject: Classical Data Analysis

## Session 1 - Regression

### Exercise 1 Solution



Considering the OLS presented in Demo 2 develop a new regression analysis based on the independent variable “LSTAT — percentage of lower status of the population”. 

- Interpret and discuss the OLS Regression Results. 
- Commit scripts in your GitHub account. You should export your solution code (.ipynb notebook) and push it to your repository “ClassicalDataAnalysis”.


The following are the tasks that should complete and synchronize with your repository “ClassicalDataAnalysis” until October 13. Please notice that none of these tasks is graded, however it’s important that you correctly understand and complete them in order to be sure that you won’t have problems with further assignments.



# 1 - Linear Regression in Statsmodels
Statsmodels is “a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.” (from the documentation)
As in with Pandas and NumPy, the easiest way to get or install Statsmodels is through the Anaconda package.

After installing it, you will need to import it every time you want to use it:

In [1]:
import statsmodels.api as sm

Let’s see how to actually use Statsmodels for linear regression.

First, we import a dataset from sklearn (the other library I’ve mentioned):

Imports datasets from scikit-learn (check here http://scikit-learn.org/stable/datasets/index.html):

In [2]:
from sklearn import datasets 

Loads Boston dataset from datasets library:

In [3]:
data = datasets.load_boston() 

This is a dataset of the Boston house prices. Because it is a dataset designated for testing and learning machine learning tools, it comes with a description of the dataset, and we can see it by using the command print (data.DESCR).

In [4]:
print (data.DESCR)

Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
      

Running data.feature_names and data.target would print the column names of the independent variables and the dependent variable, respectively. Meaning, Scikit-learn has already set the house value/price data as a target variable and 13 other variables are set as predictors. Let’s see how to run a linear regression on this dataset.

In [5]:
data.feature_names

array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], 
      dtype='<U7')

In [6]:
data.target

array([ 24. ,  21.6,  34.7,  33.4,  36.2,  28.7,  22.9,  27.1,  16.5,
        18.9,  15. ,  18.9,  21.7,  20.4,  18.2,  19.9,  23.1,  17.5,
        20.2,  18.2,  13.6,  19.6,  15.2,  14.5,  15.6,  13.9,  16.6,
        14.8,  18.4,  21. ,  12.7,  14.5,  13.2,  13.1,  13.5,  18.9,
        20. ,  21. ,  24.7,  30.8,  34.9,  26.6,  25.3,  24.7,  21.2,
        19.3,  20. ,  16.6,  14.4,  19.4,  19.7,  20.5,  25. ,  23.4,
        18.9,  35.4,  24.7,  31.6,  23.3,  19.6,  18.7,  16. ,  22.2,
        25. ,  33. ,  23.5,  19.4,  22. ,  17.4,  20.9,  24.2,  21.7,
        22.8,  23.4,  24.1,  21.4,  20. ,  20.8,  21.2,  20.3,  28. ,
        23.9,  24.8,  22.9,  23.9,  26.6,  22.5,  22.2,  23.6,  28.7,
        22.6,  22. ,  22.9,  25. ,  20.6,  28.4,  21.4,  38.7,  43.8,
        33.2,  27.5,  26.5,  18.6,  19.3,  20.1,  19.5,  19.5,  20.4,
        19.8,  19.4,  21.7,  22.8,  18.8,  18.7,  18.5,  18.3,  21.2,
        19.2,  20.4,  19.3,  22. ,  20.3,  20.5,  17.3,  18.8,  21.4,
        15.7,  16.2,

First, we should load the data as a pandas data frame for easier analysis and set the median home value as our target variable:

In [7]:
import numpy as np

In [8]:
import pandas as pd

Define the data/predictors as the pre-set feature names:

In [9]:
df = pd.DataFrame(data.data, columns=data.feature_names) 

Show Pandas data frame `df´ as a table:

In [10]:
df 

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.0900,1.0,296.0,15.3,396.90,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.90,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.90,5.33
5,0.02985,0.0,2.18,0.0,0.458,6.430,58.7,6.0622,3.0,222.0,18.7,394.12,5.21
6,0.08829,12.5,7.87,0.0,0.524,6.012,66.6,5.5605,5.0,311.0,15.2,395.60,12.43
7,0.14455,12.5,7.87,0.0,0.524,6.172,96.1,5.9505,5.0,311.0,15.2,396.90,19.15
8,0.21124,12.5,7.87,0.0,0.524,5.631,100.0,6.0821,5.0,311.0,15.2,386.63,29.93
9,0.17004,12.5,7.87,0.0,0.524,6.004,85.9,6.5921,5.0,311.0,15.2,386.71,17.10


Show the top few rows you can also use head:

In [11]:
df. head() 

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


Show the first few rows for the first few columns you can use:

In [12]:
df.head(5)[df.columns[0:4]] 

Unnamed: 0,CRIM,ZN,INDUS,CHAS
0,0.00632,18.0,2.31,0.0
1,0.02731,0.0,7.07,0.0
2,0.02729,0.0,7.07,0.0
3,0.03237,0.0,2.18,0.0
4,0.06905,0.0,2.18,0.0


Put the target (housing value -- MEDV) in another DataFrame:

In [13]:
target = pd.DataFrame(data.target, columns=["MEDV"]) 

Show Pandas data frame `target´ as a table:

In [14]:
target

Unnamed: 0,MEDV
0,24.0
1,21.6
2,34.7
3,33.4
4,36.2
5,28.7
6,22.9
7,27.1
8,16.5
9,18.9


What we’ve done here is the take the dataset and load it as a pandas data frame; after that, we’re setting the predictors (as df) — the independent variables that are pre-set in the dataset. We’re also setting the target — the dependent variable, or the variable we’re trying to predict/estimate.

Next we’ll want to fit a linear regression model. We need to choose variables that we think we’ll be good predictors for the dependent variable — that can be done by checking the correlation(s) between variables, by plotting the data and searching visually for relationship, by conducting preliminary research on what variables are good predictors of y etc. For this first example, let’s take LSTAT. It’s important to note that Statsmodels does not add a constant by default. Let’s see it first without a constant in our regression model:

### 1.1. Regression model with Statsmodels and without a constant:

In [15]:
import statsmodels.api as sm

X = df["LSTAT"]
y = target["MEDV"]

Note the difference in argument order:

In [16]:
model = sm.OLS(y, X).fit() ## sm.OLS(output, input)


Make the predictions by the model:

In [17]:
predictions = model.predict(X) 

Print out the statistics:

In [18]:
model.summary()

0,1,2,3
Dep. Variable:,MEDV,R-squared:,0.449
Model:,OLS,Adj. R-squared:,0.448
Method:,Least Squares,F-statistic:,410.9
Date:,"Wed, 27 Sep 2017",Prob (F-statistic):,2.7099999999999998e-67
Time:,13:32:50,Log-Likelihood:,-2182.4
No. Observations:,506,AIC:,4367.0
Df Residuals:,505,BIC:,4371.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
LSTAT,1.1221,0.055,20.271,0.000,1.013 1.231

0,1,2,3
Omnibus:,1.113,Durbin-Watson:,0.369
Prob(Omnibus):,0.573,Jarque-Bera (JB):,1.051
Skew:,0.112,Prob(JB):,0.591
Kurtosis:,3.009,Cond. No.,1.0


### Interpreting the Table 

The coefficient of 1.1221 means that as the LSTAT variable increases by 1, the predicted value of MDEV increases by 1.1221. We can see here that this model has a lower R-squared value — 0.449. LSTAT is the percentage of lower status of the population, and unfortunately we can expect that it will lower the median value of houses.

### 1.2. Regression model with Statsmodels and with a constant:

If we do want to add a constant to our model — we have to set it by using the command X = sm.add_constant(X) where X is the name of your data frame containing your input (independent) variables.

Import statsmodels: 

In [19]:
import statsmodels.api as sm 

X usually means our input variables (or independent variables):

In [20]:
X = df["LSTAT"] 

Y usually means our output/dependent variable:

In [21]:
y = target["MEDV"] 

Let's add an intercept (beta_0 - β0) to our model. 

To remember, interpretation of the Model Parameters:

- Each β coefficient represents the change in the mean response, E(y), per unit increase in the associated predictor variable when all the other predictors are held constant.
- For example, β1 represents the change in the mean response, E(y), per unit increase in x1 when x2, x3, ..., xp−1 are held constant.
- The intercept term, β0, represents the mean response, E(y), when all the predictors x1, x2, ..., xp−1, are all zero (which may or may not have any practical meaning).

In [22]:
X = sm.add_constant(X) 

Note the difference in argument order:

In [23]:
model = sm.OLS(y, X).fit() ## sm.OLS(output, input)

Make the predictions by the model:

In [24]:
predictions = model.predict(X)

Print out the statistics:

In [25]:
model.summary()

0,1,2,3
Dep. Variable:,MEDV,R-squared:,0.544
Model:,OLS,Adj. R-squared:,0.543
Method:,Least Squares,F-statistic:,601.6
Date:,"Wed, 27 Sep 2017",Prob (F-statistic):,5.08e-88
Time:,15:02:06,Log-Likelihood:,-1641.5
No. Observations:,506,AIC:,3287.0
Df Residuals:,504,BIC:,3295.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
const,34.5538,0.563,61.415,0.000,33.448 35.659
LSTAT,-0.9500,0.039,-24.528,0.000,-1.026 -0.874

0,1,2,3
Omnibus:,137.043,Durbin-Watson:,0.892
Prob(Omnibus):,0.0,Jarque-Bera (JB):,291.373
Skew:,1.453,Prob(JB):,5.36e-64
Kurtosis:,5.319,Cond. No.,29.7


### Interpreting the Table 
With the constant term the coefficients are different. Without a constant we are forcing our model to go through the origin, but now we have a y-intercept at 34.55. We also changed the slope of the LSTAT predictor from 1.1221 to -0.9500.

# 2 - Linear Regression in SKLearn 
SKLearn is pretty much the golden standard when it comes to machine learning in Python. It has many learning algorithms, for regression, classification, clustering and dimensionality reduction. In order to use linear regression, we need to import it:

In [26]:
from sklearn import linear_model

Let’s use the same dataset we used before, the Boston housing prices. The process would be the same in the beginning — importing the datasets from SKLearn and loading in the Boston dataset:

Imports datasets from scikit-learn (check here http://scikit-learn.org/stable/datasets/index.html):

In [27]:
from sklearn import datasets 

Loads Boston dataset from datasets library:

In [28]:
data = datasets.load_boston() 

Next, we’ll load the data to Pandas (same as before).

Define the data/predictors as the pre-set feature names: 

In [29]:
df = pd.DataFrame(data.data, columns=data.feature_names)

Show Pandas data frame `df´ as a table:

In [30]:
df

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.0900,1.0,296.0,15.3,396.90,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.90,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.90,5.33
5,0.02985,0.0,2.18,0.0,0.458,6.430,58.7,6.0622,3.0,222.0,18.7,394.12,5.21
6,0.08829,12.5,7.87,0.0,0.524,6.012,66.6,5.5605,5.0,311.0,15.2,395.60,12.43
7,0.14455,12.5,7.87,0.0,0.524,6.172,96.1,5.9505,5.0,311.0,15.2,396.90,19.15
8,0.21124,12.5,7.87,0.0,0.524,5.631,100.0,6.0821,5.0,311.0,15.2,386.63,29.93
9,0.17004,12.5,7.87,0.0,0.524,6.004,85.9,6.5921,5.0,311.0,15.2,386.71,17.10


Create a new Pandas data frame `df2´ as a table including the LSTAT variable:

In [38]:
df2 = pd.DataFrame(df, columns=["LSTAT"])

Show Pandas data frame `df2´ as a table:

In [51]:
df2

Unnamed: 0,LSTAT
0,4.98
1,9.14
2,4.03
3,2.94
4,5.33
5,5.21
6,12.43
7,19.15
8,29.93
9,17.10


Put the target (housing value -- MEDV) in another DataFrame:

In [52]:
target = pd.DataFrame(data.target, columns=["MEDV"])

Show Pandas data frame `target´ as a table:

In [41]:
target

Unnamed: 0,MEDV
0,24.0
1,21.6
2,34.7
3,33.4
4,36.2
5,28.7
6,22.9
7,27.1
8,16.5
9,18.9


So now, as before, we have the data frame that contains the independent variables (marked as “df”) and the data frame with the dependent variable (marked as “target”). Let’s fit a regression model using SKLearn. First we’ll define our X and y — this time I’ll use all the variables in the data frame to predict the housing price:

In [42]:
X = df2
y = target["MEDV"]

And then I’ll fit a model:

In [43]:
lm = linear_model.LinearRegression()
model = lm.fit(X,y)

The lm.fit() function fits a linear model. We want to use the model to make predictions (that’s what we’re here for!), so we’ll use lm.predict():

In [44]:
type(predictions)

numpy.ndarray

In [45]:
predictions = lm.predict(X)
print(predictions[0:5,])

[ 29.8225951   25.87038979  30.72514198  31.76069578  29.49007782]


The print function would print the first 5 predictions for y (I didn’t print the entire list to “save room”. Removing [0:5] would print the entire list):

In [46]:
print(predictions)

[ 29.8225951   25.87038979  30.72514198  31.76069578  29.49007782
  29.60408375  22.74472741  16.36039575   6.11886372  18.30799693
  15.1253316   21.94668596  19.62856553  26.70643322  24.80633451
  26.50692285  28.30251613  20.61661686  23.44776393  23.83728417
  14.58380346  21.41465832  16.76891698  15.66685973  19.06803641
  18.86852605  20.48360995  18.13698805  22.39320915  23.17224962
  13.08272548  22.16519731   8.22797329  17.12043524  15.22983702
  25.35736314  23.71377775  26.22190805  24.92984093  30.44962767
  32.67274316  29.95560201  29.03405413  27.48547369  25.48086955
  24.85383698  21.11064252  16.69291303   5.28282029  19.16304135
  21.77567707  25.59487547  29.53758029  26.54492483  20.49311044
  29.98410349  29.07205611  30.80114593  28.03650231  25.79438584
  22.06069188  20.83512821  28.16000873  25.52837202  26.90594358
  30.1171104   24.8253355   26.85844111  22.11769484  26.20290706
  28.16950922  25.16735326  29.30956845  27.39046875  28.11250626
  26.06039

Remember, lm.predict() predicts the y (dependent variable) using the linear model we fitted. You must have noticed that when we run a linear regression with SKLearn, we don’t get a pretty table (okay, it’s not that pretty… but it’s pretty useful) like in Statsmodels. What we can do is use built-in functions to return the score, the coefficients and the estimated intercepts. Let’s see how it works:

This is the R² score of our model. As you probably remember, this the percentage of explained variance of the predictions.

In [47]:
lm.score(X,y) 

0.54414629758647992

Next, let’s check out the coefficients for the predictors:

In [48]:
lm.coef_ ## will give this output:

array([-0.95004935])

and the intercept:

In [49]:
lm.intercept_ ## that will give this output:

34.55384087938311

These are all (estimated/predicted) parts of the regression equation I’ve mentioned earlier. 