# Subject: Classical Data Analysis

## Session 1 - Regression

### Individual assignment 1

Develop a regression analysis in Statmodels (with and without a constant) and SKLearn, based on the Iris sklearn dataset. This data sets consists of 3 different types of irises’ (Setosa, Versicolour, and Virginica) petal and sepal length.

See here for more information on this dataset: https://en.wikipedia.org/wiki/Iris_flower_data_set 

Use the field “sepal width (cm)” as independent variable and the field “sepal length (cm)” as dependent variable.

- Interpret and discuss the OLS Regression Results.
- Commit scripts in your GitHub account. You should export your solution code (.ipynb notebook) and push it to your repository “ClassicalDataAnalysis”.

The following are the tasks that should complete and synchronize with your repository “ClassicalDataAnalysis” until October 13. Please notice that none of these tasks is graded, however it’s important that you correctly understand and complete them in order to be sure that you won’t have problems with further assignments.

# Linear Regression in Statsmodels

## Load the iris dataset

> Put your code here

In [46]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score

In [47]:
iris = datasets.load_iris()

In [48]:
print (iris.DESCR)

Iris Plants Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML iris d

In [49]:
iris.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [50]:
iris

 'data': array([[ 5.1,  3.5,  1.4,  0.2],
        [ 4.9,  3. ,  1.4,  0.2],
        [ 4.7,  3.2,  1.3,  0.2],
        [ 4.6,  3.1,  1.5,  0.2],
        [ 5. ,  3.6,  1.4,  0.2],
        [ 5.4,  3.9,  1.7,  0.4],
        [ 4.6,  3.4,  1.4,  0.3],
        [ 5. ,  3.4,  1.5,  0.2],
        [ 4.4,  2.9,  1.4,  0.2],
        [ 4.9,  3.1,  1.5,  0.1],
        [ 5.4,  3.7,  1.5,  0.2],
        [ 4.8,  3.4,  1.6,  0.2],
        [ 4.8,  3. ,  1.4,  0.1],
        [ 4.3,  3. ,  1.1,  0.1],
        [ 5.8,  4. ,  1.2,  0.2],
        [ 5.7,  4.4,  1.5,  0.4],
        [ 5.4,  3.9,  1.3,  0.4],
        [ 5.1,  3.5,  1.4,  0.3],
        [ 5.7,  3.8,  1.7,  0.3],
        [ 5.1,  3.8,  1.5,  0.3],
        [ 5.4,  3.4,  1.7,  0.2],
        [ 5.1,  3.7,  1.5,  0.4],
        [ 4.6,  3.6,  1. ,  0.2],
        [ 5.1,  3.3,  1.7,  0.5],
        [ 4.8,  3.4,  1.9,  0.2],
        [ 5. ,  3. ,  1.6,  0.2],
        [ 5. ,  3.4,  1.6,  0.4],
        [ 5.2,  3.5,  1.5,  0.2],
        [ 5.2,  3.4,  1.4,  0.2],
      

In [51]:
iris_y = iris.data[:, np.newaxis, 0]

In [52]:
iris_y

array([[ 5.1],
       [ 4.9],
       [ 4.7],
       [ 4.6],
       [ 5. ],
       [ 5.4],
       [ 4.6],
       [ 5. ],
       [ 4.4],
       [ 4.9],
       [ 5.4],
       [ 4.8],
       [ 4.8],
       [ 4.3],
       [ 5.8],
       [ 5.7],
       [ 5.4],
       [ 5.1],
       [ 5.7],
       [ 5.1],
       [ 5.4],
       [ 5.1],
       [ 4.6],
       [ 5.1],
       [ 4.8],
       [ 5. ],
       [ 5. ],
       [ 5.2],
       [ 5.2],
       [ 4.7],
       [ 4.8],
       [ 5.4],
       [ 5.2],
       [ 5.5],
       [ 4.9],
       [ 5. ],
       [ 5.5],
       [ 4.9],
       [ 4.4],
       [ 5.1],
       [ 5. ],
       [ 4.5],
       [ 4.4],
       [ 5. ],
       [ 5.1],
       [ 4.8],
       [ 5.1],
       [ 4.6],
       [ 5.3],
       [ 5. ],
       [ 7. ],
       [ 6.4],
       [ 6.9],
       [ 5.5],
       [ 6.5],
       [ 5.7],
       [ 6.3],
       [ 4.9],
       [ 6.6],
       [ 5.2],
       [ 5. ],
       [ 5.9],
       [ 6. ],
       [ 6.1],
       [ 5.6],
       [ 6.7],
       [ 5

In [53]:
iris_y_train = iris_y[:-20]
iris_y_test = iris_y[-20:]

In [54]:
iris_y_train

array([[ 5.1],
       [ 4.9],
       [ 4.7],
       [ 4.6],
       [ 5. ],
       [ 5.4],
       [ 4.6],
       [ 5. ],
       [ 4.4],
       [ 4.9],
       [ 5.4],
       [ 4.8],
       [ 4.8],
       [ 4.3],
       [ 5.8],
       [ 5.7],
       [ 5.4],
       [ 5.1],
       [ 5.7],
       [ 5.1],
       [ 5.4],
       [ 5.1],
       [ 4.6],
       [ 5.1],
       [ 4.8],
       [ 5. ],
       [ 5. ],
       [ 5.2],
       [ 5.2],
       [ 4.7],
       [ 4.8],
       [ 5.4],
       [ 5.2],
       [ 5.5],
       [ 4.9],
       [ 5. ],
       [ 5.5],
       [ 4.9],
       [ 4.4],
       [ 5.1],
       [ 5. ],
       [ 4.5],
       [ 4.4],
       [ 5. ],
       [ 5.1],
       [ 4.8],
       [ 5.1],
       [ 4.6],
       [ 5.3],
       [ 5. ],
       [ 7. ],
       [ 6.4],
       [ 6.9],
       [ 5.5],
       [ 6.5],
       [ 5.7],
       [ 6.3],
       [ 4.9],
       [ 6.6],
       [ 5.2],
       [ 5. ],
       [ 5.9],
       [ 6. ],
       [ 6.1],
       [ 5.6],
       [ 6.7],
       [ 5

In [55]:
iris_y_test

array([[ 7.4],
       [ 7.9],
       [ 6.4],
       [ 6.3],
       [ 6.1],
       [ 7.7],
       [ 6.3],
       [ 6.4],
       [ 6. ],
       [ 6.9],
       [ 6.7],
       [ 6.9],
       [ 5.8],
       [ 6.8],
       [ 6.7],
       [ 6.7],
       [ 6.3],
       [ 6.5],
       [ 6.2],
       [ 5.9]])

In [56]:
iris_X = iris.data[:, np.newaxis, 1]

In [57]:
iris_X

array([[ 3.5],
       [ 3. ],
       [ 3.2],
       [ 3.1],
       [ 3.6],
       [ 3.9],
       [ 3.4],
       [ 3.4],
       [ 2.9],
       [ 3.1],
       [ 3.7],
       [ 3.4],
       [ 3. ],
       [ 3. ],
       [ 4. ],
       [ 4.4],
       [ 3.9],
       [ 3.5],
       [ 3.8],
       [ 3.8],
       [ 3.4],
       [ 3.7],
       [ 3.6],
       [ 3.3],
       [ 3.4],
       [ 3. ],
       [ 3.4],
       [ 3.5],
       [ 3.4],
       [ 3.2],
       [ 3.1],
       [ 3.4],
       [ 4.1],
       [ 4.2],
       [ 3.1],
       [ 3.2],
       [ 3.5],
       [ 3.1],
       [ 3. ],
       [ 3.4],
       [ 3.5],
       [ 2.3],
       [ 3.2],
       [ 3.5],
       [ 3.8],
       [ 3. ],
       [ 3.8],
       [ 3.2],
       [ 3.7],
       [ 3.3],
       [ 3.2],
       [ 3.2],
       [ 3.1],
       [ 2.3],
       [ 2.8],
       [ 2.8],
       [ 3.3],
       [ 2.4],
       [ 2.9],
       [ 2.7],
       [ 2. ],
       [ 3. ],
       [ 2.2],
       [ 2.9],
       [ 2.9],
       [ 3.1],
       [ 3

In [109]:
iris_X_train = iris_X[:-20]
iris_X_test = iris_X[-20:]

In [111]:
iris_X_train

array([[ 3.5],
       [ 3. ],
       [ 3.2],
       [ 3.1],
       [ 3.6],
       [ 3.9],
       [ 3.4],
       [ 3.4],
       [ 2.9],
       [ 3.1],
       [ 3.7],
       [ 3.4],
       [ 3. ],
       [ 3. ],
       [ 4. ],
       [ 4.4],
       [ 3.9],
       [ 3.5],
       [ 3.8],
       [ 3.8],
       [ 3.4],
       [ 3.7],
       [ 3.6],
       [ 3.3],
       [ 3.4],
       [ 3. ],
       [ 3.4],
       [ 3.5],
       [ 3.4],
       [ 3.2],
       [ 3.1],
       [ 3.4],
       [ 4.1],
       [ 4.2],
       [ 3.1],
       [ 3.2],
       [ 3.5],
       [ 3.1],
       [ 3. ],
       [ 3.4],
       [ 3.5],
       [ 2.3],
       [ 3.2],
       [ 3.5],
       [ 3.8],
       [ 3. ],
       [ 3.8],
       [ 3.2],
       [ 3.7],
       [ 3.3],
       [ 3.2],
       [ 3.2],
       [ 3.1],
       [ 2.3],
       [ 2.8],
       [ 2.8],
       [ 3.3],
       [ 2.4],
       [ 2.9],
       [ 2.7],
       [ 2. ],
       [ 3. ],
       [ 2.2],
       [ 2.9],
       [ 2.9],
       [ 3.1],
       [ 3

In [112]:
iris_X_test

array([[ 2.8],
       [ 3.8],
       [ 2.8],
       [ 2.8],
       [ 2.6],
       [ 3. ],
       [ 3.4],
       [ 3.1],
       [ 3. ],
       [ 3.1],
       [ 3.1],
       [ 3.1],
       [ 2.7],
       [ 3.2],
       [ 3.3],
       [ 3. ],
       [ 2.5],
       [ 3. ],
       [ 3.4],
       [ 3. ]])

### Regression model with Statsmodels and without a constant:

> Put your code here

In [113]:
import statsmodels.api as sm

In [114]:
import pandas as pd

In [115]:
df = pd.DataFrame(iris.data, columns=iris.feature_names) 

In [116]:
df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
5,5.4,3.9,1.7,0.4
6,4.6,3.4,1.4,0.3
7,5.0,3.4,1.5,0.2
8,4.4,2.9,1.4,0.2
9,4.9,3.1,1.5,0.1


In [117]:
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [119]:
X = df["sepal width (cm)"]
y = df["sepal length (cm)"]

In [120]:
model = sm.OLS(y, X).fit()

In [121]:
predictions = model.predict(X)

In [122]:
model.summary()

0,1,2,3
Dep. Variable:,sepal length (cm),R-squared:,0.957
Model:,OLS,Adj. R-squared:,0.957
Method:,Least Squares,F-statistic:,3316.0
Date:,"Tue, 17 Oct 2017",Prob (F-statistic):,1.04e-103
Time:,23:06:03,Log-Likelihood:,-243.13
No. Observations:,150,AIC:,488.3
Df Residuals:,149,BIC:,491.3
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
sepal width (cm),1.8717,0.033,57.585,0.000,1.807,1.936

0,1,2,3
Omnibus:,16.884,Durbin-Watson:,0.429
Prob(Omnibus):,0.0,Jarque-Bera (JB):,7.669
Skew:,-0.336,Prob(JB):,0.0216
Kurtosis:,2.12,Cond. No.,1.0


### Interpreting the Table

> The coefficient of 1.8717 means that as the sepal with (cm) variable increases by 1, the predicted value of MDEV increases by 1.8717. A few other important values are the R-squared — the percentage of variance our model explains; the standard error (is the standard deviation of the sampling distribution of a statistic, most commonly of the mean); the t scores and p-values, for hypothesis test — the RM has statistically significant p-value; there is a 95% confidence intervals for the sepal with (cm) (meaning we predict at a 95% percent confidence that the value of sepal with (cm) is between 1.807 to 1.936).

### Regression model with Statsmodels and with a constant:

> Put your code here

In [123]:
import statsmodels.api as sm

In [124]:
X = df["sepal width (cm)"]
y = df["sepal length (cm)"]

In [125]:
X = sm.add_constant(X)

In [126]:
model = sm.OLS(y, X).fit()

In [127]:
predictions = model.predict(X)

In [128]:
model.summary()

0,1,2,3
Dep. Variable:,sepal length (cm),R-squared:,0.012
Model:,OLS,Adj. R-squared:,0.005
Method:,Least Squares,F-statistic:,1.792
Date:,"Tue, 17 Oct 2017",Prob (F-statistic):,0.183
Time:,23:06:10,Log-Likelihood:,-183.14
No. Observations:,150,AIC:,370.3
Df Residuals:,148,BIC:,376.3
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,6.4812,0.481,13.466,0.000,5.530,7.432
sepal width (cm),-0.2089,0.156,-1.339,0.183,-0.517,0.099

0,1,2,3
Omnibus:,4.455,Durbin-Watson:,0.941
Prob(Omnibus):,0.108,Jarque-Bera (JB):,4.252
Skew:,0.356,Prob(JB):,0.119
Kurtosis:,2.585,Cond. No.,24.3


### Interpreting the Table 

> With the constant term the coefficients are different. Without a constant we are forcing our model to go through the origin, but now we have a y-intercept at 6.4812. We also changed the slope of the sepal with (cm) predictor from 1.8717 to -0.2089.
It is interesting to point out that R-squared has worsened very badly with the constant (0.012). So, in that case, the predictor sepal with is not reliable with a constant.

# Linear Regression in SKLearn 

> Put your code here

In [137]:
from sklearn import linear_model

In [140]:
X = df["sepal width (cm)"].reshape(-1,1)
y = df["sepal length (cm)"]

  """Entry point for launching an IPython kernel.


In [141]:
lm = linear_model.LinearRegression()
model = lm.fit(X,y)

In [142]:
type(predictions)

pandas.core.series.Series

In [143]:
predictions = lm.predict(X)
print(predictions[0:5,])

[ 5.75017718  5.85461233  5.81283827  5.8337253   5.72929015]


In [144]:
print(predictions)

[ 5.75017718  5.85461233  5.81283827  5.8337253   5.72929015  5.66662906
  5.77106421  5.77106421  5.87549936  5.8337253   5.70840312  5.77106421
  5.85461233  5.85461233  5.64574204  5.56219392  5.66662906  5.75017718
  5.68751609  5.68751609  5.77106421  5.70840312  5.72929015  5.79195124
  5.77106421  5.85461233  5.77106421  5.75017718  5.77106421  5.81283827
  5.8337253   5.77106421  5.62485501  5.60396798  5.8337253   5.81283827
  5.75017718  5.8337253   5.85461233  5.77106421  5.75017718  6.00082154
  5.81283827  5.75017718  5.68751609  5.85461233  5.68751609  5.81283827
  5.70840312  5.79195124  5.81283827  5.81283827  5.8337253   6.00082154
  5.89638639  5.89638639  5.79195124  5.97993451  5.87549936  5.91727342
  6.06348262  5.85461233  6.02170856  5.87549936  5.87549936  5.8337253
  5.85461233  5.91727342  6.02170856  5.95904748  5.81283827  5.89638639
  5.95904748  5.89638639  5.87549936  5.85461233  5.89638639  5.85461233
  5.87549936  5.93816045  5.97993451  5.97993451  5.

In [145]:
lm.score(X,y) 

0.011961632834767699

In [146]:
lm.coef_ 

array([-0.20887029])

In [147]:
lm.intercept_

6.4812232114596053