# Subject: Classical Data Analysis

## Session 1 - Regression

### Individual assignment 1

Develop a regression analysis in Statmodels (with and without a constant) and SKLearn, based on the Iris sklearn dataset. This data sets consists of 3 different types of irises’ (Setosa, Versicolour, and Virginica) petal and sepal length.

See here for more information on this dataset: https://en.wikipedia.org/wiki/Iris_flower_data_set 

Use the field “sepal width (cm)” as independent variable and the field “sepal length (cm)” as dependent variable.

- Interpret and discuss the OLS Regression Results.
- Commit scripts in your GitHub account. You should export your solution code (.ipynb notebook) and push it to your repository “ClassicalDataAnalysis”.

The following are the tasks that should complete and synchronize with your repository “ClassicalDataAnalysis” until October 13. Please notice that none of these tasks is graded, however it’s important that you correctly understand and complete them in order to be sure that you won’t have problems with further assignments.

# Linear Regression in Statsmodels

## Load the iris dataset

In [9]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
iris = datasets.load_iris()
print (iris.DESCR)
from pandas.core import datetools

df = pd.DataFrame(iris.data, columns=iris.feature_names) 
df

Iris Plants Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML iris d

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
5,5.4,3.9,1.7,0.4
6,4.6,3.4,1.4,0.3
7,5.0,3.4,1.5,0.2
8,4.4,2.9,1.4,0.2
9,4.9,3.1,1.5,0.1


### Regression model with Statsmodels and without a constant:

In [12]:
df.head()
import statsmodels.api as sm
target = pd.DataFrame(iris.target, columns=["sepal length (cm)"]) 
X = df["sepal width (cm)"]
y = target["sepal length (cm)"]

model = sm.OLS(y, X).fit()
predictions = model.predict(X) 
model.summary()

0,1,2,3
Dep. Variable:,sepal length (cm),R-squared:,0.533
Model:,OLS,Adj. R-squared:,0.529
Method:,Least Squares,F-statistic:,169.8
Date:,"Tue, 17 Oct 2017",Prob (F-statistic):,2.19e-26
Time:,21:32:19,Log-Likelihood:,-194.11
No. Observations:,150,AIC:,390.2
Df Residuals:,149,BIC:,393.2
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
sepal width (cm),0.3055,0.023,13.030,0.000,0.259,0.352

0,1,2,3
Omnibus:,408.899,Durbin-Watson:,0.04
Prob(Omnibus):,0.0,Jarque-Bera (JB):,13.753
Skew:,-0.151,Prob(JB):,0.00103
Kurtosis:,1.548,Cond. No.,1.0


### Interpreting the Table 

### Regression model with Statsmodels and with a constant:

In [15]:
import statsmodels.api as sm 
X = df["sepal width (cm)"]
y = target["sepal length (cm)"]
X = sm.add_constant(X) 
model = sm.OLS(y, X).fit()
predictions = model.predict(X)
model.summary()


0,1,2,3
Dep. Variable:,sepal length (cm),R-squared:,0.176
Model:,OLS,Adj. R-squared:,0.17
Method:,Least Squares,F-statistic:,31.6
Date:,"Tue, 17 Oct 2017",Prob (F-statistic):,9.16e-08
Time:,21:44:21,Log-Likelihood:,-167.92
No. Observations:,150,AIC:,339.8
Df Residuals:,148,BIC:,345.9
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,3.4203,0.435,7.865,0.000,2.561,4.280
sepal width (cm),-0.7925,0.141,-5.621,0.000,-1.071,-0.514

0,1,2,3
Omnibus:,38.964,Durbin-Watson:,0.272
Prob(Omnibus):,0.0,Jarque-Bera (JB):,10.165
Skew:,0.329,Prob(JB):,0.0062
Kurtosis:,1.907,Cond. No.,24.3


### Interpreting the Table 

# Linear Regression in SKLearn 

In [16]:
from sklearn import linear_model
from sklearn import datasets 
iris = datasets.load_iris() 
df = pd.DataFrame(iris.data, columns=iris.feature_names)


In [18]:
df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
5,5.4,3.9,1.7,0.4
6,4.6,3.4,1.4,0.3
7,5.0,3.4,1.5,0.2
8,4.4,2.9,1.4,0.2
9,4.9,3.1,1.5,0.1


In [31]:
df2 = pd.DataFrame(df, columns=["sepal width (cm)"])

In [32]:
target = pd.DataFrame(iris.target, columns=["sepal length (cm)"])

In [33]:
X = df2
y = target["sepal length (cm)"]

In [34]:
lm = linear_model.LinearRegression()
model = lm.fit(X,y)

In [35]:
type(predictions)

numpy.ndarray

In [36]:
predictions = lm.predict(X)
print(predictions[0:5,])

[ 0.64654477  1.04279503  0.88429492  0.96354498  0.56729472]


In [37]:
print(predictions)

[ 0.64654477  1.04279503  0.88429492  0.96354498  0.56729472  0.32954456
  0.72579482  0.72579482  1.12204508  0.96354498  0.48804467  0.72579482
  1.04279503  1.04279503  0.25029451 -0.0667057   0.32954456  0.64654477
  0.40879461  0.40879461  0.72579482  0.48804467  0.56729472  0.80504487
  0.72579482  1.04279503  0.72579482  0.64654477  0.72579482  0.88429492
  0.96354498  0.72579482  0.17104446  0.09179441  0.96354498  0.88429492
  0.64654477  0.96354498  1.04279503  0.72579482  0.64654477  1.59754539
  0.88429492  0.64654477  0.40879461  1.04279503  0.40879461  0.88429492
  0.48804467  0.80504487  0.88429492  0.88429492  0.96354498  1.59754539
  1.20129513  1.20129513  0.80504487  1.51829534  1.12204508  1.28054518
  1.83529555  1.04279503  1.67679544  1.12204508  1.12204508  0.96354498
  1.04279503  1.28054518  1.67679544  1.43904529  0.88429492  1.20129513
  1.43904529  1.20129513  1.12204508  1.04279503  1.20129513  1.04279503
  1.12204508  1.35979524  1.51829534  1.51829534  1

In [38]:
lm.score(X,y) 

0.17593511491257519

In [39]:
lm.coef_ 

array([-0.79250052])

In [40]:
lm.intercept_ 

3.4202965808243428