# Subject: Classical Data Analysis

## Session 1 - Regression

### Individual assignment 1

Develop a regression analysis in Statmodels (with and without a constant) and SKLearn, based on the Iris sklearn dataset. This data sets consists of 3 different types of irises’ (Setosa, Versicolour, and Virginica) petal and sepal length.

See here for more information on this dataset: https://en.wikipedia.org/wiki/Iris_flower_data_set 

Use the field “sepal width (cm)” as independent variable and the field “sepal length (cm)” as dependent variable.

- Interpret and discuss the OLS Regression Results.
- Commit scripts in your GitHub account. You should export your solution code (.ipynb notebook) and push it to your repository “ClassicalDataAnalysis”.

The following are the tasks that should complete and synchronize with your repository “ClassicalDataAnalysis” until October 13. Please notice that none of these tasks is graded, however it’s important that you correctly understand and complete them in order to be sure that you won’t have problems with further assignments.

# Linear Regression in Statsmodels

## Load the iris dataset

In [42]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn import datasets
import numpy as np
import statsmodels.api as sm

In [43]:
iris = datasets.load_iris()
iris.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [44]:
iris_y = iris.data[:, np.newaxis, 0]
iris_x = iris.data[:, np.newaxis, 1]

### Regression model with Statsmodels and without a constant:

In [45]:
model = sm.OLS(iris_y, iris_x).fit()

In [46]:
predictions = model.predict(iris_x)

### Interpreting the Table 

In [47]:
model.summary()
# The R-square is 0.957 which means model predicts well sepal length based on sepal width. Coefficient equals to 1.8717 meaning that sepal length incresease by 1.8717 cm on every 1cm increase in sepal width. Based on p-value null hypothesis is also rejected. 

0,1,2,3
Dep. Variable:,y,R-squared:,0.957
Model:,OLS,Adj. R-squared:,0.957
Method:,Least Squares,F-statistic:,3316.0
Date:,"Fri, 13 Oct 2017",Prob (F-statistic):,1.04e-103
Time:,20:46:45,Log-Likelihood:,-243.13
No. Observations:,150,AIC:,488.3
Df Residuals:,149,BIC:,491.3
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
x1,1.8717,0.033,57.585,0.000,1.807,1.936

0,1,2,3
Omnibus:,16.884,Durbin-Watson:,0.429
Prob(Omnibus):,0.0,Jarque-Bera (JB):,7.669
Skew:,-0.336,Prob(JB):,0.0216
Kurtosis:,2.12,Cond. No.,1.0


### Regression model with Statsmodels and with a constant:

In [48]:
iris_x = sm.add_constant(iris_x) 

In [49]:
model = sm.OLS(iris_y, iris_x).fit()

In [50]:
predictions = model.predict(iris_x)

### Interpreting the Table 

In [51]:
model.summary()
# The R-square in this case is 0.012 which means model doesn't predicts sepal length based on sepal width. In addition p-value shows that we have to accept null hypothesis, wchich states that sepal lengh doesn't depent on sepal width.  

0,1,2,3
Dep. Variable:,y,R-squared:,0.012
Model:,OLS,Adj. R-squared:,0.005
Method:,Least Squares,F-statistic:,1.792
Date:,"Fri, 13 Oct 2017",Prob (F-statistic):,0.183
Time:,20:46:48,Log-Likelihood:,-183.14
No. Observations:,150,AIC:,370.3
Df Residuals:,148,BIC:,376.3
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,6.4812,0.481,13.466,0.000,5.530,7.432
x1,-0.2089,0.156,-1.339,0.183,-0.517,0.099

0,1,2,3
Omnibus:,4.455,Durbin-Watson:,0.941
Prob(Omnibus):,0.108,Jarque-Bera (JB):,4.252
Skew:,0.356,Prob(JB):,0.119
Kurtosis:,2.585,Cond. No.,24.3


# Linear Regression in SKLearn 

In [52]:
l = linear_model.LinearRegression()

In [53]:
model = l.fit(iris_x, iris_y)

In [54]:
model.coef_

array([[ 0.        , -0.20887029]])

In [55]:
model.intercept_

array([ 6.48122321])

In [56]:
model.score(iris_x, iris_y)

0.011961632834767809