## Statistical Model 
- Mathematical equation which explains the relationship between dependent variable (Y) and independent variable(X).
          
          Y = f(X)
          
- Due to uncertainy in result and noise the equation is
            
          Y = f(X) + e
          
          


## Linear Models
- Y = θ0 + θ1 * X1 + θ2 * X2 + ... + θn * Xn
- Y = θ0 + θ1 * X + θ2 * X^2 + ... + θn * X^n
- Y = θ0 + θ1 * sin(X1) + θ2 * cos(X2)

## Linear Regression

    Y = θ0 + θ1 * X + e

    θ0, θ1 = Coefficient 
    e = normally distributed residual error 
    
- Linear Regression model assumes that residuals are independent and normally distributed
- Model is fitted to the data using ordinary least squares approach

## Non-Linear Models
- Most of the cases, the non-linear models are generalized to linear models
- Binomial Regresson, Poisson Regression 


## Design Matrices 
- Once the model is chosen design metrices are constructed.
    Y = XB + e

| Variable | Description | 
| -------- | ----------- |
| Y | vector/matrix of dependent variable | 
| X | vector/matrix of independent variable | 
| B | vector/matrix of coefficient | 
| e | residual error | 


## Creating a Model 

### using statsmodel library
- OLS (oridinart least squares)
- GLM (genralized linear model)
- WLS (weighted least squares)
- ols
- glm 
- wls

        Uppercase names take design metrices as args 
        Lowercase names take Patsy formulas and dataframes as args 
        
## Fitting a Model 
- fitting method returns a model object for futher methods, attributes and coefficient matrix for analysis

## View Model Summary
- Describe the fit description of the model in text. 

<hr>


## Construct Design Matrices 

Y = θ0 + θ1 * X1 + θ2 * X2 + θ3 * X1 * X2

## Design Matrix with Numpy

In [32]:
import numpy as np 

Y = np.array([1,2,3,4,5]).reshape(-1,1)

x1 = np.array([6,7,8,9,10])
x2 = np.array([11,12,13,14,15])

X = np.vstack([np.ones(5), x1, x2, x1*x2]).T 

print(Y)
print(X)

[[1]
 [2]
 [3]
 [4]
 [5]]
[[  1.   6.  11.  66.]
 [  1.   7.  12.  84.]
 [  1.   8.  13. 104.]
 [  1.   9.  14. 126.]
 [  1.  10.  15. 150.]]


## Design Matrix with patsy 

- allows defining a model easily 
- constructs relevant design matrices (patsy.dmatrices)
- takes a formula in string form as arg and a dictionary like object with data arrays for resoponse variables 

![image](https://patsy.readthedocs.io/en/v0.1.0/_images/formula-structure.png)

                             ~
                          /    \ 
                         Y     +
                             /   \
                            1     +
                                /   \
                               x1    +
                                   /   \
                                 x2     *
                                      /   \
                                    x1    x2
                                    
                                    
- 'y ~ np.log(x1)': Often numpy functions can be used to transform terms in the expression.
- 'y ~ I(x1 + x2)': I is the identify function, used to escape arithmetic expressions and are evaluated.
- 'y ~ C(x1)': Treats the variable x1 as a categorical variable.

In [33]:
import patsy 

y = np.array([1, 2, 3, 4, 5])
x1 = np.array([6, 7, 8, 9, 10])
x2 = np.array([11, 12, 13, 14, 15])
data = {
    'Y' : Y,
    'x1' : x1,
    'x2' : x2,
}

equation = 'Y ~ 1 + x1 + x2 + x1*x2'

Y, X = patsy.dmatrices(equation, data)

print(Y)
print(X)

[[1.]
 [2.]
 [3.]
 [4.]
 [5.]]
[[  1.   6.  11.  66.]
 [  1.   7.  12.  84.]
 [  1.   8.  13. 104.]
 [  1.   9.  14. 126.]
 [  1.  10.  15. 150.]]


### load popular datasets from statsmodels 

In [34]:
import statsmodels.api as sm 
dataset = sm.datasets.cancer.load()
# dataset = sm.datasets.cancer.load_pandas()
dataset

<class 'statsmodels.datasets.utils.Dataset'>

<hr/> 

## Linear Model Creation using statsmodels
### Using inbuilt <b> Icecream </b> dataset 

In [39]:
import statsmodels.api as sm

icecream = sm.datasets.get_rdataset("Icecream","Ecdat")

dataset = icecream.data
dataset.head()

Unnamed: 0,cons,income,price,temp
0,0.386,78,0.27,41
1,0.374,79,0.282,56
2,0.393,81,0.277,63
3,0.425,80,0.28,68
4,0.406,76,0.272,69


In [36]:
import statsmodels.formula.api as smf 

linearModel1 = smf.ols('cons ~ price + temp',dataset)

fitModel1 = linearModel1.fit()

print(fitModel1.summary())

                            OLS Regression Results                            
Dep. Variable:                   cons   R-squared:                       0.633
Model:                            OLS   Adj. R-squared:                  0.606
Method:                 Least Squares   F-statistic:                     23.27
Date:                Thu, 15 Oct 2020   Prob (F-statistic):           1.34e-06
Time:                        01:54:07   Log-Likelihood:                 54.607
No. Observations:                  30   AIC:                            -103.2
Df Residuals:                      27   BIC:                            -99.01
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.5966      0.258      2.309      0.0

In [37]:
linearModel2 = smf.ols('cons ~ income + temp',dataset)

fitModel2 = linearModel2.fit()

print(fitModel2.summary())

                            OLS Regression Results                            
Dep. Variable:                   cons   R-squared:                       0.702
Model:                            OLS   Adj. R-squared:                  0.680
Method:                 Least Squares   F-statistic:                     31.81
Date:                Thu, 15 Oct 2020   Prob (F-statistic):           7.96e-08
Time:                        01:54:07   Log-Likelihood:                 57.742
No. Observations:                  30   AIC:                            -109.5
Df Residuals:                      27   BIC:                            -105.3
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.1132      0.108     -1.045      0.3

In [38]:
linearModel3 = smf.ols('cons ~ -1 + income + temp',dataset)

fitModel3 = linearModel3.fit()

print(fitModel3.summary())

                                 OLS Regression Results                                
Dep. Variable:                   cons   R-squared (uncentered):                   0.990
Model:                            OLS   Adj. R-squared (uncentered):              0.990
Method:                 Least Squares   F-statistic:                              1426.
Date:                Thu, 15 Oct 2020   Prob (F-statistic):                    6.77e-29
Time:                        01:54:07   Log-Likelihood:                          57.146
No. Observations:                  30   AIC:                                     -110.3
Df Residuals:                      28   BIC:                                     -107.5
Df Model:                           2                                                  
Covariance Type:            nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------