# Discussion 2

Jupyter has lots of cool features, in this discussion, we are get familiar with Python language.

Please check this [post](http://arogozhnikov.github.io/2016/09/10/jupyter-features.html?utm_content=bufferb0c6b&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer) about features of Jupyter notebook. You can download from [here](https://github.com/arogozhnikov/arogozhnikov.github.io/tree/master/notebooks)

## Library or Module

In R we call `library` package, in python we call `import` module

| R                            | Python                            |
|------------------------------|-----------------------------------|
| install.packages()           | pip install (bash)                |
| library(ggplot2)             | import pandas import pandas as pd |
| ggplot2::ggplot              | from pandas import DataFrame      |
| source('script.R')           | import script                     |
| ...                          | ...                               |


The more you tried, you more you learn.


In [2]:
import numpy as np

ModuleNotFoundError: No module named 'numpy'

In the above cell, the error shows no module namsed 'numpy', similary to R, if you did not download `glmnet` you can not `library` it.

Jupyter notebook has magics, we can use pip inside 

In [3]:
%lsmagic

Available line magics:
%alias  %alias_magic  %autoawait  %autocall  %automagic  %autosave  %bookmark  %cat  %cd  %clear  %colors  %config  %connect_info  %cp  %debug  %dhist  %dirs  %doctest_mode  %ed  %edit  %env  %gui  %hist  %history  %killbgscripts  %ldir  %less  %lf  %lk  %ll  %load  %load_ext  %loadpy  %logoff  %logon  %logstart  %logstate  %logstop  %ls  %lsmagic  %lx  %macro  %magic  %man  %matplotlib  %mkdir  %more  %mv  %notebook  %page  %pastebin  %pdb  %pdef  %pdoc  %pfile  %pinfo  %pinfo2  %popd  %pprint  %precision  %prun  %psearch  %psource  %pushd  %pwd  %pycat  %pylab  %qtconsole  %quickref  %recall  %rehashx  %reload_ext  %rep  %rerun  %reset  %reset_selective  %rm  %rmdir  %run  %save  %sc  %set_env  %store  %sx  %system  %tb  %time  %timeit  %unalias  %unload_ext  %who  %who_ls  %whos  %xdel  %xmode

Available cell magics:
%%!  %%HTML  %%SVG  %%bash  %%capture  %%debug  %%file  %%html  %%javascript  %%js  %%latex  %%markdown  %%perl  %%prun  %%pypy  %%python  %%pyth

In [4]:
!pip install numpy

Collecting numpy
[?25l  Downloading https://files.pythonhosted.org/packages/83/0d/1dd2f96eff7f5df22166066f7dbd213428d46f78f8ed9dea8345ca1a1f51/numpy-1.16.0-cp37-cp37m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (13.9MB)
[K    100% |████████████████████████████████| 13.9MB 1.8MB/s 
[?25hInstalling collected packages: numpy
Successfully installed numpy-1.16.0


In [8]:
import numpy as np # here np is a short name for numpy, you can choose any name you like

# Linear regression

$$y \sim N(X\beta, \sigma^2) $$
$$\hat{\beta} = argmin_{\beta} ||Y - X\beta||^2$$

**Questions**:
    
1. What are the assumptions for linear regression
2. What is the consistent estimator for $\beta$ and $\sigma^2$
    
**Practice**:
    
1. Generate a dataset $X\sim N(0, 1)$, with 300 observations, dimensions is 3.   
    Use `numpy.random.randn` or call `np.random.randn`. [More distributions](https://docs.scipy.org/doc/numpy-1.15.0/reference/routines.random.html)
2. Set $\beta = (1, 2, 3)$ and set $\sigma^2 = 1$
3. Generate $y \sim N(X\beta, \sigma^2)$
4. Create a function `lm`, and give the estimates $\hat{\beta}_{lm}$, $var(\hat{\beta}_{lm})$ and $\hat{\sigma}$


In [9]:
N, P = 300, 3 # dimension information
x = np.random.randn(N, P) # X from standard normal distribution
beta = np.array([1, 2, 3]) # Create beta
sigma = 1
y = x @ beta + np.random.randn(N) * sigma

In [34]:
## Def a function, here the indent is import for both 
def lm(x, y):
    n, p = x.shape
    beta = np.linalg.inv((x.transpose() @ x)) @ x.transpose() @ y
    yhat = x @ beta
    residuals = y - yhat
    sse = residuals.transpose() @ residuals
    sigma = sse / (n - p)
    return beta, sigma

In [35]:
beta_hat, sigma_hat = lm(x, y)

In [55]:
print("Linear regression result.\n\
Beta: {}\n\
Sigma: {}".format(beta_hat, sigma_hat))

Linear regression result.
Beta: [1.01418659 2.03504284 3.09193549]
Sigma: 1.0162753210780382


# **PRACTICE**

**Questions**:


1. If you have too many predictors, what are the problems?
2. How do you solve it? List at least 3 methods.
3. What are the properties these methods?

**Programming**:

1. Implement a function ridge, which is used to solve ridge regression problem

$$\hat{\beta}_{ridge} = argmin_{\beta} ||Y - X\beta||^2 + \lambda ||\beta||^2$$

2. Give the estimates of $\hat{\beta}_{ridge}$ and $var(\hat{\beta}_{ridge})$
3. Consider what is the relationship between $\beta_{lm}$ and $\beta_{ridge}$
4. What if $X'X$ is an orthogonal matrix?
5. What is the equivalent Bayesian formulation?


In [41]:
def ridge(x, y, lamb):
    n, p = x.shape
    XtX = x.transpose() @ x
    lambInv = np.linalg.inv(lamb * np.eye(p) + XtX)
    beta_hat = lambInv @ x.transpose() @ y
    yhat = x @ beta
    residuals = y - yhat
    sse = residuals.transpose() @ residuals
    sigma = sse / (n - p)
    beta_se = np.diag(sse * lambInv @ XtX @ lambInv)
    return beta_hat, sigma, beta_se

In [53]:
beta_hat, sigma_hat, beta_se = ridge(x, y, 1)

In [54]:
print("Linear ridge regression result.\n\n\
Beta: {}\n\
Beta SE: {}\n\
Sigma: {}".format(beta_hat, beta_se, sigma_hat))

Linear ridge regression result.

Beta: [1.01418659 2.03504284 3.09193549]
Beta SE: [0.88287142 1.14044394 0.9038335 ]
Sigma: 1.0162753210780382
