import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import scipy as sp
import statsmodels.api as sm

from scipy.optimize import curve_fit
from sklearn.linear_model import LinearRegression

coeffs = ['IP', 'BT', 'NEL', 'PLTH', 'RGEO', 'KAREA', 'EPS', 'MEFF']

In [6]:
DB2P8 = pd.read_csv("data/DB2P8.csv")
DB5 = pd.read_csv("data/DB5.csv")

DB2P8 = DB2P8[DB5.columns]

# How was this chosen? Is this a form of removing outliers or noise to the new regression?
# Why not simply use the whole DB5?
new_ids = pd.read_csv("data/new_point_ids.csv")

data = pd.read_csv("data/data.csv")
                  
r = pd.read_csv("data/R.csv")#DB5[DB5.id.isin(new_ids.id.values)] #reintroduce dataset

In [25]:
Y = DB2P8[["TAUTH"]].apply(np.log).to_numpy()
X = DB2P8[coeffs].apply(np.abs).apply(np.log).to_numpy()

n, p = X.shape

$\hat{\beta} = (X^TX)^{-1}X^TY$;  $\qquad H = X(X^TX)^{-1}X^T $

In [18]:
B = np.matmul( np.linalg.inv( np.matmul(X.T,X) ) ,  np.matmul(X.T,Y))
H = np.matmul(np.matmul( X,  np.linalg.inv( np.matmul(X.T, X) )),  X.T)

H.shape

(1310, 1310)

Leverage of the $i-th$ case: the diagonal element of the hat matrix

$h_{ii} = x^T_i(X^TX)^{-1}x_i$ 

The residuals: $E = Y- \hat{\beta}X$

MSE: $s^2 = \sum^n_{i=1}\frac{E^2_i}{n-p}$

Studentized residual $r_i = E_i/s_{e_i}$

$$
    s_{e_i} = \sqrt{s^2(1-h_{ii})} \:\: \rightarrow \:\: r^*_i = \frac{e_i}{s(i)\sqrt{1-h_{ii}}}
$$

With $s^2(i)$ is the mean squared error when the $i-th$ case is omitted in fitting the regressio function. This follows the t-distribution with $n-p-1$ degrees of freedom; with assumptions.

In [26]:
E = Y - np.matmul(X,B)

## DFBETA

This is the parameter estimate after deleting the $i$-th observation; namely

$$
    \text{DFBETA} = \hat{\beta} - \hat{\beta}_i = \frac{X^TXx_iE_i}{1 - h_{ii}}
$$

with $C = (X^TX)^{-1}X^T$. If the x's are uniformly distributed then $c_{ij} = \mathcal{O}(n^{-1})$. The DFBETA$_j$ vector is 

$$
    \text{DFBETA}_j = b_j - b_{ji} = \frac{c_{ji}E_i}{1 - h_{ii}}
$$


When studying relative to the parameters, a scaled measure of the change can be done by 

$$
    \text{DFBETAS}_{ij} = \frac{b_j - b_{ji}}{s(i)\sqrt{(X^TX)^{-1}_{jj}}} = \frac{c_{ij}}{\sqrt{\sum_{k=1}^n c^2_{ij}}}\cdot\frac{r^*_i}{\sqrt{1-h_{ii}}}
$$

With $j=1,\ldots,p$ and $b_j$ being the $j$-th element of the $\hat{\beta}$ parameter. The denominator of DFBETAS$_{ij}$ is similar to the estimated standard deviation of $\hat{\beta}$ with the sample standard error $s$ replaced by the deleted-one version $s(i)$. BKW proposed a cutoff: $2\cdot n ^{-1/2}$

10480