This notebook contains an algorithm for best subset selection for linear regression using several selction criteria.

In [1]:
import numpy as np
from sklearn.linear_model import LinearRegression
from itertools import combinations

Here are the formulas for various selection criteria for the best subset in a linear regression model.

Total sum of squares, `SSTO`:
$$
SSTO = \sum (Y_i - \bar{Y})^2
$$

Error sum of squares, `SSE`:
$$
SSE = \sum (Y_i - \hat{Y_i})^2
$$

Regression sum of squares, `SSR` (not used in calculations, but included for reference):  
$$
SSR = \sum (\hat{Y_i} - \bar{Y})^2
$$

Relationship between `SSTO`, `SSE`, and `SSR`:
$$
SSTO = SSE + SSR
$$

Mean squared error, `MSE`:
$$
MSE = \frac{SSE}{n-2}
$$

-----

-----

The `p` in the following formulas referes to the subset of `p` variables from the original set of independent variables. For example, if the original `X` has variables `x1`, `x2`, and `x3`, for `p=2`, `Xp` would be `{x1, x2}`, `{x1, x3}`, and `{x2, x3}`, and the criterions would be based on those subsets.

Coefficient of multiple determination, `R2`:
$$
R^{2}_p = 1 - \frac{SSE_p}{SSTO}
$$

Adjusted coefficient of multiple determination, `adj_R2`:  

$$
R^2_{a,p} = 1 - \left(\frac{n-1}{n-p}\right)\frac{SSE_p}{SSTO} = 1 - \frac{MSE_p}{\frac{SSTO}{n-1}}
$$

Mallows's `Cp`:
$$
C_p = \frac{SSE_p}{MSE(X_{1},...,X_{p-1})} - (n-2p)
$$

The following functions will calculate different statistical values.

In [None]:
def SSTO(y):
    
    '''Calculates sum of squares from the mean.'''
    
    y_mean = np.mean(y)
    squared_errors = (y - y_mean)**2
    
    return np.sum(squared_errors)

In [None]:
def SSE(y, predictions):
    
    '''Calculates sum of squared errors between predictions and actual values.'''
    
    squared_erros = (y - predictions)**2
    
    return np.sum(squared_errors)

In [None]:
def adj_R2(_sse, _ssto, n, p):
    
    '''Calculates the adjusted R^2.'''
    
    return 1 - (n-1)/(n-p) * _sse/_ssto

In [None]:
def Cp(sse_p, sse_P, n, p, P):
    
    '''Calculates Mallows's Cp value. Needs sse_p and sse_P to be pre-calculated.'''
    
    return sse_p / (sse_P/(n-P)) - (n - 2*p)

In [None]:
def AIC(_sse, n, p):
    
    '''Calculates the Akaike information criterion'''
    
    return n * np.log(_sse) - n * np.log(n) + 2*p

In [None]:
def SBC(_sse, n, p):
    
    '''Calculates Schwarz Bayesian criterion'''
    
    return n * np.log(_sse) - n * np.log(n) + np.log(n) * p

In [None]:
def PRESS(X, y):
    
    '''Calculates PRESS criterion.'''
    
    lr = LinearRegression()
    pred = np.zeros(y.shape)
    
    for i in range(X.shape[0]):
        y_mod = np.delete(y, i, 0)
        X_mod = np.delete(X, i, 0)
        lr.fit(X_mod, y_mod)
        pred[i] = lr.predict(X[i].reshape(1, -1))
        
    return SSE(y, pred)

Define some objects that will be needed in the main function.

In [None]:
# Scikit-learn linear regression used in calculations
lin_reg = LinearRegression()

In [None]:
# DataFrames that will store the best subset related information
best_values_df = pd.DataFrame(columns = ['p', 'SSEp', 'R^2_p', 'Adj. R^2_p',
                                         'Cp', 'AICp', 'SBCp', 'PRESSp'])
best_subsets_df = pd.DataFrame(columns = ['p', 'SSEp', 'R^2_p', 'Adj. R^2_p',
                                         'Cp', 'AICp', 'SBCp', 'PRESSp'])

In [None]:
# the main function that will use the criterion calculations
# to determine the best subsets for regression
def get_subsets(X, y, P):
    # make sure that both X and y are numpy arrays
    if (type(X) != np.ndarray) or (type(y) != np.ndarray):
        raise TypeError('X and y must be numpy arrays')
    
    # check to makes sure we have the same number of rows in X and y
    if X.shape[0] != y.shape[0]:
        raise ValueError('X and y must have the same number of rows')
        
    # set n as the number of observations
    n = X.shape[0]
    
    # create a range of values 1 through P for the numbers of variables in the subsets
    P_range = range(1, P+1)
    
    # for both dataframes best_values_df and best_subsets_df,
    # set values in the 'p' column to P_range values, and set that column as the index
    best_values_df['p'] = P_range
    best_values_df.set_index('p', inplace = True)
    best_subsets_df['p'] = P_range
    best_subsets_df.set_index('p', inplace = True)
    
    # create subsets of X consisting of 1 through P variables
    # first, create an empty list to hold the tuples of subsets
    X_subsets = []
    
    # create combinations of subsets using the 'combinations' function
    # and iterating over values in range equal to the number of X variables
    for i in range(1, P):
        combs = combinations(range(X.shape[1]), i)
        for item in combs:
            X_subsets.append(item)
            
    # create intermediate dataframes to hold criterion values
    SSE_df = pd.DataFrame(columns=['X_var', 'p', 'SSEp'])
    SSE_df.set_index('X_var')
    R2_df = pd.DataFrame(columns=['X_var', 'p', 'R2p'])
    adjR2_df = pd.DataFrame(columns=['X_var', 'p', 'adjR2p'])
    C_df = pd.DataFrame(columns=['X_var', 'p', 'Cp', 'Abs_Cp'])
    AIC_df = pd.DataFrame(columns=['X_var', 'p', 'AICp'])
    SBC_df = pd.DataFrame(columns=['X_var', 'p', 'SBCp'])
    PRESS_df = pd.DataFrame(columns=['X_var', 'p', 'PRESSp'])