 ### Regularized Regression
 
#### LASSO:
       -LASSO stands for Least Absolute Shrinkage and Selection Operator. It performs L1 regularization and penalizes large weights by adding their absolute value of magnitude to the cost function. The mathematical representation of cost function for L1 regularization is
       
$$ Cost(B) = |y- XB|^2 + \lambda|B|_1 $$

$$Where $$

$$\lambda|B|_1  = \sum^{p}_{j=1}|B_j|$$



#### Ridge Regression

Ridge Regression
-Ridge Regression is almost identical to Linear Regression except that we introduce a small amount 
of bias. In return, we get a large drop in variance. Ultimately, by starting off with a slightly worse
fit, Ridge Regression performs better against data that doesn't exactly follow the same pattern
as the data the model was trained on. 

-Ridge Regression is sometimes referred to as L2 regression. This term is introduced to the loss function
of a least squares regression model. The goal is to seek coefficients that fit the data well resulting in a low RSS. However, we introduct the term...

$$ \lambda \sum_jB^2_j$$

$$ The cost function changes to $$

$$ Cost(B) = |y- XB|^2 + \lambda B^2_1$$
$$Where $$

$$\lambda B^2_1  = \sum^{p}_{j=1}B_j^2$$

which is referred to as a shrinkage penatly. The penalty is small when the coefficients are close to zero. It has the effect of shrinking the coefficient estimates towards zero. The lambda value serves to control the relative impact of the term on the coefficient estimates. When lambda equals 0, then it has no effect, and ridge
regression will produce the same output as the linear regression.


In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import california_housing
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
import seaborn as sns

In [3]:
data = pd.read_csv('/Users/Matt/Documents/Intro To Stat Learning/MachineLearningFromScratch/data/Advertising.csv')
#df = data[['Income','Rating','Balance']]


data.drop('Unnamed: 0', axis=1, inplace=True)
data

Unnamed: 0,TV,radio,newspaper,sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9
5,8.7,48.9,75.0,7.2
6,57.5,32.8,23.5,11.8
7,120.2,19.6,11.6,13.2
8,8.6,2.1,1.0,4.8
9,199.8,2.6,21.2,10.6


In [4]:
y = data['sales']
X = data.iloc[:,:-1]
standsc = StandardScaler()
standsc.fit(X)
X_norm = pd.DataFrame(standsc.transform(X))
X_norm.columns = X.columns

In [89]:
def addConstantFunc(X):

    x0 = np.ones(len(X))

    #add a constant
    X['constant'] = x0

    return X


def calculate_cost_function(X,y, coefficients):



    #add a constant
    m = len(y)
    predictions = X.dot(coefficients)

    square_err = (predictions - y) **2
    rmse = 1/(2*m) * np.sum(square_err)
    return rmse


def RegularizedRegression(X,y, alpha, n_iterations, LAMBDA, R_type):

    """
    X: predictions
    y: target variable
    alpha: learning rate
    n_iterations:The number of jobs to use for the computation to minimize the cost function
    LAMBDA: The regularization term for shirkage. 
    R_type: parameter value that holds the regularization type (L1, L2, and Elasitc Net)


    """
    if R_type == 'L2':
        X_ = addConstantFunc(X)

        gradient_preds = []

        #create base intercept
        coefficients = np.array(np.zeros(X_.shape[1]))

        cost_history = [0] * n_iterations

        for i in range(n_iterations):

            h = X_.dot(coefficients)

            loss = h - y

            gradient = (X_.T.dot(loss)/ len(y)) + ( LAMBDA* np.sign(coefficients))**2
            

            coefficients = coefficients - (alpha *gradient)

            cost = calculate_cost_function(X_,y, coefficients)

            cost_history[i] = cost
            
        
        return coefficients
    
    
    elif R_type == "L1":
        
        X_ = addConstantFunc(X)

        gradient_preds = []

        #create base intercept
        coefficients = np.array(np.zeros(X_.shape[1]))

        cost_history = [0] * n_iterations

        for i in range(n_iterations):

            h = X_.dot(coefficients)

            loss = h - y

            gradient = (X_.T.dot(loss)/ len(y)) +  (LAMBDA*np.sign(coefficients))

            coefficients = coefficients - alpha *gradient 
            cost = calculate_cost_function(X_,y, coefficients)

            cost_history[i] = cost
            
            
        
        
        return coefficients
    
    elif R_type == "Elastic":
        
        
        X_ = addConstantFunc(X)

        gradient_preds = []

        #create base intercept
        coefficients = np.array(np.zeros(X_.shape[1]))

        cost_history = [0] * n_iterations

        for i in range(n_iterations):

            h = X_.dot(coefficients)

            loss = h - y

            gradient = (X_.T.dot(loss)/ len(y)) +  (LAMBDA*np.sign(coefficients)) +  ( LAMBDA* np.sign(coefficients))**2

            coefficients = coefficients - (alpha *gradient)

            cost = calculate_cost_function(X_,y, coefficients)

            cost_history[i] = cost
        return coefficients



  
    


def predict(X, coefficients, y):

    X = addConstantFunc(X)
    prediction = X.dot(coefficients)
    
    prediction = pd.DataFrame({'Prediction':prediction, 'Actual':y})
    return prediction



## L1 Test against sklearn

In [85]:
from sklearn.linear_model import Lasso, LinearRegression, Ridge, ElasticNet

In [83]:
theta_l1 = RegularizedRegression(X_norm,y,0.1,1001,0.7,R_type='L1')

lasso = Lasso(alpha=0.7,fit_intercept=False)

lasso.fit(X_norm,y)


LassoDict = {'TV':[lasso.coef_[0]], 'radio':[lasso.coef_[1]], 'newspaper':[lasso.coef_[2]],'constant':[lasso.coef_[3]]}
Lassodf = pd.DataFrame.from_dict(LassoDict)

print(theta_l1)


print(Lassodf.T)

TV            3.253860
radio         2.112066
newspaper    -0.025801
constant     13.322500
dtype: float64
                   0
TV          3.254785
radio       2.120500
newspaper   0.000000
constant   13.322500


In [88]:
theta_l2 = RegularizedRegression(X_norm,y,0.1,1000,0.7,R_type='L2')

ridge = Ridge(alpha=0.7,fit_intercept=False,max_iter=1000,solver='auto')

ridge.fit(X_norm,y)


ridgeDict = {'TV':[ridge.coef_[0]], 'radio':[ridge.coef_[1]], 'newspaper':[ridge.coef_[2]],'constant':[ridge.coef_[3]]}
ridgedf = pd.DataFrame.from_dict(ridgeDict)

print(theta_l2)


print(ridgedf.T)

TV            3.467512
radio         2.448148
newspaper    -0.365167
constant     13.532500
dtype: float64
                   0
TV          3.905906
radio       2.781437
newspaper  -0.017957
constant   13.973592


In [90]:
theta_elastic = RegularizedRegression(X_norm,y,0.1,1000,0.7,R_type='Elastic')

elastic = ElasticNet(alpha=0.7,fit_intercept=False)

elastic.fit(X_norm,y)



elasticDict = {'TV':[elastic.coef_[0]], 'radio':[elastic.coef_[1]], 'newspaper':[elastic.coef_[2]],'constant':[elastic.coef_[3]]}
elasticdf = pd.DataFrame.from_dict(elasticDict)

print(theta_elastic)


print(elasticdf.T)

TV            2.790684
radio         1.660111
newspaper     0.015092
constant     12.832500
dtype: float64
                   0
TV          2.680278
radio       1.847126
newspaper   0.023891
constant   10.127778
