# Lab Three - Extending Logistic Regression

In this lab, you will compare the performance of logistic regression optimization programmed in scikit-learn and via your own implementation. You will also modify the optimization procedure for logistic regression. 

This report is worth 10% of the final grade. Please upload a report (<b>one per team</b>) with all code used, visualizations, and text in a rendered Jupyter notebook. Any visualizations that cannot be embedded in the notebook, please provide screenshots of the output. The results should be reproducible using your report. Please carefully describe every assumption and every step in your report.

<b>Dataset Selection</b>

Select a dataset identically to the way you selected for the lab one (i.e., table data). You are not required to use the same dataset that you used in the past, but you are encouraged. You must identify a classification task from the dataset that contains <b>three or more classes to predict</b>. That is it cannot be a binary classification; it must be multi-class prediction. 

## Preparation and Overview (3pt)

<ul>
    <li>[<b>2 points</b>] Explain the task and what business-case or use-case it is designed to solve (or designed to investigate). Detail exactly what the classification task is and what parties would be interested in the results. For example, would the model be deployed or used mostly for offline analysis? </li>
    <li>[<b>.5 points</b>] (<i>mostly the same processes as from previous labs</i>) Define and prepare your class variables. Use proper variable representations (int, float, one-hot, etc.). Use pre-processing methods (as needed) for dimensionality reduction, scaling, etc. Remove variables that are not needed/useful for the analysis. Describe the final dataset that is used for classification/regression (include a description of any newly formed variables you created). </li>
    <li>[<b>.5 points</b>] Divide you data into training and testing data using an 80% training and 20% testing split. Use the cross validation modules that are part of scikit-learn. <b>Argue "for" or "against" splitting your data using an 80/20 split. That is, why is the 80/20 split appropriate (or not) for your dataset?</b></li>
</ul>

### Use Case

Our task will be looking at a patients information and determining whether they are likely to have a stroke, heart disease, or hypertension. The use-case for this classifier would be to flag at-risk patients and enable some kind of response to be made to prevent serious medical emergencies that these conditions might cause or prevent the conditions in the first place.

For example, if a person were to be flagged as very likely to have a stroke, the doctor could contact the patient in an attempt to prevent the stroke by prescribing them medication or alerting the patient's family to monitor them in case they were to have a stroke. Similar actions could be taken for hypertension and heart disease.

Alernatively, some kind of application could be made to allow people to enter their information and determine how at risk they might be for these conditions, giving people more clear information about their health and the issues that are likely to affect them.

### Data Preparation

In [1]:
# Importing packages and reading in dataset
import numpy as np
import pandas as pd

print('Pandas:', pd.__version__)
print('Numpy:',  np.__version__)

raw_data = pd.read_csv('healthcare-dataset-stroke-data.csv')
raw_data.head()

Pandas: 1.1.3
Numpy: 1.19.2


Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [2]:
# Dropping categorical column 'work_type'; not very useful and
# doesn't translate nicely into ordinal numbers
df = raw_data.drop('work_type', axis = 1)

# Dropping 1 observation of person with gender 'Other' to simplify
# using the gender column to calculate, impute, or visualize
df.drop(df[df.gender == 'Other'].index, inplace=True)

# Making values' format consistent
for c in df.columns:
    if df[c].dtype == 'object':
        df[c] = df[c].str.lower()

# Adding numbers to smoking_status values to order them properly
# when they will get passed through the SKLearn LabelEncoder
df.smoking_status.replace(to_replace= ['never smoked', 'formerly smoked', 'smokes', 'Unknown'],
                          value     = ['0_never_smoked', '1_formerly_smoked', '2_smokes', '3_Unknown'],
                          inplace=True)

In [3]:
from sklearn.preprocessing import LabelEncoder

# Encoding all of the non-numeric columns
le = {}

for col in df.columns:
    if df[col].dtype == 'object':
        le[col] = LabelEncoder()
        df[col] = le[col].fit_transform(df[col])

# Call le[col].inverse_transform(df[col]) for any column name
# to convert numbers back to their labels

# Converting all 'Unknown' values in smoking status to NaN so
# that we can impute the missing values.
df.smoking_status.mask(df.smoking_status == 3, np.nan, inplace=True)
               
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,1,67.0,0,1,1,1,228.69,36.6,1.0,1
1,51676,0,61.0,0,0,1,0,202.21,,0.0,1
2,31112,1,80.0,0,1,1,0,105.92,32.5,0.0,1
3,60182,0,49.0,0,0,1,1,171.23,34.4,2.0,1
4,1665,0,79.0,1,0,1,0,174.12,24.0,0.0,1


In [4]:
# Imputing missing values
from sklearn.impute import KNNImputer
import copy

knn = KNNImputer(n_neighbors=3)

# Imputing on all columns except id
columns = list(df.columns)
columns.remove('id')

df_imputed = copy.deepcopy(df)
df_imputed[columns] = knn.fit_transform(df[columns])

# Rounding imputed values to be compatible with LabelEncoder
# for smoking_status and to match the format of other values
# for bmi
df_imputed.smoking_status = df_imputed.smoking_status.apply(lambda x: round(x, 0))
df_imputed.bmi = df_imputed.bmi.apply(lambda x: round(x, 1))

In [5]:
# Using df_imputed as the primary dataset
df = df_imputed

# Changing columns modified by KNN Imputer back to integers from floats
columns = [
    'gender',
    'hypertension',
    'heart_disease',
    'ever_married',
    'residence_type',
    'smoking_status',
    'stroke'
]

for col in columns:
    df[col] = df[col].astype(int)

To prep this dataset, one attribute was removed due to it being relatively unimportant and not encoding nicely into an ordinal set of integers. All categorical variables were converted to numeric data using SKLearn's LabelEncoder class. Missing values for bmi and smoking_status were imputed using KNN Imputer. One record was dropped for being the only entry with gender 'Other'. Removing this record will make visualizing the gender data simpler and will have little impact on the training, as having an outlier like that might cause other attributes to be slightly undervalued in comparison.

Here is a table of the LabelEncoder encoded variables.

| value | gender | ever_married | residence_type | smoking_status    |
|-------|--------|--------------|----------------|-------------------|
| 0     | female | no           | rural          | 0_never_smoked    |
| 1     | male   | yes          | urban          | 1_formerly_smoked |
| 2     |   -    |      -       |       -        | 2_smokes          |


In [6]:
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,1,67.0,0,1,1,1,228.69,36.6,1,1
1,51676,0,61.0,0,0,1,0,202.21,30.9,0,1
2,31112,1,80.0,0,1,1,0,105.92,32.5,0,1
3,60182,0,49.0,0,0,1,1,171.23,34.4,2,1
4,1665,0,79.0,1,0,1,0,174.12,24.0,0,1


### Dataset Division

We decided to use the 80-20 split for our testing and training data. We rationalized that, with a dataset of around 5000 entries, 1000 entries should be a decently large sample for testing our classifier. Additionally, the Pareto Principle (which says that 20 percent of inputs can typically account for 80 percent of outputs) is also something that we considered when choosing how to split our data. 80-20 is the standard we have been using so far in this coursework, but it also has some level of relation to this principle, giving us another reason to use it. While the connection is not extremely direct, we feel that it contributes to our decision to use the 80-20 split in a meaningful way.

Our runner-up method was cross validation. We chose to not pursue that method in case it would affect our implementation of the classifier in some non-trivial way. We wanted to avoid doing unnecessary, superfluous, or tedious work on top of the work already required by the assignment in order to produce the highest quality work we could.

In [7]:
columns = list(df.columns)
columns.remove('id')
targets = ['stroke', 'heart_disease', 'hypertension']

for col in targets:
    columns.remove(col)

#splitting into train and test data
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=.20, random_state=42)

X_test  = test[columns].to_numpy()
X_train = train[columns].to_numpy()
y_test  = {}
y_train = {}

for col in targets:
    y_test[col]  = test[col].to_numpy()
    y_train[col] = train[col].to_numpy()

## Modeling (5pt)

<ul>
    <li>The implementation of logistic regression must be written only from the examples given to you by the instructor. No credit will be assigned to teams that copy implementations from another source, regardless of if the code is properly cited.</li>
    <li>[<b>2 points</b>] Create a custom, one-versus-all logistic regression classifier using numpy and scipy to optimize. Use object oriented conventions identical to scikit-learn. You should start with the template developed by the instructor in the course. You should add the following functionality to the logistic regression classifier:
    <ul>
        <li>Ability to choose optimization technique when class is instantiated: either steepest descent, stochastic gradient descent, or Newton's method. </li>
        <li>Update the gradient calculation to include a customizable regularization term (either using no regularization, L1 regularization, L2 regularization, or both L1 and L2 regularization). Associate a cost with the regularization term, "C", that can be adjusted when the class is instantiated.  </li>
    </ul>
    </li>
    <li>[<b>1.5 points</b>] Train your classifier to achieve good generalization performance. That is, adjust the <b>optimization technique</b> and the value of the <b>regularization term "C"</b> to achieve the best performance on your test set. Visualize the performance of the classifier versus the parameters you investigated. Is your method of selecting parameters justified? That is, do you think there is any "data snooping" involved with this method of selecting parameters?</li>
    <li>[<b>1.5 points</b>] Compare the performance of your "best" logistic regression optimization procedure to the procedure used in scikit-learn. Visualize the performance differences in terms of training time and classification performance. <b>Discuss the results</b>. </li>
</ul>

In [8]:
class BinaryLogisticRegressionBase:
    # private:
    def __init__(self, optimization='bgd', eta = 0.01, iterations=20, regularization='none', c=0):
        self.eta = eta
        self.iters = iterations
        self.opt = optimization
        self.reg = regularization
        self.c = c
        # internally we will store the weights as self.w_ to keep with sklearn conventions
    
    def __str__(self):
        return 'Base Binary Logistic Regression Object, Not Trainable'
    
    # convenience, private and static:
    @staticmethod
    def _sigmoid(theta):
        return 1/(1+np.exp(-theta)) 
    
    @staticmethod
    def _add_bias(X):
        return np.hstack((np.ones((X.shape[0],1)),X)) # add bias term
    
    # public:
    def predict_proba(self,X,add_bias=True):
        # add bias term if requested
        Xb = self._add_bias(X) if add_bias else X
        return self._sigmoid(Xb @ self.w_) # return the probability y=1
    
    def predict(self,X):
        return (self.predict_proba(X)>0.5) #return the actual prediction
    

In [9]:
from scipy.special import expit
from numpy.linalg import pinv

class BinaryLogisticRegression(BinaryLogisticRegressionBase):
    #private:
    def __str__(self):
        if(hasattr(self,'w_')):
            return 'Binary Logistic Regression Object with coefficients:\n'+ str(self.w_) # is we have trained the object
        else:
            return 'Untrained Binary Logistic Regression Object'
        
    #optimization methods
    def _get_gradient(self, X, y):
        
        gradient = None
        if self.opt == 'tgd': gradient = self.steepest_descent
        elif self.opt == 'sgd': gradient = self.stochastic_gradient_descent
        elif self.opt == 'newton': gradient = self.newton
        return gradient(X,y)
    
    def steepest_descent(self,X,y):
        ydiff = y-self.predict_proba(X,add_bias=False).ravel() # get y difference
        gradient = np.mean(X * ydiff[:,np.newaxis], axis=0) # make ydiff a column vector and multiply through
        gradient = gradient.reshape(self.w_.shape)
        gradient[1:] += self.c * self._get_reg_gradient()
        
        return gradient
    
    def stochastic_gradient_descent(self,X,y):
        idx = np.random.randint(len(y))
        ydiff = y[idx]-self.predict_proba(X[idx],add_bias=False) # get y difference (now scalar)
        gradient = X[idx] * ydiff[:,np.newaxis] # make ydiff a column vector and multiply through
        
        gradient = gradient.reshape(self.w_.shape)
        gradient[1:] += self.c * self._get_reg_gradient()
        
        return gradient
    
    def newton(self, X, y):
        g = self.predict_proba(X,add_bias=False).ravel() # get sigmoid value for all classes
        hessian = X.T @ np.diag(g*(1-g)) @ X - 2 * self.c  # calculate the hessian

        ydiff = y-g # get y difference
        gradient = np.sum(X * ydiff[:,np.newaxis], axis=0) # make ydiff a column vector and multiply through
        gradient = gradient.reshape(self.w_.shape)
        gradient[1:] +=  self.c * self._get_reg_gradient()
        
        return pinv(hessian) @ gradient
    
    @staticmethod
    def _sigmoid(theta):
        # increase stability, redefine sigmoid operation
        return expit(theta) #1/(1+np.exp(-theta))
    
    #regularization methods
    def _get_reg_gradient(self):
        if self.reg == 'none':
            return self.w_[1:]
        elif self.reg == 'ridge':
            return -2 * self.w_[1:]
        elif self.reg == 'lasso':
            return np.sign(self.w_[1:])
        elif self.reg == 'elastic_net':
            return -2 * self.w_[1:] + np.sign(self.w_[1:])
    
    # public:
    def fit(self, X, y):
        Xb = self._add_bias(X) # add bias term
        num_samples, num_features = Xb.shape
        
        self.w_ = np.zeros((num_features,1)) # init weight vector to zeros
        
        # for as many as the max iterations
        for _ in range(self.iters):
            gradient = self._get_gradient(Xb,y)
            self.w_ += gradient*self.eta # multiply by learning rate 

# Logisitic Regression Class

In [10]:
class LogisticRegression:
    
    def __init__(self, optimization, eta, iterations, regularization, c=0):
    
        self.eta = eta
        self.iters = iterations
        self.opt = optimization
        self.reg = regularization
        self.encodings = {}
        self.c = c
        
    
    def __str__(self):
        if(hasattr(self,'w_')):
            return 'MultiClass Logistic Regression Object with coefficients:\n'+ str(self.w_) # is we have trained the object
        else:
            return 'Untrained MultiClass Logistic Regression Object'
    
    def fit(self,X,y): # y is a hash of target columns
        self.classifiers_ = [] # will fill this array with binary classifiers
        
        for name, target in y.items():
            blr = BinaryLogisticRegression(self.opt, self.eta, self.iters, self.reg, self.c )
            blr.fit(X,target)
            # add the trained classifier to the list
            self.classifiers_.append(blr)
            
        # save all the weights into one matrix, separate column for each class
        self.w_ = np.hstack([x.w_ for x in self.classifiers_]).T
        
    def predict_proba(self,X):
        probs = []
        for blr in self.classifiers_:
        #if np.count_nonzero(blr.predict_proba(X)) > 0:
            #print("Not zero")
            probs.append(blr.predict_proba(X)) # get probability for each classifier
            
        return np.hstack(probs) # make into single matrix
    
    def predict(self,X):
        return np.argmax(self.predict_proba(X),axis=1) # take argmax along row
    
    
lr = LogisticRegression('tgd',0.01, 100, 'ridge')

In [11]:
#evaluate on train dataset
from sklearn.metrics import accuracy_score
lr = LogisticRegression(optimization='tgd',eta=0.9, regularization='none', iterations=10, c=0.001)
for col in targets:
    lr.fit(X_train, y_train)
    yhat = lr.predict(X_test)
    print("Accuracy of Testing Dataset: ", accuracy_score(y_test[col],yhat), " - ", col)

Accuracy of Testing Dataset:  0.9393346379647749  -  stroke
Accuracy of Testing Dataset:  0.9315068493150684  -  heart_disease
Accuracy of Testing Dataset:  0.8845401174168297  -  hypertension


In [12]:
#trying larger iterations
from sklearn.metrics import accuracy_score
lr = LogisticRegression(optimization='tgd',eta=0.9, regularization='none', iterations=500, c=0.001)
for col in targets:
    lr.fit(X_train, y_train)
    yhat = lr.predict(X_test)
    print("Accuracy of Testing Dataset: ", accuracy_score(y_test[col],yhat), " - ", col)

Accuracy of Testing Dataset:  0.4070450097847358  -  stroke
Accuracy of Testing Dataset:  0.4090019569471624  -  heart_disease
Accuracy of Testing Dataset:  0.410958904109589  -  hypertension


As we can see, when performing no regularizations but using the default optimization of steepest descent, fewer iterations yielded a higher accuracies. This will be interesting to analyze as we increase the number of iterations and evaluate with different combinations of regularizations/optimizations.

### Steepest Descent

In [13]:
import operator
max_eta = -1
accuracy_eta_list = dict()
for val in range(58):
    lr = LogisticRegression(optimization='tgd',eta=val/100, regularization='none', iterations=10, c=0.001)
    for col in targets:
        lr.fit(X_train, y_train)
        yhat = lr.predict(X_test)
        accuracy = accuracy_score(y_test[col],yhat)
        accuracy_eta_list[accuracy] = val/100
new_list = dict(sorted(accuracy_eta_list.items(), key=operator.itemgetter(1), reverse=True)[:10])
print('Max Accuracy          Max Eta')
for key in new_list:
    print(key, '  ', new_list[key])
max_accuracy = list(new_list.keys())[0] 
max_eta = new_list[max_accuracy]
    
max_accuracy = -1
print('\nMax Accuracy       Max c')

for val in range(100):
    lr = LogisticRegression(optimization='tgd', eta=max_eta, regularization='none', iterations=500, c=val/1000.0)
    for col in targets:
        lr.fit(X_train, y_train)
        yhat = lr.predict(X_test)
        accuracy = accuracy_score(y_test[col],yhat)
    
    if accuracy > max_accuracy:
        max_accuracy = accuracy
        max_c = val/1000.0
        print(max_accuracy, max_c)
        
print("\nFinal eta: ", max_eta)
print("Final c: ", max_c)

Max Accuracy          Max Eta
0.9393346379647749    0.57
0.9315068493150684    0.57
0.8845401174168297    0.57
0.9354207436399217    0.31
0.9275929549902152    0.31
0.8806262230919765    0.31
0.9305283757338552    0.3
0.9227005870841487    0.3
0.87573385518591    0.3
0.9295499021526419    0.29

Max Accuracy       Max c
0.14090019569471623 0.0
0.8845401174168297 0.001

Final eta:  0.57
Final c:  0.001


In [14]:
import time

lr_s0 = LogisticRegression(optimization="tgd", eta=max_eta, regularization='none', iterations=500, c=max_c)
start_time = time.time()
lr_s0.fit(X_train,y_train)
yhat_s0 = lr_s0.predict(X_test)
time1 = (time.time() - start_time)
print("Accuracy of Steepest Gradient: ", accuracy_score(y_test['stroke'],yhat_s0), ' - stroke')

lr_hd0 = LogisticRegression(optimization="tgd", eta=max_eta, regularization='none', iterations=500, c=max_c)
start_time = time.time()
lr_hd0.fit(X_train,y_train)
yhat_hd0 = lr_hd0.predict(X_test)
time2 = (time.time() - start_time)
print("Accuracy of Steepest Gradient: ", accuracy_score(y_test['heart_disease'],yhat_hd0), ' - heart_disease')

lr_h0 = LogisticRegression(optimization="tgd", eta=max_eta, regularization='none', iterations=500, c=max_c)
start_time = time.time()
lr_h0.fit(X_train,y_train)
yhat_h0 = lr_h0.predict(X_test)
time3 = (time.time() - start_time)
print("Accuracy of Steepest Gradient: ", accuracy_score(y_test['hypertension'],yhat_h0), ' - hypertension')

print("--- {} seconds ---".format(float(time1+time2+time3)/3))
print("\n")

lr_s1 = LogisticRegression(optimization="tgd", eta=max_eta, regularization='ridge', iterations=500, c=max_c)
start_time = time.time()
lr_s1.fit(X_train,y_train)
yhat_s1 = lr_s1.predict(X_test)
time1 = (time.time() - start_time)
print("Accuracy of Steepest Gradient L1 regularization: ", accuracy_score(y_test['stroke'],yhat_s1), ' - stroke')

lr_hd1 = LogisticRegression(optimization="tgd", eta=max_eta, regularization='ridge', iterations=500, c=max_c)
start_time = time.time()
lr_hd1.fit(X_train,y_train)
yhat_hd1 = lr_hd1.predict(X_test)
time2 = (time.time() - start_time)
print("Accuracy of Steepest Gradient L1 regularization: ", accuracy_score(y_test['heart_disease'],yhat_hd1), ' - heart_disease')

lr_h1 = LogisticRegression(optimization="tgd", eta=max_eta, regularization='ridge', iterations=500, c=max_c)
start_time = time.time()
lr_h1.fit(X_train,y_train)
yhat_h1 = lr_h1.predict(X_test)
time3 = (time.time() - start_time)
print("Accuracy of Steepest Gradient L1 regularization: ", accuracy_score(y_test['hypertension'],yhat_h1), ' - hypertension')

print("--- {} seconds ---".format(float(time1+time2+time3)/3))
print("\n")

lr_s2 = LogisticRegression(optimization="tgd", eta=max_eta, regularization='lasso', iterations=500, c=max_c)
start_time = time.time()
lr_s2.fit(X_train,y_train)
yhat_s2 = lr_s2.predict(X_test)
time1 = (time.time() - start_time)
print("Accuracy of Steepest Gradient L2 regularization: ", accuracy_score(y_test['stroke'],yhat_s2), ' - stroke')

lr_hd2 = LogisticRegression(optimization="tgd", eta=max_eta, regularization='lasso', iterations=500, c=max_c)
start_time = time.time()
lr_hd2.fit(X_train,y_train)
yhat_hd2 = lr_hd2.predict(X_test)
time2 = (time.time() - start_time)
print("Accuracy of Steepest Gradient L2 regularization: ", accuracy_score(y_test['heart_disease'],yhat_hd2), ' - heart_disease')

lr_h2 = LogisticRegression(optimization="tgd", eta=max_eta, regularization='lasso', iterations=500, c=max_c)
start_time = time.time()
lr_h2.fit(X_train,y_train)
yhat_h2 = lr_h2.predict(X_test)
time3 = (time.time() - start_time)
print("Accuracy of Steepest Gradient L2 regularization: ", accuracy_score(y_test['hypertension'],yhat_h2), ' - hypertension')

print("--- {} seconds ---".format(float(time1+time2+time3)/3))
print("\n")

lr_s3 = LogisticRegression(optimization="tgd", eta=max_eta, regularization='elastic_net', iterations=500, c=max_c)
start_time = time.time()
lr_s3.fit(X_train,y_train)
yhat_s3 = lr_s3.predict(X_test)
time1 = (time.time() - start_time)
print("Accuracy of Steepest Gradient L12 regularization: ", accuracy_score(y_test['stroke'],yhat_s3), ' - stroke')

lr_hd3 = LogisticRegression(optimization="tgd", eta=max_eta, regularization='elastic_net', iterations=500, c=max_c)
start_time = time.time()
lr_hd3.fit(X_train,y_train)
yhat_hd3 = lr_hd3.predict(X_test)
time2 = (time.time() - start_time)
print("Accuracy of Steepest Gradient L12 regularization: ", accuracy_score(y_test['heart_disease'],yhat_hd3), ' - heart_disease')

lr_h3 = LogisticRegression(optimization="tgd", eta=max_eta, regularization='elastic_net', iterations=500, c=max_c)
start_time = time.time()
lr_h3.fit(X_train,y_train)
yhat_h3 = lr_h3.predict(X_test)
time3 = (time.time() - start_time)
print("Accuracy of Steepest Gradient L12 regularization: ", accuracy_score(y_test['hypertension'],yhat_h3), ' - hypertension')

print("--- {} seconds ---".format(float(time1+time2+time3)/3))

Accuracy of Steepest Gradient:  0.9393346379647749  - stroke
Accuracy of Steepest Gradient:  0.9315068493150684  - heart_disease
Accuracy of Steepest Gradient:  0.8845401174168297  - hypertension
--- 0.1658796469370524 seconds ---


Accuracy of Steepest Gradient L1 regularization:  0.0821917808219178  - stroke
Accuracy of Steepest Gradient L1 regularization:  0.09001956947162426  - heart_disease
Accuracy of Steepest Gradient L1 regularization:  0.13307240704500978  - hypertension
--- 0.17160797119140625 seconds ---


Accuracy of Steepest Gradient L2 regularization:  0.9393346379647749  - stroke
Accuracy of Steepest Gradient L2 regularization:  0.9315068493150684  - heart_disease
Accuracy of Steepest Gradient L2 regularization:  0.8845401174168297  - hypertension
--- 0.17215474446614584 seconds ---


Accuracy of Steepest Gradient L12 regularization:  0.541095890410959  - stroke
Accuracy of Steepest Gradient L12 regularization:  0.5303326810176126  - heart_disease
Accuracy of Steepest Gr

### Stochastic Gradient Descent

In [15]:
import operator
max_eta = -1
accuracy_eta_list = dict()
for val in range(58):
    lr = LogisticRegression(optimization='sgd',eta=val/100, regularization='none', iterations=10, c=0.001)
    for col in targets:
        lr.fit(X_train, y_train)
        yhat = lr.predict(X_test)
        accuracy = accuracy_score(y_test[col],yhat)
        accuracy_eta_list[accuracy] = val/100
new_list = dict(sorted(accuracy_eta_list.items(), key=operator.itemgetter(1), reverse=True)[:10])
print('Max Accuracy          Max Eta')
for key in new_list:
    print(key, '  ', new_list[key])
max_accuracy = list(new_list.keys())[0] 
max_eta = new_list[max_accuracy]

max_accuracy = -1
print('\nMax Accuracy       Max c')

for val in range(100):
    lr = LogisticRegression(optimization='sgd', eta=max_eta, regularization='none', iterations=500, c=val/1000.0)
    for col in targets:
        lr.fit(X_train, y_train)
        yhat = lr.predict(X_test)
        accuracy = accuracy_score(y_test[col],yhat)
    
    if accuracy > max_accuracy:
        max_accuracy = accuracy
        max_c = val/1000.0
        print(max_accuracy, max_c)
        
print("\nFinal eta: ", max_eta)
print("Final c: ", max_c)

Max Accuracy          Max Eta
0.9393346379647749    0.57
0.8845401174168297    0.57
0.0    0.57
0.9315068493150684    0.56
0.060665362035225046    0.56
0.8816046966731899    0.55
0.0684931506849315    0.54
0.8072407045009785    0.53
0.49608610567514677    0.52
0.8679060665362035    0.52

Max Accuracy       Max c
0.8062622309197651 0.0
0.8855185909980431 0.002

Final eta:  0.57
Final c:  0.002


In [16]:
lr_s0 = LogisticRegression(optimization="sgd", eta=max_eta, regularization='none', iterations=500, c=max_c)
start_time = time.time()
lr_s0.fit(X_train,y_train)
yhat_s0 = lr_s0.predict(X_test)
time1 = (time.time() - start_time)
print("Accuracy of Stochastic Gradient: ", accuracy_score(y_test['stroke'],yhat_s0), ' - stroke')

lr_hd0 = LogisticRegression(optimization="sgd", eta=max_eta, regularization='none', iterations=500, c=max_c)
start_time = time.time()
lr_hd0.fit(X_train,y_train)
yhat_hd0 = lr_hd0.predict(X_test)
time2 = (time.time() - start_time)
print("Accuracy of Stochastic Gradient: ", accuracy_score(y_test['heart_disease'],yhat_hd0), ' - heart_disease')

lr_h0 = LogisticRegression(optimization="sgd", eta=max_eta, regularization='none', iterations=500, c=max_c)
start_time = time.time()
lr_h0.fit(X_train,y_train)
yhat_h0 = lr_h0.predict(X_test)
time3 = (time.time() - start_time)
print("Accuracy of Stochastic Gradient: ", accuracy_score(y_test['hypertension'],yhat_h0), ' - hypertension')

print("--- {} seconds ---".format(float(time1+time2+time3)/3))
print("\n")

lr_s1 = LogisticRegression(optimization="sgd", eta=max_eta, regularization='ridge', iterations=500, c=max_c)
start_time = time.time()
lr_s1.fit(X_train,y_train)
yhat_s1 = lr_s1.predict(X_test)
time1 = (time.time() - start_time)
print("Accuracy of Stochastic Gradient L1 regularization: ", accuracy_score(y_test['stroke'],yhat_s1), ' - stroke')

lr_hd1 = LogisticRegression(optimization="sgd", eta=max_eta, regularization='ridge', iterations=500, c=max_c)
start_time = time.time()
lr_hd1.fit(X_train,y_train)
yhat_hd1 = lr_hd1.predict(X_test)
time2 = (time.time() - start_time)
print("Accuracy of Stochastic Gradient L1 regularization: ", accuracy_score(y_test['heart_disease'],yhat_hd1), ' - heart_disease')

lr_h1 = LogisticRegression(optimization="sgd", eta=max_eta, regularization='ridge', iterations=500, c=max_c)
start_time = time.time()
lr_h1.fit(X_train,y_train)
yhat_h1 = lr_h1.predict(X_test)
time3 = (time.time() - start_time)
print("Accuracy of Stochastic Gradient L1 regularization: ", accuracy_score(y_test['hypertension'],yhat_h1), ' - hypertension')

print("--- {} seconds ---".format(float(time1+time2+time3)/3))
print("\n")

start_time = time.time()

lr_s2 = LogisticRegression(optimization="sgd", eta=max_eta, regularization='lasso', iterations=500, c=max_c)
start_time = time.time()
lr_s2.fit(X_train,y_train)
yhat_s2 = lr_s2.predict(X_test)
time1 = (time.time() - start_time)
print("Accuracy of Stochastic Gradient L2 regularization: ", accuracy_score(y_test['stroke'],yhat_s2), ' - stroke')

lr_hd2 = LogisticRegression(optimization="sgd", eta=max_eta, regularization='lasso', iterations=500, c=max_c)
start_time = time.time()
lr_hd2.fit(X_train,y_train)
yhat_hd2 = lr_hd2.predict(X_test)
time2 = (time.time() - start_time)
print("Accuracy of Stochastic Gradient L2 regularization: ", accuracy_score(y_test['heart_disease'],yhat_hd2), ' - heart_disease')

lr_h2 = LogisticRegression(optimization="sgd", eta=max_eta, regularization='lasso', iterations=500, c=max_c)
start_time = time.time()
lr_h2.fit(X_train,y_train)
yhat_h2 = lr_h2.predict(X_test)
time3 = (time.time() - start_time)
print("Accuracy of Stochastic Gradient L2 regularization: ", accuracy_score(y_test['hypertension'],yhat_h2), ' - hypertension')

print("--- {} seconds ---".format(float(time1+time2+time3)/3))
print("\n")

start_time = time.time()

lr_s3 = LogisticRegression(optimization="sgd", eta=max_eta, regularization='elastic_net', iterations=500, c=max_c)
start_time = time.time()
lr_s3.fit(X_train,y_train)
yhat_s3 = lr_s3.predict(X_test)
time1 = (time.time() - start_time)
print("Accuracy of Stochastic Gradient L12 regularization: ", accuracy_score(y_test['stroke'],yhat_s3), ' - stroke')

lr_hd3 = LogisticRegression(optimization="sgd", eta=max_eta, regularization='elastic_net', iterations=500, c=max_c)
start_time = time.time()
lr_hd3.fit(X_train,y_train)
yhat_hd3 = lr_hd3.predict(X_test)
time2 = (time.time() - start_time)
print("Accuracy of Stochastic Gradient L12 regularization: ", accuracy_score(y_test['heart_disease'],yhat_hd3), ' - heart_disease')

lr_h3 = LogisticRegression(optimization="sgd", eta=max_eta, regularization='elastic_net', iterations=500, c=max_c)
start_time = time.time()
lr_h3.fit(X_train,y_train)
yhat_h3 = lr_h3.predict(X_test)
time3 = (time.time() - start_time)
print("Accuracy of Stochastic Gradient L12 regularization: ", accuracy_score(y_test['hypertension'],yhat_h3), ' - hypertension')

print("--- {} seconds ---".format(float(time1+time2+time3)/3))


Accuracy of Stochastic Gradient:  0.7915851272015656  - stroke
Accuracy of Stochastic Gradient:  0.9315068493150684  - heart_disease
Accuracy of Stochastic Gradient:  0.5205479452054794  - hypertension
--- 0.03335269292195638 seconds ---


Accuracy of Stochastic Gradient L1 regularization:  0.9393346379647749  - stroke
Accuracy of Stochastic Gradient L1 regularization:  0.9315068493150684  - heart_disease
Accuracy of Stochastic Gradient L1 regularization:  0.8786692759295499  - hypertension
--- 0.03061366081237793 seconds ---


Accuracy of Stochastic Gradient L2 regularization:  0.2827788649706458  - stroke
Accuracy of Stochastic Gradient L2 regularization:  0.9315068493150684  - heart_disease
Accuracy of Stochastic Gradient L2 regularization:  0.8845401174168297  - hypertension
--- 0.03398569424947103 seconds ---


Accuracy of Stochastic Gradient L12 regularization:  0.9393346379647749  - stroke
Accuracy of Stochastic Gradient L12 regularization:  0.9315068493150684  - heart_disease
A

### Newton's Method

In [17]:
import operator
max_eta = -1
accuracy_eta_list = dict()
for val in range(58):
    lr = LogisticRegression(optimization='newton',eta=val/100, regularization='none', iterations=10, c=0.001)
    for col in targets:
        lr.fit(X_train, y_train)
        yhat = lr.predict(X_test)
        accuracy = accuracy_score(y_test[col],yhat)
        accuracy_eta_list[accuracy] = val/100
new_list = dict(sorted(accuracy_eta_list.items(), key=operator.itemgetter(1), reverse=True)[:10])
print('Max Accuracy          Max Eta')
for key in new_list:
    print(key, '  ', new_list[key])
max_accuracy = list(new_list.keys())[0] 
max_eta = new_list[max_accuracy]
    
max_accuracy = -1
print('\nMax Accuracy       Max c')

for val in range(100):
    lr = LogisticRegression(optimization='newton', eta=max_eta, regularization='none', iterations=10, c=val/1000.0)
    for col in targets:
        lr.fit(X_train, y_train)
        yhat = lr.predict(X_test)
        accuracy = accuracy_score(y_test[col],yhat)
    
    if accuracy > max_accuracy:
        max_accuracy = accuracy
        max_c = val/1000.0
        print(max_accuracy, max_c)
        
print("\nFinal eta: ", max_eta)
print("Final c: ", max_c)

Max Accuracy          Max Eta
0.026418786692759294    0.57
0.021526418786692758    0.57
0.022504892367906065    0.57
0.029354207436399216    0.55
0.025440313111545987    0.55
0.02446183953033268    0.55
0.030332681017612523    0.54
0.03131115459882583    0.5
0.03424657534246575    0.47
0.033268101761252444    0.45

Max Accuracy       Max c
0.022504892367906065 0.0

Final eta:  0.57
Final c:  0.0


In [18]:
lr_s0 = LogisticRegression(optimization="newton", eta=max_eta, regularization='none', iterations=10, c=max_c)
start_time = time.time()
lr_s0.fit(X_train,y_train)
yhat_s0 = lr_s0.predict(X_test)
time1 = (time.time() - start_time)
print("Accuracy of Newton's Method: ", accuracy_score(y_test['stroke'],yhat_s0), ' - stroke')

lr_hd0 = LogisticRegression(optimization="newton", eta=max_eta, regularization='none', iterations=10, c=max_c)
start_time = time.time()
lr_hd0.fit(X_train,y_train)
yhat_hd0 = lr_hd0.predict(X_test)
time2 = (time.time() - start_time)
print("Accuracy of Newton's Method: ", accuracy_score(y_test['heart_disease'],yhat_hd0), ' - heart_disease')

lr_h0 = LogisticRegression(optimization="newton", eta=max_eta, regularization='none', iterations=10, c=max_c)
start_time = time.time()
lr_h0.fit(X_train,y_train)
yhat_h0 = lr_h0.predict(X_test)
time3 = (time.time() - start_time)
print("Accuracy of Newton's Method: ", accuracy_score(y_test['hypertension'],yhat_h0), ' - hypertension')

print("--- {} seconds ---".format(float(time1+time2+time3)/3))
print("\n")

lr_s1 = LogisticRegression(optimization="newton", eta=max_eta, regularization='ridge', iterations=10, c=max_c)
start_time = time.time()
lr_s1.fit(X_train,y_train)
yhat_s1 = lr_s1.predict(X_test)
time1 = (time.time() - start_time)
print("Accuracy of Newton's Method L1 regularization: ", accuracy_score(y_test['stroke'],yhat_s1), ' - stroke')

lr_hd1 = LogisticRegression(optimization="newton", eta=max_eta, regularization='ridge', iterations=10, c=max_c)
start_time = time.time()
lr_hd1.fit(X_train,y_train)
yhat_hd1 = lr_hd1.predict(X_test)
time2 = (time.time() - start_time)
print("Accuracy of Newton's Method L1 regularization: ", accuracy_score(y_test['heart_disease'],yhat_hd1), ' - heart_disease')

lr_h1 = LogisticRegression(optimization="newton", eta=max_eta, regularization='ridge', iterations=10, c=max_c)
start_time = time.time()
lr_h1.fit(X_train,y_train)
yhat_h1 = lr_h1.predict(X_test)
time3 = (time.time() - start_time)
print("Accuracy of Newton's Method L1 regularization: ", accuracy_score(y_test['hypertension'],yhat_h1), ' - hypertension')

print("--- {} seconds ---".format(float(time1+time2+time3)/3))
print("\n")

lr_s2 = LogisticRegression(optimization="newton", eta=max_eta, regularization='lasso', iterations=10, c=max_c)
start_time = time.time()
lr_s2.fit(X_train,y_train)
yhat_s2 = lr_s2.predict(X_test)
time1 = (time.time() - start_time)
print("Accuracy of Newton's Method L2 regularization: ", accuracy_score(y_test['stroke'],yhat_s2), ' - stroke')

lr_hd2 = LogisticRegression(optimization="newton", eta=max_eta, regularization='lasso', iterations=10, c=max_c)
start_time = time.time()
lr_hd2.fit(X_train,y_train)
yhat_hd2 = lr_hd2.predict(X_test)
time2 = (time.time() - start_time)
print("Accuracy of Newton's Method L2 regularization: ", accuracy_score(y_test['heart_disease'],yhat_hd2), ' - heart_disease')

lr_h2 = LogisticRegression(optimization="newton", eta=max_eta, regularization='lasso', iterations=10, c=max_c)
start_time = time.time()
lr_h2.fit(X_train,y_train)
yhat_h2 = lr_h2.predict(X_test)
time3 = (time.time() - start_time)
print("Accuracy of Newton's Method L2 regularization: ", accuracy_score(y_test['hypertension'],yhat_h2), ' - hypertension')

print("--- {} seconds ---".format(float(time1+time2+time3)/3))
print("\n")

lr_s3 = LogisticRegression(optimization="newton", eta=max_eta, regularization='elastic_net', iterations=10, c=max_c)
start_time = time.time()
lr_s3.fit(X_train,y_train)
yhat_s3 = lr_s3.predict(X_test)
time1 = (time.time() - start_time)
print("Accuracy of Newton's Method L12 regularization: ", accuracy_score(y_test['stroke'],yhat_s3), ' - stroke')

lr_hd3 = LogisticRegression(optimization="newton", eta=max_eta, regularization='elastic_net', iterations=10, c=max_c)
start_time = time.time()
lr_hd3.fit(X_train,y_train)
yhat_hd3 = lr_hd3.predict(X_test)
time2 = (time.time() - start_time)
print("Accuracy of Newton's Method L12 regularization: ", accuracy_score(y_test['heart_disease'],yhat_hd3), ' - heart_disease')

lr_h3 = LogisticRegression(optimization="newton", eta=max_eta, regularization='elastic_net', iterations=10, c=max_c)
start_time = time.time()
lr_h3.fit(X_train,y_train)
yhat_h3 = lr_h3.predict(X_test)
time3 = (time.time() - start_time)
print("Accuracy of Newton's Method L12 regularization: ", accuracy_score(y_test['hypertension'],yhat_h3), ' - hypertension')

print("--- {} seconds ---".format(float(time1+time2+time3)/3))


Accuracy of Newton's Method:  0.021526418786692758  - stroke
Accuracy of Newton's Method:  0.026418786692759294  - heart_disease
Accuracy of Newton's Method:  0.022504892367906065  - hypertension
--- 0.9281144936879476 seconds ---


Accuracy of Newton's Method L1 regularization:  0.021526418786692758  - stroke
Accuracy of Newton's Method L1 regularization:  0.026418786692759294  - heart_disease
Accuracy of Newton's Method L1 regularization:  0.022504892367906065  - hypertension
--- 0.9562834103902181 seconds ---


Accuracy of Newton's Method L2 regularization:  0.021526418786692758  - stroke
Accuracy of Newton's Method L2 regularization:  0.026418786692759294  - heart_disease
Accuracy of Newton's Method L2 regularization:  0.022504892367906065  - hypertension
--- 0.9530626932779948 seconds ---


Accuracy of Newton's Method L12 regularization:  0.021526418786692758  - stroke
Accuracy of Newton's Method L12 regularization:  0.026418786692759294  - heart_disease
Accuracy of Newton's Metho

We do not think there was any data snooping because our method of deciding the parameters was optimized. We loop over a range of possibilities for the values of c and eta and keep the best functioning parameters that yield the highest accuracy. Because of this, we believe our method is justified and prevents data snooping. 

### Performance Comparison

In [None]:
#whenever I tried to run and visualize the graph, my computer would freeze so I could not graph it unfortunately :(

import plotly.graph_objects as go

titles = ['Optimization/Regularization', 'Training Time(s)', 'Accuracy']
columns = [
    ['Steepest Gradient Descent', 'Steepest Gradient Descent with L1 Regularization', 'Steepest Gradient Descent with L2 Regularization', 'Steepest Gradient Descent with L1 and L2 Regularization',
     'Stochiastic Gradient Descent', 'Stochiastic Gradient Descent with L1 Regularization', 'Stochiastic Gradient Descent with L2 Regularization', 'Stochiastic Gradient Descent with L1 and L2 Regularization'
     'Newton\'s Method', 'Newton\'s Method with L1 Regularization', 'Newton\'s Method with L2 Regularization', 'Newton\'s Method with L1 and L2 Regularization'],
    [0.16378935178120932, 0.15881760915120444, 0.161603053410848, 0.15591899553934732, 0.03335269292195638, 0.03061366081237793, 0.03398569424947103, 0.0348658561706543, 0.03398569424947103, 0.0348658561706543, 0.9281144936879476, 0.9562834103902181, 0.9530626932779948, 0.975908358891805],
    [.9393346379647749, .0821917808219178, .9393346379647749, .9393346379647749, .7915851272015656, .9393346379647749, .2827788649706458, .9393346379647749, .021526418786692758, .021526418786692758, .021526418786692758],
]

fig = go.Figure(data=[go.Table(header=dict(values=titles),
                 cells=dict(values=columns))
                     ])
fig = fig.update_layout(title="Stroke")
fig.show()

#######################################################################################################

# Training Time of Steepest Gradient: 0.16378935178120932 seconds
# Training Time of Steepest Gradient L1 regularization: 0.15881760915120444 seconds
# Training Time of Steepest Gradient L2 regularization: 0.161603053410848 seconds
# Training Time of Steepest Gradient L12 regularization:  0.15591899553934732 seconds

# Training Time of Stochastic Gradient: 0.03335269292195638 seconds
# Training Time of Stochastic Gradient L1 regularization: 0.03061366081237793 seconds
# Training Time of Stochastic Gradient L2 regularization: 0.03398569424947103 seconds
# Training Time of Stochastic Gradient L12 regularization: 0.0348658561706543 seconds

# Training Time of Newton's Method: 0.9281144936879476 seconds
# Training Time of Newton's Method L1 regularization: 0.9562834103902181 seconds
# Training Time of Newton's Method L2 regularization: 0.9530626932779948 seconds
# Training Time of Newton's Method L12 regularization: 0.975908358891805 seconds

# Training Time of SKL (stroke): 0.04746890068054199 seconds
# Training Time of SKL (heart_disease): 0.0448000431060791 seconds
# Training Time of SKL (hyptertension): 0.04042816162109375 seconds

#######################################################################################################


# Accuracy of Steepest Gradient:  0.9393346379647749  - stroke
# Accuracy of Steepest Gradient L1 regularization:  0.0821917808219178  - stroke
# Accuracy of Steepest Gradient L2 regularization:  0.9393346379647749  - stroke
# Accuracy of Steepest Gradient L12 regularization:  0.9393346379647749  - stroke

# Accuracy of Stochastic Gradient:  0.7915851272015656  - stroke
# Accuracy of Stochastic Gradient L1 regularization:  0.9393346379647749  - stroke
# Accuracy of Stochastic Gradient L2 regularization:  0.2827788649706458  - stroke
# Accuracy of Stochastic Gradient L12 regularization:  0.9393346379647749  - stroke

# Accuracy of Newton's Method:  0.021526418786692758  - stroke
# Accuracy of Newton's Method L1 regularization:  0.021526418786692758  - stroke
# Accuracy of Newton's Method L2 regularization:  0.021526418786692758  - stroke
# Accuracy of Newton's Method L12 regularization:  0.021526418786692758  - stroke


#######################################################################################################

# Accuracy of Steepest Gradient:  0.9315068493150684  - heart_disease
# Accuracy of Steepest Gradient L1 regularization:  0.09001956947162426  - heart_disease
# Accuracy of Steepest Gradient L2 regularization:  0.9315068493150684  - heart_disease
# Accuracy of Steepest Gradient L12 regularization:  0.5303326810176126  - heart_disease

# Accuracy of Stochastic Gradient:  0.9315068493150684  - heart_disease
# Accuracy of Stochastic Gradient L1 regularization:  0.9315068493150684  - heart_disease
# Accuracy of Stochastic Gradient L2 regularization:  0.9315068493150684  - heart_disease
# Accuracy of Stochastic Gradient L12 regularization:  0.9315068493150684  - heart_disease

# Accuracy of Newton's Method:  0.026418786692759294  - heart_disease
# Accuracy of Newton's Method L1 regularization:  0.026418786692759294  - heart_disease
# Accuracy of Newton's Method L2 regularization:  0.026418786692759294  - heart_disease
# Accuracy of Newton's Method L12 regularization:  0.026418786692759294  - heart_disease

########################################################################################################

# Accuracy of Steepest Gradient:  0.8845401174168297  - hypertension
# Accuracy of Steepest Gradient L1 regularization:  0.13307240704500978  - hypertension
# Accuracy of Steepest Gradient L2 regularization:  0.8845401174168297  - hypertension
# Accuracy of Steepest Gradient L12 regularization:  0.49902152641878667  - hypertension


# Accuracy of Stochastic Gradient:  0.5205479452054794  - hypertension
# Accuracy of Stochastic Gradient L1 regularization:  0.8786692759295499  - hypertension
# Accuracy of Stochastic Gradient L2 regularization:  0.8845401174168297  - hypertension
# Accuracy of Stochastic Gradient L12 regularization:  0.7925636007827789  - hypertension


# Accuracy of Newton's Method:  0.022504892367906065  - hypertension
# Accuracy of Newton's Method L1 regularization:  0.022504892367906065  - hypertension
# Accuracy of Newton's Method L2 regularization:  0.022504892367906065  - hypertension
# Accuracy of Newton's Method L12 regularization:  0.022504892367906065  - hypertension


########################################################################################################

# Accuracy of SKL (stroke): 0.9393346379647749
# Accuracy of SKL (heart_disease): 0.9305283757338552
# Accuracy of SKL (hypertension): 0.8835616438356164

Looking at the list of data, we can see how the stochastic gradient had the fastest training time, even faster than SKL, whereas Newton's was the slowest. The data also shows how Newton's method consistently had the lowest accuracy for all of our target datasets. This could mean that Newton's method is the worst optimization for our dataset, or perhaps the accuracy could have improved with a different implementation. Because of the very long time span Newton's method took to run, we could only perform a few tests compared to the other optimizations. We can also see how Steepest Gradient was more accurate for the stroke and heart disease targets, whereas Stochastic Gradient Descent was more accurate for the hypertension target. Steepest Gradient with L1 regularization proved to be the least accurate consistently for all target sets. This could mean that the ridge regularization is not as accurate as the other regularizations. There were varying data depending on the target, but overall, we can conclude that all the Steepest Gradient optimizations (except the one with L2 regularization) were the most accurate, along with SKL. 

The highest accuracy was ~94%, which included 
* Steepest Gradient (no regularization)
* Steepest Gradient (L1 regularization)
* Steepest Gradient (L12 regularization)
* SKL on stroke target

### Scikit-learn

In [20]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
for col in targets:
    start_time = time.time()
    clf.fit(X_train, y_train[col])
    clf.predict(X_test)
    clf.predict_proba(X_test)
    print(clf.score(X_test,y_test[col]), ' - ', col)
    print("--- %s seconds ---" % (time.time() - start_time), "\n")

0.9393346379647749  -  stroke
--- 0.04746890068054199 seconds --- 

0.9305283757338552  -  heart_disease
--- 0.0448000431060791 seconds --- 

0.8835616438356164  -  hypertension
--- 0.04042816162109375 seconds --- 




lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to th

## Deployment (1pt)

<ul>
    <li>Which implementation of logistic regression would you advise be used in a deployed machine learning model, your implementation or scikit-learn (or other third party)? Why?</li>
</ul>

We believe the Scikit-learn's logistic regression implementation should be used over our implementation. SKLearn's implementation is more optimized because it is written in C, which runs faster than Python. It is also more reliable because it is an open-source library with more frequent updates, making the library maintainable. This allows for new optimizations in the sklearn library to have a minimal amount of work necessary to improve the efficiency and accuracy of the model. 

Scikit-learn's implementation was also much faster than ours, especially for the Steepest Gradient and Newton's Method optimizations. This is important because the faster the implementation can train/run, the faster it can update a model, making it cheaper to scale.

Overall, the accuracy of the SKL implementation was highly accurate for all target datasets. With the combination of speed and maintainability, we believe their implementation would be more efficient in an ML model.

## Exceptional Work (1pt)

<ul>
    <li>You have free reign to provide additional analyses. <b>One idea</b>: Update the code to use either "one-versus-all" or "one-versus-one" extensions of binary to multi-class classification. </li>
    <li><b>Required for 7000 level students</b>: Choose ONE of the following:
    <ul>
        <li><b>Option One</b>: Implement an optimization technique for logistic regression using <b>mean square error</b> as your objective function (instead of binary cross entropy). Derive the gradient updates for the Hessian and use Newton's method to update the values of "w". Then answer, is this process better than using binary cross entropy? </li>
        <li><b>Option Two</b>: Implement the BFGS algorithm from scratch to optimize logistic regression. That is, use BFGS without the use of an external package (for example, do not use SciPy). Compare your performance accuracy and runtime to the BFGS implementation in SciPy (that we used in lecture). </li>
    </ul>
    </li>
</ul>

In [None]:
class BinaryLogisticRegressionBase:
    # private:
    def __init__(self, optimization='bgd', eta = 0.01, iterations=20, regularization='none', c=0):
        self.eta = eta
        self.iters = iterations
        self.opt = optimization
        self.reg = regularization
        self.c = c
        # internally we will store the weights as self.w_ to keep with sklearn conventions
    
    def __str__(self):
        return 'Base Binary Logistic Regression Object, Not Trainable'
    
    # convenience, private and static:
    @staticmethod
    def _sigmoid(theta):
        return 1/(1+np.exp(-theta)) 
    
    @staticmethod
    def _add_bias(X):
        return np.hstack((np.ones((X.shape[0],1)),X)) # add bias term
    
    # public:
    def predict_proba(self,X,add_bias=True):
        # add bias term if requested
        Xb = self._add_bias(X) if add_bias else X
        return self._sigmoid(Xb @ self.w_) # return the probability y=1
    
    def predict(self,X):
        return (self.predict_proba(X)>0.5) #return the actual prediction

In [None]:
from scipy.special import expit
from numpy.linalg import pinv
from sklearn.metrics import mean_squared_error

class BinaryLogisticRegression(BinaryLogisticRegressionBase):
    #private:
    def __str__(self):
        if(hasattr(self,'w_')):
            return 'Binary Logistic Regression Object with coefficients:\n'+ str(self.w_) # is we have trained the object
        else:
            return 'Untrained Binary Logistic Regression Object'
        
    #optimization methods
    def _get_gradient(self, X, y):
        
        gradient = None
        if self.opt == 'tgd': gradient = self.steepest_descent
        elif self.opt == 'sgd': gradient = self.stochastic_gradient_descent
        elif self.opt == 'mse_newton': gradient = self.mse_newton
        return gradient(X,y)
    
    def steepest_descent(self,X,y):
        ydiff = y-self.predict_proba(X,add_bias=False).ravel() # get y difference
        gradient = np.mean(X * ydiff[:,np.newaxis], axis=0) # make ydiff a column vector and multiply through
        gradient = gradient.reshape(self.w_.shape)
        gradient[1:] += self.c * self._get_reg_gradient()
        
        return gradient
    
    def stochastic_gradient_descent(self,X,y):
       # idx = int(np.random.rand()*len(y)) # grab random instance\
        idx = np.random.randint(len(y))
        ydiff = y[idx]-self.predict_proba(X[idx],add_bias=False) # get y difference (now scalar)
        gradient = X[idx] * ydiff[:,np.newaxis] # make ydiff a column vector and multiply through
        
        gradient = gradient.reshape(self.w_.shape)
        gradient[1:] += self.c * self._get_reg_gradient()
        
        return gradient
    
    def mse_newton(self, X, y):
        g = self.predict_proba(X,add_bias=False).ravel() # get sigmoid value for all classes
        hessian = X.T @ np.diag(g*(1-g)) @ X - 2 * self.c  # calculate the hessian
       
        temp = np.square(np.subtract(y, g))

        mse = (np.sum(X * temp[:,np.newaxis], axis=0)) / len(temp)
        
#        mse = mean_squared_error(y,g)
        
#         ydiff = y-g # get y difference
#         gradient = np.sum(X * ydiff[:,np.newaxis], axis=0) # make ydiff a column vector and multiply through
        
        
        gradient = mse.reshape(self.w_.shape)
        gradient[1:] +=  self.c * self._get_reg_gradient()
        
        return pinv(hessian) @ gradient       
    
    @staticmethod
    def _sigmoid(theta):
        # increase stability, redefine sigmoid operation
        return expit(theta) #1/(1+np.exp(-theta))
    
    #regularization methods
    def _get_reg_gradient(self):
        #no regularization
        if self.reg == 'none':
            return self.w_[1:]
        elif self.reg == 'ridge':
            return -2 * self.w_[1:]
        elif self.reg == 'lasso':
            return np.sign(self.w_[1:])
        elif self.reg == 'elastic_net':
            return -2 * self.w_[1:] + np.sign(self.w_[1:])
    
    # public:
    def fit(self, X, y):
        Xb = self._add_bias(X) # add bias term
        num_samples, num_features = Xb.shape
        
        self.w_ = np.zeros((num_features,1)) # init weight vector to zeros
        
        # for as many as the max iterations
        for _ in range(self.iters):
            gradient = self._get_gradient(Xb,y)
            self.w_ += gradient*self.eta # multiply by learning rate 

In [None]:
class LogisticRegression:
    
    def __init__(self, optimization, eta, iterations, regularization, c=0):
    
        self.eta = eta
        self.iters = iterations
        self.opt = optimization
        self.reg = regularization
        self.encodings = {}
        self.c = c
        
    
    def __str__(self):
        if(hasattr(self,'w_')):
            return 'MultiClass Logistic Regression Object with coefficients:\n'+ str(self.w_) # is we have trained the object
        else:
            return 'Untrained MultiClass Logistic Regression Object'
    
    def fit(self,X,y): # y is a hash of target columns
        self.classifiers_ = [] # will fill this array with binary classifiers
        
        for name, target in y.items():
            blr = BinaryLogisticRegression(self.opt, self.eta, self.iters, self.reg, self.c )
            blr.fit(X,target)
            # add the trained classifier to the list
            self.classifiers_.append(blr)
            
        # save all the weights into one matrix, separate column for each class
        self.w_ = np.hstack([x.w_ for x in self.classifiers_]).T
        
    def predict_proba(self,X):
        probs = []
        for blr in self.classifiers_:
#             if np.count_nonzero(blr.predict_proba(X)) > 0:
#                 print("Not zero")
            probs.append(blr.predict_proba(X)) # get probability for each classifier
            
        return np.hstack(probs) # make into single matrix
    
    def predict(self,X):
        return np.argmax(self.predict_proba(X),axis=1) # take argmax along row
    
    
lr = LogisticRegression('tgd',0.01, 100, 'ridge')

In [None]:
#evaluate on train dataset
from sklearn.metrics import accuracy_score
lr = LogisticRegression(optimization='mse_newton',eta=0.9, regularization='none', iterations=100, c=0)
for col in targets:
    print(col)
    lr.fit(X_train, y_train)
    yhat = lr.predict(X_test)
    print("Accuracy of Testing Dataset (50 iterations): ", accuracy_score(y_test[col],yhat), " - ", col)

## Results Discussion