# Lab Three - Extending Logistic Regression

In this lab, you will compare the performance of logistic regression optimization programmed in scikit-learn and via your own implementation. You will also modify the optimization procedure for logistic regression. 

This report is worth 10% of the final grade. Please upload a report (<b>one per team</b>) with all code used, visualizations, and text in a rendered Jupyter notebook. Any visualizations that cannot be embedded in the notebook, please provide screenshots of the output. The results should be reproducible using your report. Please carefully describe every assumption and every step in your report.

<b>Dataset Selection</b>

Select a dataset identically to the way you selected for the lab one (i.e., table data). You are not required to use the same dataset that you used in the past, but you are encouraged. You must identify a classification task from the dataset that contains <b>three or more classes to predict</b>. That is it cannot be a binary classification; it must be multi-class prediction. 

## Preparation and Overview (3pt)

<ul>
    <li>[<b>2 points</b>] Explain the task and what business-case or use-case it is designed to solve (or designed to investigate). Detail exactly what the classification task is and what parties would be interested in the results. For example, would the model be deployed or used mostly for offline analysis? </li>
    <li>[<b>.5 points</b>] (<i>mostly the same processes as from previous labs</i>) Define and prepare your class variables. Use proper variable representations (int, float, one-hot, etc.). Use pre-processing methods (as needed) for dimensionality reduction, scaling, etc. Remove variables that are not needed/useful for the analysis. Describe the final dataset that is used for classification/regression (include a description of any newly formed variables you created). </li>
    <li>[<b>.5 points</b>] Divide you data into training and testing data using an 80% training and 20% testing split. Use the cross validation modules that are part of scikit-learn. <b>Argue "for" or "against" splitting your data using an 80/20 split. That is, why is the 80/20 split appropriate (or not) for your dataset?</b></li>
</ul>

### Use Case

Our task will be looking at a patients information and determining whether they are likely to have a stroke, heart disease, or hypertension. The use-case for this classifier would be to flag at-risk patients and enable some kind of response to be made to prevent serious medical emergencies that these conditions might cause or prevent the conditions in the first place.

For example, if a person were to be flagged as very likely to have a stroke, the doctor could contact the patient in an attempt to prevent the stroke by prescribing them medication or alerting the patient's family to monitor them in case they were to have a stroke. Similar actions could be taken for hypertension and heart disease.

Alernatively, some kind of application could be made to allow people to enter their information and determine how at risk they might be for these conditions, giving people more clear information about their health and the issues that are likely to affect them.

### Data Preparation

In [None]:
# Importing packages and reading in dataset
import numpy as np
import pandas as pd

print('Pandas:', pd.__version__)
print('Numpy:',  np.__version__)

raw_data = pd.read_csv('healthcare-dataset-stroke-data.csv')
raw_data.head()

In [None]:
# Dropping categorical column 'work_type'; not very useful and
# doesn't translate nicely into ordinal numbers
df = raw_data.drop('work_type', axis = 1)

# Dropping 1 observation of person with gender 'Other' to simplify
# using the gender column to calculate, impute, or visualize
df.drop(df[df.gender == 'Other'].index, inplace=True)

# Making values' format consistent
for c in df.columns:
    if df[c].dtype == 'object':
        df[c] = df[c].str.lower()

# Adding numbers to smoking_status values to order them properly
# when they will get passed through the SKLearn LabelEncoder
df.smoking_status.replace(to_replace= ['never smoked', 'formerly smoked', 'smokes', 'Unknown'],
                          value     = ['0_never_smoked', '1_formerly_smoked', '2_smokes', '3_Unknown'],
                          inplace=True)

In [None]:
from sklearn.preprocessing import LabelEncoder

# Encoding all of the non-numeric columns
le = {}

for col in df.columns:
    if df[col].dtype == 'object':
        le[col] = LabelEncoder()
        df[col] = le[col].fit_transform(df[col])

# Call le[col].inverse_transform(df[col]) for any column name
# to convert numbers back to their labels

# Converting all 'Unknown' values in smoking status to NaN so
# that we can impute the missing values.
df.smoking_status.mask(df.smoking_status == 3, np.nan, inplace=True)
               
df.head()

In [None]:
# Imputing missing values
from sklearn.impute import KNNImputer
import copy

knn = KNNImputer(n_neighbors=3)

# Imputing on all columns except id
columns = list(df.columns)
columns.remove('id')

df_imputed = copy.deepcopy(df)
df_imputed[columns] = knn.fit_transform(df[columns])

# Rounding imputed values to be compatible with LabelEncoder
# for smoking_status and to match the format of other values
# for bmi
df_imputed.smoking_status = df_imputed.smoking_status.apply(lambda x: round(x, 0))
df_imputed.bmi = df_imputed.bmi.apply(lambda x: round(x, 1))

In [None]:
# Using df_imputed as the primary dataset
df = df_imputed

# Changing columns modified by KNN Imputer back to integers from floats
columns = [
    'gender',
    'hypertension',
    'heart_disease',
    'ever_married',
    'residence_type',
    'smoking_status',
    'stroke'
]

for col in columns:
    df[col] = df[col].astype(int)

To prep this dataset, one attribute was removed due to it being relatively unimportant and not encoding nicely into an ordinal set of integers. All categorical variables were converted to numeric data using SKLearn's LabelEncoder class. Missing values for bmi and smoking_status were imputed using KNN Imputer. One record was dropped for being the only entry with gender 'Other'. Removing this record will make visualizing the gender data simpler and will have little impact on the training, as having an outlier like that might cause other attributes to be slightly undervalued in comparison.

Here is a table of the LabelEncoder encoded variables.

| value | gender | ever_married | residence_type | smoking_status    |
|-------|--------|--------------|----------------|-------------------|
| 0     | female | no           | rural          | 0_never_smoked    |
| 1     | male   | yes          | urban          | 1_formerly_smoked |
| 2     |   -    |      -       |       -        | 2_smokes          |


In [None]:
df.head()

### Dataset Division

In [None]:
# THIS IS NOT FINAL, LOOKING INTO CROSS VALIDATION AS METHOD

from sklearn.model_selection import cross_validate
from sklearn.linear_model import LogisticRegression

columns = list(df.columns).remove('id')

clf = SKLogisticRegression(solver='liblinear')
scores = cross_validate(clf, X, y, scoring=scoring)
sorted(scores.keys())

## Modeling (5pt)

<ul>
    <li>The implementation of logistic regression must be written only from the examples given to you by the instructor. No credit will be assigned to teams that copy implementations from another source, regardless of if the code is properly cited.</li>
    <li>[<b>2 points</b>] Create a custom, one-versus-all logistic regression classifier using numpy and scipy to optimize. Use object oriented conventions identical to scikit-learn. You should start with the template developed by the instructor in the course. You should add the following functionality to the logistic regression classifier:
    <ul>
        <li>Ability to choose optimization technique when class is instantiated: either steepest descent, stochastic gradient descent, or Newton's method. </li>
        <li>Update the gradient calculation to include a customizable regularization term (either using no regularization, L1 regularization, L2 regularization, or both L1 and L2 regularization). Associate a cost with the regularization term, "C", that can be adjusted when the class is instantiated.  </li>
    </ul>
    </li>
    <li>[<b>1.5 points</b>] Train your classifier to achieve good generalization performance. That is, adjust the <b>optimization technique</b> and the value of the <b>regularization term "C"</b> to achieve the best performance on your test set. Visualize the performance of the classifier versus the parameters you investigated. Is your method of selecting parameters justified? That is, do you think there is any "data snooping" involved with this method of selecting parameters?</li>
    <li>[<b>1.5 points</b>] Compare the performance of your "best" logistic regression optimization procedure to the procedure used in scikit-learn. Visualize the performance differences in terms of training time and classification performance. <b>Discuss the results</b>. </li>
</ul>

## Deployment (1pt)

<ul>
    <li>Which implementation of logistic regression would you advise be used in a deployed machine learning model, your implementation or scikit-learn (or other third party)? Why?</li>
</ul>

## Exceptional Work (1pt)

<ul>
    <li>You have free reign to provide additional analyses. <b>One idea</b>: Update the code to use either "one-versus-all" or "one-versus-one" extensions of binary to multi-class classification. </li>
    <li><b>Required for 7000 level students</b>: Choose ONE of the following:
    <ul>
        <li><b>Option One</b>: Implement an optimization technique for logistic regression using <b>mean square error</b> as your objective function (instead of binary cross entropy). Derive the gradient updates for the Hessian and use Newton's method to update the values of "w". Then answer, is this process better than using binary cross entropy? </li>
        <li><b>Option Two</b>: Implement the BFGS algorithm from scratch to optimize logistic regression. That is, use BFGS without the use of an external package (for example, do not use SciPy). Compare your performance accuracy and runtime to the BFGS implementation in SciPy (that we used in lecture). </li>
    </ul>
    </li>
</ul>