# Logistic Regression Tutorial

This notebook provides a brief tutorial on coding logistic regression in Python. We use the Climate and Economic Justice Screening Tool as our dataset. Please download the dataset.

In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score, f1_score

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

## Data Pre-processing 

In [None]:
df = pd.read_csv('data/CEJST-v1.csv')

The islands, unfortunately, have missing data, so we remove them.

In [None]:
islands = ['American Samoa', 'Guam', 'Northern Mariana Islands', 'Puerto Rico', 'Virgin Islands']
df = df[~df['State/Territory'].isin(islands)]

In [None]:
len(df)

In [None]:
features = ['Percent Black or African American alone', 'Percent American Indian / Alaska Native', 'Percent Asian',
            'Percent Native Hawaiian or Pacific', 'Percent White', 'Percent Hispanic or Latino', 
            'Percent age under 10', 'Percent age 10 to 64', 'Percent age over 64', 'Unemployment (percent)', 
            'Percent of individuals < 100% Federal Poverty Line', 
            'Percent of individuals below 200% Federal Poverty Line',
            'Percent individuals age 25 or over with less than high school degree', 'Linguistic isolation (percent)', 
            'Housing burden (percent)', 'Percent pre-1960s housing (lead paint indicator)', 
            'Median value ($) of owner-occupied housing units', 
            "Share of the tract's land area that is covered by impervious surface or cropland as a percent"]

The historic underinvestment indicator has missing values, but those places did not experience redlining, so we can reflect that with missing value imputation.

In [None]:
data = df[features]
data = data.fillna(0)

## Modeling

Let's perform hyperparameter tuning. Play around with the solvers and penalty parameter C. Make sure to use L2 penalty as a default parameter so that we can keep all of the features. Another useful default would be to increase the number of maximum iterations (e.g., 50000 or more); this means it will take longer to run, but that is okay. Use AUROC as your refit loss in order to account for our imbalanced labels. 

In [None]:
def find_best_model(X,y): 
    ## YOUR CODE HERE
    
    search = ## YOUR CODE HERE
    
    print("Best parameter (CV AUROC score=%0.3f):" % search.best_score_)
    print(search.best_params_)
    return search.best_estimator_, search.best_params_, pd.DataFrame(search.cv_results_)

In [None]:
def get_summary(X, y, params=None):
    
    # fit model
    clf = LogisticRegression(**params, max_iter=500000, penalty='l2').fit(X, y) 
    
    # get model coefficients 
    coef = clf.coef_.tolist()[0] + [clf.intercept_[0]]
    
    # Create coefficient and intercept table
    summary_df = pd.DataFrame({'Coefficient': coef})
    summary_df.index = list(data.columns) + ['Intercept']
    summary_df['Odds Ratio'] = np.exp(summary_df['Coefficient'])
    summary_df['Percentage Effect'] = 100 * (summary_df['Odds Ratio'] - 1)
    
    display(summary_df)

We estimate the log odds of simultaneously being in the 90th percentile for property flood risk in 30 years and living in a low income census tract. Given the imbalanced nature of a labels with a 9:1 ratio, we have pretty decent performance as measured through AUROC, which is a great metric for imbalanced data. 

In [None]:
label = 'Greater than or equal to the 90th percentile for share of properties at risk of flood in 30 years and is low income?'
X, y = data, df[label]
estimator, params, results = find_best_model(X,y)

To address the line search warning, you can change the solver or add more iterations. 

In [None]:
get_summary(X, y, params)

## Interpreting the Results

Coefficients for continuous, binary, and categorical features have different interpretations. 

Let's look at the continuous feature of the percent of white residents in a tract. The log odds are 0.92. The odds ratio is 2.51, meaning tracts with a higher percent of white residents have greater odds of being in the 90th percentile and low income than tracts with lower percent of white residents. In this case, census tracts with higher percentages of white residents are 151 percent **more** likely to be in the 90th percentile for expected building loss and low income.

Let's look at a binary feature. Tracts that experienced historic underinvestment and remain low income are 21.1 percent **more** likely to be in the 90th percentile for share of properties at risk of flood in 30 years and low income compared to tracts that did not experience historic underinvestment.

Recall that the intercept is the value we would get if we set the input features to 0. Sometimes it is interesting to interpret, and sometimes it is simply a statistical formality that might not have an interesting application to the domain.