# Logistic Regression

In the previous notebook, I used linear regression to express our response variable as a linear function of the explanatory variable. For the case where the response variable is categorical, logistic regression provides a more natural modeling framework.

I will specifically focus on the case of a binary-valued (e.g., 0 or 1) respose variable. In logistic regression, instead of expressing the response variable directly as a function of the explanatory variable, we express the probability of the response variable being equal to 1 (or 0) as a function of the explanatory variable. Specifically, the mapping is obtained by applying the standard logistic function to linear combination of the explanatory variable.

The relationship is mathematically expressed as:
    <img src="https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcTf6VGZZuqNXlLYgnirHDnZjC8Ha49Or5bUUd8VAIISVhVy-VIF"
     alt="Markdown Monster icon"
     style="float: left; margin-right: 50px;" />
     
     where y = (aX + b)
     
Since the range of the logistic function is [0,1], the result of the mapping can be interpreted as a probability.

In [2]:
# Importing libraries

import numpy as np
import pandas as pd
from sklearn import linear_model
from scipy import stats
import matplotlib.pyplot as plt
%matplotlib inline

# Training Set & Model Construction

Similar to the linear regression case, we will use Retrosheet game logs from five MLB seasons (2011-2015) to construct our training set. The explanatory variable in this case, number of hits by home team, is readily available in the raw data, so no additional processing for feature extraction is required.
We will construct a Boolean-valued column to indicate whether the home team won the game (1/True) or not (0/False), just like we did in the linear regression example.

In [3]:
# Get training data from 2011-2015 to train the logistic regression model

# Initialize arrays to hold training data
train_num_hits = np.empty([0,1])
train_win_label = np.empty([0,1])

# Loop
for year in range(2011,2016):
    # Construct log file name
    file = "GL" + str(year) + ".TXT"
    log_file = "./data/train/" + file
    
    # Read log into a dataframe
    df = pd.read_table(log_file, sep=",", header=None)
    
    # Rename columns for readability
    df.rename(columns = {6: 'Home Team', 9: 'Runs Visitor', 10: 'Runs Home', 50: 'Hits Home'}, inplace=True)
    
    # Add new columns to indicate whether home team or visiting team won the game
    df['Home Win'] = (df['Runs Home'] > df['Runs Visitor'])
    
    # Add to training set
    train_num_hits = np.vstack([train_num_hits, df['Hits Home'].values.reshape([-1,1])])
    train_win_label = np.vstack([train_win_label, df['Home Win'].values.reshape([-1,1])])

FileNotFoundError: File b'./data/train/GL2011.TXT' does not exist

In [None]:
# Instantiate lo