# Linear Regression

In this section, I will build a simple **linear regression model** to relate the win precentage of a baseball team to its run differential (i.e., the difference between runs scored and runs allowed). I will use data from five MLB seasons (2011-2015) to train the model and then evaluate the performance of the model on test data from 2016. The game logs used in this segement are sources from [Retrosheet](https://www.retrosheet.org/).

```Train Data```: Game logs from 2011 - 2015

```Test Data```: Game logs of 2016

In [1]:
# Import libraries

import pandas as pd
import numpy as np
from sklearn import linear_model
from scipy import stats
import matplotlib.pyplot as plt

%matplotlib inline

# Data Preprocessing

The first step in any machine learning endeavor is to process the raw input data to extract relevant "features" and to shape the data into the desired format (known as data preprocessing).

I will use game logs from the 2015 MLB season as sample input to build up a sequence of methods, which I will later leverage to process game logs from multiple seasons to construct training and test sets.

In [9]:
# Read games logs from 2015 season into a dataframe
input_df = pd.read_table("./data/train/GL2015.TXT", sep=",", header=None)

In [10]:
input_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,151,152,153,154,155,156,157,158,159,160
0,20150405,0,Sun,SLN,NL,1,CHN,NL,1,3,...,David Ross,2,lestj001,Jon Lester,1,lastt001,Tommy La Stella,4,,Y
1,20150406,0,Mon,MIN,AL,1,DET,AL,1,0,...,Nick Castellanos,5,avila001,Alex Avila,2,iglej001,Jose Iglesias,6,,Y
2,20150406,0,Mon,CLE,AL,1,HOU,AL,1,0,...,Jed Lowrie,6,rasmc001,Colby Rasmus,7,marij002,Jake Marisnick,8,,Y
3,20150406,0,Mon,CHA,AL,1,KCA,AL,1,1,...,Alex Rios,9,peres002,Salvador Perez,2,infao001,Omar Infante,4,,Y
4,20150406,0,Mon,TOR,AL,1,NYA,AL,1,6,...,Alex Rodriguez,10,drews001,Stephen Drew,4,gregd001,Didi Gregorius,6,,Y


The dataset comes with unlabeled columns, so later I will label the columns of interest to make subsequent code more redable. This [webpage](https://www.retrosheet.org/gamelogs/glfields.txt) provides a key to what each column of data represents, and is a must-have reference when working with Retrosheet game logs.

In [14]:
# Method to rename columns of an input dataframe (for readability)
# Input type: dataframe
# Output type: dataframe

def rename_columns(input_df):
    input_df.rename(columns = {3: 'Visiting Team',
                              6: 'Home team',
                              9: 'Runs Visitor',
                             10: 'Runs Home'}, inplace= True)
    
    return input_df

# Invoke function to rename columns
input_df = rename_columns(input_df)

# Display
input_df.head()

Unnamed: 0,0,1,2,Visiting Team,4,5,Home team,7,8,Runs Visitor,...,151,152,153,154,155,156,157,158,159,160
0,20150405,0,Sun,SLN,NL,1,CHN,NL,1,3,...,David Ross,2,lestj001,Jon Lester,1,lastt001,Tommy La Stella,4,,Y
1,20150406,0,Mon,MIN,AL,1,DET,AL,1,0,...,Nick Castellanos,5,avila001,Alex Avila,2,iglej001,Jose Iglesias,6,,Y
2,20150406,0,Mon,CLE,AL,1,HOU,AL,1,0,...,Jed Lowrie,6,rasmc001,Colby Rasmus,7,marij002,Jake Marisnick,8,,Y
3,20150406,0,Mon,CHA,AL,1,KCA,AL,1,1,...,Alex Rios,9,peres002,Salvador Perez,2,infao001,Omar Infante,4,,Y
4,20150406,0,Mon,TOR,AL,1,NYA,AL,1,6,...,Alex Rodriguez,10,drews001,Stephen Drew,4,gregd001,Didi Gregorius,6,,Y
