## Executive Summary

My goal with this project was to use modeling to attempt to get an edge in DraftKings daily fantasy baseball. Initially I had hoped to use player stats along with the stats of the opposing pitcher or batting lineup, to make regression models to predict the number of points batters and pitchers would score each game. Points are the scoring method used by draftkings which assign point values to different events in each game. The scoring system is described here: https://www.draftkings.com/help/rules/mlb?wpsrc=Organic%20Search&wpaffn=Google&wpkw=https%3A%2F%2Fwww.draftkings.com%2Fhelp%2Frules%2Fmlb&wpcn=help&wpscn=rules%2Fmlb.

I had expected to have difficulty in predicting batters due to the high variance that a batter can have game to game. It is not out of the ordinary for the best batter in the league to have zero hits in a game scoring 0 fantasy points. This concern was confirmed by my EDA where I saw very little correlation of DraftKings points with any features I had collected. I attempted to make a few models but none of them effectively predicted the test dataset. At this point I pivoted to using a time series model to predict batters. I did this because I wanted a model to predict close to a players mean score if there were no trends it could find in the data. I was interested in finding time based trends after reading this article on 538: https://fivethirtyeight.com/features/baseballs-hot-hand-is-real/
 
Normally hot hand are seen as a logical fallacy, but this article finds that pitchers have been shown to have hot and cold streaks. They also provide evidence that hot hands can exist among other players. I hope to create a model that predicts close to each player's mean while giving some weight to if they are on a hot or cold streak. Time series does not work well with pitchers in my model currently because starting pitchers play very few games in a season so the sample size is too small. The metric I am using to guage my time series is aic. It is not a very interpretable metric, but I am selecting the arima model with the lowest aic.

I was attempting to write a regression model to predict pitcher performance. Pitchers tend to be more consistent so I hoped it would work. My EDA showed some more correlation among pitching stats. I wrote a Linear Regression, Random Forest and Neural Network model to predict. Initially I was getting strong scores but I discovered that it was because of the specific random state I used to split, and upon using other random states my model was overfit. I am still going to try to optimize this model but for the time being I am just using the mean score for each pitcher. My current neural network is at: https://colab.research.google.com/drive/1D06qTKDn7v8ko2yxvzFpABbMx7TWUKrS#scrollTo=G1j2-ellRKY4

The next step I had was taking my predictions and creating my top lineups for submission. I used a genetic algorithm in order to do this. This involves creating random lineups and then resampling from the top lineups to keep making more, This allowed for me to make my top 150 lineups in a reasonable amount of time.

One of my limitations was the small amount of data I had. I only had stats from this season which definitely hurt my findings for pitcher data. There is also the natural difficulty of predicting baseball. Games are such a small sample size that they are very difficult to predict. 

In [49]:
import pandas as pd
import numpy as np
import os as os
import matplotlib.pyplot as plt
    
from sklearn.model_selection import train_test_split
from statsmodels.tsa.arima.model import ARIMA

In [50]:
targets = [file for file in os.listdir('./Datasets/') if 'target' in file]

In [51]:
li = []
for file in targets:
    li.append(pd.read_csv('./Datasets/' + file, delimiter = ';'))
df = pd.concat(li, axis = 0, ignore_index =True)

In [52]:
def format_date(date):
    date = str(date)
    year = date[0:4]
    month = date[4:6]
    day = date[6:8]
    return '-'.join([year, month, day])

In [53]:
def clean_names(name):
    name = name.replace(',','')
    #name = name.replace('.','')
    name = name.replace('-hitter','')
    name = name.replace('-pitcher','')
    names = name.split()
    first = names.pop(-1)
    names.insert(0, first)
    name = ' '.join(names)
    name = name.replace('James Happ', 'JA Happ')
    return name

In [54]:
# Putting all names into the same format
df['Name'] = [clean_names(player) for player in df['Name']]

In [55]:
# Formatting date for time series modeling
df['Date'] = pd.to_datetime([format_date(date) for date in df['Date']])

In [56]:
# Shohei Ohtani is both a pitcher and batter so stats need to be separated
new_names = []
for i, row in df.iterrows():
    if row['Name'] == 'Shohei Ohtani':
        if row['DK posn'] == 1.0:
            new_names.append('Shohei (P) Ohtani')
        else:
            new_names.append('Shohei (H) Ohtani')
    else:
        new_names.append(row['Name'])
df['Name'] = new_names

Shohei Ohtani was difficult to model throughout the project as he is both a pitcher and batter. I needed to make sure I wasn't accidentally using his batting prediction to predict his pitching.

In [57]:
df = df[df['DK posn'] != 1.0]

In [58]:
dfs = {}
removed = []
for player in set(df['Name']):
    dfs[player] = df[df['Name']== player].set_index('Date').sort_index()['DK pts']
    if len(dfs[player]) < 10:
        removed.append((player, dfs.pop(player)))

In [59]:
preds = []
aic = []
for player in dfs.keys():
    # Exponential smoothing
    arima = ARIMA(endog = dfs[player], order = (0,1,1))
    model = arima.fit()
    preds.append([player, model.forecast(1).values[0]])
    aic.append(model.aic)

  warn('Non-invertible starting MA parameters found.'




























































































In the for loop I fit an ARIMA model for each player and save the forecast of thier next game.

In [60]:
print(np.mean(aic))

315.52688106826


This is a solid mean aic score but I plan to gridsearch to determine the ARIMA order that works best.

In [61]:
sals = pd.read_csv('./DKSalaries.csv')

In [62]:
sals.head()

Unnamed: 0,Position,Name + ID,Name,ID,Roster Position,Salary,Game Info,TeamAbbrev,AvgPointsPerGame
0,SP,Trevor Williams (18269392),Trevor Williams,18269392,P,12700,CHC@CIN 07/02/2021 07:10PM ET,CHC,11.3
1,SP,Tyler Glasnow (18269393),Tyler Glasnow,18269393,P,12600,TB@TOR 07/02/2021 07:07PM ET,TB,25.91
2,RP,Shane McClanahan (18269394),Shane McClanahan,18269394,P,12400,TB@TOR 07/02/2021 07:07PM ET,TB,14.87
3,SP,Kyle Hendricks (18269395),Kyle Hendricks,18269395,P,12200,CHC@CIN 07/02/2021 07:10PM ET,CHC,15.28
4,SP,Yonny Chirinos (18269396),Yonny Chirinos,18269396,P,11900,TB@TOR 07/02/2021 07:07PM ET,TB,0.0


In [63]:
pitchers = pd.read_csv('./merged_data/pitch_predictions.csv')
pitchers.head()

Unnamed: 0.1,Unnamed: 0,preds,Name
0,0,17.565609,Julio Urias
1,1,15.963415,Max Scherzer
2,2,11.838888,Taijuan Walker
3,3,16.47717,Jordan Montgomery
4,4,13.581485,Adrian Houser


In [64]:
pitch_preds = pd.merge(left = sals, right = pitchers[['Name','preds']])

In [65]:
pred_df = pd.DataFrame(preds, columns = ['Name', 'pred_score'])

In [66]:
merge_df = pd.merge(left = sals, right = pred_df)

In [67]:
# import in lineups of known starters
starters = pd.read_csv('./Datasets/Lineups_2021_06_30.csv')
starters.head()

Unnamed: 0,team code,game_date,game_number,mlb id,player name,batting order,confirmed,position
0,ARI,6/30/2021,1,668942.0,Josh Rojas,1,Y,LF
1,ARI,6/30/2021,1,641796.0,Tim Locastro,2,Y,CF
2,ARI,6/30/2021,1,500871.0,Eduardo Escobar,3,Y,2B
3,ARI,6/30/2021,1,572233.0,Christian Walker,4,Y,1B
4,ARI,6/30/2021,1,452678.0,Asdrubal Cabrera,5,Y,3B


I use the list of starters to make sure I am not adding bench players to my lineup.

In [68]:
#starters = starters[starters['team code'] != 'PIT']
#starters = starters[starters['team code'] != 'CIN']
#starters = starters[starters['team code'] != 'COL']
#starters = starters[starters['team code'] != 'PHI']

Some draftkings contests only use a limited amount of games being played that day so I use this code if that is the case.

In [69]:
starters.columns

Index(['team code', ' game_date', ' game_number', ' mlb id', ' player name',
       ' batting order', ' confirmed', ' position'],
      dtype='object')

In [70]:
ohtani = sals[sals['Name']== 'Shohei Ohtani']
if ohtani['Roster Position'].values[0] == 'P':
    ohtani['pred_score'] = pred_df.loc[pred_df['Name'] == 'Shohei (P) Ohtani', 'pred_score'].values[0]
    ohtani['Name'] = 'Shohei (P) Ohtani'
elif ohtani['Roster Position'].values[0] != 'P':
    ohtani['pred_score'] = pred_df.loc[pred_df['Name'] == 'Shohei (H) Ohtani', 'pred_score'].values[0]
    ohtani['Name'] = 'Shohei (H) Ohtani'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ohtani['pred_score'] = pred_df.loc[pred_df['Name'] == 'Shohei (H) Ohtani', 'pred_score'].values[0]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ohtani['Name'] = 'Shohei (H) Ohtani'


In [71]:
merge_df = pd.concat([merge_df, ohtani])

When Ohtani is playing, I need to include this code to get the correct prediction in the dataframe.

In [72]:
merge_df

Unnamed: 0,Position,Name + ID,Name,ID,Roster Position,Salary,Game Info,TeamAbbrev,AvgPointsPerGame,pred_score
0,RP,Luis Garcia (18269452),Luis Garcia,18269452,P,8100,HOU@CLE 07/02/2021 07:10PM ET,HOU,17.06,4.568105
1,RP,Luis Garcia (18269662),Luis Garcia,18269662,P,4000,NYM@NYY 07/02/2021 07:05PM ET,NYY,0.00,4.568105
2,2B/SS,Luis Garcia (18268899),Luis Garcia,18268899,2B/SS,2800,LAD@WAS 07/02/2021 07:05PM ET,WAS,2.46,4.568105
3,OF,Giancarlo Stanton (18268423),Giancarlo Stanton,18268423,OF,6000,NYM@NYY 07/02/2021 07:05PM ET,NYY,8.12,8.116628
4,SS,Trea Turner (18268426),Trea Turner,18268426,SS,5900,LAD@WAS 07/02/2021 07:05PM ET,WAS,9.67,9.513323
...,...,...,...,...,...,...,...,...,...,...
444,OF,Leody Taveras (18269022),Leody Taveras,18269022,OF,2000,TEX@SEA 07/02/2021 10:10PM ET,TEX,2.40,2.399999
445,1B/OF,Jose Marmolejos (18269046),Jose Marmolejos,18269046,1B/OF,2000,TEX@SEA 07/02/2021 10:10PM ET,SEA,4.00,2.830914
446,1B,Evan White (18269051),Evan White,18269051,1B,2000,TEX@SEA 07/02/2021 10:10PM ET,SEA,3.66,3.666647
447,C,Jose Godoy (18269085),Jose Godoy,18269085,C,2000,TEX@SEA 07/02/2021 10:10PM ET,SEA,2.25,1.849429


In [73]:
final = pd.merge(merge_df,starters[[' player name']], left_on = 'Name',
                 right_on = ' player name').drop(columns = [' player name'])

In [74]:
final = final[final['Roster Position'] != 'P']

In [75]:
final

Unnamed: 0,Position,Name + ID,Name,ID,Roster Position,Salary,Game Info,TeamAbbrev,AvgPointsPerGame,pred_score
2,2B/SS,Luis Garcia (18268899),Luis Garcia,18268899,2B/SS,2800,LAD@WAS 07/02/2021 07:05PM ET,WAS,2.46,4.568105
3,OF,Giancarlo Stanton (18268423),Giancarlo Stanton,18268423,OF,6000,NYM@NYY 07/02/2021 07:05PM ET,NYY,8.12,8.116628
4,SS,Trea Turner (18268426),Trea Turner,18268426,SS,5900,LAD@WAS 07/02/2021 07:05PM ET,WAS,9.67,9.513323
5,3B,Nolan Arenado (18268430),Nolan Arenado,18268430,3B,5800,STL@COL 07/02/2021 08:10PM ET,STL,7.89,7.912380
6,OF,Aaron Judge (18268428),Aaron Judge,18268428,OF,5800,NYM@NYY 07/02/2021 07:05PM ET,NYY,8.52,8.560048
...,...,...,...,...,...,...,...,...,...,...
217,OF,Bradley Zimmer (18268817),Bradley Zimmer,18268817,OF,2000,HOU@CLE 07/02/2021 07:10PM ET,CLE,5.07,5.137928
218,OF,Daz Cameron (18268828),Daz Cameron,18268828,OF,2000,CWS@DET 07/02/2021 07:10PM ET,DET,7.67,1.756267
219,OF,Skye Bolt (18268822),Skye Bolt,18268822,OF,2000,BOS@OAK 07/02/2021 09:40PM ET,OAK,1.79,1.789474
220,OF,Skye Bolt (18268822),Skye Bolt,18268822,OF,2000,BOS@OAK 07/02/2021 09:40PM ET,OAK,1.79,1.789474


In [76]:
# Averaging my predictions with average score to hedge
pitch_preds['pred_score'] = (pitch_preds['AvgPointsPerGame'] + pitch_preds['preds'])/2

In [77]:
pitch_preds.drop(columns = ['preds'], inplace= True)

In [78]:
final = pd.concat([final, pitch_preds])

In [79]:
set(final['TeamAbbrev'])

{'ARI',
 'ATL',
 'BAL',
 'BOS',
 'CHC',
 'CIN',
 'CLE',
 'COL',
 'CWS',
 'DET',
 'HOU',
 'KC',
 'LAA',
 'LAD',
 'MIA',
 'MIL',
 'MIN',
 'NYM',
 'NYY',
 'OAK',
 'PIT',
 'SEA',
 'SF',
 'STL',
 'TB',
 'TEX',
 'TOR',
 'WAS'}

In [80]:
final = final[final['TeamAbbrev'] != 'SD']
final = final[final['TeamAbbrev'] != 'PHI']

The draftkings id is needed to submit to draftkings so I merged the predictions with the draftkings salary csv.

In [81]:
# Lineup creiumation using genetic algorithm
# Used https://med.com/@jarvisnederlof/building-a-genetic-algorithm-in-python-for-daily-fantasy-sports-9f497d378e34

import csv
import time 
import random
import copy

In [82]:
class GeneticMLB(object):
    
    def __init__(self, num_lineups, duration = 60):
        self.num_lineups = num_lineups
        self.duration = duration
        self.pitcher = []
        self.catcher = []
        self.first = []
        self.second = []
        self.short = []
        self.third = []
        self.of = []
        self.top_150 = []
        
        
        
    def load_roster(self):
        for i, row in final.iterrows():
            player = {}
            player['name'] = row['Name + ID']
            player['id'] = row['ID']
            player['pos'] = row['Roster Position']
            player['salary'] = row['Salary']
            player['pred_score'] = row['pred_score']


            if 'P' in player['pos']:
                self.pitcher.append(player)
            if 'C' in player['pos']:
                self.catcher.append(player)
            if '1B' in player['pos']:
                self.first.append(player)
            if '2B' in player['pos']:
                self.second.append(player)
            if 'SS' in player['pos']:
                self.short.append(player)
            if '3B' in player['pos']:
                self.third.append(player)
            if 'OF' in player['pos']:
                self.of.append(player)
                
    def generate_lineup(self):
        
        while True:
            lineup = []
            lineup.append(self.pitcher[random.randint(0, len(self.pitcher)-1)])
            lineup.append(self.pitcher[random.randint(0, len(self.pitcher)-1)])
            lineup.append(self.catcher[random.randint(0, len(self.catcher)-1)])
            lineup.append(self.first[random.randint(0, len(self.first)-1)])
            lineup.append(self.second[random.randint(0, len(self.second)-1)])
            lineup.append(self.third[random.randint(0, len(self.third)-1)])
            lineup.append(self.short[random.randint(0, len(self.short)-1)])
            lineup.append(self.of[random.randint(0, len(self.of)-1)])
            lineup.append(self.of[random.randint(0, len(self.of)-1)])
            lineup.append(self.of[random.randint(0, len(self.of)-1)])
            
            lineup = self.check_valid(lineup)
            if lineup:
                return lineup
    # checking if lineups are valid and adding projected score
    def check_valid(self, lineup):
        
        projection = sum(player['pred_score'] for player in lineup)
        
        salary = sum(player['salary'] for player in lineup)
        
        num_players = len(set(player['name'] for player in lineup))
        
        if salary <= 50_000 and num_players == 10:
            lineup.extend((salary, projection))
            return lineup
        return False
    
    def mate_lineups(self, lineup1, lineup2):
        # List all players per position of both lineups plus a random player
        pitcher = [lineup1[0], lineup1[1], lineup2[0], lineup2[1], self.pitcher[random.randint(0, len(self.pitcher) -1)]]
        catcher = [lineup1[2], lineup2[2], self.catcher[random.randint(0, len(self.catcher) -1)]]
        first = [lineup1[3], lineup2[3], self.first[random.randint(0, len(self.first) -1)]]
        second = [lineup1[4], lineup2[4], self.second[random.randint(0, len(self.second) -1)]]
        third = [lineup1[5], lineup2[5], self.third[random.randint(0, len(self.third) -1)]]
        short = [lineup1[6], lineup2[6], self.short[random.randint(0, len(self.short) -1)]]
        of = [lineup1[7], lineup1[8], lineup1[9], lineup2[7], lineup2[8], lineup2[9], self.of[random.randint(0, len(self.of) -1)]]
        
        # Randomly grab players from each position to fill out new lineups
        def grab_players(players, num):
            avail_players = copy.deepcopy(players)
            selected_players = []
            while len(selected_players) < num:
                i = random.randint(0, len(avail_players) - 1)
                selected_players.append(avail_players[i])
                del avail_players[i]
            return selected_players
        
        # create new lineup by selecting players from lists
        while True:
            
            lineup = []
            lineup.extend(grab_players(pitcher, 2))
            lineup.extend(grab_players(catcher, 1))
            lineup.extend(grab_players(first, 1))
            lineup.extend(grab_players(second, 1))
            lineup.extend(grab_players(third, 1))
            lineup.extend(grab_players(short, 1))
            lineup.extend(grab_players(of, 3))
            
            lineup = self.check_valid(lineup)
            
            if lineup:
                return lineup
            
    def get_lineups(self):
        # make 10 lineups
        new_lineups = [self.generate_lineup() for _ in range(10)]
        # sort linuep by pred score
        new_lineups.sort(key = lambda x: x[-1], reverse = True)
        # add to top 150
        self.top_150.extend(new_lineups)
        # mate top 3 lineups
        offspring_1 = self.mate_lineups(new_lineups[0], new_lineups[1])
        offspring_2 = self.mate_lineups(new_lineups[0], new_lineups[2])
        offspring_3 = self.mate_lineups(new_lineups[1], new_lineups[2])
        # mate top 3 lineups with random lineup in to 150 and add to top 150
        self.top_150.append(self.mate_lineups(offspring_1, self.top_150[random.randint(0, len(self.top_150)-1)]))
        self.top_150.append(self.mate_lineups(offspring_2, self.top_150[random.randint(0, len(self.top_150)-1)]))
        self.top_150.append(self.mate_lineups(offspring_3, self.top_150[random.randint(0, len(self.top_150)-1)]))
        # add offspring to top 150
        self.top_150.append(offspring_1)
        self.top_150.append(offspring_2)
        self.top_150.append(offspring_3)
        
        
    def run(self):
        
        runtime = time.time() + self.duration
        while time.time() < runtime:
            self.get_lineups()
            self.top_150.sort(key = lambda x: x[-1], reverse = True)
            
            self.top_150 = self.top_150[:150]
            
    def save_file(self):
        
        lineups = [[player['name'] if isinstance(player, dict) else player for player in lineup] for lineup in self.top_150]
        
        lineups = [lineups[i] for i in range(self.num_lineups - 1) if lineups[i] != lineups[i+1]]
        
        lineups_for_upload = [lineup[:10] for lineup in lineups]
        
        header = ['P','P','C','1B','2B','3B','SS','OF','OF','OF','OF','Salary','Proj']
        
        header_for_upload = ['P','P','C','1B','2B','3B','SS','OF','OF','OF','OF']
        
        with  open("./lineups.csv", 'w') as f:
            writer = csv.writer(f)
            writer.writerow(header)
            writer.writerows(lineups)

        with  open("./lineups_for_upload.csv", 'w') as f:
            writer = csv.writer(f)
            writer.writerow(header_for_upload)
            writer.writerows(lineups_for_upload)
        
        
        

In [83]:
g = GeneticMLB(num_lineups = 150, duration = 120)
g.load_roster()
g.run()
g.save_file()

This is the genetic algorithm that I adapted from the medium article to generate my lineups. It first creates lineups by randomly selecting players. It then sorts by predicted score. It takes the top 3 lineups and mates them while inserting one random player. This adds some randomness while still biasing towards the top. Every lineup that is generated is checked to ensure that it meets criteria set by DraftKings. This program runs for the duration set by the input with a default of 1 minute. It then saves the top 150 lineups in a format that can be directly input into DraftKings.