In [None]:
%%html
<style>
  table {margin-left: 0 !important;}
</style>

# Hitting . . . the sweet science

## Why batting average is not good enough, part 1

Traditionally, batting average has been **the measure** of a hitters productivity. But it doesn't make a lot of sense for the following reasons:

### Reason 1 -- It doesn't take into account walks

     He gets on base a lot.  Why do I care how he gets there?  
     Billy Beane in Moneyball


If you recall this scene from [Moneyball](https://www.imdb.com/title/tt1210166/?ref_=fn_al_tt_1), Brad Pitt's character explains a bit why they are trading for a player with a lower batting average and says, "He gets on base.". And everytime the scouts would question something, he would point to the scout and say the following:

![base](https://c.tenor.com/i8OMlZpe19AAAAAd/moneyball-gets-on-base.gif)

If you used batting average as the **only** measure you would completely disregard walks.  Which is silly because a player on first base is valuable whether he singled or walked.

### Stat:  On Base percentage

On base percentage would give you simply, the percentage of time you reach base

     OBP = (Hits + Walks + Hit by Pitch) / (At Bats + Walks + Hit by Pitch + Sacrifice Flies)
     
#### Define Formula

In [7]:
def onBasePercentage(hits, walks, hbp, ab, sf):
    return round((hits + walks + hbp) / (ab + walks + hbp + sf), 3)

In [38]:
### Load the Data Again

## Give me the batting data into a dataframe
import pandas as pd

# List of Batting Stats
battingUrl="https://raw.githubusercontent.com/chadwickbureau/baseballdatabank/master/core/Batting.csv"
#dfBatting=pd.read_csv(battingUrl, usecols = ['playerID','yearID', 'G', 'teamID', 'AB', 'H', 'BB', 'HBP', 'SF'])
dfBatting=pd.read_csv(battingUrl)
dfModernBatting = dfBatting.query("yearID > 1976 and AB > 500 and G > 50")

peopleUrl="https://raw.githubusercontent.com/chadwickbureau/baseballdatabank/master/core/People.csv"
dfPlayers=pd.read_csv(peopleUrl, usecols = ['playerID','nameFirst', 'nameLast'])

### Now we need to join the batting and players
dfPlayersAndBatting = dfPlayers.merge(dfModernBatting, on='playerID', how='inner')
dfPlayersAndBatting

Unnamed: 0,playerID,nameFirst,nameLast,yearID,stint,teamID,lgID,G,AB,R,...,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
0,abreubo01,Bobby,Abreu,1999,1,PHI,NL,152,546,118,...,93.0,27.0,9.0,109,113.0,8.0,3.0,0.0,4.0,13.0
1,abreubo01,Bobby,Abreu,2000,1,PHI,NL,154,576,103,...,79.0,28.0,8.0,100,116.0,9.0,1.0,0.0,3.0,12.0
2,abreubo01,Bobby,Abreu,2001,1,PHI,NL,162,588,118,...,110.0,36.0,14.0,106,137.0,11.0,1.0,0.0,9.0,13.0
3,abreubo01,Bobby,Abreu,2002,1,PHI,NL,157,572,102,...,85.0,31.0,12.0,104,117.0,9.0,3.0,0.0,6.0,11.0
4,abreubo01,Bobby,Abreu,2003,1,PHI,NL,158,577,99,...,101.0,22.0,9.0,109,126.0,13.0,2.0,0.0,7.0,13.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4104,zobribe01,Ben,Zobrist,2011,1,TBA,AL,156,588,99,...,91.0,19.0,6.0,77,128.0,1.0,2.0,2.0,5.0,9.0
4105,zobribe01,Ben,Zobrist,2012,1,TBA,AL,157,560,88,...,74.0,14.0,9.0,97,103.0,7.0,3.0,2.0,6.0,13.0
4106,zobribe01,Ben,Zobrist,2013,1,TBA,AL,157,612,77,...,71.0,11.0,3.0,72,91.0,4.0,7.0,1.0,6.0,18.0
4107,zobribe01,Ben,Zobrist,2014,1,TBA,AL,146,570,83,...,52.0,10.0,5.0,75,84.0,4.0,1.0,2.0,6.0,8.0


In [43]:
#### Calculate Mike Trout OBP for 2016

dfTrout = dfPlayersAndBatting.query("nameLast == 'Trout' and yearID==2016", inplace=False)
m = dfTrout.iloc[0]

troutObp = onBasePercentage(m.H, m.BB, m.HBP, m.AB, m.SF)
m

playerID     troutmi01
nameFirst         Mike
nameLast         Trout
yearID            2016
stint                1
teamID             LAA
lgID                AL
G                  159
AB                 549
R                  123
H                  173
2B                  32
3B                   5
HR                  29
RBI              100.0
SB                30.0
CS                 7.0
BB                 116
SO               137.0
IBB               12.0
HBP               11.0
SH                 0.0
SF                 5.0
GIDP               5.0
Name: 3717, dtype: object

## Why batting average is not good enough, part 2

    All hits are not created equal
    
### Hits are not equal

It's obvious to anyone that watches baseball that hits are not created equal. A home run is always preferred to a single. There is no time where a manager would said, "jeez, I wish Trout would have not hit that home run."

Yet batting average and on base percentage treat them equally.  

#### Example 1 -- Single Sam vs Home Run Harry
- Single Sam has a batting average of 0.250.  Every one of his hits are singles. 
- Home Run Harry has a batting average of 0.250.  Improbably, every one of his hits are home runs

Which would you rather have?  Obviously home run harry

### Enter Slugging Percentage

Slugging Percentage is a type of weighted average. It assigns a the following weights:
- Singles 1
- Doubles 2
- Triples 3
- Home Runs 4

So that would mean that Single Slam has a slugging percentage of 0.250, while Home Run Harry has a slugging percentage of 1


In [51]:
def slugging(singles, doubles, triples, homeruns, atBats):
    return round((singles + doubles*2.0 + triples*3.0 + homeruns*4.0) / (atBats), 3)

In [53]:

troutSingles = m.H - m['2B'] - m['3B'] - m['HR']
troutSlugging = slugging(troutSingles, m['2B'], m['3B'], m.HR, m.AB)
troutSlugging

0.55

## Combining Forces

### What if we could take the best of On Base Percentage and Combine with Slugging?

We would then get . . . On Base Plus Slugging (often abbreviated as OBP)


In [56]:
#(hits, walks, hbp, ab, sf):
def onBasePlusSlugging(hits, walks, hbp, ab, sf, singles, doubles, triples, homeruns):
    return slugging(singles, doubles, triples, homeruns, ab) + onBasePercentage(hits, walks, hbp, ab, sf)

troutOnBasePlusSlugging = onBasePlusSlugging(m.H, m.BB, m.HBP, m.SF, troutSingles, m['2B'], m['3B'], m['HR'])
troutOnBasePlusSlugging

TypeError: onBasePlusSlugging() missing 1 required positional argument: 'homeruns'