# Building The Heisman Dataset
***Sean Steinle***


In this notebook, we'll be pulling together the data we need to answer our questions about the Heisman award.

## Table of Contents

1. [Data Cleaning and Preparation](#1)
    1. [Preparing Votes Data](#1a)
    2. [Setting Up Player and Game APIs](#1b)
    3. [Pulling Stats for Past Heisman Winners](#1c)
2. [Exploratory Data Analysis](#2)
3. [Modeling and Machine Learning](#3)

In [1]:
#import libraries
import cfbd
import pandas as pd
import time

<a id='1'></a><a id='1a'></a>

## Preparing Votes Data

I found the results of the past several decades of Heisman voting here: https://www.sports-reference.com/cfb/awards/heisman.html. In order to import it to this notebook, I created a .csv file by hand.

In [2]:
votes_df = pd.read_csv("../data/heisman_votes.csv")
votes_df = votes_df[votes_df["Year"] > 2003]
votes_df

Unnamed: 0,Rk,Player,School,Class,Pos,1st,2nd,3rd,Tot,Summary,Year
0,1,DeVonta Smith*\devonta-smith-1,Alabama,SR,WR,447,221,73,1856,117 Rec 1856 Yds 15.9 Avg 23 TD,2020
1,2,Trevor Lawrence*\trevor-lawrence-1,Clemson,JR,QB,222,176,169,1187,231 Cmp 334 Att 3153 Yds 24 TD 5 Int,2020
2,3,Mac Jones*\mac-jones-1,Alabama,JR,QB,138,248,220,1130,311 Cmp 402 Att 4500 Yds 41 TD 4 Int,2020
3,4,Kyle Trask*\kyle-trask-1,Florida,SR,QB,61,164,226,737,301 Cmp 437 Att 4283 Yds 43 TD 8 Int,2020
4,5,Najee Harris*\najee-harris-1,Alabama,SR,RB,16,47,74,216,251 Att 1466 Yds 5.8 Avg 26 TD,2020
...,...,...,...,...,...,...,...,...,...,...,...
166,6,Cedric Benson*\cedric-benson-1,Texas,SR,RB,12,41,69,187,326 Att 1834 Yds 5.6 Avg 19 TD,2004
167,7,Jason Campbell*\jason-campbell-1,Auburn,SR,QB,21,24,51,162,188 Cmp 270 Att 2700 Yds 20 TD 7 Int,2004
168,8,J.J. Arrington*\jj-arrington-1,California,SR,RB,10,33,19,115,289 Att 2018 Yds 7.0 Avg 15 TD,2004
169,9,Aaron Rodgers*\aaron-rodgers-1,California,JR,QB,8,14,15,67,209 Cmp 316 Att 2566 Yds 24 TD 8 Int,2004


In [3]:
players = votes_df['Player']
players

0          DeVonta Smith*\devonta-smith-1
1      Trevor Lawrence*\trevor-lawrence-1
2                  Mac Jones*\mac-jones-1
3                Kyle Trask*\kyle-trask-1
4            Najee Harris*\najee-harris-1
                      ...                
166        Cedric Benson*\cedric-benson-1
167      Jason Campbell*\jason-campbell-1
168        J.J. Arrington*\jj-arrington-1
169        Aaron Rodgers*\aaron-rodgers-1
170    Braylon Edwards*\braylon-edwards-1
Name: Player, Length: 171, dtype: object

In [4]:
#clean names
names = []
for player in players:
    name = player.split('\\')[0]
    if name[-1] == "*": #weird asterisk thing
        names.append(name[:-1])
    else:
        names.append(name)
votes_df["Player"] = names
votes_df["Player"]

0        DeVonta Smith
1      Trevor Lawrence
2            Mac Jones
3           Kyle Trask
4         Najee Harris
            ...       
166      Cedric Benson
167     Jason Campbell
168     J.J. Arrington
169      Aaron Rodgers
170    Braylon Edwards
Name: Player, Length: 171, dtype: object

<a id='1b'></a>

## Setting Up Player and Game APIs

<a id='1bi'></a>

**Configuring Writing Functions**

In [5]:
#API Configuration
configuration = cfbd.Configuration()
configuration.api_key['Authorization'] = 'Jr1LRSHRWqFsEIwxZcp0K/NBQRciKdGb9E+GPLFJHHKdFGkShUyyxdhDlZISY2NB'
configuration.api_key_prefix['Authorization'] = 'Bearer'

players_api = cfbd.PlayersApi(cfbd.ApiClient(configuration))
games_api = cfbd.GamesApi(cfbd.ApiClient(configuration))

In [27]:
#these functions search the players for a 
def pullStatsQB(name, season, team):
    qb = {}
    qb['NAME'] = name
    players = players_api.get_player_season_stats(year=season,team=team)
    for player in players: #more accurately, this would be "for stat in stats"
        if player.player == name:
            if player.category == 'passing': 
                qb["Pass"+player.stat_type] = player.stat
            if player.category == 'rushing':
                qb["Rush"+player.stat_type] = player.stat
    return qb

def pullStatsRB(name, season, team):
    rb = {}
    rb['NAME'] = name
    players = players_api.get_player_season_stats(year=season,team=team)
    for player in players:
        if player.player == name:
            if player.category == 'rushing':
                rb["Rush"+player.stat_type] = player.stat
            if player.category == 'receiving':
                rb["Rec"+player.stat_type] = player.stat
    return rb

def pullStatsWR(name, season, team):
    wr = {}
    wr['NAME'] = name
    players = players_api.get_player_season_stats(year=season,team=team)
    for player in players:
        if player.player == name:
            if player.category == 'rushing':
                wr["Rush"+player.stat_type] = player.stat
            if player.category == 'receiving':
                wr["Rec"+player.stat_type] = player.stat
    return wr

In [24]:
#let's make an additional function to add our heisman stats to these dicts
def addVoteStats(d):
    d['Year'] = row['Year']
    d['School'] = row['School']
    d['Class'] = row['Class']
    d['1stVotes'] = row['1st']
    d['2ndVotes'] = row['2nd']
    d['3rdVotes'] = row['3rd']
    d['TotalVotes'] = row['Tot']
    return d

In [25]:
#let's make one final function that adds the stats of a player's team in
def addTeamStats(d, name, season, team):
    records = games_api.get_team_records(year=season, team=team)
    try:
        d['win_percent'] = records[0].total['wins']/records[0].total['games']
    except:
        d['win_percent'] = -1
    return d

<a id='1bii'></a>

**Testing Functions**

In [28]:
kp = pullStatsQB("Kenny Pickett", 2020, "Pittsburgh")
mi = pullStatsRB("Mark Ingram", 2009, "Alabama")
wr = pullStatsWR("Artavis Scott", 2015, "Clemson")

In [29]:
kp

{'NAME': 'Kenny Pickett',
 'PassYPA': 7.2,
 'PassATT': 333.0,
 'RushYPC': 2.0,
 'RushCAR': 81.0,
 'RushLONG': 18.0,
 'PassCOMPLETIONS': 204.0,
 'RushTD': 8.0,
 'PassPCT': 0.613,
 'PassINT': 9.0,
 'PassYDS': 2408.0,
 'PassTD': 13.0,
 'RushYDS': 162.0}

In [30]:
mi

{'NAME': 'Mark Ingram',
 'RushLONG': 70.0,
 'RecYPR': 10.4,
 'RecTD': 3.0,
 'RecREC': 32.0,
 'RecYDS': 334.0,
 'RecLONG': 69.0,
 'RushCAR': 271.0,
 'RushYPC': 6.1,
 'RushYDS': 1658.0,
 'RushTD': 17.0}

In [31]:
wr

{'NAME': 'Artavis Scott',
 'RushLONG': 10.0,
 'RushYDS': 20.0,
 'RushTD': 1.0,
 'RushCAR': 6.0,
 'RecREC': 93.0,
 'RushYPC': 3.3,
 'RecYDS': 901.0,
 'RecYPR': 9.7,
 'RecTD': 6.0,
 'RecLONG': 51.0}

<a id='1biii'></a>

**Potential Problems**

There were a few cases where the school names from Sports Reference did not align with the CFBD api, in which case I manually changed them (Brigham Young->BYU, Texas Christian->TCU, etc). Additionally, there are a few other cases of key errors I changed manually after exporting the .csv below. Finally, I dropped fumbles as a statistic because of inconsistent access (some players had simple NaN when other sources said they fumbled).

<a id='1c'></a>

## Pulling Stats for Past Heisman Winners

In [32]:
qbs = []
rbs = []
wrs = []
total_time = time.time()
for i,row in votes_df.iterrows():
    player_time = time.time()
    #gather position specific data
    if row["Pos"] == "QB":
        d = pullStatsQB(row["Player"], row["Year"], row["School"])
        d = addVoteStats(d)
        d = addTeamStats(d, row["Player"], row["Year"], row["School"])
        qbs.append(d)
    elif row["Pos"] == "RB":
        d = pullStatsRB(row["Player"], row["Year"], row["School"])
        d = addVoteStats(d)
        d = addTeamStats(d, row["Player"], row["Year"], row["School"])
        rbs.append(d)
    elif row["Pos"] == "WR":
        d = pullStatsWR(row["Player"], row["Year"], row["School"])
        d = addVoteStats(d)
        d = addTeamStats(d, row["Player"], row["Year"], row["School"])
        wrs.append(d)
    #print("Player processed in: ", time.time()-player_time)

df_qb = pd.DataFrame(qbs)
df_rb = pd.DataFrame(rbs)
df_wr = pd.DataFrame(wrs)
    
print("Finished processing in ", (time.time()-total_time)/60, "mins")

Finished processing in  9.01957439184189 mins


In [34]:
#df_qb.to_csv("../data/heisman_QBs_copy.csv")
#df_rb.to_csv("../data/heisman_RBs_copy.csv")
#df_wr.to_csv("../data/heisman_WRs_copy.csv")

In [35]:
df_qb.head(50)

Unnamed: 0,NAME,PassPCT,PassYPA,PassCOMPLETIONS,PassINT,RushYPC,RushCAR,RushLONG,PassATT,PassTD,...,RushTD,RushYDS,Year,School,Class,1stVotes,2ndVotes,3rdVotes,TotalVotes,win_percent
0,Trevor Lawrence,0.692,9.4,231.0,5.0,3.0,68.0,34.0,334.0,24.0,...,8.0,203.0,2020,Clemson,JR,222,176,169,1187,0.833333
1,Mac Jones,0.774,11.2,311.0,4.0,0.5,35.0,14.0,402.0,41.0,...,1.0,16.0,2020,Alabama,JR,138,248,220,1130,1.0
2,Kyle Trask,0.689,9.8,301.0,8.0,0.8,64.0,26.0,437.0,43.0,...,3.0,50.0,2020,Florida,SR,61,164,226,737,0.666667
3,Justin Fields,0.702,9.3,158.0,6.0,4.7,81.0,44.0,225.0,22.0,...,5.0,383.0,2020,Ohio State,JR,5,6,21,48,0.875
4,Zach Wilson,0.733,11.0,244.0,3.0,3.8,69.0,33.0,333.0,32.0,...,10.0,265.0,2020,BYU,JR,3,6,21,42,0.916667
5,Ian Book,0.646,8.0,228.0,3.0,4.2,117.0,33.0,353.0,15.0,...,9.0,486.0,2020,Notre Dame,SR,5,5,13,38,0.833333
6,Joe Burrow,0.763,10.8,402.0,6.0,3.2,115.0,29.0,527.0,60.0,...,5.0,369.0,2019,LSU,SR,841,41,3,2608,1.0
7,Jalen Hurts,0.695,11.3,237.0,8.0,5.6,233.0,52.0,341.0,32.0,...,20.0,1298.0,2019,Oklahoma,SR,12,231,264,762,0.857143
8,Justin Fields,0.672,9.2,238.0,3.0,3.5,137.0,51.0,354.0,41.0,...,10.0,484.0,2019,Ohio State,SO,6,271,187,747,0.928571
9,Trevor Lawrence,0.658,9.0,268.0,8.0,5.5,103.0,67.0,407.0,36.0,...,9.0,563.0,2019,Clemson,SO,3,25,29,88,0.933333


In [36]:
df_rb

Unnamed: 0,NAME,RecLONG,RecYDS,RecTD,RecYPR,RushTD,RushYPC,RushYDS,RecREC,RushLONG,RushCAR,Year,School,Class,1stVotes,2ndVotes,3rdVotes,TotalVotes,win_percent
0,Najee Harris,26.0,425.0,4.0,9.9,26.0,5.8,1466.0,43.0,53.0,251.0,2020,Alabama,SR,16,47,74,216,1.0
1,Breece Hall,28.0,180.0,2.0,7.8,21.0,5.6,1573.0,23.0,75.0,279.0,2020,Iowa State,SO,6,10,26,64,0.75
2,Jonathan Taylor,36.0,252.0,5.0,9.7,21.0,6.3,2003.0,26.0,72.0,320.0,2019,Wisconsin,JR,6,44,83,189,0.714286
3,J.K. Dobbins,28.0,247.0,2.0,10.7,21.0,6.7,2003.0,23.0,68.0,301.0,2019,Ohio State,JR,2,36,36,114,0.928571
4,Chuba Hubbard,46.0,198.0,0.0,8.6,21.0,6.4,2094.0,23.0,92.0,328.0,2019,Oklahoma State,SO,0,11,46,68,0.615385
5,Travis Etienne,53.0,432.0,4.0,11.7,19.0,7.8,1614.0,37.0,90.0,207.0,2019,Clemson,JR,0,7,11,25,0.933333
6,Travis Etienne,24.0,78.0,2.0,6.5,23.0,8.3,1595.0,12.0,75.0,193.0,2018,Clemson,SO,0,6,17,29,1.0
7,Jonathan Taylor,30.0,60.0,0.0,7.5,15.0,7.0,2009.0,8.0,88.0,287.0,2018,Wisconsin,SO,1,2,19,26,0.615385
8,Darrell Henderson,71.0,295.0,3.0,15.5,22.0,8.9,1909.0,19.0,82.0,214.0,2018,Memphis,JR,0,3,15,21,0.571429
9,Bryce Love,12.0,33.0,0.0,5.5,19.0,8.1,2118.0,6.0,75.0,263.0,2017,Stanford,JR,75,421,233,1300,0.642857


In [37]:
df_wr

Unnamed: 0,NAME,RecYPR,RushTD,RecLONG,RecYDS,RushYDS,RushCAR,RecREC,RecTD,RushYPC,RushLONG,Year,School,Class,1stVotes,2ndVotes,3rdVotes,TotalVotes,win_percent
0,DeVonta Smith,15.9,1.0,66.0,1856.0,6.0,4.0,117.0,23.0,1.5,14.0,2020,Alabama,SR,447,221,73,1856,1.0
1,Dede Westbrook,19.1,0.0,88.0,1524.0,101.0,10.0,80.0,17.0,10.1,35.0,2016,Oklahoma,SR,7,49,90,209,0.846154
2,Amari Cooper,13.9,0.0,80.0,1727.0,23.0,5.0,124.0,16.0,4.6,20.0,2014,Alabama,JR,49,280,316,1023,0.857143
3,Marqise Lee,14.6,0.0,83.0,1721.0,106.0,13.0,118.0,14.0,8.2,38.0,2012,USC,SO,19,33,84,207,0.538462
4,Tavon Austin,11.3,3.0,75.0,1289.0,643.0,72.0,114.0,12.0,8.9,74.0,2012,West Virginia,SR,6,4,21,47,0.538462
5,Justin Blackmon,16.1,1.0,81.0,1782.0,77.0,4.0,111.0,20.0,19.3,69.0,2010,Oklahoma State,SO,1,23,56,105,0.846154
6,Mardy Gilyard,13.7,1.0,68.0,1191.0,16.0,5.0,87.0,11.0,3.2,5.0,2009,Cincinnati,SR,2,2,13,23,0.923077
7,Golden Tate,16.1,2.0,78.0,1496.0,186.0,25.0,93.0,15.0,7.4,33.0,2009,Notre Dame,JR,2,3,9,21,0.5
8,Michael Crabtree,12.0,0.0,82.0,1165.0,1.0,2.0,97.0,19.0,0.5,3.0,2008,Texas Tech,SO,3,27,53,116,0.846154
9,Dwayne Jarrett,14.5,0.0,62.0,1015.0,-3.0,1.0,70.0,12.0,-3.0,0.0,2006,USC,JR,1,11,22,47,0.846154


<a id='2'></a>
    
## Exploratory Data Analysis

As I mentioned before, I did some manual correction of a few players, so we'll reload the data and begin to summarize from there.

<a id='3'></a>
## Modeling and Machine Learning