# Data Pipeline for Calculating RAPM

To run the following code, you will need this software:

## Software Prequisites


Option 1 (Running with Docker): 
- [Docker](https://www.docker.com/) and docker-compose

Option 2 (Running locally): 
- [Anaconda](https://anaconda.org/anaconda/python)
- [MongoDB](https://www.mongodb.com/)

Option 3 (Mixed):
- You can also run Anaconda locally and Mongo out of a docker container or vice versa

Once you have the above installed, you can proceed with these steps:

## Table of Contents

- Scrape Data
- Import Data into Database
- Process Play-by-Play Data
- Convert Play-by-Play/Possession Data into Matrix
- Run Regression Model with Matrix

## 1. Scrape Data

We use scripts to scrape the following for a given season:
- List of Game IDs and associated information (from nba.com)
- Play-By-Play Data for each game (from nba.com)

The following are also scraped for comparison against other advanced stats
- Player Advanced Stats (from basketball-reference.com)
- Team Advanced Stats (from basketball-reference.com)
- RPM data (from espn.com)

The ```collect_all_season_data()``` function will finish by creating a .tar.gz file with all the above contents.

In [1]:
import os
os.chdir("data_collector")

In [2]:

# move into the data_collector directory
import collect_nba_data
year = 2016
collect_nba_data.collect_all_season_data(year)

Found /home/aidan/rapm-model/data_collector/2015-16/games_preseason_2016.json
Found /home/aidan/rapm-model/data_collector/2015-16/preseason/0011500001.json
Found /home/aidan/rapm-model/data_collector/2015-16/preseason/0011500002.json
Found /home/aidan/rapm-model/data_collector/2015-16/preseason/0011500003.json
Found /home/aidan/rapm-model/data_collector/2015-16/preseason/0011500004.json
Found /home/aidan/rapm-model/data_collector/2015-16/preseason/0011500005.json
Found /home/aidan/rapm-model/data_collector/2015-16/preseason/0011500006.json
Found /home/aidan/rapm-model/data_collector/2015-16/preseason/0011500007.json
Found /home/aidan/rapm-model/data_collector/2015-16/preseason/0011500008.json
Found /home/aidan/rapm-model/data_collector/2015-16/preseason/0011500009.json
Found /home/aidan/rapm-model/data_collector/2015-16/preseason/0011500010.json
Found /home/aidan/rapm-model/data_collector/2015-16/preseason/0011500011.json
Found /home/aidan/rapm-model/data_collector/2015-16/preseason/00

Found /home/aidan/rapm-model/data_collector/2015-16/regular_season/0021500249.json
Found /home/aidan/rapm-model/data_collector/2015-16/regular_season/0021500250.json
Found /home/aidan/rapm-model/data_collector/2015-16/regular_season/0021500251.json
Found /home/aidan/rapm-model/data_collector/2015-16/regular_season/0021500252.json
Found /home/aidan/rapm-model/data_collector/2015-16/regular_season/0021500253.json
Found /home/aidan/rapm-model/data_collector/2015-16/regular_season/0021500254.json
Found /home/aidan/rapm-model/data_collector/2015-16/regular_season/0021500255.json
Found /home/aidan/rapm-model/data_collector/2015-16/regular_season/0021500256.json
Found /home/aidan/rapm-model/data_collector/2015-16/regular_season/0021500257.json
Found /home/aidan/rapm-model/data_collector/2015-16/regular_season/0021500258.json
Found /home/aidan/rapm-model/data_collector/2015-16/regular_season/0021500259.json
Found /home/aidan/rapm-model/data_collector/2015-16/regular_season/0021500260.json
Foun

Scraped ESPN's 2015-16 RPM Page 1/11
Scraped ESPN's 2015-16 RPM Page 2/11
Scraped ESPN's 2015-16 RPM Page 3/11
Scraped ESPN's 2015-16 RPM Page 4/11
Scraped ESPN's 2015-16 RPM Page 5/11
Scraped ESPN's 2015-16 RPM Page 6/11
Scraped ESPN's 2015-16 RPM Page 7/11
Scraped ESPN's 2015-16 RPM Page 8/11
Scraped ESPN's 2015-16 RPM Page 9/11
Scraped ESPN's 2015-16 RPM Page 10/11
Scraped ESPN's 2015-16 RPM Page 11/11
Table concatenated
Saved 2015-16/player_rpms_2016.csv
Created 2015-16.tar.gz


## 2. Import the Data into a Database

This project uses a MongoDB database to store scraped and processed data. The preferred way to start a MongoDB instance with this project is using [Docker](https://docs.docker.com/install/). 

1. [Install Docker]([Docker](https://docs.docker.com/install/)
2. From the command line in the rapm-model directory:
   
   ```docker-compose up```

3. The above command will start an instance of MongoDB running at localhost:27017 with no password by default. You can use a GUI tool like MongoDB Compass or Robo 3T to view the contents of the database. 

In [4]:
import importlib
import data_importer
#importlib.reload(data_importer)

year = 2018
# if we delete the original files but still have the archive file we unzip it first
# data_importer.unarchive_data(year)

# Creates a games.[season] collection and each row of the table is a game object
# with following keys: {"game_index", "pbp"}
data_importer.import_games_pbp_into_mongo(year, "playoffs")

# Does some extra processing to update the rows in the games.[season] table 
# with the following keys: {"away", "home", "date"}
data_importer.import_game_info_into_mongo(year, "playoffs")

# Reads the scraped basketball-reference players table and imports it into
# the "players" collection in Mongo. 
data_importer.import_players_into_mongo(year)

# Reads the scraped basketball-reference teams table and imports it into
# the "teams" collection in Mongo. 
data_importer.import_teams_into_mongo(year)

# Reads the scraped RPM data and adds it to the previously created players table for each player
data_importer.add_rpm_to_player_table(year)


creating database index
database index created (or already existed)
Importing game_ids... (only showing every 100)
Inserted game_id 0041700101
creating index
{u'_id_': {u'key': [(u'_id', 1)], u'ns': u'nba.players', u'v': 2},
 u'player_index.name_stub_1_player_index.season_1_player_index.team_1': {u'key': [(u'player_index.name_stub',
                                                                                   1),
                                                                                  (u'player_index.season',
                                                                                   1),
                                                                                  (u'player_index.team',
                                                                                   1)],
                                                                         u'ns': u'nba.players',
                                                                         u'unique': True,
     

import rondaehollisjefferson_BRK_2018
import richaunholmes_PHI_2018
import rodneyhood_TOT_2018
import rodneyhood_UTA_2018
import rodneyhood_CLE_2018
import scottyhopson_DAL_2018
import alhorford_BOS_2018
import danuelhouse_PHO_2018
import dwighthoward_CHO_2018
import joshhuestis_OKC_2018
import rjhunter_HOU_2018
import vincehunter_MEM_2018
import sergeibaka_TOR_2018
import andreiguodala_GSW_2018
import ersanilyasova_TOT_2018
import ersanilyasova_ATL_2018
import ersanilyasova_PHI_2018
import joeingles_UTA_2018
import andreingram_LAL_2018
import brandoningram_LAL_2018
import kyrieirving_BOS_2018
import jonathanisaac_ORL_2018
import wesleyiwundu_ORL_2018
import jarrettjack_NYK_2018
import aaronjackson_HOU_2018
import demetriusjackson_TOT_2018
import demetriusjackson_HOU_2018
import demetriusjackson_PHI_2018
import joshjackson_PHO_2018
import justinjackson_SAC_2018
import reggiejackson_DET_2018
import lebronjames_CLE_2018
import mikejames_TOT_2018
import mikejames_PHO_2018
import mikejames

import joshsmith_NOP_2018
import tonysnell_MIL_2018
import marreesespeights_ORL_2018
import nikstauskas_TOT_2018
import nikstauskas_PHI_2018
import nikstauskas_BRK_2018
import lancestephenson_IND_2018
import davidstockton_UTA_2018
import julyanstone_CHO_2018
import edmondsumner_IND_2018
import calebswanigan_POR_2018
import jaysontatum_BOS_2018
import isaiahtaylor_ATL_2018
import jeffteague_MIN_2018
import marquisteague_MEM_2018
import mirzateletovic_MIL_2018
import garretttemple_SAC_2018
import milosteodosic_LAC_2018
import jasonterry_MIL_2018
import danieltheis_BOS_2018
import isaiahthomas_TOT_2018
import isaiahthomas_CLE_2018
import isaiahthomas_LAL_2018
import lancethomas_NYK_2018
import klaythompson_GSW_2018
import tristanthompson_CLE_2018
import sindariusthornwell_LAC_2018
import anthonytolliver_DET_2018
import karlanthonytowns_MIN_2018
import pjtucker_HOU_2018
import evanturner_POR_2018
import mylesturner_IND_2018
import ekpeudoh_UTA_2018
import tylerulis_PHO_2018
import jonasval

Added RPM for Rudy Gobert
Added RPM for Kemba Walker
Added RPM for Jrue Holiday
Added RPM for DeMarcus Cousins
Added RPM for Kevin Durant
Added RPM for Joe Ingles
Added RPM for Tyreke Evans
Added RPM for Kevin Love
Added RPM for Fred VanVleet
Added RPM for Kelly Olynyk
Added RPM for Nikola Mirotic
Added RPM for Nikola Mirotic
Added RPM for LaMarcus Aldridge
Added RPM for Kyle Anderson
Added RPM for Jayson Tatum
Added RPM for Paul George
Added RPM for Ben Simmons
Added RPM for Nene Hilario
Added RPM for Jordan Bell
Added RPM for Kyle Korver
Added RPM for Kristaps Porzingis
[CLE] Larry Nance Jr.'s name doesn't match bball-ref name. Attempting to resolve...
	stubname is larrynancejr
	Converted Larry Nance Jr. to Larry Nance
Added RPM for Larry Nance
Added RPM for Larry Nance
Added RPM for Eric Bledsoe
Added RPM for Eric Bledsoe
Added RPM for Dwight Powell
Added RPM for Spencer Dinwiddie
Added RPM for Ricky Rubio
Added RPM for Kyrie Irving
Added RPM for David West
Added RPM for Ekpe Udoh
A

Added RPM for Pat Connaughton
[WAS] Kelly Oubre Jr.'s name doesn't match bball-ref name. Attempting to resolve...
	stubname is kellyoubrejr
	Converted Kelly Oubre Jr. to Kelly Oubre
Added RPM for Kelly Oubre
Added RPM for Caleb Swanigan
Added RPM for Antonius Cleveland
Added RPM for Antonius Cleveland
Added RPM for Kyle Kuzma
Added RPM for Nicolas Brussino
Added RPM for Raymond Felton
Added RPM for Luke Babbitt
Added RPM for Luke Babbitt
Added RPM for Kenneth Faried
Added RPM for Damian Jones
Added RPM for Dante Exum
Added RPM for Cameron Payne
Added RPM for Erik McCree
Added RPM for Reggie Hearn
Added RPM for Lauri Markkanen
Added RPM for Xavier Silas
Added RPM for Brandon Ingram
Added RPM for Jarrett Allen
Added RPM for Dwyane Wade
Added RPM for Dwyane Wade
Added RPM for Michael Kidd-Gilchrist
Added RPM for Isaiah Whitehead
Added RPM for Andrew Wiggins
Added RPM for Ike Anigbogu
Added RPM for Darrell Arthur
Added RPM for T.J. McConnell
Added RPM for Marquese Chriss
Added RPM for Tyle

Added RPM for Jameer Nelson
Added RPM for Abdel Nader
Added RPM for Paul Zipser
Added RPM for Jamal Crawford
Added RPM for Patrick McCaw
Added RPM for Malik Monk
Added RPM for Semi Ojeleye
Added RPM for Emmanuel Mudiay
Added RPM for Emmanuel Mudiay
Added RPM for Kobi Simmons


## 3. Process Play-by-Play Data

Here we convert the raw play-by-play data from nba.com to a version with lineup infomation, which we need to do the RAPM calculation. More precisely, we parse the play-by-play data into this form: 

```json
[ { 
    "home_lineup": [playerid1, ..., playerid5],
    "away_lineup": [playerid6, ..., playerid10],
    "score_margin_update": 2, #(0-4 range of points that can be scored on a possession, not accounting for weirdness that can happen with technicals, flagrants)
    "home_team_is_on_offense": True, # or false,
    "possession_metadata": 3 #any data we want to keep about what the event was, at least event number for reference
   }, {...}, ...
]
```
Each object in the list is denotes a possession ending, either a turnover, made shot, defensive rebound, or the last of a series of free throws. 

We store this in a collection called possessions.[season] for each season

In [5]:
# get back into main rapm-model directory
# only run this once, or just put in an absolute path e.g. /home/dev/rapm-model
os.chdir("..")

In [7]:
import parse_pbp
import importlib
import common_utils
#importlib.reload(parse_pbp)
#importlib.reload(common_utils)

year = 2018
parse_pbp.save_lineup_data_for_season(year, "playoffs")

creating index
index created (or already existed)


AttributeError: 'module' object has no attribute 'perf_counter'

## 4. Convert Play-by-Play/Possession Data into Matrix

We convert the possessions to matrix as described in [this talk by Jeremias Engelmann](https://www.youtube.com/watch?v=OuC0YZTADcE) and summarized in the accompanying paper. These matrices were too big for the MongoDB database and are stored as python pickles.


In [9]:
import calculate_rapm
import importlib

#importlib.reload(calculate_rapm)

# Creates a seasons table in MongoDB where each row is an NBA season with the following attributes:
# {"year_string", "games_data"}
# "games_data" is an object where the keys are the game_ids and the information stored is the home and away team for that game
#     This is totally redundant with the games collection, but stored in a way for faster lookups
calculate_rapm.store_games_data(year, "playoffs")

# This does a pass through the possessions collection and calculates possession count for each player
# It updates the "seasons" collection with an additional "player_info" attribute
# which is an object where the keys are the player name and team and the data associated with it a numerical index, a possession count, and a player stub name
# It also updates the players table with the possession counts
calculate_rapm.store_player_and_possession_data_for_matrix(year, "playoffs")

# This does a pass through the possessions data and creates the player/possession matrix for calculating RAPM
X, Y = calculate_rapm.build_matrix(year, "playoffs")


storing games data...
stored games data


KeyError: 'player_info'

## 5. Run Regression Model w/ Matrix

In [None]:
import pickle

import calculate_rapm
from common_utils import construct_year_string
year = 2018
year_string = construct_year_string(year)
X = pickle.load(open("matrices/{}-X-playoffs.indicator.pickle".format(year_string), "rb"))
Y = pickle.load(open("matrices/{}-Y-playoffs.pickle".format(year_string), "rb"))
# The above steps should only need to be done once so long as no new games are added
# This step runs the model, stores the values to the players table and prints out the top 50 in ORAPM
calculate_rapm.calculate_rapm(year, X, Y, season_type="playoffs")

# RAPMs are calculated for each "player-team" pair
# For traded players, this calculates a weighted average for their whole season
calculate_rapm.deal_with_traded_players(year)

In [None]:
import json

from pymongo import MongoClient

import mongo_config

client = MongoClient(mongo_config.host, mongo_config.port)
db = client.nba

players = db.players.find(
    filter = {
        "player_index.season": 2018
    },
    projection= {
        "player": 1,
        "team_id": 1,
        "orapm_playoffs": 1,
        "drapm_playoffs": 1,
        "rapm_playoffs": 1,
        "playoffs_possessions": 1
    }, 
    sort= [("rapm_playoffs", -1)]
)
result = []
index = 1
for player in players:
    if "playoffs_possessions" in player:
        result.append([
            index,
            player["player"],
            player["team_id"],
            player["playoffs_possessions"],
            round(player["orapm_playoffs"], 4),
            round(player["drapm_playoffs"], 4),
            round(player["rapm_playoffs"], 4)
        ])
        index += 1
#print(result)
with open("2017-18-playoffs-rapm.json", "w") as jsonfile:
    json.dump({
        "data": result
    }, jsonfile)