# Data Pipeline for Calculating RAPM

To run the following code, you will need this software:

## Software Prequisites


Option 1 (Running with Docker): 
- [Docker](https://www.docker.com/) and docker-compose

Option 2 (Running locally): 
- [Anaconda](https://anaconda.org/anaconda/python)
- [MongoDB](https://www.mongodb.com/)

Option 3 (Mixed):
- You can also run Anaconda locally and Mongo out of a docker container or vice versa

Once you have the above installed, you can proceed with these steps:

## Table of Contents

- Scrape Data
- Import Data into Database
- Process Play-by-Play Data
- Convert Play-by-Play/Possession Data into Matrix
- Run Regression Model with Matrix

## 1. Scrape Data

We use scripts to scrape the following for a given season:
- List of Game IDs and associated information (from nba.com)
- Play-By-Play Data for each game (from nba.com)

The following are also scraped for comparison against other advanced stats
- Player Advanced Stats (from basketball-reference.com)
- Team Advanced Stats (from basketball-reference.com)
- RPM data (from espn.com)

The ```collect_all_season_data()``` function will finish by creating a .tar.gz file with all the above contents.

In [3]:
import os
os.chdir("data_collector")

OSError: [Errno 2] No such file or directory: 'data_collector'

In [7]:

# move into the data_collector directory
import collect_nba_data
year = 2015
collect_nba_data.collect_all_season_data(year)

Found /home/aidan/rapm-model/data_collector/2014-15/games_preseason_2015.json
Found /home/aidan/rapm-model/data_collector/2014-15/preseason/0011400001.json
Found /home/aidan/rapm-model/data_collector/2014-15/preseason/0011400002.json
Found /home/aidan/rapm-model/data_collector/2014-15/preseason/0011400003.json
Found /home/aidan/rapm-model/data_collector/2014-15/preseason/0011400004.json
Found /home/aidan/rapm-model/data_collector/2014-15/preseason/0011400005.json
Found /home/aidan/rapm-model/data_collector/2014-15/preseason/0011400006.json
Found /home/aidan/rapm-model/data_collector/2014-15/preseason/0011400007.json
Found /home/aidan/rapm-model/data_collector/2014-15/preseason/0011400008.json
Found /home/aidan/rapm-model/data_collector/2014-15/preseason/0011400009.json
Found /home/aidan/rapm-model/data_collector/2014-15/preseason/0011400010.json
Found /home/aidan/rapm-model/data_collector/2014-15/preseason/0011400011.json
Found /home/aidan/rapm-model/data_collector/2014-15/preseason/00

Write /home/aidan/rapm-model/data_collector/2014-15/regular_season/0021400435.json
Write /home/aidan/rapm-model/data_collector/2014-15/regular_season/0021400436.json
Write /home/aidan/rapm-model/data_collector/2014-15/regular_season/0021400437.json
Write /home/aidan/rapm-model/data_collector/2014-15/regular_season/0021400438.json
Write /home/aidan/rapm-model/data_collector/2014-15/regular_season/0021400439.json
Write /home/aidan/rapm-model/data_collector/2014-15/regular_season/0021400440.json
Write /home/aidan/rapm-model/data_collector/2014-15/regular_season/0021400441.json
Write /home/aidan/rapm-model/data_collector/2014-15/regular_season/0021400442.json
Write /home/aidan/rapm-model/data_collector/2014-15/regular_season/0021400443.json
Write /home/aidan/rapm-model/data_collector/2014-15/regular_season/0021400444.json
Write /home/aidan/rapm-model/data_collector/2014-15/regular_season/0021400445.json
Write /home/aidan/rapm-model/data_collector/2014-15/regular_season/0021400446.json
Writ

Write /home/aidan/rapm-model/data_collector/2014-15/regular_season/0021400534.json
Write /home/aidan/rapm-model/data_collector/2014-15/regular_season/0021400535.json
Write /home/aidan/rapm-model/data_collector/2014-15/regular_season/0021400536.json
Write /home/aidan/rapm-model/data_collector/2014-15/regular_season/0021400537.json
Write /home/aidan/rapm-model/data_collector/2014-15/regular_season/0021400538.json
Write /home/aidan/rapm-model/data_collector/2014-15/regular_season/0021400539.json
Write /home/aidan/rapm-model/data_collector/2014-15/regular_season/0021400540.json
Write /home/aidan/rapm-model/data_collector/2014-15/regular_season/0021400541.json
Write /home/aidan/rapm-model/data_collector/2014-15/regular_season/0021400542.json
Write /home/aidan/rapm-model/data_collector/2014-15/regular_season/0021400543.json
Write /home/aidan/rapm-model/data_collector/2014-15/regular_season/0021400544.json
Write /home/aidan/rapm-model/data_collector/2014-15/regular_season/0021400545.json
Writ

Write /home/aidan/rapm-model/data_collector/2014-15/regular_season/0021400633.json
Write /home/aidan/rapm-model/data_collector/2014-15/regular_season/0021400634.json
Write /home/aidan/rapm-model/data_collector/2014-15/regular_season/0021400635.json
Write /home/aidan/rapm-model/data_collector/2014-15/regular_season/0021400636.json
Write /home/aidan/rapm-model/data_collector/2014-15/regular_season/0021400637.json
Write /home/aidan/rapm-model/data_collector/2014-15/regular_season/0021400638.json
Write /home/aidan/rapm-model/data_collector/2014-15/regular_season/0021400639.json
Write /home/aidan/rapm-model/data_collector/2014-15/regular_season/0021400640.json
Write /home/aidan/rapm-model/data_collector/2014-15/regular_season/0021400641.json
Write /home/aidan/rapm-model/data_collector/2014-15/regular_season/0021400642.json
Write /home/aidan/rapm-model/data_collector/2014-15/regular_season/0021400643.json
Write /home/aidan/rapm-model/data_collector/2014-15/regular_season/0021400644.json
Writ

ConnectionError: ('Connection aborted.', error("(104, 'ECONNRESET')",))

## 2. Import the Data into a Database

This project uses a MongoDB database to store scraped and processed data. The preferred way to start a MongoDB instance with this project is using [Docker](https://docs.docker.com/install/). 

1. [Install Docker]([Docker](https://docs.docker.com/install/)
2. From the command line in the rapm-model directory:
   
   ```docker-compose up```

3. The above command will start an instance of MongoDB running at localhost:27017 with no password by default. You can use a GUI tool like MongoDB Compass or Robo 3T to view the contents of the database. 

In [None]:
import importlib
import data_importer
#importlib.reload(data_importer)

year = 2018
# if we delete the original files but still have the archive file we unzip it first
# data_importer.unarchive_data(year)

# Creates a games.[season] collection and each row of the table is a game object
# with following keys: {"game_index", "pbp"}
data_importer.import_games_pbp_into_mongo(year, "playoffs")

# Does some extra processing to update the rows in the games.[season] table 
# with the following keys: {"away", "home", "date"}
data_importer.import_game_info_into_mongo(year, "playoffs")

# Reads the scraped basketball-reference players table and imports it into
# the "players" collection in Mongo. 
data_importer.import_players_into_mongo(year)

# Reads the scraped basketball-reference teams table and imports it into
# the "teams" collection in Mongo. 
data_importer.import_teams_into_mongo(year)

# Reads the scraped RPM data and adds it to the previously created players table for each player
data_importer.add_rpm_to_player_table(year)


## 3. Process Play-by-Play Data

Here we convert the raw play-by-play data from nba.com to a version with lineup infomation, which we need to do the RAPM calculation. More precisely, we parse the play-by-play data into this form: 

```json
[ { 
    "home_lineup": [playerid1, ..., playerid5],
    "away_lineup": [playerid6, ..., playerid10],
    "score_margin_update": 2, #(0-4 range of points that can be scored on a possession, not accounting for weirdness that can happen with technicals, flagrants)
    "home_team_is_on_offense": True, # or false,
    "possession_metadata": 3 #any data we want to keep about what the event was, at least event number for reference
   }, {...}, ...
]
```
Each object in the list is denotes a possession ending, either a turnover, made shot, defensive rebound, or the last of a series of free throws. 

We store this in a collection called possessions.[season] for each season

In [None]:
# get back into main rapm-model directory
# only run this once, or just put in an absolute path e.g. /home/dev/rapm-model
os.chdir("..")

In [None]:
import parse_pbp
import importlib
import common_utils
importlib.reload(parse_pbp)
importlib.reload(common_utils)

year = 2018
parse_pbp.save_lineup_data_for_season(year, "playoffs")

## 4. Convert Play-by-Play/Possession Data into Matrix

We convert the possessions to matrix as described in [this talk by Jeremias Engelmann](https://www.youtube.com/watch?v=OuC0YZTADcE) and summarized in the accompanying paper. These matrices were too big for the MongoDB database and are stored as python pickles.


In [None]:
import calculate_rapm
import importlib

importlib.reload(calculate_rapm)

# Creates a seasons table in MongoDB where each row is an NBA season with the following attributes:
# {"year_string", "games_data"}
# "games_data" is an object where the keys are the game_ids and the information stored is the home and away team for that game
#     This is totally redundant with the games collection, but stored in a way for faster lookups
calculate_rapm.store_games_data(year, "playoffs")

# This does a pass through the possessions collection and calculates possession count for each player
# It updates the "seasons" collection with an additional "player_info" attribute
# which is an object where the keys are the player name and team and the data associated with it a numerical index, a possession count, and a player stub name
# It also updates the players table with the possession counts
calculate_rapm.store_player_and_possession_data_for_matrix(year, "playoffs")

# This does a pass through the possessions data and creates the player/possession matrix for calculating RAPM
X, Y = calculate_rapm.build_matrix(year, "playoffs")


## 5. Run Regression Model w/ Matrix

In [None]:
import pickle

import calculate_rapm
from common_utils import construct_year_string
year = 2018
year_string = construct_year_string(year)
X = pickle.load(open("matrices/{}-X-playoffs.indicator.pickle".format(year_string), "rb"))
Y = pickle.load(open("matrices/{}-Y-playoffs.pickle".format(year_string), "rb"))
# The above steps should only need to be done once so long as no new games are added
# This step runs the model, stores the values to the players table and prints out the top 50 in ORAPM
calculate_rapm.calculate_rapm(year, X, Y, season_type="playoffs")

# RAPMs are calculated for each "player-team" pair
# For traded players, this calculates a weighted average for their whole season
calculate_rapm.deal_with_traded_players(year)

In [None]:
import json

from pymongo import MongoClient

import mongo_config

client = MongoClient(mongo_config.host, mongo_config.port)
db = client.nba

players = db.players.find(
    filter = {
        "player_index.season": 2018
    },
    projection= {
        "player": 1,
        "team_id": 1,
        "orapm_playoffs": 1,
        "drapm_playoffs": 1,
        "rapm_playoffs": 1,
        "playoffs_possessions": 1
    }, 
    sort= [("rapm_playoffs", -1)]
)
result = []
index = 1
for player in players:
    if "playoffs_possessions" in player:
        result.append([
            index,
            player["player"],
            player["team_id"],
            player["playoffs_possessions"],
            round(player["orapm_playoffs"], 4),
            round(player["drapm_playoffs"], 4),
            round(player["rapm_playoffs"], 4)
        ])
        index += 1
#print(result)
with open("2017-18-playoffs-rapm.json", "w") as jsonfile:
    json.dump({
        "data": result
    }, jsonfile)