## Dataframe Cleaning/Merging

The objective of this notebook is to clean the notebooks of missing values, and impute additional features which I can calculate from the box score. These features I think will be useful for capturing additional variance in my data. In addition, I merge the game data with the corresponding gambling lines in this notebook.

In [None]:
import json
import requests
from bs4 import BeautifulSoup
import time
import csv
import pandas as pd
import matplotlib.pyplot as plt
import re

%matplotlib inline

In [None]:
team_handles_dict = {'Toronto Raptors': 'TOR',
                     'Boston Celtics': 'BOS',
                     'Philadelphia 76ers': 'PHI',
                     'Cleveland Cavaliers': 'CLE',
                     'Indiana Pacers': 'IND',
                     'Miami Heat': 'MIA',
                     'Milwaukee Bucks': 'MIL',
                     'Washington Wizards': 'WAS',
                     'Detroit Pistons': 'DET',
                     #'Charlotte Hornets': 'CHO',
                     #'Charlotte Bobcats': 'CHA',
                     'New York Knicks': 'NYK',
                     'Brooklyn Nets': 'BRK',
                     'Chicago Bulls': 'CHI',
                     'Orlando Magic': 'ORL',
                     'Atlanta Hawks': 'ATL',
                     'Houston Rockets': 'HOU',
                     'Golden State Warriors': 'GSW',
                     'Portland Trail Blazers': 'POR',
                     'Oklahoma City Thunder': 'OKC',
                     'Utah Jazz': 'UTA',
                     'New Orleans Pelicans': 'NOP',
                     'San Antonio Spurs': 'SAS',
                     'Minnesota Timberwolves': 'MIN',
                     'Denver Nuggets': 'DEN',
                     'L.A. Clippers': 'LAC',
                     'L.A. Lakers': 'LAL',
                     'Sacramento Kings': 'SAC',
                     'Dallas Mavericks': 'DAL',
                     'Memphis Grizzlies': 'MEM',
                     'Phoenix Suns': 'PHO'}

#### dataframe_loader:
- loads data from repository into a pandas dataframe

In [None]:
def dataframe_loader(years_games):
    years_stats = []
    for game in years_games:
        with open(f'{game}') as g:
            years_stats.append(json.load(g))
    all_games_year = [team for game_list in years_stats for game in game_list for team in game]
    df_year = pd.DataFrame(all_games_year, columns=['gid', 'team_slug', 'away_home', 'mp', 'fg', 'fga',
                                                    'fg%', '3p', '3pa', '3p%', 'ft', 'fta', 'ft%', 'orb',
                                                    'drb', 'trb', 'ast', 'stl', 'blk', 'tov', 'pf', 'pts'])
    return df_year

In [None]:
gl_2014 = !ls ../raw_data_files/*_2014.json
gl_2015 = !ls ../raw_data_files/*_2015.json
gl_2016 = !ls ../raw_data_files/*_2016.json
gl_2017 = !ls ../raw_data_files/*_2017.json
gl_2018 = !ls ../raw_data_files/*_2018.json

df_2014 = dataframe_loader(gl_2014)
df_2015 = dataframe_loader(gl_2015)
df_2016 = dataframe_loader(gl_2016)
df_2017 = dataframe_loader(gl_2017)
df_2018 = dataframe_loader(gl_2018)

In [None]:
df_all = df_2014.append([df_2015, df_2016, df_2017, df_2018], ignore_index=True)

In [None]:
df_all['date'] = df_all['gid'].map(lambda x: x[:8])

In [None]:
df_all.head()

In [None]:
with open('../raw_data_files/all_gambling_lines.json') as g:
     all_lines = json.load(g)

In [None]:
df_lines = pd.DataFrame(data=all_lines[0])
for day_line in all_lines[1:]:
    df_lines = df_lines.append(day_line, ignore_index=True).copy()

In [None]:
df_lines.head()

Writing this cell to make corrections to the lines for data points where I have to manually impute the line because the line was missing from the website I scraped

In [None]:
df_lines.loc[384, 2] = '-1 -105'
df_lines.loc[385, 2] = '194.5 -105'
df_lines.loc[2466, 2] = '203.5 -105'
df_lines.loc[2467, 2] = '-1 -105'
df_lines.loc[5677, 2] = '0 -105'

Writing this cell to append missing lines to my Dataframe

In [None]:
missing_lines = pd.DataFrame([['20150306', 'Miami', '+6 -110'], 
                             ['20150306', 'Washington','193 -110']])

In [None]:
df_lines = df_lines.append(missing_lines, ignore_index=True)

In [None]:
df_lines.rename({0:'date', 1:'team', 2:'full_line'}, axis=1, inplace=True)

In [None]:
df_lines['team_slug'] = df_lines['team']

I wrote this cell to get the team_slug from a team name. With Charlotte, because there is a split between the slugs based on time, I set all of the 2014 season's games to the Charlotte Bobcats slug, and every seasons' games following to have the clug of the Charlotte Hornets

In [None]:
for i, team in enumerate(df_lines['team']):
    if team == "Charlotte":
        if int(df_lines['date'][i]) < 20141001:
            df_lines['team_slug'][i] = 'CHA'
        else:
            df_lines['team_slug'][i] = 'CHO'
    else:
        for key in team_handles_dict:     
            if team in key:
                df_lines['team_slug'][i] = team_handles_dict[key]
            

In [None]:
df_lines['gid'] = df_lines['date'] + '0' + df_lines['team_slug']

In [None]:
for i, entry in enumerate(df_lines['gid']):
    if i % 2 == 0:
        df_lines['gid'][i] = df_lines['gid'][i+1]

In the cell below I merge my game data and my gambling lines data, using both team slug and game ID as the features to join on

In [None]:
df = pd.merge(df_all, df_lines, how='inner', on=['gid', 'team_slug'])

In [None]:
df.head()

In [None]:
len(df)

In [None]:
home_away_dict = {'away': 0, 'home': 1}

df['home'] = df.away_home.map(lambda x: home_away_dict[x]).copy()

In [None]:
df['betting_line'] = df['full_line'].map(lambda x: str.split(x)[0])
df['bet_terms'] = df['full_line'].map(lambda x: str.split(x)[1])

The following cells are Regex equations I made to remove the 1/2 symbol from the betting lines, and impute .5 in its place. I did this because the 1/2 symbol was recognized as a special character and not a number

In [None]:
p = re.compile(r'[^0-9+-]+')

In [None]:
for i, entry in enumerate(df['betting_line']):
    if p.findall(entry):
        df.loc[i, 'betting_line'] = entry.replace(p.findall(entry)[0], '.5')

In [None]:
df['date'] = pd.to_datetime([x[0:4] + '-' + x[4:6] + '-' + x[6:8] for x in df['date_x']])

In [None]:
df.drop(['date_x', 'team', 'mp', 'away_home', 'full_line', 'date_y'], axis=1, inplace=True)

In [None]:
df.sort_values(by=['team_slug', 'date'], inplace=True)

In [None]:
df[['fg', 'fga', '3p', '3pa', 'ft', 'fta', 
           'orb', 'drb', 'trb', 'ast', 'stl', 
           'blk', 'tov', 'pf', 'pts']] = df[['fg', 'fga', '3p', '3pa', 'ft', 'fta', 
           'orb', 'drb', 'trb', 'ast', 'stl', 
           'blk', 'tov', 'pf', 'pts']].astype('int64', copy=True)

df[['fg%', '3p%', 'ft%', 
    'betting_line', 'bet_terms']] = df[['fg%', '3p%', 
                                        'ft%', 'betting_line', 'bet_terms']].astype('float64', copy=True)

By merging the dataframe on itself and then removing rows where the merge imputed the same information twice on one row, I am able to return a dataframe which represents a game on one row, as both the target team's box score and their opponent's box score. Every game is represented twice, once for each team participating in the game. 

In [None]:
doubled_df = df.merge(df, on='gid',  suffixes=['_1', '_2'])

In [None]:
merged_df = doubled_df[doubled_df['team_slug_1'] != doubled_df['team_slug_2']].copy()

In [None]:
merged_df.head()

Equation for Offensive Rating: 

Offensive Rating = 100 x Pts / (0.5 * ((Tm FGA + 0.4 * Tm FTA - 1.07 * (Tm ORB / (Tm ORB + Opp DRB)) * (Tm FGA - Tm FG) + Tm TOV) + (Opp FGA + 0.4 * Opp FTA - 1.07 * (Opp ORB / (Opp ORB + Tm DRB)) * (Opp FGA - Opp FG) + Opp TOV)))

In [None]:
merged_df['off_rating_1'] = merged_df.apply((lambda x: 100 * x['pts_1'] / 
                (0.5*((x['fga_1'] + 0.4*(x['fta_1']) - 1.07*(x['orb_1'] / (x['orb_1'] + x['drb_2']))
                 * (x['fga_1'] - x['fg_1']) + x['tov_1']) +
                (x['fga_2'] + 0.4*(x['fta_2']) - 1.07*(x['orb_2'] / (x['orb_2'] + x['drb_1']))
                 * (x['fga_2'] - x['fg_2']) + x['tov_2'])))), 1)

In [None]:
merged_df['off_rating_2'] = merged_df.apply((lambda x: 100 * x['pts_2'] / 
                (0.5*((x['fga_2'] + 0.4*(x['fta_2']) - 1.07*(x['orb_2'] / (x['orb_2'] + x['drb_1']))
                 * (x['fga_2'] - x['fg_2']) + x['tov_2']) +
                (x['fga_1'] + 0.4*(x['fta_1']) - 1.07*(x['orb_1'] / (x['orb_1'] + x['drb_2']))
                 * (x['fga_1'] - x['fg_1']) + x['tov_1'])))), 1)

In [None]:
merged_df['off_rating_1'] = round(merged_df['off_rating_1'], 2)
merged_df['off_rating_2'] = round(merged_df['off_rating_2'], 2)

In [None]:
merged_df.reset_index(inplace=True)

In [None]:
over_under_list = []
for i, x in enumerate(merged_df['betting_line_1']):
    if x > merged_df['betting_line_2'][i]:
        over_under_list.append(x)
    else:
        over_under_list.append(merged_df['betting_line_2'][i])

In [None]:
merged_df['over_under'] = pd.Series(over_under_list, merged_df.index)

In [None]:
for i, value in enumerate(merged_df['betting_line_1']):
    if merged_df.loc[i, 'betting_line_1'] > 0:
        merged_df.loc[i, 'betting_line_1'] = merged_df.loc[i, 'betting_line_2'] * -1
    else:
        merged_df.loc[i, 'betting_line_2'] = merged_df.loc[i, 'betting_line_1'] * -1
        

In [None]:
merged_df.set_index('date_2', inplace=True)

In [None]:
merged_df.drop(['bet_terms_1', 'bet_terms_2', 'date_1', 'index'], axis=1, inplace=True)

In [None]:
merged_df.head()

In [None]:
merged_df.to_csv('clean_nba_betting_dataframe_full.csv', columns=merged_df.columns)