# DATA CLEANING AND CONDITIONING

# Introduction

<b><u>Module 3 project</u>:</b> <b>NBA prediction using classification models</b>

The goal of this project is to predict which NBA team will win a game. For this purposed I have classified the NBA game as a home team winning or losing to forecast the winner. 

Several models will be tested and compared to define which gives the best forecast. The outputs will be tested with recent games from the 2019-2020 NBA season.

This Notebook will be dedicated to explaining the data sets that will be used, followed by the cleaning and conditioning to have it ready/prepared for a baseline model

The different sources for the dataset have been explained in the README. 

# List of Libraries for Data Cleaning and Conditioning

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', None)

Two different datasets will be used. A first one (loaded below) which corresponds to the match results and a second one that will be explained later on with the statistics related to each team per month of NBA season. 

The first dataset was scrapped from the stats.nba.com webpage. The original format is as an Excel file. The dataset is relatively clean, however minor modifications were done in Excel prior to being loaded into our first dataframe called df1.

In [3]:
df1 = pd.read_excel('NBA_2018_Game_Results.xlsx')
df1.head().append(df1.tail())

Unnamed: 0,Date,Team1,Team1Score,Team2,Team2Score
0,"Tue, Oct 16, 2018",Philadelphia 76ers,87,Boston Celtics,105
1,"Tue, Oct 16, 2018",Oklahoma City Thunder,100,Golden State Warriors,108
2,"Wed, Oct 17, 2018",Milwaukee Bucks,113,Charlotte Hornets,112
3,"Wed, Oct 17, 2018",Brooklyn Nets,100,Detroit Pistons,103
4,"Wed, Oct 17, 2018",Memphis Grizzlies,83,Indiana Pacers,111
1273,"Sun, Apr 28, 2019",Houston Rockets,100,Golden State Warriors,104
1274,"Mon, Apr 29, 2019",Philadelphia 76ers,94,Toronto Raptors,89
1275,"Mon, Apr 29, 2019",Portland Trail Blazers,113,Denver Nuggets,121
1276,"Tue, Apr 30, 2019",Boston Celtics,102,Milwaukee Bucks,123
1277,"Tue, Apr 30, 2019",Houston Rockets,109,Golden State Warriors,115


In principle there is not much more to modify on this dataset in terms of the cleaning but only a bit of engineering to prepare it to later be merge with the statistics data set. 

The first thing that I will do is modify the Date column. The current format won't be of much use so a conversion using the to_datetime should be enough to then extract the year and the month from it. 

In [4]:
df1['Date'] = pd.to_datetime(df1['Date'])
df1.head()

Unnamed: 0,Date,Team1,Team1Score,Team2,Team2Score
0,2018-10-16,Philadelphia 76ers,87,Boston Celtics,105
1,2018-10-16,Oklahoma City Thunder,100,Golden State Warriors,108
2,2018-10-17,Milwaukee Bucks,113,Charlotte Hornets,112
3,2018-10-17,Brooklyn Nets,100,Detroit Pistons,103
4,2018-10-17,Memphis Grizzlies,83,Indiana Pacers,111


In [5]:
df1['year'] = pd.DatetimeIndex(df1['Date']).year
df1['month'] = pd.DatetimeIndex(df1['Date']).month
df1.head()

Unnamed: 0,Date,Team1,Team1Score,Team2,Team2Score,year,month
0,2018-10-16,Philadelphia 76ers,87,Boston Celtics,105,2018,10
1,2018-10-16,Oklahoma City Thunder,100,Golden State Warriors,108,2018,10
2,2018-10-17,Milwaukee Bucks,113,Charlotte Hornets,112,2018,10
3,2018-10-17,Brooklyn Nets,100,Detroit Pistons,103,2018,10
4,2018-10-17,Memphis Grizzlies,83,Indiana Pacers,111,2018,10


Now I will drop the 'Date' column and follow it with a .shape command to better understand the size of the dataset. Keep in mind that this is a preliminary exercise done with data from only 1 season as a test. If this works, 20 years of data will be added to it. I am constraining to that time frame as not far from that day the league went through a team expansion by the addition of the Toronto Raptors and at the time the Vancouver Grizzlies. 

In [6]:
df1=df1.drop(['Date'], axis=1)

In [7]:
df1.shape

(1278, 6)

As mentioned at the beginning the game prediction will be based on the home team wining or losing the games. For this purpose I will create a new column called 'Game_Result' and assign a 1 when the Team2Score (home team) score is higher than the Team1Score (visiting team):

In [8]:
df1['Game_Result'] = np.where(df1['Team2Score'] > df1['Team1Score'], 1, 0)
df1.head()

Unnamed: 0,Team1,Team1Score,Team2,Team2Score,year,month,Game_Result
0,Philadelphia 76ers,87,Boston Celtics,105,2018,10,1
1,Oklahoma City Thunder,100,Golden State Warriors,108,2018,10,1
2,Milwaukee Bucks,113,Charlotte Hornets,112,2018,10,0
3,Brooklyn Nets,100,Detroit Pistons,103,2018,10,1
4,Memphis Grizzlies,83,Indiana Pacers,111,2018,10,1


This is more less the first dataset that I will use for the later merger. Most of these values will be dropped, however the Team1 (visiting team) and Team2 (home team) will be used to merge the average per month of each one of these teams. These averages are calculated by month so that the monthly variations of the team performances can be accounted for. 

At the moment, the columns available are describe on the legend below.

<b>df1 Legend</b> column legend:

* Team1 = Visiting Team

* Team1Score = Points scored by T1

* Team2 = Home Team

* Team2Score = Points scored by T2

* 1 = Team2 won 

* 0 = Team2 lost


A quick .value_counts will give an idea of thee results obtained and maybe some insight on the home court advantage:

In [9]:
df1.Game_Result.value_counts()

1    756
0    522
Name: Game_Result, dtype: int64

An interesting valued revealed by the quick binary transformation of the "Winner_Team" column suggest that in 59% of the games, the home team has won. This is probably thanks to a combination of factors which might include confidence of playing at home, no travel meaning less tiredness and of course the 6th player on the court being the crowd. 

Now to the second dataset. This is a larger dataset in terms of numbers of columns. This one contains a series of statistical values that according to the top NBA analysts result in win shares for each team. To clarify, this doesn't mean that the team with the best statistics will always, but a combination of all of these is what will results in wins. 

In [10]:
df2 = pd.read_excel('NBA_Monthly_Stats_2018-2019.xlsx')
df2.head()

Unnamed: 0,TEAM,Month,GP,W,L,MIN,OFFRTG,DEFRTG,NETRTG,AST_P,AST/TO,AST_RATIO,OREB_P,DREB_P,REB_P,TOV_P,EFG_P,TS_P,PACE,PIE
0,Golden State Warriors,October,9,8,1,432,120.3,107.5,12.9,65.7,2.0,20.9,26.6,71.8,51.1,14.8,59.9,63.1,104.0,58.5
1,Milwaukee Bucks,October,7,7,0,336,113.2,98.2,15.0,61.2,1.57,18.5,26.4,74.7,53.4,16.2,56.5,59.6,106.0,59.9
2,Toronto Raptors,October,8,7,1,384,114.5,106.7,7.8,58.1,1.85,18.1,28.4,68.3,49.5,13.5,54.5,57.8,102.31,55.9
3,Denver Nuggets,October,7,6,1,341,109.3,102.1,7.2,59.0,1.76,17.2,31.3,77.8,54.0,13.7,50.5,54.5,101.77,55.0
4,Boston Celtics,October,7,5,2,336,100.8,96.2,4.6,61.7,1.66,16.7,24.9,75.8,50.6,13.6,47.9,51.7,101.5,53.2


The NBA season is a 7 month long one starting in October and finalizing in April with the playoff starting in May until normally end of June. Due to Covid-19 these days have all been modified. For this study I will only consider regular games and hopefully these results can be later extended to the playoffs. 

I made the decision to do it this way as team's and player's performance changes dramatically during the playoffs, some for better and some for worse. Therefore this will be something added to the model results if time permits it. 

For the time being lets check that the data is indeed reflecting the 7 Regular Season months:

In [11]:
df2.Month.value_counts()

January     30
February    30
December    30
November    30
March       30
April       30
October     30
Name: Month, dtype: int64

In [12]:
df2.shape

(210, 20)

I will now run a quick test to see if the numbers make sense for one single team, in this case Golden State Warriors:

In [13]:
df_GSW= df2[df2['TEAM'] == 'Golden State Warriors']
df_GSW.head(10)

Unnamed: 0,TEAM,Month,GP,W,L,MIN,OFFRTG,DEFRTG,NETRTG,AST_P,AST/TO,AST_RATIO,OREB_P,DREB_P,REB_P,TOV_P,EFG_P,TS_P,PACE,PIE
0,Golden State Warriors,October,9,8,1,432,120.3,107.5,12.9,65.7,2.0,20.9,26.6,71.8,51.1,14.8,59.9,63.1,104.0,58.5
46,Golden State Warriors,November,14,7,7,682,110.7,111.1,-0.4,62.8,1.78,18.6,27.9,72.4,51.3,14.6,53.3,57.3,98.74,51.2
63,Golden State Warriors,December,15,10,5,725,110.6,106.7,3.9,67.4,1.97,19.5,23.7,75.1,49.9,13.7,53.7,57.6,103.15,53.2
94,Golden State Warriors,January,13,11,2,629,123.8,110.4,13.4,69.2,2.58,22.2,29.1,73.8,52.7,12.6,59.5,62.2,101.99,57.9
127,Golden State Warriors,February,11,7,4,528,115.2,111.0,4.3,65.6,2.29,20.0,25.7,70.4,49.0,12.3,55.7,58.6,102.05,53.7
159,Golden State Warriors,March,14,9,5,677,112.9,107.3,5.6,69.7,1.95,21.1,23.4,72.4,49.7,15.1,57.2,59.8,100.57,53.5
180,Golden State Warriors,April,6,5,1,288,113.7,103.7,10.0,66.2,2.02,21.4,22.9,71.2,49.4,14.9,58.6,60.0,103.42,55.5


Luckily yes, the total number of games adds 82 corresponding to a complete NBA season. Now to be able to link this dataframe to the first one, I will add a new column where the month will be a numeric value:

In [14]:
months = []

for m in df2['Month']:
    if m == 'October':
        months.append('10')
    elif m == 'November':
        months.append('11')
    elif m == 'December':
        months.append('12')
    elif m == 'January':
        months.append('1')
    elif m == 'February':
        months.append('2')
    elif m == 'March':
        months.append('3')
    else:
        months.append('4')
df2['month'] = months
df2.head()

Unnamed: 0,TEAM,Month,GP,W,L,MIN,OFFRTG,DEFRTG,NETRTG,AST_P,AST/TO,AST_RATIO,OREB_P,DREB_P,REB_P,TOV_P,EFG_P,TS_P,PACE,PIE,month
0,Golden State Warriors,October,9,8,1,432,120.3,107.5,12.9,65.7,2.0,20.9,26.6,71.8,51.1,14.8,59.9,63.1,104.0,58.5,10
1,Milwaukee Bucks,October,7,7,0,336,113.2,98.2,15.0,61.2,1.57,18.5,26.4,74.7,53.4,16.2,56.5,59.6,106.0,59.9,10
2,Toronto Raptors,October,8,7,1,384,114.5,106.7,7.8,58.1,1.85,18.1,28.4,68.3,49.5,13.5,54.5,57.8,102.31,55.9,10
3,Denver Nuggets,October,7,6,1,341,109.3,102.1,7.2,59.0,1.76,17.2,31.3,77.8,54.0,13.7,50.5,54.5,101.77,55.0,10
4,Boston Celtics,October,7,5,2,336,100.8,96.2,4.6,61.7,1.66,16.7,24.9,75.8,50.6,13.6,47.9,51.7,101.5,53.2,10


Now a quick drop of the original 'Month' column:

In [15]:
df2 = df2.drop('Month', axis=1)
df2.head(2)

Unnamed: 0,TEAM,GP,W,L,MIN,OFFRTG,DEFRTG,NETRTG,AST_P,AST/TO,AST_RATIO,OREB_P,DREB_P,REB_P,TOV_P,EFG_P,TS_P,PACE,PIE,month
0,Golden State Warriors,9,8,1,432,120.3,107.5,12.9,65.7,2.0,20.9,26.6,71.8,51.1,14.8,59.9,63.1,104.0,58.5,10
1,Milwaukee Bucks,7,7,0,336,113.2,98.2,15.0,61.2,1.57,18.5,26.4,74.7,53.4,16.2,56.5,59.6,106.0,59.9,10


At this point I have a good chunk of the data I will need to preedict the game winner but I am still missing some of thee basic basketball basic statistics needed: 

* FG_P = Field Goal Percentage

* FGA = Field Goal Attempts

* 3PA = Three Point Attempts

* 3PT_P = Three Point Percentage

Thosee are 4 of the 9 most important categories that according to analysts, results in games won. So this is where the third dataset comes into place: 

In [16]:
df3 = pd.read_excel('2018-2019_NBA_GEN_STATS.xlsx')
df3.head(2)

Unnamed: 0,TEAM,month,PTS,FGM,FGA,FG_P,3PM,3PA,3P_P
0,Milwaukee Bucks,10,120.0,43.9,91.4,48.0,15.6,40.6,38.4
1,Golden State Warriors,10,125.0,46.7,89.1,52.4,13.3,31.9,41.8


I will display the first 2 rows of the two dataframes I need to merge first before going into the larger merge with df1. For this merge I will create a unique identifier column that I will call 'TEAM_id' and this will be a simple addition of the team's name and the month of the statistics. 

In [17]:
df2['TEAM_id'] = df2['TEAM'].astype(str)+'_'+df2['month'].astype(str)
df3['TEAM_id'] = df3['TEAM'].astype(str)+'_'+df3['month'].astype(str)

In [18]:
df2_a = df2.merge(df3, on="TEAM_id", how = 'inner')
df2_a.head().append(df2_a.tail())

Unnamed: 0,TEAM_x,GP,W,L,MIN,OFFRTG,DEFRTG,NETRTG,AST_P,AST/TO,AST_RATIO,OREB_P,DREB_P,REB_P,TOV_P,EFG_P,TS_P,PACE,PIE,month_x,TEAM_id,TEAM_y,month_y,PTS,FGM,FGA,FG_P,3PM,3PA,3P_P
0,Golden State Warriors,9,8,1,432,120.3,107.5,12.9,65.7,2.0,20.9,26.6,71.8,51.1,14.8,59.9,63.1,104.0,58.5,10,Golden State Warriors_10,Golden State Warriors,10,125.0,46.7,89.1,52.4,13.3,31.9,41.8
1,Milwaukee Bucks,7,7,0,336,113.2,98.2,15.0,61.2,1.57,18.5,26.4,74.7,53.4,16.2,56.5,59.6,106.0,59.9,10,Milwaukee Bucks_10,Milwaukee Bucks,10,120.0,43.9,91.4,48.0,15.6,40.6,38.4
2,Toronto Raptors,8,7,1,384,114.5,106.7,7.8,58.1,1.85,18.1,28.4,68.3,49.5,13.5,54.5,57.8,102.31,55.9,10,Toronto Raptors_10,Toronto Raptors,10,117.4,44.1,91.8,48.1,11.8,33.4,35.2
3,Denver Nuggets,7,6,1,341,109.3,102.1,7.2,59.0,1.76,17.2,31.3,77.8,54.0,13.7,50.5,54.5,101.77,55.0,10,Denver Nuggets_10,Denver Nuggets,10,112.9,42.1,91.6,46.0,8.3,28.3,29.3
4,Boston Celtics,7,5,2,336,100.8,96.2,4.6,61.7,1.66,16.7,24.9,75.8,50.6,13.6,47.9,51.7,101.5,53.2,10,Boston Celtics_10,Boston Celtics,10,102.4,37.3,89.3,41.8,11.0,33.6,32.8
205,Miami Heat,6,1,5,293,103.9,106.9,-3.1,61.8,1.73,18.1,25.3,72.9,49.2,14.4,50.7,53.1,101.24,48.2,4,Miami Heat_4,Miami Heat,4,106.8,41.5,92.7,44.8,11.0,34.8,31.6
206,New Orleans Pelicans,4,1,3,197,112.9,115.9,-2.9,59.5,1.83,18.8,27.5,72.3,50.3,14.4,54.5,56.7,102.21,47.4,4,New Orleans Pelicans_4,New Orleans Pelicans,4,117.8,46.3,96.5,47.9,12.8,34.8,36.7
207,Sacramento Kings,5,1,4,240,113.7,122.2,-8.5,50.9,2.23,16.5,29.4,72.9,49.1,10.2,52.8,55.0,101.9,45.8,4,Sacramento Kings_4,Sacramento Kings,4,116.0,45.6,97.8,46.6,12.0,33.0,36.4
208,Cleveland Cavaliers,5,0,5,240,106.1,122.7,-16.5,48.3,1.4,14.8,30.4,80.0,52.9,14.3,50.7,53.5,97.3,41.3,4,Cleveland Cavaliers_4,Cleveland Cavaliers,4,103.6,40.6,89.0,45.6,9.0,27.2,33.1
209,Washington Wizards,4,0,4,192,110.4,117.4,-7.0,57.3,2.13,17.2,31.4,73.9,50.7,11.4,49.4,52.8,100.88,45.7,4,Washington Wizards_4,Washington Wizards,4,111.5,42.8,96.3,44.4,9.5,33.0,28.8


Thats a good step forward now that I have one of my two preliminary dataframes. There are however a few rows that I won't be needing so I will proceed to drop them:

In [19]:
df2_a = df2_a.drop(['3PM', 'FGM', 'PTS', 'month_y', 'TEAM_y', 'TEAM_id'], axis=1)
df2_a.head(2)

Unnamed: 0,TEAM_x,GP,W,L,MIN,OFFRTG,DEFRTG,NETRTG,AST_P,AST/TO,AST_RATIO,OREB_P,DREB_P,REB_P,TOV_P,EFG_P,TS_P,PACE,PIE,month_x,FGA,FG_P,3PA,3P_P
0,Golden State Warriors,9,8,1,432,120.3,107.5,12.9,65.7,2.0,20.9,26.6,71.8,51.1,14.8,59.9,63.1,104.0,58.5,10,89.1,52.4,31.9,41.8
1,Milwaukee Bucks,7,7,0,336,113.2,98.2,15.0,61.2,1.57,18.5,26.4,74.7,53.4,16.2,56.5,59.6,106.0,59.9,10,91.4,48.0,40.6,38.4


After the previous merge, the two columns of my interest ('Team' and 'month') for the next dataframe merge were renamed so I will revert this nomenclature to their original one:

In [20]:
df2_a.rename({'TEAM_x': 'TEAM', 'month_x': 'month'}, axis=1, inplace=True)
df2_a.head(2)

Unnamed: 0,TEAM,GP,W,L,MIN,OFFRTG,DEFRTG,NETRTG,AST_P,AST/TO,AST_RATIO,OREB_P,DREB_P,REB_P,TOV_P,EFG_P,TS_P,PACE,PIE,month,FGA,FG_P,3PA,3P_P
0,Golden State Warriors,9,8,1,432,120.3,107.5,12.9,65.7,2.0,20.9,26.6,71.8,51.1,14.8,59.9,63.1,104.0,58.5,10,89.1,52.4,31.9,41.8
1,Milwaukee Bucks,7,7,0,336,113.2,98.2,15.0,61.2,1.57,18.5,26.4,74.7,53.4,16.2,56.5,59.6,106.0,59.9,10,91.4,48.0,40.6,38.4


Now I have a main dataframe with all the data that I will be eventually looking at called df2_a from which I will also create a new dataset call df2_ready, ready for the final merge or just to analyze team's performance during this one season (2018-2019). I will proceed to save it as a .csv file for future analysis and work

In [104]:
df2_ready = df2_a[['TEAM', 'month', 'OFFRTG', 'DEFRTG', 'AST/TO', 'REB_P', 'FG_P', 'FGA', 'PACE', '3PA', '3P_P']]
df2_ready.head(5)

Unnamed: 0,TEAM,month,OFFRTG,DEFRTG,AST/TO,REB_P,FG_P,FGA,PACE,3PA,3P_P
0,Golden State Warriors,10,120.3,107.5,2.0,51.1,52.4,89.1,104.0,31.9,41.8
1,Milwaukee Bucks,10,113.2,98.2,1.57,53.4,48.0,91.4,106.0,40.6,38.4
2,Toronto Raptors,10,114.5,106.7,1.85,49.5,48.1,91.8,102.31,33.4,35.2
3,Denver Nuggets,10,109.3,102.1,1.76,54.0,46.0,91.6,101.77,28.3,29.3
4,Boston Celtics,10,100.8,96.2,1.66,50.6,41.8,89.3,101.5,33.6,32.8


In [105]:
df2_ready.to_csv('2018-2019_Monthly_Team_9MStats.csv')

<b>THESE ARE ACCORDING TO NBA ANALYSTS THE TOP STATISTICS TO DETERMINE WINS. SUCCESSFULLY TESTED WITH THE 2018-2019 SEASON DATA BY THE NBA WITH ABOVE 90% ACCURACY</b>

==========================================================================================================

==========================================================================================================

Now I have my two main dataframes: An initial one (df1) which will define the 'TARGET' of my models and that will be GAME_RESULTS based on the home team wining or lossing the games, and a second one, df2_ready, which has the most relevant statistics that translate into wins according to expert NBA Analysts. 

The next step is to find a way to assign those monthly averages to both teams per game, but putting a label (prefix) of Team1 and Team2 in from of each one. 

I will temporarily rename my dataframe for practicality purposes. df1 will become results, whereas df2 will become stats:

In [23]:
results = df1
results.head(2)

Unnamed: 0,Team1,Team1Score,Team2,Team2Score,year,month,Game_Result
0,Philadelphia 76ers,87,Boston Celtics,105,2018,10,1
1,Oklahoma City Thunder,100,Golden State Warriors,108,2018,10,1


In [24]:
stats = df2_ready
stats.head(2)

Unnamed: 0,TEAM,month,OFFRTG,DEFRTG,AST/TO,REB_P,FG_P,FGA,PACE,3PA,3P_P
0,Golden State Warriors,10,120.3,107.5,2.0,51.1,52.4,89.1,104.0,31.9,41.8
1,Milwaukee Bucks,10,113.2,98.2,1.57,53.4,48.0,91.4,106.0,40.6,38.4


At this point I need a way to merge both dataframes and be able to add the statistics for both teams playing (Team1 and Team2) on each row. The solution that I came up with was to engineer a new column for each dataframe and relate (add) both columns with the team's name and corresponding game month, both as strings. That will be the link between both dataframes.

In [25]:
results['Team1_id'] = results['Team1'].astype(str)+'_'+results['month'].astype(str)
results['Team2_id'] = results['Team2'].astype(str)+'_'+results['month'].astype(str)

stats['Team_id'] = stats['TEAM'].astype(str)+'_'+stats['month'].astype(str)

Now I will display both new dataframes to QC the previous column manipulation/engineering:

In [26]:
display(results.head(2))
display(stats.head(2))

Unnamed: 0,Team1,Team1Score,Team2,Team2Score,year,month,Game_Result,Team1_id,Team2_id
0,Philadelphia 76ers,87,Boston Celtics,105,2018,10,1,Philadelphia 76ers_10,Boston Celtics_10
1,Oklahoma City Thunder,100,Golden State Warriors,108,2018,10,1,Oklahoma City Thunder_10,Golden State Warriors_10


Unnamed: 0,TEAM,month,OFFRTG,DEFRTG,AST/TO,REB_P,FG_P,FGA,PACE,3PA,3P_P,Team_id
0,Golden State Warriors,10,120.3,107.5,2.0,51.1,52.4,89.1,104.0,31.9,41.8,Golden State Warriors_10
1,Milwaukee Bucks,10,113.2,98.2,1.57,53.4,48.0,91.4,106.0,40.6,38.4,Milwaukee Bucks_10


The most important part of this process, the merge between both dataframes with the assignment of prefixes to both team's statistics so that they can be identified and properly modeled. The final dataframe will have a larger number of columns but some will be dropped as we might not need them. 

In [27]:
df_merge = results.merge(
    stats.add_prefix('Team1_'), left_on='Team1_id', right_on='Team1_Team_id',how='left').merge(
    stats.add_prefix('Team2_'), left_on='Team2_id', right_on='Team2_Team_id',how='left').drop(
    columns= ['Team1_id','Team2_id','Team1_Team_id','Team2_Team_id','Team1_TEAM','Team2_TEAM'])

Following the merge, a quick display of the dataframe for QC purposes is always useful.

In [28]:
df_merge.head()

Unnamed: 0,Team1,Team1Score,Team2,Team2Score,year,month,Game_Result,Team1_month,Team1_OFFRTG,Team1_DEFRTG,Team1_AST/TO,Team1_REB_P,Team1_FG_P,Team1_FGA,Team1_PACE,Team1_3PA,Team1_3P_P,Team2_month,Team2_OFFRTG,Team2_DEFRTG,Team2_AST/TO,Team2_REB_P,Team2_FG_P,Team2_FGA,Team2_PACE,Team2_3PA,Team2_3P_P
0,Philadelphia 76ers,87,Boston Celtics,105,2018,10,1,10,106.3,107.5,1.83,51.1,43.5,93.1,104.39,35.9,33.8,10,100.8,96.2,1.66,50.6,41.8,89.3,101.5,33.6,32.8
1,Oklahoma City Thunder,100,Golden State Warriors,108,2018,10,1,10,102.4,105.0,1.3,50.8,42.8,93.8,106.08,29.7,27.5,10,120.3,107.5,2.0,51.1,52.4,89.1,104.0,31.9,41.8
2,Milwaukee Bucks,113,Charlotte Hornets,112,2018,10,0,10,113.2,98.2,1.57,53.4,48.0,91.4,106.0,40.6,38.4,10,114.9,109.1,2.09,49.2,46.3,91.0,100.44,34.5,37.7
3,Brooklyn Nets,100,Detroit Pistons,103,2018,10,1,10,108.0,111.4,1.46,48.7,45.4,87.5,99.15,35.8,38.1,10,107.6,108.7,1.34,52.3,43.3,92.7,100.37,32.3,32.3
4,Memphis Grizzlies,83,Indiana Pacers,111,2018,10,1,10,104.9,101.5,1.73,47.9,43.5,82.7,98.67,29.0,36.8,10,112.1,105.5,1.67,49.2,50.6,86.0,97.13,22.3,41.6


In [29]:
df_merge.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1278 entries, 0 to 1277
Data columns (total 27 columns):
Team1           1278 non-null object
Team1Score      1278 non-null int64
Team2           1278 non-null object
Team2Score      1278 non-null int64
year            1278 non-null int64
month           1278 non-null int64
Game_Result     1278 non-null int64
Team1_month     1234 non-null object
Team1_OFFRTG    1234 non-null float64
Team1_DEFRTG    1234 non-null float64
Team1_AST/TO    1234 non-null float64
Team1_REB_P     1234 non-null float64
Team1_FG_P      1234 non-null float64
Team1_FGA       1234 non-null float64
Team1_PACE      1234 non-null float64
Team1_3PA       1234 non-null float64
Team1_3P_P      1234 non-null float64
Team2_month     1234 non-null object
Team2_OFFRTG    1234 non-null float64
Team2_DEFRTG    1234 non-null float64
Team2_AST/TO    1234 non-null float64
Team2_REB_P     1234 non-null float64
Team2_FG_P      1234 non-null float64
Team2_FGA       1234 non-null flo

In [88]:
df_merge.shape

(1278, 27)

In [103]:
df_merge.to_csv('df_merge.csv')

Before I continue I will explain which are the columns that will be used for the different models and which will be dropped. Our TARGET as previously mentioned will be the 'Game_Results' as we are trying to predict if the home team (Team2) wins or losses the games. 

From that long list of basketball statistics I will create a smaller dataset for the predictive models which will be based on what are known to be the most important values for game result prediction. If you are interested on reading more about them you can do so on this link:
https://picks.nba.com/primetime-picks

In [30]:
df_model = df_merge.drop(['Team1', 'Team1Score', 'Team2', 'Team2Score', 'year', 'month', 'Team1_month', 'Team2_month'], axis=1)
df_model.head(2)

Unnamed: 0,Game_Result,Team1_OFFRTG,Team1_DEFRTG,Team1_AST/TO,Team1_REB_P,Team1_FG_P,Team1_FGA,Team1_PACE,Team1_3PA,Team1_3P_P,Team2_OFFRTG,Team2_DEFRTG,Team2_AST/TO,Team2_REB_P,Team2_FG_P,Team2_FGA,Team2_PACE,Team2_3PA,Team2_3P_P
0,1,106.3,107.5,1.83,51.1,43.5,93.1,104.39,35.9,33.8,100.8,96.2,1.66,50.6,41.8,89.3,101.5,33.6,32.8
1,1,102.4,105.0,1.3,50.8,42.8,93.8,106.08,29.7,27.5,120.3,107.5,2.0,51.1,52.4,89.1,104.0,31.9,41.8


The merge is successful with the drop of non-important columns and now I will proceed to check the dimensions of the df and the presence of missing values:

In [31]:
df_model.shape

(1278, 19)

In [32]:
df_model.isna().sum()

Game_Result      0
Team1_OFFRTG    44
Team1_DEFRTG    44
Team1_AST/TO    44
Team1_REB_P     44
Team1_FG_P      44
Team1_FGA       44
Team1_PACE      44
Team1_3PA       44
Team1_3P_P      44
Team2_OFFRTG    44
Team2_DEFRTG    44
Team2_AST/TO    44
Team2_REB_P     44
Team2_FG_P      44
Team2_FGA       44
Team2_PACE      44
Team2_3PA       44
Team2_3P_P      44
dtype: int64

These values correspond to canceled games, therefore I will proceed to drop all of them

In [33]:
df_model = df_model.dropna()

In [34]:
df_model.isna().sum()

Game_Result     0
Team1_OFFRTG    0
Team1_DEFRTG    0
Team1_AST/TO    0
Team1_REB_P     0
Team1_FG_P      0
Team1_FGA       0
Team1_PACE      0
Team1_3PA       0
Team1_3P_P      0
Team2_OFFRTG    0
Team2_DEFRTG    0
Team2_AST/TO    0
Team2_REB_P     0
Team2_FG_P      0
Team2_FGA       0
Team2_PACE      0
Team2_3PA       0
Team2_3P_P      0
dtype: int64

This below corresponds to our final dataframe for the modeling of the one year (2018-2019) test 

In [35]:
df_model.head(3).append(df_model.tail(3))

Unnamed: 0,Game_Result,Team1_OFFRTG,Team1_DEFRTG,Team1_AST/TO,Team1_REB_P,Team1_FG_P,Team1_FGA,Team1_PACE,Team1_3PA,Team1_3P_P,Team2_OFFRTG,Team2_DEFRTG,Team2_AST/TO,Team2_REB_P,Team2_FG_P,Team2_FGA,Team2_PACE,Team2_3PA,Team2_3P_P
0,1,106.3,107.5,1.83,51.1,43.5,93.1,104.39,35.9,33.8,100.8,96.2,1.66,50.6,41.8,89.3,101.5,33.6,32.8
1,1,102.4,105.0,1.3,50.8,42.8,93.8,106.08,29.7,27.5,120.3,107.5,2.0,51.1,52.4,89.1,104.0,31.9,41.8
2,0,113.2,98.2,1.57,53.4,48.0,91.4,106.0,40.6,38.4,114.9,109.1,2.09,49.2,46.3,91.0,100.44,34.5,37.7
1275,1,120.4,113.0,1.65,53.9,50.1,90.2,98.75,29.8,38.5,110.0,108.1,2.25,48.2,46.2,89.5,98.42,30.2,33.7
1276,1,111.5,105.4,2.36,48.3,46.3,90.2,100.8,33.2,36.1,113.0,110.9,1.86,50.7,47.4,93.2,109.7,37.8,33.3
1277,1,126.0,103.9,2.65,49.0,48.2,94.2,102.1,53.6,41.4,113.7,103.7,2.02,49.4,50.6,92.5,103.42,35.7,41.1


In [36]:
display(df_model.shape)
display(df_model.columns)

(1190, 19)

Index(['Game_Result', 'Team1_OFFRTG', 'Team1_DEFRTG', 'Team1_AST/TO',
       'Team1_REB_P', 'Team1_FG_P', 'Team1_FGA', 'Team1_PACE', 'Team1_3PA',
       'Team1_3P_P', 'Team2_OFFRTG', 'Team2_DEFRTG', 'Team2_AST/TO',
       'Team2_REB_P', 'Team2_FG_P', 'Team2_FGA', 'Team2_PACE', 'Team2_3PA',
       'Team2_3P_P'],
      dtype='object')

I have successfully finalized preparing the first dataset which represents probably a 10 percent of the entire data that will be used for the final model fittings. I will now proceed to save it so that I can concentrate on all the models on a different Jupyter Notebook

In [37]:
df_model.to_csv('2018-2019_Model_Ready.csv')

=================================================================================================

# Bonus Material

In [38]:
df_merge.head()

Unnamed: 0,Team1,Team1Score,Team2,Team2Score,year,month,Game_Result,Team1_month,Team1_OFFRTG,Team1_DEFRTG,Team1_AST/TO,Team1_REB_P,Team1_FG_P,Team1_FGA,Team1_PACE,Team1_3PA,Team1_3P_P,Team2_month,Team2_OFFRTG,Team2_DEFRTG,Team2_AST/TO,Team2_REB_P,Team2_FG_P,Team2_FGA,Team2_PACE,Team2_3PA,Team2_3P_P
0,Philadelphia 76ers,87,Boston Celtics,105,2018,10,1,10,106.3,107.5,1.83,51.1,43.5,93.1,104.39,35.9,33.8,10,100.8,96.2,1.66,50.6,41.8,89.3,101.5,33.6,32.8
1,Oklahoma City Thunder,100,Golden State Warriors,108,2018,10,1,10,102.4,105.0,1.3,50.8,42.8,93.8,106.08,29.7,27.5,10,120.3,107.5,2.0,51.1,52.4,89.1,104.0,31.9,41.8
2,Milwaukee Bucks,113,Charlotte Hornets,112,2018,10,0,10,113.2,98.2,1.57,53.4,48.0,91.4,106.0,40.6,38.4,10,114.9,109.1,2.09,49.2,46.3,91.0,100.44,34.5,37.7
3,Brooklyn Nets,100,Detroit Pistons,103,2018,10,1,10,108.0,111.4,1.46,48.7,45.4,87.5,99.15,35.8,38.1,10,107.6,108.7,1.34,52.3,43.3,92.7,100.37,32.3,32.3
4,Memphis Grizzlies,83,Indiana Pacers,111,2018,10,1,10,104.9,101.5,1.73,47.9,43.5,82.7,98.67,29.0,36.8,10,112.1,105.5,1.67,49.2,50.6,86.0,97.13,22.3,41.6


In [39]:
df_GSW = df_merge[df_merge['Team2'] == 'Golden State Warriors']
df_GSW.head()

Unnamed: 0,Team1,Team1Score,Team2,Team2Score,year,month,Game_Result,Team1_month,Team1_OFFRTG,Team1_DEFRTG,Team1_AST/TO,Team1_REB_P,Team1_FG_P,Team1_FGA,Team1_PACE,Team1_3PA,Team1_3P_P,Team2_month,Team2_OFFRTG,Team2_DEFRTG,Team2_AST/TO,Team2_REB_P,Team2_FG_P,Team2_FGA,Team2_PACE,Team2_3PA,Team2_3P_P
1,Oklahoma City Thunder,100,Golden State Warriors,108,2018,10,1,10,102.4,105.0,1.3,50.8,42.8,93.8,106.08,29.7,27.5,10,120.3,107.5,2.0,51.1,52.4,89.1,104.0,31.9,41.8
46,Phoenix Suns,103,Golden State Warriors,123,2018,10,1,10,101.4,115.5,1.38,48.1,45.5,83.1,102.14,31.7,32.4,10,120.3,107.5,2.0,51.1,52.4,89.1,104.0,31.9,41.8
61,Washington Wizards,122,Golden State Warriors,144,2018,10,1,10,104.4,114.3,1.45,43.3,43.3,90.0,105.57,35.4,32.3,10,120.3,107.5,2.0,51.1,52.4,89.1,104.0,31.9,41.8
107,New Orleans Pelicans,121,Golden State Warriors,131,2018,10,1,10,114.3,113.3,1.88,51.3,48.9,93.3,106.71,28.0,37.2,10,120.3,107.5,2.0,51.1,52.4,89.1,104.0,31.9,41.8
123,Minnesota Timberwolves,99,Golden State Warriors,116,2018,11,1,11,105.9,105.1,1.76,49.4,43.8,90.4,100.36,30.8,35.0,11,110.7,111.1,1.78,51.3,47.4,87.4,98.74,28.4,36.0


In [40]:
df_GSW.Game_Result.value_counts()

1    33
0    13
Name: Game_Result, dtype: int64

<b>This means that GSW played 46 home games and won 71.74% of the games</b>

I'm going to run a quick KNN Classification Model and see what I get for just 1 team. I believe at this point that it is better to do team by team as their performance is always different depending on many factors which I will describe on the README if this works...

## Test 1 -  KNN Classification model

In [41]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score

I will need to quickly clean this dataframe and just keep my 9 columns of interest, so I will do that right away:

In [42]:
df_GSW = df_GSW.drop(['Team1', 'Team1Score', 'Team2', 'Team2Score', 'year', 'month', 'Team1_month', 'Team2_month'], axis=1)
df_GSW.head()

Unnamed: 0,Game_Result,Team1_OFFRTG,Team1_DEFRTG,Team1_AST/TO,Team1_REB_P,Team1_FG_P,Team1_FGA,Team1_PACE,Team1_3PA,Team1_3P_P,Team2_OFFRTG,Team2_DEFRTG,Team2_AST/TO,Team2_REB_P,Team2_FG_P,Team2_FGA,Team2_PACE,Team2_3PA,Team2_3P_P
1,1,102.4,105.0,1.3,50.8,42.8,93.8,106.08,29.7,27.5,120.3,107.5,2.0,51.1,52.4,89.1,104.0,31.9,41.8
46,1,101.4,115.5,1.38,48.1,45.5,83.1,102.14,31.7,32.4,120.3,107.5,2.0,51.1,52.4,89.1,104.0,31.9,41.8
61,1,104.4,114.3,1.45,43.3,43.3,90.0,105.57,35.4,32.3,120.3,107.5,2.0,51.1,52.4,89.1,104.0,31.9,41.8
107,1,114.3,113.3,1.88,51.3,48.9,93.3,106.71,28.0,37.2,120.3,107.5,2.0,51.1,52.4,89.1,104.0,31.9,41.8
123,1,105.9,105.1,1.76,49.4,43.8,90.4,100.36,30.8,35.0,110.7,111.1,1.78,51.3,47.4,87.4,98.74,28.4,36.0


In [43]:
df_GSW.shape

(46, 19)

In [44]:
df_GSW = df_GSW.dropna()

In [45]:
y = df_GSW['Game_Result']
X = df_GSW.drop(['Game_Result'], axis=1)

In [46]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

Since KNN is a distance-based classifier, if data is in different scales, then larger scaled features have a larger impact on the distance between points

In [47]:
# Instantiate StandardScaler
scaler = StandardScaler()

# Transform the training and test sets
scaled_data_train = scaler.fit_transform(X_train) 
scaled_data_test = scaler.transform(X_test)

# Convert into a DataFrame
scaled_df_train = pd.DataFrame(scaled_data_train, columns=X.columns)
scaled_df_train.head()

Unnamed: 0,Team1_OFFRTG,Team1_DEFRTG,Team1_AST/TO,Team1_REB_P,Team1_FG_P,Team1_FGA,Team1_PACE,Team1_3PA,Team1_3P_P,Team2_OFFRTG,Team2_DEFRTG,Team2_AST/TO,Team2_REB_P,Team2_FG_P,Team2_FGA,Team2_PACE,Team2_3PA,Team2_3P_P
0,0.239811,-0.5478,0.962644,-0.603084,1.061247,-0.535167,-0.529805,-0.201876,1.008318,-0.85655,1.23105,-1.054909,0.648408,-0.84321,-0.78989,-1.55619,-1.650698,-1.272297
1,0.403941,-0.885155,-0.078774,0.644676,1.504973,0.283323,1.076405,0.835542,0.348028,-0.877839,-0.780071,-0.276025,-0.600818,-1.337278,0.167973,0.869398,0.191239,-0.469238
2,0.486005,-2.427348,-0.198938,1.073593,0.007395,1.290696,0.261086,0.52712,-0.683675,-0.85655,1.23105,-1.054909,0.648408,-0.84321,-0.78989,-1.55619,-1.650698,-1.272297
3,-1.592967,-1.077929,-0.559429,-1.343941,-0.103536,-2.738794,-1.671864,-1.379485,-1.01382,-0.877839,-0.780071,-0.276025,-0.600818,-1.337278,0.167973,0.869398,0.191239,-0.469238
4,-0.361997,0.560651,-1.160247,-1.226964,0.284725,-0.84997,0.581717,-1.014987,-0.766212,-0.388179,-0.505827,-0.358013,-0.779279,0.046113,-1.039767,-0.549653,0.312022,0.381059


<b>FITTING THE KNN MODEL</b>

In [48]:
# Instantiate KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=11)

# Fit the classifier
clf.fit(scaled_data_train, y_train)

# Predict on the test set
test_preds = clf.predict(scaled_data_test)

<b>EVALUATING THE MODEL</b>

In [49]:
def print_metrics(y, preds):
    print("Precision Score: {}".format(precision_score(y, preds)))
    print("Recall Score: {}".format(recall_score(y, preds)))
    print("Accuracy Score: {}".format(accuracy_score(y, preds)))
    print("F1 Score: {}".format(f1_score(y, preds)))
    
print_metrics(y_test, test_preds)

Precision Score: 0.7272727272727273
Recall Score: 1.0
Accuracy Score: 0.7272727272727273
F1 Score: 0.8421052631578948


<b>If I am correct with my interpretation this means that in 72.72% of the Home Games for GSW (Golden State Warriors) my predictions is that they will win the games<b/>
    
In reality, GSW won 30 and lost 11 home games. The percentage accounts for a 73.17% of home games won. 
    
This is interesting as it is not far from my predictions but probably more interesting would be to calculate it with a team that has not suffered injuries with this data and compared it with the 2019-2020 season results. 
    
I will try with the Milwaukee Bucks which has almost exactly the same team

## Test 2 -  KNN Classification model

In [50]:
df_MIL = df_merge[df_merge['Team2'] == 'Milwaukee Bucks']
df_MIL.head()

Unnamed: 0,Team1,Team1Score,Team2,Team2Score,year,month,Game_Result,Team1_month,Team1_OFFRTG,Team1_DEFRTG,Team1_AST/TO,Team1_REB_P,Team1_FG_P,Team1_FGA,Team1_PACE,Team1_3PA,Team1_3P_P,Team2_month,Team2_OFFRTG,Team2_DEFRTG,Team2_AST/TO,Team2_REB_P,Team2_FG_P,Team2_FGA,Team2_PACE,Team2_3PA,Team2_3P_P
22,Indiana Pacers,101,Milwaukee Bucks,118,2018,10,1,10,112.1,105.5,1.67,49.2,50.6,86.0,97.13,22.3,41.6,10,113.2,98.2,1.57,53.4,48.0,91.4,106.0,40.6,38.4
41,New York Knicks,113,Milwaukee Bucks,124,2018,10,1,10,105.2,109.8,1.39,51.0,43.1,91.8,100.56,30.8,36.2,10,113.2,98.2,1.57,53.4,48.0,91.4,106.0,40.6,38.4
58,Philadelphia 76ers,108,Milwaukee Bucks,123,2018,10,1,10,106.3,107.5,1.83,51.1,43.5,93.1,104.39,35.9,33.8,10,113.2,98.2,1.57,53.4,48.0,91.4,106.0,40.6,38.4
80,Orlando Magic,91,Milwaukee Bucks,113,2018,10,1,10,99.4,109.6,1.8,48.9,41.0,92.7,101.07,33.7,30.5,10,113.2,98.2,1.57,53.4,48.0,91.4,106.0,40.6,38.4
91,Toronto Raptors,109,Milwaukee Bucks,124,2018,10,1,10,114.5,106.7,1.85,49.5,48.1,91.8,102.31,33.4,35.2,10,113.2,98.2,1.57,53.4,48.0,91.4,106.0,40.6,38.4


In [51]:
df_MIL = df_MIL.drop(['Team1', 'Team1Score', 'Team2', 'Team2Score', 'year', 'month', 'Team1_month', 'Team2_month'], axis=1)
df_MIL.head()

Unnamed: 0,Game_Result,Team1_OFFRTG,Team1_DEFRTG,Team1_AST/TO,Team1_REB_P,Team1_FG_P,Team1_FGA,Team1_PACE,Team1_3PA,Team1_3P_P,Team2_OFFRTG,Team2_DEFRTG,Team2_AST/TO,Team2_REB_P,Team2_FG_P,Team2_FGA,Team2_PACE,Team2_3PA,Team2_3P_P
22,1,112.1,105.5,1.67,49.2,50.6,86.0,97.13,22.3,41.6,113.2,98.2,1.57,53.4,48.0,91.4,106.0,40.6,38.4
41,1,105.2,109.8,1.39,51.0,43.1,91.8,100.56,30.8,36.2,113.2,98.2,1.57,53.4,48.0,91.4,106.0,40.6,38.4
58,1,106.3,107.5,1.83,51.1,43.5,93.1,104.39,35.9,33.8,113.2,98.2,1.57,53.4,48.0,91.4,106.0,40.6,38.4
80,1,99.4,109.6,1.8,48.9,41.0,92.7,101.07,33.7,30.5,113.2,98.2,1.57,53.4,48.0,91.4,106.0,40.6,38.4
91,1,114.5,106.7,1.85,49.5,48.1,91.8,102.31,33.4,35.2,113.2,98.2,1.57,53.4,48.0,91.4,106.0,40.6,38.4


In [52]:
df_MIL.isna().sum().sum()

9

In [53]:
df_MIL = df_MIL.dropna()

In [54]:
y = df_MIL['Game_Result']
X = df_MIL.drop(['Game_Result'], axis=1)

In [55]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [56]:
# Instantiate StandardScaler
scaler = StandardScaler()

# Transform the training and test sets
scaled_data_train = scaler.fit_transform(X_train) 
scaled_data_test = scaler.transform(X_test)

# Convert into a DataFrame
scaled_df_train = pd.DataFrame(scaled_data_train, columns=X.columns)
scaled_df_train.head()

Unnamed: 0,Team1_OFFRTG,Team1_DEFRTG,Team1_AST/TO,Team1_REB_P,Team1_FG_P,Team1_FGA,Team1_PACE,Team1_3PA,Team1_3P_P,Team2_OFFRTG,Team2_DEFRTG,Team2_AST/TO,Team2_REB_P,Team2_FG_P,Team2_FGA,Team2_PACE,Team2_3PA,Team2_3P_P
0,0.723742,-2.494933,1.176244,0.728035,0.836831,0.117215,0.21576,0.101493,1.070368,0.397438,0.213523,1.858447,-0.703012,-0.123927,1.160182,-0.36966,0.109501,0.217371
1,0.188129,1.12151,0.590211,-0.849888,-1.135699,0.48088,-0.702019,0.870582,0.304576,0.358519,0.213523,0.317524,0.591838,-0.699027,0.554391,-0.17609,0.63194,0.820166
2,-1.077866,-0.26401,-0.650799,-0.035476,-1.186277,-1.370509,-1.667144,0.696916,-0.509078,0.358519,-0.683753,0.440797,-0.703012,1.473573,-1.505299,-0.512902,-2.154402,0.639327
3,-1.004828,1.262411,-0.719744,-1.206193,-1.236855,0.679243,0.190266,-0.295456,0.831058,-1.898738,-0.842096,-1.038489,-0.703012,-1.529727,-1.141825,-0.946499,0.065965,-1.229335
4,-1.589134,-0.498844,-1.409194,0.626233,-1.439165,-0.213391,-0.221278,0.250349,-1.753489,-1.898738,-0.842096,-1.038489,-0.703012,-1.529727,-1.141825,-0.946499,0.065965,-1.229335


In [57]:
# Instantiate KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=11)

# Fit the classifier
clf.fit(scaled_data_train, y_train)

# Predict on the test set
test_preds = clf.predict(scaled_data_test)

In [58]:
def print_metrics(y, preds):
    print("Precision Score: {}".format(precision_score(y, preds)))
    print("Recall Score: {}".format(recall_score(y, preds)))
    print("Accuracy Score: {}".format(accuracy_score(y, preds)))
    print("F1 Score: {}".format(f1_score(y, preds)))
    
print_metrics(y_test, test_preds)

Precision Score: 0.8181818181818182
Recall Score: 1.0
Accuracy Score: 0.8181818181818182
F1 Score: 0.9


I'm not sure if a Recall value equal to 1 is ok or if it means that my model is not correct. I need to discuss this with Abhineet. 

Assuming that is it correct. I want to compare those predictions to the 2019-2020 results for the Milwaukee Bucks Team when playing at home. It total they played 35 games, from which they won 30. That accounts for a 85.71% of winnings when my predictions are saying they will win 81.18%. 

The Milwaukee Bucks is a good example as a winning team and unfortunately I didn't think about the Golden State Warriors situation. They lost 4 of their 5 starting player due to injuries and decisions to leave the team plus all their 7 bench players. So this is an entirely new team playing the 2019-2020 season. 

I will run a third model with the BKN (Brooklin Nets). They were a mid tear team with no possibility of winning the championship so maybe interesting for making bets.

## Test 3 -  KNN Classification model

In [89]:
df_BKN = df_merge[df_merge['Team2'] == 'Brooklyn Nets']
df_BKN.head()

Unnamed: 0,Team1,Team1Score,Team2,Team2Score,year,month,Game_Result,Team1_month,Team1_OFFRTG,Team1_DEFRTG,Team1_AST/TO,Team1_REB_P,Team1_FG_P,Team1_FGA,Team1_PACE,Team1_3PA,Team1_3P_P,Team2_month,Team2_OFFRTG,Team2_DEFRTG,Team2_AST/TO,Team2_REB_P,Team2_FG_P,Team2_FGA,Team2_PACE,Team2_3PA,Team2_3P_P
17,New York Knicks,105,Brooklyn Nets,107,2018,10,1,10,105.2,109.8,1.39,51.0,43.1,91.8,100.56,30.8,36.2,10,108.0,111.4,1.46,48.7,45.4,87.5,99.15,35.8,38.1
82,Golden State Warriors,120,Brooklyn Nets,114,2018,10,0,10,120.3,107.5,2.0,51.1,52.4,89.1,104.0,31.9,41.8,10,108.0,111.4,1.46,48.7,45.4,87.5,99.15,35.8,38.1
103,Detroit Pistons,119,Brooklyn Nets,120,2018,10,1,10,107.6,108.7,1.34,52.3,43.3,92.7,100.37,32.3,32.3,10,108.0,111.4,1.46,48.7,45.4,87.5,99.15,35.8,38.1
117,Houston Rockets,119,Brooklyn Nets,111,2018,11,0,11,115.5,111.9,1.45,49.2,46.3,83.5,94.72,42.9,34.6,11,109.6,110.3,1.58,49.4,44.2,90.4,99.88,34.1,33.9
133,Philadelphia 76ers,97,Brooklyn Nets,122,2018,11,1,11,110.8,107.4,1.6,51.3,47.5,85.1,102.34,29.7,36.0,11,109.6,110.3,1.58,49.4,44.2,90.4,99.88,34.1,33.9


In [90]:
df_BKN = df_BKN.drop(['Team1', 'Team1Score', 'Team2', 'Team2Score', 'year', 'month', 'Team1_month', 'Team2_month'], axis=1)
df_BKN.head()

Unnamed: 0,Game_Result,Team1_OFFRTG,Team1_DEFRTG,Team1_AST/TO,Team1_REB_P,Team1_FG_P,Team1_FGA,Team1_PACE,Team1_3PA,Team1_3P_P,Team2_OFFRTG,Team2_DEFRTG,Team2_AST/TO,Team2_REB_P,Team2_FG_P,Team2_FGA,Team2_PACE,Team2_3PA,Team2_3P_P
17,1,105.2,109.8,1.39,51.0,43.1,91.8,100.56,30.8,36.2,108.0,111.4,1.46,48.7,45.4,87.5,99.15,35.8,38.1
82,0,120.3,107.5,2.0,51.1,52.4,89.1,104.0,31.9,41.8,108.0,111.4,1.46,48.7,45.4,87.5,99.15,35.8,38.1
103,1,107.6,108.7,1.34,52.3,43.3,92.7,100.37,32.3,32.3,108.0,111.4,1.46,48.7,45.4,87.5,99.15,35.8,38.1
117,0,115.5,111.9,1.45,49.2,46.3,83.5,94.72,42.9,34.6,109.6,110.3,1.58,49.4,44.2,90.4,99.88,34.1,33.9
133,1,110.8,107.4,1.6,51.3,47.5,85.1,102.34,29.7,36.0,109.6,110.3,1.58,49.4,44.2,90.4,99.88,34.1,33.9


In [91]:
df_BKN.Game_Result.value_counts()

1    23
0    20
Name: Game_Result, dtype: int64

In [92]:
df_BKN.isna().sum()

Game_Result     0
Team1_OFFRTG    1
Team1_DEFRTG    1
Team1_AST/TO    1
Team1_REB_P     1
Team1_FG_P      1
Team1_FGA       1
Team1_PACE      1
Team1_3PA       1
Team1_3P_P      1
Team2_OFFRTG    0
Team2_DEFRTG    0
Team2_AST/TO    0
Team2_REB_P     0
Team2_FG_P      0
Team2_FGA       0
Team2_PACE      0
Team2_3PA       0
Team2_3P_P      0
dtype: int64

In [93]:
df_BKN = df_BKN.dropna()

In [94]:
df_BKN.Game_Result.value_counts()

1    23
0    19
Name: Game_Result, dtype: int64

In [95]:
y = df_BKN['Game_Result']
X = df_BKN.drop(['Game_Result'], axis=1)

In [96]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [97]:
# Instantiate StandardScaler
scaler = StandardScaler()

# Transform the training and test sets
scaled_data_train = scaler.fit_transform(X_train) 
scaled_data_test = scaler.transform(X_test)

# Convert into a DataFrame
scaled_df_train = pd.DataFrame(scaled_data_train, columns=X.columns)
scaled_df_train.head()

Unnamed: 0,Team1_OFFRTG,Team1_DEFRTG,Team1_AST/TO,Team1_REB_P,Team1_FG_P,Team1_FGA,Team1_PACE,Team1_3PA,Team1_3P_P,Team2_OFFRTG,Team2_DEFRTG,Team2_AST/TO,Team2_REB_P,Team2_FG_P,Team2_FGA,Team2_PACE,Team2_3PA,Team2_3P_P
0,-0.833261,-0.975524,-0.204156,-0.333062,-1.058073,0.109708,-0.772298,0.214962,-1.345415,0.551233,0.806888,0.550906,0.434486,1.282159,-1.242702,-1.081039,-1.057005,0.970895
1,-1.294074,0.444461,-0.737484,-0.937181,-0.331197,0.627243,-0.226809,-0.925927,-0.650284,0.551233,0.806888,0.550906,0.434486,1.282159,-1.242702,-1.081039,-1.057005,0.970895
2,-0.577254,-0.908962,-0.02638,-1.541299,0.259389,-2.071333,-1.616148,-1.263008,0.453748,0.239404,0.751671,0.044889,-0.6534,-0.880238,0.243847,-0.752624,-0.636913,-1.268937
3,0.011562,-0.820213,-0.275266,0.600575,1.031695,0.035774,1.051021,0.448326,-0.036933,0.551233,0.806888,0.550906,0.434486,1.282159,-1.242702,-1.081039,-1.057005,0.970895
4,-0.218845,0.932581,0.969167,-1.15686,-0.967214,-0.851429,-0.823532,1.874438,-1.140964,-1.506838,-1.291377,0.424401,-1.067834,-0.571325,0.022446,0.676675,0.29329,-0.918963


In [98]:
# Instantiate KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=11)

# Fit the classifier
clf.fit(scaled_data_train, y_train)

# Predict on the test set
test_preds = clf.predict(scaled_data_test)

In [99]:
def print_metrics(y, preds):
    print("Precision Score: {}".format(precision_score(y, preds)))
    print("Recall Score: {}".format(recall_score(y, preds)))
    print("Accuracy Score: {}".format(accuracy_score(y, preds)))
    print("F1 Score: {}".format(f1_score(y, preds)))
    
print_metrics(y_test, test_preds)

Precision Score: 0.6666666666666666
Recall Score: 0.5714285714285714
Accuracy Score: 0.5454545454545454
F1 Score: 0.6153846153846153


The Brooklyn Nets is a mid tear team with no possibility to win the championship. They won an average of 55.6% of their home games this year. Does it have any relationship with my accuracy? The values are verey close for the three teams that I run for tests. 

The only filter I have made is that I looked at their roster and if it is similar then I should be able to use the previous year parameters to predict how they will do this year in terms of home games. Of course there are upsets and the change in the rosters in other teams which might affect my results, but that is almost impossible to account for. 

If this is working, then a question I have is how far behind should I go for data? if I go 20 years then thee teams might have completely change and that might cause noise to my model. With 10 years it seems more realistic that the team will be performing in a similar way, but still with changes that might have occurred. 