# CSCI 4622 Machine Learning Final Project

### Holden Kjerland-Nicoletti, Brian Nguyen

For this project we will be using a kaggle data set that contains NBA regular season team statistics dating back to 2000 to make predictions on the number of playoff wins a team will have. This is better than just picking the NBA champion because only 1 out of the 30 teams win the NBA championship each year, so it is hard to predict.

*NOTE: Our project has changed slightly from our project proposal, we have decided to use team data rather than individual player statistics.

In [1]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import copy
%matplotlib inline

## Data

The data for this project was acquired in two different ways:

* Regular season team statistics were acquired from the following kaggle dataset https://www.kaggle.com/mharvnek/nba-team-stats-00-to-18

* Playoff wins were acquired from scraping nba https://stats.nba.com/teams/traditional playoff stats. This was very difficult as neither of us have had any experience scraping websites. The jupyter notebook used to scrape is 'web_scrape_playoff_wins.ipynb' in the github

### Playoff wins data

This data has already been mostly cleaned in "web_scrap_playoff_wins.ipynb" script. We now just need to drop the columns: games played and losses. We also need to drop rows before 2001, as the regular season statistics only go back that far.

In [2]:
playoffs = pd.read_csv('data/playoff_wins.csv')
playoffs.head()

Unnamed: 0.1,Unnamed: 0,Season,Team,gp,W,L
0,0,2018,Milwaukee Bucks,15.0,10.0,5.0
1,1,2018,Toronto Raptors,24.0,16.0,8.0
2,2,2018,Golden State Warriors,22.0,14.0,8.0
3,3,2018,Philadelphia 76ers,12.0,7.0,5.0
4,4,2018,Boston Celtics,9.0,5.0,4.0


In [3]:
# we're removing these columns since they aren't necessary for what we are trying accomplish
playoff_wins = playoffs.drop(['gp', 'Unnamed: 0', 'L'], axis = 1) 
playoff_wins.head(10)

Unnamed: 0,Season,Team,W
0,2018,Milwaukee Bucks,10.0
1,2018,Toronto Raptors,16.0
2,2018,Golden State Warriors,14.0
3,2018,Philadelphia 76ers,7.0
4,2018,Boston Celtics,5.0
5,2018,Houston Rockets,6.0
6,2018,Denver Nuggets,7.0
7,2018,Portland Trail Blazers,8.0
8,2018,San Antonio Spurs,3.0
9,2018,LA Clippers,2.0


In [4]:
# we're only keeping the data from year 2000 and above since we want to merge our two datasets together
# and the other dataset only contains years of 2000 and above
playoff_wins = playoff_wins[playoff_wins.Season >= 2000]
playoff_wins.tail()

Unnamed: 0,Season,Team,W
299,2000,Minnesota Timberwolves,1.0
300,2000,Orlando Magic,1.0
301,2000,Phoenix Suns,1.0
302,2000,Miami Heat,0.0
303,2000,Portland Trail Blazers,0.0


In [5]:
# checking to see if there are any NaN values
numRowsOrig = len(playoff_wins)
copyPOWData = copy.deepcopy(playoff_wins)
playOffWinsClean = copyPOWData.dropna()
numRowsNew = len(playOffWinsClean)

# luckily, there weren't any missing or NaN values in the data set, so we didn't need to impute any data
if numRowsOrig == numRowsNew:
    print(f'The old and new playoffs data have the same number of rows of {numRowsOrig}')
else:
    print(f'The old playoffs data has this many rows {numRowsOrig} and the new playoffs data {numRowsNew}')

The old and new playoffs data have the same number of rows of 304


In [6]:
allPFTeams = set(playOffWinsClean['Team'])
print(f'There are a total of {len(allPFTeams)}')

# from this, you can see that there are old team names
# thus, we'll replace the old team names with the new/current ones
for i in allPFTeams:
    print(i)

There are a total of 35
Minnesota Timberwolves
Golden State Warriors
Toronto Raptors
Charlotte Hornets
Seattle SuperSonics
New Jersey Nets
Detroit Pistons
New Orleans Pelicans
Los Angeles Lakers
Los Angeles Clippers
Sacramento Kings
Chicago Bulls
Phoenix Suns
Atlanta Hawks
Dallas Mavericks
Portland Trail Blazers
Indiana Pacers
Houston Rockets
Charlotte Bobcats
LA Clippers
Denver Nuggets
San Antonio Spurs
Philadelphia 76ers
Oklahoma City Thunder
New York Knicks
Washington Wizards
Cleveland Cavaliers
New Orleans Hornets
Orlando Magic
Brooklyn Nets
Miami Heat
Utah Jazz
Milwaukee Bucks
Memphis Grizzlies
Boston Celtics


In [7]:
mappingOldToNew = [
    ['Charlotte Bobcats', 'Charlotte Hornets'],
    ['New Orleans Hornets', 'New Orleans Pelicans'],
    ['New Orleans/Oklahoma City Hornets', 'New Orleans Pelicans'],
    ['Seattle SuperSonics', 'Oklahoma City Thunder'],
    ['Vancouver Grizzlies', 'Memphis Grizzlies'],
    ['New Jersey Nets', 'Brooklyn Nets'],
    ['LA Clippers', 'Los Angeles Clippers']
]

for i in range(len(mappingOldToNew)):
    playOffWinsClean = playOffWinsClean.replace(to_replace = mappingOldToNew[i][0], value = mappingOldToNew[i][1])

In [8]:
# to verify if the mapping worked
allPFTeams = set(playOffWinsClean['Team'])
numTeams = len(allPFTeams)
print(f'There are a total of {numTeams}')

# it worked
for i in allPFTeams:
    print(i)

There are a total of 30
Minnesota Timberwolves
Golden State Warriors
Toronto Raptors
Charlotte Hornets
Detroit Pistons
New Orleans Pelicans
Los Angeles Lakers
Los Angeles Clippers
Sacramento Kings
Chicago Bulls
Phoenix Suns
Atlanta Hawks
Dallas Mavericks
Portland Trail Blazers
Indiana Pacers
Houston Rockets
Denver Nuggets
San Antonio Spurs
Philadelphia 76ers
Oklahoma City Thunder
New York Knicks
Washington Wizards
Cleveland Cavaliers
Orlando Magic
Brooklyn Nets
Miami Heat
Utah Jazz
Milwaukee Bucks
Memphis Grizzlies
Boston Celtics


### Regular Season Statistics

The only cleaning we need to do for this, is to change the Season from spanning 2 years (like '2018-19') to just the later year ('2019'), since that is when the playoffs take place

In [9]:
team_stats = pd.read_csv('data/nba_team_stats_00_to_18.csv')
team_stats = team_stats.drop(['Unnamed: 0'], axis = 1)
team_stats.head()

Unnamed: 0,TEAM,GP,W,L,WIN%,MIN,PTS,FGM,FGA,FG%,...,REB,AST,TOV,STL,BLK,BLKA,PF,PFD,+/-,SEASON
0,Atlanta Hawks,82,29,53,0.354,48.4,113.3,41.4,91.8,45.1,...,46.1,25.8,17.0,8.2,5.1,5.5,23.6,22.2,-6.0,2018-19
1,Boston Celtics,82,49,33,0.598,48.2,112.4,42.1,90.5,46.5,...,44.5,26.3,12.8,8.6,5.3,3.9,20.4,19.5,4.4,2018-19
2,Brooklyn Nets,82,42,40,0.512,48.7,112.2,40.3,89.7,44.9,...,46.6,23.8,15.1,6.6,4.1,5.3,21.5,22.0,-0.1,2018-19
3,Charlotte Hornets,82,39,43,0.476,48.4,110.7,40.2,89.8,44.8,...,43.8,23.2,12.2,7.2,4.9,6.0,18.9,20.6,-1.1,2018-19
4,Chicago Bulls,82,22,60,0.268,48.5,104.9,39.8,87.9,45.3,...,42.9,21.9,14.1,7.4,4.3,5.8,20.3,18.7,-8.4,2018-19


In [10]:
for i in range(len(team_stats)):
    team_stats.SEASON[i] = team_stats.SEASON[i][:4]

team_stats.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,TEAM,GP,W,L,WIN%,MIN,PTS,FGM,FGA,FG%,...,REB,AST,TOV,STL,BLK,BLKA,PF,PFD,+/-,SEASON
0,Atlanta Hawks,82,29,53,0.354,48.4,113.3,41.4,91.8,45.1,...,46.1,25.8,17.0,8.2,5.1,5.5,23.6,22.2,-6.0,2018
1,Boston Celtics,82,49,33,0.598,48.2,112.4,42.1,90.5,46.5,...,44.5,26.3,12.8,8.6,5.3,3.9,20.4,19.5,4.4,2018
2,Brooklyn Nets,82,42,40,0.512,48.7,112.2,40.3,89.7,44.9,...,46.6,23.8,15.1,6.6,4.1,5.3,21.5,22.0,-0.1,2018
3,Charlotte Hornets,82,39,43,0.476,48.4,110.7,40.2,89.8,44.8,...,43.8,23.2,12.2,7.2,4.9,6.0,18.9,20.6,-1.1,2018
4,Chicago Bulls,82,22,60,0.268,48.5,104.9,39.8,87.9,45.3,...,42.9,21.9,14.1,7.4,4.3,5.8,20.3,18.7,-8.4,2018


In [11]:
# checking to see if there are any NaN values
numRowsOrig = len(team_stats)
copyTSData = copy.deepcopy(team_stats)
teamStatsClean = copyTSData.dropna()
numRowsNew = len(teamStatsClean)

# luckily, there weren't any missing or NaN values in the data set, so we didn't need to impute any data
if numRowsOrig == numRowsNew:
    print(f'The old and new team stats data have the same number of rows of {numRowsOrig}')
else:
    print(f'The old team stats data has this many rows {numRowsOrig} and the new playoffs data {numRowsNew}')

The old and new team stats data have the same number of rows of 566


In [12]:
print(team_stats.columns)

Index(['TEAM', 'GP', 'W', 'L', 'WIN%', 'MIN', 'PTS', 'FGM', 'FGA', 'FG%',
       '3PM', '3PA', '3P%', 'FTM', 'FTA', 'FT%', 'OREB', 'DREB', 'REB', 'AST',
       'TOV', 'STL', 'BLK', 'BLKA', 'PF', 'PFD', '+/-', 'SEASON'],
      dtype='object')


In [13]:
allTeams = set(teamStatsClean['TEAM'])
numTeams = len(allTeams)
print(f'There are a total of {numTeams}')

# from this, you can see that there are old team names
# thus, we'll replace the old team names with the new/current ones
for i in allTeams:
    print(i)

There are a total of 37
Minnesota Timberwolves
Golden State Warriors
Toronto Raptors
Charlotte Hornets
Seattle SuperSonics
New Jersey Nets
Detroit Pistons
Vancouver Grizzlies
Los Angeles Lakers
New Orleans Pelicans
Los Angeles Clippers
Sacramento Kings
Chicago Bulls
Phoenix Suns
Dallas Mavericks
Atlanta Hawks
Portland Trail Blazers
Indiana Pacers
Houston Rockets
Charlotte Bobcats
LA Clippers
Denver Nuggets
San Antonio Spurs
Philadelphia 76ers
Oklahoma City Thunder
New York Knicks
Washington Wizards
Cleveland Cavaliers
New Orleans/Oklahoma City Hornets
New Orleans Hornets
Orlando Magic
Brooklyn Nets
Memphis Grizzlies
Milwaukee Bucks
Utah Jazz
Boston Celtics
Miami Heat


In [14]:
mappingOldToNew = [
    ['Charlotte Bobcats', 'Charlotte Hornets'],
    ['New Orleans Hornets', 'New Orleans Pelicans'],
    ['New Orleans/Oklahoma City Hornets', 'New Orleans Pelicans'],
    ['Seattle SuperSonics', 'Oklahoma City Thunder'],
    ['Vancouver Grizzlies', 'Memphis Grizzlies'],
    ['New Jersey Nets', 'Brooklyn Nets'],
    ['LA Clippers', 'Los Angeles Clippers']
]

for i in range(len(mappingOldToNew)):
    teamStatsClean = teamStatsClean.replace(to_replace = mappingOldToNew[i][0], value = mappingOldToNew[i][1])

In [15]:
# to verify if the mapping worked
allTeams = set(teamStatsClean['TEAM'])
numTeams = len(allTeams)
print(f'There are a total of {numTeams}')

for i in allTeams:
    print(i)

There are a total of 30
Minnesota Timberwolves
Golden State Warriors
Toronto Raptors
Charlotte Hornets
Detroit Pistons
Los Angeles Lakers
New Orleans Pelicans
Los Angeles Clippers
Sacramento Kings
Chicago Bulls
Phoenix Suns
Dallas Mavericks
Atlanta Hawks
Portland Trail Blazers
Indiana Pacers
Houston Rockets
Denver Nuggets
San Antonio Spurs
Philadelphia 76ers
Oklahoma City Thunder
New York Knicks
Washington Wizards
Cleveland Cavaliers
Orlando Magic
Brooklyn Nets
Memphis Grizzlies
Milwaukee Bucks
Utah Jazz
Boston Celtics
Miami Heat


### Now we need to merge the two datasets together

In [16]:
year1 = set(playOffWinsClean['Season'])
year1 = list(year1)
year1.sort()
year2 = set(teamStatsClean['SEASON'])
year2 = list(year2)
year2 = [int(i) for i in year2]
year2.sort()

# verifying whether both datasets have the same season
print(year1)
print(year2)

[2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018]
[2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018]


In [17]:
# this is for learning purposes
query1 = playOffWinsClean[(playOffWinsClean['Team'] == 'Atlanta Hawks') & (playOffWinsClean['Season'] == 2018)]
# print(query1) # empty as it should be
if len(query1) == 0:
    print("not found")
elif len(query1) == 1:
    print("found")
else:
    print("found multiple copies")

query2 = playOffWinsClean[(playOffWinsClean['Team'] == 'Milwaukee Bucks') & (playOffWinsClean['Season'] == 2018)]
# print(query2) # returns the correct row
if len(query2) == 0:
    print("not found")
elif len(query2) == 1:
    print("found")
else:
    print("found multiple copies")    

not found
found


In [18]:
print(playOffWinsClean.columns)

Index(['Season', 'Team', 'W'], dtype='object')


In [19]:
madeToPlayOffArr = []
playOffWinsArr = []

for i in range(len(teamStatsClean)):
    TSRowData = teamStatsClean.iloc[i]
#     print(TSRowData)
    nextTeam = TSRowData['TEAM']
    nextSeason = TSRowData['SEASON']
#     print(nextTeam)
#     print(nextSeason)
    query = playOffWinsClean[(playOffWinsClean['Team'] == nextTeam) & (playOffWinsClean['Season'] == int(nextSeason))]
    if len(query) == 0:
#         print("not found")
        madeToPlayOffArr.append(0)
        playOffWinsArr.append(0)
    elif len(query) == 1:
#         print("found")
        madeToPlayOffArr.append(1)
        playOffWinsArr.append(float(query['W']))
    else:
        raise Exception("found multiple copies")   

In [20]:
print(len(madeToPlayOffArr))
print(len(playOffWinsArr))
print(len(teamStatsClean))

566
566
566


In [21]:
teamStatsClean.insert(len(teamStatsClean.columns)-1, "Made PO", madeToPlayOffArr, False)
teamStatsClean.insert(len(teamStatsClean.columns)-1, "PO Wins", playOffWinsArr, False)

finalDataClean = copy.deepcopy(teamStatsClean)

In [22]:
# you can check with the playoffs data set to see if we correctly added the data to the team stats dataset
finalDataClean.head(10)

Unnamed: 0,TEAM,GP,W,L,WIN%,MIN,PTS,FGM,FGA,FG%,...,TOV,STL,BLK,BLKA,PF,PFD,+/-,Made PO,PO Wins,SEASON
0,Atlanta Hawks,82,29,53,0.354,48.4,113.3,41.4,91.8,45.1,...,17.0,8.2,5.1,5.5,23.6,22.2,-6.0,0,0.0,2018
1,Boston Celtics,82,49,33,0.598,48.2,112.4,42.1,90.5,46.5,...,12.8,8.6,5.3,3.9,20.4,19.5,4.4,1,5.0,2018
2,Brooklyn Nets,82,42,40,0.512,48.7,112.2,40.3,89.7,44.9,...,15.1,6.6,4.1,5.3,21.5,22.0,-0.1,1,1.0,2018
3,Charlotte Hornets,82,39,43,0.476,48.4,110.7,40.2,89.8,44.8,...,12.2,7.2,4.9,6.0,18.9,20.6,-1.1,0,0.0,2018
4,Chicago Bulls,82,22,60,0.268,48.5,104.9,39.8,87.9,45.3,...,14.1,7.4,4.3,5.8,20.3,18.7,-8.4,0,0.0,2018
5,Cleveland Cavaliers,82,19,63,0.232,48.2,104.5,38.9,87.6,44.4,...,13.5,6.5,2.4,5.6,20.0,19.4,-9.6,0,0.0,2018
6,Dallas Mavericks,82,33,49,0.402,48.2,108.9,38.8,86.9,44.7,...,14.2,6.5,4.3,4.5,20.1,23.2,-1.3,0,0.0,2018
7,Denver Nuggets,82,54,28,0.659,48.1,110.7,41.9,90.0,46.6,...,13.4,7.7,4.4,5.0,20.0,20.4,4.0,1,7.0,2018
8,Detroit Pistons,82,41,41,0.5,48.4,107.0,38.8,88.3,44.0,...,13.8,6.9,4.0,5.1,22.1,21.3,-0.2,1,0.0,2018
9,Golden State Warriors,82,57,25,0.695,48.3,117.7,44.0,89.8,49.1,...,14.3,7.6,6.4,3.6,21.4,19.5,6.5,1,14.0,2018
