# CSCI 4622 Machine Learning Final Project

### Holden Kjerland-Nicoletti, Brian Nguyen

For this project we will be using a kaggle data set that contains NBA regular season team statistics dating back to 2000 to make predictions on the number of playoff wins a team will have. This is better than just picking the NBA champion because only 1 out of the 30 teams win the NBA championship each year, so it is hard to predict.

*NOTE: Our project has changed slightly from our project proposal, we have decided to use team data rather than individual player statistics.

In [4]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import copy
%matplotlib inline

## Data

The data for this project was acquired in two different ways:

* Regular season team statistics were acquired from the following kaggle dataset https://www.kaggle.com/mharvnek/nba-team-stats-00-to-18

* Playoff wins were acquired from scraping nba https://stats.nba.com/teams/traditional playoff stats. This was very difficult as neither of us have had any experience scraping websites. The jupyter notebook used to scrape is 'web_scrape_playoff_wins.ipynb' in the github

### Playoff wins data

This data has already been mostly cleaned in "web_scrap_playoff_wins.ipynb" script. We now just need to drop the columns: games played and losses. We also need to drop rows before 2001, as the regular season statistics only go back that far.

In [102]:
playoffs = pd.read_csv('data/playoff_wins.csv')
playoffs.head()

Unnamed: 0.1,Unnamed: 0,Season,Team,gp,W,L
0,0,2018,Milwaukee Bucks,15.0,10.0,5.0
1,1,2018,Toronto Raptors,24.0,16.0,8.0
2,2,2018,Golden State Warriors,22.0,14.0,8.0
3,3,2018,Philadelphia 76ers,12.0,7.0,5.0
4,4,2018,Boston Celtics,9.0,5.0,4.0


In [103]:
playoff_wins = playoffs.drop(['gp', 'Unnamed: 0', 'L'], axis = 1)
playoff_wins.head()

Unnamed: 0,Season,Team,W
0,2018,Milwaukee Bucks,10.0
1,2018,Toronto Raptors,16.0
2,2018,Golden State Warriors,14.0
3,2018,Philadelphia 76ers,7.0
4,2018,Boston Celtics,5.0


In [104]:
playoff_wins = playoff_wins[playoff_wins.Season >= 2001]
playoff_wins.tail()

Unnamed: 0,Season,Team,W
283,2001,Toronto Raptors,2.0
284,2001,Orlando Magic,1.0
285,2001,Utah Jazz,1.0
286,2001,Minnesota Timberwolves,0.0
287,2001,Portland Trail Blazers,0.0


### Regular Season Statistics

The only cleaning we need to do for this, is to change the Season from spanning 2 years (like '2018-19') to just the later year ('2019'), since that is when the playoffs take place

In [105]:
team_stats = pd.read_csv('data/nba_team_stats_00_to_18.csv')
team_stats = team_stats.drop(['Unnamed: 0'], axis = 1)
team_stats.head()

Unnamed: 0,TEAM,GP,W,L,WIN%,MIN,PTS,FGM,FGA,FG%,...,REB,AST,TOV,STL,BLK,BLKA,PF,PFD,+/-,SEASON
0,Atlanta Hawks,82,29,53,0.354,48.4,113.3,41.4,91.8,45.1,...,46.1,25.8,17.0,8.2,5.1,5.5,23.6,22.2,-6.0,2018-19
1,Boston Celtics,82,49,33,0.598,48.2,112.4,42.1,90.5,46.5,...,44.5,26.3,12.8,8.6,5.3,3.9,20.4,19.5,4.4,2018-19
2,Brooklyn Nets,82,42,40,0.512,48.7,112.2,40.3,89.7,44.9,...,46.6,23.8,15.1,6.6,4.1,5.3,21.5,22.0,-0.1,2018-19
3,Charlotte Hornets,82,39,43,0.476,48.4,110.7,40.2,89.8,44.8,...,43.8,23.2,12.2,7.2,4.9,6.0,18.9,20.6,-1.1,2018-19
4,Chicago Bulls,82,22,60,0.268,48.5,104.9,39.8,87.9,45.3,...,42.9,21.9,14.1,7.4,4.3,5.8,20.3,18.7,-8.4,2018-19


In [106]:
for i in range(len(team_stats)):
    team_stats.SEASON[i] = team_stats.SEASON[i][:2] + team_stats.SEASON[i][-2:]

team_stats.tail()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,TEAM,GP,W,L,WIN%,MIN,PTS,FGM,FGA,FG%,...,REB,AST,TOV,STL,BLK,BLKA,PF,PFD,+/-,SEASON
561,Seattle SuperSonics,82,44,38,0.537,48.3,97.3,36.9,81.1,45.6,...,41.7,21.9,15.3,8.0,5.0,6.2,21.1,0.1,0.0,2001
562,Toronto Raptors,82,47,35,0.573,48.7,97.6,37.2,85.0,43.7,...,44.5,24.4,13.2,7.3,6.3,5.4,21.3,0.1,2.3,2001
563,Utah Jazz,82,53,29,0.646,48.2,97.1,36.1,76.7,47.1,...,40.6,25.7,15.8,8.1,5.6,5.5,25.7,0.1,4.7,2001
564,Vancouver Grizzlies,82,23,59,0.28,48.2,91.7,35.0,79.7,43.9,...,40.5,23.2,15.7,7.1,4.4,5.8,21.1,0.1,-5.7,2001
565,Washington Wizards,82,19,63,0.232,48.0,93.2,34.5,78.7,43.9,...,41.3,20.1,17.0,7.7,4.7,6.2,23.3,0.1,-6.7,2001


2019


In [26]:
# droping the columns with blank data
data = data.drop(columns=['blanl', 'blank2'])

In [27]:
print(data.columns)

Index(['#', 'Season Start', 'Player Name', ' Player Salary in $ ', 'Pos',
       'Age', 'Tm', 'G', 'GS', 'MP', 'PER', 'TS%', '3PAr', 'FTr', 'ORB%',
       'DRB%', 'TRB%', 'AST%', 'STL%', 'BLK%', 'TOV%', 'USG%', 'OWS', 'DWS',
       'WS', 'WS/48', 'OBPM', 'DBPM', 'BPM', 'VORP', 'FG', 'FGA', 'FG%', '3P',
       '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB',
       'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS'],
      dtype='object')


In [28]:
copyData = copy.deepcopy(data)

In [64]:
totalOrigRows = len(copyData)
print(totalOrigRows)

24625


In [65]:
numRemainCols = len(copyData.columns)
print(numRemainCols)

52


In [68]:
data = copyData.dropna(thresh=numRemainCols-1)

In [69]:
data

Unnamed: 0,#,Season Start,Player Name,Player Salary in $,Pos,Age,Tm,G,GS,MP,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,8035.0,1986.0,A.C. Green,,PF,22.0,LAL,82.0,1.0,1542.0,...,61.1%,160.0,221.0,381.0,54.0,49.0,49.0,99.0,229.0,521.0
1,8420.0,1987.0,A.C. Green,,PF,23.0,LAL,79.0,72.0,2240.0,...,78.0%,210.0,405.0,615.0,84.0,70.0,80.0,102.0,171.0,852.0
2,8807.0,1988.0,A.C. Green,,PF,24.0,LAL,82.0,64.0,2636.0,...,77.3%,245.0,465.0,710.0,93.0,87.0,45.0,120.0,204.0,937.0
3,9242.0,1989.0,A.C. Green,,PF,25.0,LAL,82.0,82.0,2510.0,...,78.6%,258.0,481.0,739.0,103.0,94.0,55.0,119.0,172.0,1088.0
4,9688.0,1990.0,A.C. Green,"$1,750,000.00",PF,26.0,LAL,82.0,82.0,2709.0,...,75.1%,262.0,450.0,712.0,90.0,66.0,50.0,116.0,207.0,1061.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24619,18442.0,2007.0,Zydrunas Ilgauskas,"$10,142,156.00",C,31.0,CLE,78.0,78.0,2130.0,...,80.7%,242.0,357.0,599.0,123.0,48.0,98.0,141.0,257.0,925.0
24620,19003.0,2008.0,Zydrunas Ilgauskas,"$10,841,615.00",C,32.0,CLE,73.0,73.0,2222.0,...,80.2%,263.0,419.0,682.0,104.0,34.0,120.0,135.0,247.0,1029.0
24621,19600.0,2009.0,Zydrunas Ilgauskas,"$11,541,074.00",C,33.0,CLE,65.0,65.0,1765.0,...,79.9%,157.0,333.0,490.0,64.0,28.0,84.0,90.0,183.0,838.0
24622,20187.0,2010.0,Zydrunas Ilgauskas,,C,34.0,CLE,64.0,6.0,1339.0,...,74.3%,114.0,231.0,345.0,48.0,14.0,50.0,63.0,183.0,474.0


In [70]:
percOrigData = len(data)/len(copyData) * 100
print(f'The percentage of remaining rows left of original data is {percOrigData}')

The percentage of remaining rows left of original data is 65.48629441624365


In [73]:
# removing the columns with rows containing data from below the year 1979 since the 3pt shot wasn't introduced until
# the 1979
data = data.drop(data[data['Season Start'] < 1979.0].index)

In [75]:
print(set(data['Season Start'])) # to verify that we removed the rows that had a year that is less than 1979

{1980.0, 1981.0, 1982.0, 1983.0, 1984.0, 1985.0, 1986.0, 1987.0, 1988.0, 1989.0, 1990.0, 1991.0, 1992.0, 1993.0, 1994.0, 1995.0, 1996.0, 1997.0, 1998.0, 1999.0, 2000.0, 2001.0, 2002.0, 2003.0, 2004.0, 2005.0, 2006.0, 2007.0, 2008.0, 2009.0, 2010.0, 2011.0, 2012.0, 2013.0, 2014.0, 2015.0, 2016.0, 2017.0}


In [76]:
print(len(data)) # luckily, we didn't remove any more rows from our data

16126


In [77]:
# now we need to one hot encode some of the columns since they have non-numerical data
dummyPlayName = pd.get_dummies(data['Player Name'])

In [78]:
dummyPlayName.head()

Unnamed: 0,A.C. Green,A.J. English,A.J. Guyton,A.J. Hammons,A.J. Price,A.J. Wynder,Aaron Brooks,Aaron Gordon,Aaron Gray,Aaron Harrison,...,Zach Randolph,Zan Tabak,Zarko Cabarkapa,Zarko Paspalj,Zaza Pachulia,Zeljko Rebraca,Zendon Hamilton,Zoran Dragic,Zoran Planinic,Zydrunas Ilgauskas
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [80]:
data = pd.concat([data, dummyPlayName], axis=1)
data = data.drop(columns = ['Player Name'])

NameError: name 'trainDataClean' is not defined