## SlamStats Analytics <br>

Vincent Funtanilla - u1282199@utah.edu, u1282199 <br>

Jacob Layton - jake.layton@utah.edu, u1312858 <br>

John Chae- u1285738@utah.edu, u1285738 <br>

## Background and Motivation <br>
 
Our motivation for undertaking this  project stems from a passion and curiosity for the game of basketball. By delving into player performances, strategies, and game outcomes, we hope to provide a valuable resource for basketball enthusiasts and enhance the overall fan experience. We want to contribute positively to the basketball community, offering valuable insights for analysts, coaches, and fans alike. We also see this project as an opportunity for skill development, allowing us to apply and refine our data science skills in a real-world context. Ultimately, our goal is to bring innovation to sports analytics, fostering a deeper appreciation for the game we love.

## Project Objectives <br>

There are three questions we want to answer;<br>
- Who will be the 2024 MVP?<br>
- Who will be the 2024 defensive player of the year?<br>
- Which team will win the 2024 NBA Finals?<br>
    
We want to create a model that can predict not only this years finals and award winners, but also identify what the most important statistics are in prediciting them. This will help us gain a lot of insight into what these statistics mean and give us a better understanding of categorical models. 
    
## Data Description and Acquisition

We will be collecting data from https://www.basketball-reference.com/. Basketball Reference is an online basketball encyclopedia that contains all relevant basketball statistics. Basketball Reference does not allow data scraping but all data is available to download as Excel spreadsheets. To obtain the data we will download the Excel spreadsheet containing the data of interest from Basketball Reference and then we will save the spreadsheet as a CSV file which we can open in Jupyter Notebook. We will collect player and team data from the past 23 seasons to ensure that we have enough information to build our models. For player data we will collect season total data and per game data. Season total data refers to cumulative statistics such as total number of points scored. Per game data refers to average stats per game such as average points scored per game. We will collect four types of team data, offensive, defensive, advanced, and playoff. The offensive and defensive data will tell us about the offensive and defense performance of each team. The advanced team data will tell us how the team performed throughout the season such as their win/loss record. From the playoff data we are intereseted in teams playoff record. 


## Ethical Considerations

One main ethical implication of our analyses is related to gambling. While our project aims to provide valuable insights into basketball dynamics, we acknowledge the potential risk of individuals excessively gambling based on our findings. It is important that we approach our work with a sense of responsibility, emphasizing the importance of using data-driven insights for informed decision-making rather than irresponsible gambling behavior. We can also promote responsible engagement with our analyses and advocate for measures to mitigate the risks associated with problem gambling within the context of sports betting.<br>

It’s also crucial to consider the potential impact of our analyses on fan engagement and player welfare. While we aim to foster constructive dialogue among fans, we should avoid promoting negative behaviors or attitudes that could harm the mental well-being of players. We want to uphold the integrity of the game and the welfare of those involved, including players, teams, and fans.

## Data Cleaning and Processing

The player and team data will be processed and cleaned seperately. As both the player and team data are stored in excel files the data will need to be read in and then combined into one large data frame. Most of the data is already fairly presentable but there are a few minor issues that will need to be resolved. One issue in the player data is that the column titles appear in the first row of the data frame instead of the header column. Another issue in the player data to be resolved are the NaN values which appear in columns with numerical values. For the team data there are multiple overlapping columns from the different data sets, these will need to be removed. Finally some data will need to be added to both data frames as they are processed such as the season that the data pertains.

In [1]:
import os
import glob
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns
from pandas.plotting import scatter_matrix

from sklearn import tree, svm, metrics
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.datasets import load_digits
from sklearn.preprocessing import scale

plt.style.use('ggplot')
%matplotlib inline  
plt.rcParams['figure.figsize'] = (10, 6) 

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


The cell below extracts the names of all excel files containg our player data. The names of the files are then split up by the type of data they store (season totals or averages per game) and are stored in two seperate lists. These two lists of file names will be used below to extract all the data into the Jupyter Notebook.

## Data importing

In [2]:
#Get local directory and get csv file names
main_directory = os.path.normpath(os.getcwd())
data_directory = os.path.join(main_directory, 'data')
file_names = [f for f in os.listdir(data_directory) if os.path.isfile(os.path.join(data_directory, f))]
print(file_names)

['00-01 player stats per game.csv', '00-01 player stats.csv', '01-02 player stats per game.csv', '01-02 player stats.csv', '02-03 player stats per game.csv', '02-03 player stats.csv', '03-04 player stats per game.csv', '03-04 player stats.csv', '04-05 player stats per game.csv', '04-05 player stats.csv', '05-06 player stats per game.csv', '05-06 player stats.csv', '06-07 player stats per game.csv', '06-07 player stats.csv', '07-08 player stats per game.csv', '07-08 player stats.csv', '08-09 player stats per game.csv', '08-09 player stats.csv', '09-10 player stats per game.csv', '09-10 player stats.csv', '10-11 player stats per game.csv', '10-11 player stats.csv', '11-12 player stats per game.csv', '11-12 player stats.csv', '12-13 player stats per game.csv', '12-13 player stats.csv', '13-14 player stats per game.csv', '13-14 player stats.csv', '14-15 player stats per game.csv', '14-15 player stats.csv', '15-16 player stats per game.csv', '15-16 player stats.csv', '16-17 player stats per

Lets organize these file names into a list and only get the file we want

In [3]:
#filtering file names to have list of only wanted csv files
filtered_file_list = [filename for filename in file_names if 'MVP' not in filename and 'Defensive Player of the Year' not in filename]

files = []
for filename in filtered_file_list:
    if 'stats.csv' in filename:
        files.append(filename)

print(files)

['00-01 player stats.csv', '01-02 player stats.csv', '02-03 player stats.csv', '03-04 player stats.csv', '04-05 player stats.csv', '05-06 player stats.csv', '06-07 player stats.csv', '07-08 player stats.csv', '08-09 player stats.csv', '09-10 player stats.csv', '10-11 player stats.csv', '11-12 player stats.csv', '12-13 player stats.csv', '13-14 player stats.csv', '14-15 player stats.csv', '15-16 player stats.csv', '16-17 player stats.csv', '17-18 player stats.csv', '18-19 player stats.csv', '19-20 player stats.csv', '20-21 player stats.csv', '21-22 player stats.csv', '22-23 player stats.csv']


Now lets read into one the of the csv file and investigate somethings

In [4]:
#Read into one of the csv file and investigate the data
df_01_02 = pd.read_csv(os.path.join(data_directory,files[0]),header=0)
df_01_02

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Player-additional
0,1,Mahmoud Abdul-Rauf,PG,31,VAN,41,0,486,120,246,...,5,20,25,76,9,1,26,50,266,abdulma02
1,2,Tariq Abdul-Wahad,SG,26,DEN,29,12,420,43,111,...,14,45,59,22,14,13,34,54,111,abdulta01
2,3,Shareef Abdur-Rahim,SF,24,VAN,81,81,3241,604,1280,...,175,560,735,250,90,77,231,238,1663,abdursh01
3,4,Cory Alexander,PG,27,ORL,26,0,227,18,56,...,0,25,25,36,16,0,25,29,52,alexaco01
4,5,Courtney Alexander,PG,23,TOT,65,24,1382,239,573,...,42,101,143,62,45,5,75,139,618,alexaco02
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
532,437,David Wingate,SG,37,SEA,1,0,9,3,3,...,0,0,0,2,0,0,0,1,6,wingada01
533,438,Rubén Wolkowyski,PF,27,SEA,34,1,305,25,79,...,12,34,46,3,6,18,12,38,75,wolkoru01
534,439,Metta World Peace,SF,21,CHI,76,74,2363,327,815,...,59,235,294,228,152,45,159,254,907,artesro01
535,440,Lorenzen Wright,C,25,ATL,71,46,1988,363,811,...,180,355,535,87,42,63,125,232,881,wrighlo02


In [5]:
#Check for duplicate players, this is due to transfers during seasons
pd.set_option('display.max_rows', None)
player_duplicate = df_01_02['Player'].value_counts()
player_duplicate

Player
Doug Overton             4
Anthony Miller           4
Garth Joseph             3
Rick Brunson             3
Kevin Ollie              3
Juwan Howard             3
Cherokee Parks           3
Kornél Dávid             3
Nazr Mohammed            3
Loy Vaught               3
Corie Blount             3
Eric Montross            3
Roshown McLeod           3
Pepe Sánchez             3
Calvin Booth             3
Paul McPherson           3
Larry Robinson           3
Erick Strickland         3
Mark Strickland          3
Rod Strickland           3
Vinny Del Negro          3
Anthony Johnson          3
Chucky Brown             3
Brevin Knight            3
Jim Jackson              3
Toni Kukoč*              3
Rubén Garcés             3
Tyrone Nesby             3
Sean Colson              3
Courtney Alexander       3
Hubert Davis             3
Mark Jackson             3
Kevin Willis             3
Corliss Williamson       3
Bill Curley              3
Felipe López             3
Othella Harrington   

In [6]:
#Investigate player named Doug Overton
duplicate_data = df_01_02[df_01_02['Player'] == 'Doug Overton']
duplicate_data

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Player-additional
373,303,Doug Overton,PG,31,TOT,21,10,475,52,139,...,4,35,39,72,8,0,31,37,125,overtdo01
374,303,Doug Overton,PG,31,BOS,7,1,144,15,44,...,3,12,15,19,4,0,13,15,38,overtdo01
375,303,Doug Overton,PG,31,CHH,2,0,15,2,2,...,0,0,0,0,1,0,3,2,4,overtdo01
376,303,Doug Overton,PG,31,NJN,12,9,316,35,93,...,1,23,24,53,3,0,15,20,83,overtdo01


Doug Overton played in three different team in the seaosn 2001-02. Boston Celtics, Charlotte Hornets, and Brooklyn Nets and the total statistical combine is represented as ```TOT``` in the the ```Team``` column. Looks like if there is an instance of player playing for multiple team ```TOT``` appears very first. We can later on use this information in data cleaning

Looks like some players some just have random astrix sign on them. we'll handle them later on

For now lets get all the MVP player data and defensive data

In [7]:
MVP_df = pd.read_csv(os.path.join(data_directory,'MVP player stats.csv'),header=1)
MVP_df

Unnamed: 0,Season,Lg,Player,Voting,Age,Tm,G,MP,PTS,TRB,AST,STL,BLK,FG%,3P%,FT%,WS,WS/48,-9999
0,2022-23,NBA,Joel Embiid,(V),28,PHI,66,34.6,33.1,10.2,4.2,1.0,1.7,0.548,0.33,0.857,12.3,0.259,embiijo01
1,2021-22,NBA,Nikola Jokić,(V),26,DEN,74,33.5,27.1,13.8,7.9,1.5,0.9,0.583,0.337,0.81,15.2,0.296,jokicni01
2,2020-21,NBA,Nikola Jokić,(V),25,DEN,72,34.6,26.4,10.8,8.3,1.3,0.7,0.566,0.388,0.868,15.6,0.301,jokicni01
3,2019-20,NBA,Giannis Antetokounmpo,(V),25,MIL,63,30.4,29.5,13.6,5.6,1.0,1.0,0.553,0.304,0.633,11.1,0.279,antetgi01
4,2018-19,NBA,Giannis Antetokounmpo,(V),24,MIL,72,32.8,27.7,12.5,5.9,1.3,1.5,0.578,0.256,0.729,14.4,0.292,antetgi01
5,2017-18,NBA,James Harden,(V),28,HOU,72,35.4,30.4,5.4,8.8,1.8,0.7,0.449,0.367,0.858,15.4,0.289,hardeja01
6,2016-17,NBA,Russell Westbrook,(V),28,OKC,81,34.6,31.6,10.7,10.4,1.6,0.4,0.425,0.343,0.845,13.1,0.224,westbru01
7,2015-16,NBA,Stephen Curry,(V),27,GSW,79,34.2,30.1,5.4,6.7,2.1,0.2,0.504,0.454,0.908,17.9,0.318,curryst01
8,2014-15,NBA,Stephen Curry,(V),26,GSW,80,32.7,23.8,4.3,7.7,2.0,0.2,0.487,0.443,0.914,15.7,0.288,curryst01
9,2013-14,NBA,Kevin Durant,(V),25,OKC,81,38.5,32.0,7.4,5.5,1.3,0.7,0.503,0.391,0.873,19.2,0.295,duranke01


Lets grab each player name and their MVP winning season and store them into an dictionary

In [8]:
#Grabbing MVP award winner for each season
season_mvp_dict = {}
for index,row in MVP_df.iterrows():
    season = row['Season']
    player = row['Player']
    season_short = season[2:]
    season_mvp_dict[season_short] = player

print(season_mvp_dict)

{'22-23': 'Joel Embiid', '21-22': 'Nikola Jokić', '20-21': 'Nikola Jokić', '19-20': 'Giannis Antetokounmpo', '18-19': 'Giannis Antetokounmpo', '17-18': 'James Harden', '16-17': 'Russell Westbrook', '15-16': 'Stephen Curry', '14-15': 'Stephen Curry', '13-14': 'Kevin Durant', '12-13': 'LeBron James', '11-12': 'LeBron James', '10-11': 'Derrick Rose', '09-10': 'LeBron James', '08-09': 'LeBron James', '07-08': 'Kobe Bryant', '06-07': 'Dirk Nowitzki', '05-06': 'Steve Nash', '04-05': 'Steve Nash', '03-04': 'Kevin Garnett', '02-03': 'Tim Duncan', '01-02': 'Tim Duncan', '00-01': 'Allen Iverson', '99-00': "Shaquille O'Neal", '98-99': 'Karl Malone', '97-98': 'Michael Jordan', '96-97': 'Karl Malone', '95-96': 'Michael Jordan', '94-95': 'David Robinson', '93-94': 'Hakeem Olajuwon', '92-93': 'Charles Barkley', '91-92': 'Michael Jordan', '90-91': 'Michael Jordan', '89-90': 'Magic Johnson', '88-89': 'Magic Johnson', '87-88': 'Michael Jordan', '86-87': 'Magic Johnson', '85-86': 'Larry Bird', '84-85': '

In [9]:
#Doing the same thing we did with the MVP to Denfesive Player of the Year
defensive_df = pd.read_csv(os.path.join(data_directory,'Defensive Player of the Year player stats.csv'),header=1)
season_defensive_dict = {}
for index,row in defensive_df.iterrows():
    season = row['Season']
    player = row['Player']
    season_short = season[2:]
    season_defensive_dict[season_short] = player

Now lets loop through all the season data we have and combine them while making a new column named ```MVP``` and ```Defensive player of the Year```. We will assign zeros and ones, one denoting the player won the MVP award or Defensive Player of the Year award

In [10]:
#Read all the csv file and clean it up alittle before concatnating into one big dataframe
combine_clean_df = pd.DataFrame()
for csv_file in files:
    #Read in the csv
    df = pd.read_csv(os.path.join(data_directory,csv_file),header=0)
    #Some cleaning for Player column
    df['Player'] = df['Player'].str.replace('*','')
    #Drop unnecessary column
    df = df.drop(columns = ['Rk','Player-additional'])
    #Grab the season string to later on use it to assign MVP or Defensive Player of the Year winners
    season = (csv_file[0:2] + csv_file[3:5])
    # add the year to the data frame
    df['Year'] = season
    #Using the agg function, We grab the first rows of each unique players which have their Total stats from
    #multiple team if any
    combined_df = df.groupby('Player').agg({
    'Age': 'first',
    'Pos':'first',
    'Tm':'first',
    'G':'first',
    'GS':'first',
    'MP':'first',
    'FG':'first',
    'FGA':'first',
    'FG%':'first',
    '3P':'first',
    '3PA':'first',
    '3P%' : 'first',
    '2P':'first',
    '2PA':'first',
    '2P%':'first',
    'FT':'first',
    'FTA':'first',
    'FT%':'first',
    'eFG%':'first',
    'ORB':'first',
    'DRB':'first',
    'TRB':'first',
    'AST':'first',
    'STL':'first',
    'BLK':'first',
    'TOV':'first',
    'PF':'first',
    'PTS':'first',
    'Year':'first'
    }).reset_index()
    
    '''combined_df['FG%'] = combined_df['FG'] / combined_df['FGA']
    combined_df['3P%'] = combined_df['3P'] / combined_df['3PA']
    combined_df['2P%'] = combined_df['2P'] / combined_df['2PA']
    combined_df['FT%'] = combined_df['FT'] / combined_df['FTA']
    combined_df['eFG%'] = (combined_df['FG'] + 0.5 * combined_df['3P'])/ combined_df['FGA']'''
    
    season = csv_file[:5]
    
    #Creating a column of MVP and Defensive Player of the year, zeros and ones
    if season in season_mvp_dict:
        mvp_player = season_mvp_dict[season]
        combined_df['MVP'] = (combined_df['Player'] == mvp_player).astype(int)
    if season in season_defensive_dict:
        defensive_player = season_defensive_dict[season]
        combined_df['Defensive Player of the Year'] = (combined_df['Player'] == defensive_player).astype(int)
    combine_clean_df = pd.concat([combine_clean_df,combined_df])

#Lets check if we assigned everything correclty. There should be only 23 MVP and Defensive winners
mvp_count = (combine_clean_df['MVP'] == 1).sum()
denfensive_count = (combine_clean_df['Defensive Player of the Year'] == 1).sum()

print('The count for MVP award winners are : ', mvp_count)
print('\nThe count for Defensive Player of the Year award winners are : ',denfensive_count)

The count for MVP award winners are :  23

The count for Defensive Player of the Year award winners are :  23


In [11]:
#Lets look at the combine data set we created
pd.set_option('display.max_columns', None)
combine_clean_df.head(10)

Unnamed: 0,Player,Age,Pos,Tm,G,GS,MP,FG,FGA,FG%,3P,3PA,3P%,2P,2PA,2P%,FT,FTA,FT%,eFG%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Year,MVP,Defensive Player of the Year
0,A.C. Green,37,PF,MIA,82,1,1411,144,324,0.444,0,6,0.0,144,318,0.453,79,111,0.712,0.444,107,206,313,39,30,8,45,119,367,1,0,0
1,A.J. Guyton,22,PG,CHI,33,8,630,78,192,0.406,27,69,0.391,51,123,0.415,15,18,0.833,0.477,10,26,36,64,9,5,24,35,198,1,0,0
2,Aaron McKie,28,SG,PHI,76,33,2394,338,714,0.473,53,170,0.312,285,544,0.524,149,194,0.768,0.511,33,278,311,377,106,8,203,178,878,1,0,0
3,Aaron Williams,29,PF,NJN,82,25,2336,297,650,0.457,0,2,0.0,297,648,0.458,244,310,0.787,0.457,211,379,590,88,59,113,132,319,838,1,0,0
4,Adam Keefe,30,PF,GSW,67,13,836,64,159,0.403,1,3,0.333,63,156,0.404,39,63,0.619,0.406,90,119,209,36,28,20,40,102,168,1,0,0
5,Adonal Foyle,25,C,GSW,58,37,1457,156,375,0.416,0,0,,156,375,0.416,30,68,0.441,0.416,156,249,405,48,31,156,79,136,342,1,0,0
6,Adrian Griffin,26,SF,BOS,44,0,377,33,97,0.34,9,26,0.346,24,71,0.338,18,24,0.75,0.387,27,60,87,27,18,5,18,45,93,1,0,0
7,Al Harrington,20,PF,IND,78,38,1892,241,543,0.444,1,7,0.143,240,536,0.448,103,157,0.656,0.445,119,262,381,130,63,18,148,223,586,1,0,0
8,Alan Henderson,28,PF,ATL,73,42,1810,298,671,0.444,0,1,0.0,298,670,0.445,173,271,0.638,0.444,180,226,406,50,51,29,126,164,769,1,0,0
9,Allan Houston,29,SG,NYK,78,78,2858,542,1208,0.449,96,252,0.381,446,956,0.467,279,307,0.909,0.488,20,263,283,173,52,10,161,190,1459,1,0,0


In [12]:
combine_clean_df.tail(10)

Unnamed: 0,Player,Age,Pos,Tm,G,GS,MP,FG,FGA,FG%,3P,3PA,3P%,2P,2PA,2P%,FT,FTA,FT%,eFG%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Year,MVP,Defensive Player of the Year
529,Xavier Cooks,27,PF,WAS,10,1,126,17,28,0.607,0,1,0.0,17,27,0.63,4,10,0.4,0.607,16,22,38,6,6,4,8,13,38,2223,0,0
530,Xavier Moon,28,SG,LAC,4,0,20,3,9,0.333,1,3,0.333,2,6,0.333,0,0,,0.389,0,3,3,5,0,0,1,2,7,2223,0,0
531,Xavier Sneed,25,SF,CHO,4,0,48,5,10,0.5,3,6,0.5,2,4,0.5,4,4,1.0,0.65,0,5,5,5,0,1,3,7,17,2223,0,0
532,Xavier Tillman Sr.,24,C,MEM,61,29,1180,188,306,0.614,4,15,0.267,184,291,0.632,49,89,0.551,0.621,121,186,307,96,58,29,44,97,429,2223,0,0
533,Yuta Watanabe,28,SF,BRK,58,1,928,114,232,0.491,60,135,0.444,54,97,0.557,34,47,0.723,0.621,30,111,141,48,25,17,22,80,322,2223,0,0
534,Zach Collins,25,C,SAS,63,26,1441,284,548,0.518,55,147,0.374,229,401,0.571,108,142,0.761,0.568,116,286,402,180,37,49,129,199,731,2223,0,0
535,Zach LaVine,27,SG,CHI,77,77,2768,673,1388,0.485,204,544,0.375,469,844,0.556,363,428,0.848,0.558,42,303,345,327,69,18,194,159,1913,2223,0,0
536,Zeke Nnaji,22,PF,DEN,53,5,728,110,196,0.561,17,65,0.262,93,131,0.71,40,62,0.645,0.605,65,73,138,18,17,23,31,105,277,2223,0,0
537,Ziaire Williams,21,SF,MEM,37,4,561,84,196,0.429,25,97,0.258,59,99,0.596,17,22,0.773,0.492,16,63,79,35,14,6,37,58,210,2223,0,0
538,Zion Williamson,22,PF,NOP,29,29,956,285,469,0.608,7,19,0.368,278,450,0.618,177,248,0.714,0.615,58,144,202,133,32,16,99,65,754,2223,0,0


23 MVP's and 23 Defensive player of the Year, which is correct since we are only looking at season from 01 to 23
Lets save this into an csv

## Now lets clean it

In [13]:
#Checking for NaN
combine_clean_df.isnull().sum()

Player                             0
Age                                0
Pos                                0
Tm                                 0
G                                  0
GS                                 0
MP                                 0
FG                                 0
FGA                                0
FG%                               44
3P                                 0
3PA                                0
3P%                             1398
2P                                 0
2PA                                0
2P%                               85
FT                                 0
FTA                                0
FT%                              428
eFG%                              44
ORB                                0
DRB                                0
TRB                                0
AST                                0
STL                                0
BLK                                0
TOV                                0
P

There is some null values lets look into that

In [14]:
combine_clean_df.shape

(11083, 32)

In [15]:
data = combine_clean_df[combine_clean_df['FG%'].isna()]
data

Unnamed: 0,Player,Age,Pos,Tm,G,GS,MP,FG,FGA,FG%,3P,3PA,3P%,2P,2PA,2P%,FT,FTA,FT%,eFG%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Year,MVP,Defensive Player of the Year
15,Andy Panko,23,SF,ATL,1,0,1,0,0,,0,0,,0,0,,0,0,,,0,0,0,0,0,0,0,0,0,1,0,0
254,Lari Ketner,23,PF,IND,3,0,7,0,0,,0,0,,0,0,,0,0,,,0,0,0,1,0,0,2,0,0,1,0,0
120,Dickey Simpkins,29,PF,ATL,1,0,3,0,0,,0,0,,0,0,,0,0,,,0,0,0,1,0,0,0,0,0,102,0,0
162,Guy Rucker,25,PF,GSW,3,0,4,0,0,,0,0,,0,0,,0,0,,,0,1,1,1,0,0,0,1,0,203,0,0
196,Jelani McCoy,26,C,CLE,2,0,12,0,0,,0,0,,0,0,,0,0,,,1,3,4,0,0,0,1,4,0,304,0,0
310,Olden Polynice,39,C,LAC,2,0,12,0,0,,0,0,,0,0,,0,0,,,1,1,2,1,1,0,4,2,0,304,0,0
346,Pavel Podkolzin,20,C,DAL,5,0,10,0,0,,0,0,,0,0,,1,2,0.5,,0,2,2,0,0,0,2,4,1,405,0,0
10,Alex Scales,27,SG,SAS,1,0,0,0,0,,0,0,,0,0,,0,0,,,0,0,0,0,0,0,0,0,0,506,0,0
68,Bryon Russell,35,SF,DEN,1,0,3,0,0,,0,0,,0,0,,0,0,,,0,1,1,1,0,0,0,0,0,506,0,0
120,Deng Gai,23,C,PHI,2,0,5,0,0,,0,0,,0,0,,0,0,,,0,0,0,0,0,0,0,0,0,506,0,0


Looks like its NaN because it is dividing by zero, lets replace them to zeros

In [16]:
#Checking for NaNs
combine_clean_df.fillna(0, inplace=True)
combine_clean_df.isnull().sum()

Player                          0
Age                             0
Pos                             0
Tm                              0
G                               0
GS                              0
MP                              0
FG                              0
FGA                             0
FG%                             0
3P                              0
3PA                             0
3P%                             0
2P                              0
2PA                             0
2P%                             0
FT                              0
FTA                             0
FT%                             0
eFG%                            0
ORB                             0
DRB                             0
TRB                             0
AST                             0
STL                             0
BLK                             0
TOV                             0
PF                              0
PTS                             0
Year          

In [17]:
#Lets give the column header more indepth name 
name_map = {'Rk':'Rank','Pos':'Position','Tm':'Team','G':'Games','GS':'Games started','MP':'Minutes played per game'
             ,'FG':'Field goals per game','FGA':'Field goals attempt per game','FG%':'Field goal percent','3P':'3 point field goal per game',
             '3PA':'3 point field goal attempt per game', '3P%':'3 point field goal percentage','2P':'2 point field goal per game', '2PA':'2 point field goal attempt per game',
             '2P%':'2 point field goal percentage', 'eFG%':'Effective field goal percentage', 'FT':'Free throws per game',
             'FTA':'Free throw attempt per game','FT%':'Free throw percentage','ORB':'Offensive rebounds per game','DRB':'Defensive rebounds per game',
             'TRB':'Total rebounds per game','AST':'Assist per game','STL':'Steals per game','BLK':'Blocks per game','TOV':'Turn overs per game',
             'PF':'Personal fouls per game','PTS':'Points per game'}

renamed_df = combine_clean_df.rename(columns=name_map)
renamed_df.head(10)

Unnamed: 0,Player,Age,Position,Team,Games,Games started,Minutes played per game,Field goals per game,Field goals attempt per game,Field goal percent,3 point field goal per game,3 point field goal attempt per game,3 point field goal percentage,2 point field goal per game,2 point field goal attempt per game,2 point field goal percentage,Free throws per game,Free throw attempt per game,Free throw percentage,Effective field goal percentage,Offensive rebounds per game,Defensive rebounds per game,Total rebounds per game,Assist per game,Steals per game,Blocks per game,Turn overs per game,Personal fouls per game,Points per game,Year,MVP,Defensive Player of the Year
0,A.C. Green,37,PF,MIA,82,1,1411,144,324,0.444,0,6,0.0,144,318,0.453,79,111,0.712,0.444,107,206,313,39,30,8,45,119,367,1,0,0
1,A.J. Guyton,22,PG,CHI,33,8,630,78,192,0.406,27,69,0.391,51,123,0.415,15,18,0.833,0.477,10,26,36,64,9,5,24,35,198,1,0,0
2,Aaron McKie,28,SG,PHI,76,33,2394,338,714,0.473,53,170,0.312,285,544,0.524,149,194,0.768,0.511,33,278,311,377,106,8,203,178,878,1,0,0
3,Aaron Williams,29,PF,NJN,82,25,2336,297,650,0.457,0,2,0.0,297,648,0.458,244,310,0.787,0.457,211,379,590,88,59,113,132,319,838,1,0,0
4,Adam Keefe,30,PF,GSW,67,13,836,64,159,0.403,1,3,0.333,63,156,0.404,39,63,0.619,0.406,90,119,209,36,28,20,40,102,168,1,0,0
5,Adonal Foyle,25,C,GSW,58,37,1457,156,375,0.416,0,0,0.0,156,375,0.416,30,68,0.441,0.416,156,249,405,48,31,156,79,136,342,1,0,0
6,Adrian Griffin,26,SF,BOS,44,0,377,33,97,0.34,9,26,0.346,24,71,0.338,18,24,0.75,0.387,27,60,87,27,18,5,18,45,93,1,0,0
7,Al Harrington,20,PF,IND,78,38,1892,241,543,0.444,1,7,0.143,240,536,0.448,103,157,0.656,0.445,119,262,381,130,63,18,148,223,586,1,0,0
8,Alan Henderson,28,PF,ATL,73,42,1810,298,671,0.444,0,1,0.0,298,670,0.445,173,271,0.638,0.444,180,226,406,50,51,29,126,164,769,1,0,0
9,Allan Houston,29,SG,NYK,78,78,2858,542,1208,0.449,96,252,0.381,446,956,0.467,279,307,0.909,0.488,20,263,283,173,52,10,161,190,1459,1,0,0


In [18]:
#Rechecking MVP count and defensive count
mvp_count = (renamed_df['MVP'] == 1).sum()
mvp_count

23

In [19]:
defensve_counnt = (renamed_df['Defensive Player of the Year'] == 1).sum()
defensve_counnt

23

In [20]:
#Save them into csv
renamed_df.to_csv('cleaned_combine_player_stat_w_MVP_defensive_player.csv',header=True,index=False)

## Team Data Import

In [21]:
team_data_directory = os.path.join(main_directory, 'team_data')
file_names = [f for f in os.listdir(team_data_directory) if os.path.isfile(os.path.join(team_data_directory, f))]
print(file_names)

['.DS_Store', '2000Adv.csv', '2000Def.csv', '2000Off.csv', '2000Playoffs.csv', '2001Adv.csv', '2001Def.csv', '2001Off.csv', '2001Playoffs.csv', '2002Adv.csv', '2002Def.csv', '2002Off.csv', '2002Playoffs.csv', '2003Adv.csv', '2003Def.csv', '2003Off.csv', '2003Playoffs.csv', '2004Adv.csv', '2004Def.csv', '2004Off.csv', '2004Playoffs.csv', '2005Adv.csv', '2005Def.csv', '2005Off.csv', '2005Playoffs.csv', '2006Adv.csv', '2006Def.csv', '2006Off.csv', '2006Playoffs.csv', '2007Adv.csv', '2007Def.csv', '2007Off.csv', '2007Playoffs.csv', '2008Adv.csv', '2008Def.csv', '2008Off.csv', '2008Playoffs.csv', '2009Adv.csv', '2009Def.csv', '2009Off.csv', '2009Playoffs.csv', '2010Adv.csv', '2010Def.csv', '2010Off.csv', '2010Playoffs.csv', '2011Adv.csv', '2011Def.csv', '2011Off.csv', '2011Playoffs.csv', '2012Adv.csv', '2012Def.csv', '2012Off.csv', '2012Playoffs.csv', '2013Adv.csv', '2013Def.csv', '2013Off.csv', '2013Playoffs.csv', '2014Adv.csv', '2014Def.csv', '2014Off.csv', '2014Playoffs.csv', '2015Adv.cs

In [22]:
off_files = [file for file in file_names if 'Off.csv' in file]
def_files = [file for file in file_names if 'Def.csv' in file]
adv_files = [file for file in file_names if 'Adv.csv' in file]
playoff_files = [file for file in file_names if 'Playoffs.csv' in file]

off_files = sorted(off_files)
def_files = sorted(def_files)
adv_files = sorted(adv_files)
playoff_files = sorted(playoff_files)
print(adv_files)


['2000Adv.csv', '2001Adv.csv', '2002Adv.csv', '2003Adv.csv', '2004Adv.csv', '2005Adv.csv', '2006Adv.csv', '2007Adv.csv', '2008Adv.csv', '2009Adv.csv', '2010Adv.csv', '2011Adv.csv', '2012Adv.csv', '2013Adv.csv', '2014Adv.csv', '2015Adv.csv', '2016Adv.csv', '2017Adv.csv', '2018Adv.csv', '2019Adv.csv', '2020Adv.csv', '2021Adv.csv', '2022Adv.csv', '2023Adv.csv', '2024Adv.csv']


This block of code is a big for loop to individually clean each season data set and create an array of them. One data frame is the raw stats, while the other is normalized to the season's league average. See comments for details.

In [23]:
# Dictionary to store DataFrames
team_stats_dfs = {}
normalized_team_dfs = {}

for i in range(25):

    # Read and Cleanup defensive stats
    def_stats = pd.read_csv(os.path.join(team_data_directory,def_files[i]))
    def_stats = def_stats.drop(columns=['Rk','Team▲','G','MP'])
    def_stats = def_stats.add_prefix('Opp ')
   
    # Read and Cleanup advanced stats
    adv_stats = pd.read_csv(os.path.join(team_data_directory,adv_files[i]),header=1)
    adv_stats = adv_stats.drop(columns=['Rk','Team▲','Unnamed: 17','Unnamed: 22','Unnamed: 27','Arena','Attend.','Attend./G'])
    adv_stats = adv_stats.rename(columns={'eFG%.1': 'Opp eFG%', 'TOV%.1': 'Opp TOV%', 'FT/FGA.1': 'Opp FT/FGA'})
    
    # Read and Cleanup main team stats (offensive)
    off_stats = pd.read_csv(os.path.join(team_data_directory,off_files[i]))
    off_stats = off_stats.drop(columns=['Rk','G','MP'])
    off_stats['Team▲'] = off_stats['Team▲'].str.replace('*', '')
    off_stats = off_stats.rename(columns={'Team▲': 'Team'}) 
    #the supersonics was spelled different for some years
    off_stats['Team'] = off_stats['Team'].replace('Seattle SuperSonics', 'Seattle Supersonics')
    #in 2014 Charlotte switched mascots midseason
    if i == 14:
        off_stats['Team'] = off_stats['Team'].replace('Charlotte Bobcats', 'Charlotte Hornets')
    
    # Make one big df of every stat
    team_stats = pd.concat([off_stats, def_stats, adv_stats], axis=1)

    #Add the year to the name
    year = 2000 + i
    team_stats['Team'] = team_stats['Team'] + ' ' + str(year)

    #replace win and loss columns with winning percentage (2020 and 2021 had shortened seasons)
    winrate = team_stats['W']/(team_stats['W'] + team_stats['L']) 
    pwinrate = team_stats['PW']/(team_stats['PW'] + team_stats['PL'])
    team_stats.insert(44, 'W%', winrate)
    team_stats.insert(47, 'PW%', pwinrate)
    team_stats = team_stats.drop(columns=['W','L','PW','PL'])

    ##################################################
    #create copy to normalize by year
    normalized_team = team_stats.copy()

    #get column of team names
    team_col = normalized_team['Team'].copy()
    #drop team name column (can't divide with strings)
    normalized_team = normalized_team.drop(normalized_team.columns[0], axis=1)

    #row index for the league average row
    avg_i = len(normalized_team) - 1
        
    #these stats are already normalized and don't appear in the league average row
    normalized_team.at[avg_i,'NRtg'] = 1
    normalized_team.at[avg_i,'MOV'] = 1
    normalized_team.at[avg_i,'SOS'] = 1
    normalized_team.at[avg_i,'SRS'] = 1
    normalized_team.at[avg_i,'W%'] = 1

    #get the row of averages
    divisor_row = normalized_team.iloc[avg_i]
    
    #divide every row by the average to normalize
    normalized_team = normalized_team.div(divisor_row, axis=1)
    
    #add the team column back
    normalized_team.insert(0, 'Team', team_col)
    
    #drop the league average row in both dataframes
    normalized_team.drop(normalized_team.tail(1).index, inplace=True)
    team_stats.drop(team_stats.tail(1).index, inplace=True)
        
    ##################################################

    #put this years stats into it's own file
    if i == 24:
        team_stats2024 = team_stats.copy()
        normal_stats2024 = normalized_team.copy()

    else:
    
        # Read in playoff stats
        playoff = pd.read_csv(os.path.join(team_data_directory,playoff_files[i]),header=1)
        #This csv is only used to get the number of playoff wins
        playoff = playoff[['Team', 'W', 'L']]
        playoff.drop(playoff.tail(1).index, inplace=True)
        year = 2000 + i
        playoff['Team'] = playoff['Team'] + ' ' + str(year)

        # New DataFrame for just team names of all the teams
        all_team_names = team_stats['Team']
        all_team_names_df = pd.DataFrame(all_team_names, columns=['Team'])

        # Add the teams that missed the playoffs
        all_teams = pd.concat([playoff, all_team_names_df]).drop_duplicates(subset=['Team'])
        # Replace NaN with -1 to indicate that these teams missed the playoffs
        all_teams = all_teams.fillna(-1)

        # Sort by team alphabetically
        all_teams.sort_values(by='Team', inplace=True)
        # Reset index
        all_teams.reset_index(drop=True, inplace=True)

        #drop team name and losses
        playoff_wins = all_teams.drop(columns=['Team','L'])

        #add playoff win column to main df and normal df
        team_stats = pd.concat([team_stats, playoff_wins], axis=1)
        normalized_team = pd.concat([normalized_team, playoff_wins], axis=1)       

        #create big array of all years (200i)
        team_stats_dfs[i] = team_stats.copy()
        normalized_team_dfs[i] = normalized_team.copy()

display(team_stats_dfs[23])
display(normalized_team_dfs[23])

Unnamed: 0,Team,FG,FGA,FG%,3P,3PA,3P%,2P,2PA,2P%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Opp FG,Opp FGA,Opp FG%,Opp 3P,Opp 3PA,Opp 3P%,Opp 2P,Opp 2PA,Opp 2P%,Opp FT,Opp FTA,Opp FT%,Opp ORB,Opp DRB,Opp TRB,Opp AST,Opp STL,Opp BLK,Opp TOV,Opp PF,Opp PTS,Age,W%,PW%,MOV,SOS,SRS,ORtg,DRtg,NRtg,Pace,FTr,3PAr,TS%,eFG%,TOV%,ORB%,FT/FGA,Opp eFG%,Opp TOV%,DRB%,Opp FT/FGA,W
0,Atlanta Hawks 2023,44.6,92.4,0.483,10.8,30.5,0.352,33.9,61.8,0.548,18.5,22.6,0.818,11.2,33.2,44.4,25.0,7.1,4.9,12.9,18.8,118.4,43.8,90.2,0.486,11.9,33.5,0.356,31.9,56.7,0.562,18.6,23.2,0.803,10.6,33.5,44.1,26.0,7.5,5.0,14.2,19.7,118.1,24.9,0.5,0.512195,0.29,0.02,0.32,116.6,116.3,0.3,100.7,0.244,0.331,0.579,0.541,11.2,25.1,0.2,0.552,12.4,75.8,0.206,2.0
1,Boston Celtics 2023,42.2,88.8,0.475,16.0,42.6,0.377,26.2,46.2,0.567,17.5,21.6,0.812,9.7,35.6,45.3,26.7,6.4,5.2,13.4,18.8,117.9,41.8,90.2,0.463,11.6,33.7,0.345,30.1,56.5,0.534,16.2,21.1,0.769,9.7,34.3,44.0,23.1,6.6,3.9,12.7,19.1,111.4,27.4,0.695122,0.695122,6.52,-0.15,6.38,118.0,111.5,6.5,98.5,0.243,0.48,0.6,0.566,12.0,22.1,0.197,0.528,11.3,78.5,0.18,11.0
2,Brooklyn Nets 2023,41.5,85.1,0.487,12.8,33.8,0.378,28.7,51.3,0.559,17.7,22.1,0.8,8.2,32.3,40.5,25.5,7.1,6.2,13.7,21.1,113.4,41.0,88.5,0.463,11.8,32.2,0.367,29.2,56.3,0.518,18.7,24.4,0.767,11.5,33.6,45.1,23.4,7.0,3.9,13.7,18.5,112.5,28.0,0.54878,0.52439,0.85,0.18,1.03,115.0,114.1,0.9,98.3,0.26,0.397,0.598,0.562,12.7,19.6,0.208,0.53,12.2,73.7,0.212,0.0
3,Charlotte Hornets 2023,41.3,90.4,0.457,10.7,32.5,0.33,30.5,57.9,0.528,17.6,23.6,0.749,11.0,33.5,44.5,25.1,7.7,5.2,14.2,20.3,111.0,43.0,90.1,0.477,12.2,34.3,0.357,30.7,55.8,0.55,19.0,24.0,0.795,10.9,35.3,46.2,25.9,7.0,5.7,14.4,20.3,117.2,25.3,0.329268,0.317073,-6.24,0.35,-5.89,109.2,115.3,-6.1,100.8,0.261,0.36,0.55,0.516,12.3,23.8,0.195,0.544,12.5,75.5,0.211,-1.0
4,Chicago Bulls 2023,42.5,86.8,0.49,10.4,28.9,0.361,32.1,57.9,0.555,17.6,21.8,0.809,8.5,33.9,42.4,24.5,7.9,4.5,13.4,18.9,113.1,40.7,87.1,0.468,13.2,37.1,0.357,27.5,50.0,0.55,17.1,22.0,0.779,9.6,33.7,43.3,26.0,6.7,4.7,15.0,18.7,111.8,27.5,0.487805,0.536585,1.29,0.07,1.37,113.5,112.2,1.3,98.5,0.251,0.333,0.587,0.55,12.2,20.1,0.203,0.544,13.5,77.8,0.197,0.0
5,Cleveland Cavaliers 2023,41.6,85.2,0.488,11.6,31.6,0.367,30.0,53.6,0.559,17.5,22.5,0.78,9.7,31.4,41.1,24.9,7.1,4.7,13.3,19.0,112.3,39.0,83.5,0.468,11.3,30.6,0.368,27.8,52.9,0.525,17.5,22.4,0.782,9.8,31.5,41.2,23.0,7.0,4.4,15.7,20.4,106.9,25.4,0.621951,0.670732,5.38,-0.15,5.23,116.1,110.6,5.5,95.7,0.264,0.371,0.59,0.556,12.3,23.6,0.206,0.535,14.4,76.3,0.21,1.0
6,Dallas Mavericks 2023,40.0,84.3,0.475,15.2,41.0,0.371,24.8,43.3,0.574,19.0,25.1,0.755,7.6,31.2,38.8,22.9,6.3,3.7,12.2,20.7,114.2,41.8,86.2,0.485,11.1,31.7,0.352,30.6,54.5,0.562,19.5,25.0,0.781,10.1,34.6,44.7,24.9,6.4,3.8,13.1,21.8,114.1,27.8,0.463415,0.5,0.07,-0.22,-0.14,116.8,116.7,0.1,96.6,0.298,0.487,0.599,0.565,11.4,18.0,0.225,0.549,11.9,75.5,0.226,-1.0
7,Denver Nuggets 2023,43.6,86.4,0.504,11.8,31.2,0.379,31.8,55.2,0.575,16.8,22.4,0.751,10.1,32.9,43.0,28.9,7.5,4.5,14.5,18.6,115.8,41.7,87.4,0.478,11.4,33.1,0.344,30.4,54.3,0.559,17.6,22.7,0.775,10.1,30.7,40.8,25.7,7.9,4.2,13.5,19.5,112.5,26.6,0.646341,0.597561,3.33,-0.29,3.04,117.6,114.2,3.4,98.1,0.259,0.361,0.601,0.573,13.1,24.8,0.194,0.543,12.2,76.4,0.201,16.0
8,Detroit Pistons 2023,39.6,87.1,0.454,11.4,32.4,0.351,28.2,54.6,0.516,19.8,25.7,0.771,11.2,31.3,42.4,23.0,7.0,3.8,15.1,22.1,110.3,43.1,88.1,0.489,12.0,33.3,0.36,31.1,54.8,0.568,20.4,26.2,0.777,11.0,33.7,44.7,25.8,7.7,5.5,13.5,21.0,118.5,24.1,0.207317,0.268293,-8.22,0.49,-7.73,110.7,118.9,-8.2,99.0,0.295,0.372,0.561,0.52,13.3,24.9,0.227,0.557,11.9,74.0,0.231,-1.0
9,Golden State Warriors 2023,43.1,90.2,0.479,16.6,43.2,0.385,26.5,47.0,0.564,16.0,20.2,0.794,10.5,34.1,44.6,29.8,7.2,3.9,16.3,21.4,118.9,42.4,90.5,0.469,12.9,35.5,0.364,29.5,55.0,0.536,19.4,25.2,0.769,10.7,32.6,43.3,25.7,7.9,4.0,14.3,18.4,117.1,27.3,0.536585,0.54878,1.8,-0.15,1.66,116.1,114.4,1.7,101.6,0.224,0.479,0.6,0.571,14.1,24.4,0.178,0.54,12.3,76.0,0.214,6.0


Unnamed: 0,Team,FG,FGA,FG%,3P,3PA,3P%,2P,2PA,2P%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Opp FG,Opp FGA,Opp FG%,Opp 3P,Opp 3PA,Opp 3P%,Opp 2P,Opp 2PA,Opp 2P%,Opp FT,Opp FTA,Opp FT%,Opp ORB,Opp DRB,Opp TRB,Opp AST,Opp STL,Opp BLK,Opp TOV,Opp PF,Opp PTS,Age,W%,PW%,MOV,SOS,SRS,ORtg,DRtg,NRtg,Pace,FTr,3PAr,TS%,eFG%,TOV%,ORB%,FT/FGA,Opp eFG%,Opp TOV%,DRB%,Opp FT/FGA,W
0,Atlanta Hawks 2023,1.061905,1.046433,1.016842,0.878049,0.891813,0.975069,1.14527,1.142329,1.0,1.005435,0.961702,1.046036,1.076923,1.006061,1.023041,0.988142,0.972603,1.042553,0.914894,0.94,1.032258,1.042857,1.021518,1.023158,0.96748,0.979532,0.98615,1.077703,1.048059,1.025547,1.01087,0.987234,1.026854,1.019231,1.015152,1.016129,1.027668,1.027397,1.06383,1.007092,0.985,1.029643,0.946768,0.5,1.02439,0.29,0.02,0.32,1.015679,1.013066,0.3,1.016145,0.917293,0.855297,0.996558,0.992661,0.896,1.045833,0.961538,1.012844,0.992,0.997368,0.990385,2.0
1,Boston Celtics 2023,1.004762,1.005663,1.0,1.300813,1.245614,1.044321,0.885135,0.853974,1.034672,0.951087,0.919149,1.038363,0.932692,1.078788,1.043779,1.055336,0.876712,1.106383,0.950355,0.94,1.027899,0.995238,1.021518,0.974737,0.943089,0.98538,0.955679,1.016892,1.044362,0.974453,0.880435,0.897872,0.983376,0.932692,1.039394,1.013825,0.913043,0.90411,0.829787,0.900709,0.955,0.971229,1.041825,0.695122,1.390244,6.52,-0.15,6.38,1.027875,0.971254,6.5,0.993946,0.913534,1.24031,1.032702,1.038532,0.96,0.920833,0.947115,0.968807,0.904,1.032895,0.865385,11.0
2,Brooklyn Nets 2023,0.988095,0.96376,1.025263,1.04065,0.988304,1.047091,0.969595,0.948244,1.020073,0.961957,0.940426,1.023018,0.788462,0.978788,0.93318,1.007905,0.972603,1.319149,0.971631,1.055,0.988666,0.97619,1.002265,0.974737,0.95935,0.94152,1.01662,0.986486,1.040665,0.945255,1.016304,1.038298,0.980818,1.105769,1.018182,1.039171,0.924901,0.958904,0.829787,0.971631,0.925,0.98082,1.064639,0.54878,1.04878,0.85,0.18,1.03,1.001742,0.993902,0.9,0.991927,0.977444,1.02584,1.02926,1.031193,1.016,0.816667,1.0,0.972477,0.976,0.969737,1.019231,0.0
3,Charlotte Hornets 2023,0.983333,1.023783,0.962105,0.869919,0.950292,0.914127,1.030405,1.07024,0.963504,0.956522,1.004255,0.957801,1.057692,1.015152,1.025346,0.992095,1.054795,1.106383,1.007092,1.015,0.967742,1.02381,1.020385,1.004211,0.99187,1.002924,0.98892,1.037162,1.031423,1.00365,1.032609,1.021277,1.016624,1.048077,1.069697,1.064516,1.023715,0.958904,1.212766,1.021277,1.015,1.021796,0.961977,0.329268,0.634146,-6.24,0.35,-5.89,0.95122,1.004355,-6.1,1.017154,0.981203,0.930233,0.946644,0.946789,0.984,0.991667,0.9375,0.998165,1.0,0.993421,1.014423,-1.0
4,Chicago Bulls 2023,1.011905,0.983012,1.031579,0.845528,0.845029,1.0,1.084459,1.07024,1.012774,0.956522,0.92766,1.034527,0.817308,1.027273,0.976959,0.968379,1.082192,0.957447,0.950355,0.945,0.986051,0.969048,0.98641,0.985263,1.073171,1.084795,0.98892,0.929054,0.924214,1.00365,0.929348,0.93617,0.996164,0.923077,1.021212,0.997696,1.027668,0.917808,1.0,1.06383,0.935,0.974717,1.045627,0.487805,1.073171,1.29,0.07,1.37,0.988676,0.977352,1.3,0.993946,0.943609,0.860465,1.010327,1.009174,0.976,0.8375,0.975962,0.998165,1.08,1.023684,0.947115,0.0
5,Cleveland Cavaliers 2023,0.990476,0.964892,1.027368,0.943089,0.923977,1.01662,1.013514,0.990758,1.020073,0.951087,0.957447,0.997442,0.932692,0.951515,0.947005,0.98419,0.972603,1.0,0.943262,0.95,0.979076,0.928571,0.94564,0.985263,0.918699,0.894737,1.019391,0.939189,0.977819,0.958029,0.951087,0.953191,1.0,0.942308,0.954545,0.949309,0.909091,0.958904,0.93617,1.113475,1.02,0.931997,0.965779,0.621951,1.341463,5.38,-0.15,5.23,1.011324,0.963415,5.5,0.965691,0.992481,0.958656,1.015491,1.020183,0.984,0.983333,0.990385,0.981651,1.152,1.003947,1.009615,1.0
6,Dallas Mavericks 2023,0.952381,0.9547,1.0,1.235772,1.19883,1.027701,0.837838,0.80037,1.047445,1.032609,1.068085,0.965473,0.730769,0.945455,0.894009,0.905138,0.863014,0.787234,0.865248,1.035,0.995641,0.995238,0.976217,1.021053,0.902439,0.926901,0.975069,1.033784,1.007394,1.025547,1.059783,1.06383,0.998721,0.971154,1.048485,1.029954,0.98419,0.876712,0.808511,0.929078,1.09,0.994769,1.057034,0.463415,1.0,0.07,-0.22,-0.14,1.017422,1.016551,0.1,0.974773,1.120301,1.258398,1.030981,1.036697,0.912,0.75,1.081731,1.007339,0.952,0.993421,1.086538,-1.0
7,Denver Nuggets 2023,1.038095,0.978482,1.061053,0.95935,0.912281,1.049861,1.074324,1.020333,1.04927,0.913043,0.953191,0.960358,0.971154,0.99697,0.990783,1.142292,1.027397,0.957447,1.028369,0.93,1.00959,0.992857,0.989807,1.006316,0.926829,0.967836,0.952909,1.027027,1.003697,1.020073,0.956522,0.965957,0.991049,0.971154,0.930303,0.940092,1.01581,1.082192,0.893617,0.957447,0.975,0.98082,1.011407,0.646341,1.195122,3.33,-0.29,3.04,1.02439,0.994774,3.4,0.989909,0.973684,0.932817,1.034423,1.051376,1.048,1.033333,0.932692,0.99633,0.976,1.005263,0.966346,16.0
8,Detroit Pistons 2023,0.942857,0.98641,0.955789,0.926829,0.947368,0.972299,0.952703,1.009242,0.941606,1.076087,1.093617,0.985934,1.076923,0.948485,0.976959,0.909091,0.958904,0.808511,1.070922,1.105,0.961639,1.02619,0.997735,1.029474,0.97561,0.973684,0.99723,1.050676,1.012939,1.036496,1.108696,1.114894,0.993606,1.057692,1.021212,1.029954,1.019763,1.054795,1.170213,0.957447,1.05,1.03313,0.91635,0.207317,0.536585,-8.22,0.49,-7.73,0.964286,1.035714,-8.2,0.998991,1.109023,0.96124,0.965577,0.954128,1.064,1.0375,1.091346,1.022018,0.952,0.973684,1.110577,-1.0
9,Golden State Warriors 2023,1.02619,1.021518,1.008421,1.349593,1.263158,1.066482,0.89527,0.868762,1.029197,0.869565,0.859574,1.015345,1.009615,1.033333,1.02765,1.177866,0.986301,0.829787,1.156028,1.07,1.036617,1.009524,1.024915,0.987368,1.04878,1.038012,1.00831,0.996622,1.016636,0.978102,1.054348,1.07234,0.983376,1.028846,0.987879,0.997696,1.01581,1.082192,0.851064,1.014184,0.92,1.020924,1.038023,0.536585,1.097561,1.8,-0.15,1.66,1.011324,0.996516,1.7,1.025227,0.842105,1.237726,1.032702,1.047706,1.128,1.016667,0.855769,0.990826,0.984,1.0,1.028846,6.0


Next, we take the big array of individual season dataframes and combine them into one big dataframe to write to a csv file. The to_csv lines are commented out to prevent them from being over written

In [24]:
# Combine all DataFrames into one giant DataFrame
all_data = pd.concat(team_stats_dfs.values(), ignore_index=True)
all_data.set_index('Team', inplace=True)
#write it into a csv file
all_data.to_csv('all_team_data.csv')

# Combine all normalized DataFrames into one giant DataFrame
all_data = pd.concat(normalized_team_dfs.values(), ignore_index=True)
all_data.set_index('Team', inplace=True)
#write it to a csv file
#all_data.to_csv('normal_team_data.csv')

team_stats2024.set_index('Team', inplace=True)
#team_stats2024.to_csv('2024_team_data.csv')
normal_stats2024.set_index('Team', inplace=True)
#normal_stats2024.to_csv('2024_normal_data.csv')

## Modeling

Lets create an SVM model for predicting MVP award winners

In [25]:
# determine the average stats for each season
# create new player data data frame that only includes columns which we would like to get the seasonal averages of
player_stat_game_df2 = renamed_df.drop(columns = ['Player', 'Position', 'Team', 'MVP', 'Defensive Player of the Year', 'Age'])
# create new data frame that contains the seasonal averages of relevant statistics
average_stat_game_df = player_stat_game_df2.groupby('Year').mean()
# merge original data with season average data
merged_data = pd.merge(renamed_df, average_stat_game_df, on='Year', suffixes=('', '_avg'))
# divide each statisitic by the corresponding season average
for stat in average_stat_game_df.columns:
    merged_data[stat] = merged_data[stat] / merged_data[f'{stat}_avg']
    merged_data.drop(columns=[f'{stat}_avg'], inplace=True)

player_average_game_df = merged_data
player_average_game_df.head(10)

Unnamed: 0,Player,Age,Position,Team,Games,Games started,Minutes played per game,Field goals per game,Field goals attempt per game,Field goal percent,3 point field goal per game,3 point field goal attempt per game,3 point field goal percentage,2 point field goal per game,2 point field goal attempt per game,2 point field goal percentage,Free throws per game,Free throw attempt per game,Free throw percentage,Effective field goal percentage,Offensive rebounds per game,Defensive rebounds per game,Total rebounds per game,Assist per game,Steals per game,Blocks per game,Turn overs per game,Personal fouls per game,Points per game,Year,MVP,Defensive Player of the Year
0,A.C. Green,37,PF,MIA,1.50851,0.03709,1.081198,0.748294,0.745492,1.050394,0.0,0.081173,0.0,0.865873,0.881628,1.040479,0.788124,0.827756,1.037054,0.989554,1.654581,1.253584,1.366826,0.332117,0.71129,0.282127,0.579551,0.987505,0.717856,1,0,0
1,A.J. Guyton,22,PG,CHI,0.607083,0.29672,0.482746,0.405326,0.441773,0.960496,1.033235,0.933491,1.768704,0.306663,0.341007,0.953198,0.149644,0.134231,1.213295,1.063102,0.154634,0.158219,0.157207,0.545012,0.213387,0.176329,0.309094,0.290443,0.38729,1,0,0
2,Aaron McKie,28,SG,PHI,1.398131,1.22397,1.834435,1.756413,1.642844,1.119001,2.028202,2.299905,1.411345,1.713707,1.508195,1.203556,1.486461,1.446708,1.11862,1.138878,0.510291,1.69173,1.358092,3.210462,2.513226,0.282127,2.614421,1.477109,1.717377,1,0,0
3,Aaron Williams,29,PF,NJN,1.50851,0.92725,1.789992,1.543357,1.495586,1.081149,0.0,0.027058,0.0,1.785863,1.796526,1.051963,2.434204,2.311751,1.146294,1.018527,3.262772,2.306352,2.576445,0.749392,1.398871,3.985046,1.700018,2.647178,1.639136,1,0,0
4,Adam Keefe,30,PF,GSW,1.232563,0.48217,0.640596,0.332575,0.365843,0.953398,0.038268,0.040587,1.506339,0.378819,0.432497,0.927933,0.389074,0.469807,0.901596,0.904862,1.391704,0.724158,0.912673,0.306569,0.663871,0.705318,0.515157,0.846433,0.32861,1,0,0
5,Adonal Foyle,25,C,GSW,1.066995,1.37233,1.116446,0.810652,0.862838,0.984153,0.0,0.0,0.0,0.938029,1.039656,0.955495,0.299287,0.507094,0.642333,0.92715,2.412287,1.515255,1.768576,0.408759,0.735,5.501479,1.017435,1.128578,0.668955,1,0,0
6,Adrian Griffin,26,SF,BOS,0.809444,0.0,0.288881,0.171484,0.223187,0.804356,0.344412,0.35175,1.565145,0.144312,0.196842,0.77634,0.179572,0.178974,1.092403,0.862516,0.417511,0.365122,0.379916,0.229927,0.426774,0.176329,0.231821,0.373426,0.181909,1,0,0
7,Al Harrington,20,PF,IND,1.434924,1.40942,1.449771,1.252354,1.24939,1.050394,0.038268,0.094702,0.646866,1.443122,1.486015,1.028995,1.027553,1.17079,0.955488,0.991783,1.840142,1.594364,1.663772,1.107056,1.49371,0.634786,1.90608,1.850535,1.146222,1,0,0
8,Alan Henderson,28,PF,ATL,1.342942,1.55778,1.386937,1.548554,1.543905,1.050394,0.0,0.013529,0.0,1.791876,1.857519,1.022104,1.725891,2.020918,0.92927,0.989554,2.783408,1.375292,1.772943,0.425791,1.209194,1.022711,1.622744,1.360932,1.504171,1,0,0
9,Allan Houston,29,SG,NYK,1.434924,2.893019,2.189981,2.816497,2.779489,1.062223,3.673724,3.409271,1.723469,2.681801,2.65043,1.072635,2.783373,2.289379,1.323992,1.087618,0.309268,1.60045,1.23582,1.473236,1.232903,0.352659,2.073506,1.576689,2.853818,1,0,0


In [26]:
# create new data frame with predictor data
player_average_game_df2 = player_average_game_df.drop(columns = ['Player', 'Position', 'Team', 'MVP', 'Defensive Player of the Year'])
# create data filters for splitting up testing and training data
train_years = ['0001', '0102', '0203', '0304', '0405', '0506', '0607', '0708', '0809', '0910', '1011', '1112', '1213', '1314',
              '1415', '1516', '1617', '1718', '1819', '1920', '2021']
test_year = ['2122']
# Filter both x_train and y_train based on train_years
x_train = player_average_game_df2[player_average_game_df2['Year'].isin(train_years)].to_numpy()
y_train = player_average_game_df[player_average_game_df['Year'].isin(train_years)]['MVP'].to_numpy()
# Filter both x_test and y_test based on train_years
x_test_first = player_average_game_df2[player_average_game_df2['Year'].isin(test_year)].to_numpy()
y_test_first = player_average_game_df[player_average_game_df['Year'].isin(test_year)]['MVP'].to_numpy()

# initialize old confusion matrix and model accuracy
confusion_matrix_old = [[0, 100],[0, 0]]
model_accuracy = []

# for loop used to determine the optimal class weights
for i in range(500, 1000, 100):
    for j in range(10, 100, 10):
        # Define class weights
        class_weights = {0: 1, 1: i}
        # create svm model
        svm_model = svm.SVC(kernel = 'rbf', C = j, gamma = 'scale', class_weight = class_weights)
        # train the nearest neighbor model with the training data 
        svm_model.fit(x_train, y_train)
        # get model predictions
        y_prediction = svm_model.predict(x_test_first)
        # determine the confusion matrix with the confusion_matrix function
        confusion_matrix = metrics.confusion_matrix(y_test_first, y_prediction)
        # determine ideal parameters
        if confusion_matrix[1][1] == 1 and confusion_matrix[0][1] < confusion_matrix_old[0][1]:
            best_weight = i
            best_c = j
            confusion_matrix_old = confusion_matrix

In [27]:
# display results from previous cell
print('Best Weight: ')
print(best_weight)
print('Best C Value: ')
print(best_c)

# develop svm model using the ideal parameters
ideal_class_weights = {0:1, 1:best_weight}

# develop the ideal svm model
svm_model_first = svm.SVC(kernel = 'rbf', C = best_c, gamma = 'scale', class_weight = ideal_class_weights)
# train the nearest neighbor model with the training data 
svm_model_first.fit(x_train, y_train)
# get model predictions
y_prediction = svm_model_first.predict(x_test_first)
# assess accuracy using the accuracy_score function
model_accuracy = metrics.accuracy_score(y_test_first, y_prediction)
# determine the confusion matrix with the confusion_matrix function
confusion_matrix = metrics.confusion_matrix(y_test_first, y_prediction)
print('Model Accuracy: ')
print(model_accuracy)
print('Confusion Matrix: ')
print(confusion_matrix)

# create placeholder data frame that will be used later
x_test1 = player_average_game_df2[player_average_game_df2['Year'].isin(test_year)]

# Create a DataFrame to store the predictions and corresponding players
predictions_df = player_average_game_df.loc[x_test1.index].copy()
predictions_df['First Prediction'] = y_prediction

# Filter the DataFrame to get the rows where the model predicted MVPs
predicted_mvp_df = predictions_df[predictions_df['First Prediction'] == 1]

# Print or display the predicted MVP players
display(predicted_mvp_df)

Best Weight: 
500
Best C Value: 
90
Model Accuracy: 
0.9636363636363636
Confusion Matrix: 
[[582  22]
 [  0   1]]


Unnamed: 0,Player,Age,Position,Team,Games,Games started,Minutes played per game,Field goals per game,Field goals attempt per game,Field goal percent,3 point field goal per game,3 point field goal attempt per game,3 point field goal percentage,2 point field goal per game,2 point field goal attempt per game,2 point field goal percentage,Free throws per game,Free throw attempt per game,Free throw percentage,Effective field goal percentage,Offensive rebounds per game,Defensive rebounds per game,Total rebounds per game,Assist per game,Steals per game,Blocks per game,Turn overs per game,Personal fouls per game,Points per game,Year,MVP,Defensive Player of the Year,First Prediction
9963,Anthony Edwards,20,SG,MIN,1.672875,3.541463,2.512733,3.323777,3.475535,1.021005,4.251095,4.208817,1.263514,2.914527,2.988125,1.043644,3.195141,3.149811,1.164932,1.065949,1.475494,2.025678,1.897766,2.743832,3.384029,2.40038,3.579101,2.053989,3.408357,2122,0,0,1
10057,Darius Garland,22,PG,CLE,1.579938,3.344715,2.47605,3.281397,3.274541,1.069625,3.440421,3.174091,1.355534,3.211216,3.341309,1.029568,3.122524,2.711088,1.322035,1.084153,0.928133,1.333631,1.239357,5.816924,2.868368,0.365275,4.671669,1.440297,3.274957,2122,0,0,1
10071,DeMar DeRozan,32,PF,CHI,1.765813,3.738211,2.794982,4.68598,4.285098,1.166863,0.988627,0.992777,1.245817,6.317718,6.473496,1.045655,7.552152,6.67085,1.299804,1.053813,1.332704,2.422163,2.168875,3.731612,2.191562,1.252372,3.409565,2.22933,4.709002,2122,0,0,1
10076,Dejounte Murray,25,PG,SAS,1.579938,3.344715,2.410838,3.469078,3.464369,1.069625,1.898163,2.055469,1.157336,4.162364,4.400862,1.013481,2.84658,2.778584,1.176789,1.011337,1.903863,3.481859,3.114992,6.255937,4.447582,1.20019,3.390728,1.728357,3.197141,2122,0,0,1
10088,Devin Booker,25,SG,PHO,1.579938,3.344715,2.38944,4.007906,3.966856,1.078885,3.618374,3.341885,1.355534,4.179816,4.382273,1.021525,4.574861,4.083505,1.286465,1.072017,1.070923,2.141019,1.892233,3.282621,2.481622,1.356736,3.051655,2.254378,4.050898,2122,0,0,1
10098,Donovan Mitchell,25,SG,UTA,1.556703,3.295528,2.308943,3.735465,3.841234,1.037212,4.587228,4.57237,1.256435,3.35956,3.355251,1.071797,3.87774,3.521039,1.264233,1.078085,1.308906,1.636402,1.560262,3.57197,3.190656,0.626186,3.767475,2.053989,3.853022,2122,0,0,1
10145,Giannis Antetokounmpo,27,PF,MIL,1.556703,3.295528,2.245768,4.17137,3.475535,1.280308,1.40385,1.691917,1.037001,5.392748,4.661103,1.238699,8.031423,8.616984,1.070078,1.177196,3.18897,4.642478,4.304553,3.871298,2.320477,4.748577,4.125385,2.655157,4.451096,2122,0,0,1
10176,Ja Morant,22,PG,MEM,1.32436,2.803659,1.924798,3.511458,3.285707,1.141396,1.739983,1.789796,1.217503,4.293256,4.280036,1.073807,4.589385,4.66847,1.12788,1.072017,1.832468,1.787787,1.798175,3.831387,2.127104,1.148008,3.692126,1.077092,3.47728,2122,0,0,1
10197,James Harden,32,PG,TOT,1.510235,3.197154,2.465861,2.464075,2.769262,0.949234,2.926335,3.132143,1.167954,2.260068,2.528056,0.957177,6.825984,6.029639,1.299804,0.980997,1.308906,3.207924,2.766422,6.65504,2.642766,1.878558,5.349815,1.916222,3.183801,2122,0,0,1
10218,Jayson Tatum,23,SF,BOS,1.765813,3.738211,2.782755,4.2864,4.366054,1.048788,4.547683,4.551395,1.249356,4.17109,4.242858,1.053699,5.809348,5.275934,1.264233,1.063927,2.022854,3.77742,3.369503,3.332509,2.417164,2.556926,4.087711,2.179232,4.548922,2122,0,0,1


In [28]:
# this cell develops a second svm model which is used to predict the MVPs from the pool of previously predicted MVPs
# create new test and train data from the results of previous run do this in an effort to decrease number of predicted MVPs
x_test_second = predicted_mvp_df.drop(columns = ['First Prediction', 'Player', 'Position', 'Team', 'MVP', 'Defensive Player of the Year']).to_numpy()
y_test_second = predicted_mvp_df['MVP'].to_numpy()

# initialize old confusion matrix and model accuracy
confusion_matrix_old = [[0, 100],[0, 0]]
model_accuracy = []

# for loop used to determine the optimal class weights
for i in range(10, 100, 10):
    for j in range(10, 100, 10):
        # Define class weights
        class_weights = {0: 1, 1: i}
        # create svm model
        svm_model = svm.SVC(kernel = 'rbf', C = j, gamma = 'scale', class_weight = class_weights)
        # train the nearest neighbor model with the training data 
        svm_model.fit(x_train, y_train)
        # get model predictions
        y_prediction = svm_model.predict(x_test_second)
        # determine the confusion matrix with the confusion_matrix function
        confusion_matrix = metrics.confusion_matrix(y_test_second, y_prediction)
        # determine ideal parameters
        if confusion_matrix[1][1] == 1 and confusion_matrix[0][1] < confusion_matrix_old[0][1]:
            best_weight2 = i
            best_c2 = j
            confusion_matrix_old = confusion_matrix

In [29]:
# display results from previous cell
print('Best Weight: ')
print(best_weight2)
print('Best C Value: ')
print(best_c2)

# develop svm model using the ideal parameters
ideal_class_weights = {0:1, 1:best_weight2}

# develop the ideal svm model
svm_model_second = svm.SVC(kernel = 'rbf', C = best_c2, gamma = 'scale', class_weight = ideal_class_weights)
# train the nearest neighbor model with the training data 
svm_model_second.fit(x_train, y_train)
# get model predictions
y_prediction = svm_model_second.predict(x_test_second)
# assess accuracy using the accuracy_score function
model_accuracy = metrics.accuracy_score(y_test_second, y_prediction)
# determine the confusion matrix with the confusion_matrix function
confusion_matrix = metrics.confusion_matrix(y_test_second, y_prediction)
print('Model Accuracy: ')
print(model_accuracy)
print('Confusion Matrix: ')
print(confusion_matrix)

# add final predicted winners to dataframe
predicted_mvp_df.loc[:, 'Second Prediction'] = y_prediction
display(predicted_mvp_df)

Best Weight: 
30
Best C Value: 
90
Model Accuracy: 
0.8695652173913043
Confusion Matrix: 
[[19  3]
 [ 0  1]]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  predicted_mvp_df.loc[:, 'Second Prediction'] = y_prediction


Unnamed: 0,Player,Age,Position,Team,Games,Games started,Minutes played per game,Field goals per game,Field goals attempt per game,Field goal percent,3 point field goal per game,3 point field goal attempt per game,3 point field goal percentage,2 point field goal per game,2 point field goal attempt per game,2 point field goal percentage,Free throws per game,Free throw attempt per game,Free throw percentage,Effective field goal percentage,Offensive rebounds per game,Defensive rebounds per game,Total rebounds per game,Assist per game,Steals per game,Blocks per game,Turn overs per game,Personal fouls per game,Points per game,Year,MVP,Defensive Player of the Year,First Prediction,Second Prediction
9963,Anthony Edwards,20,SG,MIN,1.672875,3.541463,2.512733,3.323777,3.475535,1.021005,4.251095,4.208817,1.263514,2.914527,2.988125,1.043644,3.195141,3.149811,1.164932,1.065949,1.475494,2.025678,1.897766,2.743832,3.384029,2.40038,3.579101,2.053989,3.408357,2122,0,0,1,0
10057,Darius Garland,22,PG,CLE,1.579938,3.344715,2.47605,3.281397,3.274541,1.069625,3.440421,3.174091,1.355534,3.211216,3.341309,1.029568,3.122524,2.711088,1.322035,1.084153,0.928133,1.333631,1.239357,5.816924,2.868368,0.365275,4.671669,1.440297,3.274957,2122,0,0,1,0
10071,DeMar DeRozan,32,PF,CHI,1.765813,3.738211,2.794982,4.68598,4.285098,1.166863,0.988627,0.992777,1.245817,6.317718,6.473496,1.045655,7.552152,6.67085,1.299804,1.053813,1.332704,2.422163,2.168875,3.731612,2.191562,1.252372,3.409565,2.22933,4.709002,2122,0,0,1,0
10076,Dejounte Murray,25,PG,SAS,1.579938,3.344715,2.410838,3.469078,3.464369,1.069625,1.898163,2.055469,1.157336,4.162364,4.400862,1.013481,2.84658,2.778584,1.176789,1.011337,1.903863,3.481859,3.114992,6.255937,4.447582,1.20019,3.390728,1.728357,3.197141,2122,0,0,1,0
10088,Devin Booker,25,SG,PHO,1.579938,3.344715,2.38944,4.007906,3.966856,1.078885,3.618374,3.341885,1.355534,4.179816,4.382273,1.021525,4.574861,4.083505,1.286465,1.072017,1.070923,2.141019,1.892233,3.282621,2.481622,1.356736,3.051655,2.254378,4.050898,2122,0,0,1,0
10098,Donovan Mitchell,25,SG,UTA,1.556703,3.295528,2.308943,3.735465,3.841234,1.037212,4.587228,4.57237,1.256435,3.35956,3.355251,1.071797,3.87774,3.521039,1.264233,1.078085,1.308906,1.636402,1.560262,3.57197,3.190656,0.626186,3.767475,2.053989,3.853022,2122,0,0,1,0
10145,Giannis Antetokounmpo,27,PF,MIL,1.556703,3.295528,2.245768,4.17137,3.475535,1.280308,1.40385,1.691917,1.037001,5.392748,4.661103,1.238699,8.031423,8.616984,1.070078,1.177196,3.18897,4.642478,4.304553,3.871298,2.320477,4.748577,4.125385,2.655157,4.451096,2122,0,0,1,1
10176,Ja Morant,22,PG,MEM,1.32436,2.803659,1.924798,3.511458,3.285707,1.141396,1.739983,1.789796,1.217503,4.293256,4.280036,1.073807,4.589385,4.66847,1.12788,1.072017,1.832468,1.787787,1.798175,3.831387,2.127104,1.148008,3.692126,1.077092,3.47728,2122,0,0,1,0
10197,James Harden,32,PG,TOT,1.510235,3.197154,2.465861,2.464075,2.769262,0.949234,2.926335,3.132143,1.167954,2.260068,2.528056,0.957177,6.825984,6.029639,1.299804,0.980997,1.308906,3.207924,2.766422,6.65504,2.642766,1.878558,5.349815,1.916222,3.183801,2122,0,0,1,0
10218,Jayson Tatum,23,SF,BOS,1.765813,3.738211,2.782755,4.2864,4.366054,1.048788,4.547683,4.551395,1.249356,4.17109,4.242858,1.053699,5.809348,5.275934,1.264233,1.063927,2.022854,3.77742,3.369503,3.332509,2.417164,2.556926,4.087711,2.179232,4.548922,2122,0,0,1,0


In [30]:
years = ['0001', '0102', '0203', '0304', '0405', '0506', '0607', '0708', '0809', '0910', '1011', '1112', '1213', '1314','1415',
         '1516', '1617', '1718', '1819', '1920', '2021', '2122']

complete_predicted_mvp_df = pd.DataFrame()
complete_predicted_mvp_df = complete_predicted_mvp_df.reindex(columns = predicted_mvp_df.columns)

for year in years:
    # test model to see how it predicts the winner of each year
    x_test_final_first = player_average_game_df2[player_average_game_df2['Year'].isin([year])].to_numpy()
    y_test_final_first = player_average_game_df[player_average_game_df['Year'].isin([year])]['MVP'].to_numpy()
    
    # get the prediction from the first model
    y_prediction_first = svm_model_first.predict(x_test_final_first)
    
    # create placeholder data frame that will be used later
    x_test1 = player_average_game_df2[player_average_game_df2['Year'].isin([year])]
    
    # Create a DataFrame to store the predictions and corresponding players
    predictions_df = player_average_game_df.loc[x_test1.index].copy()
    predictions_df['First Prediction'] = y_prediction_first
    
    # Filter the DataFrame to get the rows where the model predicted MVPs
    predicted_mvp_df = predictions_df[predictions_df['First Prediction'] == 1]
    
    # create the second set of test data based on results from the first model
    x_test_final_second = predicted_mvp_df.drop(columns = ['First Prediction', 'Player', 'Position', 'Team', 'MVP', 'Defensive Player of the Year']).to_numpy()
    y_test_final_second = predicted_mvp_df['MVP'].to_numpy()
    
    # get the prediction from the second model
    y_prediction_second = svm_model_second.predict(x_test_final_second)
    
    # add final predicted winners to dataframe
    predicted_mvp_df.loc[:, 'Second Prediction'] = y_prediction_second
    
    # fill out data frame with all the predicted winners
    complete_predicted_mvp_df = pd.concat([complete_predicted_mvp_df, predicted_mvp_df], ignore_index = True)
    
display(complete_predicted_mvp_df)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  predicted_mvp_df.loc[:, 'Second Prediction'] = y_prediction_second
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  predicted_mvp_df.loc[:, 'Second Prediction'] = y_prediction_second
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  predicted_mvp_df.loc[:, 'Second Prediction'] = y_prediction_second
A va

Unnamed: 0,Player,Age,Position,Team,Games,Games started,Minutes played per game,Field goals per game,Field goals attempt per game,Field goal percent,3 point field goal per game,3 point field goal attempt per game,3 point field goal percentage,2 point field goal per game,2 point field goal attempt per game,2 point field goal percentage,Free throws per game,Free throw attempt per game,Free throw percentage,Effective field goal percentage,Offensive rebounds per game,Defensive rebounds per game,Total rebounds per game,Assist per game,Steals per game,Blocks per game,Turn overs per game,Personal fouls per game,Points per game,Year,MVP,Defensive Player of the Year,First Prediction,Second Prediction
0,Allen Iverson,25.0,SG,PHI,1.306149,2.633389,2.282699,3.959724,4.171535,0.993616,3.75026,4.139829,1.447533,3.992637,4.178032,1.012917,5.836105,5.36177,1.185621,0.99624,0.773169,1.357035,1.192152,2.76764,4.220323,0.705318,3.052304,1.21986,4.316913,1,1.0,0.0,1.0,0.0
1,Andre Miller,24.0,PG,CLE,1.50851,3.041379,2.182319,2.348813,2.298601,1.06932,0.650555,0.865847,1.203262,2.615658,2.59221,1.068041,3.741093,3.355767,1.213295,1.027442,1.453557,1.618706,1.572068,5.594891,2.821452,0.987445,3.412914,1.900326,2.534989,1,0.0,0.0,1.0,0.0
2,Antawn Jamison,24.0,SF,GSW,1.50851,3.041379,2.600698,4.157191,4.169234,1.045663,2.372614,2.773415,1.366109,4.4376,4.455274,1.05426,3.810926,3.982177,1.041424,1.022985,4.329745,2.647132,3.122302,1.396594,2.702903,0.987445,2.562905,1.867132,3.998084,1,0.0,0.0,1.0,0.0
3,Antoine Walker,24.0,PF,BOS,1.490113,3.004289,2.602231,3.694703,3.957551,0.977056,8.45722,8.157898,1.66014,2.946374,3.096789,1.008323,2.484086,2.595127,1.04288,1.06533,2.33497,3.456485,3.139769,3.789538,3.271935,1.728029,3.876555,2.08289,3.70077,1,0.0,0.0,1.0,0.0
4,Chris Webber,27.0,PF,SAC,1.287752,2.596299,2.173123,4.08444,3.761974,1.137927,0.076536,0.378808,0.321171,4.714198,4.455274,1.120869,3.232304,3.437797,1.023945,1.072017,2.767944,3.639046,3.393047,2.50365,2.205,4.161375,2.51139,1.87543,3.712506,1,0.0,0.0,1.0,0.0
5,Dirk Nowitzki,22.0,PF,DAL,1.50851,3.041379,2.394574,3.071125,2.869224,1.121367,5.778462,5.276252,1.75061,2.645723,2.375961,1.178291,4.499287,4.012006,1.220578,1.190139,1.840142,3.864204,3.292609,1.473236,1.873065,3.561855,2.009112,2.033099,3.489521,1,0.0,0.0,1.0,0.0
6,Gary Payton,32.0,PG,SEA,1.453321,2.930109,2.485759,3.767454,3.660734,1.078783,3.903332,3.679848,1.696328,3.746104,3.656818,1.084119,2.703563,2.63987,1.115707,1.087618,1.128826,1.752584,1.576435,5.467153,3.011129,0.916913,2.691694,1.526899,3.565806,1,0.0,0.0,1.0,0.0
7,Glenn Robinson,28.0,SF,MIL,1.398131,2.744659,2.155499,3.554398,3.359316,1.107172,2.104738,2.489309,1.352539,3.782182,3.537604,1.132353,2.504038,2.281922,1.19436,1.085389,1.917459,2.446315,2.296966,2.145985,2.039032,2.186485,2.820484,1.584988,3.27436,1,0.0,0.0,1.0,0.0
8,Jalen Rose,28.0,SF,IND,1.324545,2.670479,2.255114,2.946409,2.85772,1.081149,2.25781,2.35402,1.53348,3.054608,2.960941,1.093307,2.84323,2.565298,1.206012,1.069788,0.572145,1.959486,1.567701,3.70438,1.541129,1.516433,2.717452,1.908624,2.890982,1,0.0,0.0,1.0,0.0
9,Jamal Mashburn,28.0,SF,CHH,1.398131,2.818839,2.290362,2.977588,3.193651,0.977056,3.9416,3.909838,1.610381,2.826114,3.046886,0.983057,2.783373,2.714443,1.115707,1.002926,1.422631,2.945315,2.515309,3.5,2.015323,0.458457,2.717452,1.535197,2.988783,1,0.0,0.0,1.0,0.0


In [31]:
# determine how well the two models are able to predict the winner of the MVP award
predicted_mvp_winners = complete_predicted_mvp_df[complete_predicted_mvp_df['Second Prediction'] == 1]
display(predicted_mvp_winners)
num_true_winners = len(years)
pred_true_winners = predicted_mvp_winners['MVP'].sum()
percent_picked_true_winners = (pred_true_winners/num_true_winners)*100
print('Percentage that model predicts the true winner: ')
print(percent_picked_true_winners)

Unnamed: 0,Player,Age,Position,Team,Games,Games started,Minutes played per game,Field goals per game,Field goals attempt per game,Field goal percent,3 point field goal per game,3 point field goal attempt per game,3 point field goal percentage,2 point field goal per game,2 point field goal attempt per game,2 point field goal percentage,Free throws per game,Free throw attempt per game,Free throw percentage,Effective field goal percentage,Offensive rebounds per game,Defensive rebounds per game,Total rebounds per game,Assist per game,Steals per game,Blocks per game,Turn overs per game,Personal fouls per game,Points per game,Year,MVP,Defensive Player of the Year,First Prediction,Second Prediction
189,Kobe Bryant,27.0,SG,LAL,1.470541,2.978862,2.520109,5.080462,5.121756,1.029887,5.85262,6.034747,1.559168,4.933639,4.890195,1.058695,6.61029,5.797558,1.227092,1.052376,1.18286,2.213013,1.931933,3.251366,3.818398,1.188581,3.379974,1.906084,5.435177,506,0.0,0.0,1.0,1.0
191,LeBron James,21.0,SF,CLE,1.452159,2.941626,2.584707,4.545403,4.296807,1.098546,4.129348,4.415384,1.505249,4.624514,4.266732,1.137768,5.708023,5.762164,1.065405,1.103816,1.2495,3.006948,2.527423,4.705449,3.194986,2.614879,3.515173,1.480692,4.75578,506,0.0,0.0,1.0,1.0
265,Kobe Bryant,29.0,SG,LAL,1.483417,3.0,2.417832,3.804656,3.793377,1.039397,4.186306,4.192484,1.600745,3.723193,3.679371,1.059251,6.048674,5.44139,1.19389,1.05411,1.535557,2.514133,2.253072,3.709138,3.796938,1.545064,3.474121,1.975593,4.252831,708,1.0,0.0,1.0,1.0
288,Dwyane Wade,27.0,SG,MIA,1.42429,2.851707,2.275782,4.152714,3.878966,1.100991,2.389432,2.768589,1.2911,4.537382,4.199445,1.126429,5.584668,5.624871,1.077277,1.06938,1.455094,1.843314,1.739531,5.067648,4.295252,3.982737,3.654653,1.526775,4.308628,809,0.0,0.0,1.0,1.0
292,LeBron James,24.0,SF,CLE,1.460348,2.923902,2.280262,3.836641,3.597914,1.096507,3.584149,3.824238,1.401068,3.891723,3.532592,1.150075,5.62253,5.559211,1.0984,1.098394,1.733034,3.024466,2.679227,5.050441,3.401443,3.494288,3.23813,1.192257,4.160553,809,1.0,0.0,1.0,1.0
300,Kevin Durant,21.0,SF,OKC,1.460686,2.946667,2.408106,3.784622,3.668141,1.054375,3.575781,3.476805,1.564564,3.827586,3.722742,1.058998,7.293029,6.149362,1.277096,1.054917,1.721758,3.025117,2.682833,1.953806,2.787544,3.107726,3.591018,1.473075,4.421789,910,0.0,0.0,1.0,1.0
302,LeBron James,25.0,SF,CLE,1.353806,2.731057,2.205138,3.660692,3.360263,1.114181,3.603716,3.833401,1.427397,3.672414,3.225246,1.172014,5.72059,5.658877,1.088369,1.11854,1.164237,2.820717,2.385698,5.506181,3.111099,2.848749,3.458508,1.025122,4.038996,910,1.0,0.0,1.0,1.0
315,LeBron James,26.0,SF,MIA,1.419632,2.903089,2.326612,3.73937,3.359628,1.15447,2.617651,2.845847,1.426659,3.974649,3.506062,1.188058,4.968444,4.99985,1.085576,1.133252,1.347042,3.074748,2.619235,4.734774,3.109804,1.888685,3.839789,1.44596,3.896265,1011,0.0,0.0,1.0,1.0
323,Kevin Durant,23.0,SF,OKC,1.5198,3.186667,2.540361,4.25592,3.845347,1.157368,5.008587,4.517983,1.686105,4.095422,3.649235,1.175651,6.156223,5.384916,1.236483,1.176715,0.849325,3.816101,3.016369,2.658561,2.767368,3.649217,4.291962,1.640873,4.639705,1112,0.0,0.0,1.0,1.0
326,LeBron James,27.0,SF,MIA,1.427691,2.993535,2.320848,4.110305,3.465852,1.239037,2.033562,1.956917,1.577184,4.553146,3.905792,1.221798,5.527745,5.395665,1.108522,1.191773,1.995913,3.118703,2.816041,4.453952,3.616447,2.369621,3.686242,1.18439,4.220878,1112,1.0,0.0,1.0,1.0


Percentage that model predicts the true winner: 
50.0


## Defensive Player of the Year Prediction

In [32]:
# determine the average stats for each season
# create new player data data frame that only includes columns which we would like to get the seasonal averages of
player_stat_game_df2 = renamed_df.drop(columns = ['Player', 'Position', 'Team', 'MVP', 'Defensive Player of the Year', 'Age'])
# create new data frame that contains the seasonal averages of relevant statistics
average_stat_game_df = player_stat_game_df2.groupby('Year').mean()
# merge original data with season average data
merged_data = pd.merge(renamed_df, average_stat_game_df, on='Year', suffixes=('', '_avg'))
# divide each statisitic by the corresponding season average
for stat in average_stat_game_df.columns:
    merged_data[stat] = merged_data[stat] / merged_data[f'{stat}_avg']
    merged_data.drop(columns=[f'{stat}_avg'], inplace=True)

player_average_game_df = merged_data
player_average_game_df.head(10)

Unnamed: 0,Player,Age,Position,Team,Games,Games started,Minutes played per game,Field goals per game,Field goals attempt per game,Field goal percent,3 point field goal per game,3 point field goal attempt per game,3 point field goal percentage,2 point field goal per game,2 point field goal attempt per game,2 point field goal percentage,Free throws per game,Free throw attempt per game,Free throw percentage,Effective field goal percentage,Offensive rebounds per game,Defensive rebounds per game,Total rebounds per game,Assist per game,Steals per game,Blocks per game,Turn overs per game,Personal fouls per game,Points per game,Year,MVP,Defensive Player of the Year
0,A.C. Green,37,PF,MIA,1.50851,0.03709,1.081198,0.748294,0.745492,1.050394,0.0,0.081173,0.0,0.865873,0.881628,1.040479,0.788124,0.827756,1.037054,0.989554,1.654581,1.253584,1.366826,0.332117,0.71129,0.282127,0.579551,0.987505,0.717856,1,0,0
1,A.J. Guyton,22,PG,CHI,0.607083,0.29672,0.482746,0.405326,0.441773,0.960496,1.033235,0.933491,1.768704,0.306663,0.341007,0.953198,0.149644,0.134231,1.213295,1.063102,0.154634,0.158219,0.157207,0.545012,0.213387,0.176329,0.309094,0.290443,0.38729,1,0,0
2,Aaron McKie,28,SG,PHI,1.398131,1.22397,1.834435,1.756413,1.642844,1.119001,2.028202,2.299905,1.411345,1.713707,1.508195,1.203556,1.486461,1.446708,1.11862,1.138878,0.510291,1.69173,1.358092,3.210462,2.513226,0.282127,2.614421,1.477109,1.717377,1,0,0
3,Aaron Williams,29,PF,NJN,1.50851,0.92725,1.789992,1.543357,1.495586,1.081149,0.0,0.027058,0.0,1.785863,1.796526,1.051963,2.434204,2.311751,1.146294,1.018527,3.262772,2.306352,2.576445,0.749392,1.398871,3.985046,1.700018,2.647178,1.639136,1,0,0
4,Adam Keefe,30,PF,GSW,1.232563,0.48217,0.640596,0.332575,0.365843,0.953398,0.038268,0.040587,1.506339,0.378819,0.432497,0.927933,0.389074,0.469807,0.901596,0.904862,1.391704,0.724158,0.912673,0.306569,0.663871,0.705318,0.515157,0.846433,0.32861,1,0,0
5,Adonal Foyle,25,C,GSW,1.066995,1.37233,1.116446,0.810652,0.862838,0.984153,0.0,0.0,0.0,0.938029,1.039656,0.955495,0.299287,0.507094,0.642333,0.92715,2.412287,1.515255,1.768576,0.408759,0.735,5.501479,1.017435,1.128578,0.668955,1,0,0
6,Adrian Griffin,26,SF,BOS,0.809444,0.0,0.288881,0.171484,0.223187,0.804356,0.344412,0.35175,1.565145,0.144312,0.196842,0.77634,0.179572,0.178974,1.092403,0.862516,0.417511,0.365122,0.379916,0.229927,0.426774,0.176329,0.231821,0.373426,0.181909,1,0,0
7,Al Harrington,20,PF,IND,1.434924,1.40942,1.449771,1.252354,1.24939,1.050394,0.038268,0.094702,0.646866,1.443122,1.486015,1.028995,1.027553,1.17079,0.955488,0.991783,1.840142,1.594364,1.663772,1.107056,1.49371,0.634786,1.90608,1.850535,1.146222,1,0,0
8,Alan Henderson,28,PF,ATL,1.342942,1.55778,1.386937,1.548554,1.543905,1.050394,0.0,0.013529,0.0,1.791876,1.857519,1.022104,1.725891,2.020918,0.92927,0.989554,2.783408,1.375292,1.772943,0.425791,1.209194,1.022711,1.622744,1.360932,1.504171,1,0,0
9,Allan Houston,29,SG,NYK,1.434924,2.893019,2.189981,2.816497,2.779489,1.062223,3.673724,3.409271,1.723469,2.681801,2.65043,1.072635,2.783373,2.289379,1.323992,1.087618,0.309268,1.60045,1.23582,1.473236,1.232903,0.352659,2.073506,1.576689,2.853818,1,0,0


In [33]:
# create new data frame with predictor data
player_average_game_df2 = player_average_game_df.drop(columns = ['Player', 'Position', 'Team', 'MVP', 'Defensive Player of the Year'])
# create data filters for splitting up testing and training data
train_years = ['0001', '0102', '0203', '0304', '0405', '0506', '0607', '0708', '0809', '0910', '1011', '1112', '1213', '1314',
              '1415', '1516', '1617', '1718', '1819', '1920', '2021']
test_year = ['2122']
# Filter both x_train and y_train based on train_years
x_train = player_average_game_df2[player_average_game_df2['Year'].isin(train_years)].to_numpy()
y_train = player_average_game_df[player_average_game_df['Year'].isin(train_years)]['Defensive Player of the Year'].to_numpy()
# Filter both x_test and y_test based on train_years
x_test_first = player_average_game_df2[player_average_game_df2['Year'].isin(test_year)].to_numpy()
y_test_first = player_average_game_df[player_average_game_df['Year'].isin(test_year)]['Defensive Player of the Year'].to_numpy()

# initialize old confusion matrix and model accuracy
confusion_matrix_old = [[0, 100],[0, 0]]
model_accuracy = []

# for loop used to determine the optimal class weights
for i in range(500, 1000, 100):
    for j in range(10, 100, 10):
        # Define class weights
        class_weights = {0: 1, 1: i}
        # create svm model
        svm_model = svm.SVC(kernel = 'rbf', C = j, gamma = 'scale', class_weight = class_weights)
        # train the nearest neighbor model with the training data 
        svm_model.fit(x_train, y_train)
        # get model predictions
        y_prediction = svm_model.predict(x_test_first)
        # determine the confusion matrix with the confusion_matrix function
        confusion_matrix = metrics.confusion_matrix(y_test_first, y_prediction)
        # determine ideal parameters
        if confusion_matrix[1][1] == 1 and confusion_matrix[0][1] < confusion_matrix_old[0][1]:
            best_weight = i
            best_c = j
            confusion_matrix_old = confusion_matrix

In [34]:
# display results from previous cell
print('Best Weight: ')
print(best_weight)
print('Best C Value: ')
print(best_c)

# develop svm model using the ideal parameters
ideal_class_weights = {0:1, 1:best_weight}

# develop the ideal svm model
svm_model_first = svm.SVC(kernel = 'rbf', C = best_c, gamma = 'scale', class_weight = ideal_class_weights)
# train the nearest neighbor model with the training data 
svm_model_first.fit(x_train, y_train)
# get model predictions
y_prediction = svm_model_first.predict(x_test_first)
# assess accuracy using the accuracy_score function
model_accuracy = metrics.accuracy_score(y_test_first, y_prediction)
# determine the confusion matrix with the confusion_matrix function
confusion_matrix = metrics.confusion_matrix(y_test_first, y_prediction)
print('Model Accuracy: ')
print(model_accuracy)
print('Confusion Matrix: ')
print(confusion_matrix)

# create placeholder data frame that will be used later
x_test1 = player_average_game_df2[player_average_game_df2['Year'].isin(test_year)]

# Create a DataFrame to store the predictions and corresponding players
predictions_df = player_average_game_df.loc[x_test1.index].copy()
predictions_df['First Prediction'] = y_prediction

# Filter the DataFrame to get the rows where the model predicted MVPs
predicted_mvp_df = predictions_df[predictions_df['First Prediction'] == 1]

# Print or display the predicted MVP players
display(predicted_mvp_df)

Best Weight: 
500
Best C Value: 
90
Model Accuracy: 
0.9884297520661157
Confusion Matrix: 
[[598   6]
 [  1   0]]


Unnamed: 0,Player,Age,Position,Team,Games,Games started,Minutes played per game,Field goals per game,Field goals attempt per game,Field goal percent,3 point field goal per game,3 point field goal attempt per game,3 point field goal percentage,2 point field goal per game,2 point field goal attempt per game,2 point field goal percentage,Free throws per game,Free throw attempt per game,Free throw percentage,Effective field goal percentage,Offensive rebounds per game,Defensive rebounds per game,Total rebounds per game,Assist per game,Steals per game,Blocks per game,Turn overs per game,Personal fouls per game,Points per game,Year,MVP,Defensive Player of the Year,First Prediction
10145,Giannis Antetokounmpo,27,PF,MIL,1.556703,3.295528,2.245768,4.17137,3.475535,1.280308,1.40385,1.691917,1.037001,5.392748,4.661103,1.238699,8.031423,8.616984,1.070078,1.177196,3.18897,4.642478,4.304553,3.871298,2.320477,4.748577,4.125385,2.655157,4.451096,2122,0,0,1
10202,Jaren Jackson Jr.,22,PF,MEM,1.812282,3.836585,2.166289,2.603322,2.892092,0.96081,2.530884,2.803548,1.129022,2.635291,2.950947,0.957177,4.124637,3.881017,1.21977,0.964816,2.831996,2.414954,2.511912,0.858071,2.352706,9.236243,2.448859,3.406616,2.828069,2122,0,0,1
10234,Joel Embiid,27,C,PHI,1.579938,3.344715,2.34053,4.032122,3.723987,1.155287,1.838846,1.754839,1.313063,5.000072,5.032876,1.063753,9.498284,9.033209,1.206431,1.080108,3.47455,4.685731,4.404145,2.83363,2.481622,5.166034,4.031198,2.266903,4.622292,2122,0,0,1
10271,Karl-Anthony Towns,26,C,MIN,1.719344,3.639837,2.522922,3.886821,3.388996,1.224743,2.96588,2.558849,1.451094,4.293256,3.940793,1.166308,5.576974,5.253435,1.218288,1.1954,4.616867,3.8423,4.022378,2.683967,2.320477,4.33112,4.257247,3.343995,4.042004,2122,0,0,1
10388,Nikola Jokić,26,C,DEN,1.719344,3.639837,2.522922,4.625438,3.659781,1.349764,1.917936,2.013521,1.192728,5.820328,4.754046,1.311091,5.504357,5.264685,1.200503,1.254058,4.902447,5.860769,5.637969,5.826902,3.512945,3.287476,5.293303,2.392146,4.455543,2122,1,0,1
10438,Rudy Gobert,29,C,UTA,1.533469,3.246341,2.160176,2.191634,1.41813,1.650741,0.0,0.027966,0.0,3.158859,2.342169,1.443808,4.400581,4.938454,1.02265,1.442167,5.735387,5.24081,5.355794,0.718385,1.450298,7.148956,2.241648,2.204281,2.283354,2122,0,0,1


In [51]:
# this cell develops a second svm model which is used to predict the MVPs from the pool of previously predicted MVPs
# create new test and train data from the results of previous run do this in an effort to decrease number of predicted MVPs
x_test_second = predicted_mvp_df.drop(columns = ['First Prediction', 'Player', 'Position', 'Team', 'MVP', 'Defensive Player of the Year']).to_numpy()
y_test_second = predicted_mvp_df['Defensive Player of the Year'].to_numpy()

param_grid = {
    'C': list(range(10, 100, 10)),
    'class_weight': [{0: 1, 1: i} for i in range(500, 1000, 100)],
    'gamma': ['scale']
}

# Create the SVM model
svm_model = svm.SVC(kernel='rbf')

# Instantiate the GridSearchCV object
grid_search = GridSearchCV(estimator=svm_model, param_grid=param_grid, scoring='accuracy', cv=5)

# Fit the grid search to the data
grid_search.fit(x_train, y_train)

# Get the best parameters
best_params = grid_search.best_params_

# Get the best estimator
best_estimator = grid_search.best_estimator_

# Print the best parameters
print("Best Parameters:", best_params)

# Print the best estimator
print("Best Estimator:", best_estimator)

Best Parameters: {'C': 90, 'class_weight': {0: 1, 1: 500}, 'gamma': 'scale'}
Best Estimator: SVC(C=90, class_weight={0: 1, 1: 500})


In [52]:
# display results from previous cell
print('Best Weight: ')
print(best_params['class_weight'])
print('Best C Value: ')
print(best_params['C'])

# develop svm model using the ideal parameters
ideal_class_weights = {0:1, 1:best_params['class_weight']}

# develop the ideal svm model
svm_model_second = svm.SVC(kernel = 'rbf', C = best_params['C'], gamma = 'scale', class_weight = best_params['class_weight'])
# train the nearest neighbor model with the training data 
svm_model_second.fit(x_train, y_train)
# get model predictions
y_prediction = svm_model_second.predict(x_test_second)
# assess accuracy using the accuracy_score function
model_accuracy = metrics.accuracy_score(y_test_second, y_prediction)
# determine the confusion matrix with the confusion_matrix function
confusion_matrix = metrics.confusion_matrix(y_test_second, y_prediction)
print('Model Accuracy: ')
print(model_accuracy)
print('Confusion Matrix: ')
print(confusion_matrix)

# add final predicted winners to dataframe
predicted_mvp_df.loc[:, 'Second Prediction'] = y_prediction
display(predicted_mvp_df)

Best Weight: 
{0: 1, 1: 500}
Best C Value: 
90


ValueError: X has 28 features, but SVC is expecting 27 features as input.

In [None]:
years = ['0001', '0102', '0203', '0304', '0405', '0506', '0607', '0708', '0809', '0910', '1011', '1112', '1213', '1314','1415',
         '1516', '1617', '1718', '1819', '1920', '2021', '2122']

complete_predicted_mvp_df = pd.DataFrame()
complete_predicted_mvp_df = complete_predicted_mvp_df.reindex(columns = predicted_mvp_df.columns)

for year in years:
    # test model to see how it predicts the winner of each year
    x_test_final_first = player_average_game_df2[player_average_game_df2['Year'].isin([year])].to_numpy()
    y_test_final_first = player_average_game_df[player_average_game_df['Year'].isin([year])]['Defensive Player of the Year'].to_numpy()
    
    # get the prediction from the first model
    y_prediction_first = svm_model_first.predict(x_test_final_first)
    
    # create placeholder data frame that will be used later
    x_test1 = player_average_game_df2[player_average_game_df2['Year'].isin([year])]
    
    # Create a DataFrame to store the predictions and corresponding players
    predictions_df = player_average_game_df.loc[x_test1.index].copy()
    predictions_df['First Prediction'] = y_prediction_first
    
    # Filter the DataFrame to get the rows where the model predicted MVPs
    predicted_mvp_df = predictions_df[predictions_df['First Prediction'] == 1]
    
    # create the second set of test data based on results from the first model
    x_test_final_second = predicted_mvp_df.drop(columns = ['First Prediction', 'Player', 'Position', 'Team', 'MVP', 'Defensive Player of the Year']).to_numpy()
    y_test_final_second = predicted_mvp_df['Defensive Player of the Year'].to_numpy()
    
    # get the prediction from the second model
    y_prediction_second = svm_model_second.predict(x_test_final_second)
    
    # add final predicted winners to dataframe
    predicted_mvp_df.loc[:, 'Second Prediction'] = y_prediction_second
    
    # fill out data frame with all the predicted winners
    complete_predicted_mvp_df = pd.concat([complete_predicted_mvp_df, predicted_mvp_df], ignore_index = True)
    
display(complete_predicted_mvp_df)

In [None]:
# determine how well the two models are able to predict the winner of the MVP award
predicted_mvp_winners = complete_predicted_mvp_df[complete_predicted_mvp_df['Second Prediction'] == 1]
display(predicted_mvp_winners)
num_true_winners = len(years)
pred_true_winners = predicted_mvp_winners['Defensive Player of the Year'].sum()
percent_picked_true_winners = (pred_true_winners/num_true_winners)*100
print('Percentage that model predicts the true winner: ')
print(percent_picked_true_winners)

## Championship Prediction