## SlamStats Analytics <br>

Vincent Funtanilla - u1282199@utah.edu, u1282199 <br>

Jacob Layton - jake.layton@utah.edu, u1312858 <br>

John Chae- u1285738@utah.edu, u1285738 <br>

## Background and Motivation <br>
 
Our motivation for undertaking this  project stems from a passion and curiosity for the game of basketball. By delving into player performances, strategies, and game outcomes, we hope to provide a valuable resource for basketball enthusiasts and enhance the overall fan experience. We want to contribute positively to the basketball community, offering valuable insights for analysts, coaches, and fans alike. We also see this project as an opportunity for skill development, allowing us to apply and refine our data science skills in a real-world context. Ultimately, our goal is to bring innovation to sports analytics, fostering a deeper appreciation for the game we love.

## Project Objectives <br>

There are three questions we want to answer;<br>
- Who will be the 2024 MVP?<br>
- Who will be the 2024 defensive player of the year?<br>
- Which team will win the 2024 NBA Finals?<br>
    
We want to create a model that can predict not only this years finals and award winners, but also identify what the most important statistics are in prediciting them. This will help us gain a lot of insight into what these statistics mean and give us a better understanding of categorical models. 
    
## Data Description and Acquisition

We will be collecting data from https://www.basketball-reference.com/. Basketball Reference is an online basketball encyclopedia that contains all relevant basketball statistics. Basketball Reference does not allow data scraping but all data is available to download as Excel spreadsheets. To obtain the data we will download the Excel spreadsheet containing the data of interest from Basketball Reference and then we will save the spreadsheet as a CSV file which we can open in Jupyter Notebook. We will collect player and team data from the past 23 seasons to ensure that we have enough information to build our models. For player data we will collect season total data and per game data. Season total data refers to cumulative statistics such as total number of points scored. Per game data refers to average stats per game such as average points scored per game. We will collect four types of team data, offensive, defensive, advanced, and playoff. The offensive and defensive data will tell us about the offensive and defense performance of each team. The advanced team data will tell us how the team performed throughout the season such as their win/loss record. From the playoff data we are intereseted in teams playoff record. 


## Ethical Considerations

One main ethical implication of our analyses is related to gambling. While our project aims to provide valuable insights into basketball dynamics, we acknowledge the potential risk of individuals excessively gambling based on our findings. It is important that we approach our work with a sense of responsibility, emphasizing the importance of using data-driven insights for informed decision-making rather than irresponsible gambling behavior. We can also promote responsible engagement with our analyses and advocate for measures to mitigate the risks associated with problem gambling within the context of sports betting.<br>

It’s also crucial to consider the potential impact of our analyses on fan engagement and player welfare. While we aim to foster constructive dialogue among fans, we should avoid promoting negative behaviors or attitudes that could harm the mental well-being of players. We want to uphold the integrity of the game and the welfare of those involved, including players, teams, and fans.

## Data Cleaning and Processing

The player and team data will be processed and cleaned seperately. As both the player and team data are stored in excel files the data will need to be read in and then combined into one large data frame. Most of the data is already fairly presentable but there are a few minor issues that will need to be resolved. One issue in the player data is that the column titles appear in the first row of the data frame instead of the header column. Another issue in the player data to be resolved are the NaN values which appear in columns with numerical values. For the team data there are multiple overlapping columns from the different data sets, these will need to be removed. Finally some data will need to be added to both data frames as they are processed such as the season that the data pertains.

In [1]:
import os
import glob
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns
from pandas.plotting import scatter_matrix

from sklearn import tree, svm, metrics
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.datasets import load_digits
from sklearn.preprocessing import scale

plt.style.use('ggplot')
%matplotlib inline  
plt.rcParams['figure.figsize'] = (10, 6) 

The cell below extracts the names of all excel files containg our player data. The names of the files are then split up by the type of data they store (season totals or averages per game) and are stored in two seperate lists. These two lists of file names will be used below to extract all the data into the Jupyter Notebook.

## Data importing

In [2]:
main_directory = os.path.normpath(os.getcwd())
data_directory = os.path.join(main_directory, 'data')
file_names = [f for f in os.listdir(data_directory) if os.path.isfile(os.path.join(data_directory, f))]
print(file_names)

['00-01 player stats per game.csv', '00-01 player stats.csv', '01-02 player stats per game.csv', '01-02 player stats.csv', '02-03 player stats per game.csv', '02-03 player stats.csv', '03-04 player stats per game.csv', '03-04 player stats.csv', '04-05 player stats per game.csv', '04-05 player stats.csv', '05-06 player stats per game.csv', '05-06 player stats.csv', '06-07 player stats per game.csv', '06-07 player stats.csv', '07-08 player stats per game.csv', '07-08 player stats.csv', '08-09 player stats per game.csv', '08-09 player stats.csv', '09-10 player stats per game.csv', '09-10 player stats.csv', '10-11 player stats per game.csv', '10-11 player stats.csv', '11-12 player stats per game.csv', '11-12 player stats.csv', '12-13 player stats per game.csv', '12-13 player stats.csv', '13-14 player stats per game.csv', '13-14 player stats.csv', '14-15 player stats per game.csv', '14-15 player stats.csv', '15-16 player stats per game.csv', '15-16 player stats.csv', '16-17 player stats per

Lets organize these file names into a list and only get the file we want

In [3]:
filtered_file_list = [filename for filename in file_names if 'MVP' not in filename and 'Defensive Player of the Year' not in filename]

files = []
for filename in filtered_file_list:
    if 'stats.csv' in filename:
        files.append(filename)

print(files)

['00-01 player stats.csv', '01-02 player stats.csv', '02-03 player stats.csv', '03-04 player stats.csv', '04-05 player stats.csv', '05-06 player stats.csv', '06-07 player stats.csv', '07-08 player stats.csv', '08-09 player stats.csv', '09-10 player stats.csv', '10-11 player stats.csv', '11-12 player stats.csv', '12-13 player stats.csv', '13-14 player stats.csv', '14-15 player stats.csv', '15-16 player stats.csv', '16-17 player stats.csv', '17-18 player stats.csv', '18-19 player stats.csv', '19-20 player stats.csv', '20-21 player stats.csv', '21-22 player stats.csv', '22-23 player stats.csv']


Now lets read into one the of the csv file and investigate somethings

In [4]:
df_01_02 = pd.read_csv(os.path.join(data_directory,files[0]),header=0)
df_01_02

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Player-additional
0,1,Mahmoud Abdul-Rauf,PG,31,VAN,41,0,486,120,246,...,5,20,25,76,9,1,26,50,266,abdulma02
1,2,Tariq Abdul-Wahad,SG,26,DEN,29,12,420,43,111,...,14,45,59,22,14,13,34,54,111,abdulta01
2,3,Shareef Abdur-Rahim,SF,24,VAN,81,81,3241,604,1280,...,175,560,735,250,90,77,231,238,1663,abdursh01
3,4,Cory Alexander,PG,27,ORL,26,0,227,18,56,...,0,25,25,36,16,0,25,29,52,alexaco01
4,5,Courtney Alexander,PG,23,TOT,65,24,1382,239,573,...,42,101,143,62,45,5,75,139,618,alexaco02
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
532,437,David Wingate,SG,37,SEA,1,0,9,3,3,...,0,0,0,2,0,0,0,1,6,wingada01
533,438,Rubén Wolkowyski,PF,27,SEA,34,1,305,25,79,...,12,34,46,3,6,18,12,38,75,wolkoru01
534,439,Metta World Peace,SF,21,CHI,76,74,2363,327,815,...,59,235,294,228,152,45,159,254,907,artesro01
535,440,Lorenzen Wright,C,25,ATL,71,46,1988,363,811,...,180,355,535,87,42,63,125,232,881,wrighlo02


In [5]:
#Check for duplicate players, this is due to transfers during seasons
player_duplicate = df_01_02['Player'].value_counts()
player_duplicate

Player
Doug Overton      4
Anthony Miller    4
Garth Joseph      3
Rick Brunson      3
Kevin Ollie       3
                 ..
Horace Grant      1
Gary Grant        1
Brian Grant       1
Steve Goodrich    1
Wang Zhizhi       1
Name: count, Length: 441, dtype: int64

Looks like some players have duplicates due to transfer and some just have random astrix sign on them. we'll handle them later on

For now lets get all the MVP player data and defensive data

In [6]:
MVP_df = pd.read_csv(os.path.join(data_directory,'MVP player stats.csv'),header=1)
MVP_df

Unnamed: 0,Season,Lg,Player,Voting,Age,Tm,G,MP,PTS,TRB,AST,STL,BLK,FG%,3P%,FT%,WS,WS/48,-9999
0,2022-23,NBA,Joel Embiid,(V),28,PHI,66,34.6,33.1,10.2,4.2,1.0,1.7,0.548,0.330,0.857,12.3,0.259,embiijo01
1,2021-22,NBA,Nikola Jokić,(V),26,DEN,74,33.5,27.1,13.8,7.9,1.5,0.9,0.583,0.337,0.810,15.2,0.296,jokicni01
2,2020-21,NBA,Nikola Jokić,(V),25,DEN,72,34.6,26.4,10.8,8.3,1.3,0.7,0.566,0.388,0.868,15.6,0.301,jokicni01
3,2019-20,NBA,Giannis Antetokounmpo,(V),25,MIL,63,30.4,29.5,13.6,5.6,1.0,1.0,0.553,0.304,0.633,11.1,0.279,antetgi01
4,2018-19,NBA,Giannis Antetokounmpo,(V),24,MIL,72,32.8,27.7,12.5,5.9,1.3,1.5,0.578,0.256,0.729,14.4,0.292,antetgi01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
63,1959-60,NBA,Wilt Chamberlain,(V),23,PHW,72,46.4,37.6,27.0,2.3,,,0.461,,0.582,17.0,0.245,chambwi01
64,1958-59,NBA,Bob Pettit,(V),26,STL,72,39.9,29.2,16.4,3.1,,,0.438,,0.759,14.8,0.246,pettibo01
65,1957-58,NBA,Bill Russell,(V),23,BOS,69,38.3,16.6,22.7,2.9,,,0.442,,0.519,11.3,0.206,russebi01
66,1956-57,NBA,Bob Cousy,(V),28,BOS,64,36.9,20.6,4.8,7.5,,,0.378,,0.821,8.8,0.178,cousybo01


In [7]:
#Grabbing MVP award winner for each season
season_mvp_dict = {}
for index,row in MVP_df.iterrows():
    season = row['Season']
    player = row['Player']
    season_short = season[2:]
    season_mvp_dict[season_short] = player

print(season_mvp_dict)

{'22-23': 'Joel Embiid', '21-22': 'Nikola Jokić', '20-21': 'Nikola Jokić', '19-20': 'Giannis Antetokounmpo', '18-19': 'Giannis Antetokounmpo', '17-18': 'James Harden', '16-17': 'Russell Westbrook', '15-16': 'Stephen Curry', '14-15': 'Stephen Curry', '13-14': 'Kevin Durant', '12-13': 'LeBron James', '11-12': 'LeBron James', '10-11': 'Derrick Rose', '09-10': 'LeBron James', '08-09': 'LeBron James', '07-08': 'Kobe Bryant', '06-07': 'Dirk Nowitzki', '05-06': 'Steve Nash', '04-05': 'Steve Nash', '03-04': 'Kevin Garnett', '02-03': 'Tim Duncan', '01-02': 'Tim Duncan', '00-01': 'Allen Iverson', '99-00': "Shaquille O'Neal", '98-99': 'Karl Malone', '97-98': 'Michael Jordan', '96-97': 'Karl Malone', '95-96': 'Michael Jordan', '94-95': 'David Robinson', '93-94': 'Hakeem Olajuwon', '92-93': 'Charles Barkley', '91-92': 'Michael Jordan', '90-91': 'Michael Jordan', '89-90': 'Magic Johnson', '88-89': 'Magic Johnson', '87-88': 'Michael Jordan', '86-87': 'Magic Johnson', '85-86': 'Larry Bird', '84-85': '

In [8]:
#Grabbing all defensive player of the year award winner for each season
defensive_df = pd.read_csv(os.path.join(data_directory,'Defensive Player of the Year player stats.csv'),header=1)
season_defensive_dict = {}
for index,row in defensive_df.iterrows():
    season = row['Season']
    player = row['Player']
    season_short = season[2:]
    season_defensive_dict[season_short] = player

Now lets loop through all the season data we have and combine them.

In [9]:
#Read all the csv file and clean it up alittle before concatnating into one big dataframe
combine_clean_df = pd.DataFrame()
for csv_file in files:
    df = pd.read_csv(os.path.join(data_directory,csv_file),header=0)
    df['Player'] = df['Player'].str.replace('*','')
    df = df.drop(columns = ['Rk','Player-additional'])
    season = (csv_file[0:2] + csv_file[3:5])
    # add the year to the data frame
    df['Year'] = season
    #Using the agg function, We grab the first rows of each unique players which have their Total stats from
    #multiple team if any
    combined_df = df.groupby('Player').agg({
    'Age': 'first',
    'Pos':'first',
    'Tm':'first',
    'G':'first',
    'GS':'first',
    'MP':'first',
    'FG':'first',
    'FGA':'first',
    '3P':'first',
    '3PA':'first',
    '2P':'first',
    '2PA':'first',
    'FT':'first',
    'FTA':'first',
    'ORB':'first',
    'DRB':'first',
    'TRB':'first',
    'AST':'first',
    'STL':'first',
    'BLK':'first',
    'TOV':'first',
    'PF':'first',
    'PTS':'first',
    'Year':'first'
    }).reset_index()
    combined_df['FG%'] = combined_df['FG'] / combined_df['FGA']
    combined_df['3P%'] = combined_df['3P'] / combined_df['3PA']
    combined_df['2P%'] = combined_df['2P'] / combined_df['2PA']
    combined_df['FT%'] = combined_df['FT'] / combined_df['FTA']
    combined_df['eFG%'] = (combined_df['FG'] + 0.5 * combined_df['3P'])/ combined_df['FGA']
    
    season = csv_file[:5]
    
    #Creating a column of MVP and Defensive Player of the year, zeros and ones
    if season in season_mvp_dict:
        mvp_player = season_mvp_dict[season]
        combined_df['MVP'] = (combined_df['Player'] == mvp_player).astype(int)
    if season in season_defensive_dict:
        defensive_player = season_defensive_dict[season]
        combined_df['Defensive Player of the Year'] = (combined_df['Player'] == defensive_player).astype(int)
    combine_clean_df = pd.concat([combine_clean_df,combined_df])
    
mvp_count = (combine_clean_df['MVP'] == 1).sum()
denfensive_count = (combine_clean_df['Defensive Player of the Year'] == 1).sum()

print(mvp_count)
print(denfensive_count)

23
23


In [10]:
pd.set_option('display.max_columns', None)


combine_clean_df

Unnamed: 0,Player,Age,Pos,Tm,G,GS,MP,FG,FGA,3P,3PA,2P,2PA,FT,FTA,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Year,FG%,3P%,2P%,FT%,eFG%,MVP,Defensive Player of the Year
0,A.C. Green,37,PF,MIA,82,1,1411,144,324,0,6,144,318,79,111,107,206,313,39,30,8,45,119,367,0001,0.444444,0.000000,0.452830,0.711712,0.444444,0,0
1,A.J. Guyton,22,PG,CHI,33,8,630,78,192,27,69,51,123,15,18,10,26,36,64,9,5,24,35,198,0001,0.406250,0.391304,0.414634,0.833333,0.476562,0,0
2,Aaron McKie,28,SG,PHI,76,33,2394,338,714,53,170,285,544,149,194,33,278,311,377,106,8,203,178,878,0001,0.473389,0.311765,0.523897,0.768041,0.510504,0,0
3,Aaron Williams,29,PF,NJN,82,25,2336,297,650,0,2,297,648,244,310,211,379,590,88,59,113,132,319,838,0001,0.456923,0.000000,0.458333,0.787097,0.456923,0,0
4,Adam Keefe,30,PF,GSW,67,13,836,64,159,1,3,63,156,39,63,90,119,209,36,28,20,40,102,168,0001,0.402516,0.333333,0.403846,0.619048,0.405660,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
534,Zach Collins,25,C,SAS,63,26,1441,284,548,55,147,229,401,108,142,116,286,402,180,37,49,129,199,731,2223,0.518248,0.374150,0.571072,0.760563,0.568431,0,0
535,Zach LaVine,27,SG,CHI,77,77,2768,673,1388,204,544,469,844,363,428,42,303,345,327,69,18,194,159,1913,2223,0.484870,0.375000,0.555687,0.848131,0.558357,0,0
536,Zeke Nnaji,22,PF,DEN,53,5,728,110,196,17,65,93,131,40,62,65,73,138,18,17,23,31,105,277,2223,0.561224,0.261538,0.709924,0.645161,0.604592,0,0
537,Ziaire Williams,21,SF,MEM,37,4,561,84,196,25,97,59,99,17,22,16,63,79,35,14,6,37,58,210,2223,0.428571,0.257732,0.595960,0.772727,0.492347,0,0


23 MVP's and 23 Defensive player of the Year, which is correct since we are only looking at season from 01 to 23
Lets save this into an csv

In [11]:
combine_clean_df.to_csv('combine_player_stat_w_MVP_defensive_player.csv',header=True,index=False)

## Now lets clean it

In [12]:
combine_clean_df

Unnamed: 0,Player,Age,Pos,Tm,G,GS,MP,FG,FGA,3P,3PA,2P,2PA,FT,FTA,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Year,FG%,3P%,2P%,FT%,eFG%,MVP,Defensive Player of the Year
0,A.C. Green,37,PF,MIA,82,1,1411,144,324,0,6,144,318,79,111,107,206,313,39,30,8,45,119,367,0001,0.444444,0.000000,0.452830,0.711712,0.444444,0,0
1,A.J. Guyton,22,PG,CHI,33,8,630,78,192,27,69,51,123,15,18,10,26,36,64,9,5,24,35,198,0001,0.406250,0.391304,0.414634,0.833333,0.476562,0,0
2,Aaron McKie,28,SG,PHI,76,33,2394,338,714,53,170,285,544,149,194,33,278,311,377,106,8,203,178,878,0001,0.473389,0.311765,0.523897,0.768041,0.510504,0,0
3,Aaron Williams,29,PF,NJN,82,25,2336,297,650,0,2,297,648,244,310,211,379,590,88,59,113,132,319,838,0001,0.456923,0.000000,0.458333,0.787097,0.456923,0,0
4,Adam Keefe,30,PF,GSW,67,13,836,64,159,1,3,63,156,39,63,90,119,209,36,28,20,40,102,168,0001,0.402516,0.333333,0.403846,0.619048,0.405660,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
534,Zach Collins,25,C,SAS,63,26,1441,284,548,55,147,229,401,108,142,116,286,402,180,37,49,129,199,731,2223,0.518248,0.374150,0.571072,0.760563,0.568431,0,0
535,Zach LaVine,27,SG,CHI,77,77,2768,673,1388,204,544,469,844,363,428,42,303,345,327,69,18,194,159,1913,2223,0.484870,0.375000,0.555687,0.848131,0.558357,0,0
536,Zeke Nnaji,22,PF,DEN,53,5,728,110,196,17,65,93,131,40,62,65,73,138,18,17,23,31,105,277,2223,0.561224,0.261538,0.709924,0.645161,0.604592,0,0
537,Ziaire Williams,21,SF,MEM,37,4,561,84,196,25,97,59,99,17,22,16,63,79,35,14,6,37,58,210,2223,0.428571,0.257732,0.595960,0.772727,0.492347,0,0


In [13]:
#Checking for NaN
combine_clean_df.isnull().sum()

Player                             0
Age                                0
Pos                                0
Tm                                 0
G                                  0
GS                                 0
MP                                 0
FG                                 0
FGA                                0
3P                                 0
3PA                                0
2P                                 0
2PA                                0
FT                                 0
FTA                                0
ORB                                0
DRB                                0
TRB                                0
AST                                0
STL                                0
BLK                                0
TOV                                0
PF                                 0
PTS                                0
Year                               0
FG%                               44
3P%                             1399
2

There is some null values lets look into that

In [14]:
combine_clean_df.shape

(11083, 32)

In [15]:
data = combine_clean_df[combine_clean_df['FG%'].isna()]
data

Unnamed: 0,Player,Age,Pos,Tm,G,GS,MP,FG,FGA,3P,3PA,2P,2PA,FT,FTA,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Year,FG%,3P%,2P%,FT%,eFG%,MVP,Defensive Player of the Year
15,Andy Panko,23,SF,ATL,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,,,,,,0,0
254,Lari Ketner,23,PF,IND,3,0,7,0,0,0,0,0,0,0,0,0,0,0,1,0,0,2,0,0,1,,,,,,0,0
120,Dickey Simpkins,29,PF,ATL,1,0,3,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,102,,,,,,0,0
162,Guy Rucker,25,PF,GSW,3,0,4,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,1,0,203,,,,,,0,0
196,Jelani McCoy,26,C,CLE,2,0,12,0,0,0,0,0,0,0,0,1,3,4,0,0,0,1,4,0,304,,,,,,0,0
310,Olden Polynice,39,C,LAC,2,0,12,0,0,0,0,0,0,0,0,1,1,2,1,1,0,4,2,0,304,,,,,,0,0
346,Pavel Podkolzin,20,C,DAL,5,0,10,0,0,0,0,0,0,1,2,0,2,2,0,0,0,2,4,1,405,,,,0.5,,0,0
10,Alex Scales,27,SG,SAS,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,506,,,,,,0,0
68,Bryon Russell,35,SF,DEN,1,0,3,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,506,,,,,,0,0
120,Deng Gai,23,C,PHI,2,0,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,506,,,,,,0,0


Looks like its NaN because it is dividing by zero, lets replace them to zeros

In [16]:
#Checking for NaNs
combine_clean_df.fillna(0, inplace=True)
combine_clean_df.isnull().sum()

Player                          0
Age                             0
Pos                             0
Tm                              0
G                               0
GS                              0
MP                              0
FG                              0
FGA                             0
3P                              0
3PA                             0
2P                              0
2PA                             0
FT                              0
FTA                             0
ORB                             0
DRB                             0
TRB                             0
AST                             0
STL                             0
BLK                             0
TOV                             0
PF                              0
PTS                             0
Year                            0
FG%                             0
3P%                             0
2P%                             0
FT%                             0
eFG%          

In [17]:
#Lets give the column header more indepth name 
name_map = {'Rk':'Rank','Pos':'Position','Tm':'Team','G':'Games','GS':'Games started','MP':'Minutes played per game'
             ,'FG':'Field goals per game','FGA':'Field goals attempt per game','FG%':'Field goal percent','3P':'3 point field goal per game',
             '3PA':'3 point field goal attempt per game', '3P%':'3 point field goal percentage','2P':'2 point field goal per game', '2PA':'2 point field goal attempt per game',
             '2P%':'2 point field goal percentage', 'eFG%':'Effective field goal percentage', 'FT':'Free throws per game',
             'FTA':'Free throw attempt per game','FT%':'Free throw percentage','ORB':'Offensive rebounds per game','DRB':'Defensive rebounds per game',
             'TRB':'Total rebounds per game','AST':'Assist per game','STL':'Steals per game','BLK':'Blocks per game','TOV':'Turn overs per game',
             'PF':'Personal fouls per game','PTS':'Points per game'}

renamed_df = combine_clean_df.rename(columns=name_map)
renamed_df

Unnamed: 0,Player,Age,Position,Team,Games,Games started,Minutes played per game,Field goals per game,Field goals attempt per game,3 point field goal per game,3 point field goal attempt per game,2 point field goal per game,2 point field goal attempt per game,Free throws per game,Free throw attempt per game,Offensive rebounds per game,Defensive rebounds per game,Total rebounds per game,Assist per game,Steals per game,Blocks per game,Turn overs per game,Personal fouls per game,Points per game,Year,Field goal percent,3 point field goal percentage,2 point field goal percentage,Free throw percentage,Effective field goal percentage,MVP,Defensive Player of the Year
0,A.C. Green,37,PF,MIA,82,1,1411,144,324,0,6,144,318,79,111,107,206,313,39,30,8,45,119,367,0001,0.444444,0.000000,0.452830,0.711712,0.444444,0,0
1,A.J. Guyton,22,PG,CHI,33,8,630,78,192,27,69,51,123,15,18,10,26,36,64,9,5,24,35,198,0001,0.406250,0.391304,0.414634,0.833333,0.476562,0,0
2,Aaron McKie,28,SG,PHI,76,33,2394,338,714,53,170,285,544,149,194,33,278,311,377,106,8,203,178,878,0001,0.473389,0.311765,0.523897,0.768041,0.510504,0,0
3,Aaron Williams,29,PF,NJN,82,25,2336,297,650,0,2,297,648,244,310,211,379,590,88,59,113,132,319,838,0001,0.456923,0.000000,0.458333,0.787097,0.456923,0,0
4,Adam Keefe,30,PF,GSW,67,13,836,64,159,1,3,63,156,39,63,90,119,209,36,28,20,40,102,168,0001,0.402516,0.333333,0.403846,0.619048,0.405660,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
534,Zach Collins,25,C,SAS,63,26,1441,284,548,55,147,229,401,108,142,116,286,402,180,37,49,129,199,731,2223,0.518248,0.374150,0.571072,0.760563,0.568431,0,0
535,Zach LaVine,27,SG,CHI,77,77,2768,673,1388,204,544,469,844,363,428,42,303,345,327,69,18,194,159,1913,2223,0.484870,0.375000,0.555687,0.848131,0.558357,0,0
536,Zeke Nnaji,22,PF,DEN,53,5,728,110,196,17,65,93,131,40,62,65,73,138,18,17,23,31,105,277,2223,0.561224,0.261538,0.709924,0.645161,0.604592,0,0
537,Ziaire Williams,21,SF,MEM,37,4,561,84,196,25,97,59,99,17,22,16,63,79,35,14,6,37,58,210,2223,0.428571,0.257732,0.595960,0.772727,0.492347,0,0


In [18]:
#Rechecking MVP count and defensive count
mvp_count = (renamed_df['MVP'] == 1).sum()
mvp_count

23

In [19]:
defensve_counnt = (renamed_df['Defensive Player of the Year'] == 1).sum()
defensve_counnt

23

In [20]:
#Save them into csv
renamed_df.to_csv('combine_player_stat_w_MVP_defensive_player.csv',header=True,index=False)

## Modeling

In [21]:
# determine the average stats for each season
# create new player data data frame that only includes columns which we would like to get the seasonal averages of
player_stat_game_df2 = renamed_df.drop(columns = ['Player', 'Position', 'Team', 'MVP', 'Defensive Player of the Year', 'Age'])
# create new data frame that contains the seasonal averages of relevant statistics
average_stat_game_df = player_stat_game_df2.groupby('Year').mean()
# merge original data with season average data
merged_data = pd.merge(renamed_df, average_stat_game_df, on='Year', suffixes=('', '_avg'))
# divide each statisitic by the corresponding season average
for stat in average_stat_game_df.columns:
    merged_data[stat] = merged_data[stat] / merged_data[f'{stat}_avg']
    merged_data.drop(columns=[f'{stat}_avg'], inplace=True)

player_average_game_df = merged_data
display(player_average_game_df)

Unnamed: 0,Player,Age,Position,Team,Games,Games started,Minutes played per game,Field goals per game,Field goals attempt per game,3 point field goal per game,3 point field goal attempt per game,2 point field goal per game,2 point field goal attempt per game,Free throws per game,Free throw attempt per game,Offensive rebounds per game,Defensive rebounds per game,Total rebounds per game,Assist per game,Steals per game,Blocks per game,Turn overs per game,Personal fouls per game,Points per game,Year,Field goal percent,3 point field goal percentage,2 point field goal percentage,Free throw percentage,Effective field goal percentage,MVP,Defensive Player of the Year
0,A.C. Green,37,PF,MIA,1.508510,0.037090,1.081198,0.748294,0.745492,0.000000,0.081173,0.865873,0.881628,0.788124,0.827756,1.654581,1.253584,1.366826,0.332117,0.711290,0.282127,0.579551,0.987505,0.717856,0001,1.051507,0.000000,1.040086,1.036604,0.990560,0,0
1,A.J. Guyton,22,PG,CHI,0.607083,0.296720,0.482746,0.405326,0.441773,1.033235,0.933491,0.306663,0.341007,0.149644,0.134231,0.154634,0.158219,0.157207,0.545012,0.213387,0.176329,0.309094,0.290443,0.387290,0001,0.961143,1.770052,0.952355,1.213746,1.062143,0,0
2,Aaron McKie,28,SG,PHI,1.398131,1.223970,1.834435,1.756413,1.642844,2.028202,2.299905,1.713707,1.508195,1.486461,1.446708,0.510291,1.691730,1.358092,3.210462,2.513226,0.282127,2.614421,1.477109,1.717377,0001,1.119988,1.410257,1.203317,1.118648,1.137791,0,0
3,Aaron Williams,29,PF,NJN,1.508510,0.927250,1.789992,1.543357,1.495586,0.000000,0.027058,1.785863,1.796526,2.434204,2.311751,3.262772,2.306352,2.576445,0.749392,1.398871,3.985046,1.700018,2.647178,1.639136,0001,1.081030,0.000000,1.052726,1.146402,1.018372,0,0
4,Adam Keefe,30,PF,GSW,1.232563,0.482170,0.640596,0.332575,0.365843,0.038268,0.040587,0.378819,0.432497,0.389074,0.469807,1.391704,0.724158,0.912673,0.306569,0.663871,0.705318,0.515157,0.846433,0.328610,0001,0.952309,1.507822,0.927577,0.901640,0.904120,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11078,Zach Collins,25,C,SAS,1.311385,1.139350,1.305734,1.482433,1.359789,0.976481,0.941401,1.693132,1.624446,1.286510,1.322334,2.435779,1.898705,2.027719,1.557828,1.112084,2.307041,2.104449,2.182364,1.396566,2223,1.118655,1.186575,1.077723,1.056479,1.068728,0,0
11079,Zach LaVine,27,SG,CHI,1.602804,3.374228,2.508169,3.512948,3.444137,3.621858,3.483823,3.467593,3.419033,4.324103,3.985626,0.881920,2.011566,1.740207,2.830055,2.073886,0.847484,3.164831,1.743698,3.654762,2223,1.046607,1.189272,1.048689,1.178117,1.049789,0,0
11080,Zeke Nnaji,22,PF,DEN,1.103229,0.219106,0.659663,0.574182,0.486348,0.301822,0.416266,0.687604,0.530679,0.476485,0.577357,1.364876,0.484635,0.696083,0.155783,0.510957,1.082897,0.505720,1.151499,0.529205,2223,1.211420,0.829441,1.339762,0.896177,1.136716,0,0
11081,Ziaire Williams,21,SF,MEM,0.770178,0.175285,0.508339,0.438466,0.486348,0.443855,0.621196,0.436222,0.401048,0.202506,0.204869,0.335969,0.418246,0.398482,0.302911,0.420788,0.282495,0.603602,0.636066,0.401202,2223,0.925085,0.817369,1.124690,1.073376,0.925680,0,0


In [22]:
# create new data frame with predictor data
player_average_game_df2 = player_average_game_df.drop(columns = ['Player', 'Position', 'Team', 'MVP', 'Defensive Player of the Year'])
# create data filters for splitting up testing and training data
train_years = ['0001', '0102', '0203', '0304', '0405', '0506', '0607', '0708', '0809', '0910', '1011', '1112', '1213', '1314',
              '1415', '1516', '1617', '1718', '1819', '1920', '2021']
test_year = ['2122']
# Filter both x_train and y_train based on train_years
x_train = player_average_game_df2[player_average_game_df2['Year'].isin(train_years)].to_numpy()
y_train = player_average_game_df[player_average_game_df['Year'].isin(train_years)]['MVP'].to_numpy()
# Filter both x_test and y_test based on train_years
x_test_first = player_average_game_df2[player_average_game_df2['Year'].isin(test_year)].to_numpy()
y_test_first = player_average_game_df[player_average_game_df['Year'].isin(test_year)]['MVP'].to_numpy()

# initialize old confusion matrix and model accuracy
confusion_matrix_old = [[0, 100],[0, 0]]
model_accuracy = []

# for loop used to determine the optimal class weights
for i in range(500, 1000, 100):
    for j in range(10, 100, 10):
        # Define class weights
        class_weights = {0: 1, 1: i}
        # create svm model
        svm_model = svm.SVC(kernel = 'rbf', C = j, gamma = 'scale', class_weight = class_weights)
        # train the nearest neighbor model with the training data 
        svm_model.fit(x_train, y_train)
        # get model predictions
        y_prediction = svm_model.predict(x_test_first)
        # determine the confusion matrix with the confusion_matrix function
        confusion_matrix = metrics.confusion_matrix(y_test_first, y_prediction)
        # determine ideal parameters
        if confusion_matrix[1][1] == 1 and confusion_matrix[0][1] < confusion_matrix_old[0][1]:
            best_weight = i
            best_c = j
            confusion_matrix_old = confusion_matrix

In [23]:
# display results from previous cell
print('Best Weight: ')
print(best_weight)
print('Best C Value: ')
print(best_c)

# develop svm model using the ideal parameters
ideal_class_weights = {0:1, 1:best_weight}

# develop the ideal svm model
svm_model_first = svm.SVC(kernel = 'rbf', C = best_c, gamma = 'scale', class_weight = ideal_class_weights)
# train the nearest neighbor model with the training data 
svm_model_first.fit(x_train, y_train)
# get model predictions
y_prediction = svm_model_first.predict(x_test_first)
# assess accuracy using the accuracy_score function
model_accuracy = metrics.accuracy_score(y_test_first, y_prediction)
# determine the confusion matrix with the confusion_matrix function
confusion_matrix = metrics.confusion_matrix(y_test_first, y_prediction)
print('Model Accuracy: ')
print(model_accuracy)
print('Confusion Matrix: ')
print(confusion_matrix)

# create placeholder data frame that will be used later
x_test1 = player_average_game_df2[player_average_game_df2['Year'].isin(test_year)]

# Create a DataFrame to store the predictions and corresponding players
predictions_df = player_average_game_df.loc[x_test1.index].copy()
predictions_df['First Prediction'] = y_prediction

# Filter the DataFrame to get the rows where the model predicted MVPs
predicted_mvp_df = predictions_df[predictions_df['First Prediction'] == 1]

# Print or display the predicted MVP players
display(predicted_mvp_df)

Best Weight: 
500
Best C Value: 
90
Model Accuracy: 
0.9636363636363636
Confusion Matrix: 
[[582  22]
 [  0   1]]


Unnamed: 0,Player,Age,Position,Team,Games,Games started,Minutes played per game,Field goals per game,Field goals attempt per game,3 point field goal per game,3 point field goal attempt per game,2 point field goal per game,2 point field goal attempt per game,Free throws per game,Free throw attempt per game,Offensive rebounds per game,Defensive rebounds per game,Total rebounds per game,Assist per game,Steals per game,Blocks per game,Turn overs per game,Personal fouls per game,Points per game,Year,Field goal percent,3 point field goal percentage,2 point field goal percentage,Free throw percentage,Effective field goal percentage,MVP,Defensive Player of the Year,First Prediction
9963,Anthony Edwards,20,SG,MIN,1.672875,3.541463,2.512733,3.323777,3.475535,4.251095,4.208817,2.914527,2.988125,3.195141,3.149811,1.475494,2.025678,1.897766,2.743832,3.384029,2.40038,3.579101,2.053989,3.408357,2122,1.02089,1.263935,1.044558,1.164501,1.066549,0,0,1
10057,Darius Garland,22,PG,CLE,1.579938,3.344715,2.47605,3.281397,3.274541,3.440421,3.174091,3.211216,3.341309,3.122524,2.711088,0.928133,1.333631,1.239357,5.816924,2.868368,0.365275,4.671669,1.440297,3.274957,2122,1.069737,1.356363,1.029238,1.322198,1.084596,0,0,1
10071,DeMar DeRozan,32,PF,CHI,1.765813,3.738211,2.794982,4.68598,4.285098,0.988627,0.992777,6.317718,6.473496,7.552152,6.67085,1.332704,2.422163,2.168875,3.731612,2.191562,1.252372,3.409565,2.22933,4.709002,2122,1.16737,1.246133,1.045164,1.299642,1.052819,0,0,1
10076,Dejounte Murray,25,PG,SAS,1.579938,3.344715,2.410838,3.469078,3.464369,1.898163,2.055469,4.162364,4.400862,2.84658,2.778584,1.903863,3.481859,3.114992,6.255937,4.447582,1.20019,3.390728,1.728357,3.197141,2122,1.068953,1.155597,1.012897,1.176073,1.012128,0,0,1
10088,Devin Booker,25,SG,PHO,1.579938,3.344715,2.38944,4.007906,3.966856,3.618374,3.341885,4.179816,4.382273,4.574861,4.083505,1.070923,2.141019,1.892233,3.282621,2.481622,1.356736,3.051655,2.254378,4.050898,2122,1.078549,1.354896,1.021459,1.286113,1.072518,0,0,1
10098,Donovan Mitchell,25,SG,UTA,1.556703,3.295528,2.308943,3.735465,3.841234,4.587228,4.57237,3.35956,3.355251,3.87774,3.521039,1.308906,1.636402,1.560262,3.57197,3.190656,0.626186,3.767475,2.053989,3.853022,2122,1.038108,1.255431,1.07231,1.264277,1.07746,0,0,1
10145,Giannis Antetokounmpo,27,PF,MIL,1.556703,3.295528,2.245768,4.17137,3.475535,1.40385,1.691917,5.392748,4.661103,8.031423,8.616984,3.18897,4.642478,4.304553,3.871298,2.320477,4.748577,4.125385,2.655157,4.451096,2122,1.281226,1.038307,1.239038,1.06997,1.177022,0,0,1
10176,Ja Morant,22,PG,MEM,1.32436,2.803659,1.924798,3.511458,3.285707,1.739983,1.789796,4.293256,4.280036,4.589385,4.66847,1.832468,1.787787,1.798175,3.831387,2.127104,1.148008,3.692126,1.077092,3.47728,2122,1.140847,1.216537,1.074243,1.128533,1.072318,0,0,1
10197,James Harden,32,PG,TOT,1.510235,3.197154,2.465861,2.464075,2.769262,2.926335,3.132143,2.260068,2.528056,6.825984,6.029639,1.308906,3.207924,2.766422,6.65504,2.642766,1.878558,5.349815,1.916222,3.183801,2122,0.949858,1.16914,0.95741,1.299596,0.980729,0,0,1
10218,Jayson Tatum,23,SF,BOS,1.765813,3.738211,2.782755,4.2864,4.366054,4.547683,4.551395,4.17109,4.242858,5.809348,5.275934,2.022854,3.77742,3.369503,3.332509,2.417164,2.556926,4.087711,2.179232,4.548922,2122,1.048027,1.250344,1.05282,1.264044,1.064335,0,0,1


In [24]:
# this cell develops a second svm model which is used to predict the MVPs from the pool of previously predicted MVPs
# create new test and train data from the results of previous run do this in an effort to decrease number of predicted MVPs
x_test_second = predicted_mvp_df.drop(columns = ['First Prediction', 'Player', 'Position', 'Team', 'MVP', 'Defensive Player of the Year']).to_numpy()
y_test_second = predicted_mvp_df['MVP'].to_numpy()

# initialize old confusion matrix and model accuracy
confusion_matrix_old = [[0, 100],[0, 0]]
model_accuracy = []

# for loop used to determine the optimal class weights
for i in range(10, 100, 10):
    for j in range(10, 100, 10):
        # Define class weights
        class_weights = {0: 1, 1: i}
        # create svm model
        svm_model = svm.SVC(kernel = 'rbf', C = j, gamma = 'scale', class_weight = class_weights)
        # train the nearest neighbor model with the training data 
        svm_model.fit(x_train, y_train)
        # get model predictions
        y_prediction = svm_model.predict(x_test_second)
        # determine the confusion matrix with the confusion_matrix function
        confusion_matrix = metrics.confusion_matrix(y_test_second, y_prediction)
        # determine ideal parameters
        if confusion_matrix[1][1] == 1 and confusion_matrix[0][1] < confusion_matrix_old[0][1]:
            best_weight2 = i
            best_c2 = j
            confusion_matrix_old = confusion_matrix

In [25]:
# display results from previous cell
print('Best Weight: ')
print(best_weight2)
print('Best C Value: ')
print(best_c2)

# develop svm model using the ideal parameters
ideal_class_weights = {0:1, 1:best_weight2}

# develop the ideal svm model
svm_model_second = svm.SVC(kernel = 'rbf', C = best_c2, gamma = 'scale', class_weight = ideal_class_weights)
# train the nearest neighbor model with the training data 
svm_model_second.fit(x_train, y_train)
# get model predictions
y_prediction = svm_model_second.predict(x_test_second)
# assess accuracy using the accuracy_score function
model_accuracy = metrics.accuracy_score(y_test_second, y_prediction)
# determine the confusion matrix with the confusion_matrix function
confusion_matrix = metrics.confusion_matrix(y_test_second, y_prediction)
print('Model Accuracy: ')
print(model_accuracy)
print('Confusion Matrix: ')
print(confusion_matrix)

# add final predicted winners to dataframe
predicted_mvp_df.loc[:, 'Second Prediction'] = y_prediction
display(predicted_mvp_df)

Best Weight: 
30
Best C Value: 
90
Model Accuracy: 
0.8695652173913043
Confusion Matrix: 
[[19  3]
 [ 0  1]]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  predicted_mvp_df.loc[:, 'Second Prediction'] = y_prediction


Unnamed: 0,Player,Age,Position,Team,Games,Games started,Minutes played per game,Field goals per game,Field goals attempt per game,3 point field goal per game,3 point field goal attempt per game,2 point field goal per game,2 point field goal attempt per game,Free throws per game,Free throw attempt per game,Offensive rebounds per game,Defensive rebounds per game,Total rebounds per game,Assist per game,Steals per game,Blocks per game,Turn overs per game,Personal fouls per game,Points per game,Year,Field goal percent,3 point field goal percentage,2 point field goal percentage,Free throw percentage,Effective field goal percentage,MVP,Defensive Player of the Year,First Prediction,Second Prediction
9963,Anthony Edwards,20,SG,MIN,1.672875,3.541463,2.512733,3.323777,3.475535,4.251095,4.208817,2.914527,2.988125,3.195141,3.149811,1.475494,2.025678,1.897766,2.743832,3.384029,2.40038,3.579101,2.053989,3.408357,2122,1.02089,1.263935,1.044558,1.164501,1.066549,0,0,1,0
10057,Darius Garland,22,PG,CLE,1.579938,3.344715,2.47605,3.281397,3.274541,3.440421,3.174091,3.211216,3.341309,3.122524,2.711088,0.928133,1.333631,1.239357,5.816924,2.868368,0.365275,4.671669,1.440297,3.274957,2122,1.069737,1.356363,1.029238,1.322198,1.084596,0,0,1,0
10071,DeMar DeRozan,32,PF,CHI,1.765813,3.738211,2.794982,4.68598,4.285098,0.988627,0.992777,6.317718,6.473496,7.552152,6.67085,1.332704,2.422163,2.168875,3.731612,2.191562,1.252372,3.409565,2.22933,4.709002,2122,1.16737,1.246133,1.045164,1.299642,1.052819,0,0,1,0
10076,Dejounte Murray,25,PG,SAS,1.579938,3.344715,2.410838,3.469078,3.464369,1.898163,2.055469,4.162364,4.400862,2.84658,2.778584,1.903863,3.481859,3.114992,6.255937,4.447582,1.20019,3.390728,1.728357,3.197141,2122,1.068953,1.155597,1.012897,1.176073,1.012128,0,0,1,0
10088,Devin Booker,25,SG,PHO,1.579938,3.344715,2.38944,4.007906,3.966856,3.618374,3.341885,4.179816,4.382273,4.574861,4.083505,1.070923,2.141019,1.892233,3.282621,2.481622,1.356736,3.051655,2.254378,4.050898,2122,1.078549,1.354896,1.021459,1.286113,1.072518,0,0,1,0
10098,Donovan Mitchell,25,SG,UTA,1.556703,3.295528,2.308943,3.735465,3.841234,4.587228,4.57237,3.35956,3.355251,3.87774,3.521039,1.308906,1.636402,1.560262,3.57197,3.190656,0.626186,3.767475,2.053989,3.853022,2122,1.038108,1.255431,1.07231,1.264277,1.07746,0,0,1,0
10145,Giannis Antetokounmpo,27,PF,MIL,1.556703,3.295528,2.245768,4.17137,3.475535,1.40385,1.691917,5.392748,4.661103,8.031423,8.616984,3.18897,4.642478,4.304553,3.871298,2.320477,4.748577,4.125385,2.655157,4.451096,2122,1.281226,1.038307,1.239038,1.06997,1.177022,0,0,1,1
10176,Ja Morant,22,PG,MEM,1.32436,2.803659,1.924798,3.511458,3.285707,1.739983,1.789796,4.293256,4.280036,4.589385,4.66847,1.832468,1.787787,1.798175,3.831387,2.127104,1.148008,3.692126,1.077092,3.47728,2122,1.140847,1.216537,1.074243,1.128533,1.072318,0,0,1,0
10197,James Harden,32,PG,TOT,1.510235,3.197154,2.465861,2.464075,2.769262,2.926335,3.132143,2.260068,2.528056,6.825984,6.029639,1.308906,3.207924,2.766422,6.65504,2.642766,1.878558,5.349815,1.916222,3.183801,2122,0.949858,1.16914,0.95741,1.299596,0.980729,0,0,1,0
10218,Jayson Tatum,23,SF,BOS,1.765813,3.738211,2.782755,4.2864,4.366054,4.547683,4.551395,4.17109,4.242858,5.809348,5.275934,2.022854,3.77742,3.369503,3.332509,2.417164,2.556926,4.087711,2.179232,4.548922,2122,1.048027,1.250344,1.05282,1.264044,1.064335,0,0,1,0


In [27]:
years = ['0001', '0102', '0203', '0304', '0405', '0506', '0607', '0708', '0809', '0910', '1011', '1112', '1213', '1314','1415',
         '1516', '1617', '1718', '1819', '1920', '2021', '2122']

complete_predicted_mvp_df = pd.DataFrame()
complete_predicted_mvp_df = complete_predicted_mvp_df.reindex(columns = predicted_mvp_df.columns)

for year in years:
    # test model to see how it predicts the winner of each year
    x_test_final_first = player_average_game_df2[player_average_game_df2['Year'].isin([year])].to_numpy()
    y_test_final_first = player_average_game_df[player_average_game_df['Year'].isin([year])]['MVP'].to_numpy()
    
    # get the prediction from the first model
    y_prediction_first = svm_model_first.predict(x_test_final_first)
    
    # create placeholder data frame that will be used later
    x_test1 = player_average_game_df2[player_average_game_df2['Year'].isin([year])]
    
    # Create a DataFrame to store the predictions and corresponding players
    predictions_df = player_average_game_df.loc[x_test1.index].copy()
    predictions_df['First Prediction'] = y_prediction_first
    
    # Filter the DataFrame to get the rows where the model predicted MVPs
    predicted_mvp_df = predictions_df[predictions_df['First Prediction'] == 1]
    
    # create the second set of test data based on results from the first model
    x_test_final_second = predicted_mvp_df.drop(columns = ['First Prediction', 'Player', 'Position', 'Team', 'MVP', 'Defensive Player of the Year']).to_numpy()
    y_test_final_second = predicted_mvp_df['MVP'].to_numpy()
    
    # get the prediction from the second model
    y_prediction_second = svm_model_second.predict(x_test_final_second)
    
    # add final predicted winners to dataframe
    predicted_mvp_df.loc[:, 'Second Prediction'] = y_prediction_second
    
    # fill out data frame with all the predicted winners
    complete_predicted_mvp_df = pd.concat([complete_predicted_mvp_df, predicted_mvp_df], ignore_index = True)
    
display(complete_predicted_mvp_df)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  predicted_mvp_df.loc[:, 'Second Prediction'] = y_prediction_second
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  predicted_mvp_df.loc[:, 'Second Prediction'] = y_prediction_second
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  predicted_mvp_df.loc[:, 'Second Prediction'] = y_prediction_second
A va

Unnamed: 0,Player,Age,Position,Team,Games,Games started,Minutes played per game,Field goals per game,Field goals attempt per game,3 point field goal per game,3 point field goal attempt per game,2 point field goal per game,2 point field goal attempt per game,Free throws per game,Free throw attempt per game,Offensive rebounds per game,Defensive rebounds per game,Total rebounds per game,Assist per game,Steals per game,Blocks per game,Turn overs per game,Personal fouls per game,Points per game,Year,Field goal percent,3 point field goal percentage,2 point field goal percentage,Free throw percentage,Effective field goal percentage,MVP,Defensive Player of the Year,First Prediction,Second Prediction
0,Allen Iverson,25.0,SG,PHI,1.306149,2.633389,2.282699,3.959724,4.171535,3.750260,4.139829,3.992637,4.178032,5.836105,5.361770,0.773169,1.357035,1.192152,2.767640,4.220323,0.705318,3.052304,1.219860,4.316913,0001,0.994379,1.448692,1.012019,1.185048,0.996980,1.0,0.0,1.0,0.0
1,Andre Miller,24.0,PG,CLE,1.508510,3.041379,2.182319,2.348813,2.298601,0.650555,0.865847,2.615658,2.592210,3.741093,3.355767,1.453557,1.618706,1.572068,5.594891,2.821452,0.987445,3.412914,1.900326,2.534989,0001,1.070453,1.201545,1.068591,1.213746,1.027371,0.0,0.0,1.0,0.0
2,Antawn Jamison,24.0,SF,GSW,1.508510,3.041379,2.600698,4.157191,4.169234,2.372614,2.773415,4.437600,4.455274,3.810926,3.982177,4.329745,2.647132,3.122302,1.396594,2.702903,0.987445,2.562905,1.867132,3.998084,0001,1.044544,1.368072,1.054810,1.041912,1.022130,0.0,0.0,1.0,0.0
3,Antoine Walker,24.0,PF,BOS,1.490113,3.004289,2.602231,3.694703,3.957551,8.457220,8.157898,2.946374,3.096789,2.484086,2.595127,2.334970,3.456485,3.139769,3.789538,3.271935,1.728029,3.876555,2.082890,3.700770,0001,0.977994,1.657854,1.007574,1.042147,1.064492,0.0,0.0,1.0,0.0
4,Chris Webber,27.0,PF,SAC,1.287752,2.596299,2.173123,4.084440,3.761974,0.076536,0.378808,4.714198,4.455274,3.232304,3.437797,2.767944,3.639046,3.393047,2.503650,2.205000,4.161375,2.511390,1.875430,3.712506,0001,1.137364,0.323105,1.120557,1.023654,1.072804,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
471,Pascal Siakam,27.0,PF,TOR,1.579938,3.344715,2.626855,3.608326,3.369455,1.482940,1.524123,4.546313,4.596043,4.124637,4.263495,3.046180,3.258385,3.209050,3.591926,2.739452,2.191651,3.409565,2.817973,3.448377,2122,1.143180,1.217552,1.059347,1.110591,1.061585,0.0,0.0,1.0,0.0
472,Russell Westbrook,33.0,PG,LAL,1.812282,3.836585,2.728750,3.317722,3.442036,1.562030,1.852719,4.092555,4.498452,3.863216,4.488481,2.617811,3.388144,3.209050,5.487664,2.417164,1.043643,5.557026,2.943216,3.203811,2122,1.028948,1.055028,0.974304,0.988061,0.963741,0.0,0.0,1.0,0.0
473,Trae Young,23.0,PG,ATL,1.765813,3.738211,2.702258,4.304563,4.310222,4.607000,4.264748,4.171090,4.340449,7.261685,6.220877,1.189914,1.686863,1.571328,7.353470,2.320477,0.365275,5.707725,1.603113,4.791265,2122,1.066100,1.351788,1.029148,1.340047,1.084017,0.0,0.0,1.0,1.0
474,Tyrese Haliburton,21.0,SG-PG,TOT,1.789047,3.787398,2.745053,2.603322,2.537560,3.183378,2.719651,2.347329,2.416524,2.323739,2.137372,1.451695,1.802204,1.720715,6.265915,4.318666,2.556926,3.748638,1.515443,2.625746,2122,1.095167,1.464735,1.040270,1.248078,1.135919,0.0,0.0,1.0,0.0


In [29]:
# determine how well the two models are able to predict the winner of the MVP award
predicted_mvp_winners = complete_predicted_mvp_df[complete_predicted_mvp_df['Second Prediction'] == 1]
display(predicted_mvp_winners)
num_true_winners = len(years)
pred_true_winners = predicted_mvp_winners['MVP'].sum()
percent_picked_true_winners = (pred_true_winners/num_true_winners)*100
print('Percentage that model predicts the true winner: ')
print(percent_picked_true_winners)

Unnamed: 0,Player,Age,Position,Team,Games,Games started,Minutes played per game,Field goals per game,Field goals attempt per game,3 point field goal per game,3 point field goal attempt per game,2 point field goal per game,2 point field goal attempt per game,Free throws per game,Free throw attempt per game,Offensive rebounds per game,Defensive rebounds per game,Total rebounds per game,Assist per game,Steals per game,Blocks per game,Turn overs per game,Personal fouls per game,Points per game,Year,Field goal percent,3 point field goal percentage,2 point field goal percentage,Free throw percentage,Effective field goal percentage,MVP,Defensive Player of the Year,First Prediction,Second Prediction
189,Kobe Bryant,27.0,SG,LAL,1.470541,2.978862,2.520109,5.080462,5.121756,5.85262,6.034747,4.933639,4.890195,6.61029,5.797558,1.18286,2.213013,1.931933,3.251366,3.818398,1.188581,3.379974,1.906084,5.435177,506,1.03001,1.561496,1.059061,1.226849,1.053448,0.0,0.0,1.0,1.0
191,LeBron James,21.0,SF,CLE,1.452159,2.941626,2.584707,4.545403,4.296807,4.129348,4.415384,4.624514,4.266732,5.708023,5.762164,1.2495,3.006948,2.527423,4.705449,3.194986,2.614879,3.515173,1.480692,4.75578,506,1.098459,1.505783,1.137759,1.065898,1.103441,0.0,0.0,1.0,1.0
265,Kobe Bryant,29.0,SG,LAL,1.483417,3.0,2.417832,3.804656,3.793377,4.186306,4.192484,3.723193,3.679371,6.048674,5.44139,1.535557,2.514133,2.253072,3.709138,3.796938,1.545064,3.474121,1.975593,4.252831,708,1.038463,1.602826,1.059683,1.193425,1.05403,1.0,0.0,1.0,1.0
288,Dwyane Wade,27.0,SG,MIA,1.42429,2.851707,2.275782,4.152714,3.878966,2.389432,2.768589,4.537382,4.199445,5.584668,5.624871,1.455094,1.843314,1.739531,5.067648,4.295252,3.982737,3.654653,1.526775,4.308628,809,1.101194,1.289238,1.127064,1.077614,1.070165,0.0,0.0,1.0,1.0
292,LeBron James,24.0,SF,CLE,1.460348,2.923902,2.280262,3.836641,3.597914,3.584149,3.824238,3.891723,3.532592,5.62253,5.559211,1.733034,3.024466,2.679227,5.050441,3.401443,3.494288,3.23813,1.192257,4.160553,809,1.096853,1.400032,1.149168,1.097734,1.098515,1.0,0.0,1.0,1.0
300,Kevin Durant,21.0,SF,OKC,1.460686,2.946667,2.408106,3.784622,3.668141,3.575781,3.476805,3.827586,3.722742,7.293029,6.149362,1.721758,3.025117,2.682833,1.953806,2.787544,3.107726,3.591018,1.473075,4.421789,910,1.054401,1.563259,1.058359,1.277088,1.055714,0.0,0.0,1.0,1.0
302,LeBron James,25.0,SF,CLE,1.353806,2.731057,2.205138,3.660692,3.360263,3.603716,3.833401,3.672414,3.225246,5.72059,5.658877,1.164237,2.820717,2.385698,5.506181,3.111099,2.848749,3.458508,1.025122,4.038996,910,1.113318,1.428916,1.172087,1.088563,1.118191,1.0,0.0,1.0,1.0
315,LeBron James,26.0,SF,MIA,1.419632,2.903089,2.326612,3.73937,3.359628,2.617651,2.845847,3.974649,3.506062,4.968444,4.99985,1.347042,3.074748,2.619235,4.734774,3.109804,1.888685,3.839789,1.44596,3.896265,1011,1.1555,1.425688,1.188596,1.085095,1.134129,0.0,0.0,1.0,1.0
323,Kevin Durant,23.0,SF,OKC,1.5198,3.186667,2.540361,4.25592,3.845347,5.008587,4.517983,4.095422,3.649235,6.156223,5.384916,0.849325,3.816101,3.016369,2.658561,2.767368,3.649217,4.291962,1.640873,4.639705,1112,1.156864,1.684365,1.176025,1.236904,1.176791,0.0,0.0,1.0,1.0
326,LeBron James,27.0,SF,MIA,1.427691,2.993535,2.320848,4.110305,3.465852,2.033562,1.956917,4.553146,3.905792,5.527745,5.395665,1.995913,3.118703,2.816041,4.453952,3.616447,2.369621,3.686242,1.18439,4.220878,1112,1.23962,1.578885,1.221581,1.108419,1.19247,1.0,0.0,1.0,1.0


Percentage that model predicts the true winner: 
50.0
