# NBA Awards Predictor - Part 1

The following notebook is part 1 of the NBA awards predictor project. This notebook includes extensive **data cleaning and manipulation** utilizing Pandas. The purpose of this notebook is to take the raw, disconnected data collected through webscraping (work shown on webscraping.py) and **construct datasets** for each separate NBA award. The construction of these datasets involves concatenation, merging, and **feature engineering** unique to each NBA award. Part 2 will involve a deep analysis of these datasets through *exploratory data analysis*, which will prepare me for model building.

Below is a more detailed table of contents for this notebook

**Table of Contents**
1. [Constructing Dataset]
    - [Constructing for Separate Datasets]
2. [Feature Engineering]
    - [MVP]
    - [DPOY]
    - [MIP]
    - [SMOY]
    - [ROY]
3. [Saving Datasets]

## Importing Necessary Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from itertools import product
import requests as rq 
from bs4 import BeautifulSoup 
%matplotlib inline

## 1. Constructing Datasets

Utilizing the several datasets retrieved from webscraping.py, I will concatenate the per 36 min. data and advanced datasets to create an aggregated dataframe of all the data aligned with each respective player.

In [4]:
#specifying features to avoid overlapping data and filter for relevant data (focusing on per 36 min. data, not per game data)
adv_features = ["Player", "PER", "TS%", "USG%", "OWS", "DWS", "WS", "WS/48", "OBPM", "DBPM", "BPM", "VORP"]
per_features = ["Player", "Pos", "Age", "Tm", "G", "GS", "MP", "FG", "FGA", "FG%", "3P", "3PA", "3P%", 
                       "2P", "2PA", "2P%", "FT", "FTA", "FT%", "ORB", "DRB", "TRB", "AST", "STL", "BLK", "TOV", 
                       "PF", "PTS", "Year"]
mvp_features = ["Player", "Share"]

data = pd.DataFrame()
for year in range(2000, 2023):
    #reading in data from files
    adv = pd.read_csv("../data/player_data/advanced_{}".format(year))[adv_features]
    per = pd.read_csv("../data/player_data/per_minute_{}".format(year))[per_features]
    
    #merging datasets
    merge_method = "left" if len(per) > len(adv) else "right"
    df = pd.merge(per, adv, on = "Player", how = merge_method)
    data = pd.concat([data, df], axis = 0)
data

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,TS%,USG%,OWS,DWS,WS,WS/48,OBPM,DBPM,BPM,VORP
0,Tariq Abdul-Wahad,SG,25,TOT,61,56,1578,6.3,14.7,0.424,...,0.477,22.5,0.4,1.8,2.2,0.068,-1.2,-0.1,-1.2,0.3
1,Shareef Abdur-Rahim,SF,23,VAN,82,82,3223,6.6,14.3,0.465,...,0.547,25.0,6.2,2.6,8.8,0.132,2.6,-0.4,2.2,3.4
2,Ray Allen*,SG,24,MIL,82,82,3070,7.5,16.5,0.455,...,0.570,25.6,9.0,1.0,10.1,0.157,4.7,-1.1,3.6,4.3
3,John Amaechi,C,29,ORL,80,53,1684,6.5,15.0,0.437,...,0.505,24.1,0.6,1.8,2.4,0.067,-1.8,-0.8,-2.5,-0.2
4,Derek Anderson,SG,25,LAC,64,58,2201,6.2,14.1,0.438,...,0.542,23.4,3.1,0.3,3.3,0.073,1.2,-1.2,-0.1,1.1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
342,Delon Wright,SG,29,ATL,77,8,1452,3.0,6.7,0.454,...,0.576,10.3,2.2,1.5,3.6,0.121,-0.2,2.4,2.2,1.6
343,Thaddeus Young,PF,33,TOT,52,1,845,6.0,11.6,0.518,...,0.548,17.4,0.9,1.3,2.2,0.126,0.1,2.1,2.2,0.9
344,Trae Young,PG,23,ATL,76,76,2652,9.7,21.0,0.460,...,0.603,34.4,9.0,1.0,10.0,0.181,7.1,-2.0,5.2,4.8
345,Omer Yurtseven,C,23,MIA,56,12,706,6.6,12.6,0.526,...,0.546,19.9,0.8,1.4,2.1,0.145,-1.4,0.4,-1.0,0.2


In [5]:
#removing the asterisks in player name to prevent problems when merging datasets
data["Player"] = data["Player"].apply(lambda x:x.strip("*"))

### Constructing Datasets for Separate Awards

For sixth man of the year and rookie of the year awards, I am also filtering out players who do not qualify for the award

In [8]:
#mvp
mvp = pd.read_csv("../data/awards_data/mvp_data")
mvp_df = pd.merge(data, mvp, on = ["Player", "Year"], how = "left") #merges MVP award voting share data

#dpoy
dpoy = pd.read_csv("../data/awards_data/dpoy_data")
dpoy_df = pd.merge(data, dpoy, on = ["Player", "Year"], how = "left") #merges DPOY award voting share data

#mip
mip = pd.read_csv("../data/awards_data/mip_data")
mip_df = pd.merge(data, mip, on = ["Player", "Year"], how = "left") #merges MIP award voting share data

#smoy
qualified_smoy = pd.DataFrame() 
for year in range(2000, 2023): #finds the players that are qualified for the SMOY award (Started less than half of games)
    x = pd.read_csv("../data/player_data/per_minute_{}".format(str(year)))
    x["Year"] = year
    x = x[x["GS"] < (x["G"] / 2)].reset_index()[["Player", "Year"]]
    qualified_smoy = pd.concat([qualified_smoy, x], axis = 0)
    
qualified_for_smoy = pd.merge(data, qualified_smoy, on = ["Player", "Year"]) #filters out data with qualified players
smoy_data = pd.read_csv("../data/awards_data/smoy_data")
smoy_df = pd.merge(qualified_for_smoy, smoy_data, on = ["Player", "Year"], how = "left") #merges SMOY award voting share data 

#roy
qualified_rookies = pd.DataFrame()
for year in range(2000, 2023): #finds the players that are qualified for ROY award (first year in league)
    url = "https://www.basketball-reference.com/leagues/NBA_{}_rookies-season-stats.html".format(year)
    parse = BeautifulSoup(rq.get(url).text, "html.parser")
    parse.find("tr", class_ = "over_header").decompose()
    rookie_data = parse.find_all(id = "rookies")
    x = pd.read_html(str(rookie_data))[0]
    x["Year"] = year
    x = x[["Player", "Year"]]
    qualified_rookies = pd.concat([qualified_rookies, x], axis = 0)

qualified_for_roy = pd.merge(data, qualified_rookies, on = ["Player", "Year"]) #filters out data with qualified players
roy_data = pd.read_csv("../data/awards_data/roy_data")
roy_df = pd.merge(qualified_for_roy, roy_data, on = ["Player", "Year"], how = "left") #merges ROY award voting share data

### Inspecting Datasets

Validating that the constructed dataset includes all necessary data and are accurate

In [14]:
print("MVP: ", mvp_df.shape)
print("DPOY: ", dpoy_df.shape)
print("MIP: ", mip_df.shape)
print("SMOY: ", smoy_df.shape)
print("ROY: ", roy_df.shape)

MVP:  (7423, 41)
DPOY:  (7423, 41)
MIP:  (7423, 41)
SMOY:  (3835, 41)
ROY:  (793, 41)


In [15]:
#checking accuracy by inspecting specific player data
columns = ["Player", "Year", "Share"]
print("MVP:")
print((mvp_df[(mvp_df["Player"] == "Giannis Antetokounmpo") & (mvp_df["Year"] > 2018)][columns]), "\n")
print("DPOY:")
print((dpoy_df[(dpoy_df["Player"] == "Rudy Gobert") & (dpoy_df["Year"] > 2018)][columns]), "\n")
print("MIP:")
print((mip_df[(mip_df["Player"] == "Pascal Siakam") & (mip_df["Year"] > 2018)][columns]), "\n")
print("SMOY:")
print((smoy_df[(smoy_df["Player"] == "Lou Williams") & (smoy_df["Year"] > 2018)][columns]), "\n")
print("ROY:")
print((roy_df[(roy_df["Player"] == "LaMelo Ball") & (roy_df["Year"] > 2018)][columns]), "\n")

MVP:
                     Player  Year  Share
6120  Giannis Antetokounmpo  2019  0.932
6468  Giannis Antetokounmpo  2020  0.952
6774  Giannis Antetokounmpo  2021  0.345
7085  Giannis Antetokounmpo  2022  0.595 

DPOY:
           Player  Year  Share
6234  Rudy Gobert  2019  0.822
6572  Rudy Gobert  2020  0.374
6869  Rudy Gobert  2021  0.928
7189  Rudy Gobert  2022  0.272 

MIP:
             Player  Year  Share
6412  Pascal Siakam  2019  0.938
6726  Pascal Siakam  2020  0.026
7025  Pascal Siakam  2021    NaN
7365  Pascal Siakam  2022    NaN 

SMOY:
            Player  Year  Share
3330  Lou Williams  2019  0.978
3485  Lou Williams  2020  0.254
3641  Lou Williams  2021    NaN
3829  Lou Williams  2022    NaN 

ROY:
          Player  Year  Share
728  LaMelo Ball  2021  0.939 



## 2. Feature Engineering / Creating New Features

Before proceeding to the exploratory data analysis portion of this project, I will be engineering/adding my own features to the dataset based on domain knowledge. The EDA portion of this notebook will further explore the potential significance of these features to the model and multicollinearity. These features will be engineered with new data from webscraping.py and the existing data.

**NOTE:** Usually feature engineering should come after exploratory data analysis; however, I'd like to explore how these new variables interact with other features and the target variables, which will better support in feature selection. For example, through domain knowledge, I already know DPOY and MIP will rely on completely different features than the existing variables, and these should be explored before building the model. Additionally, all features that will added below are not solely derived from the existing raw data, but from additional webscraping. 

### Most Value Player

The new features are the following:
1. Team Seed (Seed)
2. Team Win/Loss Ratio (WL)
3. Scoring Standing (Utilizing points *per 36 minutes*) (SS)
4. Minutes Played Per Game (MPG)

In [39]:
mvp_df["Tm"].unique()

array(['TOT', 'VAN', 'MIL', 'ORL', 'LAC', 'BOS', 'SAC', 'HOU', 'CHI',
       'POR', 'WAS', 'MIN', 'PHO', 'SEA', 'IND', 'GSW', 'TOR', 'ATL',
       'DEN', 'DAL', 'MIA', 'LAL', 'CLE', 'DET', 'NJN', 'NYK', 'CHH',
       'SAS', 'UTA', 'PHI', 'MEM', 'NOH', 'CHA', 'NOK', 'OKC', 'BRK',
       'NOP', 'CHO'], dtype=object)

In [16]:
#Dictionary to translate team abbreviations to full team name in order to allow for datasets to interact with one another
abbreviations = {"VAN": "Vancouver Grizzlies", "MIL": "Milwaukee Bucks", "ORL": "Orlando Magic",
                "LAC": "Los Angeles Clippers", "BOS": "Boston Celtics", "SAC": "Sacramento Kings",
                "HOU": "Houston Rockets", "CHI": "Chicago Bulls", "POR": "Portland Trail Blazers",
                "WAS": "Washington Wizards", "MIN": "Minnesota Timberwolves", "PHO": "Phoenix Suns",
                "SEA": "Seattle Supersonics", "IND": "Indiana Pacers", "GSW": "Golden State Warriors", 
                "TOR": "Toronto Raptors", "ATL": "Atlanta Hawks", "DEN": "Denver Nuggets",
                "DAL": "Dallas Mavericks", "MIA": "Miami Heat", "LAL": "Los Angeles Lakers",
                "CLE": "Cleveland Cavaliers", "DET": "Detroit Pistons", "NJN": "New Jersey Nets",
                "NYK": "New York Knicks", "CHH": "Charlotte Hornets", "SAS": "San Antonio Spurs",
                "UTA": "Utah Jazz", "PHI": "Philadelphia 76ers", "MEM": "Memphis Grizzlies",
                "NOH": "New Orleans Hornets", "CHA": "Charlotte Bobcats", "NOK": "New Orleans/Oklahoma City Hornets",
                "OKC": "Oklahoma City Thunder", "BRK": "Brooklyn Nets", "NOP": "New Orleans Pelicans",
                "CHO": "Charlotte Hornets"} 

#functions that return desired feature for a player
def set_seed(df):
    if df["Tm"] == "TOT":
        return np.nan
    else: 
        teams = pd.read_csv("../data/team_data/{}".format(df["Year"]))
        tm = abbreviations[df["Tm"]]
        return teams[teams["Team"] == tm]["Seed"].values[0]

def set_wr(df):
    if df["Tm"] == "TOT":
        return np.nan
    else: 
        teams = pd.read_csv("../data/team_data/{}".format(df["Year"]))
        tm = abbreviations[df["Tm"]]
        return teams[teams["Team"] == tm]["Pct"].values[0]

def set_ss(df):
    players = data[data["Year"] == df["Year"]]
    sorted_ = players.sort_values("PTS", ascending = False).reset_index()
    value = sorted_[sorted_["Player"] == df["Player"]].index + 1
    return value[0]

def set_mpg(df):
    return df["MP"] / df["G"]

#applying the functions to each player in the dataset and merging information into aggregated dataset
mvp_df["Seed"] = list(data[["Year", "Tm"]].apply(set_seed, axis = 1))
mvp_df["WL"] = list(data[["Year", "Tm"]].apply(set_wr, axis = 1))
mvp_df["SS"] = list(data.apply(set_ss, axis = 1))
mvp_df["MPG"] = list(data.apply(set_mpg, axis = 1))
mvp_df[["Player", "G", "MP", "MPG", "Tm", "Seed", "WL", "PTS", "SS"]].tail(5)

Unnamed: 0,Player,G,MP,MPG,Tm,Seed,WL,PTS,SS
7418,Delon Wright,77,1452,18.857143,ATL,8.0,0.524,8.5,343
7419,Thaddeus Young,52,845,16.25,TOT,,,13.7,232
7420,Trae Young,76,2652,34.894737,ATL,8.0,0.524,29.3,4
7421,Omer Yurtseven,56,706,12.607143,MIA,1.0,0.646,15.2,175
7422,Ivica Zubac,76,1852,24.368421,LAC,9.0,0.512,15.3,172


### Defensive Player of the Year

The following features will be added:
1. STL%
2. BLK%
3. Defensive Rating
4. OPP Points off TOV
5. OPP Points off 2nd chance
6. OPP Points FB
7. OPP Points Paint

This data will be retrieved through additional webscraing (Found in webscraping.py as function defensive_stats)

In [17]:
dpoy_df_temp = pd.DataFrame()
for year in range(2000, 2023):
    temp = pd.read_csv("../data/player_data/defensive_{}".format(year))
    temp.drop(['Unnamed: 0', 'TEAM', 'AGE', 'GP', 'W', 'L', 'MIN', 'STL', 'BLK'], axis = 1, inplace = True)
    df = pd.merge(dpoy_df[dpoy_df["Year"] == year], temp, on = ["Player", "Year"], how = "left")
    dpoy_df_temp = pd.concat([dpoy_df_temp, df], axis = 0)

In [18]:
dpoy_df = dpoy_df_temp.copy()
dpoy_df

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,DREB,DREB%,%DREB,STL%,%BLK,OPP PTSOFF TOV,OPP PTS2ND CHANCE,OPP PTSFB,OPP PTSPAINT,DEFWS
0,Tariq Abdul-Wahad,SG,25,TOT,61,56,1578,6.3,14.7,0.424,...,3.1,11.2,17.6,19.5,12.6,9.2,7.5,9.3,21.2,0.092
1,Shareef Abdur-Rahim,SF,23,VAN,82,82,3223,6.6,14.3,0.465,...,7.4,20.2,32.0,18.4,29.7,15.1,11.1,13.5,36.1,0.051
2,Ray Allen,SG,24,MIL,82,82,3070,7.5,16.5,0.455,...,3.4,9.2,14.9,21.0,6.2,13.0,11.1,10.5,28.5,0.060
3,John Amaechi,C,29,ORL,80,53,1684,6.5,15.0,0.437,...,2.6,11.5,19.7,10.3,16.7,8.4,7.3,6.1,18.2,0.075
4,Derek Anderson,SG,25,LAC,64,58,2201,6.2,14.1,0.438,...,2.8,8.6,13.7,26.9,4.3,13.4,9.8,11.2,35.8,0.030
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
342,Delon Wright,SG,29,ATL,77,8,1452,3.0,6.7,0.454,...,2.2,11.6,17.1,38.1,14.1,5.9,5.1,4.2,18.3,0.064
343,Thaddeus Young,PF,33,TOT,52,1,845,6.0,11.6,0.518,...,2.5,15.2,22.3,34.6,22.8,5.2,4.2,4.3,15.2,0.068
344,Trae Young,PG,23,ATL,76,76,2652,9.7,21.0,0.460,...,3.1,8.7,12.6,18.3,3.0,11.0,9.4,9.1,35.6,0.056
345,Omer Yurtseven,C,23,MIA,56,12,706,6.6,12.6,0.526,...,3.7,29.3,42.6,16.2,33.9,4.7,3.6,3.2,10.6,0.046


### Most Improved Player

The following features will be added:
1. % Change in numerical features involving gameplay with the exception of advanced data (% indicates percent change)
2. Quantitative change numerical features involving gameplay, which will be compared to (1.) in EDA (Q indicating quantitative change)
3. Aggregate change formula (Summation of all selected % change columns) (AGG)

This totals to 22 additional features being added. This process will also remove a few features using domain knowledge in order to reduce the number of independent variables going into the exploratory data analysis

In [19]:
#constructing dataframe of percentage change features (to be concatenated to mip_df) and aggregated sum
percent_change_features = ["GS", "MP", "TS%", "TRB", "AST", "STL", "BLK", "PTS"]
def percent_change(df):
    player = df["Player"]
    year = df["Year"]
    prev_year = data[(data["Player"] == player) & (data["Year"] == year - 1)]
    
    if prev_year.empty: #returns NAs if player doesn't exist in previous year (Or previous year doesn't exist)
        return np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan

    def calc(x, y):
        try:
            return (x - y) / y
        except ZeroDivisionError:
            return x - y
    
    #calculates percent change for each statistic
    GS = calc(df["GS"], prev_year["GS"]).values[0]
    MP = calc(df["MP"], prev_year["MP"]).values[0]
    TS = calc(df["TS%"], prev_year["TS%"]).values[0]
    TRB = calc(df["TRB"], prev_year["TRB"]).values[0]
    AST = calc(df["AST"], prev_year["AST"]).values[0]
    STL = calc(df["STL"], prev_year["STL"]).values[0]
    BLK = calc(df["BLK"], prev_year["BLK"]).values[0]
    PTS = calc(df["PTS"], prev_year["PTS"]).values[0]
    SUM = GS + MP + TS + TRB + AST + STL + BLK + PTS
    
    return GS, MP, TS, TRB, AST, STL, BLK, PTS, SUM

#creates dataframe sorted by the players in mip_df
pct = data.apply(percent_change, axis = 1, result_type = "expand").reset_index()
pct.drop("index", axis = 1, inplace = True)
pct.columns = ["GS - %", "MP - %", "TS% - %", "TRB - %", "AST - %", "STL - %", "BLK - %", "PTS - %", "AGG"]
pct

Unnamed: 0,GS - %,MP - %,TS% - %,TRB - %,AST - %,STL - %,BLK - %,PTS - %,AGG
0,,,,,,,,,
1,,,,,,,,,
2,,,,,,,,,
3,,,,,,,,,
4,,,,,,,,,
...,...,...,...,...,...,...,...,...,...
7418,-0.794872,-0.169336,0.024911,0.000000,-0.175439,0.095238,-0.166667,-0.360902,-1.547067
7419,-0.956522,-0.488499,-0.051903,-0.032609,-0.301587,0.437500,-0.111111,-0.234637,-1.739368
7420,0.206349,0.248000,0.023769,-0.071429,-0.009901,0.111111,-0.500000,0.085185,0.093085
7421,,,,,,,,,


In [20]:
#constructing dataframe for quantitative change features (to be concatenated to mip_df)
quantitative_change_features = ["GS", "MP", "TS%", "TRB", "AST", "STL", "BLK", 
                                "PTS", "PER", "USG%", "WS", "WS/48", "BPM", "VORP"]
def quantitative_change(df):
    player = df["Player"]
    year = df["Year"]
    prev_year = data[(data["Player"] == player) & (data["Year"] == year - 1)]
    
    if prev_year.empty: #returns NAs if player doesn't exist in previous year (Or previous year doesn't exist)
        return np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan
    
    def calc(x, y):
        return (x - y).values[0]
    
    #calculates quantitaive change for desired variables
    GS = calc(df["GS"], prev_year["GS"])
    MP = calc(df["MP"], prev_year["MP"])
    TS = calc(df["TS%"], prev_year["TS%"])
    TRB = calc(df["TRB"], prev_year["TRB"])
    AST = calc(df["AST"], prev_year["AST"])
    STL = calc(df["STL"], prev_year["STL"])
    BLK = calc(df["BLK"], prev_year["BLK"])
    PTS = calc(df["PTS"], prev_year["PTS"])
    PER = calc(df["PER"], prev_year["PER"])
    USG = calc(df["USG%"], prev_year["USG%"])
    WS = calc(df["WS"], prev_year["WS"])
    WS_per = calc(df["WS/48"], prev_year["WS/48"])
    BPM = calc(df["BPM"], prev_year["BPM"])
    VORP = calc(df["VORP"], prev_year["VORP"])
    
    return GS, MP, TS, TRB, AST, STL, BLK, PTS, PER, USG, WS, WS_per, BPM, VORP

#creates dataframe sorted by players in mip_df
quant = data.apply(quantitative_change, axis = 1, result_type = "expand").reset_index()
quant.drop("index", axis = 1, inplace = True)
quant.columns = ["GS - Q", "MP - Q", "TS% - Q", "TRB - Q", "AST - Q", "STL - Q", "BLK - Q", 
               "PTS - Q", "PER - Q", "USG% - Q", "WS - Q", "WS/48 - Q", "BPM - Q", "VORP - Q"]
quant

Unnamed: 0,GS - Q,MP - Q,TS% - Q,TRB - Q,AST - Q,STL - Q,BLK - Q,PTS - Q,PER - Q,USG% - Q,WS - Q,WS/48 - Q,BPM - Q,VORP - Q
0,,,,,,,,,,,,,,
1,,,,,,,,,,,,,,
2,,,,,,,,,,,,,,
3,,,,,,,,,,,,,,
4,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7418,-31.0,-296.0,0.014,0.0,-1.0,0.2,-0.1,-4.8,-2.5,-6.0,-0.7,0.004,0.0,-0.2
7419,-22.0,-807.0,-0.030,-0.3,-1.9,0.7,-0.1,-4.2,-3.3,-4.9,-2.9,-0.021,-1.1,-1.3
7420,13.0,527.0,0.014,-0.3,-0.1,0.1,-0.1,2.3,2.4,1.4,2.8,0.018,1.5,1.8
7421,,,,,,,,,,,,,,


In [21]:
#concatenating the dataframes to mip_df and removing unnecessary features (via domain knowledge)
mip_df.drop(["FG%", "3P%", "2P%", "FT", "FTA", "FT%", "ORB", "DRB"], axis = 1, inplace = True)
mip_df = pd.concat([mip_df, pct, quant], axis = 1)
mip_df

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,3P,...,AST - Q,STL - Q,BLK - Q,PTS - Q,PER - Q,USG% - Q,WS - Q,WS/48 - Q,BPM - Q,VORP - Q
0,Tariq Abdul-Wahad,SG,25,TOT,61,56,1578,6.3,14.7,0.1,...,,,,,,,,,,
1,Shareef Abdur-Rahim,SF,23,VAN,82,82,3223,6.6,14.3,0.3,...,,,,,,,,,,
2,Ray Allen,SG,24,MIL,82,82,3070,7.5,16.5,2.0,...,,,,,,,,,,
3,John Amaechi,C,29,ORL,80,53,1684,6.5,15.0,0.0,...,,,,,,,,,,
4,Derek Anderson,SG,25,LAC,64,58,2201,6.2,14.1,0.9,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7418,Delon Wright,SG,29,ATL,77,8,1452,3.0,6.7,1.1,...,-1.0,0.2,-0.1,-4.8,-2.5,-6.0,-0.7,0.004,0.0,-0.2
7419,Thaddeus Young,PF,33,TOT,52,1,845,6.0,11.6,0.7,...,-1.9,0.7,-0.1,-4.2,-3.3,-4.9,-2.9,-0.021,-1.1,-1.3
7420,Trae Young,PG,23,ATL,76,76,2652,9.7,21.0,3.2,...,-0.1,0.1,-0.1,2.3,2.4,1.4,2.8,0.018,1.5,1.8
7421,Omer Yurtseven,C,23,MIA,56,12,706,6.6,12.6,0.1,...,,,,,,,,,,


### Sixth Man of the Year

The same features for MVP will be added to this award

In [22]:
merged_features = mvp_df[["Player", "Year", "Seed", "WL", "SS", "MPG"]]
smoy_df = pd.merge(smoy_df, merged_features, on = ["Player", "Year"], how = "left")
smoy_df

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,WS/48,OBPM,DBPM,BPM,VORP,Share,Seed,WL,SS,MPG
0,Chris Anstey,C,25,CHI,73,11,1007,5.8,13.0,0.442,...,0.090,-2.1,-0.1,-2.3,-0.1,,15.0,0.207,96,13.794521
1,Greg Anthony,PG,32,POR,82,3,1548,3.9,9.7,0.406,...,0.128,0.4,0.8,1.2,1.2,,3.0,0.720,211,18.878049
2,Chucky Atkins,PG,25,ORL,82,0,1626,7.0,16.4,0.424,...,0.072,0.6,-1.4,-0.8,0.5,,9.0,0.500,56,19.829268
3,Stacey Augmon,SG,31,POR,59,0,692,4.3,9.1,0.474,...,0.102,-1.5,1.0,-0.5,0.3,,3.0,0.720,249,11.728814
4,Isaac Austin,C,30,WAS,59,23,1173,4.6,10.8,0.429,...,0.001,-3.1,-0.1,-3.2,-0.4,,13.0,0.354,205,19.881356
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3830,Dylan Windler,SF,25,CLE,50,0,459,2.9,7.7,0.378,...,0.084,-2.9,1.2,-1.7,0.0,,9.0,0.537,341,9.180000
3831,Justise Winslow,PF,25,TOT,48,11,774,5.1,12.0,0.428,...,0.048,-2.7,1.0,-1.7,0.1,,,,262,16.125000
3832,Delon Wright,SG,29,ATL,77,8,1452,3.0,6.7,0.454,...,0.121,-0.2,2.4,2.2,1.6,,8.0,0.524,343,18.857143
3833,Thaddeus Young,PF,33,TOT,52,1,845,6.0,11.6,0.518,...,0.126,0.1,2.1,2.2,0.9,,,,232,16.250000


### Rookie of the Year

The same features for MVP will be added to this award

In [23]:
merged_features = mvp_df[["Player", "Year", "Seed", "WL", "SS", "MPG"]]
roy_df = pd.merge(roy_df, merged_features, on = ["Player", "Year"], how = "left")
roy_df

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,WS/48,OBPM,DBPM,BPM,VORP,Share,Seed,WL,SS,MPG
0,Chucky Atkins,PG,25,ORL,82,0,1626,7.0,16.4,0.424,...,0.072,0.6,-1.4,-0.8,0.5,,9.0,0.500,56,19.829268
1,William Avery,PG,20,MIN,59,1,484,4.2,13.5,0.309,...,-0.046,-4.3,-2.0,-6.3,-0.5,,6.0,0.610,223,8.203390
2,Cal Bowdler,SF,22,ATL,46,0,423,4.2,9.8,0.426,...,0.027,-3.4,-0.3,-3.7,-0.2,,14.0,0.341,251,9.195652
3,Ryan Bowen,SF,24,DEN,52,0,589,2.8,7.2,0.393,...,0.111,-1.7,1.5,-0.2,0.3,,10.0,0.427,290,11.326923
4,Elton Brand,PF,20,CHI,81,80,2999,7.6,15.7,0.482,...,0.121,2.2,-0.9,1.3,2.5,0.479,15.0,0.207,27,37.024691
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
788,Duane Washington Jr.,PG,21,IND,48,7,968,6.4,15.9,0.405,...,-0.002,-1.9,-2.8,-4.8,-0.7,,13.0,0.305,102,20.166667
789,Trendon Watford,SF,21,POR,48,10,869,5.9,11.1,0.532,...,0.104,-1.4,-0.4,-1.7,0.1,,13.0,0.329,178,18.104167
790,Aaron Wiggins,SG,23,OKC,50,35,1209,4.6,10.0,0.463,...,0.048,-3.4,-0.9,-4.3,-0.7,,14.0,0.293,271,24.180000
791,Ziaire Williams,SF,20,MEM,62,31,1346,5.1,11.3,0.450,...,0.080,-2.4,-0.6,-3.0,-0.3,,2.0,0.683,245,21.709677


## 3. Saving the Datasets

In [29]:
mvp_df.to_csv("../data/awards_dfs/MVP.csv")
dpoy_df.to_csv("../data/awards_dfs/DPOY.csv")
mip_df.to_csv("../data/awards_dfs/MIP.csv")
roy_df.to_csv("../data/awards_dfs/ROY.csv")
smoy_df.to_csv("../data/awards_dfs/SMOY.csv")