# Final Project: Predicting NBA's Most Valuable Player (MVP)

## Goal
- Create an **unsupervised machine learning to predict NBA's MVP each year** based on their stats. 

### Dataset
- Datasets is from [NBA Players stats since 1950](https://www.kaggle.com/drgilermo/nba-players-stats?select=Seasons_Stats.csv)

#### Seasons Stats Columns
- **Year - season**
- **Player - name**
- Pos - position
- Age - age
- Tm - team
- **G - games**
- **GS - games started**
- **MP - minutes played**
- PER - player efficiency rating
- TS% - true shooting %
- 3PAr - 3-point attempt rate
- FTr - free throw rate
- ORB% - offensive rebound percentage
- DRB% - defensive rebound percentage
- TRB% - total rebound percentage
- AST% - assist percentage
- STL% - steal percentage
- BLK% - block percentage
- TOV% - turnover percentage
- USG% - usage percentage
- OWS - offensive win shares
- DWS - defensive win shares
- WS/48 - win shares per 48
- OBPM - offensive box plus/minus
- DBPM - defensive box plus/minus
- BPM - box plus/minus
- VORP - value over replacement
- **FG - field goals**
- **FGA - field goal attempts**
- **FG% - field goal percentage**
- **3P - 3-point field goals**
- 3PA - 3-point field goal attempts
- **3P% - 3-point field goal percentage**
- **2P - 2-point field goals**
- 2PA - 2-point field goals attempts
- **2P% - 2-point field goals percentage**
- **eFG% - effective field goal percentage**
- FT - free throws
- FTA - free throw attempts
- FT% - free throw percentage
- **ORB - offensive rebounds**
- **DRB - defensive rebounds**
- **TRB - total rebounds**
- **AST - assists**
- **STL - steals**
- **BLK - blocks**
- **TOV - turnovers**
- PF - personal fouls
- **PTS - points**

## Import Dataset

In [366]:
import pandas as pd

players = pd.read_csv("datasets/Players.csv")
seasons_stats = pd.read_csv("datasets/Seasons_Stats.csv")
player_data = pd.read_csv("datasets/player_data.csv")

print(seasons_stats.shape)
seasons_stats.head()

(24691, 53)


Unnamed: 0.1,Unnamed: 0,Year,Player,Pos,Age,Tm,G,GS,MP,PER,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,0,1950.0,Curly Armstrong,G-F,31.0,FTW,63.0,,,,...,0.705,,,,176.0,,,,217.0,458.0
1,1,1950.0,Cliff Barker,SG,29.0,INO,49.0,,,,...,0.708,,,,109.0,,,,99.0,279.0
2,2,1950.0,Leo Barnhorst,SF,25.0,CHS,67.0,,,,...,0.698,,,,140.0,,,,192.0,438.0
3,3,1950.0,Ed Bartels,F,24.0,TOT,15.0,,,,...,0.559,,,,20.0,,,,29.0,63.0
4,4,1950.0,Ed Bartels,F,24.0,DNN,13.0,,,,...,0.548,,,,20.0,,,,27.0,59.0


### Testing Datasets with my favorite NBA player, Kobe Bryant

In [367]:
player_data[player_data["name"] == "Kobe Bryant"] # get Kobe Bryant's player data

Unnamed: 0,name,year_start,year_end,position,height,weight,birth_date,college
528,Kobe Bryant,1997,2016,G-F,6-6,212.0,"August 23, 1978",


In [368]:
players[players["Player"] == "Kobe Bryant"] # get Kobe Bryant's player data

Unnamed: 0.1,Unnamed: 0,Player,height,weight,collage,born,birth_city,birth_state
2456,2456,Kobe Bryant,198.0,96.0,,1978.0,Philadelphia,Pennsylvania


In [369]:
seasons_stats[seasons_stats["Player"] == "Kobe Bryant"] # get Kobe Bryant's player data

Unnamed: 0.1,Unnamed: 0,Year,Player,Pos,Age,Tm,G,GS,MP,PER,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
12900,12900,1997.0,Kobe Bryant,SG,18.0,LAL,71.0,6.0,1103.0,14.4,...,0.819,47.0,85.0,132.0,91.0,49.0,23.0,112.0,102.0,539.0
13479,13479,1998.0,Kobe Bryant,SG,19.0,LAL,79.0,1.0,2056.0,18.5,...,0.794,79.0,163.0,242.0,199.0,74.0,40.0,157.0,180.0,1220.0
14021,14021,1999.0,Kobe Bryant,SG,20.0,LAL,50.0,50.0,1896.0,18.9,...,0.839,53.0,211.0,264.0,190.0,72.0,50.0,157.0,153.0,996.0
14537,14537,2000.0,Kobe Bryant,SG,21.0,LAL,66.0,62.0,2524.0,21.7,...,0.821,108.0,308.0,416.0,323.0,106.0,62.0,182.0,220.0,1485.0
15028,15028,2001.0,Kobe Bryant,SG,22.0,LAL,68.0,68.0,2783.0,24.5,...,0.853,104.0,295.0,399.0,338.0,114.0,43.0,220.0,222.0,1938.0
15578,15578,2002.0,Kobe Bryant,SG,23.0,LAL,80.0,80.0,3063.0,23.2,...,0.829,112.0,329.0,441.0,438.0,118.0,35.0,223.0,228.0,2019.0
16070,16070,2003.0,Kobe Bryant,SG,24.0,LAL,82.0,82.0,3401.0,26.2,...,0.843,106.0,458.0,564.0,481.0,181.0,67.0,288.0,218.0,2461.0
16576,16576,2004.0,Kobe Bryant,SG,25.0,LAL,65.0,64.0,2447.0,23.7,...,0.852,103.0,256.0,359.0,330.0,112.0,28.0,171.0,176.0,1557.0
17159,17159,2005.0,Kobe Bryant,SG,26.0,LAL,66.0,66.0,2689.0,23.3,...,0.816,95.0,297.0,392.0,398.0,86.0,53.0,270.0,174.0,1819.0
17742,17742,2006.0,Kobe Bryant,SG,27.0,LAL,80.0,80.0,3277.0,28.0,...,0.85,71.0,354.0,425.0,360.0,147.0,30.0,250.0,233.0,2832.0


## Clean Data

In [370]:
seasons_stats = seasons_stats[~seasons_stats["Player"].isnull()] #remove rows where Player column is null
print(seasons_stats.shape)

(24624, 53)


In [371]:
players = players[~players["Player"].isnull()]
print(seasons_stats.shape)

(24624, 53)


### Change row names

In [372]:
seasons_stats = seasons_stats.rename(columns = {"Unnamed: 0": "id"}) #rename 'Unnamed: 0' column to 'id'
seasons_stats.head()

Unnamed: 0,id,Year,Player,Pos,Age,Tm,G,GS,MP,PER,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,0,1950.0,Curly Armstrong,G-F,31.0,FTW,63.0,,,,...,0.705,,,,176.0,,,,217.0,458.0
1,1,1950.0,Cliff Barker,SG,29.0,INO,49.0,,,,...,0.708,,,,109.0,,,,99.0,279.0
2,2,1950.0,Leo Barnhorst,SF,25.0,CHS,67.0,,,,...,0.698,,,,140.0,,,,192.0,438.0
3,3,1950.0,Ed Bartels,F,24.0,TOT,15.0,,,,...,0.559,,,,20.0,,,,29.0,63.0
4,4,1950.0,Ed Bartels,F,24.0,DNN,13.0,,,,...,0.548,,,,20.0,,,,27.0,59.0


In [373]:
players = players.rename(columns = {"Unnamed: 0": "id"}) #rename 'Unnamed: 0' column to 'id'
players.head()

Unnamed: 0,id,Player,height,weight,collage,born,birth_city,birth_state
0,0,Curly Armstrong,180.0,77.0,Indiana University,1918.0,,
1,1,Cliff Barker,188.0,83.0,University of Kentucky,1921.0,Yorktown,Indiana
2,2,Leo Barnhorst,193.0,86.0,University of Notre Dame,1924.0,,
3,3,Ed Bartels,196.0,88.0,North Carolina State University,1925.0,,
4,4,Ralph Beard,178.0,79.0,University of Kentucky,1927.0,Hardinsburg,Kentucky


### Ensure NBA players with the same name are considered as different players
- This is done by grouping `player_data` by their names
- Getting the year they started

In [374]:
num_players = player_data.groupby("name").count() 
num_players =  num_players.iloc[:,:1] # get all the columns before the 2nd column (name, year start)
num_players = num_players.reset_index() # reset index

num_players[num_players["name"] == "Charles Smith"] # Example of a player with duplicate names

Unnamed: 0,name,year_start
710,Charles Smith,3


#### Rename columns to match other datasets
player_data's uses `'name'` while players and season_stats datasets uses `'Player'`

In [394]:
num_players.columns = ["Player", "count"]
num_players.head()

Unnamed: 0,Player,count
0,A.C. Green,1
1,A.J. Bramlett,1
2,A.J. English,1
3,A.J. Guyton,1
4,A.J. Hammons,1


### Players with duplicated names

In [376]:
duplicated_names = num_players[num_players["count"] > 1]

print(len(duplicated_names), " total of names representing more than 1 player")
duplicated_names.head()

47  total of names representing more than 1 player


Unnamed: 0,Player,count
314,Bill Bradley,2
420,Bob Duffy,2
494,Bobby Jones,2
505,Bobby Wilson,2
680,Cedric Henderson,2


In [377]:
seasons_stats = seasons_stats.iloc[:,1:]
seasons_stats = seasons_stats.drop(["blanl", "blank2"], axis=1) # drop these columns because they all null values

In [378]:
seasons_stats.columns

Index(['Year', 'Player', 'Pos', 'Age', 'Tm', 'G', 'GS', 'MP', 'PER', 'TS%',
       '3PAr', 'FTr', 'ORB%', 'DRB%', 'TRB%', 'AST%', 'STL%', 'BLK%', 'TOV%',
       'USG%', 'OWS', 'DWS', 'WS', 'WS/48', 'OBPM', 'DBPM', 'BPM', 'VORP',
       'FG', 'FGA', 'FG%', '3P', '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%',
       'FT', 'FTA', 'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV',
       'PF', 'PTS'],
      dtype='object')

In [379]:
player_data["id"] = player_data.index # assign index as the id

In [380]:
kobe_stats = seasons_stats[seasons_stats["Player"] == 'Kobe Bryant']
kobe_stats["Year"].iloc[0] - kobe_stats["Age"].iloc[0]
# Kobe was born on 1978, so we will need to...
# subtract 1 more to match their actual birth year

1979.0

### Create a new column for year born

In [381]:
# remove rows with null values on column Age and Year
seasons_stats = seasons_stats[~seasons_stats["Year"].isnull()]
seasons_stats = seasons_stats[~seasons_stats["Age"].isnull()]

# create a born column
seasons_stats["born"] = (seasons_stats["Year"] - seasons_stats["Age"] - 1) #.astype("int16") # subtract 1

In [382]:
players = players[~players["born"].isnull()]       #remove players with no birth year

players_born = players[["Player", "born"]]         #select on player and birth year
players_born.head()

Unnamed: 0,Player,born
0,Curly Armstrong,1918.0
1,Cliff Barker,1921.0
2,Leo Barnhorst,1924.0
3,Ed Bartels,1925.0
4,Ralph Beard,1927.0


In [383]:
player_data = player_data[~player_data["birth_date"].isnull()]  #remove player_data with no birth_date
for i, row in player_data.iterrows():                           #loop through each row
    birth_year = float(row["birth_date"].split(",")[1])         #get the year from birth_rate
    player_data.loc[i, "born"] = birth_year                     #assign to born column

In [395]:
player_data_born = player_data[["name", "born"]]                #only get the name and born columns
player_data_born.columns = ["Player", "born"]                   #rename name column to Player
player_data.head()

Unnamed: 0,name,year_start,year_end,position,height,weight,birth_date,college,id,born
0,Alaa Abdelnaby,1991,1995,F-C,6-10,240.0,"June 24, 1968",Duke University,0,1968.0
1,Zaid Abdul-Aziz,1969,1978,C-F,6-9,235.0,"April 7, 1946",Iowa State University,1,1946.0
2,Kareem Abdul-Jabbar,1970,1989,C,7-2,225.0,"April 16, 1947","University of California, Los Angeles",2,1947.0
3,Mahmoud Abdul-Rauf,1991,2001,G,6-1,162.0,"March 9, 1969",Louisiana State University,3,1969.0
4,Tariq Abdul-Wahad,1998,2003,F,6-6,223.0,"November 3, 1974",San Jose State University,4,1974.0


### Concatenate birth year from 2 datasets

In [385]:
born = pd.concat([players_born, player_data_born]) #concatenate/combine 2 datasets
born = born.drop_duplicates() #remove duplicates
born = born.reset_index() #reset index
born['id'] = born["index"] #assign id as the index
born = born.drop("index", axis=1) #remove index column

born.head()

Unnamed: 0,Player,born,id
0,Curly Armstrong,1918.0,0
1,Cliff Barker,1921.0,1
2,Leo Barnhorst,1924.0,2
3,Ed Bartels,1925.0,3
4,Ralph Beard,1927.0,4


### Merge born into seasons_stats
This will give us an id from Player that will refer to the same players

In [386]:
data = seasons_stats.merge(born, on=["Player", "born"])
data.head()

Unnamed: 0,Year,Player,Pos,Age,Tm,G,GS,MP,PER,TS%,...,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,born,id
0,1950.0,Curly Armstrong,G-F,31.0,FTW,63.0,,,,0.368,...,,,176.0,,,,217.0,458.0,1918.0,0
1,1951.0,Curly Armstrong,G-F,32.0,FTW,38.0,,,,0.372,...,,89.0,77.0,,,,97.0,202.0,1918.0,0
2,1950.0,Leo Barnhorst,SF,25.0,CHS,67.0,,,,0.394,...,,,140.0,,,,192.0,438.0,1924.0,2
3,1951.0,Leo Barnhorst,SF,26.0,INO,68.0,,,,0.377,...,,296.0,218.0,,,,197.0,546.0,1924.0,2
4,1952.0,Leo Barnhorst,SF,27.0,INO,66.0,,2344.0,15.9,0.419,...,,430.0,255.0,,,,196.0,820.0,1924.0,2


#### Remove players that played with multiple teams in a single season
- These players are marked as "TOT" as their team
- We can assume this is safe as MVP caliber players are never traded to another team

In [387]:
data = data[data["Tm"] !="TOT"]

## 1. Let's begin by adding important features to players
- PPG - points per game
- APG - assists per game
- RPG - rebounds per game
- SPG - steals per game
- BPG - blocks per game
- FPG - fouls per game
- TOVPG - turnovers per game

In [363]:
data["PPG"] = data["PTS"] / data["G"]
data["APG"] = data["AST"] / data["G"]
data["RPG"] = data["TRB"] / data["G"]
data["SPG"] = data["STL"] / data["G"]
data["BPG"] = data["BLK"] / data["G"]
data["FPG"] = data["PF"] / data["G"]
data["TOVPG"] = data["TOV"] / data["G"]

## 2. Adding MVP Players from 1956-2018
- List of [NBA MVP Award Winners](https://www.nba.com/history/awards/mvp)

In [388]:
mvp_players = {"Bob Pettit*": [1956, 1959],
                  "Bob Cousy*": [1957],
                  "Bill Russell*": [1958, 1961, 1962, 1963, 1965],
                  "Wilt Chamberlain*": [1960, 1966, 1967, 1968],
                  "Oscar Robertson*": [1964],
                  "Wes Unseld*": [1969],
                  "Willis Reed*": [1970],
                  "Kareem Abdul-Jabbar*": [1971, 1972, 1974, 1976, 1977, 1980],
                  "Dave Cowens*": [1973],
                  "Bob McAdoo*": [1975],
                  "Bill Walton*": [1978],
                  "Moses Malone*": [1979, 1982, 1983],
                  "Julius Erving*": [1981],
                  "Larry Bird*": [1984, 1985, 1986],
                  "Magic Johnson*": [1987, 1989, 1990],
                  "Michael Jordan*": [1988, 1991, 1992, 1996, 1998],
                  "Charles Barkley*": [1993],
                  "Hakeem Olajuwon*": [1994],
                  "David Robinson*": [1995],
                  "Karl Malone*": [1997, 1999],
                  "Shaquille O\"Neal*": [2000],
                  "Allen Iverson*": [2001],
                  "Tim Duncan": [2002, 2003],
                  "Kevin Garnett": [2004],
                  "Steve Nash": [2005, 2006],
                  "Dirk Nowitzki": [2007],
                  "Kobe Bryant": [2008],
                  "LeBron James": [2009, 2010, 2012, 2013],
                  "Derrick Rose": [2011],
                  "Kevin Durant": [2014],
                  "Stephen Curry": [2015, 2016],
                  "Russell Westbrook": [2017],
                  "James Harden": [2018]
              }

## 3. Assign the MVP on our data

In [389]:
data['MVP'] = 0                       # make a new column called MVP and have all its values 0
for i, row in data.iterrows():        # loop through each row in our data
    for k, v in mvp_players.items():  # loop through each MVP
        for year in v:                # loop through each year the player was an MVP
            if row['Player'] != k:
                break
            elif(row['Year'] == year) & (row['Player'] == k):   # if we find the player at the same year...
                data.loc[i, 'MVP'] = 1                          # make their MVP value = 1
                break

In [391]:
data[data["MVP"]==1].tail() #show the most recent MVPs

Unnamed: 0,Year,Player,Pos,Age,Tm,G,GS,MP,PER,TS%,...,TRB,AST,STL,BLK,TOV,PF,PTS,born,id,MVP
19572,2014.0,Kevin Durant,SF,25.0,OKC,81.0,81.0,3122.0,29.8,0.635,...,598.0,445.0,103.0,59.0,285.0,174.0,2593.0,1988.0,3220,1
20181,2011.0,Derrick Rose,PG,22.0,CHI,81.0,81.0,3026.0,23.5,0.55,...,330.0,623.0,85.0,51.0,278.0,136.0,2026.0,1988.0,3313,1
20260,2017.0,Russell Westbrook,PG,28.0,OKC,81.0,81.0,2802.0,30.6,0.554,...,864.0,840.0,133.0,31.0,438.0,190.0,2558.0,1988.0,3325,1
20358,2015.0,Stephen Curry,PG,26.0,GSW,80.0,80.0,2613.0,28.0,0.638,...,341.0,619.0,163.0,16.0,249.0,158.0,1900.0,1988.0,3343,1
20359,2016.0,Stephen Curry,PG,27.0,GSW,79.0,79.0,2700.0,31.5,0.669,...,430.0,527.0,169.0,15.0,262.0,161.0,2375.0,1988.0,3343,1


## 4. Add Team Wins because this is an important parameter
- The best player in a team with the most wins usually becomes the MVP

In [393]:
data.sort_values(by="Tm")["Tm"].unique()

array(['AND', 'ATL', 'BAL', 'BLB', 'BOS', 'BRK', 'BUF', 'CAP', 'CHA',
       'CHH', 'CHI', 'CHO', 'CHP', 'CHS', 'CHZ', 'CIN', 'CLE', 'DAL',
       'DEN', 'DET', 'DNN', 'FTW', 'GSW', 'HOU', 'IND', 'INO', 'KCK',
       'KCO', 'LAC', 'LAL', 'MEM', 'MIA', 'MIL', 'MIN', 'MLH', 'MNL',
       'NJN', 'NOH', 'NOJ', 'NOK', 'NOP', 'NYK', 'NYN', 'OKC', 'ORL',
       'PHI', 'PHO', 'PHW', 'POR', 'ROC', 'SAC', 'SAS', 'SDC', 'SDR',
       'SEA', 'SFW', 'SHE', 'STB', 'STL', 'SYR', 'TOR', 'TRI', 'UTA',
       'VAN', 'WAS', 'WAT', 'WSB', 'WSC'], dtype=object)