# Final Project: Predicting NBA's Most Valuable Player (MVP)

## Goal
- Predict possible MVPs each year based on everyone's stats that year
- Get 5 highest possible/alternative MVPs
- Which stats gives you the highest chance to become an MVP
- Create an **machine learning model to predict NBA's MVP each year** based on their stats.

### Dataset
- Datasets is from [NBA Players stats since 1950](https://www.kaggle.com/drgilermo/nba-players-stats?select=Seasons_Stats.csv)

#### Seasons Stats Columns
- **Year - season**
- **Player - name**
- Pos - position
- Age - age
- Tm - team
- **G - games**
- **GS - games started**
- **MP - minutes played**
- PER - player efficiency rating
- TS% - true shooting %
- 3PAr - 3-point attempt rate
- FTr - free throw rate
- ORB% - offensive rebound percentage
- DRB% - defensive rebound percentage
- TRB% - total rebound percentage
- AST% - assist percentage
- STL% - steal percentage
- BLK% - block percentage
- TOV% - turnover percentage
- USG% - usage percentage
- OWS - offensive win shares
- DWS - defensive win shares
- WS/48 - win shares per 48
- OBPM - offensive box plus/minus
- DBPM - defensive box plus/minus
- BPM - box plus/minus
- VORP - value over replacement
- **FG - field goals**
- **FGA - field goal attempts**
- **FG% - field goal percentage**
- **3P - 3-point field goals**
- 3PA - 3-point field goal attempts
- **3P% - 3-point field goal percentage**
- **2P - 2-point field goals**
- 2PA - 2-point field goals attempts
- **2P% - 2-point field goals percentage**
- **eFG% - effective field goal percentage**
- FT - free throws
- FTA - free throw attempts
- FT% - free throw percentage
- **ORB - offensive rebounds**
- **DRB - defensive rebounds**
- **TRB - total rebounds**
- **AST - assists**
- **STL - steals**
- **BLK - blocks**
- **TOV - turnovers**
- PF - personal fouls
- **PTS - points**

## Import Dataset

In [1]:
import pandas as pd

players = pd.read_csv("datasets/Players.csv")
seasons_stats = pd.read_csv("datasets/Seasons_Stats.csv")
player_data = pd.read_csv("datasets/player_data.csv")

print(seasons_stats.shape)
seasons_stats.head()

(24691, 53)


Unnamed: 0.1,Unnamed: 0,Year,Player,Pos,Age,Tm,G,GS,MP,PER,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,0,1950.0,Curly Armstrong,G-F,31.0,FTW,63.0,,,,...,0.705,,,,176.0,,,,217.0,458.0
1,1,1950.0,Cliff Barker,SG,29.0,INO,49.0,,,,...,0.708,,,,109.0,,,,99.0,279.0
2,2,1950.0,Leo Barnhorst,SF,25.0,CHS,67.0,,,,...,0.698,,,,140.0,,,,192.0,438.0
3,3,1950.0,Ed Bartels,F,24.0,TOT,15.0,,,,...,0.559,,,,20.0,,,,29.0,63.0
4,4,1950.0,Ed Bartels,F,24.0,DNN,13.0,,,,...,0.548,,,,20.0,,,,27.0,59.0


### Testing Datasets with my favorite NBA player, Kobe Bryant

In [2]:
player_data[player_data["name"] == "Kobe Bryant"] # get Kobe Bryant's player data

Unnamed: 0,name,year_start,year_end,position,height,weight,birth_date,college
528,Kobe Bryant,1997,2016,G-F,6-6,212.0,"August 23, 1978",


In [3]:
players[players["Player"] == "Kobe Bryant"] # get Kobe Bryant's player data

Unnamed: 0.1,Unnamed: 0,Player,height,weight,collage,born,birth_city,birth_state
2456,2456,Kobe Bryant,198.0,96.0,,1978.0,Philadelphia,Pennsylvania


In [4]:
seasons_stats[seasons_stats["Player"] == "Kobe Bryant"] # get Kobe Bryant's player data

Unnamed: 0.1,Unnamed: 0,Year,Player,Pos,Age,Tm,G,GS,MP,PER,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
12900,12900,1997.0,Kobe Bryant,SG,18.0,LAL,71.0,6.0,1103.0,14.4,...,0.819,47.0,85.0,132.0,91.0,49.0,23.0,112.0,102.0,539.0
13479,13479,1998.0,Kobe Bryant,SG,19.0,LAL,79.0,1.0,2056.0,18.5,...,0.794,79.0,163.0,242.0,199.0,74.0,40.0,157.0,180.0,1220.0
14021,14021,1999.0,Kobe Bryant,SG,20.0,LAL,50.0,50.0,1896.0,18.9,...,0.839,53.0,211.0,264.0,190.0,72.0,50.0,157.0,153.0,996.0
14537,14537,2000.0,Kobe Bryant,SG,21.0,LAL,66.0,62.0,2524.0,21.7,...,0.821,108.0,308.0,416.0,323.0,106.0,62.0,182.0,220.0,1485.0
15028,15028,2001.0,Kobe Bryant,SG,22.0,LAL,68.0,68.0,2783.0,24.5,...,0.853,104.0,295.0,399.0,338.0,114.0,43.0,220.0,222.0,1938.0
15578,15578,2002.0,Kobe Bryant,SG,23.0,LAL,80.0,80.0,3063.0,23.2,...,0.829,112.0,329.0,441.0,438.0,118.0,35.0,223.0,228.0,2019.0
16070,16070,2003.0,Kobe Bryant,SG,24.0,LAL,82.0,82.0,3401.0,26.2,...,0.843,106.0,458.0,564.0,481.0,181.0,67.0,288.0,218.0,2461.0
16576,16576,2004.0,Kobe Bryant,SG,25.0,LAL,65.0,64.0,2447.0,23.7,...,0.852,103.0,256.0,359.0,330.0,112.0,28.0,171.0,176.0,1557.0
17159,17159,2005.0,Kobe Bryant,SG,26.0,LAL,66.0,66.0,2689.0,23.3,...,0.816,95.0,297.0,392.0,398.0,86.0,53.0,270.0,174.0,1819.0
17742,17742,2006.0,Kobe Bryant,SG,27.0,LAL,80.0,80.0,3277.0,28.0,...,0.85,71.0,354.0,425.0,360.0,147.0,30.0,250.0,233.0,2832.0


## Clean Data

In [5]:
seasons_stats = seasons_stats[~seasons_stats["Player"].isnull()] #remove rows where Player column is null
print(seasons_stats.shape)

(24624, 53)


In [6]:
players = players[~players["Player"].isnull()]
print(seasons_stats.shape)

(24624, 53)


### Change row names

In [7]:
seasons_stats = seasons_stats.rename(columns = {"Unnamed: 0": "id"}) #rename 'Unnamed: 0' column to 'id'
seasons_stats.head()

Unnamed: 0,id,Year,Player,Pos,Age,Tm,G,GS,MP,PER,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,0,1950.0,Curly Armstrong,G-F,31.0,FTW,63.0,,,,...,0.705,,,,176.0,,,,217.0,458.0
1,1,1950.0,Cliff Barker,SG,29.0,INO,49.0,,,,...,0.708,,,,109.0,,,,99.0,279.0
2,2,1950.0,Leo Barnhorst,SF,25.0,CHS,67.0,,,,...,0.698,,,,140.0,,,,192.0,438.0
3,3,1950.0,Ed Bartels,F,24.0,TOT,15.0,,,,...,0.559,,,,20.0,,,,29.0,63.0
4,4,1950.0,Ed Bartels,F,24.0,DNN,13.0,,,,...,0.548,,,,20.0,,,,27.0,59.0


In [8]:
players = players.rename(columns = {"Unnamed: 0": "id"}) #rename 'Unnamed: 0' column to 'id'
players.head()

Unnamed: 0,id,Player,height,weight,collage,born,birth_city,birth_state
0,0,Curly Armstrong,180.0,77.0,Indiana University,1918.0,,
1,1,Cliff Barker,188.0,83.0,University of Kentucky,1921.0,Yorktown,Indiana
2,2,Leo Barnhorst,193.0,86.0,University of Notre Dame,1924.0,,
3,3,Ed Bartels,196.0,88.0,North Carolina State University,1925.0,,
4,4,Ralph Beard,178.0,79.0,University of Kentucky,1927.0,Hardinsburg,Kentucky


### Ensure NBA players with the same name are considered as different players
- This is done by grouping `player_data` by their names
- Getting the year they started

In [9]:
num_players = player_data.groupby("name").count() 
num_players =  num_players.iloc[:,:1] # get all the columns before the 2nd column (name, year start)
num_players = num_players.reset_index() # reset index

num_players[num_players["name"] == "Charles Smith"] # Example of a player with duplicate names

Unnamed: 0,name,year_start
710,Charles Smith,3


#### Rename columns to match other datasets
player_data's uses `'name'` while players and season_stats datasets uses `'Player'`

In [10]:
num_players.columns = ["Player", "count"]
num_players.head()

Unnamed: 0,Player,count
0,A.C. Green,1
1,A.J. Bramlett,1
2,A.J. English,1
3,A.J. Guyton,1
4,A.J. Hammons,1


### Players with duplicated names

In [11]:
duplicated_names = num_players[num_players["count"] > 1]

print(len(duplicated_names), " total of names representing more than 1 player")
duplicated_names.head()

47  total of names representing more than 1 player


Unnamed: 0,Player,count
314,Bill Bradley,2
420,Bob Duffy,2
494,Bobby Jones,2
505,Bobby Wilson,2
680,Cedric Henderson,2


In [12]:
seasons_stats = seasons_stats.iloc[:,1:]
seasons_stats = seasons_stats.drop(["blanl", "blank2"], axis=1) # drop these columns because they all null values

In [13]:
seasons_stats.columns

Index(['Year', 'Player', 'Pos', 'Age', 'Tm', 'G', 'GS', 'MP', 'PER', 'TS%',
       '3PAr', 'FTr', 'ORB%', 'DRB%', 'TRB%', 'AST%', 'STL%', 'BLK%', 'TOV%',
       'USG%', 'OWS', 'DWS', 'WS', 'WS/48', 'OBPM', 'DBPM', 'BPM', 'VORP',
       'FG', 'FGA', 'FG%', '3P', '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%',
       'FT', 'FTA', 'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV',
       'PF', 'PTS'],
      dtype='object')

In [14]:
player_data["id"] = player_data.index # assign index as the id

In [15]:
kobe_stats = seasons_stats[seasons_stats["Player"] == 'Kobe Bryant']
kobe_stats["Year"].iloc[0] - kobe_stats["Age"].iloc[0]
# Kobe was born on 1978, so we will need to...
# subtract 1 more to match their actual birth year

1979.0

### Create a new column for year born

In [16]:
# remove rows with null values on column Age and Year
seasons_stats = seasons_stats[~seasons_stats["Year"].isnull()]
seasons_stats = seasons_stats[~seasons_stats["Age"].isnull()]

# create a born column
seasons_stats["born"] = (seasons_stats["Year"] - seasons_stats["Age"] - 1) #.astype("int16") # subtract 1

In [17]:
players = players[~players["born"].isnull()]       #remove players with no birth year

players_born = players[["Player", "born"]]         #select on player and birth year
players_born.head()

Unnamed: 0,Player,born
0,Curly Armstrong,1918.0
1,Cliff Barker,1921.0
2,Leo Barnhorst,1924.0
3,Ed Bartels,1925.0
4,Ralph Beard,1927.0


In [18]:
player_data = player_data[~player_data["birth_date"].isnull()]  #remove player_data with no birth_date
for i, row in player_data.iterrows():                           #loop through each row
    birth_year = float(row["birth_date"].split(",")[1])         #get the year from birth_rate
    player_data.loc[i, "born"] = birth_year                     #assign to born column

In [19]:
player_data_born = player_data[["name", "born"]]                #only get the name and born columns
player_data_born.columns = ["Player", "born"]                   #rename name column to Player
player_data.head()

Unnamed: 0,name,year_start,year_end,position,height,weight,birth_date,college,id,born
0,Alaa Abdelnaby,1991,1995,F-C,6-10,240.0,"June 24, 1968",Duke University,0,1968.0
1,Zaid Abdul-Aziz,1969,1978,C-F,6-9,235.0,"April 7, 1946",Iowa State University,1,1946.0
2,Kareem Abdul-Jabbar,1970,1989,C,7-2,225.0,"April 16, 1947","University of California, Los Angeles",2,1947.0
3,Mahmoud Abdul-Rauf,1991,2001,G,6-1,162.0,"March 9, 1969",Louisiana State University,3,1969.0
4,Tariq Abdul-Wahad,1998,2003,F,6-6,223.0,"November 3, 1974",San Jose State University,4,1974.0


### Concatenate birth year from 2 datasets

In [20]:
born = pd.concat([players_born, player_data_born]) #concatenate 2 datasets
born = born.drop_duplicates()                      #remove duplicates
born = born.reset_index()                          #reset index
born['id'] = born["index"]                         #assign id as the index
born = born.drop("index", axis=1)                  #remove index column

born.head()

Unnamed: 0,Player,born,id
0,Curly Armstrong,1918.0,0
1,Cliff Barker,1921.0,1
2,Leo Barnhorst,1924.0,2
3,Ed Bartels,1925.0,3
4,Ralph Beard,1927.0,4


### Merge born into seasons_stats
This will give us an id from Player that will refer to the same players

In [21]:
data = seasons_stats.merge(born, on=["Player", "born"]) #merge born dataset's Player and born columns to seasons_stats
data.head()

Unnamed: 0,Year,Player,Pos,Age,Tm,G,GS,MP,PER,TS%,...,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,born,id
0,1950.0,Curly Armstrong,G-F,31.0,FTW,63.0,,,,0.368,...,,,176.0,,,,217.0,458.0,1918.0,0
1,1951.0,Curly Armstrong,G-F,32.0,FTW,38.0,,,,0.372,...,,89.0,77.0,,,,97.0,202.0,1918.0,0
2,1950.0,Leo Barnhorst,SF,25.0,CHS,67.0,,,,0.394,...,,,140.0,,,,192.0,438.0,1924.0,2
3,1951.0,Leo Barnhorst,SF,26.0,INO,68.0,,,,0.377,...,,296.0,218.0,,,,197.0,546.0,1924.0,2
4,1952.0,Leo Barnhorst,SF,27.0,INO,66.0,,2344.0,15.9,0.419,...,,430.0,255.0,,,,196.0,820.0,1924.0,2


#### Remove players that played with multiple teams in a single season
- These players are marked as "TOT" as their team
- We can assume this is safe as MVP caliber players are never traded to another team

In [22]:
data = data[data["Tm"] !="TOT"]

## 1. Let's begin by adding important features to players
- PPG - points per game
- APG - assists per game
- RPG - rebounds per game
- SPG - steals per game
- BPG - blocks per game
- FPG - fouls per game
- TOVPG - turnovers per game

In [23]:
data["PPG"] = data["PTS"] / data["G"]
data["APG"] = data["AST"] / data["G"]
data["RPG"] = data["TRB"] / data["G"]
data["SPG"] = data["STL"] / data["G"]
data["BPG"] = data["BLK"] / data["G"]
data["FPG"] = data["PF"] / data["G"]
data["TOVPG"] = data["TOV"] / data["G"]

## 2. Adding MVP Players from 1956-2018
- List of [NBA MVP Award Winners](https://www.nba.com/history/awards/mvp)

In [24]:
mvp_players = {"Bob Pettit*": [1956, 1959],
                  "Bob Cousy*": [1957],
                  "Bill Russell*": [1958, 1961, 1962, 1963, 1965],
                  "Wilt Chamberlain*": [1960, 1966, 1967, 1968],
                  "Oscar Robertson*": [1964],
                  "Wes Unseld*": [1969],
                  "Willis Reed*": [1970],
                  "Kareem Abdul-Jabbar*": [1971, 1972, 1974, 1976, 1977, 1980],
                  "Dave Cowens*": [1973],
                  "Bob McAdoo*": [1975],
                  "Bill Walton*": [1978],
                  "Moses Malone*": [1979, 1982, 1983],
                  "Julius Erving*": [1981],
                  "Larry Bird*": [1984, 1985, 1986],
                  "Magic Johnson*": [1987, 1989, 1990],
                  "Michael Jordan*": [1988, 1991, 1992, 1996, 1998],
                  "Charles Barkley*": [1993],
                  "Hakeem Olajuwon*": [1994],
                  "David Robinson*": [1995],
                  "Karl Malone*": [1997, 1999],
                  "Shaquille O\"Neal*": [2000],
                  "Allen Iverson*": [2001],
                  "Tim Duncan": [2002, 2003],
                  "Kevin Garnett": [2004],
                  "Steve Nash": [2005, 2006],
                  "Dirk Nowitzki": [2007],
                  "Kobe Bryant": [2008],
                  "LeBron James": [2009, 2010, 2012, 2013],
                  "Derrick Rose": [2011],
                  "Kevin Durant": [2014],
                  "Stephen Curry": [2015, 2016],
                  "Russell Westbrook": [2017],
                  "James Harden": [2018]
              }

## 3. Assign the MVP on our data

In [25]:
data['MVP'] = 0                       # make a new column called MVP and have all its values 0
for i, row in data.iterrows():        # loop through each row in our data
    for k, v in mvp_players.items():  # loop through each MVP
        for year in v:                # loop through each year the player was an MVP
            if row['Player'] != k:
                break
            elif(row['Year'] == year) & (row['Player'] == k):   # if we find the player at the same year...
                data.loc[i, 'MVP'] = 1                          # make their MVP value = 1
                break

In [26]:
data[data["MVP"]==1].tail() #show the most recent MVPs

Unnamed: 0,Year,Player,Pos,Age,Tm,G,GS,MP,PER,TS%,...,born,id,PPG,APG,RPG,SPG,BPG,FPG,TOVPG,MVP
19572,2014.0,Kevin Durant,SF,25.0,OKC,81.0,81.0,3122.0,29.8,0.635,...,1988.0,3220,32.012346,5.493827,7.382716,1.271605,0.728395,2.148148,3.518519,1
20181,2011.0,Derrick Rose,PG,22.0,CHI,81.0,81.0,3026.0,23.5,0.55,...,1988.0,3313,25.012346,7.691358,4.074074,1.049383,0.62963,1.679012,3.432099,1
20260,2017.0,Russell Westbrook,PG,28.0,OKC,81.0,81.0,2802.0,30.6,0.554,...,1988.0,3325,31.580247,10.37037,10.666667,1.641975,0.382716,2.345679,5.407407,1
20358,2015.0,Stephen Curry,PG,26.0,GSW,80.0,80.0,2613.0,28.0,0.638,...,1988.0,3343,23.75,7.7375,4.2625,2.0375,0.2,1.975,3.1125,1
20359,2016.0,Stephen Curry,PG,27.0,GSW,79.0,79.0,2700.0,31.5,0.669,...,1988.0,3343,30.063291,6.670886,5.443038,2.139241,0.189873,2.037975,3.316456,1


## 4. Add Team Wins (1980-2018) because this is an important parameter
- The best player in a team with the most wins usually becomes the MVP
- Source, click [here](https://www.landofbasketball.com/nba_teams_year_by_year.htm)

In [27]:
data = data[data.Year >= 1980]
data.sort_values(by="Tm")["Tm"].unique()

array(['ATL', 'BOS', 'BRK', 'CHA', 'CHH', 'CHI', 'CHO', 'CLE', 'DAL',
       'DEN', 'DET', 'GSW', 'HOU', 'IND', 'KCK', 'LAC', 'LAL', 'MEM',
       'MIA', 'MIL', 'MIN', 'NJN', 'NOH', 'NOK', 'NOP', 'NYK', 'OKC',
       'ORL', 'PHI', 'PHO', 'POR', 'SAC', 'SAS', 'SDC', 'SEA', 'TOR',
       'UTA', 'VAN', 'WAS', 'WSB'], dtype=object)

In [28]:
teams_wins = {"ATL": {1980:50, 1981:31, 1982:42, 1983:43, 1984:40, 1985:34, 1986:50, 1987:57, 1988:50, 1989:52, 1990:41, 1991:43, 1992:38, 1993:43, 1994:57, 1995:42, 1996:46, 1997:56, 1998:50, 1999:31, 2000:28, 2001:25, 2002:33, 2003:35, 2004:28, 2005:13, 2006:26, 2007:30, 2008:37, 2009:47, 2010:53, 2011:44, 2012:40, 2013:44, 2014:38, 2015:60, 2016:48, 2017:43},
             "BOS": {1980:61, 1981:62, 1982:63, 1983:56, 1984:62, 1985:63, 1986:67, 1987:59, 1988:57, 1989:42, 1990:52, 1991:56, 1992:51, 1993:48, 1994:32, 1995:35, 1996:33, 1997:15, 1998:36, 1999:19, 2000:35, 2001:36, 2002:49, 2003:44, 2004:36, 2005:45, 2006:33, 2007:24, 2008:66, 2009:62, 2010:50, 2011:56, 2012:39, 2013:41, 2014:25, 2015:40, 2016:48, 2017:53},
             "BRK": {2013:49, 2014:44, 2015:38, 2016:21, 2017:20},
             "CHA": {1989:20, 1990:19, 1991:26, 1992:31, 1993:44, 1994:41, 1995:50, 1996:41, 1997:54, 1998:51, 1999:26, 2000:49, 2001:46, 2002:44, 2005:18, 2006:26, 2007:33, 2008:32, 2009:35, 2010:44, 2011:34, 2012:7, 2013:21, 2014:43},
             "NJN": {2000:31, 2001:26, 2002:52, 2003:49, 2004:47, 2005:42, 2006:49, 2007:41, 2008:34, 2009:34, 2010:12, 2011:24, 2012:22},
             "CHH": {2000:49, 2001:46, 2002:44},
             "CHI": {1980:30, 1981:45, 1982:34, 1983:28, 1984:27, 1985:38, 1986:30, 1987:40, 1988:50, 1989:47, 1990:55, 1991:61, 1992:67, 1993:57, 1994:55, 1995:47, 1996:72, 1997:69, 1998:62, 1999:13, 2000:17, 2001:15, 2002:21, 2003:30, 2004:23, 2005:47, 2006:41, 2007:49, 2008:33, 2009:41, 2010:41, 2011:62, 2012:50, 2013:45, 2014:48, 2015:50, 2016:42, 2017:41},
             "CHO": {2015:33, 2016:48, 2017:36},
             "CLE": {1980:37, 1981:28, 1982:15, 1983:23, 1984:28, 1985:36, 1986:29, 1987:31, 1988:42, 1989:57, 1990:42, 1991:33, 1992:57, 1993:54, 1994:47, 1995:43, 1996:47, 1997:42, 1998:47, 1999:22, 2000:32, 2001:30, 2002:29, 2003:17, 2004:35, 2005:42, 2006:50, 2007:50, 2008:45, 2009:66, 2010:61, 2011:19, 2012:21, 2013:24, 2014:33, 2015:53, 2016:57, 2017:51},
             "DAL": {1981:15, 1982:28, 1983:38, 1984:43, 1985:44, 1986:44, 1987:55, 1988:53, 1989:38, 1990:47, 1991:28, 1992:22, 1993:11, 1994:13, 1995:36, 1996:26, 1997:24, 1998:20, 1999:19, 2000:40, 2001:53, 2002:57, 2003:60, 2004:52, 2005:58, 2006:60, 2007:67, 2008:51, 2009:50, 2010:55, 2011:57, 2012:36, 2013:41, 2014:49, 2015:50, 2016:42, 2017:33},
             "DEN": {1980:30, 1981:37, 1982:46, 1983:45, 1984:38, 1985:52, 1986:47, 1987:37, 1988:54, 1989:44, 1990:43, 1991:20, 1992:24, 1993:36, 1994:42, 1995:41, 1996:35, 1997:21, 1998:11, 1999:14, 2000:35, 2001:40, 2002:27, 2003:17, 2004:43, 2005:49, 2006:44, 2007:45, 2008:50, 2009:54, 2010:53, 2011:50, 2012:38, 2013:57, 2014:36, 2015:30, 2016:33, 2017:40},
             "DET": {1980:16, 1981:21, 1982:39, 1983:37, 1984:49, 1985:46, 1986:46, 1987:52, 1988:54, 1989:63, 1990:69, 1991:50, 1992:48, 1993:40, 1994:20, 1995:28, 1996:46, 1997:54, 1998:37, 1999:29, 2000:42, 2001:32, 2002:50, 2003:50, 2004:54, 2005:54, 2006:64, 2007:53, 2008:59, 2009:39, 2010:27, 2011:30, 2012:25, 2013:29, 2014:29, 2015:32, 2016:44, 2017:37},
             "GSW": {1980:24, 1981:39, 1982:45, 1983:30, 1984:37, 1985:22, 1986:30, 1987:42, 1988:20, 1989:43, 1990:37, 1991:44, 1992:55, 1993:34, 1994:50, 1995:26, 1996:36, 1997:30, 1998:19, 1999:21, 2000:19, 2001:17, 2002:21, 2003:38, 2004:37, 2005:34, 2006:34, 2007:42, 2008:48, 2009:29, 2010:26, 2011:36, 2012:23, 2013:47, 2014:51, 2015:67, 2016:73, 2017:67},
             "HOU": {1980:41, 1981:40, 1982:46, 1983:14, 1984:29, 1985:48, 1986:51, 1987:42, 1988:46, 1989:45, 1990:41, 1991:52, 1992:42, 1993:55, 1994:58, 1995:47, 1996:48, 1997:57, 1998:41, 1999:31, 2000:34, 2001:45, 2002:28, 2003:43, 2004:45, 2005:51, 2006:34, 2007:52, 2008:55, 2009:53, 2010:42, 2011:43, 2012:34, 2013:45, 2014:54, 2015:56, 2016:41, 2017:55},
             "IND": {1980:37, 1981:44, 1982:35, 1983:20, 1984:26, 1985:22, 1986:26, 1987:41, 1988:38, 1989:28, 1990:42, 1991:41, 1992:40, 1993:41, 1994:47, 1995:52, 1996:52, 1997:39, 1998:58, 1999:33, 2000:56, 2001:41, 2002:42, 2003:48, 2004:61, 2005:44, 2006:41, 2007:35, 2008:36, 2009:36, 2010:32, 2011:37, 2012:42, 2013:49, 2014:56, 2015:38, 2016:45, 2017:42},
             "LAC": {1980:35, 1981:36, 1982:17, 1983:25, 1984:30, 1985:31, 1986:32, 1987:12, 1988:17, 1989:21, 1990:30, 1991:31, 1992:45, 1993:41, 1994:27, 1995:17, 1996:29, 1997:36, 1998:17, 1999:9, 2000:15, 2001:31, 2002:39, 2003:27, 2004:28, 2005:37, 2006:47, 2007:40, 2008:23, 2009:19, 2010:29, 2011:32, 2012:40, 2013:56, 2014:57, 2015:56, 2016:53, 2017:51},
             "LAL": {1980:60, 1981:54, 1982:57, 1983:58, 1984:54, 1985:62, 1986:62, 1987:65, 1988:62, 1989:57, 1990:63, 1991:58, 1992:43, 1993:39, 1994:33, 1995:48, 1996:53, 1997:56, 1998:61, 1999:31, 2000:67, 2001:56, 2002:58, 2003:50, 2004:56, 2005:34, 2006:45, 2007:42, 2008:57, 2009:65, 2010:57, 2011:57, 2012:41, 2013:45, 2014:27, 2015:21, 2016:17, 2017:26},
             "MEM": {1996:15, 1997:14, 1998:19, 1999:8, 2000:22, 2001:23, 2002:23, 2003:28, 2004:50, 2005:45, 2006:49, 2007:22, 2008:22, 2009:24, 2010:40, 2011:46, 2012:41, 2013:56, 2014:50, 2015:55, 2016:42, 2017:43},
             "VAN": {2000:22, 2001:23},
             "MIA": {1989:15, 1990:18, 1991:24, 1992:38, 1993:36, 1994:42, 1995:32, 1996:42, 1997:61, 1998:55, 1999:33, 2000:52, 2001:50, 2002:36, 2003:25, 2004:42, 2005:59, 2006:52, 2007:44, 2008:15, 2009:43, 2010:47, 2011:58, 2012:46, 2013:66, 2014:54, 2015:37, 2016:48, 2017:41},
             "MIL": {1980:49, 1981:60, 1982:55, 1983:51, 1984:50, 1985:59, 1986:57, 1987:50, 1988:42, 1989:49, 1990:44, 1991:48, 1992:31, 1993:28, 1994:20, 1995:34, 1996:25, 1997:33, 1998:36, 1999:28, 2000:42, 2001:52, 2002:41, 2003:42, 2004:41, 2005:30, 2006:40, 2007:28, 2008:26, 2009:34, 2010:46, 2011:35, 2012:31, 2013:38, 2014:15, 2015:41, 2016:33, 2017:42},
             "MIN": {1990:22, 1991:29, 1992:15, 1993:19, 1994:20, 1995:21, 1996:26, 1997:40, 1998:45, 1999:25, 2000:50, 2001:47, 2002:50, 2003:51, 2004:58, 2005:44, 2006:33, 2007:32, 2008:22, 2009:24, 2010:15, 2011:17, 2012:26, 2013:31, 2014:40, 2015:16, 2016:29, 2017:31},
             "NOH": {2003:47, 2004:41, 2005:18, 2008:56, 2009:49, 2010:37, 2011:46, 2012:21, 2013:27},
             "NOK": {2006:38, 2007:39},
             "NOP": {2014:34, 2015:45, 2016:30, 2017:34},
             "NYK": {1980:39, 1981:50, 1982:33, 1983:44, 1984:47, 1985:24, 1986:23, 1987:24, 1988:38, 1989:52, 1990:45, 1991:39, 1992:51, 1993:60, 1994:57, 1995:55, 1996:47, 1997:57, 1998:43, 1999:27, 2000:50, 2001:48, 2002:30, 2003:37, 2004:39, 2005:33, 2006:23, 2007:33, 2008:23, 2009:32, 2010:29, 2011:42, 2012:36, 2013:54, 2014:37, 2015:17, 2016:32, 2017:31},
             "OKC": {1980:56, 1981:34, 1982:52, 1983:48, 1984:42, 1985:31, 1986:31, 1987:39, 1988:44, 1989:47, 1990:41, 1991:41, 1992:47, 1993:55, 1994:63, 1995:57, 1996:64, 1997:57, 1998:61, 1999:25, 2009:23, 2010:50, 2011:55, 2012:47, 2013:60, 2014:59, 2015:45, 2016:55, 2017:47},
             "ORL": {1990:18, 1991:31, 1992:21, 1993:41, 1994:50, 1995:57, 1996:60, 1997:45, 1998:41, 1999:33, 2000:41, 2001:43, 2002:44, 2003:42, 2004:21, 2005:36, 2006:36, 2007:40, 2008:52, 2009:59, 2010:59, 2011:52, 2012:37, 2013:20, 2014:23, 2015:25, 2016:35, 2017:29},
             "PHI": {1980:59, 1981:62, 1982:58, 1983:65, 1984:52, 1985:58, 1986:54, 1987:45, 1988:36, 1989:46, 1990:53, 1991:44, 1992:35, 1993:26, 1994:25, 1995:24, 1996:18, 1997:22, 1998:31, 1999:28, 2000:49, 2001:56, 2002:43, 2003:48, 2004:33, 2005:43, 2006:38, 2007:35, 2008:40, 2009:41, 2010:27, 2011:41, 2012:35, 2013:34, 2014:19, 2015:18, 2016:10, 2017:28},
             "PHO": {1980:55, 1981:57, 1982:46, 1983:53, 1984:41, 1985:36, 1986:32, 1987:36, 1988:28, 1989:55, 1990:54, 1991:55, 1992:53, 1993:62, 1994:56, 1995:59, 1996:41, 1997:40, 1998:56, 1999:27, 2000:53, 2001:51, 2002:36, 2003:44, 2004:29, 2005:62, 2006:54, 2007:61, 2008:55, 2009:46, 2010:54, 2011:40, 2012:33, 2013:25, 2014:48, 2015:39, 2016:23, 2017:24},
             "POR": {1980:38, 1981:45, 1982:42, 1983:46, 1984:48, 1985:42, 1986:40, 1987:49, 1988:53, 1989:39, 1990:59, 1991:63, 1992:57, 1993:51, 1994:47, 1995:44, 1996:44, 1997:49, 1998:46, 1999:35, 2000:59, 2001:50, 2002:49, 2003:50, 2004:41, 2005:27, 2006:21, 2007:32, 2008:41, 2009:54, 2010:50, 2011:48, 2012:28, 2013:33, 2014:54, 2015:51, 2016:44, 2017:41},
             "SAC": {1980:47, 1981:40, 1982:30, 1983:45, 1984:38, 1985:31, 1986:37, 1987:29, 1988:24, 1989:27, 1990:23, 1991:25, 1992:29, 1993:25, 1994:28, 1995:39, 1996:39, 1997:34, 1998:27, 1999:27, 2000:44, 2001:55, 2002:61, 2003:59, 2004:55, 2005:50, 2006:44, 2007:33, 2008:38, 2009:17, 2010:25, 2011:24, 2012:22, 2013:28, 2014:28, 2015:29, 2016:33, 2017:32},
             "SAS": {1980:41, 1981:52, 1982:48, 1983:53, 1984:37, 1985:41, 1986:35, 1987:28, 1988:31, 1989:21, 1990:56, 1991:55, 1992:47, 1993:4, 1994:55, 1995:62, 1996:59, 1997:20, 1998:56, 1999:37, 2000:53, 2001:58, 2002:58, 2003:60, 2004:57, 2005:59, 2006:63, 2007:58, 2008:56, 2009:54, 2010:50, 2011:61, 2012:50, 2013:58, 2014:62, 2015:55, 2016:67, 2017:61},
             "SEA": {2000:45, 2001:44, 2002:45, 2003:40, 2004:37, 2005:52, 2006:35, 2007:31, 2008:20},
             "TOR": {1996:21, 1997:30, 1998:16, 1999:23, 2000:45, 2001:47, 2002:42, 2003:24, 2004:33, 2005:33, 2006:27, 2007:47, 2008:41, 2009:33, 2010:40, 2011:22, 2012:23, 2013:34, 2014:48, 2015:49, 2016:56, 2017:51},
             "UTA": {1980:24, 1981:28, 1982:25, 1983:30, 1984:45, 1985:41, 1986:42, 1987:44, 1988:47, 1989:51, 1990:55, 1991:54, 1992:55, 1993:47, 1994:53, 1995:60, 1996:55, 1997:64, 1998:62, 1999:37, 2000:55, 2001:53, 2002:44, 2003:47, 2004:42, 2005:26, 2006:41, 2007:51, 2008:54, 2009:48, 2010:53, 2011:39, 2012:36, 2013:43, 2014:25, 2015:38, 2016:40, 2017:51},
             "WAS": {1980:39, 1981:39, 1982:43, 1983:42, 1984:35, 1985:40, 1986:39, 1987:42, 1988:38, 1989:40, 1990:31, 1991:30, 1992:25, 1993:22, 1994:24, 1995:21, 1996:39, 1997:44, 1998:42, 1999:18, 2000:29, 2001:19, 2002:37, 2003:37, 2004:25, 2005:45, 2006:42, 2007:41, 2008:43, 2009:19, 2010:26, 2011:23, 2012:20, 2013:29, 2014:44, 2015:46, 2016:41, 2017:49}
             }

## 5. Populate each player's team win that year

In [29]:
for i, row in data.iterrows():                               # loop through each player's row
    for k, v in teams_wins.items():                          # loop through each team
        for year, value in v.items():                        # loop through each win
            if ((row["Tm"] == k) & (row["Year"] == year)):   # if team and year for that player match...
                data.loc[i, "Tm_Wins"] = value               # write the team's win at that year for player

In [30]:
data.head()

Unnamed: 0,Year,Player,Pos,Age,Tm,G,GS,MP,PER,TS%,...,id,PPG,APG,RPG,SPG,BPG,FPG,TOVPG,MVP,Tm_Wins
2522,1980.0,Paul Silas,PF,36.0,SEA,82.0,,1595.0,8.2,0.439,...,644,3.841463,0.804878,5.317073,0.304878,0.060976,1.463415,1.012195,0,
2564,1980.0,Rick Barry*,SF,35.0,HOU,72.0,,1816.0,14.8,0.517,...,648,12.027778,3.722222,3.277778,1.111111,0.388889,2.527778,2.111111,0,41.0
3010,1980.0,Walt Frazier*,PG,34.0,CLE,3.0,,27.0,13.6,0.421,...,714,3.333333,2.666667,1.0,0.666667,0.333333,0.666667,1.333333,0,37.0
3037,1980.0,Phil Jackson*,PF,34.0,NJN,16.0,,194.0,10.7,0.645,...,720,4.0625,0.75,1.5,0.3125,0.25,2.1875,0.5625,0,
3080,1980.0,Earl Monroe*,SG,35.0,NYK,51.0,,633.0,15.3,0.497,...,731,7.411765,1.313725,0.705882,0.411765,0.058824,0.901961,0.54902,0,39.0


## 6. Select features needed
Id, player, year, player efficiency rating, win shares, box plus/minus, value over replacement, points per game, team wins, MVP

In [31]:
data_mvp = data[['id', 'Player', 'Year', 'PER', 'WS', 'BPM', 'VORP', 'PPG', 'Tm_Wins', 'MVP']]
data_mvp = data_mvp.fillna(0) #make NaN values = 0
data_mvp.head()

Unnamed: 0,id,Player,Year,PER,WS,BPM,VORP,PPG,Tm_Wins,MVP
2522,644,Paul Silas,1980.0,8.2,2.2,-2.3,-0.1,3.841463,0.0,0
2564,648,Rick Barry*,1980.0,14.8,3.4,0.6,1.2,12.027778,41.0,0
3010,714,Walt Frazier*,1980.0,13.6,0.0,-3.4,0.0,3.333333,37.0,0
3037,720,Phil Jackson*,1980.0,10.7,0.5,-1.1,0.0,4.0625,0.0,0
3080,731,Earl Monroe*,1980.0,15.3,1.1,-5.5,-0.6,7.411765,39.0,0


## 7. Begin Classification
- We'll use random forrest classifier because its perfect for a couple of columns with thousands of rows

In [50]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier

years = range(1981, 2018)
mvp_years = dict()
results_mvp = pd.DataFrame(columns = ["id", "Year", "MVP"])

for y in years :
    # train : all seasons from 1981 to 2018
    # test : year
    train = data_mvp[data_mvp["Year"] < y]
    test = data_mvp[data_mvp["Year"] == y]
    X_train = train.drop(["id", "Player", "Year", "MVP"], axis=1)
    y_train = train["MVP"]
    X_test = test.drop(["id", "Player", "Year", "MVP"], axis=1)
    
    # Random Forest

    random_forest = RandomForestClassifier(n_estimators=10)
    random_forest.fit(X_train, y_train)
    y_pred = random_forest.predict(X_test)
    random_forest.score(X_train, y_train)
    acc_random_forest = round(random_forest.score(X_train, y_train) * 100, 2)
    
    pred_proba = random_forest.predict_proba(X_test)
    
    y_pred_proba = []
    for i in enumerate(pred_proba):
        y_pred_proba.append(i[1][1])
    y_pred_proba = np.asarray(y_pred_proba)
    
    mvp_years = pd.DataFrame({
        "id": test["id"],
        "Year": y,
        "MVP": y_pred_proba
        })
    
    results_mvp = pd.concat([results_mvp, mvp_years])

results_mvp["id"] = results_mvp["id"].astype("int")
career_player = data[["id", "Player"]]
results_mvp = results_mvp.merge(career_player, on="id")

results_mvp = results_mvp.drop_duplicates()
results_mvp.tail()

Unnamed: 0,id,Year,MVP,Player
167894,3916,2017,0.0,Isaiah Whitehead
167895,4419,2017,0.0,Troy Williams
167899,3918,2017,0.0,Kyle Wiltjer
167900,3919,2017,0.0,Stephen Zimmerman
167901,3921,2017,0.0,Ivica Zubac


### Most important features/stats

In [51]:
feature_importances = pd.DataFrame(random_forest.feature_importances_, index = X_train.columns,
                                    columns=['importance']).sort_values('importance', ascending=False)
feature_importances

Unnamed: 0,importance
WS,0.353273
Tm_Wins,0.225683
VORP,0.154374
PER,0.099689
PPG,0.095723
BPM,0.071258


### ⭐️⭐️ These stats are the most important stats to be high to have a higher chance to become an MVP ⭐️⭐️


## 8. Create MVP Predictions

In [52]:
mvp_list = []

#create a list of dictionaries from mvp_players
for player, years in mvp_players.items():
    for year in years:
        if year > 1980 and year < 2018:
            mvp_dic = {}
            mvp_dic["Year"] = year
            mvp_dic["Actual MVP"] = player
            mvp_list.append(mvp_dic)
            
mvp_predictions = pd.DataFrame(mvp_list) #turn that list into a dataset       
mvp_predictions= mvp_predictions.sort_values("Year", ascending=False) #sort

#create predictions columns
mvp_predictions["Prediction 1"] = ""
mvp_predictions["Prediction 2"] = ""
mvp_predictions["Prediction 3"] = ""
mvp_predictions["Prediction 4"] = ""
mvp_predictions["Prediction 5"] = ""

mvp_predictions.head()

Unnamed: 0,Year,Actual MVP,Prediction 1,Prediction 2,Prediction 3,Prediction 4,Prediction 5
36,2017,Russell Westbrook,,,,,
35,2016,Stephen Curry,,,,,
34,2015,Stephen Curry,,,,,
33,2014,Kevin Durant,,,,,
31,2013,LeBron James,,,,,


## 9. Get the top 5 possible MVPs each year
- In case the model gets it wrong, I want to be able to see if its next guess gets it correct
- Now, MVP column is the probability that the player will become the MVP for that year

In [53]:
top_mvp = results_mvp.sort_values("MVP", ascending=False).groupby("Year").head(5)
top_mvp = top_mvp.sort_values("Year", ascending=False)
top_mvp = top_mvp.sort_values(["Year", "MVP"], ascending = (False, False)) #sort by year and MVP chance
top_mvp = top_mvp[["Year", "MVP", "Player"]]
top_mvp.head()

Unnamed: 0,Year,MVP,Player
158589,2017,0.2,Russell Westbrook
159258,2017,0.2,Stephen Curry
161883,2017,0.1,DeMarcus Cousins
164011,2017,0.1,Kawhi Leonard
120925,2017,0.0,Chris Mihm


## 10. Add the top 5 predictions to mvp_predictions

In [54]:
for i, row in mvp_predictions.iterrows():
    prediction_count = 1
    for index, player_row in top_mvp.iterrows():
        year = row["Year"]
        if year == player_row["Year"]: #if mvp_predictions's year == top_mvp's year
            player = player_row["Player"]
            if prediction_count == 1:
                mvp_predictions.loc[i, "Prediction 1"] = player
            elif prediction_count == 2:
                mvp_predictions.loc[i, "Prediction 2"] = player
            elif prediction_count == 3:
                mvp_predictions.loc[i, "Prediction 3"] = player
            elif prediction_count == 4:
                mvp_predictions.loc[i, "Prediction 4"] = player
            elif prediction_count == 5:
                mvp_predictions.loc[i, "Prediction 5"] = player
            prediction_count += 1 #increment prediction_count
mvp_predictions

Unnamed: 0,Year,Actual MVP,Prediction 1,Prediction 2,Prediction 3,Prediction 4,Prediction 5
36,2017,Russell Westbrook,Russell Westbrook,Stephen Curry,DeMarcus Cousins,Kawhi Leonard,Chris Mihm
35,2016,Stephen Curry,Stephen Curry,Kawhi Leonard,Boban Marjanovic,Kevin Durant,Hassan Whiteside
34,2015,Stephen Curry,Stephen Curry,Russell Westbrook,James Harden,Anthony Davis,Kenyon Martin
33,2014,Kevin Durant,Kevin Durant,James Harden,LeBron James,Carmelo Anthony,Chris Mihm
31,2013,LeBron James,LeBron James,Kevin Durant,Kenyon Martin,C.J. Miles,Chris Mihm
30,2012,LeBron James,Greg Monroe,Tony Parker,Chris Mihm,Kenyon Martin,C.J. Miles
32,2011,Derrick Rose,LeBron James,Kevin Love,Pau Gasol,C.J. Miles,Kenyon Martin
29,2010,LeBron James,LeBron James,Kevin Durant,Kobe Bryant,Kenyon Martin,Chris Mihm
28,2009,LeBron James,LeBron James,Kobe Bryant,Dirk Nowitzki,Steven Hill,Chris Paul
27,2008,Kobe Bryant,Kevin Garnett,Chris Paul,Dwight Howard,LeBron James,Desmond Mason


In [55]:
TP = 0
for i, row in mvp_predictions.iterrows():
    if row["Actual MVP"] == row["Prediction 1"]: #if actual MVP is the same as the first prediction...
        TP += 1
print("Accuracy =", TP / len(mvp_predictions))

Accuracy = 0.5405405405405406


## Final Project Options
Final Project Options:
1. Apply supervised or un-supervised models to a dataset (or problem) you are interested in. Investigate variety of steps
to make the model better including:
  - Hyper-parameter tuning by Grid-search
  - Check if dataset is balanced or not -> change the threshold
  - Data preprocessing (scaling)
  - Dimensionality reduction (PCA) -> train the model based on X_reduced_train and Y_reduced_train
  - Eliminate unnecessary features -> Feature Engineering
  - Try other models and do the above all steps
2. Read blogs about Feature Engineering and make your model performance better with variety of Feature Engineering options:
    - https://towardsdatascience.com/feature-selection-correlation-and-p-value-da8921bfb3cf
    - https://towardsdatascience.com/feature-selection-with-pandas-e3690ad8504b
3. Build a Decision Tree (DT) classifier from Scratch (you can use Pandas or any other Python built-in functions) and provide DT visualization. For any categorical dataset, your function should return the optimal tree with the root and all appropriate leafs, max_depth of the tree and the visualized graph. You can follow the steps we explored in class but should work for any dataset for example if we pass Lens dataset.

Solo projects would be for option 1 or 2. Group projects is accepted for option 3 (edited) 

## Sources
- [DS2.1 Course](https://github.com/Make-School-Courses/DS-2.1-Machine-Learning)
- [Dataset](https://www.kaggle.com/drgilermo/nba-players-stats?select=Seasons_Stats.csv)