<a href="https://colab.research.google.com/github/Ntaylor1027/ML-Project/blob/master/437ML_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Machine Learning Projet**

## NBA MVP Predictions

April 15, 2020

Team Members:
- Cole Bennett
- Noah Taylor

---

## Overview

The goal of this project is to create a Machine Learning model that can correctly classify the MVP of NBA seasons.

### Data

The data for this project utilizes every player from the NBA from 1982-2017 and their advanced statistics (TS%, WS, PER, and more) that is transformed into a vector to be used in a Logistic Regression Model.

The data sets used were found on Kaggle and consisted in:
- MVP Names
- All Player Statistics 
- All Team Records 
The Team Record for a given year was appended to all the Player Vectors used in the Data.

## Method

1. **Gather and Prepare Data**: This step involves creating data structures for our model to use for classifying an MVP for a season. The data is turned into feature vectors and has the team record for that season appended.
2. **Retrieve Testing and Training Data**: Testing and training data is based on the 1982-2017 NBA seasons. Each data point for training is the stats of a player for a particular season. The classes are positive or negative, where positive indicates the player is the MVP of the specific season the stats are in.
3. **Training**: We are utilizing logistic regression to train our binary classifier. Due to how logistic regression optimizes on a sigmoid curve, we can obtain probabilities as to how close predictions of stats are to MVP status.
4. **Testing**: We use the probabilities from our logistic regression classifier to rank the stats of players in a season it has not seen before. The average distance in rankings from the predicted MVP to the actual MVP is our primary performance metric to demonstrate the margin of error.
We test on held-out seasons, which consist of a list of player stats for a given season. The classifier predicts the MVP probability for each player stat in the season and then we rank these probabilities.
To increase the accuracy rating, bagging has been utilized in our logistic regression classifier using 10 bags.


## Mount Drive

In [None]:
from google.colab import drive
import pandas as pd
import numpy as np
import csv
import random

# Mount Drive for CSV Data
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


##Data Parsing Functions

In [None]:
FILES = ["mvp.csv","Seasons_Stats.csv", "regular-season-stats.csv"]

def read_csv(file_name, header_index, start_index):
  '''
  Read a csv file to gather data
  Input:
    - file_name: name of file to read
    - header_index: index of row describing csv columns
    - start_index: index of first row to record in csv
  Output:
    - instances: list form of csv data 
  '''
  instances = []
  header = []
  with open(f"/content/drive/My Drive/NBA_DATA/{file_name}", "r") as f:
    csv_reader = csv.reader(f, delimiter=",")
    record = False
    for row in csv_reader:
        if record:
          instances.append(row)
          continue
        if not record:
          start_index -= 1
          header_index -= 1
          if header_index == 0:
            header = row
          if start_index == 0:
            record = True
  return header, instances

def get_mvp_name_and_year(mvps):
  '''
  Create a dictionary with key of years and values of mvp winner
  Input: 
    - mvps: list of data collected from read_csv on file "mvp.csv"
  Output:
    - year_dict: dictionary of {year: mvp winner}
  '''
  year_dict = {}
  for mvp in mvps:
    year_dict[mvp[0]] = mvp[2].split("\\")[0]
  return year_dict

def print_player(stats_header, player):
  for entry in range(0,len(stats_header)):
    if "blan" not in stats_header[entry]: # remove blanks from data set
      print(f"{stats_header[entry]}: {player[entry]}")

def create_year_player_dict(players):
  year_players = {}
  for player in players:
    if player[1] not in year_players:
      year_players[player[1]] = [player]
    else:
      if player not in year_players[player[1]]:
        year_players[player[1]].append(player)
  return year_players

def create_TMRec_dict(team_header, team_data):
  '''
  Create a dictionary of every year with a dictionary of every team and records 
  for the values.
  '''
  def get_year(data_entry):
    year = entry[1].split('-')[1]
    if int(year) < 20:
      year = '20' + year
    else:
      year = '19' + year
    return year
  
  TMRec = {}
  
  for entry in team_data:
    year = get_year(entry)
    if int(year) < 1982: # Dont record early data to avoid bottom of csv
      break
    year_dict = {}
    for index in range(3, len(entry)): # The Team index starts at 3
      year_dict[team_header[index]] = entry[index]
    TMRec[year] = year_dict
    
  return TMRec

def append_TMRec(players, team_records, all_teams):
  def convert_team(team):
    '''
    The NBA has had many franchise name and location changes over the years
    To alter our Regular season stats to be labeled under the new franchise
    names we must convert the old franchise label to the current franchise label.
    '''
    teams = {'SEA':'OKC', 'SDC': 'LAC', 'NJN':'BRK',
             'WSB':'WAS', 'KCK':'SAC', 'CHH':'CHO', 
             'VAN':'MEM', 'NOH':'NOP', 'CHA':'CHO', 'NOK':'NOP'}
    return teams[team]
  
  new_player_list = []
  for player in players:
    year = player[1]
    team = player[5]
    if type(player) is None:
      print(player)
      continue
    if team == 'TOT' or team == '':
      continue
    elif team not in all_teams:
      team = convert_team(team)
    try:
      record = team_records[year][team]
      player.append(record)
      new_player_list.append(player)
    except:
      print(f"{player[2]}, {team}, {year}")

  return new_player_list

def remove_blank_stats(stats_header, players):
  new_players = []
  entry = 0
  while (entry < len(stats_header)):
    if ("blan" in stats_header[entry]): # remove blanks from data set
      del stats_header[entry]    
      for index in range(len(players)):
        del players[index][entry]
        new_players.append(players[index])
    else:
      entry += 1
  
  return new_players

def convert_stats_to_float(players):
  new_players = []
  for player in players:
    for entry in range(6, len(player)):
      try:
        player[entry] = float(player[entry])
      except:
        player[entry] = 0.0
        continue
    new_players.append(player)
  return new_players

def get_mvps_data(mvps, players):
  def grab_year(entry):
    year = entry.split('-')[1]
    if int(year) < 20:
      year = '20' + year
    else:
      year = '19' + year
    return year
    
  mvps_data = {}
  for key in mvps.keys():
    year = grab_year(key)
    mvp = mvps[key]
    if (int(year) >= 1982 and int(year) <= 2017):
      curr_players = players[year]
      for player in curr_players:
        if mvp in player[2]:
          mvps_data[year] = player  
  return mvps_data

## Developer Notes from Data gathering
Notes: Missing from the data is per game statistics. These however can be derived from total stats / games player

**Headers to data:** [mvp_headers, stats_headers] \\
**Data to interact with:** [mvps_data, stats_data] \\
**Data by year dictionary:** [year_players]

Currently We keep track of many details of the players. When analyzing the data of the NBA players for the features not important for comparison are (names, team, age, and position). 

year_players: This variable can be used to get all the players based on the year they played in. Columns are described by stats_headers

mvps_data: This variable is a dictionary of {year : mvp winner} using the data from stats_data.

## Load and Prepare Data

In [None]:
# Get all MVPs and Years
mvp_header, mvp_data = read_csv(FILES[0],1,2)
mvps = get_mvp_name_and_year(mvp_data)

# Get Team records
team_header, team_data = read_csv(FILES[2], 1, 2)
all_teams = team_header[3:]
TMRec = create_TMRec_dict(team_header, team_data)

# Get all player stats for years 1980 - 2017
# start indexes for years
# 1980 : 5728; No GS on all players
# 1982 : 6450; GS starts on all players

stats_header, stats_data = read_csv(FILES[1],1,6450)
stats_header[0] = "Index" # set index column name that is not placed in csv
stats_data = remove_blank_stats(stats_header, stats_data)
stats_header.append('TMRec')

# append team wins to players
stats_data = append_TMRec(stats_data, TMRec, all_teams)
stats_data = convert_stats_to_float(stats_data)

# Grab mvps from stats data
year_players = create_year_player_dict(stats_data)
mvps_data = get_mvps_data(mvps, year_players)

## Split Into Training and Testing Data


In [None]:
def get_class_label(pstats):
  class_label = 0 # negative (not MVP)
  for mvp_stats in mvps_data.values():
    # Check if name and year match for this instance
    if mvp_stats[2].replace("*", "") == pstats[2].replace("*", "") and mvp_stats[1] == pstats[1]: # name and year equal
      class_label = 1 # positive (MVP)
  return class_label


def create_train_test(year_players, train_years, test_years):
  X_train = []
  y_train = []
  # Create training data
  for year in train_years:
    for player in year_players[year]:
      X_train.append(player[6:]) # exlude extra player info
      y_train.append(get_class_label(player))

  # Create testing data
  # Each testing point is a testing year that consists of all players who played in that season
  X_test_player_map = {}
  y_test_player_map = {}
  X_test = {}
  y_test = {}
  for year in test_years:
    X_test_players = []
    y_test_players = []
    x_idx_map = {}
    y_idx_map = {}

    i = 0
    for player in year_players[year]:
      label = get_class_label(player)
      X_test_players.append(player[6:]) # exclude extra player info
      y_test_players.append(label)
      x_idx_map[i] = player # record the full player stats at the current index
      y_idx_map[i] = label
      i += 1

    X_test[year] = X_test_players
    y_test[year] = y_test_players
    X_test_player_map[year] = x_idx_map
    y_test_player_map[year] = y_idx_map
  return X_train, y_train, X_test, y_test, X_test_player_map, y_test_player_map

## Train

### Model

We will use Logistic Regression to predict the probabilities of a given player's stats achieving MVP. We will use our trained classifier to predict individual probabilities for each instance in a set of stats. Then, we will perform sorting on the probabilities to determine which players are closest to MVP status and ultimately determine the MVP.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier


def train(X_train, y_train):
  log_clf = LogisticRegression(max_iter=1000, C=1.0, class_weight={0: 0.1, 1: 10})
  clf = BaggingClassifier(base_estimator=log_clf, n_estimators=10)
  clf.fit(X_train, y_train)
  return clf

## Test

In [None]:
debug = True

def test(clf, X_test, y_test, X_test_player_map, y_test_player_map):
  distances = []
  correct = 0
  incorrect = 0

  # For each testing year, predict the MVP
  for year, players in X_test.items():
    if debug:
      print(f"TEST SEASON - {year} - {len(players)} player(s)")
    results = []
    probs = clf.predict_proba(players)

    for i in range(len(probs)):
      pos_prob = probs[i][1] # i.e., is MVP
      full_stats = X_test_player_map[year][i] # get extra info (team, name, etc.)
      mvp_truth = y_test_player_map[year][i] # 1/0 if this stat instance was mvp or not
      results.append((players[i], full_stats, mvp_truth, pos_prob))

    # sort by probability in descending order
    results.sort(key=lambda tup: tup[3], reverse=True)

    pred_mvp_index = None
    for i in range(len(results)):
      if results[i][2]: # if MVP
        pred_mvp_index = i
        break
    
    distances.append(pred_mvp_index)
    if pred_mvp_index == 0: # perfect MVP prediction
      correct += 1
    else:
      incorrect += 1

    if debug:
      n = 10
      print(f"  distance to real MVP: {pred_mvp_index}")
      print(f"  top {n} results")
      for i in range(n):
        mvp = results[i]
        top = ""
        if mvp[2]:
          top = " *** ultimate predicted MVP for the season ***"
        if mvp[2]:
          mvp_label = "yes"
        else:
          mvp_label = "no"
        print(f"  #{i+1:-2} mvp prediction for {year}: {mvp[1][2]:20} - actually mvp: {mvp_label:3} - confidence: {mvp[3]*100:.2f}%{top}")

  avg_distance = np.array(distances).mean()
  return avg_distance, correct / (correct + incorrect)


# ------ RUN -----
years = list(year_players.keys())
random.shuffle(years)
train_size = 0.8 # percentage
train_years = years[:int(len(years)*train_size)]
test_years = years[int(len(years)*train_size):]
X_train, y_train, X_test, y_test, X_test_player_map, y_test_player_map = create_train_test(year_players, train_years, test_years)
if debug:
  print("test_years", test_years)
  print("train_years", train_years)
  print("X_train", len(X_train))
  print("y_train", len(y_train))
  print("X_test", {k: len(v) for k,v in X_test.items()})
  print("y_test", {k: len(v) for k,v in y_test.items()})
clf = train(X_train, y_train)
avg_distance, acc = test(clf, X_test, y_test, X_test_player_map, y_test_player_map)
print("--------------------------------------------")
print(f"avg_distance: {avg_distance:.2f}, accuracy: {acc:.2f}")

test_years ['2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017']
train_years ['1982', '1983', '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009']
X_train 12349
y_train 12349
X_test {'2010': 512, '2011': 542, '2012': 515, '2013': 523, '2014': 548, '2015': 575, '2016': 528, '2017': 542}
y_test {'2010': 512, '2011': 542, '2012': 515, '2013': 523, '2014': 548, '2015': 575, '2016': 528, '2017': 542}


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

TEST SEASON - 2010 - 512 player(s)
  distance to real MVP: 0
  top 10 results
  # 1 mvp prediction for 2010: LeBron James         - actually mvp: yes - confidence: 98.83% *** ultimate predicted MVP for the season ***
  # 2 mvp prediction for 2010: Steve Nash           - actually mvp: no  - confidence: 84.44%
  # 3 mvp prediction for 2010: Kevin Durant         - actually mvp: no  - confidence: 80.77%
  # 4 mvp prediction for 2010: Dwyane Wade          - actually mvp: no  - confidence: 57.39%
  # 5 mvp prediction for 2010: Ryan Bowen           - actually mvp: no  - confidence: 9.43%
  # 6 mvp prediction for 2010: Trey Gilder          - actually mvp: no  - confidence: 8.19%
  # 7 mvp prediction for 2010: Kobe Bryant          - actually mvp: no  - confidence: 5.77%
  # 8 mvp prediction for 2010: Chauncey Billups     - actually mvp: no  - confidence: 4.83%
  # 9 mvp prediction for 2010: Tim Duncan           - actually mvp: no  - confidence: 4.79%
  #10 mvp prediction for 2010: Dwayne Jones 

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


## Results and Observations

* Average distance to true MVP ranking: **0.5-3**
* Average accuracy (correct #1 ranking of MVP): **60%-75%**

For the results of our classifier we noticed that we did not always have the correct MVP and therefore did not have the most accurate model, but we realized this may be do to other factors not visible in the data. The accuracy of our model varies from 60% to 75% for getting the prediction correct (i.e., our #1 ranking prediction is the actual MVP), but we are noticing an average of 1 to 3 positions away in ranking from the correct classification when our prediction is wrong. Therefore, we believe that some of the errors caused with our model is due to these external factors, and given that our model is only 1 to 3 players away from the correct classification on average out of ~300-500 players in a season, it appears promising.

## Further Research

Noted in “The Book of BasketBall” By Bill Simmons is the issue with Media Bias on the awards as he dedicates a entire 40 page chapter to correcting the mistakes of MVP votes.

Based on the errors we have found we think that to expand on the project we could use some Natural Language Processing over media outlets to see the Players portral in their narratives as that seems to have an effect on the MVP award. 
