# $$Project Description$$ #

DATA 1030

Kaiwen Yang

## Introduction

This project aims to use machine learning methods to predict the offensive win share of a NBA player base on his shooting tendency. Win share is an advanced player stat that indicates how much a player contributes to success of his team. It is an important attribute to evaluate a player. In this project, two datasets will be used. "Seasons_Stats.csv" contains the basic stats of all NBA players from 1950 to 2017. "nba_shot_types.csv" contains data about players' shooting tendency (e.g. the ratio of a player's shots inside three point line to shots outside the three point line). The target variable would be "ows", which is the offensive win share of a player. It is a continuous variable, so regression model would be implemented. 

The topic of this project is highly related to the ongoing revolution of NBA. A lot of teams intorduced data science to game play analysis in order to improve offense and defense efficiency. They use more complex and detailed dataset to find out the most efficient offense choices, and most of the results suggest that NBA teams should encourage their players to shot more three-point and shot less mid-range jump shots. In this project, we could use simpler dataset to reproduce this analysis. 

## Dataset

In "nba_shot_types.csv", there are 3007 data points, and each has 23 features. Here are column descriptions for columns that will be using: 

"YEAR": NBA season. Category. Will be changed to discrete numerical data for the convinience of merging. 

"PLAYER": Player name. Category. 

The following data type are all float: 

"PCT_FGA_2PT": Percentage of Field Goal Attempts That Were 2 PT Shots

"PCT_FGA_3PT": Percentage of Field Goal Attempts That Were 2 PT Shots

"PCT_PTS_2PT": Percentage of Points That Came From 2 PT Field Goals Made

"PCT_PTS_MR": Percentage of Points That Came From Midrange

"PCT_PTS_3PT": Percentage of Points That Came From 3 PT Field Goals Made

"PCT_PTS_FSTBRK": Percentage of Points That Came on Fast Breaks

"PCT_PTS_FT": Percentage of Points That Came From Free Throws

"PCT_PTS_OFF_TOS": Percentage of Points That Came Off Turnovers

"PCT_PTS_INTHEPT": Percentage of Points That Came In the Paint

In "Seasons_stats_complete.csv", there are 26.1k of data points and each has 50 features. Only data between 2013-2019 will be used. Here are column descriptions for columns that will be using:

"Year": NBA season. Category. Will be changed to discrete numerical data for the convinience of merging. 

"Player": Player name. Category. 

"Pos": Position. Category. 

"OWS": offensive win share. Continuous numerical.

"2PA": 2-point attempt. Discrete numerical.

"3PA": 3-point attempt. Discrete numerical.

"FTA": free throw attempt. Discrete numerical.

## Preprocessing of Datasets

Two tables will be merged base on the "Year" and "Player" columns. After merging, these two columns will be droped.

one-hot encoder will be applied to: "Pos". The 5 different position are not ordinal. They just stand for the position of the player plays in that season. 

MinMaxEncoder will be applied to columns "2PA", "3PA", "FTA" and columns start with "PCT_". "PCT_" columns are bounded by 0 and 100. "2PA", "3PA", and "FTA" are bounded because players can only attempt shots in a limited amount of time, so the shot attempts are bounded. 

In [1]:
import numpy as np
import pandas as pd
shot = pd.read_csv('nba_shot_types.csv')
stat = pd.read_csv('Seasons_stats_complete.csv')

from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder

ohenc = OneHotEncoder(sparse=False)
mmscaler = MinMaxScaler()
sscaler = StandardScaler()

In [16]:
stat = stat[['Year', 'Player', 'Pos', 'OWS', "2PA", "3PA", "FTA"]]
stat1 = stat[stat['Year'] > 2013]
stat.head(5)

shot = shot[['YEAR', 'PLAYER', "PCT_FGA_2PT", "PCT_FGA_3PT",
             "PCT_PTS_2PT", "PCT_PTS_MR", "PCT_PTS_3PT", "PCT_PTS_FSTBRK",
             "PCT_PTS_FT", "PCT_PTS_OFF_TOS", "PCT_PTS_INTHEPT"]]
shot.head(5)

shot['YEAR'] = shot['YEAR'].replace('2018-2019', 2019)
shot['YEAR'] = shot['YEAR'].replace('2017-2018', 2018)
shot['YEAR'] = shot['YEAR'].replace('2016-2017', 2017)
shot['YEAR'] = shot['YEAR'].replace('2015-2016', 2016)
shot['YEAR'] = shot['YEAR'].replace('2014-2015', 2015)
shot['YEAR'] = shot['YEAR'].replace('2013-2014', 2014)

stat['2PA'] = mmscaler.fit_transform(pd.DataFrame(stat['2PA']))
stat['3PA'] = mmscaler.fit_transform(pd.DataFrame(stat['3PA']))
stat['FTA'] = mmscaler.fit_transform(pd.DataFrame(stat['FTA']))
shot['PCT_FGA_2PT'] = mmscaler.fit_transform(pd.DataFrame(shot['PCT_FGA_2PT']))
shot['PCT_FGA_3PT'] = mmscaler.fit_transform(pd.DataFrame(shot['PCT_FGA_3PT']))
shot['PCT_PTS_2PT'] = mmscaler.fit_transform(pd.DataFrame(shot['PCT_PTS_2PT']))
shot['PCT_PTS_2PT'] = mmscaler.fit_transform(pd.DataFrame(shot['PCT_PTS_2PT']))
shot['PCT_PTS_MR'] = mmscaler.fit_transform(pd.DataFrame(shot['PCT_PTS_MR']))
shot['PCT_PTS_3PT'] = mmscaler.fit_transform(pd.DataFrame(shot['PCT_PTS_3PT']))
shot['PCT_PTS_FSTBRK'] = mmscaler.fit_transform(pd.DataFrame(shot['PCT_PTS_FSTBRK']))
shot['PCT_PTS_FT'] = mmscaler.fit_transform(pd.DataFrame(shot['PCT_PTS_FT']))
shot['PCT_PTS_OFF_TOS'] = mmscaler.fit_transform(pd.DataFrame(shot['PCT_PTS_OFF_TOS']))
shot['PCT_PTS_INTHEPT'] = mmscaler.fit_transform(pd.DataFrame(shot['PCT_PTS_INTHEPT']))


merged = stat1.merge(shot, left_on='Player', right_on='PLAYER', how='right')
merged1 = merged[merged['Year'] == merged['YEAR']]
merged2 = merged1[['Pos', 'OWS', '2PA', '3PA', 'FTA', "PCT_FGA_2PT", "PCT_FGA_3PT",
                  "PCT_PTS_2PT", "PCT_PTS_MR", "PCT_PTS_3PT", "PCT_PTS_FSTBRK",
                  "PCT_PTS_FT", "PCT_PTS_OFF_TOS", "PCT_PTS_INTHEPT"]]
index = [x for x in range(3553)]

one_hot_var = ohenc.fit_transform(pd.DataFrame(merged2['Pos']))
one_hot_var_name = ohenc.get_feature_names(['Pos'])
one_hot_var = pd.DataFrame(one_hot_var, columns = [one_hot_var_name])

merged2 = merged2.reset_index()
df = merged2[['OWS', '2PA', '3PA', 'FTA', "PCT_FGA_2PT", "PCT_FGA_3PT",
                  "PCT_PTS_2PT", "PCT_PTS_MR", "PCT_PTS_3PT", "PCT_PTS_FSTBRK",
                  "PCT_PTS_FT", "PCT_PTS_OFF_TOS", "PCT_PTS_INTHEPT"]]

df1 = pd.concat([df, one_hot_var], axis=1)
#one_hot_var
1df1

Unnamed: 0,OWS,2PA,3PA,FTA,PCT_FGA_2PT,PCT_FGA_3PT,PCT_PTS_2PT,PCT_PTS_MR,PCT_PTS_3PT,PCT_PTS_FSTBRK,PCT_PTS_FT,PCT_PTS_OFF_TOS,PCT_PTS_INTHEPT,"(Pos_C,)","(Pos_PF,)","(Pos_PG,)","(Pos_SF,)","(Pos_SG,)"
0,-0.1,0.000950,0.014591,0.007337,0.167,0.833,0.235,0.000,0.353,0.000,0.412,0.000,0.235,0.0,1.0,0.0,0.0,0.0
1,-0.1,0.023109,0.284047,0.044021,0.200,0.800,0.136,0.034,0.745,0.051,0.119,0.144,0.102,0.0,1.0,0.0,0.0,0.0
2,0.5,0.025324,0.087549,0.044021,0.471,0.529,0.297,0.054,0.500,0.027,0.203,0.171,0.243,0.0,1.0,0.0,0.0,0.0
3,-0.2,0.003166,0.006809,0.002201,0.471,0.529,0.297,0.054,0.500,0.027,0.203,0.171,0.243,0.0,1.0,0.0,0.0,0.0
4,0.6,0.022159,0.080739,0.041820,0.471,0.529,0.297,0.054,0.500,0.027,0.203,0.171,0.243,0.0,1.0,0.0,0.0,0.0
5,1.8,0.052232,0.047665,0.049890,0.771,0.229,0.651,0.124,0.186,0.153,0.163,0.192,0.528,0.0,1.0,0.0,0.0,0.0
6,1.0,0.085787,0.058366,0.071167,0.819,0.181,0.673,0.256,0.136,0.103,0.191,0.161,0.417,0.0,1.0,0.0,0.0,0.0
7,0.6,0.039886,0.014591,0.038885,0.894,0.106,0.725,0.129,0.070,0.129,0.205,0.123,0.596,0.0,0.0,0.0,1.0,0.0
8,0.1,0.002849,0.004864,0.005869,0.894,0.106,0.725,0.129,0.070,0.129,0.205,0.123,0.596,0.0,0.0,0.0,1.0,0.0
9,0.5,0.037037,0.009728,0.033015,0.894,0.106,0.725,0.129,0.070,0.129,0.205,0.123,0.596,0.0,0.0,0.0,1.0,0.0


In [27]:
#!pip install xlwt
#import xlwt
#df1.to_excel("my_data.xls")