# Ian Murphy
# BrainStation, Capstone EDA
# 12/10/2020

### Data Dictionary

(included in previous notebook, but adding here as well for ease of use)

This cell contains a Data Dictionary containing the description for every feature the the dataframe called 'outcomes_df':

PER: This stands for Player Efficiency Rating. This is a measure of a player's per-minute productivity. It takes into account all of the positive things a player does, and creates a metric to measure how productive that player is. It is mean to summarize a player's statistical accomplishments in a single number. 

TSP: This stands for True Shooting Percentage. True shooting percentage is a measure of shooting efficiency that takes into account 2 point field goals, 3-point field goals, and free throws. 

PER and TSP will be included for the top 12 players on each team. (top 12 by minutes played per game). For example Player1_PER is the PER for the player on that team that played the most minutes per game that season. 

AVG_PER: This is the average Player Efficiency Rating of the top 12 players on each team. 

AVG_TSP: This is the average True Shooting Percentage of the top 12 players on each team. 

Coaches: I will be using dummy variables for coaches. The column titles will be the first initial, and then the last name the coach for each respective season. If the value is 1 then that means that respective coach was the coach of that team that season. If the value is 0, then that coach was not coaching that team, that season. 

ORtg: This stands for Offensive Rating. It is points scored per 100 possessions by a team. 

Rel ORtg:This is similar to ORtg, but it is relative to the league average. 

DRtg: This stands for Defensive Rating. It is the amount of points allowed per 100 posessions by a team.

Rel DRtg: This is similar to DRtg, but it is relative to the league average. 

SRS: This stands for Simple Rating System. This rating takes into account average point differential and strength of schedule. The rating is denominated in points above or below the average, where zero is average. 

Pace: The Pace factor is an estimate of the number of possessions per 48 minutes by a team. 

Rel Pace: This is similar to Pace, but it is relative to the league average. 

Playoffs: This is the target variable. 0 means the team did not make the playoffs that year. 1 Means the team did make the playoffs that year. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
outcomes_df = pd.read_csv('data/outcomes_df.csv')

In [3]:
outcomes_df.head()

Unnamed: 0.1,Unnamed: 0,Team,Year,player1_PER,player2_PER,player3_PER,player4_PER,player5_PER,player6_PER,player7_PER,...,T. Porter,T. Stotts,T. Thibodeau,V. Del Negro,W. Unseld,Division,AVG_PER,AVG_TSP,Location,Playoffs
0,0,NYK,2020,17.5,10.7,23.5,14.6,16.5,16.0,9.8,...,0,0,0,0,0,0,14.6125,0.5365,"New York, NY",0
1,1,NYK,2019,8.7,10.8,13.9,14.6,14.4,12.2,22.0,...,0,0,0,0,0,0,13.425,0.547125,"New York, NY",0
2,2,NYK,2018,13.0,14.4,24.0,7.0,17.8,20.4,11.5,...,0,0,0,0,0,0,16.0125,0.546625,"New York, NY",0
3,3,NYK,2017,17.9,12.2,17.4,17.0,12.7,12.7,19.0,...,0,0,0,0,0,0,16.175,0.542875,"New York, NY",0
4,4,NYK,2016,20.3,10.9,17.6,17.7,11.7,12.3,17.2,...,0,0,0,0,0,0,14.7625,0.540125,"New York, NY",0


### Basic EDA

Showing shape, nulls, duplicates. 

In [4]:
outcomes_df.shape

(876, 191)

In [5]:
outcomes_df.isna().sum().sum()

0

In [6]:
outcomes_df.duplicated().sum()

0

It looks like my data set is ready for more in depth analysis. I am going to dive into the specific features and examine correlation and multicollinearity. 

### Correlation Heat Map

Creating heat map for the following features: ‘AVG PER’, ‘AVG TSP’, ‘Pace’, ‘Rel Pace’, 'SRS', 'Pace', 'Rel Pace', 'ORtg', 'DRtg', 'Rel ORtg', 'Rel DRtg', ‘Playoffs’.

Since I have so many features, I am not going to print out every single one, but rather select a few key statistics that I suspect may be correlated to eachother. 

My data set is not too large so I will be trying a number of different dimension reduction techniques, and testing them out on different models. I will investegate using the heatmap. Then move on to calculating VIF scores. I will also perform PCA in my modeling notebook. 

In [7]:
# selecting a few key features to make a heat map. Need to put the selected features into a dataframe 
# I am using AVG_PER and AVG_TSP and all of the team stats. I'm leaving out coaches and individual player stats 
corr_df =  outcomes_df[['AVG_PER', 'AVG_TSP', 'player1_PER', 'Pace', 'Rel Pace', 'SRS', 'ORtg', 'DRtg', 'Rel ORtg', 'Rel DRtg', 'Playoffs']]

In [8]:
# showing the heatmap 
corr_df.corr().style.background_gradient(cmap="coolwarm", vmin=-1, vmax=1)

Unnamed: 0,AVG_PER,AVG_TSP,player1_PER,Pace,Rel Pace,SRS,ORtg,DRtg,Rel ORtg,Rel DRtg,Playoffs
AVG_PER,1.0,0.633479,0.31915,0.062819,0.049702,0.747483,0.741861,-0.212522,0.838896,-0.29425,0.534723
AVG_TSP,0.633479,1.0,0.250229,0.37238,0.08667,0.53492,0.806674,0.134703,0.620433,-0.192468,0.384519
player1_PER,0.31915,0.250229,1.0,-0.032849,-0.019485,0.456657,0.384365,-0.207366,0.467233,-0.234706,0.352877
Pace,0.062819,0.37238,-0.032849,1.0,0.621944,-0.067584,0.34415,0.460413,0.035653,0.157608,-0.074026
Rel Pace,0.049702,0.08667,-0.019485,0.621944,1.0,-0.105251,0.053592,0.209723,0.061094,0.25072,-0.127453
SRS,0.747483,0.53492,0.456657,-0.067584,-0.105251,1.0,0.682132,-0.621437,0.80533,-0.74783,0.76936
ORtg,0.741861,0.806674,0.384365,0.34415,0.053592,0.682132,1.0,0.142312,0.846326,-0.185352,0.494187
DRtg,-0.212522,0.134703,-0.207366,0.460413,0.209723,-0.621437,0.142312,1.0,-0.18318,0.8288,-0.529467
Rel ORtg,0.838896,0.620433,0.467233,0.035653,0.061094,0.80533,0.846326,-0.18318,1.0,-0.218001,0.582572
Rel DRtg,-0.29425,-0.192468,-0.234706,0.157608,0.25072,-0.74783,-0.185352,0.8288,-0.218001,1.0,-0.638299


There are a few features that have strong correlations with eachother. Some make sense, some are suprising. 
- 'Pace' and 'Rel Pace'. This makes sense as they are measuring the same basic statistic. Pace is a measure of how fast a team plays in terms of going up and down the court. 'Rel Pace' is just the same stat relative to the league average. 
- The same goes for 'ORtg', 'Rel ORtg',	'DRtg',	'Rel DRtg' respectively. I may need to drop the relative features because they are repetitive. 
- 'ORtg' and 'Rel ORtg' have a strong correlation with AVG_TSP. This makes sense because these are measures of offensive production. The more productive a teamis on offense will certainly correlate with their shooting percentage (TSP)
- 'SRS' has the strongest correlation with making the playoffs. This makes a lot of sense because SRS (Simple Rating System) is an overall team stat the measures how well the team is performing on both ends of the floor. SRS takes into account strength of schedule and average point differential. 
- 'Player1_PER' has a positive correlation, but it is not that strong. I would gess this would be one of the strongest, but perhaps the overall team aVG PER is more important as shown in this heatmap. 

### Variance Inflation Factor

I am going to use the VIF method to dive in deeper into the features. I want to examine the amount of amount of multicollinearity and see if there are any columns that I can drop to reduce the VIF scores of the other variables. 

In [9]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

pd.Series([variance_inflation_factor(corr_df.values, i) 
               for i in range(corr_df.shape[1])], 
              index=corr_df.columns)

  vif = 1. / (1. - r_squared_i)


AVG_PER         722.185284
AVG_TSP        1837.658057
player1_PER      25.537727
Pace           2174.130732
Rel Pace          2.482645
SRS             135.155134
ORtg                   inf
DRtg                   inf
Rel ORtg               inf
Rel DRtg               inf
Playoffs          5.831482
dtype: float64

There are a number of very high scores. This is what I expected because in sports, almost every stat is tied together in one way or another. My first thought is to remove Rel Ortg, Rel Drtg, and Pace. I do not want to remove ORtg and DRtg as they are really important statistics and measure offensive and defensive performance. My goal is to remove as much colinearity as I can with out loosing vital information. I will not be able to remove all the multicolinearity because of the nature of my project. 

Relative stats are the same measure, but relative to the league average. They have a lot of multicolinearity because of that, and I think that is causing the VIF scores to rise. 

In [10]:
corr_df_new = corr_df.drop(['Rel ORtg', 'Rel DRtg', 'Pace'], axis =1)

In [12]:
pd.Series([variance_inflation_factor(corr_df_new.values, i) 
               for i in range(corr_df_new.shape[1])], 
              index=corr_df_new.columns)

AVG_PER          674.485054
AVG_TSP         1714.957794
player1_PER       25.188084
Rel Pace           1.037621
SRS              132.879312
ORtg           64909.757498
DRtg           62758.105581
Playoffs           5.769091
dtype: float64

I did see a slight decrease in almost every category. I want to use the entire data set and look at those VIF scores. I will pu them in a dictionary and sort them to make it easier to read and plot in a visual. 

In [13]:
new_outcomes_df = outcomes_df.drop(['Team', 'Year', 'Location'], axis = 1)

In [14]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

vifs = pd.Series([variance_inflation_factor(new_outcomes_df.values, i) 
           for i in range(new_outcomes_df.shape[1])], 
          index=new_outcomes_df.columns)

  vif = 1. / (1. - r_squared_i)


In [15]:
vifs_dict = dict(vifs)

In [16]:
sorted_vifs_dict = sorted(vifs_dict.items(), key=lambda kv: kv[1])

In [17]:
sorted_vifs_dict[:-10]

[('playe11_PER', 2.277788249648934),
 ('player10_TSP', 2.2929767943285113),
 ('player9_PER', 2.325077421337637),
 ('player9_TSP', 2.3344878494079517),
 ('player10_PER', 2.3504427030837216),
 ('player11_TSP', 2.3579757780142145),
 ('player12_PER', 2.8352159846556355),
 ('player12_TSP', 3.034993328403722),
 ('Playoffs', 3.383654643422511),
 ('Rel Pace', 4.989669064492156),
 ('Division', 6.22669449796834),
 ('Pace', 6.250857525185122),
 ('Unnamed: 0', 7.508066312614935),
 ('L. Hamilton ', 8.025359655159258),
 ('R. Ayers ', 8.062329211620202),
 ('G. Irvine ', 8.064318912571839),
 ("K. O'Neill ", 8.084473896321276),
 ('G. Heard ', 8.100660077494709),
 ('B. Hanzlik ', 8.397684343431138),
 ('D. Versace ', 8.437212205555149),
 ('Q. Buckner ', 8.493657026393942),
 ('F. Carter ', 8.546692262796006),
 ('S. Vincent ', 8.562171230390225),
 ('S. Jackson ', 8.566612592186473),
 ('G. Littles ', 8.568999517383835),
 ('M. Dunlap ', 8.575493402583232),
 ('M. Curry ', 8.616189199201282),
 ('J. Tarkanian '

Just glossing over the scores quickly. They look much better when looking at them as a whole. There are a number of player stats that are infinity, which is not ideal, but I do not want to loose that information because I feel it is vital to answeting my question. (What stats are the most important). If i drop the player stats, I wont be able to answer questions like 'does player 1 have more impact then player 5?' This was one of my main goals in the project, so I need to keep these in my modeling. 

'SRS' is another stat that has high VIF score, but I dont think I can drop it as it has the strongest correlation to outcome. 

I will move on to modeling and use PCA for dimension reduction. I don't feel it is worth it to remove key features even if they have a high VIF score. 