# Basketball analytics: distilling and summarizing Information

#### Due: May 11 at 10 pm

When analyzing data, setting a goal is often helpful. In this assignment, the focus is on understanding how NMF behaves, and further analyzing player data.

In all the problems below, take a step back and think about each procedure as a piece in a bigger puzzle of understanding the game of basketball and its players. This goal should guide any decisions we make, and insights we interpret.

## Preparing Data

In the previous notebook `07-Shooting-Pattern-Analysis`, we computed smoothed shot patterns for 362 players that played during 2016-17 regular season. Save the matrix `X` from Non-negative matrix factorization (NMF) section.

Please create this file from saving the appropriate variable into a picke file called `allpatterns2016-17.pkl`. After saving the file, you can load it via the following command:

In [None]:
# Import modules
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import subprocess as sp
import pickle # to serialize/unserialize python data objects

import helper_basketball as h
import imp
imp.reload(h);

In [None]:
allshots = pickle.load(open('allshots2016-17.pkl', 'rb'))
allmade = allshots.copy()
allmade.head(10)

In [None]:
## bin edge definitions in inches
xedges = (np.linspace(start=-25, stop=25, num=151, dtype=np.float)) * 12
yedges = (np.linspace(start= -4, stop=31, num=106, dtype=np.float)) * 12

## 2d histogram containers for binned counts and smoothed binned counts
all_counts = {}
all_smooth = {}

## data matrix: players (row) by vectorized 2-d court locations (column)
for i, one in enumerate(allmade.groupby('PlayerID')):
    
    ## what does this line do?
    pid, pdf = one
    
    ## h.bin_shots: what is this function doing?
    tmp1, xedges, yedges = h.bin_shots(pdf, bin_edges=(xedges, yedges), density=True, sigma=2)
    tmp2, xedges, yedges = h.bin_shots(pdf, bin_edges=(xedges, yedges), density=False)
    
    ## vectorize and store into dictionary
    all_smooth[pid] = tmp1.reshape(-1)
    all_counts[pid] = tmp2.reshape(-1)

In [None]:
pickle.dump(np.stack(all_smooth.values()).T, open('allpatterns2016-17.pkl', 'wb'))

In [None]:
X = pickle.load(open('allpatterns2016-17.pkl', 'rb'))
X

## Non-negative Matrix Factorization (NMF) notation

Non-negative matrix factorization was used on the smoothed shooting pattern data of around 360 players. The result was useful in
* Bases: Identifying modes of shooting style (number of modes was determined by `n_components` argument to `NMF` function)
* Coefficients: How each players shooting style could be expressed as a linear combination of these bases (matrix multiplication between the bases and coefficients achieve this)

Recall the following. Given some matrix $X$ is $p\times n$ matrix, NMF computes the following factorization:
$$ \min_{W,H} \| X - WH \|_F\\
\text{ subject to } W\geq 0,\ H\geq 0, $$
where $W$ is ${p\times r}$ matrix and $H$ is ${r\times n}$ matrix.


## Problem 1

__PSTAT 134 and 234__: Experiment with different number of `n_components` to change the number of bases vectors. Visualize the bases vectors.

What value of $r$ seem to be too small? (`r` is too small to represent diversity of shooting modes) What value of $r$ seem to be too large? (`r` is too large and some bases seem to be duplicated). Note that, if a basis were a perfect duplicate of another (they will not be, but could be similar), you would use one basis instead of two.

__PSTAT 234 (optional for 134)__: Choose two different choices for number of components, say $r_1=3$ and $r_2=20$. Reconstruct the shooting pattern of at least two player using 3 bases and 20 bases. Is there any difference between the reconstruction?

- For a given player, plot the original shooting frequencies and corresponding reconstruction for $r \in \{3,20\}$.

Compute the difference: i.e., the norm of the difference  $ \min_{W_r,H_r} \| X - W_rH_r \|_F$. Plot the approximation error as a function of $r$. (Note the subscript $r$ makes the choice of $r$ explicit.) Choose at least 10 different choices of $r$. Based on this plot, what can you say about choosing $r$?

In [None]:
## Non-negative Matrix Factorization
def non_negative_marix_decomp(n_components,train_data):
    import sklearn.decomposition as skld
    model = skld.NMF(n_components=n_components, init='nndsvda', max_iter=500, random_state=0)
    W = model.fit_transform(train_data)
    H = model.components_
    nmf = (W,H)
    return(nmf)

In [None]:
X = np.stack(all_smooth.values()).T

r = 3
W_3,H_3 = non_negative_marix_decomp(n_components = r,train_data = X)
print("W_3",W_3)
print("---------")
print("H_3",H_3)

In [None]:
r = 5
W_5,H_5 = non_negative_marix_decomp(n_components = r,train_data = X)
print("W_5",W_5)
print("---------")
print("H_5",H_5)

In [None]:
r = 10
W_10,H_10 = non_negative_marix_decomp(n_components = r,train_data = X)
print("W_10",W_10)
print("---------")
print("H_10",H_10)

In [None]:
r = 15
W_15,H_15 = non_negative_marix_decomp(n_components = r,train_data = X)
print("W_15",W_15)
print("---------")
print("H_15",H_15)

In [None]:
fig, ax = plt.subplots(2,2, figsize=(20,40))

axi = ax.flatten()
# I took the 1th player appearing in first column  
h.plot_shotchart(W_3[:,1], xedges, yedges, ax=axi[0])
h.plot_shotchart(W_5[:,1], xedges, yedges, ax=axi[1]) 
h.plot_shotchart(W_10[:,1], xedges, yedges, ax=axi[2])
h.plot_shotchart(W_15[:,1], xedges, yedges, ax=axi[3]) 

axi[0].set_title('Estimated Shooting Pattern (r=3)')
axi[1].set_title('Estimated Shooting Pattern (r=5)')
axi[2].set_title('Estimated Shooting Pattern (r=10)')
axi[3].set_title('Estimated Shooting Pattern (r=15)')

__According to 4 plots above, 5 seems too small to represent diversity of shooting modes, and 15 seems too large, where some bases seem to be duplicated. Therefore, 10 seems to be the best value for `r` among these 4 values.__

## Problem 2

__PSTAT 134 and 234__: In the previous question, NMF gave us a set of bases to describe each player. So, the comparison is through a standard set of shooting styles. We may also approach the comparison more directly.

* In this problem, we compare of players' shooting styles to each other directly. What we are interested in is pairwise correlation between shooting patterns. Let $X_i$ represent the column in the smoothed shooting pattern for player $i$. Then, we want to compute   
    $$ R = [\text{Cor} (X_i, X_j)]_{i,j} $$ for all player combinations $i,j\in\{1,2,\dots,362\}$. What is the correct orientation of matrix $X$? What should be the dimension of matrix $R$?   
    _Note: if your command is not running properly, you may be running into the issue of using too much memory, and your notebook session is rebooted by the server as a result._
    
* Visualize matrix $R$ with [seaborn.heatmap](https://seaborn.pydata.org/generated/seaborn.heatmap.html) function.

* Identify 2 pairs of players with highest similarities (positive correlation) and 2 pairs with lowest similarity (negative correlation). Plot their shooting pattern. What do you observe?

__PSTAT 234 (optional for 134)__: Perform hierarchical clustering with matrix $R$, and visualize the clustered matrix.

In [None]:
Corr_X = np.corrcoef(X.T, rowvar = 0)
Corr_X

In [None]:
import seaborn as sns; sns.set()
ax = sns.heatmap(Corr_X)

In [None]:
X_test = X_new[:,0:12].copy()
R = np.corrcoef(X_test,rowvar=0)
ax = sns.heatmap(R)

## Problem 3

__PSTAT 134 and 234__: How would you use the coefficients matrix $H$ from NMF  or the correlation matrix $R$ (computed above) to differentiate between types of players? Consider what the coefficients represent, and how you can use them to discriminate player types.

Give your thought process, reasoning for your chosen method, and the results. Do they look reasonable? Do you expect any of the comparison to be similar to any of the [figures here](https://fastbreakdata.com/classifying-the-modern-nba-player-with-machine-learning-539da03bb824)? Why, or why not? Can you verify your intuition?

In [None]:
## Hd holds coefficients
Hd = pd.DataFrame(H_10, columns=all_smooth.keys())
Hd.T.head(10)

In [None]:
# Note that these players cofficients are not scaled to sum to 1.
Hd.sum(0)

In [None]:
# Scale each player to sum to 1.

Hd /= Hd.sum(0)
Hd.T.head(10)

__The __

## Problem 4

__PSTAT 134 and 234__: Suppose you are in charge of a basketball team. How would you use this information? How would you use what you have learned from analyzing the data, and what other questions would you like to answer with further analysis.

__I would make strategies to against those players based on the shot types they prefer to use. Also, I can use these data to train my team __