## ECE 196: NBA Win Classification

**Note: Whenever you see '...', replace with a line of code**

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import balanced_accuracy_score
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv('NBA_Team_Data_18-19.csv')

In [None]:
df.head()

**Data Cleaning**

As you can see above, our dataset is fairly clean in that our data is recorded with a consistent format.  Lets first check for any null values in our data. This is important because we cannot generate a model with null values.

Hint: use .isnull() and .sum() to find the sum of null values in a given dataframe.

In [None]:
# Output a series with the column names of our dataframe and the sum of their null values
... 

The 'Season' column is the only column with null values.  All values are null, so we will drop the column.

In [None]:
df = ...

Next, lets split our dataset into its features, X, and the label, y. Remember that our label is whether or not a team has won the game, this is in the 'W/L' column.

In [None]:
# X should equal all of columns that are not 'W/L'
X = ...
# Y should equal the 'W/L' column
y = ...

**Exploratory Data Analysis**

This section will get us more familiar with our dataset, as well as help with feature selection later on. In the cell below, we have found the indices of wins and losses, respectively.

Using these indices, we will compare the distribution of the same variables when a team wins versus when they lose.

In [None]:
wins = y[y=="W"].index
losses = y[y=="L"].index

Generate at least 3 sets of histograms, one set for each of the following columns in X: 'PTS', 'REB', and 'TOV'.  Each set should have two histograms, one filtered by when a team wins, and another filtered by when a team loses.

Note: Your first plot will show up as blue, your second will be orange.

Hint: Use .loc to select certain rows

In [None]:
plt.xlabel("PTS") # Fill this in with the name of column you are plotting on the x-axis
plt.ylabel("Count") # The y-axis of a histogram should always be count, unless you are making a density plot

### Plot two histograms of your target column: one filtered by wins and one filtered by losses ###
...  # Plot histogram of 'PTS' column with data filtered by wins
...  # Plot histogram of 'PTS' column with data filtered by losses

In [None]:
X.columns

In [None]:
plt.xlabel("REB") # Fill this in with the name of column you are plotting on the x-axis
plt.ylabel("Count") # The y-axis of a histogram should always be count, unless you are making a density plot

### Plot two histograms of your target column: one filtered by wins and one filtered by losses ###
...  # Plot histogram of 'REB' column with data filtered by wins
...  # Plot histogram of 'REB' column with data filtered by losses

In [None]:
plt.xlabel("TOV") # Fill this in with the name of column you are plotting on the x-axis
plt.ylabel("Count") # The y-axis of a histogram should always be count, unless you are making a density plot

### Plot two histograms of your target column: one filtered by wins and one filtered by losses ###
...  # Plot histogram of 'PTS' column with data filtered by wins
...  # Plot histogram of 'PTS' column with data filtered by losses

Note that the distribution of 'TOV' is more even than the other two.  Lets look at the summary statistics for the 'TOV' column to find any difference in the distributions.

Hint: Use .describe() to generate summary statistics for a column.

In [None]:
...  # Apply .describe on the 'TOV' column filtered by wins
...  # Apply .describe on the 'TOV' column filtered by losses

**Feature Extraction and Engineering**

Now that we have some sense of what our data looks like and how distribution of columns differ when a team wins versus when they lose, lets try and select the most important features to predict whether or not a team will win.  Additionally, we can create derived features by forming combinations of our columns as we see fit, or turning categorical variables into vectors.

This part is pretty open-ended, but I do have some tips.  Check out the ['Four Factors of Basketball Success'](https://www.basketball-reference.com/about/factors.html) by Dean Oliver.  He gives weights to the four most important factors that lead to a team winning.  Our model will create weights for us, but the features outlined in his four factors can be derived from our dataset.

As a baseline model, try to input all quantitative columns into a kNN classifier and see what accuracy you are able to get.  Once you apply a model to all quantitative columns, come back to this part and see if you can improve your accuracy.  Machine Learning projects are an iterative process, meaning you should start with a baseline model and work your way up.

Recommendation: Try to standardize all your columns and see if that improves accuracy over non-standardization.

Finding Quantitative Columns:

The cell below provides skeleton code for finding all quantitative columns.

Hint: Use .dtype to find the datatype of a given column

In [None]:
quant_cols = []
for col in X.columns:
    if X[col].dtype == 'int' or X[col].dtype == 'float':
        quant_cols.append(col)

**Optional**

In the cell below, I will provide skeleton code for deriving a feature from our dataset.  In this example, we will be able to find whether a team is playing at Home or not based on the 'Match Up' column.  There are two types of outputs for this column: Case 1 -- 'Team_1 vs. Team_2' or Case 2 -- 'Team_1 @ Team_2'.  In Case 1, Team_1 is at home, whereas in Case 2, Team_1 is away.  

We will utilize One Hot Encoding to separate 'Home' and 'Away' into two separate columns.

Hint: Create a function to encode 'vs.' and '@' as 'Home' and 'Away' respectively for each row, then use df[column].apply(func) to apply your function to your Series.

Hint: Use .split() to help isolate the 'vs.' and '@' characters.

Hint: Use pd.get_dummies(df[column]) to One Hot Encode a given column.

In [None]:
def find_home_away(row):
    ### Insert code here: function should return either 'Home' or 'Away'
    if ...:  # if we find 'vs.'
        return 'Home'
    if ...:  # if we find '@'
        return 'Away'

In [None]:
home_away = ... # Return a series of values that are either 'Home' or 'Away'. Use .apply
home_away.head()

In [None]:
X['Match Up'].head()

This cell has skeleton code for performing Logistic Regression on each quantitative variable and y. Use accuracy as a metric for determing which columns are most predictive towards success.  Feel free to use this once creating any derived features to check their effectiveness.

In [None]:
X_feat = ... # Feature dataframe containing all the column we want to use so far
#### Uncomment the line below if you created 'home_away' from above ###
# X_feat[['Away', 'Home']] = ... # use pd.get_dummies to get these values
model = ...  # Initiate a classification model, for example KNN, Logistic Regression, SVC, etc.
             # refer to sklearn documntation for this...
for col in X_feat:
    ...  # Fit your newly initiated model to the column at X_feat and y
    pred = ...  # Generate predictions based on the column at X_feat
    print("{}:".format(col))
    print("Total Accuracy:",sum(pred == y) / len(y))
    print()

In [None]:
### Add any additional feature engineering code here ###

**Model Selection and Performance**

Once you have extracted all the features you wish to use in your Machine Learning model, it is time to select a model and test it's accuracy on unseen data.  Because we don't actually have unseen data, we will create a sample of our total dataset where we hold-out the labels.

Using sklearn's train_test_split, create a testing and training set with 33% of our data going towards our test set.  Refer to sklearn [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) to implement this.

In [None]:
## Create X_train, X_test, y_train, and y_test with 33% of our total data in our test set
... # Use train_test_split from sklearn to create X_train, X_test, y_train, and y_test

Once we have our training and testing datasets, we can initialize our classifier, fit to our training data, and make predictions on our testing data.  Pick whichever classifier you want, I have imported kNN, Logistic Regression, and SVM from sklearn.  You can find more classifiers on sklearn's website and import as you wish.

Hint: Use .fit() and .predict() after initializing the classifer

In [None]:
classifier = ... # Initialize a classifier here
## Fit and predict using X_train, y_train, and X_test
... # Fit classifier to your X_train and y_train
pred = ...  # Generate predictions based on X_test

The baseline accuracy (using KNeighborsClassifier with k=5) was 0.73522. See if you can beat this!

In [None]:
sum(pred == y_test) / len(pred)  # Outputs the accuracy of your predictions

**Hyperparameter Tuning**

Now that we have generated predictions and an accuracy from the model, change hyperparameters to improve model perfomance.  This may be changing the value of k for KNearestNeighbors, C for LogisticRegression and SVM, and others depending on your model.

You could loop through existing hyperparameters and output the score for each one to find the optimal hyperparameter value.

In [None]:
### Your code here: Hyperparameter Tuning ###

**Conclusion**

As you will see, it is fairly easy to get a score above 0.70 regardless of how much feature engineering we do or which classifier we choose.  However, we are sort of cheating in a sense, becuase we are using data from that game to predict the results of that game.

An interesting extension of this problem would be to predict who will win a game WITHOUT having any statistics present from that game.  One way to approach this problem would be to find average statistics of the team's last 5 or so games, and use that to predict their performance in the current game.

Additionally, we could find outside datasets, such as a dataset of Vegas odds for the game see who is favored to win.  The possibilities are endless as far as what features you could engineer to make more accurate predictions, as long as you have quality data and an open mind.