# Feature engineering on NCAA data

Domain knowledge is critical to getting the best out of data analysis and machine learning.
In the case of basketball, Dean Oliver identified four factors that are critical to success:
* Shooting
* Turnovers
* Rebounding
* Free Throws

Of course, it is not enough to identify factors, you need a way to measure them.

Read [this article](https://www.basketball-reference.com/about/factors.html) about the four factors and how they are measured. In this notebook, we will compute them from the box score data.

## Shooting efficiency

Shooting is measured as the fraction of field goal attempts made, weighting 3 points higher:

$(FG + 0.5 * 3P) / FGA$

Let's compute the offensive and defensive shooting efficiency and see how correlated they are to winning teams.

In [None]:
%%bigquery df1
SELECT 
  team_code,
  AVG(SAFE_DIVIDE(fgm + 0.5 * fgm3,fga+fga3)) AS offensive_efficiency,
  AVG(SAFE_DIVIDE(opp_fgm + 0.5 * opp_fgm3,opp_fga+opp_fga3)) AS opponents_efficiency,
  AVG(win) AS win_rate,
  COUNT(win) AS num_games
FROM lab_dev.team_box
WHERE fga IS NOT NULL
GROUP BY team_code

Let's remove the entries corresponding to teams that played fewer than 100 games, and then plot it.

In [None]:
df1 = df1[df1['num_games'] > 100]

In [None]:
df1.plot(x='offensive_efficiency', y='win_rate', style='o');

In [None]:
df1.plot(x='opponents_efficiency', y='win_rate', style='o');

Does the relationship make sense? Do you think offensive and defensive efficiency are good predictors of a team's performance?

## Turnover Percentage

Turnover percentage is measured as:

$TOV / (FGA + 0.44 * FTA + TOV)$

As before, let's compute this, and see whether it is a good predictor. For simplicity, we will compute only offensive turnover percentage, although we should really compute both sides as we did for scoring efficiency.

In [None]:
%%bigquery df2
SELECT 
  team_code,
  AVG(SAFE_DIVIDE(tov,fga+0.44*fta+tov)) AS turnover_percent,
  AVG(win) AS win_rate,
  COUNT(win) AS num_games
FROM lab_dev.team_box
WHERE fga IS NOT NULL
GROUP BY team_code
HAVING num_games > 100

In [None]:
df2.plot(x='turnover_percent', y='win_rate', style='o');

## Rebounding

Again, we'd have to measure both sides, but for simplicity, we'll do only the offensive rebounds.

$ORB / (ORB + Opp DRB)$

In [None]:
%%bigquery df3
SELECT 
  team_code,
  AVG(SAFE_DIVIDE(oreb,oreb + opp_dreb)) AS rebounding,
  AVG(win) AS win_rate,
  COUNT(win) AS num_games
FROM lab_dev.team_box
WHERE fga IS NOT NULL
GROUP BY team_code
HAVING num_games > 100

In [None]:
df3.plot(x='rebounding', y='win_rate', style='o');

The relationship doesn't seem all that strong here. One way to measure the strength of the relationship is through the correlation. Numbers near 0 mean not correlated and numbers near +/- 1 indicate high correlation:

In [None]:
df3.corr()['win_rate']

The correlation between rebounding and win_rate is 0.38.  Compare that to the first data frame:

In [None]:
df1.corr()['win_rate']

Notice that the offensive and opponents efficiency have correlation of 0.67 and -0.66, which are higher.

In [None]:
df2.corr()['win_rate']

## Free throw factor

This is a measure of both how often a team gets to the line and how often they make them:

$FT / FGA$


In [None]:
%%bigquery df3
SELECT 
  team_code,
  AVG(SAFE_DIVIDE(ftm,fga+fga3)) AS freethrows,
  AVG(win) AS win_rate,
  COUNT(win) AS num_games
FROM lab_dev.team_box
WHERE fga IS NOT NULL
GROUP BY team_code
HAVING num_games > 100

In [None]:
df3.plot(x='freethrows', y='win_rate', style='o');

In [None]:
df3.corr()['win_rate']

## Machine Learning

Let's use these factors to create a simple ML model

In [19]:
%%bigquery
SELECT 
  team_code,
  is_home,
  SAFE_DIVIDE(fgm + 0.5 * fgm3,fga+fga3) AS offensive_efficiency,
  SAFE_DIVIDE(opp_fgm + 0.5 * opp_fgm3,opp_fga+opp_fga3) AS opponents_efficiency,
  SAFE_DIVIDE(tov,fga+0.44*fta+tov) AS turnover_percent,
  SAFE_DIVIDE(opp_tov,opp_fga+0.44*opp_fta+opp_tov) AS opponents_turnover_percent,
  SAFE_DIVIDE(oreb,oreb + opp_dreb) AS rebounding,
  SAFE_DIVIDE(opp_oreb,opp_oreb + dreb) AS opponents_rebounding,
  SAFE_DIVIDE(ftm,fga+fga3) AS freethrows,
  SAFE_DIVIDE(opp_ftm,opp_fga+opp_fga3) AS opponents_freethrows,
  win
FROM lab_dev.team_box
WHERE fga IS NOT NULL and win IS NOT NULL
LIMIT 10

Unnamed: 0,team_code,is_home,offensive_efficiency,opponents_efficiency,turnover_percent,opponents_turnover_percent,rebounding,opponents_rebounding,freethrows,opponents_freethrows,win
0,272,0,0.265823,0.465116,0.090253,0.277778,0.189189,0.235294,0.126582,0.55814,0
1,632,0,0.310345,0.402439,0.186782,0.309406,0.305556,0.434783,0.172414,0.585366,0
2,178,0,0.412698,0.457447,0.178359,0.2658,0.185185,0.35,0.15873,0.617021,0
3,504980,0,0.349315,0.545455,0.165479,0.241838,0.378378,0.176471,0.191781,0.590909,0
4,183,0,0.261111,0.447917,0.123762,0.259463,0.214286,0.227273,0.111111,0.541667,0
5,28,0,0.364198,0.528846,0.088028,0.228311,0.266667,0.166667,0.135802,0.326923,0
6,183,0,0.416667,0.477778,0.341352,0.302013,0.35,0.315789,0.12963,0.244444,0
7,183,0,0.282609,0.446809,0.190728,0.287173,0.176471,0.210526,0.144928,0.276596,0
8,183,0,0.335714,0.413043,0.294482,0.260736,0.208333,0.315789,0.085714,0.565217,0
9,183,0,0.401639,0.39,0.186314,0.147929,0.2,0.304348,0.163934,0.5,0


In [23]:
%%bigquery
CREATE OR REPLACE MODEL lab_dev.four_factors_model
OPTIONS(model_type='logistic_reg', input_label_cols=['win'])
AS

SELECT 
  team_code,
  is_home,
  SAFE_DIVIDE(fgm + 0.5 * fgm3,fga+fga3) AS offensive_efficiency,
  SAFE_DIVIDE(opp_fgm + 0.5 * opp_fgm3,opp_fga+opp_fga3) AS opponents_efficiency,
  SAFE_DIVIDE(tov,fga+0.44*fta+tov) AS turnover_percent,
  SAFE_DIVIDE(opp_tov,opp_fga+0.44*opp_fta+opp_tov) AS opponents_turnover_percent,
  SAFE_DIVIDE(oreb,oreb + opp_dreb) AS rebounding,
  SAFE_DIVIDE(opp_oreb,opp_oreb + dreb) AS opponents_rebounding,
  SAFE_DIVIDE(ftm,fga+fga3) AS freethrows,
  SAFE_DIVIDE(opp_ftm,opp_fga+opp_fga3) AS opponents_freethrows,
  win
FROM lab_dev.team_box
WHERE fga IS NOT NULL and win IS NOT NULL

In [24]:
%%bigquery evalstats
SELECT * FROM ML.EVALUATE(MODEL lab_dev.four_factors_model)

Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.839405,0.843535,0.841371,0.841465,0.364191,0.919842


84% isn't bad, but ... there is a *huge* problem with the above approach.
How are we supposed to know Team A's free throw shooting percentage against Team B before the game is played?

What we could do is to get the free throw shooting percentage of Team A in the 3 games prior to this one and use that. This requires analytic functions in SQL. If you are not familar with these, make a copy of the select statement and modify it in stages until you grasp what is happening.

In [26]:
%%bigquery
CREATE OR REPLACE MODEL lab_dev.four_factors_model
OPTIONS(model_type='logistic_reg', input_label_cols=['win'])
AS

WITH all_games AS (
SELECT 
  game_date,
  team_code,
  is_home,
  SAFE_DIVIDE(fgm + 0.5 * fgm3,fga+fga3) AS offensive_efficiency,
  SAFE_DIVIDE(opp_fgm + 0.5 * opp_fgm3,opp_fga+opp_fga3) AS opponents_efficiency,
  SAFE_DIVIDE(tov,fga+0.44*fta+tov) AS turnover_percent,
  SAFE_DIVIDE(opp_tov,opp_fga+0.44*opp_fta+opp_tov) AS opponents_turnover_percent,
  SAFE_DIVIDE(oreb,oreb + opp_dreb) AS rebounding,
  SAFE_DIVIDE(opp_oreb,opp_oreb + dreb) AS opponents_rebounding,
  SAFE_DIVIDE(ftm,fga+fga3) AS freethrows,
  SAFE_DIVIDE(opp_ftm,opp_fga+opp_fga3) AS opponents_freethrows,
  win
FROM lab_dev.team_box
WHERE fga IS NOT NULL and win IS NOT NULL
)

, prevgames AS (
SELECT 
  is_home,
  AVG(offensive_efficiency) 
       OVER(PARTITION BY team_code ORDER BY game_date DESC ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) AS offensive_efficiency,
  AVG(opponents_efficiency) 
       OVER(PARTITION BY team_code ORDER BY game_date DESC ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING)AS opponents_efficiency,
  AVG(turnover_percent)
       OVER(PARTITION BY team_code ORDER BY game_date DESC ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) AS turnover_percent,
  AVG(opponents_turnover_percent)
       OVER(PARTITION BY team_code ORDER BY game_date DESC ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) AS opponents_turnover_percent,
  AVG(rebounding)
       OVER(PARTITION BY team_code ORDER BY game_date DESC ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) AS rebounding,
  AVG(opponents_rebounding) 
       OVER(PARTITION BY team_code ORDER BY game_date DESC ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) AS opponents_rebounding,
  AVG(freethrows) 
       OVER(PARTITION BY team_code ORDER BY game_date DESC ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) AS freethrows,
  AVG(opponents_freethrows) 
       OVER(PARTITION BY team_code ORDER BY game_date DESC ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) AS oppponents_freethrows,
  win
FROM all_games
)

SELECT * FROM prevgames
WHERE offensive_efficiency IS NOT NULL

In [27]:
%%bigquery evalstats
SELECT * FROM ML.EVALUATE(MODEL lab_dev.four_factors_model)

Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.675065,0.687436,0.680808,0.681195,0.79653,0.748621


Based on just the teams' performance coming in, we can predict the outcome of games with a 68% accuracy.

## More complex ML model

We can write a more complex ML model using Keras and a deep neural network.
The code is not that hard but you'll have to do a lot more work (scaling, hyperparameter tuning)
to get better performance than you did with the BigQuery ML model.

In [28]:
%%bigquery games
WITH all_games AS (
SELECT 
  game_date,
  team_code,
  is_home,
  SAFE_DIVIDE(fgm + 0.5 * fgm3,fga+fga3) AS offensive_efficiency,
  SAFE_DIVIDE(opp_fgm + 0.5 * opp_fgm3,opp_fga+opp_fga3) AS opponents_efficiency,
  SAFE_DIVIDE(tov,fga+0.44*fta+tov) AS turnover_percent,
  SAFE_DIVIDE(opp_tov,opp_fga+0.44*opp_fta+opp_tov) AS opponents_turnover_percent,
  SAFE_DIVIDE(oreb,oreb + opp_dreb) AS rebounding,
  SAFE_DIVIDE(opp_oreb,opp_oreb + dreb) AS opponents_rebounding,
  SAFE_DIVIDE(ftm,fga+fga3) AS freethrows,
  SAFE_DIVIDE(opp_ftm,opp_fga+opp_fga3) AS opponents_freethrows,
  win
FROM lab_dev.team_box
WHERE fga IS NOT NULL and win IS NOT NULL
)

, prevgames AS (
SELECT 
  is_home,
  AVG(offensive_efficiency) 
       OVER(PARTITION BY team_code ORDER BY game_date DESC ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) AS offensive_efficiency,
  AVG(opponents_efficiency) 
       OVER(PARTITION BY team_code ORDER BY game_date DESC ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING)AS opponents_efficiency,
  AVG(turnover_percent)
       OVER(PARTITION BY team_code ORDER BY game_date DESC ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) AS turnover_percent,
  AVG(opponents_turnover_percent)
       OVER(PARTITION BY team_code ORDER BY game_date DESC ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) AS opponents_turnover_percent,
  AVG(rebounding)
       OVER(PARTITION BY team_code ORDER BY game_date DESC ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) AS rebounding,
  AVG(opponents_rebounding) 
       OVER(PARTITION BY team_code ORDER BY game_date DESC ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) AS opponents_rebounding,
  AVG(freethrows) 
       OVER(PARTITION BY team_code ORDER BY game_date DESC ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) AS freethrows,
  AVG(opponents_freethrows) 
       OVER(PARTITION BY team_code ORDER BY game_date DESC ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) AS oppponents_freethrows,
  win
FROM all_games
)

SELECT * FROM prevgames
WHERE offensive_efficiency IS NOT NULL

Unnamed: 0,is_home,offensive_efficiency,opponents_efficiency,turnover_percent,opponents_turnover_percent,rebounding,opponents_rebounding,freethrows,oppponents_freethrows,win
0,0,0.393333,0.351852,0.173210,0.291054,0.280000,0.125000,0.120000,0.185185,1
1,0,0.422276,0.379630,0.102631,0.259757,0.223333,0.107955,0.108780,0.185185,1
2,0,0.389361,0.351852,0.164709,0.288645,0.277094,0.123252,0.121540,0.185185,1
3,0,0.374299,0.324744,0.147452,0.252341,0.263376,0.123689,0.129130,0.175073,1
4,0,0.358203,0.318812,0.124754,0.216648,0.266905,0.123689,0.158340,0.214714,1
5,1,0.327715,0.262415,0.143944,0.194255,0.264712,0.132212,0.152243,0.168418,0
6,1,0.363609,0.293279,0.129332,0.161021,0.220283,0.171336,0.161531,0.190023,0
7,1,0.422459,0.314155,0.142350,0.195388,0.189727,0.223420,0.196136,0.278839,0
8,1,0.453630,0.309498,0.162328,0.211090,0.154930,0.256226,0.182568,0.257202,0
9,1,0.555995,0.435167,0.186589,0.239869,0.185632,0.297893,0.279390,0.335633,0


In [29]:
import tensorflow as tf
import tensorflow.keras as keras

In [30]:
nrows = len(games)
ncols = len(games.iloc[0])
ntrain = (nrows * 7) // 10
print(nrows, ncols, ntrain)

242600 10 169820


In [31]:
# 0:ntrain are the training data; remaining rows are testing
# last col is the label
train_x = games.iloc[:ntrain, 0:(ncols-1)]
train_y = games.iloc[:ntrain, ncols-1]
test_x = games.iloc[ntrain:, 0:(ncols-1)]
test_y = games.iloc[ntrain:, ncols-1]

In [32]:
model = keras.models.Sequential()
model.add(keras.layers.Dense(5, input_dim=ncols-1, activation='relu'))
model.add(keras.layers.Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [33]:
history = model.fit(train_x, train_y, epochs=5, batch_size=32)
score = model.evaluate(test_x, test_y, batch_size=512)
print(score)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
[0.5561181869061972, 0.7034488]


With a deep neural network, we are able to get 70% accuracy using the four factors model.

In [None]:
# Copyright 2019 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.