# Does the 2015-16 NBA season obey The Formula?

All fans of basketball should know about this [really cool 2008 Slate article by Bill James about **The Formula**](http://www.slate.com/articles/sports/sports_nut/2008/03/the_lead_is_safe.html) he developed for knowing when a margin is 'Safe', beyond any possibility of a comeback.

**The Formula** (**TF**) boils down to $$(m-3)^2>s$$ where $m$ is the margin, and $s$ is the number of seconds remaining. (The full version of The Formula actually uses 3.5 or 2.5 depending on whether the leading team has posession or not, but that's a detail I'm choosing to ignore.) If the margin is so big that **The Formula** is True, then there's not enough time for the trailing team to comeback. Note that the use of -3 means that any margin of 3 or less is never safe -- which makes sense because there is always time for a buzzer-beater 3.

To investigate how reliable **TF** is, I looked around for large-scale datasets with full shot sequences of many basketball games. I found what I was looking for on GitHub, posted by user [sealneaward](https://github.com/sealneaward).  This ['events' directory](https://github.com/sealneaward/nba-movement-data/tree/master/data/events) includes a separate csv file for each game of the 2015-16 NBA season. (Click on the link to open in a new tab and take a look. Each filename is 0021500001.csv ... 0021500663.csv. Those numbers are standard NBA-assigned GAME_IDs. You can also click on any of the files to take a peek, or download a .csv and open it in a spreadsheet program, but we'll also be examining the data in the notebook below.)

Follow along in the notebook below to see how **TF** can be examined for this dataset. Each cell can be executed with **SHIFT-ENTER**, or if you're feeling adventurous, look for suggested places to double-click in a cell, edit it, and re-execute.

**IMPORTANT NOTE**: It is too easy to yield to the temptation to just shift-enter through all the cells and look at the results. For maximum learning, force yourself to go slow, and read through the text cells *and* python cells. The python cells are commented to help guide you as to what's going on, even if you don't know much Python.

In [None]:
# These are the Python libraries we'll need for this analysis
import os                          # The operating system allows us to check if files exist
import random                      # Just for choosing random games in the season
import numpy as np                 # Numpy does all things numerical
import matplotlib.pyplot as plt    # Matplotlib is for making charts and graphs
import pandas as pd                # Pandas is for loading and manipulating csv by columns and rows

## Loading the Data

The next cell fetches the data directly from online (if necessary). 

If you are running this notebook in a browser, it will take a while to download all the separate files, and then save the file to the ephemeral virtual machine currently in use, and it will automatically get cleaned up.

If you are running this notebook from your own computer, it will save off the file nba2015scores.csv locally, so it will be availble for reuse if you come back again.

Note: it also saves off the whole csv nba2015.csv file, which has rows for all kinds of events, like missed shots, rebounds, blocks, etc. Only scoring events populate the 'SCORE' column, so I can create a much smaller input file by grabbing only the rows that have a 'SCORE'.

In [None]:
if    os.path.exists('nba2015scores.csv'): # if the file exists in the same directory as the notebook
    df = pd.read_csv('nba2015scores.csv')  # just let pandas slurp it in

else:  
    # This downloads csv for all the individual games, concatenates and writes to file
    # Running this notebook on your computer means this is only needed once
    # Running online will have to do this every time, unless you download the
    # nba2015scores.csv file and re-upload it for new virtual machines
    
    # All the URLs are like this, with a different last 3 digits
    url = 'https://raw.githubusercontent.com/sealneaward/nba-movement-data/master/data/events/0021500{:03d}.csv'
    
    # Download the first file into a Pandas 'DataFrame' (think 'spreadsheet tab')
    url1 = url.format(1) # insert the number 1 in the {} part of url near the end
    print(url1)          
    nba2015 = pd.read_csv(url1)                 # download the first game into 'nba2015'

    # Now download all the rest of the games and tack them onto the same growing DataFrame
    for gi in range(2,664):
        print(url.format(gi))
        try:                                    # game 0021500006.csv is actually missing
            log = pd.read_csv(url.format(gi))   # download into 'log'
            nba2015 = pd.concat([nba2015,log])  # concatenate log onto nba2015
        except:                                 # if there's an error...
            pass                                #...keep going
    
    # Now we have the whole season in a DataFrame called nba2015
    nba2015.to_csv('nba2015.csv')               # write out locally: 300K rows, 45MB
    df = nba2015[ nba2015['SCORE'].notnull() ]  # omit non-scoring events
    df.to_csv('nba2015scores.csv')              # only 75K rows, 13MB

## Examining the Data

After running the previous cell, whether the data was downloaded slowly, or locally-loaded quickly, it's sitting in a DataFrame named df (as is the most common convention). Let's see what it looks like!

In a Python notebook, just mentioning a variable name causes it to print itself reasonably.

In [None]:
# This is so big (note at the bottom it says rows X cols), 
# even with scroll bars it still omits some columns with '...'
df

In [None]:
# This shows info about all of the columns, or 'Series'
df.info()

In [None]:
# Of all those columns above, we are interested in only a few. 
# Selecting just those few columns makes for an easier-to-read-preview

# This is called a 'slice' of the DataFrame, we have selected a 'slice' of the available columns.

# The .head(10) at the end shows the first 10 scoring events of the first game.
# You can change that number (how much do you have to head() to get to the 2nd quarter?), 
# or switch to tail(10) to see the last events of the last game.
df[ ['GAME_ID', 'PERIOD', 'PCTIMESTRING', 'SCORE', 'SCOREMARGIN', 'HOMEDESCRIPTION', 'VISITORDESCRIPTION'] ].head(10)

## Augmenting the Data

### The Formula element 1: the score margin

As-is, the SCORE column is not numerical, it is a text string with two numbers separated by a dash and spaces. The SCOREMARGIN column is also not numerical, because sometimes it says 'TIE' instead of 0.

We're going to bust apart the SCORE column, and create new columns to hold the data we need numerically.

NOTE: since the pre-existing column names are ALL_CAPS, I'm going to adopt a convention of lower_case for new columns that we derive from them.

In [None]:
# The right side splits all the SCORE strings at the ' - ', and returns two new Series
# The left side catches those new Series as new columns with names hscore and vscore (h=home, v=visitor)
df[['hscore','vscore']] = df['SCORE'].str.split(' - ', expand=True)

# Since these new hscore/vscore were split from strings, they're still strings.
# Here we convert them to integer values
df['hscore'] = df['hscore'].astype(int)
df['vscore'] = df['vscore'].astype(int)

# Now that hscore and vscore are numbers, we can do arithmetic with them!
# This is just like if we created a new column in a spreadsheet and populated
# it with a formula that subtracts the two other columns
df['margin'] = df['hscore'] - df['vscore']

# Now let's take a look at these columns of interest, make sure it did it all right
df[['SCORE','hscore','vscore','margin']]

### The Formula element 2: remaining seconds

Time is represented in this dataset as which PERIOD, and time within period is PCTIMESTRING, the game clock MM:SS which counts down to 00:00 each period.

We need to munge that into seconds_elapsed (for graphing the progress of games) and seconds_left (for evaluating **TF**)

In [None]:
# This shows how often each value of PERIOD occurs. 
# 1-4 are the regular 4 quarters. The reason they show up different numbers of times is
# this is a count of how many scoring events happened in Q1-Q4 (across all games in the season).

# And then there's PERIOD 5-8, indicating 1st through quadruple (!!!) overtime
df['PERIOD'].value_counts()

In [None]:
# This gives some information about the PCTIMESTRING (game clock) column.
# Since an NBA quarter is 12 minutes, how many possible different MM:SS could the game clock ever be?
# Does that make sense with the number of 'unique' values reported to be in this column?

# Note also the dtype is 'object', i.e. string (because of the colon).
# We'll have to split and convert to numbers again
df['PCTIMESTRING'].describe()

Here I define two python functions for computing elapsed seconds. The lower-level/inner one acts directly on numerical inputs, the higher-level/outer one acts on Pandas rows, and calls the other one (which is why I call them outer/inner).

Note at this point, the functions are just being *defined*. This makes them available for *use* (or *calling*) later.

In [None]:
# This is a function to compute how many total game seconds have elapsed,
# give period, and game clock minutes,seconds, all as numbers.
# Note we have to handle overtime periods carefully
def compute_seconds_elapsed(p,m,s):
    
    # how many seconds are left in current period
    secleft = int(m)*60 + int(s) 
    
    if p <= 4:
        # Regular game time
        sec = (p-1) * 12 * 60  # all the previous quarters are complete
        sec += 12*60 - secleft # how many seconds have elapsed in this period
    else:
        # Overtime!
        sec =    4   * 12 * 60 # All 4 quarters are complete
        sec += (p-5) *  5 * 60 # Any previous 5-minute overtimes are complete
        sec +=  5*60 - secleft # how many seconds have elapsed in this period
    
    return sec
        
# This function takes a row of a Pandas DataFrame and obtains the
# period and game clock minutes,seconds as numbers so it can use the other function.
def row_seconds_elapsed(row):
    # PERIOD could be as big as 7 (triple overtime)
    p = row['PERIOD']
    # M:SS left in period
    m,s = row['PCTIMESTRING'].split(':') # these are still strings at this point
    return compute_seconds_elapsed(row['PERIOD'], m, s)

The reason for separating into outer/inner functions instead of just putting all the logic into the outer one, is that a number-only function is easier to test in isolation.

Use this cell to test that the numerical function operates correctly. Q1, game clock 11:59 should say that 1 second has elapsed. How many other good tests cases can you think of? How about Q1 0:01 (compute_seconds_elapsed(1,0,1)? or Q1 0:00? or Q2 12:00? or Q4 0:01 or P5 4.59? 

In [None]:
# Try out lots of inputs for period and game clock minutes,seconds
# to convince yourself this is working right
compute_seconds_elapsed(1,11,59)

Now that we have a function that can operate on pandas rows, we can apply it, to compute elapsed seconds from the PERIOD/PCTIMESTRING of every row, and capture it in a new column added to the DataFrame

In [None]:
# The right side applies the function to the DataFrame. 
# axis=1 makes it work on rows, instead of columns (which it would do by default)
# The left side catches the results in a new DataFrame Series with column name seconds_elapsed
df['seconds_elapsed'] = df.apply(row_seconds_elapsed, axis=1)

# Let's take a look at the relevant columns to verify that the calculations are correct
# As before, you can try more than 5 rows, and switch head() to tail()
df[['PERIOD','PCTIMESTRING','SCORE','seconds_elapsed']].head(5)

Now that we have a column with seconds_elapsed for every row, do similar transformations to compute and persist seconds_left for every row.

In [None]:
# A similar pair of functions, an inner one that acts on numbers, and an outer one that acts on rows

def compute_seconds_left(p,m,s):
    # how many seconds are left in current period
    secleft = int(m)*60 + int(s)
    # This is the right answer for 4th quarter or any overtime period!
    if p>=4:
        return secleft
    
    # Otherwise add in the time for the remaining full quarters
    secleft += (4-p)*12*60
    return secleft

# From columns PERIOD and PCTIMESTRING (like 3rd quarter, 3:05 remaining)
# compute how many total seconds of game time have elapsed
def row_seconds_left(row):
    # PERIOD could be as big as 7 (triple overtime)
    p = row['PERIOD']
    # M:SS left in period
    m,s = row['PCTIMESTRING'].split(':') # these are still strings at this point
    return compute_seconds_left(row['PERIOD'], m, s)
    

In [None]:
# As before, use this cell to test a good variety of inputs 
compute_seconds_left(3,0,1)

In [None]:
# As before, apply the function and put the results into a new column
df['seconds_left'] = df.apply(row_seconds_left, axis=1)

# Check out the SCOREs vs seconds_left at the end of the last game.
# Can you tell this game ended with 4 freethrows?
df[['PERIOD','PCTIMESTRING','SCORE','seconds_left']].tail(5)

## Graphing an individual game

Now we finally have enough data in place to check out some of the data in graphs, using matplotlib (aka 'plt', see the import statement in the first code cell)

First off, since df holds the entire season, we want to filter out just one game, by grabbing only rows with a matching GAME_ID.

In [None]:
# For starters let's look at game 1 (game_id 21500001)
gid = 1

# If you want, you can change gid to a different value and repeat these code cells, 
# or you could uncomment this line to let python choose a random game from the season
#gid = random.randint(1,663)

# Note how (when gid==1), this prints out a list that starts with True, and ends with False.
# This is because the rows at the beginning are the ones where GAME_ID==21500001, the rest are not
df['GAME_ID']== (21500000 + gid)

In [None]:
# The right side takes that same expression that gives Trues and Falses, and feeds it into df
# That 'slices' df in the row direction, selecting all the rows which are True
# The left side catches that slice in a DataFrame variable named 'game'
game = df[ df['GAME_ID']== (21500000 + gid) ]

# game has all the same columns as its parent df 
# (scroll to the right to find the new columns we created)
game

Now we can finally use the graphing library, matplotlib!

A Pandas DataFrame (or slice, same thing) knows how to use matplotlib to .plot() its columns.

In [None]:
plt.figure()     # this creates a new Figure (graph)
a=plt.gca()      # gca() stands for 'get current axes'
                 # we need these to know where the series should plot themselves
                 # (some Figures could have multiple Axes (multiple subplots in a grid))
# Here we tell the game which column names to plot as X and Y, on the ax(es) a
game.plot('seconds_elapsed', 'hscore', ax=a) # plot the home score as a function of time
game.plot('seconds_elapsed', 'vscore', ax=a) # plot the visitor score as a function of time
plt.show()       # always do this last to tell matplot lib to render the graph now

# What can you understand about the progress of this game from the graph?

In [None]:
# Rather than separate curves for home and visitor scores
# we can summarize the progress of the game by just plotting the margin
plt.figure()          # start a new figure
a=plt.gca()           # grab the axes
# Note here we use 'margin' as the Y for our graph
game.plot('seconds_elapsed', 'margin', ax=a)
plt.show()

# Does the view of the game presented by this graph match the previous graph?

## Graphing All the Games

Using the idea of graphing the margin, we can splash every game of the season onto one graph together. 

Building up this complex graph took many iterations/tries. And even when it was 'done', I copied and pasted this below for more graphs and tweaked it further. Copy/paste is a BAD IDEA in software, a sure sign that code should be modularized into a function that can be called repeatedly. This is the final form.

Note: plotting a whole season takes a good few minutes. If you want to try something out quickly, use the default maxplots=20. When you're ready to turn it loose on the whole season, use maxplots=None.

And since plt.show() is always supposed to be last, this function will do that by default. But if you want to do other stuff to the graph after calling this function, send in do_show=False

When this function is done, it returns the Axes, which you can use to do more stuff to the graph.

In [None]:
# Plot the score margin curves for many games in a DataFrame all on the same plot
# Return the Axes object for further plotting
def plot_margins(adf,           # adf is A DataFrame, could be df, could be a slice
                 maxplots=20,   # Plot a few quickly, use None to spend a few minutes plotting everything
                 do_show=True): # optionally the caller does this
    
    # Keep count how many games, in case we need to stop early
    plots=0
    
    # The standard new figure boilerplate. (20,10) makes it nice and big (and 2x landscape)
    plt.figure(figsize=(20,10))
    a=plt.gca()

    # For every GAME_ID in the DataFrame, add a curve to the plot
    for agid in adf['GAME_ID'].unique():    # use adf,agame,agid, don't confuse with df,game,gid
        agame = adf[ adf['GAME_ID']==agid ] # create a slice for this game id
        agame.plot('seconds_elapsed',       # X = seconds_elapsed
                   'margin',                # Y = score margin
                   ax=a,                    # This is the Axes to put it onto
                   c='b',                   # color = b[lue]
                   alpha=0.1)               # Make it mostly transparent; overwritten
                                            # curves will get more saturated color
        # Stop early for quick tests
        plots += 1
        if maxplots is not None:
            #print(gid)
            if plots>=maxplots:
                break

    # Make the graph look really nice
    # Set the X range to hold 4 quarters plus 4 overtimes
    a.set_xlim(0, 4*12*60 + 4*5*60) # longest game has quadruple(!!!) overtime
    
    # A couple useful values:
    qsec = 12*60   # seconds in a quarter
    gsec = 4*qsec  # seconds in a regulation game
    otsec = 5*60   # seconds in an overtime
    # This is a list of the number of seconds at the end of every period
    times = [qsec, 2*qsec, 3*qsec, 4*qsec, # 4 quarters
             gsec+otsec, gsec+2*otsec, gsec+3*otsec, gsec+4*otsec] # 4 overtimes
    
    # Instead of letting matplotlib decide which round numbers of seconds to
    # to put tick marks, control this directly
    a.set_xticks(times) # a tick at the end of every period
    # Use these labels instead of large numbers of seconds
    a.set_xticklabels(['Q1', 'Q2', 'Q3', 'Q4', 'OT1', 'OT2', 'OT3', 'OT4'])
    # By default the X axis would be labeled seconds_elapsed
    # But our manual control of the x axis ticks/labels makes it clear
    a.set_xlabel('')
    
    # Add a title for the whole graph
    a.set_title('NBA 2015-16 season, scoring margin of all games (Home-Visitor)')
    
    # Mark margin=0 (the win/loss boundary) with a light gray horizontal line
    a.axhline(0, c='#bbbbbb')
    
    # Mark the end of every period with a VERY light gray vertical line
    for t in times:
        a.axvline(t, c='#eeeeee') 

    # Force there to be no legend, otherwise it will want to 
    # make a legend with an entry for every game
    plt.legend([])
    
    if do_show:
        plt.show() # otherwise hold to allow further tweaks

    return a # to allow further tweaks


Here's a cell to allow you to test a little. Try increasing maxplots bit by bit to see when it starts to get slow, or None to let it rip for all 600+ games in the season.

In [None]:
axes = plot_margins(df, maxplots=2) # or change to maxplots=None

## Computing The Formula

Finally! We have margin and seconds_left computed in our DataFrame, so let's use those to apply **TF**. We compute $$(m-3)^2-s$$ If that is positive, then $(m-3)^2>s$ and the margin is large enough, that the remaining time is short enough, that the lead should be safe (if **TF** is indeed universally valid).

Note that we have to be careful when $m<3$. For instance with a slim 1-point margin at 2 seconds left, the formula **TF**=$(1-3)^2-2=+2$, a positive result, in a situation we should not consider safe.

In [None]:
def compute_the_formula(m,s):
    # Take the absolute value of the margin, don't get confused
    # by a margin of -40 points being less than 3 points
    am = np.abs(m)
    if am<=3: # not safe!!
        return float('NaN') # return Not-a-Number on purpose
    else:
        return (am-3)**2-s
    
def row_the_formula(row):
    return compute_the_formula(row.margin, row.seconds_left)

In [None]:
# Use this cell to test various inputs, make sure the outputs make sense

# How about a 4-point lead at the beginning of Q4?
compute_the_formula(4, 12*60)

In [None]:
# Again this is like making a new spreadsheet column with a formula to other columns
# The right part does the math, the left part catches it in a new column named TheFormula
df['TheFormula'] = df.apply(row_the_formula, axis=1)

## Examining The Formula for one game

Let's look at **TF** for a couple individual games.

Note when we sliced game before, df didn't have TheFormula column yet, so we have to re-slice again.

In [None]:
gid  = 1  # This game ends 106-94, with a healthy 12-pt margin
#gid = 2  # This game ends 95-97, not a safe margin!

# Similar slice command as before
game = df[ df['GAME_ID']== (21500000 + gid) ]

In [None]:
# Take a peek at the columns of interest, at the END of the game
game[ ['SCORE', 'margin', 'seconds_left', 'TheFormula'] ].tail(5)

# Is TF computed correctly?

# In the waning seconds of game 1, TF>0, implying the margin is safe. 
# Modify that tail(5) to look at more of the end of the game.
# At what point does TF say that the lead became safe? 
# (i.e. At what point should the visiting team know they can't come back?)

# How does this compare for game 2 vs game 1?

In [None]:
# Take a peek at the columns of interest, at the BEGINNING of the game (note switching tail-->head)
game[ ['SCORE', 'margin', 'seconds_left', 'TheFormula'] ].head(15)

# Note while the early-game margin is under 3 points, TF is undefined.
# And when the margin does slip above 3, there is so much time left, TF is hugely negative
# This early in the game, the leading team cannot assume their lead is safe
# and the trailing team does not need to assume they are hopelessly behind.

# How does this compare for game 2 vs game 1?

Now let's plot the end of the game, so we can see game margin vs TF visually. This uses a Figure with two subplots (two Axes), above and below.

In [None]:
fig = plt.figure() # standard boilerplate

# Set up 2x1 (above and below) subplots
a1 = fig.add_subplot(211)  # 21=2x1, 1=1st (above)
a2 = fig.add_subplot(212)  # 21=2x1, 2=2nd (below)
a1.invert_xaxis()          # seconds_left should run backwards from seconds_elapsed
a2.invert_xaxis()

# Like before, except we're plotting seconds_left instead of seconds_elapsed
game.plot('seconds_left', 'margin', ax=a1)

# We plot TheFormula onto a2 the lower subplot
game.plot('seconds_left', 'TheFormula', ax=a2, c='r')

# Try uncommenting these formatting options one at a time,
# see what each new configuration reveals

#a1.axhline(0,  c='blue', ls=':')  # mark the win/loss crossover
#a1.axhline(3,  c='grey', ls='-.') # mark the +/- 3 margins where TF is undefined
#a1.axhline(-3, c='grey', ls='-.')
#a2.axhline(0,  c='red',  ls=':')  # mark the TF crossover
#a2.set_ylim(-100,100)              # edit this to zoom in on interesting values of TF
#nsecs = 120                   # 120sec for the last 2 minutes
#a1.set_xlim(nsecs,0)
#a2.set_xlim(nsecs,0)

plt.show()

Now go back to the beginning of this 'Examining **The Formula** for one game' section, and replace gid=1 with gid=2, and go through these cells again, to see how **TF** looks for a game with a closer finish. And why not try gid=3,4,...? You might be able to find a situation where a margin crosses from safe (**TF**>0) to unsafe (and maybe back again), but can you find a game where a trailing team comes back from a **TF**-safe margin, and wins?

## Analyzing The Formula across the whole season

Before going on, be sure to follow the instructions and examine **TF** for the endings of both game 1 and 2 (and more?)

Once you are comfortable with how TF is working though, press on!

The crossover between unsafe and safe is when $(m-3)^2-s=0$, i.e. $$m=\sqrt{s}+3$$

In [None]:
# First make a new column in df with this formula,
# computed for every value of seconds_left
df['safe_margin'] = np.sqrt(df['seconds_left'])+3

# Then plot it (positive and negative)
plt.figure()  # standard boilerplate
a=plt.gca()

# These look different, we are giving the axes DataFrame columns to scatterplot, 
# rather than having the DataFrame tell the axes to plot its columns
# The reason is because for one of them we need the arithmetic operation -
# (Another option would be to make a 'safe_margin_negative' column and graph as before)
a.scatter(df['seconds_elapsed'],  df['safe_margin'], c='r')
a.scatter(df['seconds_elapsed'], -df['safe_margin'], c='r')
plt.show()

# Why are there four extra, smaller curves on the right?

# How many free points could one NBA team spot another at the beginning of a game, 
# before TF says the whole game isn't worth playing?

# Test replacing those a.scatter() with a.plot(), that switches to a line graph
# All those lines are from starting each game over from 0 seconds_elapsed
# That's why I used scatter() there.

Remember when we plotted the margin series for all the games of the season onto the same plot? 

Let's do that again, but this time add these safe **TF** boundary curves.

In [None]:
# We choose do_show=False here, so on the inside of the function it skips plt.show()
# the a= catches the Axes so we can plot more stuff
a = plot_margins(df, maxplots=20, do_show=False)

# These are the same scatterplot commands from above
a.scatter(df['seconds_elapsed'],  df['safe_margin'], c='r')
a.scatter(df['seconds_elapsed'], -df['safe_margin'], c='r')    

# Now that we're done, we can plt.show()
plt.show()

# What does it mean when a blue curve crosses the red boundary?

# Add more games with larger maxplots, or set to None (and run to the fridge or bathroom) to plot them all

Here's a couple cells I was toying with.

In [None]:
# safe is a slice that only grabs rows within the season for which TF says the margin is safe
safe = df[ np.abs(df['margin'])>df['safe_margin'] ]

In [None]:
# Send that slice into our plotting function
a = plot_margins(safe, maxplots=20, do_show=False)

# plot the safe_margin boundary as before
a.scatter(safe['seconds_elapsed'],  safe['safe_margin'], c='r')
a.scatter(safe['seconds_elapsed'], -safe['safe_margin'], c='r')    

# mess with the limits to show just the last two minutes of Q4
q4 = 4*12*60
a.set_xlim(q4-120, q4)
plt.show()


## Checking the validity of The Formula across the whole season

Finally! This code goes row-by-row through the whole DataFrame df (every single scoring event of every game of the season), testing for things like:

* Has the Home or Visiting team reached a safe margin at any point?
* Once **TF** says one team's lead is safe, does any trailing team ever come back and win?
* Is there any game where **TF** says both teams had a safe lead (at different times)?

Like other cells, this one starts with max_games=20. When you want to run the analysis across the whole season, change it to max_games=None.


In [None]:
max_games = 20 # None to search all the games, or a number to test a few
num_games = 0

game_ids = []       # keep a list of games that violate TF (if we find any)
never_safe_ids = [] # keep a list of games that never reach TF>0
safe_times = []     # list of values of seconds_left at time of first TF>0
for agid in df['GAME_ID'].unique():      # for each game in the season
    agame = df[ df['GAME_ID']==agid ]    # slice out just this game's rows
    ####### As far as we know before looking at THIS game... ########
    homesafe = False      # Home team isn't safe yet
    vtorsafe = False      # Visiting team isn't safe yet (v'tor stands for Visitor)
    hometime = None       # We don't know what time (if ever) Home will get to a safe lead
    vtortime = None       # We don't know what time (if ever) V'tr will get to a safe lead
    
    # Go through every row in this slice (every scoring event in this game)
    for row in agame.itertuples():
        if not homesafe and row.seconds_left > 0 and row.margin > row.safe_margin: 
            homesafe = True              # TF says Home is safe now!
            hometime = row.seconds_left  # This is how many seconds are left
        if not vtorsafe and row.seconds_left > 0 and row.margin < -row.safe_margin:
            vtorsafe = True              # TF says Vtor is safe now!
            vtortime = row.seconds_left  # This is how many seconds are left
    
    # This grabs the final score of the game for both sides
    hfinal = agame['hscore'].max()
    vfinal = agame['vscore'].max()
    
    # In this game, did a trailing team come back and beat a 'safe' lead?
    if homesafe and vfinal > hfinal:
        print('Game {} visitor overcame home safe margin!!!'.format(agid))
        game_ids.append(agid)
    if vtorsafe and hfinal > vfinal:
        print('Game {} home overcame visitor safe margin!!!'.format(agid))
        game_ids.append(agid)
    
    # Did both teams appear safe at some point according to TF?
    if homesafe and vtorsafe: 
        print('Game {} both sides had safe margin!!!'.format(agid))
        game_ids.append(agid)
        
    elif homesafe: # TF said home was safe, and they won
        print('Game {} home was safe with {} seconds left, final margin {}'.format(agid, hometime, hfinal-vfinal))
        safe_times.append(hometime)
        
    elif vtorsafe: # TF said vtor was safe, and they won
        print('Game {} vtor was safe with {} seconds left, final margin {}'.format(agid, vtortime, vfinal-hfinal))
        safe_times.append(vtortime)
        
    else:          # TF never said anybody was safe
        print('Game {} no team was ever safe'.format(agid))
        never_safe_ids.append(agid)
        
    # quit early if we're just testing a few
    num_games += 1
    if max_games is not None and num_games > max_games:
        break      

In [None]:
if len(game_ids)>0:
    print('There are games in the NBA 2015-16 season that violated The Formula!!')
    print(game_ids)
else:
    print('All {} NBA 2015-16 season games obeyed The Formula'.format(len(df['GAME_ID'].unique())))

## Concluding remarks

Well, it looks like **The Formula** is pretty reliable. It was never caught out in an entire season of NBA basketball. Maybe it's even too conservative. Maybe **TF** could be tightened up to be bolder about declaring margin-safety sooner, and still no games would violate it? (If you have any good ideas, you can go back to where the 'TheFormula' and 'safe_margin' columns are computed, compute them differently, and see what happens!)

If any of the 1-line game summaries printed in the final analysis look interesting, you can take that game ID back to the 'Examining **The Formula** for one game' section and use that ID to take a closer look.

How about the most (?) famous comeback in NBA history? Google 'Reggie Miller Knicks comeback', and see what happened at the end of Game 1 of the 1995 Eastern Conference Finals...