# Predicting NBA Playoff Teams

## Capstone 1: Ronald Musser  
  
  
The full code can be found in the Capstone1_Data2 and Capstone1_Data2.ipynb files in the repository below.  
https://github.com/RM817/NBA-Playoff-Birth-Predicition-Model

### Introduction:

Since 1984, teams in the National Basketball Association have battled with each other for the ultimate prize of winning an NBA championship.
Before they crown a winner, 16 teams are chosen at the end of the year to play in a tournament called the playoffs. These 16 teams are typically
the teams with the best records at the end of the season. However, the NBA season is quite long, lasting from October to June of the following
year. Is it possible to accurately predict the playoff teams after the first full month of the NBA season?

In order to predict the playoff teams, some sort of data is needed in order to analyze. While manually downloading statistics from a website is a viable option, there are many teams to analyze during the season along with a list of seasons. Therefore, it was advantageous to create a web scraping tool that allowed me to download each team's stats for the month of November for each year.

In [2]:
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
import re
import requests
import numpy as np
import pandas as pd
import operator
%matplotlib inline


#Pulls data from website for each team and each year. Then creates an array with all of the data.
Team_List = ['ATL','BOS','CHI','CLE','DAL','DEN','DET','GSW','HOU','IND','KCK','LAL','MIL','NJN','NYK','PHI','PHO','POR','SAS','SDC','SEA','UTA','WSB']
n = 0
Year = '1984'
Stats = np.zeros(shape = (len(Team_List),31))
for team in Team_List:
    url = 'https://www.basketball-reference.com/teams/{}/{}/splits/'.format(Team_List[n],Year)
    page = requests.get(url)
    content = BeautifulSoup(page.content, 'html.parser')
    tdvalues = content.find_all('td')
    Nov_Values = tdvalues[257:288]
    c = []
    for row in Nov_Values:
        a = str(row)
        b = a[a.index('>') + 1:a.index('</')]
        c.append(b)
    c = [float(x) for x in c]
    Stats[n] = c
    n +=1

Above is code that creates an array of stats from the website *Basketball Reference* for each team during the 1984-1985 NBA season. The array that is created is useful, but deeper analysis can be conducted by creating a dataframe with these stats.

In [3]:
DF_84 = pd.read_pickle('DF_84')
pd.set_option('display.max_columns', None)
DF_84.head()


Unnamed: 0,Team,Playoff Birth,G,W,L,Win%,FG,FGA,FG%,3P,3PA,3P%,FT,FTA,FT%,ORB,TRB,AST,STL,BLK,TOV,PF,PTS,OP-FG,OP-FGA,OP_FG Miss%,OP-3P,OP-3PA,OP_3P Miss%,OP-FT,OP-FTA,OP_FT Miss%,OP-ORB,OP-TRB,OP-AST,OP-STL,OP-BLK,OP-TOV,OP-PF,OP-PTS
0,ATL,True,14.0,7.0,7.0,0.5,39.4,80.4,0.49005,0.6,1.4,0.428571,21.1,30.1,0.700997,13.4,42.3,21.6,7.7,8.9,17.4,27.3,100.5,38.9,84.9,0.541814,0.4,1.9,0.789474,24.7,33.2,0.256024,15.8,44.6,23.2,7.1,4.9,15.9,25.5,102.9
1,BOS,True,15.0,11.0,4.0,0.733333,44.2,87.7,0.503991,0.6,2.3,0.26087,26.9,33.5,0.802985,15.4,45.9,27.6,7.6,4.8,17.3,23.9,115.9,43.6,92.4,0.528139,1.1,3.5,0.685714,20.8,27.3,0.238095,14.4,41.1,23.8,9.0,4.1,14.8,27.5,109.1
2,CHI,False,13.0,4.0,9.0,0.307692,41.0,88.4,0.463801,0.3,1.8,0.166667,23.6,31.5,0.749206,13.9,41.6,25.2,9.4,5.5,17.6,29.4,105.9,41.8,86.6,0.517321,0.5,2.2,0.772727,26.8,34.0,0.211765,14.3,46.9,25.4,10.8,8.6,17.6,26.4,110.9
3,CLE,False,15.0,5.0,10.0,0.333333,41.6,86.0,0.483721,0.8,2.6,0.307692,19.1,25.6,0.746094,14.3,44.0,23.1,7.6,5.2,17.1,28.0,103.1,40.6,85.1,0.522914,0.5,2.1,0.761905,24.1,32.7,0.262997,12.9,41.1,26.1,7.5,3.9,14.5,21.7,105.9
4,DAL,True,13.0,10.0,3.0,0.769231,42.4,89.3,0.474804,0.2,1.8,0.111111,25.4,33.8,0.751479,14.9,44.5,27.9,8.3,4.5,14.7,22.8,110.4,44.2,92.2,0.520607,0.9,2.9,0.689655,19.6,25.8,0.24031,16.6,45.2,26.7,7.8,4.8,17.6,29.0,109.0


A long list of stats were collected, each of them holding their own definition and importance. While the stats themselves are useful, it is necessary to normalize all of them to a percent above or below the league average. This allows for comparison between two categories that typically would not be comparable (such as Win% and AST). Percentage based statistics were calculated and added to the dataframe. Due to the amount of stats, the more relevant definitions will be explained here. A full list of explanations can be found on *Basketball Reference*.  
- Team is the abbreviated team name.  
- G, W, and L are the amount of games played, their wins, and their losses respectively. Win% is just the team's wins divide by their total number of games.  
- FG and FGA are just the amount of field goals (baskets) made by a team and the amount of field goals attempted respectively. FG% is just FG divided by FGA.  
- The trio of stats for both 3P and FT are similar to the FG definitions but for the three pointer and free throw respectively.  
- TRB, AST, and STL are total rebounds, assists, and steals.  
- BLK, TOV, and PF are the blocks, turnovers, and personal fouls.  
- PTS are the amount of points the team has scored.  
- The rest of the stats that have OP in front of them are the stats of the opponents that the team has faced.  
  
With the statistics properly formatted, the analysis can begin.  
  


#### Analytic Questions
  
It is now important to ask questions that, if were answered, could be important to a concerned party such as the league, teams, sports news outlets, or fans.  
1. Which stats in the month of November are most indicative of a playoff birth?  
2. Do these stats vary over time in their correlation to a playoff birth?  
3. Can these factors be used to predict the playoff teams for the three most recent seasons?

### Evaluating Statistical Importance  
  
#### Correlation Factor
  
In order to predict which teams will succeed in making the playoffs, it was necessary to figure out which stats best correlated with a playoff birth. To do this, each team was ranked from highest to lowest for each individual stat. The amount of teams in the top 16 were tallied and then divided by 16. This created a correlation factor for each statistic that shows how indicative it is at predicting a playoff birth for that season. The top 5 correlation factors were then plotted. This process was repeated for each year, from 1984 to 2006.

![title](TopFive_NoW.png)

#### Weighting

As you can see, the correlation factors seem to be decreasing as the years go by. From 1984 to 2000 to 2016, it would seem that the league is becoming more difficult to predict. The reason for this is quite simplly due to the number of teams increasing over the years. While the top 16 teams always make the playoffs, the number of teams that don't has steadily increased. In 1984, only 7 teams did not make the playoffs (23 total teams). By 2000, 13 teams did not make the playoffs (29 teams). Currently, there are 30 teams in the league. Therefore, 14 of them miss out on the playoffs.  
  
By themselves, these statistics are useful but not exactly what is needed. Comparing the correlation factors over time is necessary to create a predictive model for the three most recent seasons. The factors must be compared on equal ground. In order to do this, a weighting function must be introduced. Since the teams that don't make the playoffs is the changing variable, the weighting function will be based on that. Therefore, each correlation factor will be multiplied by a fraction of the number of teams that missed the playoffs that particular year by the current number of teams that miss the playoffs. Thus, the weighting factor for 1984, 2000, and 2016 are 7/14, 13/14, and 14/14 (1) respectively. Adding this weighting function shows a more consistant top 5 correlation factors throughout the years.  

![title](TopFive_W.png)

#### Top Correlation  
  
To focus in on the most important statistics, the number of appearances they made in the top 5 were graphed.

![title](App_TopFive.png)

Here, it is obvious that the Win% is the most important factor in November that is related to a playoff birth at the end of the season. A team's field goal defense (OP-FG Miss%), number of assists (AST), three point percentage (3P%), and field goal percentage (FG%) are also factors that contribute to a teams success. Thus, we have found the answer to our first question. However, it is obvious that the league changes as the years go by. Therefore, it is important to see if there are factors that become more relevant as the current seasons are approached because they could be used to correctly predict a team's playoff birth.  
  
  
#### Correlation Over Time

![title](Correlation_F_WinPer.png)  
  
Mapping the correlation factor over time along with a linear fit shows the fluctuation of a stat, in this case Win%. This analysis was performed on each statistic to see the behavior over time.

![title](Correlation_Over_Time.png)

Of the most frequent factors, Win% and OP-FG Miss% are increasing in their correlation while AST, FG% and 3P% are decreasing. In fact, most of the statistics across the board have decreased by more than 5% since 1984. It is interesting to note that 3P defense (OP-3P Miss%) is one of the few factors that has either stayed the same or slightly increased with time. These 6 factors that are either the most correlated, increasing in correlation over time, or both can now be used to predict a team's playoff birth. Thus, the answer to the second question is that the correlation factor does fluctuate over time.  
  

### Creating The Predictive Model  
  
In order to predict which teams are going to make the playoffs in 2017, a scoring system was created based off of these 6 stats. Each factor adds a certain value based on how well they are doing against the league average. This value is also weighted based on the average correlation factor of that statistic. Thus, a team's win percentage is more important than their three point percentage. The 16 teams with the highest scoring factor are the predicted playoff teams. An error is then calculated from the amount of teams that were predicted to have a playoff birth but ended up missing out. This process is repeated for the 2018 and 2019 seasons and the error is posted below.

### Results

In [3]:
ER_DF = pd.read_pickle('Recent Pred')
ER_DF

Unnamed: 0,Year,Accuracy(%),Percent Error (%)
0,2017,87.5,12.5
1,2018,81.25,18.75
2,2019,75.0,25.0
3,Average,81.25,18.75


These accuracies show the success of the predictive model. On average, it correctly predicted 13 of the 16 playoff teams over the course of the last three current seasons. When retroactively looking at previous seasons, the same average behavior is seen. The predictive accuracy is about 79% with a standard deviation of 7.6% and peaking as high as 94%. This proves that is it possible to predict playoff births with a fair amount of confidence.

![title](ACC_Year.png)

### Future Research  
  
While this model has fairly high accuracy, there could be possible ways to increase its predictive power. One such method would be to split the teams into their conferences and conduct the analysis separately. Although 16 teams make the playoffs, it is not necessary always the top 16 overall. The league has two conferences and takes the top 8 from each. If one conference is much stronger than the other, this can create errors within the predictions. This separation can help protect the model from such imbalances within the league. Another improvement would take into account injuries that severely impacts a team. A superstar being injured at the beginning of a season but returning later would not be taken into account by this model. The same goes for a superstar who was injured right after the month of November. Each scenario would create a skewed prediction that would most likely be incorrect.