This jupyter notebook uses the `BeautifulSoup` package to webscrape NBA injury statistics. The notebook scrapes from www.prosportstransactions.com to get injury data. The data comes in a format where it gives the date, player, team, injury information, and whether the player was "acquired" (taken off the injury list) or "relinquished" (put on the injury list). This data will be used later to create injury stints and find out how long each injury lasted in days. I scraped data from the 2018-19 to 2022-23 NBA seasons, which is 5 seasons worth of data.

In [1]:
import math
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import time
from unidecode import unidecode
from tqdm import tqdm

In [24]:
def getInjuryURL(startDate, endDate, start):
    return f"https://www.prosportstransactions.com/basketball/Search/SearchResults.php?Player=&Team=&BeginDate={startDate}&EndDate={endDate}&ILChkBx=yes&Submit=Search&start={start}"



In [30]:
def getSoupDF(soup):
    table = soup.find(name = 'table')
    
    dates = []
    teams = []
    acq = []
    relinq = []
    notes = []

    stats = [dates, teams, acq, relinq, notes]
    
    for row in table.find_all('tr')[1:]:
        for index, entry in enumerate(row.find_all('td')):
            stats[index].append(entry.text)
    
    df = pd.DataFrame(list(zip(stats[0], stats[1], stats[2], stats[3], stats[4])),
                      columns = ['Date', 'Team', 'Acquired', 'Relinquished', 'Notes'])
    
    return df

In [25]:
START_DATE = "2018-09-01"
END_DATE = "2023-07-30"

In [33]:
def getInjuryData(num_iter, sleep_time = 5):
    df_list = []
    for i in tqdm(range(num_iter)):
        ENTRIES_PER_PAGE = 25
        url = getInjuryURL(START_DATE, END_DATE, ENTRIES_PER_PAGE * i)
        response = requests.get(url)
        
        time.sleep(sleep_time)
        
        soup = BeautifulSoup(response.content, 'html.parser')
        small_df = getSoupDF(soup)
        df_list.append(small_df)
    
    return pd.concat(df_list)
        

In [34]:
injury_df = getInjuryData(391, 5)

100%|█████████████████████████████████████████| 390/390 [37:57<00:00,  5.84s/it]


In [35]:
injury_df

Unnamed: 0,Date,Team,Acquired,Relinquished,Notes
0,2018-09-21,Pacers,,• C.J. Wilcox,placed on IL with torn right Achilles tendon ...
1,2018-10-01,Thunder,,• Andre Roberson,placed on IL recovering from surgery on left ...
2,2018-10-16,76ers,,• Jerryd Bayless,placed on IL with sprained left knee
3,2018-10-16,76ers,,• Mike Muscala,placed on IL with sprained right ankle
4,2018-10-16,76ers,,• Wilson Chandler,placed on IL with strained left hamstring
...,...,...,...,...,...
20,2023-05-03,76ers,• Joel Embiid,,activated from IL
21,2023-05-05,Celtics,,• Blake Griffin,placed on IL with sore lower back
22,2023-05-06,Heat,,• Udonis Haslem,placed on IL with stomach illness
23,2023-05-06,Heat,• Jimmy Butler,,activated from IL
