In [20]:
import json
import sys
import pandas as pd
import sqlite3

# A lense from a statistical point of view. Can we create a ML model to predicit the outcomes?

# Introduction

After watching sports for many years, I noticed the negative impact of long-term injuries, especially in the NBA. I went on to examine it. Randomly, I clicked on a video where a popular YouTuber states, 'Look at KD; he had an ACL injury, and before that, he had a calf strain.' Is this really true? The YouTube channel mentioned is MPJPerformance, and the video link is [here!](https://www.youtube.com/watch?v=HnPjGpcTU8A&t=47s.)

# Retriving and Cleaning the data

## Firstly getting the data 

Luckly for me the raw data is already avaliable, big thanks to JaseZiv for the github repositry [click here!](https://github.com/JaseZiv/NBA_data/tree/main)

## How is the data currently structured? 

the file named "nba_injuries" has webcrawled various NBA sources and the original files follow a JSON schema as follows:



In [5]:
basketball_data = {
    "Date": "1947-08-05",
    "Team": "Bombers (BAA)",
    "Acquired": None,
    "Relinquished": "Jack Underman",
    "Notes": "fractured legs (in auto accident) (out indefinitely) (date approximate)"
}

## Creating a database and cleaning up the data 

Before we can clean the data to remove duplicates and unnessary/missing inputs we first need to create a database scehema appropriate to our queries and data retrival goals. 
 
- Identify all injured players easily 
- Identify length of injury easily
- Identify if this play has prior/future injury
- Proximity of next injury/prior injury

Based on these retrival goals we can take a view from a statisatical POV if "long term injuries are srouced from short term injuries". Where we can explore; 


$$
\begin{align*}
P(\text{long-term injury} \mid \text{short-term injury}), \\
P(\text{not having a long-term injury} \mid \text{short-term injury}), \\
P(\text{Another long-term injury} \mid \text{long term injury}), \\
P(\text{short-term injury, specific types}), \\
P(\text{long-term injury, specific types}).
\end{align*}
$$





### Now how to classify a long term and short term injury? 

According to bard language model, a short term injury ranges from 0 to 8 week recovery times, whilst a long term injury is tipically 8 weeks plus. 

## Converting to a database
As pointed out by Chip Huyen in Chapter two of "Designing Machine Learning Systems" states the importance of Data Models "How you choose to represent data not only affects the way your systems are built, but also the problems your systems can solve". Using this chapter as a out Chip outlines "NoSQL" to follow a strict schema, therefore for time saving we will use a "Relational Database". 

In [52]:
raw_data = '/Users/shahid/Github/nba-injuries-long-short-term/raw_data/nba_injuries.json'
df = pd.read_json(raw_data)
db_file = 'nba_injury_v1.db'
conn = sqlite3.connect(db_file)
nba_table = 'nab_injury_table'
df.to_sql(nba_table, conn, index=False, if_exists='replace')
conn.close()

In [56]:
conn = sqlite3.connect(db_file)
query_check = f"SELECT COUNT(*) FROM {nba_table}"
count_rows = pd.read_sql_query(query_check, conn)
print(f"Number of rows in {nba_table}: {count_rows.iloc[0, 0]}")
conn.close()


Number of rows in nab_injury_table: 68603


In [69]:
conn = sqlite3.connect(db_file)
query = f"SELECT * FROM {nba_table}"
print("Number of Rows:", result.shape[0])
print("Number of Columns:", result.shape[1])

# Display the first 10 rows
result.head(10)
print(df.iloc[4])
print(f"\nNew output\n")
print(df.iloc[1])
print(f"\nNew output\n")
print(df.iloc[1000])
print(f"\nNew output\n")
conn.close()

Number of Rows: 68603
Number of Columns: 5
Date             1949-12-23 00:00:00
Team                          Knicks
Acquired                            
Relinquished            Vince Boryla
Notes           mumps (out ~2 weeks)
Name: 4, dtype: object

New output

Date                                          1947-08-05 00:00:00
Team                                                Bombers (BAA)
Acquired                                                         
Relinquished                                        Jack Underman
Notes           fractured legs (in auto accident) (out indefin...
Name: 1, dtype: object

New output

Date            1988-01-30 00:00:00
Team                       Clippers
Acquired           Lancaster Gordon
Relinquished                       
Notes             activated from IL
Name: 1000, dtype: object

New output



Clearly we 