# Collegiate Athlete Injury Predictor

By. Jaedin Hernandez-Rogers

## Dataset Import

In [9]:
# Import required libraries
import math
import statsmodels.api as sm
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as st
from sklearn.linear_model import LinearRegression
from  scipy.stats import f_oneway
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf

In [10]:
# Import Dataset
cs = pd.read_csv('/Users/Jaedin/Desktop/Data Science/collegiate_athlete_injury.csv')

I will be performing logistic regression on this dataset with the purpose of creating a program that will take numeric imputs for the most significant injury predictors, and calculate the log-odds of injury of that player based on our regression formula. This dataset was recovered from Kaggle, and is meant to provide synthetic but realistic data for athletes, capturing their demographics, training regimes, schedules, fatigue levels, and injury risks.

## Data Validation

In [11]:
# Locate rows with missing values
cs.loc[cs.isnull().any(axis=1)]

Unnamed: 0,Athlete_ID,Age,Gender,Height_cm,Weight_kg,Position,Training_Intensity,Training_Hours_Per_Week,Recovery_Days_Per_Week,Match_Count_Per_Week,Rest_Between_Events_Days,Fatigue_Score,Performance_Score,Team_Contribution_Score,Load_Balance_Score,ACL_Risk_Score,Injury_Indicator


In [12]:
# Check for null values
cs.loc[cs.isnull().any(axis=1)]

Unnamed: 0,Athlete_ID,Age,Gender,Height_cm,Weight_kg,Position,Training_Intensity,Training_Hours_Per_Week,Recovery_Days_Per_Week,Match_Count_Per_Week,Rest_Between_Events_Days,Fatigue_Score,Performance_Score,Team_Contribution_Score,Load_Balance_Score,ACL_Risk_Score,Injury_Indicator


In [13]:
# Check for duplicated rows
cs[cs.duplicated()]

Unnamed: 0,Athlete_ID,Age,Gender,Height_cm,Weight_kg,Position,Training_Intensity,Training_Hours_Per_Week,Recovery_Days_Per_Week,Match_Count_Per_Week,Rest_Between_Events_Days,Fatigue_Score,Performance_Score,Team_Contribution_Score,Load_Balance_Score,ACL_Risk_Score,Injury_Indicator


In [14]:
# Drop derived features
cs = cs.drop(['Load_Balance_Score', 'ACL_Risk_Score'], axis = 1)

In [15]:
cs.head(15)

Unnamed: 0,Athlete_ID,Age,Gender,Height_cm,Weight_kg,Position,Training_Intensity,Training_Hours_Per_Week,Recovery_Days_Per_Week,Match_Count_Per_Week,Rest_Between_Events_Days,Fatigue_Score,Performance_Score,Team_Contribution_Score,Injury_Indicator
0,A001,24,Female,195,99,Center,2,13,2,3,1,1,99,58,0
1,A002,21,Male,192,65,Forward,8,14,1,3,1,4,55,63,0
2,A003,22,Male,163,83,Guard,8,8,2,1,3,6,58,62,0
3,A004,24,Female,192,90,Guard,1,13,1,1,1,7,82,74,0
4,A005,20,Female,173,79,Center,3,9,1,2,1,2,90,51,0
5,A006,22,Female,180,75,Guard,9,14,3,4,1,6,74,84,0
6,A007,22,Female,179,90,Forward,5,13,1,4,2,7,97,56,1
7,A008,24,Female,167,64,Center,6,7,2,3,3,2,62,70,0
8,A009,19,Female,166,91,Guard,4,19,2,3,3,2,58,67,0
9,A010,20,Female,162,63,Center,2,8,3,3,2,7,62,52,0


## Data Exploration

In [16]:
# Summary Statistics
cs.describe()

Unnamed: 0,Age,Height_cm,Weight_kg,Training_Intensity,Training_Hours_Per_Week,Recovery_Days_Per_Week,Match_Count_Per_Week,Rest_Between_Events_Days,Fatigue_Score,Performance_Score,Team_Contribution_Score,Injury_Indicator
count,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0
mean,21.17,180.805,77.475,5.105,11.315,1.985,2.385,1.975,4.92,74.465,72.63,0.07
std,2.002787,11.529598,12.440789,2.49904,4.438952,0.811212,1.154748,0.817137,2.560543,14.636939,14.432762,0.255787
min,18.0,160.0,55.0,1.0,5.0,1.0,1.0,1.0,1.0,50.0,50.0,0.0
25%,19.0,171.0,67.0,3.0,7.0,1.0,1.0,1.0,3.0,62.0,60.75,0.0
50%,21.0,182.5,77.5,5.0,11.0,2.0,2.0,2.0,5.0,74.0,72.0,0.0
75%,23.0,191.0,89.0,7.0,15.0,3.0,3.0,3.0,7.0,86.25,85.0,0.0
max,24.0,199.0,99.0,9.0,19.0,3.0,4.0,3.0,9.0,99.0,99.0,1.0


Here, we can explore the dataset, check for outliers, and gain better understanding of the distribution of the data

# Logistic Regression

## Categorical

### Gender

In [17]:
# Checking for significance of gender on injury
gender_reg = smf.logit('Injury_Indicator ~ Gender', data=cs).fit()
gender_reg.summary()

Optimization terminated successfully.
         Current function value: 0.253437
         Iterations 7


0,1,2,3
Dep. Variable:,Injury_Indicator,No. Observations:,200.0
Model:,Logit,Df Residuals:,198.0
Method:,MLE,Df Model:,1.0
Date:,"Tue, 17 Feb 2026",Pseudo R-squ.:,0.0007947
Time:,11:22:55,Log-Likelihood:,-50.687
converged:,True,LL-Null:,-50.728
Covariance Type:,nonrobust,LLR p-value:,0.7765

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-2.5157,0.368,-6.844,0.000,-3.236,-1.795
Gender[T.Male],-0.1585,0.560,-0.283,0.777,-1.255,0.939


H<sub>0</sub>: There is no significant evidence suggesting that gender has an effect on injury

H<sub>1</sub>: There is significant evidence suggesting that gender has an effect on injury

Here, we notice that our p-value for genders effect on injury =0.777 which is less than $\alpha$ = 0.05, therefore we fail to reject H<sub>0</sub> and conclude that there is no evidence suggesting that gender has a significant effect on injury.

### Position

In [18]:
# Checking for significance of position on injury
pos_reg = smf.logit('Injury_Indicator ~ Position', data=cs).fit()
pos_reg.summary()

Optimization terminated successfully.
         Current function value: 0.250642
         Iterations 7


0,1,2,3
Dep. Variable:,Injury_Indicator,No. Observations:,200.0
Model:,Logit,Df Residuals:,197.0
Method:,MLE,Df Model:,2.0
Date:,"Tue, 17 Feb 2026",Pseudo R-squ.:,0.01181
Time:,11:22:55,Log-Likelihood:,-50.128
converged:,True,LL-Null:,-50.728
Covariance Type:,nonrobust,LLR p-value:,0.5492

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-3.0445,0.591,-5.152,0.000,-4.203,-1.886
Position[T.Forward],0.4796,0.751,0.638,0.523,-0.993,1.952
Position[T.Guard],0.7758,0.730,1.063,0.288,-0.655,2.207


H<sub>0</sub>: There is no significant evidence suggesting a difference in injury probability for position

H<sub>1</sub>: There is significant evidence suggesting a difference in injury probability for position

The 'Center' position was used as the comparison group for this regression test. We notice that the p-value for the 'Forward' position = 0.523 and the p-value for the 'Guard' positon = 0.288 which are both less than $\alpha$ = 0.05. This suggests that there is no evidence suggesting a difference in injury probability between the different postions.

Now that we have confirmed that our catagorical variables are not significant predictors of injury, we can move on to analyzing our numeric predictors.

## Numeric

In [19]:
# Automated logistic regression procedure

# Initial predictor list
predictors = [
    'Height_cm',
    'Weight_kg',
    'Training_Intensity',
    'Training_Hours_Per_Week',
    'Recovery_Days_Per_Week',
    'Match_Count_Per_Week',
    'Rest_Between_Events_Days',
    'Fatigue_Score',
    'Performance_Score',
    'Team_Contribution_Score'
]

response = 'Injury_Indicator'

while True:
    formula = response + ' ~ ' + ' + '.join(predictors)
    model = smf.logit(formula, data=cs).fit(disp=0)
    
    pvalues = model.pvalues.drop('Intercept')
    
    max_p = pvalues.max()
    
    if max_p > 0.05:
        worst_var = pvalues.idxmax()
        predictors.remove(worst_var)
        print(f"Removing {worst_var} (p = {max_p:.4f})")
    else:
        break

print("\nFinal model:")
print(model.summary())

Removing Match_Count_Per_Week (p = 0.7283)
Removing Team_Contribution_Score (p = 0.6672)
Removing Height_cm (p = 0.4550)
Removing Performance_Score (p = 0.3589)
Removing Rest_Between_Events_Days (p = 0.2688)
Removing Weight_kg (p = 0.2621)
Removing Training_Intensity (p = 0.1666)

Final model:
                           Logit Regression Results                           
Dep. Variable:       Injury_Indicator   No. Observations:                  200
Model:                          Logit   Df Residuals:                      196
Method:                           MLE   Df Model:                            3
Date:                Tue, 17 Feb 2026   Pseudo R-squ.:                  0.4237
Time:                        11:22:55   Log-Likelihood:                -29.232
converged:                       True   LL-Null:                       -50.728
Covariance Type:            nonrobust   LLR p-value:                 2.472e-09
                              coef    std err          z      P>|z|      

After performing logistic regression, we notice that our most significant predictors of the 'Injury Incicator' variable are 'Training_Hours_Per_Week', 'Recovery_Days_Per_Week', and 'Fatigue_Score'.

Our injury predictor model would read as the following:

$$
\log\left(\frac{P(\text{Injury}=1)}{1 - P(\text{Injury}=1)}\right)
= 0.1887(\text{Training Hours})
- 2.0024(\text{Recovery Days})
+ 0.8087(\text{Fatigue Score})
-6.9979
$$


In [27]:
import math

def predict_injury(training_hours, recovery_days, fatigue_score):
    #Predict injury probability using logistic regression formula
    injury_log_odds = (
        (0.1887 * training_hours)
        - (2.0024 * recovery_days)
        + (0.8087 * fatigue_score)
        - 6.9979
    )
    probability = 1 / (1 + math.exp(-injury_log_odds))
    if probability > 0.5:
        risk = 'High'
    elif probability < 0.1:
        risk = 'Low'
    else:
        risk = 'Mid'
    return {
        'Probability': probability,
        'Risk': risk
    }


In [28]:
predict_injury(7, 2, 6)

{'Probability': 0.007926386928930563, 'Risk': 'Low'}

Here, I built a program that will use our logistic regression equation with our most significant predictors to calculate the log-odds of injury based on Training Hours, Recovery Days and Fatigue Score(1/10).

We can test its validity in using our minimum, mean, and maximum outputs for each variable.

Suggested imput tests:

Minimum Training Hours, maximum Rest, and minimum Fatigue: 5, 3, 1

Mean Training Hours, Rest, and Fatigue: 11.3, 1.9, 4.9

Maximum Training Hours, minimum Rest, and maximum Fatigue: 19, 1, 9

In [22]:
Player_Injury()

Enter Training Hours, Recovery Days, Fatigue Score (comma separated):  20, 1, 9


Predicted probability: 0.8861495516530373


In [None]:
!pip install fastapi uvicorn
!python -m uvicorn main:app --reload

Collecting fastapi
  Downloading fastapi-0.131.0-py3-none-any.whl.metadata (30 kB)
Collecting uvicorn
  Downloading uvicorn-0.41.0-py3-none-any.whl.metadata (6.7 kB)
Collecting starlette<1.0.0,>=0.40.0 (from fastapi)
  Downloading starlette-0.52.1-py3-none-any.whl.metadata (6.3 kB)
Collecting typing-inspection>=0.4.2 (from fastapi)
  Downloading typing_inspection-0.4.2-py3-none-any.whl.metadata (2.6 kB)
Collecting annotated-doc>=0.0.2 (from fastapi)
  Downloading annotated_doc-0.0.4-py3-none-any.whl.metadata (6.6 kB)
Downloading fastapi-0.131.0-py3-none-any.whl (103 kB)
Downloading starlette-0.52.1-py3-none-any.whl (74 kB)
Downloading uvicorn-0.41.0-py3-none-any.whl (68 kB)
Downloading annotated_doc-0.0.4-py3-none-any.whl (5.3 kB)
Downloading typing_inspection-0.4.2-py3-none-any.whl (14 kB)
Installing collected packages: uvicorn, typing-inspection, annotated-doc, starlette, fastapi
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5/5[0m [fastapi]
Successfully installed an