In [None]:
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

In [None]:
print(sys.executable)

<center> 
    <h1> Lab 1 - PECARN TBI Data</h1>
    <h4>Stat 215A, Fall 2024</h4>
</center>

# Introduction
Traumatic Brain Injury (TBI) is a signficant concern in pediatric healthcare, particulary when evaluatong childern who have head trauma. Identifying clinically important TBIs (ciTBI) is helpful for taking necessary futher medical intervention such as neurosurgical treatment or CT scans. The data we are using is from Pediatric Emergency Care Applied Research Network (PECARN), which is a dataset used to research acute injuries and illnesses among children in a wide range of demographics and institutions. In this lab, we aim to study this dataset and perform Exploratory Data Analysis to discover any patterns and insights for detecting the risk of Traumatic Brain Injuries in patients younger than 18. To diagnose patients with TBI, doctors must perform Computed Tomogorpahy (CT) scan. However, according to many studies, CT imaging of head-injured children has risks of radiation-induced malignancy. Most of patients with Minor Head Trauma (based on the Glasgow Coma Scale scores of 14-15) accounts for 40-60% of assessments, yet, less than 10% show signs of actual TBI. Therefore, creating a decision rule for identifying ciTBIs without excessive use of CT scans is the goal of this study.

We will first start with understanding the datasets and patterns in the features. Then, we will analyze the data cleaning process, as well as, justify the judgment calls made in this report.

# Data
The dataset includes children under 18 years who presented with minor head trauma in emergency departments within 24 hours of injury and had Glasgow Coma Scale (GCS) scores of 14-15.

In [None]:
data = pd.read_csv("../data/TBI PUD 10-08-2013.csv")

In [None]:
data

Unnamed: 0,PatNum,EmplType,Certification,InjuryMech,High_impact_InjSev,Amnesia_verb,LOCSeparate,LocLen,Seiz,SeizOccur,...,Finding20,Finding21,Finding22,Finding23,DeathTBI,HospHead,HospHeadPosCT,Intub24Head,Neurosurgery,PosIntFinal
0,1,3.0,3,11.0,2.0,0.0,0.0,92.0,0.0,92.0,...,92,92,92,92,0.0,0.0,0,0.0,0.0,0.0
1,2,5.0,3,8.0,2.0,0.0,0.0,92.0,0.0,92.0,...,0,0,0,0,0.0,0.0,0,0.0,0.0,0.0
2,3,5.0,3,5.0,2.0,,,92.0,,92.0,...,0,0,0,0,0.0,1.0,0,0.0,0.0,0.0
3,4,5.0,3,6.0,1.0,91.0,0.0,92.0,0.0,92.0,...,92,92,92,92,0.0,0.0,0,0.0,0.0,0.0
4,5,3.0,3,12.0,2.0,91.0,0.0,92.0,0.0,92.0,...,0,0,0,0,0.0,0.0,0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
43394,43395,5.0,3,8.0,2.0,0.0,0.0,92.0,0.0,92.0,...,92,92,92,92,0.0,0.0,0,0.0,0.0,0.0
43395,43396,5.0,3,6.0,1.0,91.0,0.0,92.0,0.0,92.0,...,92,92,92,92,0.0,0.0,0,0.0,0.0,0.0
43396,43397,5.0,3,7.0,1.0,0.0,0.0,92.0,0.0,92.0,...,92,92,92,92,0.0,0.0,0,0.0,0.0,0.0
43397,43398,5.0,1,8.0,2.0,0.0,0.0,92.0,0.0,92.0,...,92,92,92,92,0.0,0.0,0,0.0,0.0,0.0


## Data Collection

Data was collected through standardized forms and follow-up phone surveys.

## Data Cleaning

First, we will explore general features in categories using common knowledge. Then we can futher analayze specific questions with more features.

According to do documentation, I divided the dataset into subgroups with features that attribute to pre-condition, incidence, post-condition, and intervention. Pre-condition is any feature sets that describe about the patient irrelevant to the injury. For example, `Gender` and `Race` are attributes about the patient regardless of the injury. Incidence are variables that describe the injury - incidence that led to this analysis. Similarly, post-condition describes about the condition of patient after the injury. For example, `Seiz`, `Vomit`, `SFxPalp` describes wheter the patient had any post-traumatic seizure, vomit, or any palapble skull fractures after incident. Lastly, intervention is set of features that are written by ED. I grouped these features due to the subjective nature of diagnosis and the ability to self-express.

### Pre-Condition (A description of patients before the injury):
* AgeTwoPlus
* Gender
* Ethnicity
* Race
* Drugs


### Incidence (Relating to injury):
* InjuryMech
* High_impact_InjSev

### Post-Condition (A description of patients after the injury):
* Amnesia_verb
* LOCSeparate
* LocLen
* Seiz
* SeizOccur
* SeizLen
* Vomit
* VomitNbr
* SFxPalp
* FontBulg
* SFxBas
* SFxBasHem


### Intervention (Due to the subjective nature of ED and ability to self-express, it is necessary to compare differences between preverbal and verbal):
* ActNorm
* HA_verb
* HASeverity
* Intubated
* Paralyzed
* Sedated

### Two Groups (Major Head Trauma vs Minor Head Trauma):
* GCSGroup

### Label to determine the outcome or group by
* PosIntFinal

In [None]:
df_incidence = data.loc[:,["InjuryMech","High_impact_InjSev"]]
df_pre = data.loc[:, ["AgeTwoPlus","Gender","Race"]]
df_post = data.loc[:,["Amnesia_verb","LOCSeparate","LocLen","Seiz","SeizOccur","SeizLen","Vomit","VomitNbr","SFxPalp","FontBulg","SFxBas","SFxBasHem"]]
df_int = data.loc[:,["ActNorm","HA_verb","HASeverity","Intubated","Paralyzed","Sedated"]]

In [None]:
def bar_plot(df):
    fig, ax = plt.subplots()
    na_values = []
    non_na_values = []
    for col in df.columns:
        temp = df[col].value_counts(normalize=True, dropna=False)
        na_value = sum(temp.loc[np.isnan(temp.index)])
        non_na_value = sum(temp.loc[~np.isnan(temp.index)])
        na_values.append(na_value)
        non_na_values.append(non_na_value)
    
    bar_width = 0.6
    columns = df.columns
    ax.barh(columns, non_na_values, label='Non-NaN')
    ax.barh(columns, na_values, left=non_na_values, label='NaN', color='orange')
    
    ax.set_xlabel('Proportion')
    ax.set_title('Proportion of NaN Values')
    
    ax.legend()
    plt.show()


In [None]:
bar_plot(df_pre)

<Figure size 1650x1050 with 1 Axes>

In [None]:
bar_plot(df_post)

<Figure size 1650x1050 with 1 Axes>

In [None]:
bar_plot(df_incidence)

<Figure size 1650x1050 with 1 Axes>

In [None]:
bar_plot(df_int)

<Figure size 1650x1050 with 1 Axes>

In [None]:
plt.figure(figsize=(3, 3))
sns.heatmap(data.isnull(), cbar=False, cmap='viridis')
plt.title("Missing Values Heatmap")
plt.show()

No columns had any missing values greater than 50% of the values, so instead of removing the whole feature, I decided to replace it using the mode of the feature (most common value). Additionally, I've included correlation values with the `PosIntFinal` which is binary value indicating whether the patient was diagnosed with ciTBIs. From the data report, clinically-important TBI was defined as having at least one of the following: (1) neurosurgical procedure performed, (2)  intubated > 24 hours for head trauma,  (3) death due to TBI or in the ED, (4) hospitalized for >= 2 nights due to head injury and having a TBI on CT.

We can see that top 4 correlated values are feature that are known as a result of lab. I've selected subset of features using judgment call (described above) to get a correlation values.

In [None]:
df_corr = data.corr()['PosIntFinal'][:-1]

In [None]:
corr_lst = df_corr[abs(df_corr) > 0.5].sort_values(ascending=False)
print("These are {} strongly correlated values:\n{}".format(len(corr_lst), corr_lst))

These are 4 strongly correlated values:
HospHeadPosCT    0.952243
HospHead         0.867533
Neurosurgery     0.508631
GCSTotal        -0.519716
Name: PosIntFinal, dtype: float64


In [None]:
df2 = data.loc[:,["InjuryMech","High_impact_InjSev", "AgeTwoPlus","Gender","Race", "Amnesia_verb","LOCSeparate","LocLen","Seiz","SeizOccur","SeizLen","Vomit","VomitNbr","SFxPalp","FontBulg","SFxBas","SFxBasHem", "ActNorm","HA_verb","HASeverity","Intubated","Paralyzed","Sedated", 'PosIntFinal']]

In [None]:
df_corr2 = df2.corr()['PosIntFinal'][:-1]

In [None]:
corr_lst2 = df_corr2.sort_values(ascending=False)
print("These are {} correlated values in order:\n{}".format(len(corr_lst2), corr_lst2))

These are 23 correlated values in order:
Intubated             0.392128
Sedated               0.295219
Paralyzed             0.270167
SFxBas                0.222757
SFxPalp               0.157692
LOCSeparate           0.152461
High_impact_InjSev    0.099557
Seiz                  0.086900
HA_verb               0.077288
Vomit                 0.071679
Amnesia_verb          0.071391
FontBulg              0.063583
HASeverity            0.026814
AgeTwoPlus            0.015662
Gender               -0.001215
Race                 -0.003797
InjuryMech           -0.015431
VomitNbr             -0.048993
SeizLen              -0.049878
SeizOccur            -0.062200
LocLen               -0.107847
ActNorm              -0.167916
SFxBasHem            -0.219501
Name: PosIntFinal, dtype: float64


In [None]:
def clean_data(df, outlier='none', fill_na_with='mode', alpha=1.5):
    """
    Cleans a df by handling missing values and trimming outliers.
    """
    # replace NaN with
    if fill_na_with == 'mean':
        df = df.fillna(df.mean())
    elif fill_na_with == 'mode':
        for c in df.columns:
            mode_val = df[c].mode()[0]
            df[c] = df[c].fillna(mode_val)
    elif fill_na_with == 'drop':
        df = df.dropna()
    
    # trim outliers
    if outlier == 'iqr':
        q1 = df.quantile(0.25)
        q3 = df.quantile(0.75)
        iqr = q3 - q1
        df = df[~((df < (q1 - alpha * iqr)) | (df > (q3 + alpha * iqr))).any(axis=1)]
    
    elif outlier == 'z':
        df = df[(np.abs(stats.zscore(df)) < alpha).all(axis=1)]
    
    return df


In [None]:
# Filled missing values without trimming
data_new = clean_data(data)

# Filled missing values and trimmed with IQR
data_new_iqr = clean_data(data, outlier="iqr")

# Filled missing values and trimmed with z-score
data_new_z = clean_data(data, outlier="z")

In [None]:
# How much data was reduced after trimming using IQR
print("How much data was reduced after trimming using IQR: ", 1 - (data_new_iqr.shape[0]/data_new.shape[0]))

# How much data was reduced after trimming using z-score
print("How much data was reduced after trimming using z-score: ", 1 - (data_new_z.shape[0]/data_new.shape[0]))

How much data was reduced after trimming using IQR:  0.7581050254614161
How much data was reduced after trimming using z-score:  0.9077858936841863


In [None]:
# Filled missing values without trimming
df_pre_clean = clean_data(df_pre)

# Filled missing values and trimmed with IQR
df_pre_iqr = clean_data(df_pre, outlier="iqr")

# Filled missing values and trimmed with z-score
df_pre_z = clean_data(df_pre, outlier="z")

In [None]:
# How much data was reduced after trimming using IQR
print("How much data was reduced after trimming using IQR: ", 1 - (df_pre_iqr.shape[0]/df_pre.shape[0]))

# How much data was reduced after trimming using z-score
print("How much data was reduced after trimming using z-score: ", 1 - (df_pre_z.shape[0]/df_pre.shape[0]))

How much data was reduced after trimming using IQR:  0.03412521025830084
How much data was reduced after trimming using z-score:  0.272425631927003


In [None]:
# Filled missing values without trimming
df_post_clean = clean_data(df_post)

# Filled missing values and trimmed with IQR
df_post_iqr = clean_data(df_post, outlier="iqr")

# Filled missing values and trimmed with z-score
df_post_z = clean_data(df_post, outlier="z")

In [None]:
# How much data was reduced after trimming using IQR
print("How much data was reduced after trimming using IQR: ", 1 - (df_post_iqr.shape[0]/df_post.shape[0]))

# How much data was reduced after trimming using z-score
print("How much data was reduced after trimming using z-score: ", 1 - (df_post_z.shape[0]/df_post.shape[0]))

How much data was reduced after trimming using IQR:  0.29652756975967187
How much data was reduced after trimming using z-score:  0.29652756975967187


In [None]:
# Filled missing values without trimming
df_incidence_clean = clean_data(df_incidence)

# Filled missing values and trimmed with IQR
df_incidence_iqr = clean_data(df_incidence, outlier="iqr")

# Filled missing values and trimmed with z-score
df_incidence_z = clean_data(df_incidence, outlier="z")

In [None]:
# How much data was reduced after trimming using IQR
print("How much data was reduced after trimming using IQR: ", 1 - (df_incidence_iqr.shape[0]/df_incidence.shape[0]))

# How much data was reduced after trimming using z-score
print("How much data was reduced after trimming using z-score: ", 1 - (df_incidence_z.shape[0]/df_incidence.shape[0]))

How much data was reduced after trimming using IQR:  0.39546994170372585
How much data was reduced after trimming using z-score:  0.39546994170372585


In [None]:
# Filled missing values without trimming
df_int_clean = clean_data(df_int)

# Filled missing values and trimmed with IQR
df_int_iqr = clean_data(df_int, outlier="iqr")

# Filled missing values and trimmed with z-score
df_int_z = clean_data(df_int, outlier="z")

In [None]:
# How much data was reduced after trimming using IQR
print("How much data was reduced after trimming using IQR: ", 1 - (df_int_iqr.shape[0]/df_int.shape[0]))

# How much data was reduced after trimming using z-score
print("How much data was reduced after trimming using z-score: ", 1 - (df_int_z.shape[0]/df_int.shape[0]))

How much data was reduced after trimming using IQR:  0.16018802276550148
How much data was reduced after trimming using z-score:  0.37523906080785274


In general, if trimmed dataset is reduced by more that 10%, I will not trim the datasets. For any data sets that can be reduced using trimming, IQR trimming reduces less than z-score for the subgroups we've chose so I will use these set of cleaned data sets (replaced with mode and trimmed outlier with IQR).

The reason we don't want too much trimming is so that we don't introduce much bias.

## Data Exploration

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

def eda(df, trimming = "None"):
    """
    Perfoms EDA
    """
    
    print("Summary Statistics (Before Cleaning):")
    display(df.describe(include='all').T)
    
    plt.figure(figsize=(5, 3))
    sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
    plt.title("Missing Values Heatmap")
    plt.show()
    
    cols = df.columns
    df[cols].hist(bins=15, figsize=(15, 10), layout=(4, 3))
    plt.suptitle("Histograms of Columns")
    plt.show()

    print("Summary Statistics (After Cleaning):")
    df = clean_data(df, outlier=trimming)
    display(df.describe(include='all').T)

    cols = df.columns
    df[cols].hist(bins=15, figsize=(15, 10), layout=(4, 3))
    plt.suptitle("Histograms of Columns")
    plt.show()
    
    plt.figure(figsize=(10, 6))
    sns.heatmap(df[cols].corr(), annot=True, fmt='.2f')
    plt.title("Correlation Matrix of Features")
    plt.show()

In [None]:
eda(df_pre, "iqr")

Summary Statistics (Before Cleaning):


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
AgeTwoPlus,43399.0,1.74875,0.433737,1.0,1.0,2.0,2.0,2.0
Gender,43399.0,1.376529,0.484521,1.0,1.0,1.0,2.0,2.0
Race,43399.0,4.165718,15.32521,1.0,1.0,1.0,2.0,90.0


<Figure size 4500x3000 with 12 Axes>

Summary Statistics (After Cleaning):


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
AgeTwoPlus,41918.0,1.750775,0.432569,1.0,2.0,2.0,2.0,2.0
Gender,41918.0,1.376235,0.484446,1.0,1.0,1.0,2.0,2.0
Race,41918.0,1.422587,0.533942,1.0,1.0,1.0,2.0,3.0


<Figure size 4500x3000 with 12 Axes>

<Figure size 3000x1800 with 2 Axes>

In [None]:
eda(df_post)

Summary Statistics (Before Cleaning):


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Amnesia_verb,43399.0,31.148229,43.068433,0.0,0.0,0.0,91.0,91.0
LOCSeparate,43399.0,0.204014,0.505093,0.0,0.0,0.0,0.0,2.0
LocLen,43399.0,83.133413,26.786687,1.0,92.0,92.0,92.0,92.0
Seiz,43399.0,0.013894,0.117054,0.0,0.0,0.0,0.0,1.0
SeizOccur,43399.0,90.892509,9.941777,1.0,92.0,92.0,92.0,92.0
SeizLen,43399.0,90.987673,9.51301,1.0,92.0,92.0,92.0,92.0
Vomit,43399.0,0.133736,0.340372,0.0,0.0,0.0,0.0,1.0
VomitNbr,43399.0,80.579322,29.959698,1.0,92.0,92.0,92.0,92.0
SFxPalp,43399.0,0.050185,0.304455,0.0,0.0,0.0,0.0,2.0
FontBulg,43399.0,0.00083,0.02879,0.0,0.0,0.0,0.0,1.0


<Figure size 4500x3000 with 12 Axes>

Summary Statistics (After Cleaning):


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Amnesia_verb,43399.0,31.148229,43.068433,0.0,0.0,0.0,91.0,91.0
LOCSeparate,43399.0,0.204014,0.505093,0.0,0.0,0.0,0.0,2.0
LocLen,43399.0,83.133413,26.786687,1.0,92.0,92.0,92.0,92.0
Seiz,43399.0,0.013894,0.117054,0.0,0.0,0.0,0.0,1.0
SeizOccur,43399.0,90.892509,9.941777,1.0,92.0,92.0,92.0,92.0
SeizLen,43399.0,90.987673,9.51301,1.0,92.0,92.0,92.0,92.0
Vomit,43399.0,0.133736,0.340372,0.0,0.0,0.0,0.0,1.0
VomitNbr,43399.0,80.579322,29.959698,1.0,92.0,92.0,92.0,92.0
SFxPalp,43399.0,0.050185,0.304455,0.0,0.0,0.0,0.0,2.0
FontBulg,43399.0,0.00083,0.02879,0.0,0.0,0.0,0.0,1.0


<Figure size 4500x3000 with 12 Axes>

<Figure size 3000x1800 with 2 Axes>

In [None]:
eda(df_incidence)

Summary Statistics (Before Cleaning):


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
InjuryMech,43399.0,13.86419,22.623717,1.0,6.0,8.0,10.0,90.0
High_impact_InjSev,43399.0,1.986659,0.563684,1.0,2.0,2.0,2.0,3.0


<Figure size 4500x3000 with 12 Axes>

Summary Statistics (After Cleaning):


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
InjuryMech,43399.0,13.86419,22.623717,1.0,6.0,8.0,10.0,90.0
High_impact_InjSev,43399.0,1.986659,0.563684,1.0,2.0,2.0,2.0,3.0


<Figure size 4500x3000 with 12 Axes>

<Figure size 3000x1800 with 2 Axes>

In [None]:
eda(df_int)

Summary Statistics (Before Cleaning):


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ActNorm,43399.0,0.84343,0.363399,0.0,1.0,1.0,1.0,1.0
HA_verb,43399.0,29.7764,42.385174,0.0,0.0,1.0,91.0,91.0
HASeverity,43399.0,67.505864,40.173334,1.0,2.0,92.0,92.0,92.0
Intubated,43399.0,0.005023,0.070697,0.0,0.0,0.0,0.0,1.0
Paralyzed,43399.0,0.003134,0.055892,0.0,0.0,0.0,0.0,1.0
Sedated,43399.0,0.004931,0.070048,0.0,0.0,0.0,0.0,1.0


<Figure size 4500x3000 with 12 Axes>

Summary Statistics (After Cleaning):


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ActNorm,43399.0,0.84343,0.363399,0.0,1.0,1.0,1.0,1.0
HA_verb,43399.0,29.7764,42.385174,0.0,0.0,1.0,91.0,91.0
HASeverity,43399.0,67.505864,40.173334,1.0,2.0,92.0,92.0,92.0
Intubated,43399.0,0.005023,0.070697,0.0,0.0,0.0,0.0,1.0
Paralyzed,43399.0,0.003134,0.055892,0.0,0.0,0.0,0.0,1.0
Sedated,43399.0,0.004931,0.070048,0.0,0.0,0.0,0.0,1.0


<Figure size 4500x3000 with 12 Axes>

<Figure size 3000x1800 with 2 Axes>

# Findings

## First finding

Can we predict ciTBI (`PosIntFinal`) based on injury mechanism and injury severity? A chi-squared test is a statistical hypothesis test used in the analysis of contingency tables. I used chi-square contignece function to compute the chi-square statistic and p-value for the hypothesis test of independence of the observed frequencies in the contingency table observed. If the p-value is less than 0.05, we say that it is statistically significant and can asumme that the two observed frequencies are not independet. Higher values indicate that there are more relationships.

In [None]:
from scipy.stats import chi2_contingency

def chi(df):
    plt.figure(figsize=(8, 6))
    sns.countplot(data=df, x='AgeTwoPlus', hue='PosIntFinal')
    plt.title("Age (<2 years or >=2 years) vs. ciTBI")
    plt.show()
    
    plt.figure(figsize=(8, 6))
    sns.countplot(data=df, x='Gender', hue='PosIntFinal')
    plt.title("Gender vs. ciTBI")
    plt.show()
    
    plt.figure(figsize=(8, 6))
    sns.countplot(data=df, x='Drugs', hue='PosIntFinal')
    plt.title("Drug Intoxication vs. ciTBI")
    plt.show()
    
    ct_age = pd.crosstab(df['AgeTwoPlus'], df['PosIntFinal'])
    ct_gender = pd.crosstab(df['Gender'], df['PosIntFinal'])
    ct_drugs = pd.crosstab(df['Drugs'], df['PosIntFinal'])
    
    chi2_age, p_age, _, _ = chi2_contingency(ct_age)
    chi2_gender, p_gender, _, _ = chi2_contingency(ct_gender)
    chi2_drugs, p_drugs, _, _ = chi2_contingency(ct_drugs)
    print(f"Chi-square test for AgeTwoPlus vs ciTBI: chi2 = {chi2_age}, p-value = {p_age}")
    print(f"Chi-square test for Gender vs ciTBI: chi2 = {chi2_gender}, p-value = {p_gender}")
    print(f"Chi-square test for Drugs vs ciTBI: chi2 = {chi2_drugs}, p-value = {p_drugs}")


In [None]:
chi(data_new)

<Figure size 2400x1800 with 1 Axes>

<Figure size 2400x1800 with 1 Axes>

<Figure size 2400x1800 with 1 Axes>

Chi-square test for AgeTwoPlus vs ciTBI: chi2 = 10.350189010132969, p-value = 0.0012946142872781068
Chi-square test for Gender vs ciTBI: chi2 = 0.04429724363595672, p-value = 0.8333015632206818
Chi-square test for Drugs vs ciTBI: chi2 = 43.030281799667264, p-value = 5.389911598157804e-11


Thus, according to this test, we found that for pre-condition, use of drug is more correlated to ciTBIs.

## Second finding

Which inury mechanism and severity correlate to higher chance of ciTBIs?

In [None]:
def injury(df):
    ct_injury_mech = pd.crosstab(df['InjuryMech'], df['PosIntFinal'])
    display(ct_injury_mech)
    
    ct_injury_severity = pd.crosstab(df['High_impact_InjSev'], df['PosIntFinal'])
    display(ct_injury_severity)
    
    plt.figure(figsize=(12, 6))
    sns.countplot(data=df, x='InjuryMech', hue='PosIntFinal')
    plt.title("Injury Mechanism vs. ciTBI")
    plt.xticks(rotation=45)
    plt.show()
    
    plt.figure(figsize=(8, 6))
    sns.countplot(data=df, x='High_impact_InjSev', hue='PosIntFinal')
    plt.title("Injury Severity vs. ciTBI")
    plt.show()
    
    print(df.groupby('InjuryMech')['PosIntFinal'].mean())
    print(df.groupby('High_impact_InjSev')['PosIntFinal'].mean())


In [None]:
injury(data_new)

PosIntFinal,0.0,1.0
InjuryMech,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,3747,163
2.0,1326,107
3.0,518,38
4.0,1671,30
5.0,853,48
6.0,4710,23
7.0,2451,4
8.0,11998,186
9.0,2891,17
10.0,2950,29


PosIntFinal,0.0,1.0
High_impact_InjSev,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,7161,27
2.0,29199,403
3.0,6276,333


<Figure size 3600x1800 with 1 Axes>

<Figure size 2400x1800 with 1 Axes>

InjuryMech
1.0     0.041688
2.0     0.074669
3.0     0.068345
4.0     0.017637
5.0     0.053274
6.0     0.004859
7.0     0.001629
8.0     0.015266
9.0     0.005846
10.0    0.009735
11.0    0.008289
12.0    0.012350
90.0    0.015584
Name: PosIntFinal, dtype: float64
High_impact_InjSev
1.0    0.003756
2.0    0.013614
3.0    0.050386
Name: PosIntFinal, dtype: float64


In this finding, inury mechanism 2 (Pedestrian struck by moving vehicle) and severity 3 (High) accounts for most ciTBI patients.

## Third finding

From the post-condition features, can we accurately predict the outcome of ciTBIs? In other words, we want to see what feature sets can accurately discern the potential of having ciTBIs.

In [None]:
df_post

Unnamed: 0,Amnesia_verb,LOCSeparate,LocLen,Seiz,SeizOccur,SeizLen,Vomit,VomitNbr,SFxPalp,FontBulg,SFxBas,SFxBasHem
0,0.0,0.0,92.0,0.0,92.0,92.0,0.0,92.0,0.0,0.0,0.0,92
1,0.0,0.0,92.0,0.0,92.0,92.0,1.0,3.0,0.0,0.0,0.0,92
2,0.0,0.0,92.0,0.0,92.0,92.0,0.0,92.0,1.0,0.0,1.0,0
3,91.0,0.0,92.0,0.0,92.0,92.0,0.0,92.0,0.0,0.0,0.0,92
4,91.0,0.0,92.0,0.0,92.0,92.0,1.0,1.0,0.0,0.0,0.0,92
...,...,...,...,...,...,...,...,...,...,...,...,...
43394,0.0,0.0,92.0,0.0,92.0,92.0,0.0,92.0,0.0,0.0,0.0,92
43395,91.0,0.0,92.0,0.0,92.0,92.0,0.0,92.0,0.0,0.0,0.0,92
43396,0.0,0.0,92.0,0.0,92.0,92.0,0.0,92.0,0.0,0.0,0.0,92
43397,0.0,0.0,92.0,0.0,92.0,92.0,0.0,92.0,0.0,0.0,0.0,92


In [None]:
def post(df):
    plt.figure(figsize=(8, 6))
    sns.countplot(data=df, x='LOCSeparate', hue='PosIntFinal')
    plt.title("Loss of Consciousness (LOC) vs. ciTBI")
    plt.show()
    
    plt.figure(figsize=(8, 6))
    sns.countplot(data=df, x='Vomit', hue='PosIntFinal')
    plt.title("Vomiting vs. ciTBI")
    plt.show()
    
    plt.figure(figsize=(8, 6))
    sns.countplot(data=df, x='SFxPalp', hue='PosIntFinal')
    plt.title("Palpable Skull Fracture vs. ciTBI")
    plt.show()
    
    plt.figure(figsize=(8, 6))
    sns.countplot(data=df, x='Seiz', hue='PosIntFinal')
    plt.title("Post-Traumatic Seizure vs. ciTBI")
    plt.show()
    

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

def model(df):
    X = df[['LOCSeparate', 'Vomit', 'SFxPalp', 'Seiz']]
    y = df['PosIntFinal']
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    
    reg = LogisticRegression()
    reg.fit(X_train, y_train)
    y_pred = reg.predict(X_test)
    print(classification_report(y_test, y_pred))
    coefficients = pd.DataFrame(reg.coef_.flatten(), index=X.columns, columns=['Coefficient'])
    print(coefficients)

In [None]:
post(data_new)

<Figure size 2400x1800 with 1 Axes>

<Figure size 2400x1800 with 1 Axes>

<Figure size 2400x1800 with 1 Axes>

<Figure size 2400x1800 with 1 Axes>

In [None]:
model(data_new)

              precision    recall  f1-score   support

         0.0       0.98      1.00      0.99     12779
         1.0       0.20      0.00      0.01       241

    accuracy                           0.98     13020
   macro avg       0.59      0.50      0.50     13020
weighted avg       0.97      0.98      0.97     13020

             Coefficient
LOCSeparate     1.088662
Vomit           0.991379
SFxPalp         1.303787
Seiz            1.125431


## Reality Check

-   Do a reality check. What reality could you compare your cleaned data
    to?

-   Clearly state your assumptions and explain why this reality check is
    useful.

-   Does your cleaned data pass the reality check or are there issues?
    Discuss.

## Stability Check

Take one of your findings and present a perturbed version. How does this
affect your finding? Add a before and after plot here.

# Discussion

-   Did the data size restrict you in any way? Discuss some challenges
    that you faced as a result of the data size.

-   Address the three realms: data / reality, algorithms / models, and
    future data / reality.

-   Where do the parts of the lab fit into those three realms?

-   Do you think there is a one-to-one correspondence of the data and
    reality?

-   What about reality and data visualization?

# Conclusion

-   You should make attempts to connect your findings/analysis back to
    the domain problem in every section of this report, but here in the
    conclusion, you can reiterate your main points and provide
    overarching remarks on the PECARN data as it relates to the domain
    problem

# Academic honesty statement

Please address to Bin.

# Collaborators

# Bibliography