# Predicting Cognitive Decline in TBI Patients With Factors Such As Mental Illness and Drug Use

## Basic Info

Charlotte Riley-Vanwagoner

Amelia Le
                                       
Lonnie Schneider

## Background and Purpose

Traumatic brain injury (TBI) is a condtion that affects approximately 2.8 million people each year in the United States. TBI is increasing among all populations however there is a concerning rise in particular, in people who play contact sports, military personel (combat and non-combat blast injuries) and general traumatic incidences such as automobile accidents. Importantly, there are no current treatments for chronic TBI however, there are large initiatives targeting biomarkers and other potential mechanistic targets. It is currently thought that a primary reason for the lack of treatment options for TBI is that diagnosis and prognosis for TBI are often nebulous, unable to clearly define injury types and severities. Chronnic traumatic encephalopathy(CTE) is a form of neurodegeneration that has become apparent in people who suffer repeated mild concussive injuries. Mild injuries such as concussions are often misdiagnosed and/or not adressed in a clinical setting at all. It is understood that repeated head injuries often lead to neurodegeneration later in life, as well as progressive associated comorbidities such as emotional problems, suicideality and dementia. Currently there are large data sets that have been amassed to identify biomarkers for prognosis for TBI and to understand its pathological progression to dementia however they remain under-utilized. 

This project aims to analyze a subset of these data from a federally conserved database called the Federal Interagency Traumatic Brain Injury Research inforamtics center (FITBIR) to identify and predict multivariate biomarkers such as mental illness and drug use to aid in the prognosis of TBI and CTE. This will lead to increased understanding of the pathogenesis of CTE and improved treatment accuracy. This study will also contribute to the elucidation of the timeline for the development of dementia and other co-morbidities post injury.

https://fitbir.nih.gov/ 

This is where we collected our data sets from. We have narrowed the datasets down to imaging data, pediatric data, biomarker data, cognitive data, and neurodegeneration data. 

## Ethical Considerations

There are some ethical considerations to be made about this project. The first one is that only authorized personal are allowed to access the data from the FITBIR website. Luckily, Lonnie has access to this data because he is an authorized personel. If this wasn't the case, then it would be unethical to use this data if we were not allowed to.

In [1]:
import pandas as pd
import numpy as np

from sklearn import tree, svm, metrics
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import seaborn as sns

# Data Wrangling

In [2]:
neuro_df = pd.read_csv("CombinedData20percent.csv")
neuro_df.head()

Unnamed: 0,GUID,dizziness,1,AUDTTScore,AcuteAssesmtsEvaluationDayNum,AcuteAssessmt PlateletTransfusionInd,AcuteAssessmtCryoprecipitateTransfusionInd,AcuteAssessmtEEGMonitoringStat,AcuteAssessmtFactor 7TransfusionInd,AcuteAssessmtFresh frozen plasma (FFP)TransfusionInd,...,preOpioid,preSedative,preStimulant,race,residentType,sex,surgeryAdverse,sweating,urinaryDiscomfort,vomiting
0,TBI_INVAB423TUR,No,Complete Scores--Valid,2.0,2.0,No,No,No or not stated,No,No,...,1.0,1.0,1.0,Black or African American,"Private home, apartment or condominium",Male,No,No,No,No
1,TBI_INVAB826GR0,No,Complete Scores--Valid,2.0,3.0,No,No,No or not stated,No,No,...,1.0,1.0,1.0,White,"Private home, apartment or condominium",Female,No,No,No,No
2,TBI_INVAC020WPA,No,Unable to finish,0.0,7.0,No,No,No or not stated,No,No,...,1.0,1.0,1.0,Asian,"Private home, apartment or condominium",Female,No,No,No,No
3,TBI_INVAC153CWH,No,Complete Scores--Valid,3.0,2.0,No,No,No or not stated,No,No,...,1.0,1.0,1.0,White,"Private home, apartment or condominium",Male,No,No,No,Yes
4,TBI_INVAC275CGJ,No,Complete Scores--Valid,2.0,5.0,No,No,No or not stated,No,No,...,1.0,1.0,1.0,Black or African American,"Private home, apartment or condominium",Male,Yes,No,No,No


In [3]:
neuro_df = neuro_df.select_dtypes(exclude=['object']) # This is to reduce the numbers of features

In [4]:
print(neuro_df['StroopTestTimesStroopTestTmPartIVal'].isna().sum())
print(neuro_df['StroopTestStroopTestTmPartIIVal'].isna().sum())
# Seeing how many NaN values so I can drop the appropriate amount 

1163
1180


In [5]:
clean_df = neuro_df.drop(columns=['StroopTestAssmntCompltNum','StroopTestEvaluationDayNum','StroopTestPTADayDur',\
                                 'StroopTestStudyDayAssmntCompltNum']) # Need to remove these columns

In [6]:
for name in clean_df.columns:
    #print(name)
    #print(clean_df[name].isna().sum())
    num_na = clean_df[name].isna().sum()
    if num_na > 1180:
        #print("Drop")
        clean_df = clean_df.drop(columns = name)
clean_df = clean_df.dropna()

In [7]:
for name in clean_df.columns:
    print(name)

AUDTTScore
AcuteAssesmtsEvaluationDayNum
AdverseEventResolveTime
AdverseEventsCOBRITAdvrsEventCodeCOBRIT
AdverseEventsCOBRITAdvrsEventNum
AdverseEventsCOBRITSWIMSeqNumCOBRIT
AlchBldLvlMeasr
Alcohol
BSI18AnxScoreRaw
BSI18AnxScoreT
BSI18DeprScoreRaw
BSI18DeprScoreT
BSI18GSIScoreRaw
BSI18GSIScoreT
BSI18SomScoreRaw
BSI18SomScoreT
BldPressrDiastlMeasr
BldPressrSystMeasr
COBRITTempMeasr
COWATAdjustedTotalScore
COWATDaysSinceBaseline
COWATRawTotalScore
CTEpdurlLesnVolMeasr
CTIntraparenLesnVolMeasr
CTMidlineShiftMeasr
CTSbdrlLesnVolMeasr
Cannabis
Cocaine
DaysSinceBaseline
DigitSpanAssmntCompltNum
DigitSpanDigitSpnBckwrdScore
DigitSpanDigitSpnFwrdScr
DigitSpanDigitSpnRawScr
DigitSpanDigitSpnScldScr
DigitSpanEvaluationDayNum
DigitSpanPTADayDur
GCSDay
GCSLeftEyeMeasure
GCSMotor
GCSRightEyeMeas
GCSTotalScore
GOATAmnesia
GOATCrntTimeScore
GOATDayMnthDateScore
GOATDayScore
GOATDetailScore
GOATErrorSumVal
GOATFirstEvntScore
GOATLastEvntScore
GOATMnthScore
GOATModeTranspScore
GOATPreInjuEventDetailSco

In [8]:
# Only chose the columns that may predict cognitive decline # Filtering columns based off team member Dr. Schneider's suggestions
fully_wrangled_df = clean_df.filter(['AdverseEventsCOBRITAdvrsEventNum','BSI18AnxScoreT','BSI18DeprScoreT','BSI18SomScoreT','Cannabis',\
                           'Cocaine','Hallucinogen','Opioid','Sedative', 'Stimulant','StroopTestStroopTestTmPartIIVal',\
                           'StroopTestTimesStroopTestTmPartIVal','WgtMeasr','age'])

## Dataframe Basic Info

In [9]:
fully_wrangled_df.shape

(73, 14)

In [10]:
print(fully_wrangled_df['StroopTestTimesStroopTestTmPartIVal'].describe())
print()
print(fully_wrangled_df['StroopTestStroopTestTmPartIIVal'].describe())

count     73.000000
mean      53.219178
std       18.987018
min       26.000000
25%       42.000000
50%       49.000000
75%       59.000000
max      149.000000
Name: StroopTestTimesStroopTestTmPartIVal, dtype: float64

count     73.000000
mean     119.219178
std       46.196815
min       68.000000
25%       91.000000
50%      107.000000
75%      131.000000
max      300.000000
Name: StroopTestStroopTestTmPartIIVal, dtype: float64


## Visualizations [Redacted]

Redacted because it is not my work.

## Building the Models

In [11]:
predictor_df = fully_wrangled_df.drop(columns=['StroopTestStroopTestTmPartIIVal', 'StroopTestTimesStroopTestTmPartIVal']) # Need to remove these columns
X = predictor_df.values
print("Predictor Variables:")
print(X[:10])

Predictor Variables:
[[  2.   65.   61.   48.    1.    2.    1.    2.    1.    1.   90.   27. ]
 [  1.   49.   44.   42.    1.    1.    1.    1.    1.    1.  141.   60. ]
 [  3.   38.   42.   49.    4.    1.    1.    1.    1.    1.   90.   53. ]
 [  1.   39.   45.   48.    1.    1.    1.    1.    1.    1.   74.   18. ]
 [  1.   60.   65.   66.    1.    1.    1.    1.    1.    1.  110.6  48. ]
 [  6.    7.   71.   78.    4.    3.    2.    3.    3.    3.   77.   40. ]
 [  6.   49.   44.   59.    1.    1.    1.    1.    1.    1.   80.   70. ]
 [  1.   80.   80.   80.    2.    1.    1.    1.    1.    1.   71.4  21. ]
 [  3.   49.   44.   56.    1.    1.    1.    1.    2.    1.   98.   49. ]
 [  1.   48.   67.   56.    1.    1.    1.    1.    1.    1.   86.4  26. ]]


In [12]:
# PART 1
part1_df = fully_wrangled_df['StroopTestTimesStroopTestTmPartIVal']
part1_scores = part1_df.values
print("Part 1 Scores Dateframe:", part1_scores)
print()

# 1 = cognitive issue, 0 = no cognitive issue
y = [] # Initializing array
for random_number in part1_scores:
    if random_number > np.mean(part1_df): # The average score for part 1 is 53.2
        #print('cog issue')
        y.append(1) # Append to 1 when there is a cognitive deficit
    else:
        #print('no issue')
        y.append(0) # Append to 0 when not popular
print("Binary numpy array for Part 1 Scores:")
print(y[:10])

print()
print()
print()

# PART 2
part2_df = fully_wrangled_df['StroopTestStroopTestTmPartIIVal']
part2_scores = part2_df.values
print("Part2 Scores Dateframe:", part2_scores)
print()

#1 = cognitive issue, 0 = no cognitive issue
cog_eval = [] # This array is to just check if I stored the binary variables properly
y2 = [] # Initializing array
for random_number in part2_scores:
    if random_number > np.mean(part2_df): # The average score for part 2 is 119.2
        #print('cog issue')
        y2.append(1) # Append to 1 when there is a cognitive deficit
    else:
        #print('no issue')
        y2.append(0) # Append to 0 when not popular
print("Binary numpy array for Part 2 Scores:")
print(y2[:10])

Part 1 Scores Dateframe: [ 71.  41.  55.  49.  82.  49.  63.  42.  48.  46. 118.  65.  63.  71.
  64.  50.  63.  36.  45.  59. 149.  57.  46.  47.  75.  51.  51.  39.
  58.  58.  42.  37.  36.  36.  40.  37.  42.  67.  75.  40.  33.  48.
  49.  41.  34.  55.  40.  34.  55.  26.  50.  47.  54.  60.  43.  45.
  76.  45.  42.  41.  32.  74.  36.  43.  56.  47.  52.  52.  80.  85.
  43.  52.  52.]

Binary numpy array for Part 1 Scores:
[1, 0, 1, 0, 1, 0, 1, 0, 0, 0]



Part2 Scores Dateframe: [191. 111. 120. 128.  94. 120. 201.  87.  88. 112. 218. 221.  99. 135.
 131. 124.  91.  96.  93. 145. 180. 300.  92.  86. 113. 113. 140. 112.
 111. 141.  75.  91.  80. 103.  95.  76.  86.  98. 175. 109.  77.  96.
  84. 131.  96.  99.  81.  68. 179. 100.  75. 113. 155. 157. 100. 108.
 129.  80. 176. 131.  76.  79. 102. 107.  86. 112.  99.  83. 300. 135.
  91. 126.  91.]

Binary numpy array for Part 2 Scores:
[1, 0, 1, 1, 0, 1, 1, 0, 0, 0]


## K-nearest Neighbors

In [13]:
print("Part 1 Results for kNN Model")
X_train_neigh, X_test_neigh, y_train_neigh, y_test_neigh = train_test_split(X, y)
kNN_classifier = KNeighborsClassifier(n_neighbors = 3)
kNN_classifier.fit(X_train_neigh, y_train_neigh)
kNN_prediction = kNN_classifier.predict(X_test_neigh)
print(metrics.confusion_matrix(y_true = y_test_neigh, y_pred = kNN_prediction))
print("Accuracy:", metrics.accuracy_score(y_true = y_test_neigh, y_pred = kNN_prediction))

# Testing for best accuracy based off of k-value
bestk = None
largest_average = 0
for rn in np.random.randint(1,10,10):
    kNN_classifier2 = KNeighborsClassifier(n_neighbors = rn)
    kNN_scores2 = cross_val_score(kNN_classifier2, X_train_neigh, y_train_neigh, cv = 3, scoring = 'accuracy')
    mean_scores2 = np.mean(kNN_scores2)
    if mean_scores2 >= largest_average:
        largest_average = mean_scores2
        bestk = rn
#print(np.mean(best_k_scores))
print("Best Accuracy:", largest_average, "with a k value of", bestk)

print()
print()
print()

print("Part 2 Results for kNN Model")
X_train_neigh, X_test_neigh, y_train_neigh, y_test_neigh = train_test_split(X, y2)
kNN_classifier = KNeighborsClassifier(n_neighbors = 3)
kNN_classifier.fit(X_train_neigh, y_train_neigh)
kNN_prediction = kNN_classifier.predict(X_test_neigh)
print(metrics.confusion_matrix(y_true = y_test_neigh, y_pred = kNN_prediction))
print("Accuracy:", metrics.accuracy_score(y_true = y_test_neigh, y_pred = kNN_prediction))

# Testing for best accuracy based off of k-value
bestk = None
largest_average = 0
for rn in np.random.randint(1,10,10):
    kNN_classifier2 = KNeighborsClassifier(n_neighbors = rn)
    kNN_scores2 = cross_val_score(kNN_classifier2, X_train_neigh, y_train_neigh, cv = 3, scoring = 'accuracy')
    mean_scores2 = np.mean(kNN_scores2)
    if mean_scores2 >= largest_average:
        largest_average = mean_scores2
        bestk = rn
#print(np.mean(best_k_scores))
print("Best Accuracy:", largest_average, "with a k value of", bestk)

Part 1 Results for kNN Model
[[6 5]
 [6 2]]
Accuracy: 0.42105263157894735
Best Accuracy: 0.6658640064212821 with a k value of 9



Part 2 Results for kNN Model
[[11  2]
 [ 3  3]]
Accuracy: 0.7368421052631579
Best Accuracy: 0.6855865153078775 with a k value of 7


## Decision Tree

In [14]:
print("Part 1 Results for Decision Tree Model")
X_train_tree, X_test_tree, y_train_tree, y_test_tree = train_test_split(X, y)
tree_classifier = tree.DecisionTreeClassifier()
tree_classifier.fit(X_train_tree, y_train_tree)
tree_prediction = tree_classifier.predict(X_test_tree)
print(metrics.confusion_matrix(y_true = y_test_tree, y_pred = tree_prediction))
print("Accuracy:", metrics.accuracy_score(y_true = y_test_tree, y_pred = tree_prediction))

bestdepth = None
bestminsample = None
largest_average = 0
for _ in range(10):
    rn1 = np.random.randint(1, 10)
    rn2 = np.random.randint(2, 10)
    tree_classifier = tree.DecisionTreeClassifier(max_depth = rn1, min_samples_split = rn2)
    tree_scores = cross_val_score(tree_classifier, X_train_tree, y_train_tree, cv = 3, scoring = 'accuracy')
    mean_scores = np.mean(tree_scores)
    if mean_scores2 >= largest_average:
        largest_average = mean_scores
        bestdepth = rn1
        bestminsample = rn2
print("Best Accuracy:", largest_average, "with a depth value of", bestdepth, "and a min sample of", bestminsample)

print()
print()
print()

print("Part 2 Results for Decision Tree Model")
X_train_tree, X_test_tree, y_train_tree, y_test_tree = train_test_split(X, y2)
tree_classifier = tree.DecisionTreeClassifier()
tree_classifier.fit(X_train_tree, y_train_tree)
tree_prediction = tree_classifier.predict(X_test_tree)
print(metrics.confusion_matrix(y_true = y_test_tree, y_pred = tree_prediction))
print("Accuracy:", metrics.accuracy_score(y_true = y_test_tree, y_pred = tree_prediction))

bestdepth = None
bestminsample = None
largest_average = 0
for _ in range(10):
    rn1 = np.random.randint(1, 10)
    rn2 = np.random.randint(2, 10)
    tree_classifier = tree.DecisionTreeClassifier(max_depth = rn1, min_samples_split = rn2)
    tree_scores = cross_val_score(tree_classifier, X_train_tree, y_train_tree, cv = 3, scoring = 'accuracy')
    mean_scores = np.mean(tree_scores)
    if mean_scores2 >= largest_average:
        largest_average = mean_scores
        bestdepth = rn1
        bestminsample = rn2

print("Best Accuracy:", largest_average, "with a depth value of", bestdepth, "and a min sample of", bestminsample)

Part 1 Results for Decision Tree Model
[[6 5]
 [5 3]]
Accuracy: 0.47368421052631576
Best Accuracy: 0.4648549478270841 with a depth value of 2 and a min sample of 6



Part 2 Results for Decision Tree Model
[[11  4]
 [ 3  1]]
Accuracy: 0.631578947368421
Best Accuracy: 0.5195505102625845 with a depth value of 6 and a min sample of 9


## SVM

In [15]:
print("Part 1 Results for SVM Model")
Xtrain_svm, Xtest_svm, ytrain_svm, ytest_svm = train_test_split(X, y)
SVM_classifier = svm.SVC(kernel='rbf', C= 1, gamma='scale')
SVM_classifier.fit(Xtrain_svm, ytrain_svm)
SVM_prediction = SVM_classifier.predict(Xtest_svm)
print(metrics.confusion_matrix(y_true = ytest_svm, y_pred = SVM_prediction))
print('Accuracy = ', metrics.accuracy_score(y_true = ytest_svm, y_pred = SVM_prediction))

bestC = None
best_average = 0
for rn in np.random.uniform(50,100,10):
    SVM_classifier2 = svm.SVC(kernel='rbf', C = rn, gamma='scale')
    SVM_scores2 = cross_val_score(SVM_classifier2, Xtrain_svm, ytrain_svm, cv = 3, scoring = 'accuracy')
    mean_scores = np.mean(SVM_scores2)
    if mean_scores >= best_average:
        best_average = mean_scores
        bestC = rn 

print("Accuracy with Best C:", (mean_scores))
print("Best C:", bestC)

print()
print()
print()

print("Part 2 Results for SVM Model")
Xtrain_svm, Xtest_svm, ytrain_svm, ytest_svm = train_test_split(X, y2)
SVM_classifier = svm.SVC(kernel='rbf', C= 1, gamma='scale')
SVM_classifier.fit(Xtrain_svm, ytrain_svm)
SVM_prediction = SVM_classifier.predict(Xtest_svm)
print(metrics.confusion_matrix(y_true = ytest_svm, y_pred = SVM_prediction))
print('Accuracy = ', metrics.accuracy_score(y_true = ytest_svm, y_pred = SVM_prediction))

bestC = None
best_average = 0
for rn in np.random.uniform(50,100,10):
    SVM_classifier2 = svm.SVC(kernel='rbf', C = rn, gamma='scale')
    SVM_scores2 = cross_val_score(SVM_classifier2, Xtrain_svm, ytrain_svm, cv = 3, scoring = 'accuracy')
    mean_scores = np.mean(SVM_scores2)
    if mean_scores >= best_average:
        best_average = mean_scores
        bestC = rn 

print("Accuracy with Best C:", (mean_scores))
print("Best C:", bestC)

Part 1 Results for SVM Model
[[12  0]
 [ 7  0]]
Accuracy =  0.631578947368421
Accuracy with Best C: 0.5546382295608302
Best C: 91.76069901828794



Part 2 Results for SVM Model
[[12  1]
 [ 5  1]]
Accuracy =  0.6842105263157895
Accuracy with Best C: 0.6351335855979819
Best C: 79.54337006928752


## Gaussian

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y)
gauss_model = GaussianNB()
gauss_model.fit(X, y)
gauss_model.predict(X)

print("Accuracy of Part 1 Scores:", gauss_model.score(X, y))

X_train, X_test, y_train, y_test = train_test_split(X, y2)
gauss_model = GaussianNB()
gauss_model.fit(X, y2)
gauss_model.predict(X)
print("Accuracy of Part 2 Scores:", gauss_model.score(X, y2))

Accuracy of Part 1 Scores: 0.5068493150684932
Accuracy of Part 2 Scores: 0.684931506849315


## Logistic Regression

In [17]:
logistic_model = LogisticRegression(random_state=0).fit(X, y)
logistic_model.predict(X)
logistic_model.score(X,y)



0.6986301369863014

## Analyzing Results

The best model that achieved the highest accuracy was tied k nearest neighbors. This is most likely due to fact that it was tuned have the most ideal parameters. The best accuracy achieved was .74. This means that the features in our model had a mild significance in predicting cognitive decline in TBI patients.

In the future, the model needs to include more observations to have a substantial predictive value. The dataset our team worked with had so many NaNs that made it difficult to have enough data points for modeling. Next time, our team can find large datasets that have complete values so we can have enough data to work with. 

## What We Learned

We learned that the presence of mental illness (anxiety, depression) and drug use can increase the chances of a TBI patient experiencing cognitive decline after their injury.

## Citations

Scarpina, F., & Tagini, S. (2017). The Stroop Color and Word Test. Frontiers in psychology, 8, 557. https://doi.org/10.3389/fpsyg.2017.00557

https://stackoverflow.com/questions/26266362/how-to-count-the-nan-values-in-a-column-in-pandas-dataframe

All data was retrieved from https://fitbir.nih.gov/