# Unsupervised Wisdom: Explore Medical Narratives on Older Adult Falls

    Project Outline

1. Problem Description
2. Introduction to Approach Used
3. Data Cleaning and Preprocessing
4. Exploratory Data Analysis and Data Visualization
5. Model Building
6. Final Inference

---

## 3. Data Cleaning and Preprocessing

##### Import required modules

In [1]:
import numpy as np
import pandas as pd

import json

import re

##### Load the data

In [2]:
data = pd.read_csv("data/primary_data.csv")

#Display an overview of the data
data.head(3)

Unnamed: 0,cpsc_case_number,narrative,treatment_date,age,sex,race,other_race,hispanic,diagnosis,other_diagnosis,...,body_part,body_part_2,disposition,location,fire_involvement,alcohol,drug,product_1,product_2,product_3
0,190103269,94YOM FELL TO THE FLOOR AT THE NURSING HOME ON...,2019-01-01,94,1,0,,0,62,,...,75,,4,5,0,0,0,1807,0,0
1,190103270,86YOM FELL IN THE SHOWER AT HOME AND SUSTAINED...,2019-01-01,86,1,0,,0,62,,...,75,,4,1,0,0,0,611,0,0
2,190103273,87YOF WAS GETTING UP FROM THE COUCH AND FELL T...,2019-01-01,87,2,0,,0,53,,...,32,,4,1,0,0,0,679,1807,0


In [3]:
rows, columns = data.shape
print(f'The data contains {rows:,} rows and {columns} columns')

The data contains 115,128 rows and 22 columns


##### Check for null values

In [4]:
#Only display columns with null values
null = data.isnull().sum()

null[null != 0]

other_race           114106
other_diagnosis      112606
diagnosis_2           71983
other_diagnosis_2    110150
body_part_2           71983
dtype: int64

There are null values present, but only occur in fields that are optional. So, they can be left without having to perform any operation on them

##### Check for duplicates

In [5]:
data.duplicated().sum()

0

This shows there are no duplicate values for each person-event pairings.   
Though, a person can have more than one record due to different fall events, but will result in different case numbers; thereby each recorded fall remains unique.

##### Variable Mapping

From the data, it shows that some columns like sex, race, e.t.c. have numeric values which can be decoded to their respective strings.   
The data is redisplayed below for easier reference

In [6]:
data.head(3)

Unnamed: 0,cpsc_case_number,narrative,treatment_date,age,sex,race,other_race,hispanic,diagnosis,other_diagnosis,...,body_part,body_part_2,disposition,location,fire_involvement,alcohol,drug,product_1,product_2,product_3
0,190103269,94YOM FELL TO THE FLOOR AT THE NURSING HOME ON...,2019-01-01,94,1,0,,0,62,,...,75,,4,5,0,0,0,1807,0,0
1,190103270,86YOM FELL IN THE SHOWER AT HOME AND SUSTAINED...,2019-01-01,86,1,0,,0,62,,...,75,,4,1,0,0,0,611,0,0
2,190103273,87YOF WAS GETTING UP FROM THE COUCH AND FELL T...,2019-01-01,87,2,0,,0,53,,...,32,,4,1,0,0,0,679,1807,0


In [7]:
#Load the variable mapping file
with open('data/variable_mapping.json', 'r') as f:
    mapping = json.load(f, parse_int=True)

In [8]:
#Overview of the variable mapping file
mapping

{'sex': {'0': 'UNKNOWN', '1': 'MALE', '2': 'FEMALE', '3': 'NON-BINARY/OTHER'},
 'race': {'0': 'N.S.',
  '1': 'WHITE',
  '2': 'BLACK/AFRICAN AMERICAN',
  '3': 'OTHER',
  '4': 'ASIAN',
  '5': 'AMERICAN INDIAN/ALASKA NATIVE',
  '6': 'NATIVE HAWAIIAN/PACIFIC ISLANDER'},
 'hispanic': {'0': 'Unk/Not stated', '1': 'Yes', '2': 'No'},
 'alcohol': {'0': 'No/Unk', '1': 'Yes'},
 'drug': {'0': 'No/Unk', '1': 'Yes'},
 'body_part': {'0': '0 - INTERNAL',
  '30': '30 - SHOULDER',
  '31': '31 - UPPER TRUNK',
  '32': '32 - ELBOW',
  '33': '33 - LOWER ARM',
  '34': '34 - WRIST',
  '35': '35 - KNEE',
  '36': '36 - LOWER LEG',
  '37': '37 - ANKLE',
  '38': '38 - PUBIC REGION',
  '75': '75 - HEAD',
  '76': '76 - FACE',
  '77': '77 - EYEBALL',
  '78': '78 - UPPER TRUNK(OLD)',
  '79': '79 - LOWER TRUNK',
  '80': '80 - UPPER ARM',
  '81': '81 - UPPER LEG',
  '82': '82 - HAND',
  '83': '83 - FOOT',
  '84': '84 - 25-50% OF BODY',
  '85': '85 - ALL PARTS BODY',
  '86': '86 - OTHER(OLD)',
  '87': '87 - NOT STATED/U

In [9]:
#Convert the encoded values in the mapping to integers since they get read in as strings
#such as '0' to 0
for c in mapping.keys():
    mapping[c] = {int(k): v for k, v in mapping[c].items()}

In [10]:
decoded_data = data.copy()

#Map the keys to the new dataframe
for col in mapping.keys():
    decoded_data[col] = decoded_data[col].map(mapping[col])
    
#Ensure mappings were applied correctly by checking that the number of missing values did not change
assert (decoded_data.isnull().sum() == data.isnull().sum()).all()

In [11]:
#An overview of the new dataset with decoded values
decoded_data.head(3)

Unnamed: 0,cpsc_case_number,narrative,treatment_date,age,sex,race,other_race,hispanic,diagnosis,other_diagnosis,...,body_part,body_part_2,disposition,location,fire_involvement,alcohol,drug,product_1,product_2,product_3
0,190103269,94YOM FELL TO THE FLOOR AT THE NURSING HOME ON...,2019-01-01,94,MALE,N.S.,,Unk/Not stated,62 - INTERNAL INJURY,,...,75 - HEAD,,4 - TREATED AND ADMITTED/HOSPITALIZED,PUBLIC,NO/?,No/Unk,No/Unk,1807 - FLOORS OR FLOORING MATERIALS,0 - None,0 - None
1,190103270,86YOM FELL IN THE SHOWER AT HOME AND SUSTAINED...,2019-01-01,86,MALE,N.S.,,Unk/Not stated,62 - INTERNAL INJURY,,...,75 - HEAD,,4 - TREATED AND ADMITTED/HOSPITALIZED,HOME,NO/?,No/Unk,No/Unk,611 - BATHTUBS OR SHOWERS,0 - None,0 - None
2,190103273,87YOF WAS GETTING UP FROM THE COUCH AND FELL T...,2019-01-01,87,FEMALE,N.S.,,Unk/Not stated,"53 - CONTUSIONS, ABR.",,...,32 - ELBOW,,4 - TREATED AND ADMITTED/HOSPITALIZED,HOME,NO/?,No/Unk,No/Unk,"679 - SOFAS, COUCHES, DAVENPORTS, DIVANS OR ST...",1807 - FLOORS OR FLOORING MATERIALS,0 - None


##### Data Preprocessing 1

From the recently decoded data above, the narrative column contains some redundancies, which are the age and sex which already exist in other columns.    
An example is "94YOM FELL..."

Removing these redundancies by removing any pattern similar to "94YOM "

In [12]:
decoded_data['narrative'].replace(r'^\d+\s*YO\w\s', '', regex = True, inplace = True)

In [13]:
#An overview of the dataset after recent changes
decoded_data.head(3)

Unnamed: 0,cpsc_case_number,narrative,treatment_date,age,sex,race,other_race,hispanic,diagnosis,other_diagnosis,...,body_part,body_part_2,disposition,location,fire_involvement,alcohol,drug,product_1,product_2,product_3
0,190103269,FELL TO THE FLOOR AT THE NURSING HOME ONTO BAC...,2019-01-01,94,MALE,N.S.,,Unk/Not stated,62 - INTERNAL INJURY,,...,75 - HEAD,,4 - TREATED AND ADMITTED/HOSPITALIZED,PUBLIC,NO/?,No/Unk,No/Unk,1807 - FLOORS OR FLOORING MATERIALS,0 - None,0 - None
1,190103270,FELL IN THE SHOWER AT HOME AND SUSTAINED A CLO...,2019-01-01,86,MALE,N.S.,,Unk/Not stated,62 - INTERNAL INJURY,,...,75 - HEAD,,4 - TREATED AND ADMITTED/HOSPITALIZED,HOME,NO/?,No/Unk,No/Unk,611 - BATHTUBS OR SHOWERS,0 - None,0 - None
2,190103273,WAS GETTING UP FROM THE COUCH AND FELL TO THE ...,2019-01-01,87,FEMALE,N.S.,,Unk/Not stated,"53 - CONTUSIONS, ABR.",,...,32 - ELBOW,,4 - TREATED AND ADMITTED/HOSPITALIZED,HOME,NO/?,No/Unk,No/Unk,"679 - SOFAS, COUCHES, DAVENPORTS, DIVANS OR ST...",1807 - FLOORS OR FLOORING MATERIALS,0 - None


##### Data Preprocessing 2

From the displayed data above, some columns still have their encoded value present in their decoded value.   
An example is "62 - INTERNAL INJURY" in the diagnosis column.

Using the first data as a reference, one can derive the columns with such disparities.

In [14]:
first = decoded_data.iloc[0]

#Only display columns that start with a pattern similar to "62 - INTERNAL INJURY"
encoded_still_present = first[first.str.match(r'^\d+\s\-\s\w+', flags = re.IGNORECASE) == True]

encoded_still_present

diagnosis                       62 - INTERNAL INJURY
body_part                                  75 - HEAD
disposition    4 - TREATED AND ADMITTED/HOSPITALIZED
product_1        1807 - FLOORS OR FLOORING MATERIALS
product_2                                   0 - None
product_3                                   0 - None
Name: 0, dtype: object

This pattern can be followed to remove redundancies like the digits and everything else leading up to the expected decoded value, in these columns


In [15]:
#Remove the redundancies for the observed columns
for column in encoded_still_present.index:
    decoded_data[column].replace(r'^\d+\s\-\s', '', regex = True, inplace = True)

In [16]:
#An overview of the dataset after changes
decoded_data.head(3)

Unnamed: 0,cpsc_case_number,narrative,treatment_date,age,sex,race,other_race,hispanic,diagnosis,other_diagnosis,...,body_part,body_part_2,disposition,location,fire_involvement,alcohol,drug,product_1,product_2,product_3
0,190103269,FELL TO THE FLOOR AT THE NURSING HOME ONTO BAC...,2019-01-01,94,MALE,N.S.,,Unk/Not stated,INTERNAL INJURY,,...,HEAD,,TREATED AND ADMITTED/HOSPITALIZED,PUBLIC,NO/?,No/Unk,No/Unk,FLOORS OR FLOORING MATERIALS,,
1,190103270,FELL IN THE SHOWER AT HOME AND SUSTAINED A CLO...,2019-01-01,86,MALE,N.S.,,Unk/Not stated,INTERNAL INJURY,,...,HEAD,,TREATED AND ADMITTED/HOSPITALIZED,HOME,NO/?,No/Unk,No/Unk,BATHTUBS OR SHOWERS,,
2,190103273,WAS GETTING UP FROM THE COUCH AND FELL TO THE ...,2019-01-01,87,FEMALE,N.S.,,Unk/Not stated,"CONTUSIONS, ABR.",,...,ELBOW,,TREATED AND ADMITTED/HOSPITALIZED,HOME,NO/?,No/Unk,No/Unk,"SOFAS, COUCHES, DAVENPORTS, DIVANS OR STUDIO C...",FLOORS OR FLOORING MATERIALS,


##### Data Preprocessing 3

From the displayed data above, another observation has to do with unknown or not stated values. This is evident in the "N.S." in the race column which can be replaced to "Not Stated"    
or the "Unk" term in the hispanic, alcohol and drug columns which can be replaced to "Unknown" for better understanding.


Replacing    
"N.S." in the race column to "Not Stated",   
"Unk" term in the hispanic column to "Unknown/Not stated"   
and "No/Unk" in the alcohol and drug columns to "No/Unknown"

In [17]:
decoded_data['race'].replace('N.S.', 'Not Stated', inplace = True)

decoded_data['hispanic'].replace('Unk/Not stated', 'Unknown/Not stated', inplace = True)

decoded_data['alcohol'].replace('No/Unk', 'No/Unknown', inplace = True)
decoded_data['drug'].replace('No/Unk', 'No/Unknown', inplace = True)

In [18]:
#An overview of the dataset after changes
decoded_data.head(3)

Unnamed: 0,cpsc_case_number,narrative,treatment_date,age,sex,race,other_race,hispanic,diagnosis,other_diagnosis,...,body_part,body_part_2,disposition,location,fire_involvement,alcohol,drug,product_1,product_2,product_3
0,190103269,FELL TO THE FLOOR AT THE NURSING HOME ONTO BAC...,2019-01-01,94,MALE,Not Stated,,Unknown/Not stated,INTERNAL INJURY,,...,HEAD,,TREATED AND ADMITTED/HOSPITALIZED,PUBLIC,NO/?,No/Unknown,No/Unknown,FLOORS OR FLOORING MATERIALS,,
1,190103270,FELL IN THE SHOWER AT HOME AND SUSTAINED A CLO...,2019-01-01,86,MALE,Not Stated,,Unknown/Not stated,INTERNAL INJURY,,...,HEAD,,TREATED AND ADMITTED/HOSPITALIZED,HOME,NO/?,No/Unknown,No/Unknown,BATHTUBS OR SHOWERS,,
2,190103273,WAS GETTING UP FROM THE COUCH AND FELL TO THE ...,2019-01-01,87,FEMALE,Not Stated,,Unknown/Not stated,"CONTUSIONS, ABR.",,...,ELBOW,,TREATED AND ADMITTED/HOSPITALIZED,HOME,NO/?,No/Unknown,No/Unknown,"SOFAS, COUCHES, DAVENPORTS, DIVANS OR STUDIO C...",FLOORS OR FLOORING MATERIALS,


---

## 4. Exploratory Data Analysis and Data Visualization

In [None]:
import plotly.express as px 
import matplotlib.pyplot as plt
import seaborn as sns

## AGE GROUPING

In [None]:

# Defining custom age bins
age_bins = [65, 71, 77, 83, 89, 95, 101, 107, 113,]  

# Define labels for the age groups
age_labels = ['65-70', '71-76', '77-82', '83-88', '89-94', '95-100', '101-106', '107-112']

# Create a new column 'age_group' based on the custom age bins
decoded_data['age_group'] = pd.cut(decoded_data['age'], bins=age_bins, labels=age_labels)

# Display the first few rows of the DataFrame with the age groups
decoded_data[['age', 'age_group']]


# EDA

#### AGE DISTRIBUTION

In [None]:
plt.figure(figsize=(8, 4))
sns.histplot(decoded_data['age_group'], bins=20, kde=True)
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()


In this analysis, it becomes evident that the age distribution within the elderly population skewed toward individuals aged between 71 and 82, indicating that a significant proportion of the elderly population falls within this specific range.

#### GENDER DISTRIBUTION

In [None]:
plt.figure(figsize=(8, 6))
sns.countplot(decoded_data, x='sex')
plt.title('Gender Distribution')
plt.xlabel('Sex')
plt.ylabel('Count')
plt.show()


There is a higher proportion of females in this dataset.

#### RACE DISTRIBUTION

In [None]:
plt.figure(figsize=(15, 6))
sns.countplot(decoded_data, x='race')
plt.title('Race Distribution')
plt.xlabel('Race')
plt.ylabel('Count')
plt.show()


There are higher proportion of whites followed by Nova Scotia in this dataset.

#### DIAGNOSIS DISTRIBUTION

In [None]:
# Get the top 10 diagnosis 
top_10_diagnosis = decoded_data['diagnosis'].value_counts().head(10)

# Create a bar chart
plt.figure(figsize=(10, 6))
top_10_diagnosis.plot(kind='bar')
plt.title('Top 10 Diagnosis by Distribution')
plt.xlabel('Diagnosis')
plt.ylabel('Count')
plt.xticks(rotation=45) 
plt.show()


These are the top 10 most frequently occurring diagnosis in this dataset, revealing the primary health conditions experienced by older individuals when they fall.

#### BODY PARTS DISTRIBUTION

In [None]:
# Get the top 10 body parts affected by fall
top_10_body_part = decoded_data['body_part'].value_counts().head(10)

# Create a bar chart
plt.figure(figsize=(10, 6))
top_10_body_part.plot(kind='bar')
plt.title('Top 10 body_part by Distribution')
plt.xlabel('body_part')
plt.ylabel('Count')
plt.xticks(rotation=45) 
plt.show()

When older individuals fall, these body parts are more susceptible to being affected.

### BIVARIATE ANALYSIS

Bivariate analysis refers to the statistical analysis or examination of the relationship or interactions between two variables or factors to understand their correlation, association, or dependencies.

### Demographic Factors:

##### AGE vs DIAGNOSIS

In [None]:
top_10_diagnosis = decoded_data['diagnosis'].value_counts().head(10).index
decoded_data_filtered = decoded_data[decoded_data['diagnosis'].isin(top_10_diagnosis)]
plt.figure(figsize=(14, 6))
sns.countplot(data=decoded_data_filtered, x='diagnosis', hue='age_group')
plt.title('Top 10 Diagnosis Count by Age Group')
plt.xlabel('Diagnosis')
plt.ylabel('Count')
plt.xticks(rotation=90)
plt.show()

## HYPOTHESIS

### Age-Related Injury Trends: 

The data reveals clear age-related patterns in the types of injuries diagnosed after falls. As individuals advance in age, they exhibit distinct injury patterns, highlighting the importance of age as a risk factor for certain types of injuries.

Increased Risk of Internal Injuries (Ages 71-82): The finding that people between the ages of 71 and 82 are more frequently diagnosed with internal injuries following falls suggests that this age group may be particularly susceptible to injuries affecting internal organs. This vulnerability might be linked to age-related physiological changes or underlying health conditions.

Contusions and Auditory Brainstem Response (ABR) in Older Age Groups (Ages 65-82): The higher incidence of contusions (bruises) and the presence of auditory brainstem response (ABR) diagnoses within the age range of 65 to 82 indicate that falls in this age group not only result in physical injuries but may also have neurological or auditory implications. This finding underscores the need for a holistic assessment of falls in older adults.

Elevated Fracture Rates (Ages 65-88): The significant prevalence of fractures among individuals aged 65 to 88, with a notably higher rate, underscores the heightened risk of bone fractures in older age. This increased susceptibility might be attributed to factors such as reduced bone density and diminished musculoskeletal strength, which amplify the risk of fractures during falls.

In [None]:
decoded_data.loc[0, 'diagnosis']

#### SEX VS DIAGNOSIS

In [None]:
it = decoded_data[(decoded_data['diagnosis'] == 'INTERNAL INJURY') | (decoded_data['diagnosis'] == 'CONTUSIONS, ABR.') | (decoded_data['diagnosis'] == 'FRACTURE')]

# Calculate the proportions of each diagnosis by gender
diagnosis_proportions = it.groupby(['sex', 'diagnosis']).size() / it.groupby('sex').size()

# Reset the index for the proportions
diagnosis_proportions = diagnosis_proportions.reset_index(name='Proportion')

# Pivot the data to create a percentage stacked bar chart
pivot_table = diagnosis_proportions.pivot(index='sex', columns='diagnosis', values='Proportion')

# Normalize the data to percentages (multiply by 100)
pivot_table *= 100


# Create the percentage stacked bar chart
ax = pivot_table.plot(kind='bar', stacked=True, figsize=(15, 8))


plt.xlabel('sex')
plt.ylabel('Percentage')
plt.title('Percentage Stacked Bar Chart: Gender vs. Diagnosis')

# Add data labels to each bar segment
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy()
    ax.annotate(f'{height:.1f}%', (x + width / 2, y + height / 2), ha='center', va='center')


plt.show()


## HYPOTHESIS

### Gender-Specific Injury Patterns:

The data reveals gender-specific patterns in the types of injuries diagnosed after falls. These differences highlight the importance of considering gender as a factor in understanding and addressing specific health outcomes.

Higher Incidence of Internal Injury in Males: The higher occurrence of internal injury diagnoses in males suggests that men may be more susceptible to injuries affecting internal organs following falls. This finding may be indicative of gender-related physiological differences or variations in injury mechanisms.

Contusion and Auditory Brainstem Response (ABR) in Males: The prevalence of contusion and auditory brainstem response (ABR) diagnoses in males implies that men are more likely to experience bruises or minor injuries and may also be at greater risk of auditory or neurological issues following falls.

Fractures More Common in Females: The observation that fractures are diagnosed more frequently in females suggests that women may face a higher risk of bone fractures during falls. This could be influenced by factors such as bone density, physical activity, or age-related changes affecting bone health.


In [None]:
decoded_data.loc[0, 'diagnosis']

#### RACE VS DIAGNOSIS

In [None]:
it = decoded_data[(decoded_data['diagnosis'] == 'INTERNAL INJURY') | (decoded_data['diagnosis'] == 'CONTUSIONS, ABR.') | (decoded_data['diagnosis'] == 'FRACTURE')]


# Calculate the proportions of each diagnosis by race
diagnosis_proportions = it.groupby(['race', 'diagnosis']).size() / it.groupby('race').size()

# Reset the index for the proportions
diagnosis_proportions = diagnosis_proportions.reset_index(name='Proportion')

# Pivot the data to create a percentage stacked bar chart
pivot_table = diagnosis_proportions.pivot(index='race', columns='diagnosis', values='Proportion')

# Normalize the data to percentages (multiply by 100)
pivot_table *= 100

# Create the percentage stacked bar chart
ax = pivot_table.plot(kind='bar', stacked=True, figsize=(15, 8))

# Add labels and a title
plt.xlabel('race')
plt.ylabel('Percentage')
plt.title('Percentage Stacked Bar Chart: Race vs. Diagnosis')

# Add data labels to each bar segment
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy()
    ax.annotate(f'{height:.1f}%', (x + width / 2, y + height / 2), ha='center', va='center')

plt.show()

## HYPOTHESIS

### Racial Disparities in Diagnosis Patterns: 

The data indicates that there are disparities in the types of diagnoses individuals receive following falls among different racial groups. These disparities may reflect variations in healthcare access, socioeconomic factors, or underlying health conditions within these populations.

Higher Occurrence of Internal Injuries in Native Hawaiian/Pacific Islander: The higher incidence of internal injury diagnoses in the Native Hawaiian/Pacific Islander group suggests that individuals from this racial background may be more susceptible to injuries affecting internal organs after falling. Further research is needed to understand the specific factors contributing to this pattern.

Contusions and Auditory Brainstem Response (ABR) in Black/African American: The prevalence of contusion and auditory brainstem response (ABR) diagnoses in the Black/African American population may indicate that falls in this racial group not only result in physical injuries (contusions) but may also have neurological or auditory implications (ABR). This finding underscores the need for comprehensive assessments and care for fall-related injuries in this demographic.

Fractures More Common in American Indian/Alaska Native: The observation that fractures are diagnosed more frequently in the American Indian/Alaska Native population suggests that individuals from this racial background may be at a higher risk of bone fractures following falls. Factors contributing to this risk may include variations in bone health, activity levels, or other sociocultural factors.

### ALCOHOL VS DIAGNOSIS

In [None]:
top_10_diagnosis = decoded_data['diagnosis'].value_counts().head(10).index
decoded_data_filtered = decoded_data[decoded_data['diagnosis'].isin(top_10_diagnosis)]
plt.figure(figsize=(14, 6))
sns.countplot(data=decoded_data_filtered, x='diagnosis', hue='alcohol')
plt.title('Top 10 Diagnosis Count by alcohol')
plt.xlabel('Diagnosis')
plt.ylabel('Count')
plt.xticks(rotation=90)
plt.show()

most of the falls are obviously not caused by alcohol

### DRUGS VS DIAGNOSIS

In [None]:
top_10_diagnosis = decoded_data['diagnosis'].value_counts().head(10).index
decoded_data_filtered = decoded_data[decoded_data['diagnosis'].isin(top_10_diagnosis)]
plt.figure(figsize=(14, 6))
sns.countplot(data=decoded_data_filtered, x='diagnosis', hue='drug')
plt.title('Top 10 Diagnosis Count by drug')
plt.xlabel('Diagnosis')
plt.ylabel('Count')
plt.xticks(rotation=90)
plt.show()

Most of the falls are not caused by Drugs as well

### Environmental Factors:

##### Does the location or environment where falls occur impact the likelihood of injury?

### LOCATION VS DIAGNOSIS

In [None]:
top_10_diagnosis = decoded_data['diagnosis'].value_counts().head(10).index
decoded_data_filtered = decoded_data[decoded_data['diagnosis'].isin(top_10_diagnosis)]
plt.figure(figsize=(14, 6))
sns.countplot(data=decoded_data_filtered, x='diagnosis', hue='location')
plt.title('Top 10 Diagnosis Count by location')
plt.xlabel('Diagnosis')
plt.ylabel('Count')
plt.xticks(rotation=90)
plt.show()

## HYPOTHESIS

The observation that most falls occur at home, unknown locations, and in public areas provides valuable insights into fall prevention and safety measures. Here are insights and potential solutions based on this observation:

#### Insights:

Home as a Common Location: The fact that a significant number of falls occur at home underscores the importance of home safety for individuals of all ages, particularly for older adults who may spend a substantial amount of time at home.

Unknown Locations: The "unknown location" category may indicate that some falls happen in unmonitored or less-traveled areas, making it challenging to identify and respond to these incidents promptly.

Public Spaces: Falls in public areas highlight the need for public safety measures and awareness campaigns to reduce the risk of falls in crowded places.

#### Solutions:

Home Safety Assessments: Encourage home safety assessments, particularly for older adults, to identify and address potential fall hazards at home. This may include removing tripping hazards, improving lighting, installing handrails, and making bathrooms more accessible.

### LOCATION VS DISPOSITION 

In [None]:
location_disposition_cross_tab = pd.crosstab(decoded_data['location'], decoded_data['disposition'])
# Visualize the relationship between location and disposition using a heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(location_disposition_cross_tab, annot=True, fmt='d', cmap='YlGnBu')
plt.title('Relationship between Location and Disposition')
plt.xlabel('Disposition')
plt.ylabel('Location')
plt.show()



### Severity and Treatment Outcomes:

Falls at home are more common but result in a mix of treatment outcomes: a significant number of individuals are treated and released, while a substantial number require hospitalization.



## To identify the common circumstances or activities during which falls occur among older adults 

In [None]:
decoded_data['narrative'] = decoded_data['narrative'].str.lower().str.replace('[^\w\s]',' ').str.replace('\s\s+', ' ')

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from collections import Counter

sentences = decoded_data['narrative'].dropna().str.replace(',', '').str.replace(':', '').str.replace('.', '')

# Tokenize the sentences into unigrams
unigrams = [word_tokenize(sentence) for sentence in sentences if len(sentence) >= 4]

# Flatten the list of unigrams
unigrams = [word for sublist in unigrams for word in sublist if len(word) >= 4]

# Count the frequency of each unigram
unigram_counts = Counter(unigrams)

#Get the top 15 most common unigrams and their counts
top_n = 15
most_common_unigrams = unigram_counts.most_common(top_n)
common_unigrams, counts = zip(*most_common_unigrams)

# Create a bar chart to visualize unigram frequencies
plt.figure(figsize=(10, 6))
plt.barh(common_unigrams, counts)
plt.xlabel('Frequency')
plt.ylabel('Unigrams')
plt.title(f'Top {top_n} Most Common Unigrams')
plt.gca().invert_yaxis()  # Invert the y-axis to display the most common at the top
plt.show()


## BIGRAMS

In [None]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

from nltk import bigrams
decoded_data['narrative'] = decoded_data['narrative'].str.lower().str.replace('[^\w\s]',' ').str.replace('\s\s+', ' ')

sentences = decoded_data['narrative'].dropna().str.replace(',', '').str.replace(':', '').str.replace('.', '')

# Initialize NLTK's stop words
stop_words = set(stopwords.words('english'))

# Initialize a list to store preprocessed bigrams
preprocessed_bigrams = []

for sentence in sentences:
    # Tokenize the sentence
    tokens = word_tokenize(sentence)
   
    
    # Convert to lowercase, remove punctuation, and remove stop words
    clean_tokens = [word.lower() for word in tokens if word.isalpha() and len(word) >= 4  not in stop_words]
    
    # Create bigrams from the clean tokens
    bigrams_list = list(bigrams(clean_tokens))
    
    # Append the preprocessed bigrams to the list
    preprocessed_bigrams.extend(bigrams_list)

# Count the frequency of each bigram
bigram_counts = Counter(preprocessed_bigrams)

# Get the top 15 most common bigrams and their counts
top_15_bigrams = bigram_counts.most_common(15)
bigram_labels, bigram_counts = zip(*top_15_bigrams)

# Create a bar chart to visualize bigram frequencies
plt.figure(figsize=(10, 8))
plt.barh(range(len(bigram_labels)), bigram_counts)
plt.yticks(range(len(bigram_labels)), bigram_labels)
plt.xlabel('Frequency')
plt.ylabel('Bigrams')
plt.title('Top 15 Most Common Bigrams')
plt.gca().invert_yaxis()  # Invert the y-axis to display the most common at the top
plt.show()


### The bigrams extracted provide insights into the reasons older adults fall and suggest potential solutions for fall prevention. Let's analyze each bigram:

1. **Head Injury**: The presence of "head injury" indicates that head injuries are a significant outcome of falls among older adults. It's essential to focus on strategies to prevent head injuries, such as using protective headgear and implementing fall prevention programs.

2. **Fell Floor**: "Fell floor" suggests that falls often result in individuals landing on the floor. This emphasizes the importance of strategies to minimize the impact of falls, including improving flooring materials to reduce injury risk.

3. **Closed Head**: "Closed head" could refer to injuries where the skull is not fractured but may still result in concussions or other head injuries. Preventive measures may include education on recognizing the signs of head injuries and seeking medical attention.

4. **Hitting Head**: Falls that lead to "hitting head" indicate a potential problem with balance or coordination. Solutions may involve balance exercises, regular vision check-ups, and environmental modifications to reduce hazards.

5. **Nursing Home**: The mention of "nursing home" suggests that falls occur in care facilities. Solutions may involve improved staff training, fall risk assessments, and environmental modifications in nursing homes.

6. **Tripped Over**: "Tripped over" highlights the role of tripping hazards in falls. Reducing tripping hazards, such as clutter and loose rugs, can help prevent falls.

7. **Striking Head**: Similar to "hitting head," this bigram points to falls leading to head injuries. Strategies to reduce head injuries are relevant here.

8. **Lost Balance**: The phrase "lost balance" indicates that balance issues may contribute to falls. Balance training exercises and medical evaluations to identify underlying causes can be beneficial.

9. **Fell Down**: "Fell down" is a straightforward representation of falls. Fall prevention strategies should be implemented broadly, including home safety measures and regular exercise.

10. **Floor Home**: "Floor home" suggests that falls often occur at home. Home safety assessments and modifications are essential to reduce fall risks.

Based on these insights, here are some potential solutions for fall prevention among older adults:

- **Fall Risk Assessments**: Conduct regular fall risk assessments for older adults, taking into account their health conditions, mobility, and living environment.

- **Home Safety Modifications**: Promote home safety assessments and modifications, such as installing handrails, improving lighting, and removing tripping hazards.

- **Balance and Strength Training**: Encourage older adults to engage in balance and strength training exercises to improve their physical stability and reduce the risk of falling.

- **Medication Management**: Review and manage medications that may cause dizziness or affect balance, consulting with healthcare professionals as needed.

- **Use of Mobility Aids**: Provide and educate older adults on the appropriate use of mobility aids such as walkers or canes.

- **Regular Health Checkups**: Encourage older adults to have regular checkups with healthcare providers to address any underlying health issues that may contribute to falls.

- **Educational Programs**: Offer educational programs to raise awareness about fall prevention, including information on recognizing fall risks and taking preventive measures.

- **Environmental Modifications in Nursing Homes**: Improve the safety of nursing homes by implementing environmental modifications and staff training programs.

- **Protective Headgear**: For older adults at risk of head injuries due to falls, consider the use of protective headgear, such as helmets or head protection.

- **Regular Vision Exams**: Encourage older adults to have regular vision exams to ensure proper vision, which is crucial for balance and coordination.

Implementing a combination of these solutions tailored to individual needs and circumstances can significantly reduce the risk of falls among older adults and enhance their overall safety and well-being. Additionally, fostering a supportive and vigilant community or caregiver network can further contribute to fall prevention.