# NLP Analysis of Clinical Notes for Sepsis Detection and Insights

This notebook utilizes PyTorch, NLTK, spaCy to perform Named Entity Recognition (NER), to analyze sentiment, to conduct topic modeling and to extract relationships within a clinical notes dataset, with sepsis as the central diagnosis.

### Installation and Imports

In [6]:
# First Cell
# Install necessary packages
!pip install stanza transformers torch

import stanza
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

# Download Stanza Biomedical models
stanza.download('en', package='mimic', processors={'ner': 'bc5cdr'})

# Check GPU availability
if torch.cuda.is_available():
    print("GPU is available and will be used for processing.")
else:
    print("GPU is not available. The CPU will be used for processing.")

# Load Stanza pipeline for English Biomedical models
stanza_nlp = stanza.Pipeline('en', package='mimic', processors={'ner': 'bc5cdr'})

# Load ClinicalBERT model and tokenizer for token classification (NER)
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = AutoModelForTokenClassification.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

# Create a NER pipeline using ClinicalBERT
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer)




Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json: 387kB [00:00, 53.1MB/s]                    
2024-07-28 17:24:50 INFO: Downloaded file to /home/codespace/stanza_resources/resources.json
2024-07-28 17:24:51 INFO: Downloading these customized packages for language: en (English)...
| Processor       | Package        |
------------------------------------
| tokenize        | mimic          |
| pos             | mimic_charlm   |
| lemma           | mimic_nocharlm |
| depparse        | mimic_charlm   |
| ner             | bc5cdr         |
| backward_charlm | pubmed         |
| pretrain        | biomed         |
| forward_charlm  | mimic          |
| pretrain        | mimic          |
| backward_charlm | mimic          |
| forward_charlm  | pubmed         |

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.8.0/models/tokenize/mimic.pt: 100%|██████████| 645k/645k [00:00<00:00, 20.0MB/s]
2024-07-28 17:24:51 INFO: Downloaded

GPU is not available. The CPU will be used for processing.


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json: 387kB [00:00, 48.9MB/s]                    
2024-07-28 17:24:59 INFO: Downloaded file to /home/codespace/stanza_resources/resources.json
2024-07-28 17:25:04 INFO: Loading these models for language: en (English):
| Processor | Package        |
------------------------------
| tokenize  | mimic          |
| pos       | mimic_charlm   |
| lemma     | mimic_nocharlm |
| depparse  | mimic_charlm   |
| ner       | bc5cdr         |

2024-07-28 17:25:04 INFO: Using device: cpu
2024-07-28 17:25:04 INFO: Loading: tokenize
2024-07-28 17:25:04 INFO: Loading: pos
2024-07-28 17:25:04 INFO: Loading: lemma
2024-07-28 17:25:04 INFO: Loading: depparse
2024-07-28 17:25:05 INFO: Loading: ner
2024-07-28 17:25:07 INFO: Done loading processors!
Some weights of the model checkpoint at emilyalsentzer/Bio_ClinicalBERT were not used when initializing BertForTokenClassification: ['cls.seq_relationship.bias', 'cls

### Load datasets

In [7]:
# Second Cell
import os
import pandas as pd

# Set up the directory path
directory = os.getenv('CLINICAL_REFS_DIR', "./clinical_refs")

if not os.path.exists(directory):
    os.makedirs(directory)

# File paths
train_file_path = os.path.join(directory, 'MTS-Dialog-TrainingSet.csv')
validation_file_path = os.path.join(directory, 'MTS-Dialog-ValidationSet.csv')

# Load the datasets
train_df = pd.read_csv(train_file_path)
validation_df = pd.read_csv(validation_file_path)

# Explicitly retrieve headers
train_headers = train_df.columns.tolist()
validation_headers = validation_df.columns.tolist()

# Print the column headers to confirm the available columns
print("Training dataset columns:", train_headers)
print("Validation dataset columns:", validation_headers)

# Display the first few rows of each dataset
print("Training dataset preview:")
print(train_df.head())

print("Validation dataset preview:")
print(validation_df.head())


Training dataset columns: ['ID', 'section_header', 'section_text', 'dialogue']
Validation dataset columns: ['ID', 'section_header', 'section_text', 'dialogue']
Training dataset preview:
   ID section_header                                       section_text  \
0   0          GENHX  The patient is a 76-year-old white female who ...   
1   1          GENHX  The patient is a 25-year-old right-handed Cauc...   
2   2          GENHX  This is a 22-year-old female, who presented to...   
3   3    MEDICATIONS  Prescribed medications were Salmeterol inhaler...   
4   4             CC                                   Burn, right arm.   

                                            dialogue  
0  Doctor: What brings you back into the clinic t...  
1  Doctor: How're you feeling today?  \r\nPatient...  
2  Doctor: Hello, miss. What is the reason for yo...  
3  Doctor: Are you taking any over the counter me...  
4  Doctor: Hi, how are you? \r\nPatient: I burned...  
Validation dataset preview:
   ID

## Preprocess notes

In [9]:
# Third Cell
import pandas as pd
import re
import os
import nltk
from nltk.corpus import stopwords

# Download stopwords
nltk.download('stopwords')

def preprocess_text(text):
    # Remove non-alphabetic characters and convert to lowercase
    text = re.sub(r'[^a-zA-Z\s]', '', text, re.I | re.A).lower()
    # Tokenize the text
    tokens = text.split()
    # Remove stopwords
    tokens = [word for word in tokens if word not in stopwords.words('english')]
    return ' '.join(tokens)

# Adjust display settings
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)

# Set up the directory path
directory = os.getenv('CLINICAL_REFS_DIR', "./clinical_refs")

if not os.path.exists(directory):
    os.makedirs(directory)

# File paths
train_file_path = os.path.join(directory, 'MTS-Dialog-TrainingSet.csv')
validation_file_path = os.path.join(directory, 'MTS-Dialog-ValidationSet.csv')

# Load the datasets
train_df = pd.read_csv(train_file_path)
validation_df = pd.read_csv(validation_file_path)

# Explicitly retrieve headers
train_headers = train_df.columns.tolist()
validation_headers = validation_df.columns.tolist()

# Print the column headers to confirm the available columns
print("Training dataset columns:", train_headers)
print("Validation dataset columns:", validation_headers)

# Ensure the 'section_text' column exists in your datasets
assert 'section_text' in train_df.columns, "The training dataset does not have a 'section_text' column."
assert 'section_text' in validation_df.columns, "The validation dataset does not have a 'section_text' column."

# Preprocess the 'section_text' column
train_df['section_text_clean'] = train_df['section_text'].apply(preprocess_text)
validation_df['section_text_clean'] = validation_df['section_text'].apply(preprocess_text)

# Display the first few rows of each dataset after preprocessing
print("Training dataset preview after preprocessing:")
print(train_df[['section_text', 'section_text_clean']].head())

print("Validation dataset preview after preprocessing:")
print(validation_df[['section_text', 'section_text_clean']].head())


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/codespace/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Training dataset columns: ['ID', 'section_header', 'section_text', 'dialogue']
Validation dataset columns: ['ID', 'section_header', 'section_text', 'dialogue']
Training dataset preview after preprocessing:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             section_text  \
0                                                                                                            

###  Named Entity Recognition (NER) 

In [13]:
# Fourth Cell with Updated NER Function and Application
import torch

# Function to perform NER using ClinicalBERT with text splitting
def perform_ner_clinicalbert(text, max_length=512):
    tokens = tokenizer(text, return_tensors='pt', truncation=True, max_length=max_length, padding=True)
    input_ids = tokens['input_ids']
    attention_mask = tokens['attention_mask']
    
    # Ensure input_ids and attention_mask are on the correct device
    if torch.cuda.is_available():
        input_ids = input_ids.to('cuda')
        attention_mask = attention_mask.to('cuda')
    
    outputs = model(input_ids=input_ids, attention_mask=attention_mask)
    logits = outputs.logits

    predictions = torch.argmax(logits, dim=2)
    tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
    entities = []
    for token, prediction in zip(tokens, predictions[0]):
        if prediction != 0:  # 0 corresponds to the 'O' label in BIO tagging
            entities.append((token, model.config.id2label[prediction.item()]))
    
    return entities

# Function to perform NER using Stanza
def perform_stanza_ner(text):
    doc = stanza_nlp(text)
    entities = [(ent.text, ent.type) for ent in doc.ents]
    return entities

# Apply NER to the training dataset using ClinicalBERT
print("Applying ClinicalBERT NER to training dataset...")
train_df['clinical_bert_entities'] = train_df['section_text_clean'].apply(perform_ner_clinicalbert)

# Apply NER to the validation dataset using ClinicalBERT
print("Applying ClinicalBERT NER to validation dataset...")
validation_df['clinical_bert_entities'] = validation_df['section_text_clean'].apply(perform_ner_clinicalbert)

# Apply NER to the training dataset using Stanza
print("Applying Stanza NER to training dataset...")
train_df['stanza_entities'] = train_df['section_text_clean'].apply(perform_stanza_ner)

# Apply NER to the validation dataset using Stanza
print("Applying Stanza NER to validation dataset...")
validation_df['stanza_entities'] = validation_df['section_text_clean'].apply(perform_stanza_ner)

# Display the results
print("Training dataset with NER results (ClinicalBERT and Stanza):")
print(train_df[['section_text_clean', 'clinical_bert_entities', 'stanza_entities']].head())

print("Validation dataset with NER results (ClinicalBERT and Stanza):")
print(validation_df[['section_text_clean', 'clinical_bert_entities', 'stanza_entities']].head())


Applying ClinicalBERT NER to training dataset...
Applying ClinicalBERT NER to validation dataset...
Applying Stanza NER to training dataset...
Applying Stanza NER to validation dataset...
Training dataset with NER results (ClinicalBERT and Stanza):
                                                                                                                                                                                                                                                                                                                                                                                                                                                            section_text_clean  \
0                                                                                                                                                                patient yearold white female presents clinic today originally hypertension med check history hypertension osteoarthritis ost

### Sentiment Analysis

Note: Sentiment analysis was considered but is not included in the final analysis.
Clinical notes tend to have an overwhelmingly negative sentiment due to the nature of the content.
Therefore, sentiment analysis does not provide meaningful insights in this context.

### Relation Extraction using spaCy, Stanza and ClinicalBERT.

Bio_ClinicalBERT, as the name suggests, is a variant that combines the strengths of both BioBERT and ClinicalBERT. It is designed to handle a broader range of biomedical and clinical texts.

In [16]:
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification

# Load ClinicalBERT model and tokenizer
model_name = 'emilyalsentzer/Bio_ClinicalBERT'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Move the model to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model.to(device)

def truncate_text(text, tokenizer, max_length=512):
    # Use tokenizer to encode the text and ensure truncation
    inputs = tokenizer(text, max_length=max_length, truncation=True, return_tensors='pt')
    # Decode back to string to verify truncation
    truncated_text = tokenizer.decode(inputs['input_ids'][0], skip_special_tokens=True)
    return truncated_text

def perform_relation_extraction_clinicalbert(text, max_length=512):
    # Truncate text to the maximum length
    truncated_text = truncate_text(text, tokenizer, max_length)
    
    # Tokenize and encode the text
    tokens = tokenizer(truncated_text, return_tensors='pt', truncation=True, padding=True)
    
    # Move input tensors to the correct device (GPU or CPU)
    input_ids = tokens['input_ids'].to(device)
    attention_mask = tokens['attention_mask'].to(device)
    
    # Perform relation extraction (assuming the model outputs relations)
    with torch.no_grad():
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
    
    logits = outputs.logits

    # Process logits to extract entities and relationships
    predictions = torch.argmax(logits, dim=2)
    tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
    entities = []
    for token, prediction in zip(tokens, predictions[0]):
        if prediction != 0:  # 0 corresponds to the 'O' label in BIO tagging
            entities.append((token, model.config.id2label[prediction.item()]))
    
    # Dummy relationships extraction logic (replace with actual processing logic)
    relationships = [(entities[i][0], 'related_to', entities[i+1][0]) for i in range(len(entities)-1)]
    
    return relationships

# Apply relation extraction to the training dataset using ClinicalBERT
print("Applying ClinicalBERT relation extraction to training dataset...")
train_df['clinicalbert_relations'] = train_df['section_text_clean'].apply(perform_relation_extraction_clinicalbert)

# Apply relation extraction to the validation dataset using ClinicalBERT
print("Applying ClinicalBERT relation extraction to validation dataset...")
validation_df['clinicalbert_relations'] = validation_df['section_text_clean'].apply(perform_relation_extraction_clinicalbert)

# Display the first few rows of each dataset after relation extraction
print("Training dataset after ClinicalBERT relation extraction:")
print(train_df[['section_text_clean', 'clinicalbert_relations']].head())

print("Validation dataset after ClinicalBERT relation extraction:")
print(validation_df[['section_text_clean', 'clinicalbert_relations']].head())


Some weights of the model checkpoint at emilyalsentzer/Bio_ClinicalBERT were not used when initializing BertForTokenClassification: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint 

Applying ClinicalBERT relation extraction to training dataset...
Applying ClinicalBERT relation extraction to validation dataset...
Training dataset after ClinicalBERT relation extraction:
                                                                                                                                                                                                                                                                                                                                                                                                                                                            section_text_clean  \
0                                                                                                                                                                patient yearold white female presents clinic today originally hypertension med check history hypertension osteoarthritis osteoporosis hypothyroidism allergic rhinitis kidney stones sin

In [17]:
# Assuming Stanza is already set up and stanza_nlp is loaded
def extract_stanza_relations(text):
    doc = stanza_nlp(text)
    relations = []
    for sentence in doc.sentences:
        for word in sentence.words:
            if word.head != 0:  # If the word has a head (root has head 0)
                head_word = sentence.words[word.head - 1]
                relations.append((word.text, word.deprel, head_word.text))
    return relations

# Apply relation extraction to the training dataset using Stanza
train_df['stanza_relations'] = train_df['section_text_clean'].apply(extract_stanza_relations)
print("Relation extraction performed on the training dataset using Stanza.")

# Apply relation extraction to the validation dataset using Stanza
validation_df['stanza_relations'] = validation_df['section_text_clean'].apply(extract_stanza_relations)
print("Relation extraction performed on the validation dataset using Stanza.")

# Display the first few rows of each dataset after relation extraction
print("Training dataset after relation extraction (Stanza):")
print(train_df[['section_text_clean', 'stanza_relations']].head())

print("Validation dataset after relation extraction (Stanza):")
print(validation_df[['section_text_clean', 'stanza_relations']].head())


Relation extraction performed on the training dataset using Stanza.
Relation extraction performed on the validation dataset using Stanza.
Training dataset after relation extraction (Stanza):
                                                                                                                                                                                                                                                                                                                                                                                                                                                            section_text_clean  \
0                                                                                                                                                                patient yearold white female presents clinic today originally hypertension med check history hypertension osteoarthritis osteoporosis hypothyroidism allergic rhinitis kidney stones s

In [18]:
import spacy
from spacy.matcher import Matcher

# Load the spaCy model
try:
    nlp_spacy = spacy.load('en_core_web_sm')
except OSError:
    from spacy.cli import download
    download('en_core_web_sm')
    nlp_spacy = spacy.load('en_core_web_sm')

def extract_spacy_relations(doc):
    # Define the pattern for relation extraction
    patterns = [
        [{'DEP': 'nsubj'}, {'DEP': 'ROOT'}, {'DEP': 'dobj'}],  # Subject-Verb-Object
        [{'DEP': 'nsubj'}, {'DEP': 'prep'}, {'DEP': 'pobj'}],  # Subject-Preposition-Object
        [{'DEP': 'nsubjpass'}, {'DEP': 'aux'}, {'DEP': 'prep'}, {'DEP': 'pobj'}]  # Passive Subject-Preposition-Object
    ]
    
    # Initialize the matcher with the patterns
    matcher = Matcher(nlp_spacy.vocab)
    for i, pattern in enumerate(patterns):
        matcher.add(f'relation_pattern_{i}', [pattern])
    
    # Apply the matcher to the doc
    matches = matcher(doc)
    relations = []
    
    for match_id, start, end in matches:
        span = doc[start:end]
        relations.append((span.text, span.root.head.text))
    
    return relations

# Apply relation extraction to the training dataset
train_df['spacy_relations'] = train_df['section_text_clean'].apply(lambda x: extract_spacy_relations(nlp_spacy(x)))
print("Enhanced relation extraction performed on the training dataset using spaCy.")

# Apply relation extraction to the validation dataset
validation_df['spacy_relations'] = validation_df['section_text_clean'].apply(lambda x: extract_spacy_relations(nlp_spacy(x)))
print("Enhanced relation extraction performed on the validation dataset using spaCy.")

# Display the first few rows of each dataset after relation extraction
print("Training dataset after enhanced relation extraction (spaCy):")
print(train_df[['section_text_clean', 'spacy_relations']].head())

print("Validation dataset after enhanced relation extraction (spaCy):")
print(validation_df[['section_text_clean', 'spacy_relations']].head())


Enhanced relation extraction performed on the training dataset using spaCy.
Enhanced relation extraction performed on the validation dataset using spaCy.
Training dataset after enhanced relation extraction (spaCy):
                                                                                                                                                                                                                                                                                                                                                                                                                                                            section_text_clean  \
0                                                                                                                                                                patient yearold white female presents clinic today originally hypertension med check history hypertension osteoarthritis osteoporosis hypothyroidism allergic 

### Topic Modeling

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# Define the number of topics
num_topics = 5

def perform_topic_modeling(data):
    # Vectorize the text data
    vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
    X = vectorizer.fit_transform(data['section_text_clean'])

    # Apply LDA for topic modeling
    lda = LatentDirichletAllocation(n_components=num_topics, random_state=42)
    lda.fit(X)

    # Display the top words for each topic
    def display_topics(model, feature_names, num_top_words):
        for topic_idx, topic in enumerate(model.components_):
            print(f"Topic {topic_idx}:")
            print(" ".join([feature_names[i] for i in topic.argsort()[:-num_top_words - 1:-1]]))

    # Display the topics
    num_top_words = 10
    display_topics(lda, vectorizer.get_feature_names_out(), num_top_words)

# Apply topic modeling to the training dataset
print("Training dataset topics:")
perform_topic_modeling(train_df)

# Apply topic modeling to the validation dataset
print("Validation dataset topics:")
perform_topic_modeling(validation_df)


Training dataset topics:
Topic 0:
use drug patient alcohol history denies lives married allergies years
Topic 1:
patient home history sleep nonsmoker yearold months post medications status
Topic 2:
died history mother age father disease diabetes family cancer surgery
Topic 3:
patient pain states right yearold left time denies symptoms history
Topic 4:
negative history pain patient chest today past denies left noncontributory
Validation dataset topics:
Topic 0:
pain right extremity weakness neck left symptoms denies yearold upper
Topic 1:
patient pain right past significant yearold changes mg like states
Topic 2:
lost dr headache medications followup dermatitis atopic home psychiatric left
Topic 3:
patient denies pain history abuse time nausea symptoms past alcohol
Topic 4:
age died disease hypertension history surgery negative cancer old father


In [9]:
!pip install gensim

import nltk
nltk.download('punkt')

import gensim
import gensim.corpora as corpora
from gensim.models import LdaModel
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

def prepare_text(text):
    stop_words = set(stopwords.words('english'))
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word.isalpha() and word not in stop_words]
    return tokens

def perform_topic_modeling_gensim(df):
    # Tokenize the text
    df['tokens'] = df['section_text_clean'].apply(prepare_text)

    # Create Dictionary and Corpus needed for Topic Modeling
    id2word = corpora.Dictionary(df['tokens'])
    texts = df['tokens']
    corpus = [id2word.doc2bow(text) for text in texts]

    # Build the LDA model
    lda_model = LdaModel(
        corpus=corpus, 
        id2word=id2word, 
        num_topics=5, 
        random_state=42, 
        update_every=1, 
        chunksize=100, 
        passes=10, 
        alpha='auto', 
        per_word_topics=True
    )

    # Extract the topics
    topics = lda_model.print_topics(num_words=5)

    # Display the topics
    for topic in topics:
        print(topic)

# Apply topic modeling to the training dataset
print("Training dataset topics:")
perform_topic_modeling_gensim(train_df)

# Apply topic modeling to the validation dataset
print("Validation dataset topics:")
perform_topic_modeling_gensim(validation_df)



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[nltk_data] Downloading package punkt to /home/codespace/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Training dataset topics:
(0, '0.025*"age" + 0.016*"treatment" + 0.012*"loss" + 0.011*"noncontributory" + 0.010*"post"')
(1, '0.032*"patient" + 0.029*"currently" + 0.028*"daily" + 0.028*"smoke" + 0.027*"previously"')
(2, '0.054*"lives" + 0.049*"nursing" + 0.047*"security" + 0.047*"citizens" + 0.047*"building"')
(3, '0.022*"history" + 0.014*"significant" + 0.011*"dr" + 0.011*"hypertension" + 0.011*"cancer"')
(4, '0.039*"pain" + 0.026*"patient" + 0.015*"right" + 0.014*"left" + 0.012*"back"')
Validation dataset topics:
(0, '0.023*"denies" + 0.015*"pain" + 0.011*"patient" + 0.009*"symptoms" + 0.009*"negative"')
(1, '0.016*"patient" + 0.009*"lost" + 0.007*"well" + 0.007*"dr" + 0.007*"feeling"')
(2, '0.027*"pain" + 0.013*"right" + 0.012*"states" + 0.010*"symptoms" + 0.009*"surgery"')
(3, '0.028*"age" + 0.023*"died" + 0.016*"patient" + 0.015*"cancer" + 0.015*"disease"')
(4, '0.031*"pain" + 0.020*"patient" + 0.018*"history" + 0.011*"left" + 0.011*"denies"')


In [20]:
!pip install gensim

import re
import pandas as pd
from gensim.corpora import Dictionary
from gensim.models import LdaModel
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Ensure required nltk data is downloaded
import nltk
nltk.download('punkt')
nltk.download('stopwords')

# Step 1: Filter Documents for Sepsis-Related Content
def filter_sepsis_documents(df):
    sepsis_keywords = ['sepsis', 'septic', 'infection', 'bacteria', 'septicemia']
    sepsis_pattern = re.compile(r'\b(?:' + '|'.join(sepsis_keywords) + r')\b', re.IGNORECASE)
    return df[df['section_text_clean'].apply(lambda text: bool(sepsis_pattern.search(text)))]

# Step 2: Adjust Preprocessing
def prepare_text(text):
    stop_words = set(stopwords.words('english'))
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word.isalpha() and word not in stop_words]
    return tokens

# Step 3: Perform Topic Modeling
def perform_topic_modeling_gensim(df, num_topics=5, num_top_words=10):
    # Tokenize the text
    df['tokens'] = df['section_text_clean'].apply(prepare_text)

    # Create Dictionary and Corpus needed for Topic Modeling
    id2word = Dictionary(df['tokens'])
    texts = df['tokens']
    corpus = [id2word.doc2bow(text) for text in texts]

    # Build the LDA model
    lda_model = LdaModel(
        corpus=corpus, 
        id2word=id2word, 
        num_topics=num_topics, 
        random_state=42, 
        update_every=1, 
        chunksize=100, 
        passes=10, 
        alpha='auto', 
        per_word_topics=True
    )

    # Extract the topics
    topics = lda_model.print_topics(num_words=num_top_words)

    # Display the topics
    for topic in topics:
        print(topic)

# Load your data (example)
# train_df = pd.read_csv('your_training_data.csv')
# validation_df = pd.read_csv('your_validation_data.csv')

# Apply filtering and topic modeling to the training dataset
print("Training dataset sepsis-related topics:")
filtered_train_df = filter_sepsis_documents(train_df)
perform_topic_modeling_gensim(filtered_train_df)

# Apply filtering and topic modeling to the validation dataset
print("Validation dataset sepsis-related topics:")
filtered_validation_df = filter_sepsis_documents(validation_df)
perform_topic_modeling_gensim(filtered_validation_df)


Collecting gensim
  Downloading gensim-4.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.2 kB)
Downloading gensim-4.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.5/26.5 MB[0m [31m43.5 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.3.3


[nltk_data] Downloading package punkt to /home/codespace/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/codespace/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Training dataset sepsis-related topics:
(0, '0.014*"left" + 0.014*"infection" + 0.013*"patient" + 0.008*"pain" + 0.008*"today" + 0.008*"year" + 0.008*"started" + 0.008*"mg" + 0.008*"areas" + 0.006*"history"')
(1, '0.033*"patient" + 0.023*"surgery" + 0.019*"risks" + 0.015*"infection" + 0.014*"failure" + 0.014*"need" + 0.013*"hardware" + 0.011*"agreed" + 0.010*"yearold" + 0.010*"questions"')
(2, '0.013*"past" + 0.011*"worse" + 0.010*"infection" + 0.010*"cycle" + 0.008*"significant" + 0.008*"tract" + 0.008*"urinary" + 0.008*"also" + 0.008*"noted" + 0.008*"taking"')
(3, '0.015*"medical" + 0.012*"treatment" + 0.012*"center" + 0.012*"allergies" + 0.008*"patient" + 0.008*"allergy" + 0.008*"reaction" + 0.008*"abc" + 0.008*"taking" + 0.008*"problems"')
(4, '0.014*"infection" + 0.011*"toe" + 0.010*"patient" + 0.008*"possible" + 0.008*"underwent" + 0.007*"left" + 0.007*"pain" + 0.006*"stiffness" + 0.006*"treatment" + 0.006*"problems"')
Validation dataset sepsis-related topics:
(0, '0.009*"symptom