### 🚀 Stage 1 – Data Preprocessing

- Task 1.1: Construct Proper Sentences
- Task 1.2 & 1.3: Count Sentences and Labels

In [2]:
import pandas as pd

import sklearn_crfsuite
from sklearn_crfsuite import metrics

import spacy
nlp=spacy.load("en_core_web_sm")

In [3]:
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

In [4]:
#Stage 1: Task 1 - Constructs proper sentences from individual words and prints five sentences

In [5]:
#Function to load, process the file and provide sentence
def load_datafile(input_file):
    i_file=open(input_file, 'r')
    file_name=i_file.readlines()
    i_file.close()
    output_list=[]
    full_sentence=""
    for each_word in file_name:
        each_word=each_word.strip()
        if each_word=="":
            output_list.append(full_sentence) 
            full_sentence=""
        else:
            if full_sentence:
                full_sentence+=" "+each_word
            else:
                full_sentence=each_word
    return output_list

In [6]:
#Load data file
train_sentences=load_datafile('train_sent')
train_labels=load_datafile('train_label')
test_sentences=load_datafile('test_sent')
test_labels=load_datafile('test_label')

In [7]:
for each_item in range(5):
    print(f"✔️ Train Sentence {each_item+1}: {train_sentences[each_item]}")
    print(f"✔️ Train Label {each_item+1}: {train_labels[each_item]}")
    print("\n")

✔️ Train Sentence 1: All live births > or = 23 weeks at the University of Vermont in 1995 ( n = 2395 ) were retrospectively analyzed for delivery route , indication for cesarean , gestational age , parity , and practice group ( to reflect risk status )
✔️ Train Label 1: O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O


✔️ Train Sentence 2: The total cesarean rate was 14.4 % ( 344 of 2395 ) , and the primary rate was 11.4 % ( 244 of 2144 )
✔️ Train Label 2: O O O O O O O O O O O O O O O O O O O O O O O O O


✔️ Train Sentence 3: Abnormal presentation was the most common indication ( 25.6 % , 88 of 344 )
✔️ Train Label 3: O O O O O O O O O O O O O O O


✔️ Train Sentence 4: The `` corrected '' cesarean rate ( maternal-fetal medicine and transported patients excluded ) was 12.4 % ( 273 of 2194 ) , and the `` corrected '' primary rate was 9.6 % ( 190 of 1975 )
✔️ Train Label 4: O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O

In [8]:
for each_item in range(5):
    print(f"✔️ Test Sentence {each_item+1}: {test_sentences[each_item]}")
    print(f"✔️ Test Label {each_item+1}: {test_labels[each_item]}")
    print("\n")

✔️ Test Sentence 1: Furthermore , when all deliveries were analyzed , regardless of risk status but limited to gestational age > or = 36 weeks , the rates did not change ( 12.6 % , 280 of 2214 ; primary 9.2 % , 183 of 1994 )
✔️ Test Label 1: O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O


✔️ Test Sentence 2: As the ambient temperature increases , there is an increase in insensible fluid loss and the potential for dehydration
✔️ Test Label 2: O O O O O O O O O O O O O O O O O O O


✔️ Test Sentence 3: The daily high temperature ranged from 71 to 104 degrees F and AFI values ranged from 1.7 to 24.7 cm during the study period
✔️ Test Label 3: O O O O O O O O O O O O O O O O O O O O O O O O


✔️ Test Sentence 4: There was a significant correlation between the 2- , 3- , and 4-day mean temperature and AFI , with the 4-day mean being the most significant ( r = 0.31 , p & # 60 ; 0.001 )
✔️ Test Label 4: O O O O O O O O O O O O O O O O O O O O O O O O O

In [9]:
#Stage 1: Task 2 – Count Number of Sentences

In [10]:
print(f"✔️ Number of sentences in processed train dataset is: {len(train_sentences)}")
print(f"✔️ Number of sentences in processed test dataset is: {len(test_sentences)}")

✔️ Number of sentences in processed train dataset is: 2599
✔️ Number of sentences in processed test dataset is: 1056


In [11]:
#Stage 1: Task 3 – Count Number of Label Lines

In [12]:
print(f"✔️ Number of lines of labels in processed train dataset is: {len(train_labels)}")
print(f"✔️ Number of lines of labels in processed test dataset is: {len(test_labels)}")

✔️ Number of lines of labels in processed train dataset is: 2599
✔️ Number of lines of labels in processed test dataset is: 1056


In [13]:
assert len(train_sentences)==len(train_labels),"Mismatch in train sentences and labels!"
assert len(test_sentences)==len(test_labels),"Mismatch in test sentences and labels!"
print("✔️ Sentence and label counts match for both train and test datasets.")

✔️ Sentence and label counts match for both train and test datasets.


In [14]:
#Check the data alignment
assert len(train_labels)==len(train_sentences),"Mismatch between train sentences and labels!"
assert len(test_labels)==len(test_sentences),"Mismatch between test sentences and labels!"
print("✔️ Sentence-label alignment looks good!")

✔️ Sentence-label alignment looks good!


### 🚀 Stage 2 – Syntactic Tagging and Frequency Analysis

- Task 2.1: Extract and Count NOUN/PROPN
- Task 2.2: Show Top 25 Most Common NOUN/PROPN Tokens

In [16]:
#Stage 2: Task 1 - List of tokens

In [17]:
#List of tokens NOUN/PROPER NOUN
token_noun_propn=[]

#List of NOUN/PROPER NOUN tokens
for sentences in (train_sentences, test_sentences):
    for sent in sentences:
        processed_sent=nlp(sent)
        for each_token in processed_sent:
            if each_token.pos_=="NOUN" or each_token.pos_=="PROPN":
                token_noun_propn.append(each_token.text)

#Series of NOUN/PROPER NOUN tokens
series_noun_propn=pd.Series(token_noun_propn)

In [18]:
#Stage 2: Task 2 - Top 25 NOUN/PROPER NOUN

In [19]:
#Top 25 NOUN/PROPER NOUN
print("✔️ Top 25 NOUN/PROP NOUN:") 
series_noun_propn.value_counts().sort_values(ascending=False).head(25)

✔️ Top 25 NOUN/PROP NOUN:


patients        492
treatment       281
%               247
cancer          200
therapy         175
study           153
disease         141
cell            140
lung            116
group            94
chemotherapy     88
gene             87
effects          85
women            77
results          77
use              73
cases            71
risk             71
surgery          71
analysis         70
rate             67
response         66
survival         65
children         64
dose             63
Name: count, dtype: int64

### 🚀 Stage 3 – Feature Engineering for CRF
- Define Features Including PoS Tag
- Add Previous Word's Info
- Mark Sentence Boundaries

In [21]:
#Feature Engineering
def feature_engineering(sentence, pos, pos_tags):
  word=sentence[pos]
  features=[
    'word.lower='+word.lower(),
    'word[-3:]='+word[-3:],
    'word[-2:]='+word[-2:],
    'word.isupper=%s'%word.isupper(),
    'word.isdigit=%s'%word.isdigit(),
    'word.startsWithCapital=%s'%word[0].isupper(),
    'word.pos='+ pos_tags[pos]
  ]
#Defining Feature
  if(pos>0):
    prev_word=sentence[pos-1]
    features.extend([
    'prev_word.lower='+prev_word.lower(), 
    'prev_word.isupper=%s'%prev_word.isupper(),
    'prev_word.isdigit=%s'%prev_word.isdigit(),
    'prev_word.startsWithCapital=%s'%prev_word[0].isupper(),
    'prev_word.pos='+pos_tags[pos-1]
  ])
  else:
    features.append('BEG')
  if(pos==len(sentence)-1):
    features.append('END')
  return features

### 🚀 Stage 4 – Feature + Label Preparation
- Compute features for all sentences
- Extract labels for each token in sentence-aligned format

In [23]:
#Feature for a sentence
def sentence_feature(sentence):
    processed_sent=nlp(sentence)
    postags=[]
    for each_token in processed_sent:
        postags.append(each_token.pos_)
    sentence_list=sentence.split()
    return [feature_engineering(sentence_list,pos,postags) for pos in range(len(sentence_list))]

In [24]:
#Sentence lables
def label_feature(labels):
  return labels.split()

### 🚀 Stage 5 – Prepare CRF Model Inputs
- Task 5.1: X_train and X_test already contain feature dicts
- Task 5.2: y_train and y_test already contain label sequences

In [26]:
X_train=[sentence_feature(sentence) for sentence in train_sentences]
X_test=[sentence_feature(sentence) for sentence in test_sentences]

In [27]:
Y_train=[label_feature(labels) for labels in train_labels]
Y_test=[label_feature(labels) for labels in test_labels]

### 🚀 Stage 6 - Train CRF Model

In [29]:
#Build CRF model
crf=sklearn_crfsuite.CRF(max_iterations=100)
try:
    crf.fit(X_train, Y_train)
except AttributeError:
    pass

### 🚀 Stage 7 – Predict + Evaluate
- Task 7.1: Predict labels for each sentence in the test set
- Task 7.2: Calculate F1-score using actual vs predicted labels

In [31]:
Y_pred=crf.predict(X_test)

In [32]:
f1_score=metrics.flat_f1_score(Y_test,Y_pred,average='weighted')
print(f"✔️ F1 score is: {round(f1_score,4)}")

✔️ F1 score is: 0.9071


### 🚀 Stage 8 - Model Evaluation

- Extract Treatments (T) Corresponding to Diseases (D) in Test Data
- Process the X_test, y_pred, and original sentences to extract treatments that are nearby each disease mention.
- Identify Entities from Predicted BIO Tags

In [34]:
#Dictionary for Diseases and their corresponding Treatments
Disease_Treat_dict=dict()
for i in range(len(Y_pred)):
#Predicted labels of each test sentence into 'val'
    val=Y_pred[i]
#Empty strings for Diseases and Treatments
    Diseases=""
    Treatments=""
#Loop to iterate labels and map Diseases and Treatments
    for j in range(len(val)):
        if val[j]=='D':
            Diseases+=test_sentences[i].split()[j]+" "
        elif val[j]=='T':
            Treatments+= test_sentences[i].split()[j]+" "
    Diseases=Diseases.lstrip().rstrip()
    Treatments=Treatments.lstrip().rstrip()
#Ignore blank, add Disease as per Treatment or Append as required
    if Diseases !="" and Treatments !="":
        if Diseases in Disease_Treat_dict.keys():
            treat_out=list(Disease_Treat_dict[Diseases])
            treat_out.append(Treatments)
            Disease_Treat_dict[Diseases]=treat_out
        elif Diseases not in Disease_Treat_dict.keys():
            Disease_Treat_dict[Diseases]=Treatments

In [35]:
target_disease='hereditary retinoblastoma'
print(f"✔️ Predicted treatments for '{target_disease}': {Disease_Treat_dict['hereditary retinoblastoma']}")

✔️ Predicted treatments for 'hereditary retinoblastoma': radiotherapy


### 🚀 Model Evaluation Summary
#### Dataset Overview
- Total sentences in training set: 2,599
- Total sentences in test set: 1,056
- Corresponding lines of entity labels align perfectly with the sentence counts, indicating clean and consistent dataset preparation.

#### Corpus Insights
- The top 25 most frequent nouns and proper nouns highlight domain relevance, with high occurrences of terms like patients, treatment, cancer, therapy, and disease.
- This reinforces the biomedical focus of the corpus and the richness of vocabulary in the clinical treatment-disease context.

#### Model Performance
- The CRF-based Named Entity Recognition (NER) model achieved a weighted F1-score of 0.9071, demonstrating high accuracy in tagging disease and treatment entities across the test set.
- This indicates the model’s strong ability to generalize and correctly identify biomedical entities even in unseen data.

#### Use Case Validation

In a targeted evaluation, the model correctly identified the treatment “radiotherapy” associated with the disease “hereditary retinoblastoma”, showcasing its practical applicability in extracting clinically relevant information from natural language text.

### 📌 Conclusion
The CRF-based NER model has proven to be robust and reliable in identifying diseases and treatments within biomedical text. With a high F1 score and accurate predictions on domain-specific cases, it is well-suited for downstream tasks such as automated medical literature mining, treatment recommendation pipelines, and clinical decision support systems.