# Natural Language Processing. Lab 5. Feature Engineering with Parsing Competition.

## Feature Engineering with Parsing

Constituency and dependency parsing could be a useful tool for creating text features that could be used for text classification. For example, one may use constituency parsing to obtain phrases to pass to the text classifier. Or, in case of dependency parsing, one may filter the words by the types of relationships.

### Data


In [1]:
import pandas as pd

df = pd.read_csv("/kaggle/input/nlp-week-5-feature-engineering-using-parsing/train.csv")
df.head()

Unnamed: 0,id,medical_specialty,transcription
0,0,Cardiovascular / Pulmonary,"PREOPERATIVE DIAGNOSIS: , Persistent pneumonia..."
1,1,General Medicine,"REASON FOR VISIT: , Mr. ABC is a 30-year-old m..."
2,2,Cardiovascular / Pulmonary,"REASON FOR CONSULTATION: , Mesothelioma.,HISTO..."
3,3,General Medicine,"DISCHARGE DIAGNOSES:,1. Chronic obstructive pu..."
4,4,Cardiovascular / Pulmonary,"CHIEF COMPLAINT:, The patient complains of che..."


In [2]:
df['medical_specialty'] = df['medical_specialty'].apply(lambda x: x.strip())
df['medical_specialty'].unique()

array(['Cardiovascular / Pulmonary', 'General Medicine', 'Surgery',
       'Gastroenterology', 'Consult - History and Phy.'], dtype=object)

In [3]:
class_mapping = {
    0: "Cardiovascular / Pulmonary",
    1: "Consult - History and Phy.",
    2: "Gastroenterology",
    3: "General Medicine",
    4: "Surgery"
}

inverse_class_mapping = {v: k for k, v in class_mapping.items()}

inverse_class_mapping

{'Cardiovascular / Pulmonary': 0,
 'Consult - History and Phy.': 1,
 'Gastroenterology': 2,
 'General Medicine': 3,
 'Surgery': 4}

In [4]:
df['medical_specialty'] = df['medical_specialty'].apply(lambda x: inverse_class_mapping[x])
df.head()

Unnamed: 0,id,medical_specialty,transcription
0,0,0,"PREOPERATIVE DIAGNOSIS: , Persistent pneumonia..."
1,1,3,"REASON FOR VISIT: , Mr. ABC is a 30-year-old m..."
2,2,0,"REASON FOR CONSULTATION: , Mesothelioma.,HISTO..."
3,3,3,"DISCHARGE DIAGNOSES:,1. Chronic obstructive pu..."
4,4,0,"CHIEF COMPLAINT:, The patient complains of che..."


In [5]:
df['medical_specialty'].unique()

array([0, 3, 4, 2, 1])

In [6]:
df['transcription'][99]

"HISTORY: , The patient is a 15-year-old female who was seen in consultation at the request of Dr. X on 05/15/2008 regarding enlarged tonsils.  The patient has been having difficult time with having two to three bouts of tonsillitis this year.  She does average about four bouts of tonsillitis per year for the past several years.  She notes that throat pain and fever with the actual infections.  She is having no difficulty with swallowing.  She does have loud snoring, though there have been no witnessed observed sleep apnea episodes.  She is a mouth breather at nighttime, however.  The patient does feel that she has a cold at today's visit.  She has had tonsil problems again for many years.  She does note a history of intermittent hoarseness as well.  This is particularly prominent with the current cold that she has had.  She had been seen by Dr. Y in Muskegon who had also recommended a tonsillectomy, but she reports she would like to get the surgery done here in the Ludington area as t

In [7]:
df['transcription'].str.split().str.len().max(), df['transcription'].str.split().str.len().min(), df['transcription'].str.split().str.len().mean(), df['transcription'].str.split().str.len().std()

(2332.0, 1.0, 476.18797758532855, 299.2589859824925)

In [8]:
df.isna().sum()

id                    0
medical_specialty     0
transcription        18
dtype: int64

In [9]:
df.dropna(inplace=True)

In [10]:
df['medical_specialty'].unique()

array([0, 3, 4, 2, 1])

### Dependency parsing


In [11]:
%pip install spacy networkx

Note: you may need to restart the kernel to use updated packages.


In [12]:
import spacy
import networkx as nx
spacy.prefer_gpu()
nlp = spacy.load("en_core_web_sm")

def extract_features(text):
    doc = nlp(text)

    svos = []
    for token in doc:
        if "subj" in token.dep_:
            subject = token.text
            verb = token.head.text
            obj = None
            for child in token.head.children:
                if "obj" in child.dep_:
                    obj = child.text
            if obj:
                svos.append(f"{subject} {verb} {obj}")

    head_modifiers_cntr = 0
    for token in doc:
        if token.dep_ in ("amod", "advmod"):  
            head_modifiers_cntr += 1

    dep_graph = nx.Graph()
    for token in doc:
        for child in token.children:
            dep_graph.add_edge(token.text, child.text)

    avg_dep_path = 0
    if len(dep_graph.nodes) > 1:
        path_lengths = [len(nx.shortest_path(dep_graph, source=token.text, target=child.text))
                        for token in doc for child in token.children if nx.has_path(dep_graph, token.text, child.text)]
        avg_dep_path = sum(path_lengths) / len(path_lengths) if path_lengths else 0

    return {
        "number_of_words": len(text.split()),
        "svo_count": len(svos),
        "head_modifier_count": head_modifiers_cntr,
        "avg_dep_path": avg_dep_path,
        "svo_features": " ".join(svos).lower(),
       
    }

### Preprocess data

In [13]:
from tqdm import tqdm

features_list = []

for text in tqdm(df["transcription"], desc="Extracting Features"):
    features_list.append(extract_features(text))

new_features = pd.DataFrame(features_list)

df = df.reset_index(drop=True)
new_features = new_features.reset_index(drop=True)

df_processed = pd.concat([df.drop(columns=["transcription"]), new_features], axis=1)

df_processed.head()


Extracting Features: 100%|██████████| 1963/1963 [03:04<00:00, 10.67it/s]


Unnamed: 0,id,medical_specialty,number_of_words,svo_count,head_modifier_count,avg_dep_path,svo_features
0,0,0,247,4,33,1.996491,he underwent anesthesia trachea had appearance...
1,1,3,497,14,47,2.0,that disrupted sleep cpap limited snoring part...
2,2,0,769,17,104,1.997897,he underwent vats he had tube he had fibrillat...
3,3,3,364,5,48,2.0,she denied history she received steroids medic...
4,4,0,495,13,62,1.998314,he had infarctions patient used amphetamines h...


In [14]:
df_processed.describe()

Unnamed: 0,id,medical_specialty,number_of_words,svo_count,head_modifier_count,avg_dep_path
count,1963.0,1963.0,1963.0,1963.0,1963.0,1963.0
mean,992.479878,2.463576,476.187978,8.481915,66.633214,1.996249
std,570.852881,1.570651,299.258986,8.575625,45.898471,0.004142
min,0.0,0.0,1.0,0.0,0.0,1.925926
25%,499.5,1.0,258.0,3.0,33.5,1.994636
50%,991.0,3.0,412.0,6.0,56.0,1.996928
75%,1487.5,4.0,622.5,11.5,89.0,2.0
max,1980.0,4.0,2332.0,73.0,342.0,2.0


In [15]:
df_processed['medical_specialty'].unique()

array([0, 3, 4, 2, 1])

In [16]:
df_processed['svo_features'].str.split().str.len().max()

219

We will embed data in model section. For now we just do the data stuff

In [17]:
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer

bert_model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(bert_model_name)

class MedDataset(Dataset):
    def __init__(self, df):
        self.texts = df['svo_features'].fillna("").astype(str).tolist()
        self.labels = df['medical_specialty'].astype(int).values
        self.numerical_features = df[["number_of_words", 
                                      "svo_count", 
                                      "head_modifier_count", 
                                      "avg_dep_path"]].values
    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = torch.tensor(self.labels[idx], dtype=torch.long)
        num_features = torch.tensor(self.numerical_features[idx], dtype=torch.float32)

        if not isinstance(text, str) or text.strip() == "":
            text = "[UNK]"
        
        encoding = tokenizer(
            text,
            return_tensors="pt",
            padding="max_length",
            truncation=True,
            max_length=220
        )

        input_ids = encoding["input_ids"].squeeze(0)
        attention_mask = encoding ["attention_mask"].squeeze(0)


        return input_ids, attention_mask, num_features, label

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [18]:
from sklearn.model_selection import train_test_split
from torch.utils.data import Subset

dataset = MedDataset(df_processed)

train_indices, val_indices  = train_test_split(range(len(dataset)), test_size=0.05, random_state=42)

train_data = Subset(dataset, train_indices)
val_data = Subset(dataset, val_indices)

In [19]:
batch_size = 16

def collate_fn(batch):
    input_ids = torch.stack([item[0] for item in batch])
    attention_mask = torch.stack([item[1] for item in batch])
    num_features = torch.stack([item[2] for item in batch])
    labels = torch.stack([item[3] for item in batch])
    
    return input_ids, attention_mask, num_features, labels

# Create DataLoaders
train_loader = DataLoader(train_data, batch_size=16, shuffle=True, collate_fn=collate_fn)
val_loader = DataLoader(val_data, batch_size=16, shuffle=False, collate_fn=collate_fn)


### Create model

In [20]:
import torch
import torch.nn as nn
import torch.optim as optim
from transformers import BertTokenizer, BertModel

class MedicalClassifier(nn.Module):
    def __init__(self, num_classes, num_structured_features):
        super(MedicalClassifier, self).__init__()
        self.bert = BertModel.from_pretrained("bert-base-uncased")  # Load BERT
        
        self.feature_fc = nn.Sequential(
            nn.Linear(num_structured_features, 16),
            nn.ReLU(),
            nn.Linear(16, 8),
            nn.ReLU()
        )
        
        self.final_fc = nn.Linear(768 + 8, num_classes)


    def forward(self, input_ids, attention_mask, num_features):
        bert_out = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        cls = bert_out.last_hidden_state[:, 0, :]
        structured_output = self.feature_fc(num_features)

        combined_features = torch.cat((cls, structured_output), dim=1)

        logits = self.final_fc(combined_features)

        return logits
        

In [21]:
model = MedicalClassifier(5, 4)
criterion = nn.CrossEntropyLoss()
optimizer = optim.AdamW(model.parameters(), lr=2e-5)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

MedicalClassifier(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

In [22]:
from sklearn.metrics import accuracy_score, f1_score


epochs = 5

for epoch in range(epochs):
    model.train()
    total_train_loss = 0
    all_train_preds, all_train_labels = [], []

    # Training
    for input_ids, attention_mask, num_features, labels in tqdm(train_loader, desc=f"Epoch {epoch+1}/{epochs} [Training]"):
        # Move tensors to device
        input_ids, attention_mask, num_features, labels = (
            input_ids.to(device), 
            attention_mask.to(device), 
            num_features.to(device), 
            labels.to(device)
        )

        optimizer.zero_grad()
        outputs = model(input_ids, attention_mask, num_features)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        total_train_loss += loss.item()

        # Store predictions & labels
        preds = torch.argmax(outputs, dim=1).cpu().numpy()
        all_train_preds.extend(preds)
        all_train_labels.extend(labels.cpu().numpy())

    avg_train_loss = total_train_loss / len(train_loader)
    train_accuracy = accuracy_score(all_train_labels, all_train_preds)
    train_f1 = f1_score(all_train_labels, all_train_preds, average="macro")  # 🔹 Compute Macro F1

    # Validation
    model.eval()
    total_val_loss = 0
    all_val_preds, all_val_labels = [], []

    with torch.no_grad():
        for input_ids, attention_mask, num_features, labels in tqdm(val_loader, desc=f"Epoch {epoch+1}/{epochs} [Validation]"):
            # Move tensors to device
            input_ids, attention_mask, num_features, labels = (
                input_ids.to(device), 
                attention_mask.to(device), 
                num_features.to(device), 
                labels.to(device)
            )

            outputs = model(input_ids, attention_mask, num_features)
            loss = criterion(outputs, labels)
            total_val_loss += loss.item()

            preds = torch.argmax(outputs, dim=1).cpu().numpy()
            all_val_preds.extend(preds)
            all_val_labels.extend(labels.cpu().numpy())

    avg_val_loss = total_val_loss / len(val_loader)
    val_accuracy = accuracy_score(all_val_labels, all_val_preds)
    val_f1 = f1_score(all_val_labels, all_val_preds, average="macro")  # 🔹 Compute Macro F1

    print(f"Epoch {epoch+1} | Train Loss: {avg_train_loss:.4f} | Train Acc: {train_accuracy:.4f} | Train F1: {train_f1:.4f}")
    print(f"Epoch {epoch+1} | Val Loss: {avg_val_loss:.4f} | Val Acc: {val_accuracy:.4f} | Val F1: {val_f1:.4f}")

Epoch 1/5 [Training]: 100%|██████████| 117/117 [00:43<00:00,  2.68it/s]
Epoch 1/5 [Validation]: 100%|██████████| 7/7 [00:00<00:00,  9.30it/s]


Epoch 1 | Train Loss: 1.2487 | Train Acc: 0.5456 | Train F1: 0.3307
Epoch 1 | Val Loss: 0.9957 | Val Acc: 0.6465 | Val F1: 0.3407


Epoch 2/5 [Training]: 100%|██████████| 117/117 [00:43<00:00,  2.71it/s]
Epoch 2/5 [Validation]: 100%|██████████| 7/7 [00:00<00:00,  9.24it/s]


Epoch 2 | Train Loss: 1.0277 | Train Acc: 0.6068 | Train F1: 0.4064
Epoch 2 | Val Loss: 1.0205 | Val Acc: 0.6263 | Val F1: 0.3914


Epoch 3/5 [Training]: 100%|██████████| 117/117 [00:43<00:00,  2.69it/s]
Epoch 3/5 [Validation]: 100%|██████████| 7/7 [00:00<00:00,  8.98it/s]


Epoch 3 | Train Loss: 0.8938 | Train Acc: 0.6513 | Train F1: 0.4756
Epoch 3 | Val Loss: 1.0477 | Val Acc: 0.6465 | Val F1: 0.3571


Epoch 4/5 [Training]: 100%|██████████| 117/117 [00:43<00:00,  2.68it/s]
Epoch 4/5 [Validation]: 100%|██████████| 7/7 [00:00<00:00,  8.99it/s]


Epoch 4 | Train Loss: 0.7735 | Train Acc: 0.6786 | Train F1: 0.5369
Epoch 4 | Val Loss: 1.2206 | Val Acc: 0.5758 | Val F1: 0.3502


Epoch 5/5 [Training]: 100%|██████████| 117/117 [00:43<00:00,  2.67it/s]
Epoch 5/5 [Validation]: 100%|██████████| 7/7 [00:00<00:00,  9.18it/s]

Epoch 5 | Train Loss: 0.7181 | Train Acc: 0.6819 | Train F1: 0.5556
Epoch 5 | Val Loss: 1.3115 | Val Acc: 0.5657 | Val F1: 0.3589





### Perform prediction


In [23]:
test_df = pd.read_csv("/kaggle/input/nlp-week-5-feature-engineering-using-parsing/test.csv")
test_df.head()

Unnamed: 0,id,transcription
0,0,"INDICATIONS FOR PROCEDURE:, The patient has pr..."
1,1,"CLINICAL HISTORY: ,This 78-year-old black woma..."
2,2,"PREOPERATIVE DIAGNOSIS: , Penoscrotal abscess...."
3,3,"INDICATIONS:, Ischemic cardiomyopathy, status..."
4,4,"PREOPERATIVE DIAGNOSIS: , Ruptured distal bice..."


In [24]:
from tqdm import tqdm

features_list = []

for text in tqdm(test_df["transcription"], desc="Extracting Features"):
    features_list.append(extract_features(text))

new_features = pd.DataFrame(features_list)

test_df = test_df.reset_index(drop=True)
new_features = new_features.reset_index(drop=True)

test_df = pd.concat([test_df.drop(columns=["transcription"]), new_features], axis=1)

test_df.head()

Extracting Features: 100%|██████████| 495/495 [00:43<00:00, 11.43it/s]


Unnamed: 0,id,number_of_words,svo_count,head_modifier_count,avg_dep_path,svo_features
0,0,303,5,78,1.988981,she had stenosis ultrasound showed stenosis sh...
1,1,481,6,59,1.998172,woman has history she noted complaints ecg sho...
2,2,453,5,58,2.0,patient had changes he need penectomy patient ...
3,3,87,2,9,2.0,electrocardiogram revealed pacemaker fibrillat...
4,4,615,0,87,1.985896,


In [25]:
class TestDataset(Dataset):
    def __init__(self, df):
        self.texts = df['svo_features'].fillna("").astype(str).tolist()
        self.numerical_features = df[["number_of_words", "svo_count", "head_modifier_count", "avg_dep_path"]].values
        self.ids = df["id"].values  # Store IDs for submission

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        num_features = torch.tensor(self.numerical_features[idx], dtype=torch.float32)
        sample_id = self.ids[idx]  # Extract ID

        if not isinstance(text, str) or text.strip() == "":
            text = "[UNK]"  # Replace empty text with BERT's unknown token

        encoding = tokenizer(
            text,
            return_tensors="pt",
            padding="max_length",
            truncation=True,
            max_length=220
        )

        input_ids = encoding["input_ids"].squeeze(0)
        attention_mask = encoding["attention_mask"].squeeze(0)

        return input_ids, attention_mask, num_features, sample_id

# Load test dataset
test_dataset = TestDataset(test_df) 
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=False)


In [26]:
model.eval() 

predictions = []

with torch.no_grad():
    for input_ids, attention_mask, num_features, sample_ids in tqdm(test_loader, desc="Generating Predictions"):
        input_ids, attention_mask, num_features = (
            input_ids.to(device), 
            attention_mask.to(device), 
            num_features.to(device)
        )

        outputs = model(input_ids, attention_mask, num_features)
        preds = torch.argmax(outputs, dim=1).cpu().numpy()  # Convert logits to predicted class

        for sample_id, pred in zip(sample_ids, preds):
            predictions.append((sample_id.item(), pred))


Generating Predictions: 100%|██████████| 31/31 [00:03<00:00,  8.16it/s]


In [27]:
submission_df = pd.DataFrame(predictions, columns=["id", "class_id"])

submission_df = submission_df.sort_values(by="id")

submission_df.to_csv("submission.csv", index=False)

print("Submission file saved as `submission.csv`!")


Submission file saved as `submission.csv`!
