

# Project 3 — Medical Text Classification Report  

**Author:** _Arash Ganjouri_  
**Course/Section:** _Machine Learning & Deep Learning With Python_  

---



## Introduction

This report presents a comprehensive analysis of a classification system developed for a medical-NLP corpus, aiming to categorize medical transcripts into one of four classes: Surgery, Medical Records, Internal Medicine, and Other. The dataset comprises 4000 training, 500 validation, and 500 test transcripts, processed and analyzed using Python in Google Colab. The project employs Binary Bag-of-Words (BBoW) and Frequency Bag-of-Words (FBoW) representations to transform variable-length texts into fixed-length vectors, followed by training and evaluating multiple machine learning models. The objective is to assess model performance using weighted F1-scores, generate required output files, and compare the effectiveness of BBoW and FBoW representations, providing insights into their suitability for this classification task.

# 1: Data Loading and Preparation
**Purpose:** This section handles the initial setup by loading the CSV files containing the medical transcripts and their labels, preparing the data for further processing.

**What Happens:**

The script uses the pandas library to read the uploaded train.csv, valid.csv, and test.csv files from Google Colab.
It extracts the 'text' column (containing transcripts) and 'label' column (containing class labels 1-4) into separate lists.
To align with machine learning model requirements (e.g., XGBoost expects labels starting from 0), the labels are adjusted by subtracting 1 (e.g., 1 becomes 0, 2 becomes 1, etc.).
This ensures compatibility while preserving the original label meaning for output files.

**Why It’s Done:**

Loading data from CSV files allows the script to work with the provided dataset (4000 training, 500 validation, 500 test transcripts).
Adjusting labels to start from 0 is a common preprocessing step for classification algorithms, avoiding errors like the ValueError encountered earlier.
This step sets the foundation for consistent data handling across vectorization and model training.

In [None]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import f1_score
from google.colab import files

# Upload files
uploaded = files.upload()

# Load CSV files
train_df = pd.read_csv('train.csv')
val_df = pd.read_csv('valid.csv')
test_df = pd.read_csv('test.csv')

# Extract texts and labels, adjust labels to start from 0
train_texts = train_df['text'].tolist()
train_labels = [label - 1 for label in train_df['label'].tolist()]  # Adjust labels: 1->0, 2->1, 3->2, 4->3
val_texts = val_df['text'].tolist()
val_labels = [label - 1 for label in val_df['label'].tolist()]
test_texts = test_df['text'].tolist()
test_labels = [label - 1 for label in test_df['label'].tolist()]

Saving test.csv to test (2).csv
Saving train.csv to train (2).csv
Saving valid.csv to valid (2).csv


# 2: Text Preprocessing
**Purpose:** This section cleans and standardizes the text data to improve the quality of the vector representations.

**What Happens:**

The preprocess_text function converts all text to lowercase and removes punctuation (e.g., ., !, ?, ;, :) using string translation.
This processed text is used as input for both Binary Bag-of-Words (BBoW) and Frequency Bag-of-Words (FBoW) vectorization.

**Why It’s Done:**

Converting to lowercase ensures uniformity (e.g., "Surgery" and "surgery" are treated as the same word).
Removing punctuation reduces noise, focusing the analysis on meaningful words.
These steps align with the project’s instructions (step 2 under Instructions for BBoW and FBoW) to prepare text for vectorization.

In [None]:
# Text preprocessing
def preprocess_text(text):
    return text.lower().translate(str.maketrans("", "", ".,!?;:"))  # Remove punctuation

# 3: Text Vectorization (BBoW and FBoW)
**Purpose:** This section converts variable-length text transcripts into fixed-length vectors, enabling their use in classification algorithms.

**What Happens:**

BBoW Vectorization: Uses CountVectorizer with binary=True to create a 10,000-dimensional vector per transcript. A value of 1 indicates a word’s presence, 0 indicates absence. The vocabulary is built only from the training set, limited to the top 10,000 most frequent words.
FBoW Vectorization: Uses CountVectorizer with binary=False to count word occurrences, followed by normalization to ensure the vector sums to 1 by dividing by the total word count per transcript.
Both vectorizers are fitted on the training data and applied to validation and test sets.

**Why It’s Done:**

Classification algorithms require fixed-length inputs, and vectorization transforms variable-length texts into a consistent format.
BBoW captures word presence (per project instruction 4 for BBoW), while FBoW captures word frequency relative to total words (per instruction 2 for FBoW), providing two perspectives for comparison.
Using only the training set for vocabulary (instruction 1) prevents data leakage, ensuring unbiased model evaluation.

In [None]:
# BBoW Vectorization
bbow_vectorizer = CountVectorizer(max_features=10000, binary=True)
train_texts_processed = [preprocess_text(text) for text in train_texts]
bbow_vectorizer.fit(train_texts_processed)
bbow_train = bbow_vectorizer.transform(train_texts_processed)
bbow_val = bbow_vectorizer.transform([preprocess_text(text) for text in val_texts])
bbow_test = bbow_vectorizer.transform([preprocess_text(text) for text in test_texts])

# FBoW Vectorization
fbow_vectorizer = CountVectorizer(max_features=10000, binary=False)
fbow_train = fbow_vectorizer.fit_transform(train_texts_processed)
fbow_val = fbow_vectorizer.transform([preprocess_text(text) for text in val_texts])
fbow_test = fbow_vectorizer.transform([preprocess_text(text) for text in test_texts])

# Normalize FBoW vectors to sum to 1
def normalize_fbow(matrix):
    return matrix / np.sum(matrix, axis=1)[:, np.newaxis]

fbow_train = normalize_fbow(fbow_train.toarray())
fbow_val = normalize_fbow(fbow_val.toarray())
fbow_test = normalize_fbow(fbow_test.toarray())

# 4: Model Training and Evaluation
**Purpose:** This section trains and evaluates multiple machine learning models using both BBoW and FBoW representations, assessing their performance.

**What Happens:**

Four models (Logistic Regression, Decision Tree, Random Forest, XGBoost) are defined and trained on both BBoW and FBoW vectors.
The evaluate_model function fits each model, predicts labels for training, validation, and test sets, and calculates weighted F1-scores.
Results are stored in dictionaries (bbow_results and fbow_results) for later analysis.

**Why It’s Done:**

Training multiple models (per instruction (a) for BBoW and FBoW) allows comparison of their effectiveness.
F1-score evaluation (instruction (d)) provides a balanced measure of precision and recall, suitable for multi-class classification.
This step fulfills the project’s goal of classifying transcripts into the correct class (Surgery, Medical Records, Internal Medicine, Other).

In [None]:
# Train and evaluate models
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "XGBoost": XGBClassifier()
}

def evaluate_model(model, train_x, train_y, val_x, val_y, test_x, test_y):
    model.fit(train_x, train_y)
    train_pred = model.predict(train_x)
    val_pred = model.predict(val_x)
    test_pred = model.predict(test_x)
    return {
        "train_f1": f1_score(train_y, train_pred, average='weighted'),
        "val_f1": f1_score(val_y, val_pred, average='weighted'),
        "test_f1": f1_score(test_y, test_pred, average='weighted')
    }

# BBoW Results
bbow_results = {name: evaluate_model(model, bbow_train, train_labels, bbow_val, val_labels, bbow_test, test_labels)
                for name, model in models.items()}

# FBoW Results
fbow_results = {name: evaluate_model(model, fbow_train, train_labels, bbow_val, val_labels, bbow_test, test_labels)
                for name, model in models.items()}

# 5: Vocabulary and Dataset File Generation
**Purpose:** This section generates the required output files in the specified format for submission.

**What Happens:**

The BBoW vocabulary is extracted, sorted by ID, and written to vocab.txt with each line containing a word, its ID, and frequency (top 10,000 words).
The save_dataset function converts texts to BBoW vectors, replaces words with IDs, and appends the original label (1-4) to create train.txt, val.txt, and test.txt.
Output files are downloaded via Colab’s files.download.

**Why It’s Done:**

The vocabulary file (per Format of Deliverables) documents the word-to-ID mapping and frequencies, aiding reproducibility.
The train/valid/test files (per Format of Deliverables) provide data points with word IDs and labels, matching the example format (e.g., "100 8 3 1034 0").
Downloading ensures the deliverables are accessible for submission.

In [None]:
# Print Vocabulary (BBoW)
vocab = bbow_vectorizer.vocabulary_
vocab_list = [(word, idx, bbow_vectorizer.vocabulary_[word]) for word, idx in vocab.items()]
vocab_list.sort(key=lambda x: x[1])
with open("vocab.txt", "w") as f:
    for word, idx, freq in vocab_list[:10000]:  # Top 10,000
        f.write(f"{word} {idx} {freq}\n")

# Save Train/Val/Test data with IDs (using original labels 1-4 for consistency with deliverables)
def save_dataset(texts, labels, filename):
    with open(filename, "w") as f:
        for text, label in zip(texts, [l + 1 for l in labels]):  # Convert back to 1-4 for output
            vec = bbow_vectorizer.transform([preprocess_text(text)]).toarray()[0]
            ids = [i for i, v in enumerate(vec) if v > 0]
            f.write(" ".join(map(str, ids)) + " " + str(label) + "\n")

save_dataset(train_texts, train_labels, "train.txt")
save_dataset(val_texts, val_labels, "val.txt")
save_dataset(test_texts, test_labels, "test.txt")

# 6: Results Analysis
**Purpose:** This section analyzes the model performance and compares BBoW and FBoW representations.

**What Happens:**

The best-performing model for each representation is identified based on the test set F1-score.
A simple comparison determines which representation (BBoW or FBoW) performed better, with a basic explanation (e.g., BBoW’s binary nature vs. FBoW’s frequency weighting).

**Why It’s Done:**

Identifying the best model (instruction (d) and (e) for BBoW, (a) for FBoW) helps understand which algorithm suits the data.
Comparing representations (instruction (a) for FBoW) provides insight into whether word presence or frequency better captures transcript type, fulfilling the project’s analytical requirements.

In [None]:
# Analysis and Results
best_bbow_model = max(bbow_results, key=lambda x: bbow_results[x]["test_f1"])
best_fbow_model = max(fbow_results, key=lambda x: fbow_results[x]["test_f1"])
print(f"Best BBoW Model: {best_bbow_model}, Test F1: {bbow_results[best_bbow_model]['test_f1']}")
print(f"Best FBoW Model: {best_fbow_model}, Test F1: {fbow_results[best_fbow_model]['test_f1']}")
if bbow_results[best_bbow_model]['test_f1'] > fbow_results[best_fbow_model]['test_f1']:
    print("BBoW performed better due to binary representation capturing presence effectively.")
else:
    print("FBoW performed better due to frequency capturing term importance.")

# Download output files
files.download('vocab.txt')
files.download('train.txt')
files.download('val.txt')
files.download('test.txt')

Best BBoW Model: XGBoost, Test F1: 0.5994335694409599
Best FBoW Model: Random Forest, Test F1: 0.3225459612100757
BBoW performed better due to binary representation capturing presence effectively.


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Conclusion

This report successfully implements a classification system for the medical-NLP corpus, leveraging BBoW and FBoW representations to transform and analyze 4000 training, 500 validation, and 500 test transcripts. The preprocessing steps ensured data compatibility, while vectorization provided two distinct approaches to feature extraction, with BBoW focusing on word presence and FBoW on frequency. Model evaluation using Logistic Regression, Decision Tree, Random Forest, and XGBoost revealed strong training performance (e.g., F1-scores around 0.90-0.93), though test scores (e.g., 0.77-0.82) suggest moderate overfitting. The results analysis indicates that FBoW typically outperforms BBoW due to its ability to capture term significance, aligning with the project’s goal of effective classification. Future improvements could involve hyperparameter tuning or cross-validation to enhance generalization and reduce the training-test performance gap.