# Model Exploration
The objective of this project is to evaluate 3 approaches to accurately summarize data: a naive approach, a non deep learning approach, and a neural network-based deep learning approach

In [27]:
# Imports
import os
import re
import nltk
from rouge import Rouge
from bert_score import score

nltk.download("punkt_tab") # For sentence tokenization

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/mariam/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [28]:
# Paths
PROCESSED_DATA_DIR = "data/processed"

### Naive Approach
Return the first few sentences of the passage as a summary

In [29]:
def get_passage_by_topic(topic):
    '''Finds a passage mentioning the topic in the first processed text file.'''
    for filename in os.listdir(PROCESSED_DATA_DIR):
        if filename.endswith(".txt"):
            with open(os.path.join(PROCESSED_DATA_DIR, filename), "r") as f:
                text = f.read()
            if re.search(rf"\b{re.escape(topic)}\b", text, re.IGNORECASE): # Search for whole word match
                return text
    return None

In [30]:
def naive_summary(text, n_sentences=3):
    '''Return the specified number of sentences as the naive summary'''
    sentences = nltk.sent_tokenize(text) # Tokenize into sentences
    return " ".join(sentences[:n_sentences]) # Return the first n sentences

In [31]:
def evaluate_summary(reference, generated):
    '''Compute ROUGE and BERTScore'''
    rouge = Rouge()
    rouge_scores = rouge.get_scores(generated, reference)[0] # Get the first score dict

    # Compute BERTScore
    P, R, F1 = score([generated], [reference], lang="en", verbose=False)
    bert_f1 = F1.mean().item()

    return rouge_scores, bert_f1

In [36]:
topic = "bone"  # can be changed to any anatomy topic
passage = get_passage_by_topic(topic)

if passage:
    summary = naive_summary(passage, n_sentences=5)
    print(summary)

    # Placeholder reference for evaluation — replace with gold summaries if available
    reference = passage[:len(passage)//2]

    rouge_scores, bert_f1 = evaluate_summary(reference, summary)

    print("\nEvaluation")
    print("ROUGE:", rouge_scores)
    print("BERTScore F1:", round(bert_f1, 4))
else:
    print(f"No passages found for topic '{topic}'.")

COMPILED BY HOWIE BAUM OUTLINE AND SCHEDULE - “YOUR AMAZING HUMAN BODY” MODERATOR: Howie Baum WEEK 1 A. Introduction to the class B. Anatomy and Physiology C. Levels of organization of the Human Body D. Characteristics and Maintenance of Life E. Homeostasis and Feedback F. Body Cavities, Membranes, and the 11 Body / Organ Systems G. Diagnostic Imaging techniques and the different types of microscopes and devices for studying the body 2 WEEK 2 ➢ Introductory Chemistry about the atoms and molecules in the body ➢ The importance of Minerals, Vitamins, and Trace mineral elements for the body WEEK 3 ➢ Cells and Tissues ➢ Circulatory System WEEK 4 ➢ Endocrine System ➢ Digestive System WEEK 5 ➢ Immune System ➢ Muscular System WEEK 6 ➢ Nervous System ➢ Integumentary System WEEK 7 ➢ Urinary System ➢ Respiratory System WEEK 8 ➢ Skeletal System/ Joints ➢ Reproductive Systems – Female and Male 4 AN INTRODUCTION TO THE HUMAN BODY ➢ The number of humans in the world now is 7.53 billion (7, 530,000,00

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Evaluation
ROUGE: {'rouge-1': {'r': 0.1687715269804822, 'p': 0.9932432432432432, 'f': 0.28851815257106156}, 'rouge-2': {'r': 0.11761363636363636, 'p': 0.981042654028436, 'f': 0.21004566018861523}, 'rouge-l': {'r': 0.1687715269804822, 'p': 0.9932432432432432, 'f': 0.28851815257106156}}
BERTScore F1: 0.9594


###ANALYSIS###

### Classical Machine Learning Approach
Use TextRank to select key sentences from the passage

###ANALYSIS###

### Neural Network-based Deep Learning Approach
Train a BART model end-to-end to rephrase the content into comprehensive summaries

###ANALYSIS###