<a href="https://colab.research.google.com/github/Doom-Prophet/CS410/blob/main/CS410_Final_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS410 Final Project
**Name: Zicheng Ma**

**NetID: zicheng5**




# Introduction
Automatic document summarization is a vital area in natural language processing and information retrieval. The abundance of digital textual data necessitates efficient methods to condense information while preserving its essential meaning. This project aims to create a summarization system that employs machine learning techniques to generate succinct and informative summaries
from input documents.

# Setup

In [1]:
from google.colab import drive
drive.mount('/content/gdrive/')

Drive already mounted at /content/gdrive/; to attempt to forcibly remount, call drive.mount("/content/gdrive/", force_remount=True).


In [2]:
import os
if not os.path.exists("/content/gdrive/My Drive/Colab Notebooks/CS410_Final_Project"):
    os.makedirs("/content/gdrive/My Drive/Colab Notebooks/CS410_Final_Project")
os.chdir("/content/gdrive/My Drive/Colab Notebooks/CS410_Final_Project")

In [3]:
!pwd # Check if this is CS410 Final Project folder
!ls

/content/gdrive/My Drive/Colab Notebooks/CS410_Final_Project
datasets


In [4]:
!pip install datasets
!pip install nltk
!pip install beautifulsoup4
!pip install transformers
!pip install sentencepiece



# Data Collection and Preprocessing
Hugging Face’s datasets library provides an easy way to download and load the CNN/Daily Mail dataset.

In [5]:
from datasets import load_dataset

dataset_dir = "/content/gdrive/My Drive/Colab Notebooks/CS410_Final_Project/datasets/cnn_dailymail"

dataset = load_dataset("cnn_dailymail", "3.0.0", cache_dir=dataset_dir)

Define functions to clean and preprocess the text.

In [6]:
import nltk
from bs4 import BeautifulSoup
import re

nltk.download('punkt')
nltk.download('stopwords')

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stop_words = set(stopwords.words('english'))

def clean_html(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

def preprocess_text(text):
    text = clean_html(text)
    text = re.sub(r'\s+', ' ', text)  # Replace multiple spaces with a single space
    text = text.strip().lower()  # Convert to lowercase and remove leading/trailing whitespace
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word.isalpha() and word not in stop_words]  # Remove non-alphabetic words and stopwords
    return ' '.join(tokens)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Apply the preprocessing function to a subset of the dataset.

In [7]:
# Apply preprocessing to a subset of the dataset (e.g., first 1000 articles for now)
num_samples = 1000
preprocessed_articles = [preprocess_text(article) for article in dataset['train']['article'][:num_samples]]
preprocessed_summaries = [preprocess_text(summary) for summary in dataset['train']['highlights'][:num_samples]]

preprocessed_articles_for_testing = [preprocess_text(article) for article in dataset['train']['article'][num_samples+1:2*num_samples]]
preprocessed_summaries_for_testing = [preprocess_text(summary) for summary in dataset['train']['highlights'][num_samples+1:2*num_samples]]

# Show the first preprocessed article and summary
print("Article:", preprocessed_articles[0])
print("Summary:", preprocessed_summaries[0])

  soup = BeautifulSoup(text, "html.parser")


Article: london england reuters harry potter star daniel radcliffe gains access reported million million fortune turns monday insists money wo cast spell daniel radcliffe harry potter harry potter order phoenix disappointment gossip columnists around world young actor says plans fritter cash away fast cars drink celebrity parties plan one people soon turn suddenly buy massive sports car collection something similar told australian interviewer earlier month think particularly extravagant things like buying things cost pounds books cds dvds radcliffe able gamble casino buy drink pub see horror film hostel part ii currently six places number one movie uk box office chart details mark landmark birthday wraps agent publicist comment plans definitely sort party said interview hopefully none reading radcliffe earnings first five potter films held trust fund able touch despite growing fame riches actor says keeping feet firmly ground people always looking say star goes rails told reporters las

Show Evidence of Adequacy

In [8]:
def dataset_statistics(tokenized_documents):
    total_docs = len(tokenized_documents)
    total_words = sum(len(doc.split()) for doc in tokenized_documents)
    vocabulary = set(word for doc in tokenized_documents for word in doc.split())

    print("Total Documents:", total_docs)
    print("Total Words:", total_words)
    print("Average Document Length:", total_words / total_docs)
    print("Vocabulary Size:", len(vocabulary))

dataset_statistics(preprocessed_articles)


Total Documents: 1000
Total Words: 321810
Average Document Length: 321.81
Vocabulary Size: 26482


# Text Analysis Techniques
Exploration and implementation of text analysis techniques such as word association, topic modeling (e.g., Latent Dirichlet Allocation), text clustering/categorization, and semantic analysis (e.g., Word2Vec, BERT embeddings) to identify significant content and relationships within the documents.

**1. Word Association**


**2. Topic Modeling**

In [9]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

# `preprocessed_articles` is list of preprocessed text documents
vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
X = vectorizer.fit_transform(preprocessed_articles)

lda = LatentDirichletAllocation(n_components=5, random_state=42)
lda.fit(X)

# Display topics
feature_names = vectorizer.get_feature_names_out()
for topic_idx, topic in enumerate(lda.components_):
    print(f"Topic #{topic_idx + 1}:")
    print(" ".join([feature_names[i] for i in topic.argsort()[:-10 - 1:-1]]))
    print()


Topic #1:
said obama clinton cnn new mccain watch campaign people percent

Topic #2:
said new company world think oil says percent years people

Topic #3:
said iraq cnn government people military president united country security

Topic #4:
said cnn children like people family years time says bush

Topic #5:
said police cnn told home time friend people says family



**3. Text Clustering/Categorization**

**4. Semantic Analysis**

# Summarization Model


**1. Define the Model**

Use transformer-based models T5, which has been shown to perform well on summarization tasks.

Use the Hugging Face Transformers library to easily load a pre-trained model and fine-tune it on my dataset.

In [34]:
import torch
import torch.nn as nn
from transformers import T5ForConditionalGeneration, T5Tokenizer

class TopicEmbeddingModel(nn.Module):
    def __init__(self, model, num_topics):
        super(TopicEmbeddingModel, self).__init__()
        self.model = model
        self.topic_embedding = nn.Linear(num_topics, model.config.d_model)

    def forward(self, input_ids, attention_mask, topic_distribution):
        # Convert topic distribution to tensor
        topic_tensor = torch.tensor(topic_distribution).float().to(input_ids.device)
        if len(topic_tensor.size()) == 1:
            topic_tensor = topic_tensor.unsqueeze(0)

        # Transform topic tensor dimensions
        topic_tensor = self.topic_embedding(topic_tensor)

        # Get input embeddings
        input_embeddings = self.model.get_input_embeddings()(input_ids)

        # Add topic information to embeddings
        extended_embeddings = input_embeddings + topic_tensor.unsqueeze(1)

        # Forward pass through the model
        outputs = self.model.encoder(inputs_embeds=extended_embeddings, attention_mask=attention_mask)
        return outputs



**2. Integrate Topic Distributions**

Modify the model to accept the topic distribution as an additional input. Here I concatenate the topic distribution to the token embeddings of the input document.

In [35]:
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")
num_topics = 5
topic_model = TopicEmbeddingModel(model, num_topics)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


**3. Fine-Tune the Model**

Fine-tune the model on the dataset.

In [12]:
# from transformers import AdamW

# optimizer = AdamW(model.parameters(), lr=5e-5)

# for epoch in range(num_epochs):
#     for batch in dataloader:
#         # Unpack the batch
#         input_ids, attention_mask, labels, topic_distributions = batch
#
#         # Modify the embeddings
#         extended_embeddings = add_topic_information(input_ids, topic_distributions)
#
#         # Forward pass
#         outputs = model(inputs_embeds=extended_embeddings, attention_mask=attention_mask, labels=labels)
#
#         # Compute loss
#         loss = outputs.loss
#
#         # Backward pass
#         loss.backward()
#         optimizer.step()
#         optimizer.zero_grad()


**4. Generate Summaries**

In [36]:
document_text = preprocessed_articles_for_testing[0]
print("Article:", document_text)
print("Standard Summary:", preprocessed_summaries_for_testing[0])

inputs = tokenizer(document_text, return_tensors="pt", truncation=True, max_length=512)

topic_distribution = [0.1, 0.3, 0.2, 0.25, 0.15]

# Forward pass through the topic model
encoder_outputs = topic_model(inputs['input_ids'], inputs['attention_mask'], topic_distribution)

# Generate summary IDs
summary_ids = model.generate(encoder_outputs=encoder_outputs, max_length=150, attention_mask=inputs['attention_mask'])

# Decode summary IDs to text
summary_text = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print("Generated Summary:", summary_text)

Article: cnn bus carrying high school band students tipped saturday interstate northwest minneapolis minnesota killing one person bus carrying school band members rests upright crashed saturday minnesota three people critically injured authorities said second bus traveling one crashed affected according report posted web site pelican rapids school district students pelican rapids high school returning band trip chicago illinois accident happened near albertville minnesota minnesota highway patrol said people including driver westbound bus tipped minnesota highway patrol said everyone bus taken hospitals treatment evaluation school district said watch rescuers work scene pelican rapids minnesota cause accident investigated friend
Standard Summary: bus carrying high school students tips minnesota interstate one person killed three critically injured authorities say two buses pelican rapids minnesota way home chicago illinois
Generated Summary: ota highway patrol says buses taken hospital

# Evaluation
The system’s performance will be evaluated using standard metrics for summarization tasks, including ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metrics, precision, recall, and F1-score.