<a href="https://colab.research.google.com/github/Fayouzz/Multi-Document-Summarization-with-Centroid-Based-Pretraining/blob/main/Multi_Document_Summarization_with_Centroid_Based_Pretraining1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install datasets



In [11]:
from datasets import load_dataset

# Load the Multi-News dataset
dataset = load_dataset("multi_news")


In [12]:
def sample_dataset(dataset, fraction=0.1):
    sampled_data = dataset.shuffle(seed=42).select(range(int(len(dataset) * fraction)))
    return sampled_data

# Sample 10% of each split
train_sample = sample_dataset(dataset["train"], 0.1)
validation_sample = sample_dataset(dataset["validation"], 0.1)
test_sample = sample_dataset(dataset["test"], 0.1)


In [13]:
import os
import json

# Create necessary directories
os.makedirs("data/raw", exist_ok=True)

# Function to save data to JSON
def save_to_json(data, file_path):
    with open(file_path, "w") as f:
        json.dump(data, f, indent=4)

# Save the sampled splits
save_to_json(train_sample.to_dict(), "data/raw/train.json")
save_to_json(validation_sample.to_dict(), "data/raw/validation.json")
save_to_json(test_sample.to_dict(), "data/raw/test.json")

print("Data splits saved in data/raw/")


Data splits saved in data/raw/


In [14]:
!pip install spacy
!python -m spacy download en_core_web_sm


Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m60.2 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [15]:
import spacy

# Load Spacy's English model
nlp = spacy.load("en_core_web_sm")

# Sample text for tokenization
text = "This is a sample text for testing."

# Tokenize the text
doc = nlp(text)
tokens = [token.text for token in doc]

print("Tokens:", tokens)


Tokens: ['This', 'is', 'a', 'sample', 'text', 'for', 'testing', '.']


In [16]:
def preprocess_text(text):
    doc = nlp(text.lower())  # Lowercasing the text
    tokens = [token.text for token in doc if not token.is_stop and not token.is_punct]
    return " ".join(tokens)

# Example
example_text = "This is an example sentence! With some punctuation?"
processed_text = preprocess_text(example_text)
print("Processed Text:", processed_text)


Processed Text: example sentence punctuation


In [21]:
# Check the top-level structure of the dataset
print(train_data.keys())  # This should show 'document' and 'summary'

# Let's inspect the first item in the data to understand its structure
print(train_data['document'][0])  # Print the first document
print(train_data['summary'][0])  # Print the first summary


dict_keys(['document', 'summary'])
MICHAEL JACKSON's daughter has become a top celebrity crimper's latest client. 
 
 PARIS, 11, visited ANDY LECOMPTE's LA salon last week, following in the footsteps of MADONNA, BRITNEY SPEARS, JENNIFER LOPEZ and GWYNETH PALTROW. 
 
 Her two brothers, PRINCE MICHAEL and BLANKET, sat with their bodyguards while Paris had her hair washed, cut and styled after her nails were painted black. 
 
 Robots 
 
 The salon, located on Melrose Avenue, has a hidden entrance and is so exclusive it doesn't need a sign. 
 
 Customers enjoying treatments at the same time as Paris described the youngsters as "weird". 
 
 One client said: "They weren't like normal children at all because they seemed to have no joy or playfulness about them. 
 
 "I know they recently lost their father but their complete lack of emotion meant it was like watching three robots." ||||| Jackson Kids Save the Day 
 
 More Scooby Roo Jackson Kids Take on Canine Crusade 
 
 Meddling kids and Scoo

In [23]:
# Function to chunk long texts into smaller pieces
def chunk_text(text, chunk_size=1000):
    return [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]

# Preprocess the documents in chunks
preprocessed_docs = []
preprocessed_summaries = []

for i in range(len(train_data['document'])):
    # Chunk the document if it exceeds the chunk size
    chunks = chunk_text(train_data['document'][i])
    cleaned_doc = " ".join([clean_text(chunk) for chunk in chunks])
    preprocessed_docs.append(cleaned_doc)

    # Clean the summary text (assuming summaries are short)
    cleaned_summary = clean_text(train_data['summary'][i])
    preprocessed_summaries.append(cleaned_summary)

# Verify preprocessing by printing the first few entries
print(preprocessed_docs[:2])  # Print the first two preprocessed documents
print(preprocessed_summaries[:2])  # Print the first two preprocessed summaries


['MICHAEL JACKSON daughter celebrity crimper late client PARIS visit ANDY LECOMPTE LA salon week follow footstep MADONNA BRITNEY SPEARS JENNIFER LOPEZ GWYNETH PALTROW brother PRINCE MICHAEL BLANKET sit bodyguard Paris hair wash cut style nail paint black robot salon locate Melrose Avenue hidden entrance exclusive need sign customer enjoy treatment time Paris describe youngster weird client say like normal child joy playfulness know recently lose father complete lack emotion mean like watch robot Jackson Kids save Day Scooby Roo Jackson Kids Canine Crusade Meddling kid Scooby Doo hand hand mak es total sense kid help save dog nameda tell month Prince Paris Jackson see Scooby news decide help work Scooby outfit custom cart weekend Fuzzy Rescue organization rescue Scooby tell week car run time', 'image copyright Getty Images Image caption Asaram deny allegation self style indian spiritual guru claim million follower worldwide give life sentence rape year old girl Asaram Bapu convict attac

In [24]:
from sklearn.model_selection import train_test_split

# Split the data into training, validation, and test sets (80%, 10%, 10%)
train_docs, test_docs, train_summaries, test_summaries = train_test_split(
    preprocessed_docs, preprocessed_summaries, test_size=0.2, random_state=42)

# Now split the test set into 10% for validation and 10% for testing
val_docs, test_docs, val_summaries, test_summaries = train_test_split(
    test_docs, test_summaries, test_size=0.5, random_state=42)

# Check the split sizes
print(f"Train size: {len(train_docs)}, Validation size: {len(val_docs)}, Test size: {len(test_docs)}")


Train size: 3597, Validation size: 450, Test size: 450


In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TF-IDF vectorizer
vectorizer = TfidfVectorizer(max_features=5000)  # Limit the number of features to 5000

# Fit and transform the training data, and transform the validation and test data
X_train = vectorizer.fit_transform(train_docs)
X_val = vectorizer.transform(val_docs)
X_test = vectorizer.transform(test_docs)

# Optionally, transform summaries for evaluation
y_train = vectorizer.transform(train_summaries)
y_val = vectorizer.transform(val_summaries)
y_test = vectorizer.transform(test_summaries)

print(f"X_train shape: {X_train.shape}, X_val shape: {X_val.shape}, X_test shape: {X_test.shape}")


X_train shape: (3597, 5000), X_val shape: (450, 5000), X_test shape: (450, 5000)


In [28]:
import spacy
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Create mock labels (0 for even index, 1 for odd index)
y_train = [0 if i % 2 == 0 else 1 for i in range(len(preprocessed_docs))]  # Example label generation

# Ensure y_train is a numpy array of shape (n_samples,)
y_train = np.array(y_train)

# Use TfidfVectorizer to convert text data into feature vectors
vectorizer = TfidfVectorizer(max_features=5000)

# Vectorize the preprocessed documents (X_train)
X_train = vectorizer.fit_transform(preprocessed_docs)

# Split the data into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Train a Logistic Regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Make predictions on the validation set
y_pred = model.predict(X_val)

# Evaluate the model
accuracy = accuracy_score(y_val, y_pred)
precision = precision_score(y_val, y_pred, average="weighted")
recall = recall_score(y_val, y_pred, average="weighted")
f1 = f1_score(y_val, y_pred, average="weighted")

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")


Accuracy: 0.5467
Precision: 0.5467
Recall: 0.5467
F1-Score: 0.5467


In [1]:
pip install rouge-score

Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24935 sha256=8368284fbe417387b542435591d7fe70b362930c6b4872d1866122f9563be444
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge-score
Installing collected packages: rouge-score
Successfully installed rouge-score-0.1.2


In [13]:
from google.colab import files
uploaded = files.upload()  # This will allow you to upload files directly


Saving train.json to train.json


In [16]:
def truncate_text(text, max_length=1000000):
    return text[:max_length]

# Apply truncation to documents before processing
train_data['document'] = [truncate_text(doc) for doc in train_data['document']]

# Clean the truncated documents and summaries
preprocessed_docs = [clean_text(doc) for doc in train_data['document']]
preprocessed_summaries = [clean_text(summary) for summary in train_data['summary']]


In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=5000)

# Fit and transform the document and summary data
X_docs = vectorizer.fit_transform(preprocessed_docs)
X_summaries = vectorizer.transform(preprocessed_summaries)

In [2]:
from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
scores = scorer.score(ref_summary, gen_summary)

print(f"ROUGE-1: {scores['rouge1']}")
print(f"ROUGE-2: {scores['rouge2']}")
print(f"ROUGE-L: {scores['rougeL']}")


NameError: name 'ref_summary' is not defined