<a href="https://colab.research.google.com/github/JeanMusenga/PhD-Thesis_2024_Musenga/blob/main/Trial_with_BertSum.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Sammaries are not being generated:

https://chatgpt.com/share/c64e87ae-aefb-4eef-af28-94d2025276d7

It looks like the summaries are still empty, indicating that the summarization process is not working as expected. This might be due to the way the BERT model is being used for summarization. Let's modify the approach to ensure that we are extracting meaningful summaries.

Instead of using softmax on the entire sentence embedding, we'll use a more straightforward method: simply selecting a subset of sentences based on their positions in the text.

In [None]:
# Step 2: Import necessary libraries
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from transformers import BertTokenizer, BertModel
import torch
from torch.nn.functional import softmax

In [4]:
file_path = './saved_file'
file_path = ('DataSampePilot.xlsx')

data = pd.read_excel(file_path)

In [None]:
# Display the first few rows of the dataset
print("Original data:")
print(data.head())


In [1]:
!git clone https://github.com/nlpyang/BertSum.git

Cloning into 'BertSum'...
remote: Enumerating objects: 301, done.[K
remote: Counting objects: 100% (293/293), done.[K
remote: Compressing objects: 100% (124/124), done.[K
remote: Total 301 (delta 165), reused 290 (delta 164), pack-reused 8[K
Receiving objects: 100% (301/301), 15.05 MiB | 9.38 MiB/s, done.
Resolving deltas: 100% (165/165), done.


In [7]:
# Download NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [8]:
# Step 3: Preprocess the text data
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

In [9]:
def preprocess_text(text):
    # Tokenize into words
    words = word_tokenize(text)
    # Remove stopwords and perform lemmatization
    words = [lemmatizer.lemmatize(word) for word in words if word.lower() not in stop_words]
    # Join words back into a single string
    return ' '.join(words)


In [10]:
# Apply preprocessing to the Question_body and Answer_body columns
data['Question_body'] = data['Question_body'].apply(lambda x: preprocess_text(x) if pd.notnull(x) else "")
data['Answer_body'] = data['Answer_body'].apply(lambda x: preprocess_text(x) if pd.notnull(x) else "")


In [None]:
# Verify preprocessing
print("Preprocessed data:")
print(data[['Question_body', 'Answer_body']].head())

In [12]:
%cd BertSum

/content/BertSum


In [13]:
# Step 4: Define the BertSum model for extractive summarization
class BertSum:
    def __init__(self, model_name='bert-base-uncased'):
        self.tokenizer = BertTokenizer.from_pretrained(model_name)
        self.model = BertModel.from_pretrained(model_name)

    def summarize(self, text):
        if len(text.strip()) == 0:
            return ""
        inputs = self.tokenizer(text, return_tensors='pt', truncation=True, padding='max_length', max_length=512)
        outputs = self.model(**inputs)
        # Obtain the embeddings for each token
        token_embeddings = outputs.last_hidden_state
        # Average the token embeddings to get a sentence embedding
        sentence_embedding = torch.mean(token_embeddings, dim=1)
        # Compute softmax scores
        scores = softmax(sentence_embedding, dim=1)
        # Split the text into sentences
        sentences = sent_tokenize(text)
        if not sentences:  # Check if the sentence list is empty
            return ""
        # Get the top N sentences (e.g., top 3)
        N = min(3, len(sentences))  # Ensure N does not exceed the number of sentences
        top_sentence_idxs = scores[0].topk(N).indices.tolist()
        # Select the top N sentences, ensuring valid indices
        top_sentence_idxs = [idx for idx in top_sentence_idxs if idx < len(sentences)]
        summary = '. '.join([sentences[idx] for idx in top_sentence_idxs])
        return summary

In [None]:
# Initialize the model
bertsum = BertSum()

In [None]:
# Step 5: Apply the model to the Question_body and Answer_body columns
def summarize_column(text, column_name):
    if pd.notnull(text):
        try:
            summary = bertsum.summarize(text)
            print(f"Original {column_name}: {text[:100]}...")  # Print first 100 characters
            print(f"Summary: {summary}\n")
            return summary
        except Exception as e:
            print(f"Error summarizing text: {e}")
            return ""
    return ""
data['Question_summary'] = data['Question_body'].apply(lambda x: summarize_column(x, 'Question_body'))
data['Answer_summary'] = data['Answer_body'].apply(lambda x: summarize_column(x, 'Answer_body'))


In [None]:
# Display the summarized data
data[['Question_body', 'Question_summary', 'Answer_body', 'Answer_summary']].head()


Unnamed: 0,Question_body,Question_summary,Answer_body,Answer_summary
0,Kinda new AWS . high-level question . Iâ€™m lo...,,"send request , get response . order send respo...",
1,spring boot microservices want use microservic...,,< blockquote > tl ; dr : Spring MVC contradict...,
2,'m trying properly design application accordin...,,Determining source information business logic ...,
3,heard .NET8 Microsoft gifted u totally & quot ...,,always asked question : Microsoft template eve...,
4,"trying learn AWS service , mainly focused API ...",,"Short answer : , n't probably . Usually , EKS ...",


In [None]:
# Save the summarized data to a new Excel file
data.to_excel('/path/to/your/SummarizedData.xlsx', index=False)