<a href="https://colab.research.google.com/github/JeanMusenga/PhD-Thesis_2024_Musenga/blob/main/BERTSumWithLemmatization%2C_StopWordRemeval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

https://chatgpt.com/share/c64e87ae-aefb-4eef-af28-94d2025276d7

To create a more effective extractive summarization approach using the BertSum model, we need to properly integrate the model's capabilities. Here’s a more structured and debugged approach:

1.Use the BERT model to encode the text.
2.Score each sentence using the BERT model.
3.Select the top sentences based on their scores.
We will ensure that the sentences are correctly indexed and processed.


# Step 1: Install necessary dependencies

In [None]:
!pip install transformers
!pip install torch
!pip install nltk
!pip install openpyxl

# Step 2: Import necessary libraries

In [None]:
# Step 2: Import necessary libraries
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from transformers import BertTokenizer, BertModel
import torch

In [None]:
# Download NLTK data

In [None]:
# Download NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
file_path = './saved_file'
file_path = ('DataSampePilot.xlsx')

data = pd.read_excel(file_path)

FileNotFoundError: [Errno 2] No such file or directory: 'DataSampePilot.xlsx'

In [None]:
# Display the first few rows of the dataset
print("Original data:")
print(data.head())


In [None]:
!git clone https://github.com/nlpyang/BertSum.git

Cloning into 'BertSum'...
remote: Enumerating objects: 301, done.[K
remote: Counting objects: 100% (293/293), done.[K
remote: Compressing objects: 100% (124/124), done.[K
remote: Total 301 (delta 165), reused 290 (delta 164), pack-reused 8[K
Receiving objects: 100% (301/301), 15.05 MiB | 9.76 MiB/s, done.
Resolving deltas: 100% (165/165), done.


In [None]:
# List the files in the current directory
!ls

BertSum  sample_data


In [None]:
%cd BertSum

/content/BertSum


# Step 3: Preprocess the text data

In [None]:
# Step 3: Preprocess the text data
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

NameError: name 'stopwords' is not defined

In [None]:
def preprocess_text(text):
    # Tokenize into words
    words = word_tokenize(text)
    # Remove stopwords and perform lemmatization
    words = [lemmatizer.lemmatize(word) for word in words if word.lower() not in stop_words]
    # Join words back into a single string
    return ' '.join(words)

# Apply preprocessing to the Question_body and Answer_body columns
data['Question_body'] = data['Question_body'].apply(lambda x: preprocess_text(x) if pd.notnull(x) else "")
data['Answer_body'] = data['Answer_body'].apply(lambda x: preprocess_text(x) if pd.notnull(x) else "")


In [None]:
# Verify preprocessing
print("Preprocessed data:")
print(data[['Question_body', 'Answer_body']].head())

# Step 4: Define the BertSum model for extractive summarization

In [None]:
# Step 4: Define the BertSum model for extractive summarization
class BertSum:
    def __init__(self, model_name='bert-base-uncased'):
        self.tokenizer = BertTokenizer.from_pretrained(model_name)
        self.model = BertModel.from_pretrained(model_name)

    def summarize(self, text):
        sentences = sent_tokenize(text)
        if not sentences:
            return ""
        inputs = self.tokenizer(sentences, return_tensors='pt', padding=True, truncation=True)
        outputs = self.model(**inputs)
        sentence_embeddings = outputs.last_hidden_state.mean(dim=1)
        scores = torch.norm(sentence_embeddings, dim=1)
        top_sentence_idxs = scores.topk(3).indices.tolist()
        summary = '. '.join([sentences[idx] for idx in top_sentence_idxs])
        return summary

In [None]:
# Initialize the model
bertsum = BertSum()

# Step 5: Apply the model to the Question_body and Answer_body columns

In [None]:
# Step 5: Apply the model to the Question_body and Answer_body columns
def summarize_column(text, column_name):
    if pd.notnull(text):
        try:
            summary = bertsum.summarize(text)
            print(f"Original {column_name}: {text[:100]}...")  # Print first 100 characters
            print(f"Summary: {summary}\n")
            return summary
        except Exception as e:
            print(f"Error summarizing text: {e}")
            return ""
    return ""

data['Question_summary'] = data['Question_body'].apply(lambda x: summarize_column(x, 'Question_body'))
data['Answer_summary'] = data['Answer_body'].apply(lambda x: summarize_column(x, 'Answer_body'))


Original Question_body: Kinda new AWS . high-level question . Iâ€™m looking insight general architecture workflow , without ...
Summary: would React client update message ?. turn , server , code send queue message third party perform action .. waiting response , react app show status update , meaning different stage operation taking place server .

Original Question_body: spring boot microservices want use microservices client application ( front-end ) . use Spring MVC d...
Summary: spring boot microservices want use microservices client application ( front-end ) .. main logic application resides spring boot microservices .. use Spring MVC designing client side application , , client side application sends request microservices REST APIs use service , standard correct solution ?

Original Question_body: 'm trying properly design application according clean architecture , 'm struggling determine layer (...
Summary: also possible call UseCase repository .. RemoteData - load cache data AP

# Display the summarized data

In [None]:
# Display the summarized data
print("Summarized data:")
print(data[['Question_body', 'Question_summary', 'Answer_body', 'Answer_summary']].head())


# Save the summarized data to a new Excel file

In [None]:
# Save the summarized data to a new Excel file
output_path = '/content/SummarizedData.xlsx'
data.to_excel(output_path, index=False, engine='openpyxl')

# Verify that the file has been saved correctly

In [None]:

saved_data = pd.read_excel(output_path)
print("Saved summarized data:")
print(saved_data[['Question_summary', 'Answer_summary']].head())