<a href="https://colab.research.google.com/github/JeanMusenga/PhD-Thesis_2024_Musenga/blob/main/BERT_TextSummarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Using the BERT model for sequence classification, but adapted to perform a similar task of extractive summarization using transformers library
https://chatgpt.com/share/01303bfd-981b-448e-b51c-4ac8bad51dc5

In [1]:
import pandas as pd
from transformers import pipeline

# Step 1: Load the data

In [2]:
df=pd.read_excel('DataSampePilot.xlsx')

In [7]:
# Drop rows with NaN in 'Question_body' or 'Answer_body'
df = df.dropna(subset=['Question_body', 'Answer_body'])

# Initialize the summarization pipeline


In [None]:
summarizer = pipeline("summarization", model="facebook/bart-large-cnn", tokenizer="facebook/bart-large-cnn")

# Function to split text into chunks smaller than max_length

In [9]:
# Function to split text into chunks smaller than max_length
def split_into_chunks(text, max_length=512):
    words = text.split()
    chunks = []
    current_chunk = []
    current_length = 0

    for word in words:
        if current_length + len(word) + 1 <= max_length:
            current_chunk.append(word)
            current_length += len(word) + 1
        else:
            chunks.append(" ".join(current_chunk))
            current_chunk = [word]
            current_length = len(word) + 1

    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks

# Function to summarize text by splitting it into chunks

In [10]:
# Function to summarize text by splitting it into chunks
def summarize_text(text, max_chunk_size=512):
    chunks = split_into_chunks(text, max_chunk_size)

    summaries = []
    for chunk in chunks:
        summary = summarizer(chunk, max_length=130, min_length=30, do_sample=False)[0]['summary_text']
        summaries.append(summary)

    combined_summary = ' '.join(summaries)
    return combined_summary

# Apply summarization to the 'Question_body' and 'Answer_body' columns

In [None]:
# Apply summarization to the 'Question_body' and 'Answer_body' columns
df['Question_summary'] = df['Question_body'].apply(lambda x: summarize_text(x))
df['Answer_summary'] = df['Answer_body'].apply(lambda x: summarize_text(x))

# Display the dataframe with summaries

In [12]:
# Display the dataframe with summaries
print(df[['Question_body', 'Question_summary', 'Answer_body', 'Answer_summary']].head())


                                       Question_body  \
0  Kinda new to AWS. I have this high-level quest...   
1  I have some spring boot microservices and I wa...   
2  I'm trying to properly design an application a...   
3  I heard that for .NET8 Microsoft gifted us wit...   
4  I am trying to learn AWS services, and now it ...   

                                    Question_summary  \
0  I need a React app on the client, which is cal...   
1  Spring boot microservices can be used to build...   
2  I'm trying to properly design an application a...   
3  Microsoft gave us a totally fixed authenticati...   
4  I am trying to learn AWS services, and now it ...   

                                         Answer_body  \
0  You send a request, you get a response. In ord...   
1  <blockquote>\ntl;dr: Spring MVC will not contr...   
2  Determining the source of the information is b...   
3  I have always asked myself this very same ques...   
4  Short answer is: no, you don't have to but 

In [None]:
# Step 3: Save the dataframe with summaries to a new Excel file
output_file_path = 'path_to_save_summarized_file/DataSampleSummarized.xlsx'  # Adjust this path as needed
df.to_excel(output_file_path, index=False)