### Notebook for Text Summarization of the "Description" Column in the AI_contracts Dataset Using the BART Model via Hugging Face's Interface

#### This notebook uses several functions for text preprocessing and summarization using the BART model. The text preprocessing involves text tokenization, removes non-alphabetic tokens and stopwords. The BART model and tokenizer are loaded, and a summarization pipeline is initialized. The `chunk_text` function divides long text into manageable chunks for the tokenizer. The `summarize_long_text` function generates summaries for each chunk of long text and concatenates them. The `summarize_short_text` function generates a summary for short text directly. Note: Ensure VPN is turned off before running the notebook.

### Files needed to run the notebook:
        -- "AI_contracts.csv"
### Files generated from the notebook:
        -- 'AI_contracts_summaries.csv'

### Importing necessary libraries for data manipulation and text processing # Download necessary NLTK packages for tokenization and stopwords

In [1]:
import pandas as pd
from transformers import BartTokenizer, BartForConditionalGeneration, pipeline
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('stopwords')

pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.max_colwidth', 100)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sankar.kalaga\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sankar.kalaga\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


#### Load the `AI_contracts.csv` dataset, rename columns, and handle missing descriptions

In [2]:
filtered_df=pd.read_csv("AI_contracts.csv")
filtered_df.rename(columns={'Unnamed: 0': 'Original_id'}, inplace=True)
filtered_df[filtered_df['Description'].isna()]
filtered_df['Description'] = filtered_df['Description'].fillna(filtered_df['Title'])
filtered_df.drop(columns=['Title_Desc',"Title_Description_c_sw_stm", "Title_Description_c_sw_lem","Title_Desc_lst","Title_Desc_lower"],inplace=True)
filtered_df.head()

Unnamed: 0,Original_id,Title,Sol#,Department/Ind.Agency,AwardNumber,...,Awardee,Link,Description,contains_keywords,keyword
0,2,Maestro Pro Assay System Purchase,75N95024Q00304,"HEALTH AND HUMAN SERVICES, DEPARTMENT OF",,...,,https://sam.gov/opp/754bb52525fe47ec981a96f85e02b2a2/view,SAM.gov SOURCES SOUGHT NOTICE Solicitation Number: 75N95024Q00304 Title: Maestro Pro Assay Sys...,True,data science
1,34,99--DOI-BLM Automated Closeout Robotic Process Notice,DOILFBO240043,"INTERIOR, DEPARTMENT OF THE",,...,,https://sam.gov/opp/8babf5bc741343bcb9111b5dd4e635e0/view,The Bureau of Land Management intends to utilize robotic process automation (RPA) to closeout co...,True,robot
2,523,6515--Orthopedic Surgical Robotic Systems (VA-23-00059232),36C10G24R0012,"VETERANS AFFAIRS, DEPARTMENT OF",,...,,https://sam.gov/opp/8fec4ecf4eca4f368af94720ce5a9d52/view,Number: 36C10G24R0012 Notice Type: Combined Synopsis/Solicitation Synopsis: COMBINED SYNOPSIS/SO...,True,robot
3,828,6530--ME-ScriptPro Robot SP 200 CCC 12 SLOT CS: Tracy Heath,36C26024Q0556,"VETERANS AFFAIRS, DEPARTMENT OF",,...,,https://sam.gov/opp/6062ec612f7d4401acec7fd34efb0b18/view,Robotic Prescription Dispensing System,True,robot
4,836,6525-- Xenon UV Robot,36C26224Q1094,"VETERANS AFFAIRS, DEPARTMENT OF",,...,,https://sam.gov/opp/75d2573ba5f14b5cb430c5667a8eea03/view,Combined Synopsis/Solicitation Notice Combined Synopsis/Solicitation Notice Page 3 of 3 Combined...,True,robot


#### The following script defines several functions for text preprocessing and summarization using the BART model. 

In [3]:
# Text pre-processing,  tokenize, convert to lowercase, remove non-alphabetical characters, and filter out stopwords
def preprocess_text(text):
    tokens = word_tokenize(text)
    tokens = [word.lower() for word in tokens]
    tokens = [word for word in tokens if word.isalpha()]
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    preprocessed_text = ' '.join(tokens)
    return preprocessed_text

# Divide text into manageable chunks for the tokenizer
def chunk_text(text, tokenizer, max_length):
    tokens = tokenizer(text, return_tensors='pt', truncation=False)['input_ids'][0]
    chunks = []
    for i in range(0, len(tokens), max_length):
        chunk = tokens[i:i + max_length]
        chunks.append(chunk)
    return chunks

# Generate summaries for each chunk of text
def summarize_long_text(text, summarizer, tokenizer, max_length=1024):
    chunks = chunk_text(text, tokenizer, max_length)
    summaries = []
    for chunk in chunks:
        inputs = {'input_ids': chunk.unsqueeze(0)}
        summary_ids = summarizer.model.generate(**inputs, max_length=100, min_length=20, length_penalty=2.0, num_beams=4, early_stopping=True)
        summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
        summaries.append(summary)
    return " ".join(summaries)

# Generate a summary for short texts directly
def summarize_short_text(text, summarizer):
    summary_ids = summarizer.model.generate(summarizer.tokenizer.encode(text, return_tensors='pt'), max_length=100, min_length=20, length_penalty=2.0, num_beams=4, early_stopping=True)
    summary = summarizer.tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

# Load the BART model and tokenizer
model_name = "facebook/bart-large-cnn"
tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

# Initialize the summarization pipeline
summarizer = pipeline("summarization", model=model, tokenizer=tokenizer)

# Choose summarization method based on text length and preprocessing option
def apply_summarization(text, summarizer, tokenizer, max_length=1024, threshold=512, preprocess=False):
    if preprocess:
        text = preprocess_text(text)
    if len(tokenizer.encode(text)) > threshold:
        return summarize_long_text(text, summarizer, tokenizer, max_length)
    else:
        return summarize_short_text(text, summarizer)


#### Apply the summarization to the DataFrame without preprocessing

In [4]:
filtered_df['Summary_without_preprocessing'] = filtered_df['Description'].apply(lambda x: apply_summarization(x, summarizer, tokenizer, preprocess=False))

In [5]:
# Apply the summarization to the DataFrame with preprocessing
# filtered_df['Summary_with_preprocessing'] = filtered_df['Description'].apply(lambda x: apply_summarization(x, summarizer, tokenizer, preprocess=True))

#### Calculate and store the lengths of the original descriptions and their corresponding summaries in new columns

In [6]:
filtered_df["Description_len"]=filtered_df["Description"].apply(len)
filtered_df["Summary_len"]=filtered_df["Summary_without_preprocessing"].apply(len)

#### Display the DataFrame with original descriptions and their respective summaries generated using the BART model

In [7]:
print("\nDataFrame with Summaries generated using the BART model:")
filtered_df[["Description","Summary_without_preprocessing","Description_len","Summary_len"]]


DataFrame with Summaries generated using the BART model:


Unnamed: 0,Description,Summary_without_preprocessing,Description_len,Summary_len
0,SAM.gov SOURCES SOUGHT NOTICE Solicitation Number: 75N95024Q00304 Title: Maestro Pro Assay Sys...,"This is a Sources Sought Notice. This is NOT a solicitation for proposals, proposal abstracts, o...",6828,715
1,The Bureau of Land Management intends to utilize robotic process automation (RPA) to closeout co...,The Bureau of Land Management intends to utilize robotic process automation to closeout contract...,586,332
2,Number: 36C10G24R0012 Notice Type: Combined Synopsis/Solicitation Synopsis: COMBINED SYNOPSIS/SO...,The Veterans Health Administration (VHA) Non-Expendable Equipment (NX) National Program has iden...,9820,910
3,Robotic Prescription Dispensing System,Robotic Prescription Dispensing System will dispense prescription drugs. System will be availabl...,38,133
4,Combined Synopsis/Solicitation Notice Combined Synopsis/Solicitation Notice Page 3 of 3 Combined...,This is a combined synopsis/solicitation for commercial items prepared in accordance with the fo...,32003,2258
...,...,...,...,...
151,SBIR Phase II Proposal #B20P2-1F01. Early detection and prediction of harmful algal blooms using...,SBIR Phase II Proposal #B20P2-1F01. Early detection and prediction of harmful algal blooms using...,169,150
152,SBIR Phase I Proposal #B205C-0003.Procuring Unique MHSM Fuels for Electricity Generation and Coo...,Procuring Unique MHSM Fuels for Electricity Generation and Cooking with Machine Learning QC Asse...,158,148
153,NASA’s Technology Transfer Program solicits inquiries from companies interested in obtaining lic...,The Lunar Surface Manipulation System (LSMS) allows for fine positioning with complete control o...,3503,283
154,OASIS SB Pool 4 Awardees and Apparent Successful Offerors List In accordance with OASIS SB Pools...,"Government anticipated making 60 awards in OASIS SB Pool 4, including ties at rank 60. The lowes...",2602,196


#### Save the filtered DataFrame to a CSV file named `AI_contracts_summaries.csv`

In [8]:
filtered_df.to_csv("AI_contracts_summaries.csv")