**Installing Required Libraries for Embedding and Summarization**


In [None]:
!pip install transformers datasets sentence-transformers
!pip install transformers
!pip install torch
!pip install datasets
!pip install langchain openai



**Importing Necessary Libraries**

In [None]:
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from transformers import pipeline



**Downloading stopwords and punkt for tokenization and wordnet for lemmatization**

In [None]:

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

**Initializing stopwords and lemmatizer**

In [None]:

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

**Checking the first five of dataset**|

In [None]:

def inspect_dataset(file_path, num_lines=5):
    with open(file_path, 'r', encoding='utf-8') as file:
        for i in range(num_lines):
            print(file.readline().strip())
inspect_dataset('/content/FLAT_RCL.txt')

1	02V288000	FORD	FOCUS	2000	02S41	ELECTRICAL SYSTEM:12V/24V/48V BATTERY:CABLES	FORD MOTOR COMPANY	19990719	20010531	V	291854	20030210	ODI	Ford Motor Company	20021106	20021106			CERTAIN PASSENGER VEHICLES EQUIPPED WITH ZETEC ENGINES, LOOSE OR BROKEN ATTACHMENTS AND MISROUTED BATTERY CABLES COULD LEAD TO CABLE INSULATION DAMAGE.	THIS, IN TURN, COULD CAUSE THE BATTERY CABLES TO SHORT RESULTING IN HEAT DAMAGE TO THE CABLES.  BESIDES HEAT DAMAGE, THE "CHECK ENGINE" LIGHT MAY ILLUMINATE, THE VEHICLE MAY FAIL TO START, OR SMOKE, MELTING, OR FIRE COULD ALSO OCCUR.	DEALERS WILL INSPECT THE BATTERY CABLES FOR THE CONDITION OF THE CABLE INSULATION AND PROPER TIGHTENING OF THE TERMINAL ENDS.  AS NECESSARY, CABLES WILL BE REROUTED, RETAINING CLIPS INSTALLED, AND DAMAGED BATTERY CABLES REPLACED.   OWNER NOTIFICATION BEGAN FEBRUARY 10, 2003.   OWNERS WHO DO NOT RECEIVE THE FREE REMEDY  WITHIN A REASONABLE TIME SHOULD CONTACT FORD AT 1-866-436-7332.	ALSO CONTACT THE NATIONAL HIGHWAY TRAFFIC SAFETY ADM

Preprocessing the text to make the dataset Standarized and Normalized.
Here, In this function I have converted.

1). Converted text in lowercase.

2). Removed punctuation and numbers for better summarization

3). Tokenized the data and removed stopworkds and lemmatizing tokens

and after that i am returning tokens back to a single string

In [None]:

def preprocess_text(text):
    text = text.lower()
    text = re.sub(f"[{string.punctuation}0-9]", " ", text)
    tokens = word_tokenize(text)
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
    return " ".join(tokens)


**Defined the function in chunks to put less load in model**

In [None]:

def load_dataset_in_chunks(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            yield preprocess_text(line.strip())

**This function search relevant documents from the corpus**

In [None]:

def search_relevant_documents(input_data, dataset):
    keywords = preprocess_text(f"{input_data['make']} {input_data['model']} {input_data['year']} {input_data['issue']}")
    keyword_list = keywords.split()
    relevant_docs = []
    for document in dataset:
        if all(keyword in document for keyword in keyword_list):
            relevant_docs.append(document)
        if len(relevant_docs) >= 3:
            break

    return relevant_docs



**Here we are using Facebook bart model for summarization**

In [None]:

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")





**This function will return summarize documents**

In [None]:

def summarize_documents(documents):
    summaries = []
    for doc in documents:
        truncated_doc = doc[:1000]
        summary = summarizer(truncated_doc, max_length=100, min_length=30, do_sample=False)
        summaries.append(summary[0]['summary_text'])
    return summaries




**Here, I am using this function to summarize and preprocess the dataset based on the search , This function will take function which defined above and act as main function which generate summarized text**

In [None]:

def summarization_agent(input_data, file_path):
    dataset = load_dataset_in_chunks(file_path)
    relevant_docs = search_relevant_documents(input_data, dataset)
    summaries = summarize_documents(relevant_docs)
    return {
        'retrieved_documents': relevant_docs,
        'summaries': summaries
    }



**Defining a input to check if our model works well**

In [None]:

input_data = {
    'make': 'ford',
    'model': 'escape',
    'year': '2001',
    'issue': 'stuck throttle risk'
}




**Printing the fetched relevant documents**

In [None]:
print("Retrieved Documents:", result['retrieved_documents'])

Retrieved Documents: ['v ford escape engine engine cooling ford motor company v odi ford motor company ford motor company recalling certain model year escape vehicle equipped l v engine speed control manufactured october january inadequate clearance engine cover speed control cable connector could result stuck throttle accelerator pedal fully almost fully depressed risk exists regardless whether speed control cruise control used stuck throttle may result high vehicle speed make difficult stop slow vehicle could cause crash serious injury death ford notify owner dealer repair vehicle increasing engine cover clearance free charge safety recall began august remedy part expected available mid august dealer disconnect speed control cable interim remedy part available time owner service appointment owner may contact ford ford recall campaign number owner may also contact national highway traffic safety administration vehicle safety hotline tty go www safercar gov', 'v ford escape vehicle spe

**Printing the summarized text**|

In [None]:

result = summarization_agent(input_data, '/content/FLAT_RCL.txt')
print("Summaries:", result['summaries'])

Summaries: ['Ford motor company recalling certain model year escape vehicle equipped l v engine speed control. inadequate clearance engine cover speed control cable connector could result stuck throttle accelerator pedal fully almost fully depressed.', 'Ford motor company recalling certain model year escape vehicle equipped l v engine speed control. inadequate clearance engine cover speed control cable connector could result stuck throttle accelerator pedal fully almost fully depressed risk exists regardless whether speed control cruise control used.', 'Ford motor company recalling certain model year escape vehicle equipped l v engine speed control. inadequate clearance engine cover speed control cable connector could result stuck throttle accelerator pedal fully almost fully depressed.']
