# Document Summarization

## Introduction 
This demo showcases a chatbot system powered by Generative AI (OpenAI). Using technologies like <b>RAG, Langchain, and LLM models</b> users can ask questions in simple terms, retrieve relevant data, and receive concise answers. The approach integrates retrieval-based and generative techniques to deliver accurate, user-friendly insights from structured sources.

Additionally, we will be using the Teradata as a Vector Store.

The following diagram illustrates the overall architecture.

<center><img src="images/header_chat_td.png" alt="architecture" /></center>

# Steps in the analysis
1. Configuring the environment  
2. Connect to Vantage  
3. Data Exploration  
4. Generate the embeddings  
5. Load the existing embeddings to DB  
6. Calculate the VectorDistance using Teradata Vantage in-DB function  
7. LLM  
8. Chat with documents  
9. Cleanup  

# Configure the environment

In [1]:
!pip install --upgrade -r requirements.txt --quiet

Import required libraries

In [1]:
import os
import timeit
import tqdm
from tqdm.notebook import *

tqdm_notebook.pandas()

# teradata lib
from teradataml import *

# helper functions
from utils.sql_helper_func import *
from utils.tdapiclient_helper_func import *

# LLM
from langchain.chat_models import ChatOpenAI
from langchain.schema import StrOutputParser
from langchain.prompts import PromptTemplate
from langchain.schema import StrOutputParser
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader

from dotenv import load_dotenv

# Suppress warnings
import warnings
warnings.filterwarnings("ignore")
display.max_rows = 5



sagemaker.config INFO - Not applying SDK defaults from location: C:\ProgramData\sagemaker\sagemaker\config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: C:\Users\rg255041\AppData\Local\sagemaker\sagemaker\config.yaml


# Connect to Vantage

We will be prompted to provide the password. We will enter the password, press the Enter key, and then use the down arrow to go to the next cell.

In [8]:
load_dotenv()

input_username = os.getenv('TD_USERNAME')
input_password = os.getenv('TD_PW')
input_host = os.getenv('TD_HOST')

In [3]:
eng = create_context(host = input_host, username=input_username, password = input_password)
print(eng)
execute_sql('''SET query_band='DEMO= Chat_with_docs_VantageDB_GenAI_Python.ipynb;' UPDATE FOR SESSION;''')

Engine(teradatasql://demo_user:***@ruvendataiku2-bglgq0q0y78bcvsk.env.clearscape.teradata.com)


TeradataCursor uRowsHandle=9 bClosed=False

Load OpenAI API key.

In [9]:
api_key = os.getenv('OPENAI_API_KEY')

# Data Exploration

This noteboook demonstrates how to interact with documentation, such as in insurance policy with a LLM. 

The Traveller Easy Single Trip - International insurance policy is a comprehensive travel insurance plan that provides cover for a wide range of risks, including medical expenses, trip cancellation, loss of luggage, and personal accident. The policy is designed to be affordable and flexible, and it can be purchased online or over the phone.

The source data from [AXA]("https://axa-com-my.cdn.axa-contento-118412.eu/axa-com-my/3d2f84a5-42b9-459b-911a-710546df0633_Policy+wording+-+SmartTraveller+Easy+Single+Trip+-+International+%280820%29.pdf") is loaded in FAISS as Vector Database.

Now, let's use `PyMuPDFLoader` library to read the pdf document and split it into pages.

In [5]:
from langchain_community.document_loaders import PyMuPDFLoader

pages = PyMuPDFLoader("data/SmartTraveller_International.pdf").load_and_split()
print(pages[2].page_content[:100], "\n\n-----------------------------------------------\n")
print(pages[2].metadata)

page 2 
 
Area 3 (Overseas Only): Worldwide EXCLUDING Iran, Syria, Belarus, Cuba, Democratic Republi 

-----------------------------------------------

{'source': 'data/SmartTraveller_International.pdf', 'file_path': 'data/SmartTraveller_International.pdf', 'page': 1, 'total_pages': 24, 'format': 'PDF 1.7', 'title': '', 'author': 'Nur Syuhada Binti Shafiee (UW)', 'subject': '', 'keywords': '', 'creator': 'Microsoft® Word for Office 365', 'producer': 'Microsoft® Word for Office 365', 'creationDate': "D:20200807090818+08'00'", 'modDate': "D:20200807090818+08'00'", 'trapped': ''}


# Generate the embeddings

This section explains how to generate embeddings using the OpenAI Embeddings API. The chosen model, text-embedding-3-small, produces 1536-dimensional vectors.

In [6]:
def read_document_content(pages):
    docs = [p.page_content for p in pages]

    # split the page content
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=200,
        chunk_overlap=30,
        length_function=len,
        is_separator_regex=False,
    )

    docs = text_splitter.create_documents(docs)

    texts_data = []
    for t in docs:
        texts_data.append(t.page_content)

    # generate the dataframe
    df = pd.DataFrame(data=texts_data, columns=["text"])
    df["id"] = range(1090, len(df.index) + 1090)
    cols = df.columns.tolist()
    cols = cols[-1:] + cols[:-1]
    df = df[cols]

    return df

In the above cell, we will read all the pages of the PDF file and split them into pages. To process further, we will split them into semantic character splits to maintain the context of sentences.

In [7]:
# read pdf file content
df = read_document_content(pages)

# copy docs to vantage
copy_to_sql(df, table_name="docs_data", primary_index="id", if_exists="replace")

tdf_docs = DataFrame("docs_data")
print("Data information: \n", tdf_docs.shape)
tdf_docs.sort("id")

Data information: 
 (642, 2)


id,text
1090,page 1 STTRE/PL(08/20) SmartTraveller Easy Single Trip - International Policy coverage attaching to and forming part of the Policy Schedule IMPORTANT NOTICE
1091,IMPORTANT NOTICE Welcome to Your SmartTraveller Easy - International Policy. Please read this policy carefully together with Your Policy
1092,Schedule to ensure that You understand the terms and conditions and that the cover You require is being provided. If You
1093,"have any questions after reading this document, please contact Your insurance advisor or AXA Affin General Insurance"
1094,"Berhad. If there are any changes in Your circumstances that may affect the insurance provided, please notify Us immediately, otherwise You may not receive the full benefits of this policy."
