# Welcome Everyone

This Notebook will be helpful to understand how to liverage the AI to answer from our own documents


##### **UNSTRUCTURED Library**
The unstructured library is designed to help preprocess structure unstructured text documents for use in downstream machine learning tasks. Examples of documents that can be processed using the unstructured library include PDFs, XML and HTML documents.

### **LOADING THE DOCUMENT(file)**
Plain text files, HTML, XML, JSON, and Emails are immediately supported without any additional dependencies.

If you’re processing document types beyond the basics, you can install the necessary extras like: 

**pip install "unstructured[docx,pdf]"**

To install all the additional document types:

**pip install "unstructured[all-docs]"**

"csv", "doc", "docx", "epub", "image", "md", "msg", "odt", "org", "pdf", "ppt", "pptx", "rtf", "rst", "tsv", "xlsx"

Note: We installed few document extensions already for you to try
(docx,pdf,pptx,md,msg)


In [2]:
import glob
import os
from unstructured.partition.auto import partition

folder_path = "docs"
# Using glob to get a list of file paths in the specified folder and its subfolders
file_paths = glob.glob(os.path.join(folder_path, "**"), recursive=True)

file_elements = []
for file_path in file_paths:
    if os.path.isfile(file_path):
        elements = partition(filename=file_path)
        file_elements.append(elements)

### **How the Documents(file) are categorized into elements**

The element objects represent different components of the source document. Examples: NarrativeText , ListItem, Title, Table, Image, etc..

To know more about elements visit: https://unstructured-io.github.io/unstructured/introduction.html#document-elements


In [3]:
total_elements = sum(len(row) for row in file_elements)
print(f"Total element count: {total_elements}")
file_elements[0][0].to_dict()

Total element count: 3984


{'type': 'Title',
 'element_id': '12266cc0b8b0d8278150b03a8bc457cf',
 'metadata': {'coordinates': {'points': ((66.767, 52.48632479999992),
    (66.767, 104.29162479999991),
    (545.2360133, 104.29162479999991),
    (545.2360133, 52.48632479999992)),
   'system': 'PixelSpace',
   'layout_width': 612,
   'layout_height': 792},
  'filename': 'IEEE.pdf',
  'file_directory': 'docs',
  'last_modified': '2023-09-28T10:34:08',
  'filetype': 'application/pdf',
  'page_number': 1},
 'text': 'CNN based Curved Path Detection and Obstacle Avoidance for Autonomous Rover'}

### **Chunking the documents**
Document Chunking is process of converting the document into pages or paragraphs or sentences.

Here unstructured package providing method to chunk the documents into paragraphs based on the Title's

In [4]:
import itertools
from unstructured.chunking.title import chunk_by_title

chunks = []
for elements in file_elements:
    chunks.extend(chunk_by_title(elements))

print(f"{len(chunks) = }")
chunks[0].to_dict()

len(chunks) = 429


{'type': 'CompositeElement',
 'element_id': 'c1815235350bbd9718006fb5f4aeecad',
 'metadata': {'filename': 'IEEE.pdf',
  'file_directory': 'docs',
  'last_modified': '2023-09-28T10:34:08',
  'filetype': 'application/pdf',
  'page_number': 1},
 'text': 'CNN based Curved Path Detection and Obstacle Avoidance for Autonomous Rover\n\nSudeep Aryan G 1, Bhanu Meher Srinavas R 2 ,Neethika K 3,Dr.Jayabarathi R 4 Department of Electrical and Electronics Engineering Amrita School of Engineering,Coimbatore Amrita Vishwa Vidyapeetham, India cb.en.u4eee19151@cb.students.amrita.edu, r jayabarathi@cb.amrita.edu'}

### **Initializing the Large Language Models**
Choosing the model for embedding and summarizing the results from the document

In [5]:
from llama_index.llms import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from dotenv import load_dotenv, find_dotenv

# Load secrets from .env file
load_dotenv(find_dotenv(), override=True)

llm = OpenAI(temperature=0.9, model="gpt-3.5-turbo")
embedding_llm = "local:BAAI/bge-large-en"
embedding_llm = OpenAIEmbedding(model="text-embedding-ada-002")


### **Embedding the chunks**
Converting the Chunks into Llama-index Document objects and embedding those objects

Here we are using the default VectorStore from the Llama-Index. Few other VectorDBs availble are: **Pinecone, Chroma, Faiss** and so on.

In [6]:
from llama_index import VectorStoreIndex, Document, ServiceContext

service_context = ServiceContext.from_defaults(llm=llm, embed_model=embedding_llm)

# Converting ech chunk into llama-index's Document
documents = list(
    map(
        lambda chunk: Document(text=chunk.text, extra_info=chunk.metadata.to_dict()),
        chunks,
    )
)

# Embedding the Documents(Chunks)
index = VectorStoreIndex.from_documents(documents, service_context=service_context)

### **Storing the Embedding**

Embedding are the nummerical representation of the tokens. Tokens can be a word or piece of word.

Instead of embed the same document for multiple times. Embedding can be stored in locally 

In [7]:
index.storage_context.persist(
    persist_dir="VectorStore-online"
)  # persist_dir is the folder that the embeddings will be stored

### **Loading the Embedding**

Loading the embedding from local

In [8]:
from llama_index import StorageContext, load_index_from_storage

storage_context = StorageContext.from_defaults(persist_dir="./VectorStore-online")
loaded_index = load_index_from_storage(storage_context)

In [9]:
from llama_index.vector_stores.types import MetadataFilters, ExactMatchFilter
filters = MetadataFilters(filters=[ExactMatchFilter(key="filename", value="tps25750.pdf")])

### **Creating a query engine**

In [10]:
TOP_SIMILARITIES = (
    5  # How many Chunks needs to be retrieved, for each query from the user
)
query_engine = index.as_query_engine(similarity_top_k=TOP_SIMILARITIES)

### **Quering the Document**

In [11]:
response = query_engine.query(
    "what is the abstract of this IEEE paper CNN"
)
response.response

"I'm sorry, but I cannot provide the abstract of this IEEE paper based on the given context information."

### **Sources Nodes Retrived for the User Query**

In [12]:
nodes = [node for node in response.source_nodes]
# print(*nodes, sep="\n")
nodes[0].metadata

{'filename': 'IEEE.pdf',
 'file_directory': 'docs',
 'last_modified': '2023-09-28T10:34:08',
 'filetype': 'application/pdf',
 'page_number': 5}

### Sample UI to try Q&A on your documents

In [13]:
import gradio as gr


def generate_response(input_text):
    response = query_engine.query(input_text)
    return response.response


question_text = gr.Textbox(lines=7, label="Question")
answer_text = gr.Textbox(lines=7, label="Answer")

gr.Interface(
    fn=generate_response,
    inputs=question_text,
    outputs=answer_text,
    title="Q&A on your Documents",
).launch(inline=False, inbrowser=True)

Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.


