<a href="https://colab.research.google.com/github/Ankitha2003/AI-Powered-Virtual-Analyst/blob/main/Ai_Powered_Virtual_analyst.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Step 1: Install Necessary Libraries

We need to install the following libraries for this project:
- **NumPy**: For numerical operations.
- **Pandas**: For data manipulation (not directly used but useful).
- **FAISS**: For efficient similarity search.
- **Sentence-Transformers**: For generating text embeddings.
- **Gradio**: To create the user interface.
- **python-docx**: To extract text from `.docx` files.

Run the command below to install them:



In [None]:
!pip install numpy pandas faiss-cpu sentence-transformers gradio python-docx




#### **Explanation:**

##### This installs all the necessary libraries for document processing, embeddings, and building the Gradio interface.

### Step 2: Extract Text from .docx Files

In this step, we load and extract text from the three NABARD reports in `.docx` format. The function below reads the documents and extracts their content.

The extracted text from each report will then be combined into one corpus for further processing.

In [10]:
from docx import Document

# Function to extract text from a .docx file
def extract_text_from_docx(file_path):
    doc = Document(file_path)
    text = [para.text for para in doc.paragraphs if para.text.strip()]
    return " ".join(text)

# Extract text from all the uploaded reports
doc1_text = extract_text_from_docx("Annual Report 2020-21 - FINAL.docx")
doc2_text = extract_text_from_docx("Annual Report 2021-22 - FINAL.docx")
doc3_text = extract_text_from_docx("Annual Report 2022-23 - FINAL.docx")

# Combine all documents into one text corpus
corpus = [doc1_text, doc2_text, doc3_text]


#### **Explanation:**

##### The extract_text_from_docx function reads the .docx file and extracts text from non-empty paragraphs.
##### After extracting the text from the three reports, the content is combined into a single corpus for processing in the next steps.


### Step 3: Split Text into Chunks

In this step, we will split the text into smaller chunks for processing. Each chunk will contain 1000 characters, with a 100-character overlap between chunks. This allows us to ensure the context is preserved when processing the text.

The chunks will make it easier to generate embeddings and perform efficient retrieval from the documents.


In [11]:
from langchain.text_splitter import CharacterTextSplitter

# Split the text into manageable chunks
splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
chunks = []
for doc in corpus:
    chunks.extend(splitter.split_text(doc))

print(f"Number of chunks created: {len(chunks)}")




Number of chunks created: 6


##### **Explanation:**

##### CharacterTextSplitter is used to divide the text into chunks of 1000 characters, with an overlap of 100 characters between each chunk.
##### This helps to preserve the context between chunks, which is important for semantic search and retrieval.
##### The chunks list stores all the split text chunks, and the final count is printed to verify how many chunks were created.

### Step 5: Create a FAISS Vector Store for Efficient Search

In this step, we will create a **FAISS vector store** to store the embeddings. FAISS (Facebook AI Similarity Search) is a library that allows efficient similarity search in large datasets. We'll use it to perform quick searches for the most relevant chunks based on the query.

We will add the generated embeddings into the FAISS index, which will later allow us to retrieve the most similar chunks for a given query.


In [12]:
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss

# Load a local embedding model
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

# Generate embeddings for all chunks
chunk_embeddings = embedding_model.encode(chunks, convert_to_numpy=True)

# Create a FAISS vector store
dimension = chunk_embeddings.shape[1]  # Embedding size
index = faiss.IndexFlatL2(dimension)
index.add(chunk_embeddings)

print("FAISS index created with embeddings!")


  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

FAISS index created with embeddings!


##### **Explanation:**

##### faiss.IndexFlatL2(dimension): This initializes a FAISS index using L2 distance (Euclidean distance) to measure similarity between embeddings.
##### index.add(chunk_embeddings): This adds the embeddings for all chunks into the FAISS index, allowing for fast similarity searches.
##### Dimension: The dimensionality of the embeddings is determined from the shape of chunk_embeddings. It represents the number of features each embedding has.

### Step 6: Define Search and Answer Functions

In this step, we define two functions:
1. **`search_documents(query, top_k=3)`**: This function searches the FAISS index to find the most relevant text chunks for a given query.
2. **`answer_query(query)`**: This function retrieves the relevant chunks from the FAISS index and combines them to form an answer.

These functions enable us to query the documents and return meaningful results.


In [13]:
def search_documents(query, top_k=3):
    # Encode the query
    query_embedding = embedding_model.encode([query], convert_to_numpy=True)

    # Search the FAISS index for the most relevant chunks
    distances, indices = index.search(query_embedding, top_k)

    # Retrieve the matching chunks
    results = [chunks[idx] for idx in indices[0]]
    return results

def answer_query(query):
    # Search for relevant chunks
    relevant_chunks = search_documents(query)

    # Combine the relevant chunks into one text
    context = " ".join(relevant_chunks)

    # Answer the query using the context (for simplicity, return the context)
    return context


##### **Explanation:**

##### *search_documents:*

###### This function encodes the query into an embedding and searches for the top k most similar chunks from the FAISS index.
###### The function returns the top_k results based on the similarity search.
##### *answer_query:*

###### This function uses the search_documents function to retrieve relevant chunks for the input query.
###### It then combines these chunks into a single string, which serves as the answer to the query.

### Step 7: Gradio Interface

This step sets up a Gradio interface to allow users to interact with the AI virtual assistant. Users can type their queries, and the assistant will provide the answer based on the NABARD reports.


In [15]:
import gradio as gr

# Define a function for the Gradio interface
def query_assistant(input_query):
    try:
        # Get the answer from the QA system
        answer = answer_query(input_query)
        return answer if answer.strip() else "No relevant information found."
    except Exception as e:
        return f"Error: {str(e)}"

# Create the Gradio interface
interface = gr.Interface(
    fn=query_assistant,
    inputs="text",
    outputs="text",
    title="AI Virtual Assistant for NABARD Reports",
    description="Ask any questions related to the NABARD annual reports."
)
interface.launch()


Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://aba8b087a1cfc8e840.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




#### **Explanation:**

#### query_assistant: Handles user queries and gets answers.
#### Gradio Interface: Displays a simple text box for user input and shows the assistant's response.
#### interface.launch(): Starts the web app.