In [None]:
pip install -U jupyter ipywidgets

In [None]:
pip install -U langchain langchain-openai

In [None]:
pip install -U langchain-groq

In [None]:
pip install openai

In this step, LangSmith is configured to enable tracing and monitoring of the information retrieval system. The following configurations were set:
- LANGSMITH_TRACING=True: Activates tracing to log system activities.
- LANGSMITH_ENDPOINT: Specifies the API endpoint for the LangSmith service.
- LANGSMITH_API_KEY: Authenticates the API request using a unique key.
- LANGSMITH_PROJECT: Assigns the project name "tourism" to organize and track project-specific traces.

In [1]:
LANGSMITH_TRACING=True
LANGSMITH_ENDPOINT="https://api.smith.langchain.com"
LANGSMITH_API_KEY="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"
LANGSMITH_PROJECT="tourism"

Here, environment variables are set up to allow for LangSmith tracing and API key authentication. LANGSMITH_TRACING is assigned the value of "true" to turn on tracing, useful for tracking system performance and data movement. The API key is retrieved securely using getpass.getpass() to avoid it from being printed at input time. This key is then saved to the LANGSMITH_API_KEY environment variable, allowing secure and uninterrupted API calls throughout the project.

In [3]:
import getpass
import os

os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_API_KEY"] = getpass.getpass()

 ········


The Groq API key is configured here to use the LLaMA 3 chat model. It checks initially if the GROQ_API_KEY environment variable already exists. Otherwise, it encourages the user securely to input the API key with getpass.getpass(). Next, the model "llama3-8b-8192" in the Groq provider is created and initialized to the LLaMA 3 chat model through the init_chat_model() method, allowing the model to generate text-based response for further interpretation.

In [5]:
import getpass
import os
from langchain.chat_models import init_chat_model

# Set up Groq API key
if not os.environ.get("GROQ_API_KEY"):
    os.environ["GROQ_API_KEY"] = getpass.getpass("Enter API key for Groq: ")

# Initialize LLaMA 3 chat model
llm = init_chat_model("llama3-8b-8192", model_provider="groq")


Enter API key for Groq:  ········


- Embeddings Definition: The Hugging Face Embeddings model (sentence-transformers/all-mpnet-base-v2) is used to transform text into numerical vector representations.
- Semantic Representation: These vectors capture the semantic meaning of the text, allowing it to be more easily compared and searched for text data.
- Vector Store Initialization: The Chroma vector store is initialized with the embedding function in order to store and handle these vector representations.
- Efficient Retrieval: This configuration allows for efficient similarity search and retrieval of text information for downstream NLP tasks.

In [8]:
from langchain_chroma import Chroma
from langchain_huggingface import HuggingFaceEmbeddings

# Define embeddings model
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

# Initialize Chroma vector store with the embeddings function
vector_store = Chroma(embedding_function=embeddings)

- PDF Loading: PyMuPDFLoader is utilized to load text content from the given PDF file (one.pdf).
- Text Cleaning: Whitespace-only or empty content is removed to leave behind useful text.
- Document Conversion: Cleaned text is converted to Document objects for subsequent processing.
- Text Splitting: Content is split into overlapping, smaller pieces using RecursiveCharacterTextSplitter to improve search and retrieval performance.
- Embedding Initialization: The Hugging Face model (multi-qa-mpnet-base-dot-v1) is initialized to produce semantic vector representations of the text.
- Vector Store Setup: The ChromaDB vector store is set up and filled with the split document chunks for similarity-based search and retrieval.
- Confirmation: The last print statement verifies that the PDF content is successfully stored in ChromaDB.

In [1]:
from langchain_community.document_loaders import PyMuPDFLoader  # Load PDFs
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_core.documents import Document  # Import Document class

# Load PDF content
pdf_path = "one.pdf"  # Update with actual filename
loader = PyMuPDFLoader(pdf_path)
docs = loader.load()

# Clean text and remove empty content
cleaned_docs = [doc.page_content.strip() for doc in docs if doc.page_content]

# Convert cleaned text back to Document objects
cleaned_documents = [Document(page_content=text) for text in cleaned_docs]

# Split text into smaller chunks for better retrieval
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
all_splits = text_splitter.split_documents(cleaned_documents)

# Initialize Hugging Face embeddings model
embeddings = HuggingFaceEmbeddings(model_name="multi-qa-mpnet-base-dot-v1")

# Initialize Chroma vector store
vector_store = Chroma(embedding_function=embeddings)

# Store split documents in ChromaDB
vector_store.add_documents(documents=all_splits)

print("PDF content successfully stored in ChromaDB!")


PDF content successfully stored in ChromaDB!


The code retrieves all the documents that are saved in ChromaDB via the get() function of the Chroma vector store. This function retrieves the saved document chunks and their respective embeddings, which enables us to check that the documents are indeed indexed correctly. The print(all_docs) statement prints the retrieved documents, proving that the PDF content is successfully processed and stored for future search and retrieval procedures.

In [3]:
# Fetch all documents stored in ChromaDB
all_docs = vector_store.get()
print(all_docs)


{'ids': ['1efe471a-51a5-4f1f-96f9-280809444df8', '892e3d13-45ce-4e70-8e1c-d82dbf29e1fe', '8cb75054-ccac-4d4e-b714-7f0b9e683f18', '8b4faa80-1c78-44ca-ac87-cb0373d20dc0', 'd5b15813-e1c6-463c-9a11-2e92cf46bf2e', '4ca1c339-87c8-409f-ba96-365e3c3b307c', 'f1e6e63a-bfbb-48b4-b9ba-691282d34e9a', '810ae3d4-d4d6-4ec4-85a0-f27b27e0a791', 'e73ac1c7-0137-40a5-9f00-f005f2525487', '1cdf69df-a929-40ba-a4dc-57cc0b80e8e2', '323b5795-8c32-4a00-9103-4db15da94c5d', 'b20104d2-c4a7-471e-acad-cee6959bd52c', '9c361c68-b91d-4181-8fe4-27b776591b6d', '75b230e4-3607-4b2e-88a7-b3fadabe0b21', 'c6970ca5-6f34-4e40-93b8-02959226a55b', '484db021-5a24-4ee2-95be-1d81cfc4b385', 'fc76136f-018f-412c-82c7-a3581ce28b2e', '53e6e257-5303-4ef0-b24a-53750ad4dd65', '0b72ccb2-bf4b-445e-ab56-45f87ead149f', '8c8b7eff-5559-4432-a9ea-16818376cd75', '640ab92e-58d9-485f-b00f-d365e83fcf0f', '55041abd-d678-4afa-81c1-b824b20b6ac6', 'c282fa24-9582-4bcf-b7f2-a5cb55928cd4', '8f2f8f53-9642-4c05-b39f-91ad05266402', '5c4cb111-331a-4a76-ac21-787aee

- The search query "top states for tourism in India" was employed to retrieve relevant information from the ChromaDB vector store.
- The method similarity_search() returns the top 5 most similar chunks based on semantic meaning through vector embeddings.
- Every document obtained includes data such as tourism data, cultural celebrations, local dishes, and tips for safety across various Indian states.
- The output shows the most contextually similar data pertaining to the query, optimizing information retrieval effectiveness.
- This process demonstrates the way ChromaDB assists with semantic search and document retrieval on the tourism dataset.

In [5]:
query = "top states for tourism in India"
retrieved_docs = vector_store.similarity_search(query, k=5)  # Retrieve top 5 similar chunks

for i, doc in enumerate(retrieved_docs):
    print(f"Document {i+1}:")
    print(doc.page_content)
    print("-" * 80)


Document 1:
NLP EXP 
1. Domestic Tourism 
Overall Visits: 
In 2021, domestic tourist visits to various states and Union Territories (UTs) in 
India reached approximately 677.6 million, marking an 11.05% increase from 
610.2 million visits in 2020. 
Top States by Domestic Tourist Visits in 2022: 
Rank State Number of Visits (in millions) Percentage Share 
1 Uttar Pradesh 317.91 18.37% 
2 Tamil Nadu 218.58 12.63% 
3 Andhra Pradesh 192.72 11.13% 
4 Karnataka 182.41 10.54% 
5 Gujarat 135.81 7.85%
--------------------------------------------------------------------------------
Document 2:
7 . Gujarat 
Safety Tips: Safe for tourists, but be aware of high temperatures in summer. 
Languages: Gujarati, Hindi, English. 
Cultural Festivals: Navratri, Rann Utsav, International Kite Festival. 
Local Cuisine: Dhokla, Thepla, Undhiyu. 
8. Haryana 
Safety Tips: Urban areas are safe, but rural areas require basic precautions. 
Languages: Hindi, Haryanvi, Punjabi. 
Cultural Festivals: Surajkund Mela, Ba

- The question "Best tourist places in India by state" was submitted to the similarity_search() function of the ChromaDB vector store to obtain the top 5 most similar chunks.
- The retrieval was semantic similarity based on the Hugging Face embeddings model.
- The documents downloaded included information on tourist spots, cultural festivals, local food, safety advice, and languages spoken in various Indian states.
- The search brings up the states of Gujarat, Haryana, Andhra Pradesh, Karnataka, Kerala, and their corresponding tourism information.
- This illustrates the effectiveness of semantic search in retrieving contextually relevant data from extensive bodies of text to support tourism-related queries.

In [7]:
query = "Best tourist destinations in India by state"
retrieved_docs = vector_store.similarity_search(query, k=5)  # Retrieve top 5 similar chunks

for i, doc in enumerate(retrieved_docs):
    print(f"Document {i+1}:")
    print(doc.page_content)
    print("-" * 80)


Document 1:
7 . Gujarat 
Safety Tips: Safe for tourists, but be aware of high temperatures in summer. 
Languages: Gujarati, Hindi, English. 
Cultural Festivals: Navratri, Rann Utsav, International Kite Festival. 
Local Cuisine: Dhokla, Thepla, Undhiyu. 
8. Haryana 
Safety Tips: Urban areas are safe, but rural areas require basic precautions. 
Languages: Hindi, Haryanvi, Punjabi. 
Cultural Festivals: Surajkund Mela, Baisakhi. 
Local Cuisine: Bajra Roti, Churma, Kachri ki Sabzi. 
9. Himachal Pradesh
--------------------------------------------------------------------------------
Document 2:
NLP EXP 
1. Domestic Tourism 
Overall Visits: 
In 2021, domestic tourist visits to various states and Union Territories (UTs) in 
India reached approximately 677.6 million, marking an 11.05% increase from 
610.2 million visits in 2020. 
Top States by Domestic Tourist Visits in 2022: 
Rank State Number of Visits (in millions) Percentage Share 
1 Uttar Pradesh 317.91 18.37% 
2 Tamil Nadu 218.58 12.63% 


- Loading NLP Model:
The spaCy en_core_web_sm model was used to load and pull out named entities such as geopolitical locations from the text.
- Indian States List:
A list of Indian states was predefined to filter out state mentions relevant to it from the documents.
- Regex Pattern:
A regular expression pattern was created to recognize state names in various cases (uppercase, lowercase, or a mix) from the text.
- Document Retrieval:
A search "Best tourist destinations in India by state" was utilized to return the top 5 similar documents via the vector_store.similarity_search() method.
- NER-Based Extraction:
SpaCy NER was employed to process each document's text, which yielded entities tagged with Geopolitical Entity (GPE) or Location (LOC) that mapped to the set of predefined elements.
- Regex-Based Extraction:
Regex pattern was then used on every document to filter out further mentions of states, which might not have been extracted by spaCy.
- Duplicate Removal and Order Maintenance:
A set was utilized in order to avoid adding duplicate entries, and a list maintained the order of appearance.
- Final Output:
The sanitized list of Indian states was printed, and it displayed unique states in the same order that they were encountered within the documents.

In [9]:
import spacy
import re

# Load spaCy NLP model
nlp = spacy.load("en_core_web_sm")

# Predefined list of Indian states
indian_states = {
    "Andhra Pradesh", "Arunachal Pradesh", "Assam", "Bihar", "Chhattisgarh", 
    "Goa", "Gujarat", "Haryana", "Himachal Pradesh", "Jharkhand", "Karnataka", 
    "Kerala", "Madhya Pradesh", "Maharashtra", "Manipur", "Meghalaya", "Mizoram", 
    "Nagaland", "Odisha", "Punjab", "Rajasthan", "Sikkim", "Tamil Nadu", 
    "Telangana", "Tripura", "Uttar Pradesh", "Uttarakhand", "West Bengal"
}

# Define the regex pattern for extracting states
state_pattern = re.compile(r"\b(?:Andhra Pradesh|Arunachal Pradesh|Assam|Bihar|Chhattisgarh|"
                           r"Goa|Gujarat|Haryana|Himachal Pradesh|Jharkhand|Karnataka|"
                           r"Kerala|Madhya Pradesh|Maharashtra|Manipur|Meghalaya|Mizoram|"
                           r"Nagaland|Odisha|Punjab|Rajasthan|Sikkim|Tamil Nadu|"
                           r"Telangana|Tripura|Uttar Pradesh|Uttarakhand|West Bengal)\b", 
                           re.IGNORECASE)

# Modify the query for better retrieval
query = "Best tourist destinations in India by state"
retrieved_docs = vector_store.similarity_search(query, k=5)  # Retrieve top 5 similar chunks

# Print raw output
print("\n🔹 RAW DOCUMENTS RETRIEVED FROM VECTOR STORE 🔹")
for i, doc in enumerate(retrieved_docs):
    print(f"\nDocument {i+1}:")
    print(doc.page_content)
    print("-" * 80)

# Use a list to preserve order
cleaned_states = []
seen_states = set()

for doc in retrieved_docs:
    text = doc.page_content
    doc_nlp = nlp(text)

    # Extract using spaCy NER
    for ent in doc_nlp.ents:
        if ent.label_ in ["GPE", "LOC"] and ent.text in indian_states and ent.text not in seen_states:
            cleaned_states.append(ent.text)
            seen_states.add(ent.text)

    # Extract using regex
    matches = state_pattern.findall(text)
    for state in matches:
        if state not in seen_states:
            cleaned_states.append(state)
            seen_states.add(state)

# Print cleaned output with preserved order
print("\n🔹 CLEANED OUTPUT: TOP STATES FOR TOURISM IN INDIA 🔹")
for state in cleaned_states:
    print(f"- {state}")



🔹 RAW DOCUMENTS RETRIEVED FROM VECTOR STORE 🔹

Document 1:
7 . Gujarat 
Safety Tips: Safe for tourists, but be aware of high temperatures in summer. 
Languages: Gujarati, Hindi, English. 
Cultural Festivals: Navratri, Rann Utsav, International Kite Festival. 
Local Cuisine: Dhokla, Thepla, Undhiyu. 
8. Haryana 
Safety Tips: Urban areas are safe, but rural areas require basic precautions. 
Languages: Hindi, Haryanvi, Punjabi. 
Cultural Festivals: Surajkund Mela, Baisakhi. 
Local Cuisine: Bajra Roti, Churma, Kachri ki Sabzi. 
9. Himachal Pradesh
--------------------------------------------------------------------------------

Document 2:
NLP EXP 
1. Domestic Tourism 
Overall Visits: 
In 2021, domestic tourist visits to various states and Union Territories (UTs) in 
India reached approximately 677.6 million, marking an 11.05% increase from 
610.2 million visits in 2020. 
Top States by Domestic Tourist Visits in 2022: 
Rank State Number of Visits (in millions) Percentage Share 
1 Uttar Pr

- Dataset Used:
The data used is Indian tourist data saved in a vector database.
The files have information on tourist states, festivals, seasons, languages, and other tourism information.
- Libraries Used:
spaCy: For Natural Language Processing (NLP) operations such as Named Entity Recognition (NER).
re: Regular expressions for pattern matching.
Regex Patterns and NER for State Extraction:
We have defined predefined regex patterns to match common Indian state names.
In addition to regex, spaCy's NER model is utilized to extract location entities (GPE and LOC).
- Query Categorization:
Queries are divided into four types:
States
Festivals
Seasons
Languages
Each type is detected through keyword matching.
- Document Retrieval:
A similarity search is performed on the vector store using the user's query.
Top k documents are retrieved on the basis of similarity scores.
- Information Extraction:
According to the type of query, pertinent details are fetched from the documents retrieved.
If no category matches, general information is returned.

In [11]:
import re
import spacy

# Load spaCy NLP model
nlp = spacy.load("en_core_web_sm")

# Predefined regex patterns for different query types
STATE_PATTERN = re.compile(r"\b(?:Gujarat|Karnataka|Madhya Pradesh|Maharashtra|Rajasthan|"
                           r"Andhra Pradesh|Telangana|Uttarakhand|Tamil Nadu|Sikkim|"
                           r"Kerala|Uttar Pradesh|Punjab|West Bengal)\b", re.IGNORECASE)

# Keywords to detect the intent of the query
STATE_QUERY_KEYWORDS = ["top states", "which states", "most visited states"]
FESTIVAL_QUERY_KEYWORDS = ["festival", "festivals", "celebration"]
SEASON_QUERY_KEYWORDS = ["season", "best time", "climate", "weather"]
LANGUAGE_QUERY_KEYWORDS = ["language", "spoken", "official language","Languages"]

def extract_states_from_text(text):
    """Extracts state names using Regex and Named Entity Recognition (NER)."""
    extracted_states = []
    seen_states = set()

    # Extract using Regex
    for match in STATE_PATTERN.findall(text):
        state = match.strip()
        if state not in seen_states:
            extracted_states.append(state)
            seen_states.add(state)

    # Extract using NER (Named Entity Recognition)
    doc_nlp = nlp(text)
    for ent in doc_nlp.ents:
        if ent.label_ in ["GPE", "LOC"]:  # Location-based entities
            state = ent.text.strip()
            if state not in seen_states:
                extracted_states.append(state)
                seen_states.add(state)

    return extracted_states

def retrieve_and_clean_states(query, vector_store, k=5):
    """Retrieves documents and extracts state names."""
    retrieved_docs = vector_store.similarity_search(query, k=k)
    all_states = []

    for doc in retrieved_docs:
        extracted_states = extract_states_from_text(doc.page_content)
        all_states.extend(extracted_states)

    return list(set(all_states))  # Ensure unique states

def retrieve_category_info(query, vector_store, category_keywords, k=2):
    """Retrieves specific information based on query type (festivals, seasons, languages)."""
    retrieved_docs = vector_store.similarity_search(query, k=k)
    
    relevant_info = []
    for doc in retrieved_docs:
        if any(keyword in doc.page_content.lower() for keyword in category_keywords):
            relevant_info.append(doc.page_content)

    return "\n".join(relevant_info) if relevant_info else "No relevant data found."

def handle_query(query, vector_store):
    """Routes query to the correct function based on intent."""
    
    # 1️⃣ Check if the query is about states
    if any(keyword in query.lower() for keyword in STATE_QUERY_KEYWORDS):
        cleaned_states = retrieve_and_clean_states(query, vector_store)
        return f"🔹 **Top States for Tourism in India:**\n" + ", ".join(cleaned_states)
    
    # 2️⃣ Check if the query is about festivals
    elif any(keyword in query.lower() for keyword in FESTIVAL_QUERY_KEYWORDS):
        return retrieve_category_info(query, vector_store, FESTIVAL_QUERY_KEYWORDS)

    # 3️⃣ Check if the query is about seasons
    elif any(keyword in query.lower() for keyword in SEASON_QUERY_KEYWORDS):
        return retrieve_category_info(query, vector_store, SEASON_QUERY_KEYWORDS)

    # 4️⃣ Check if the query is about languages
    elif any(keyword in query.lower() for keyword in LANGUAGE_QUERY_KEYWORDS):
        return retrieve_category_info(query, vector_store, LANGUAGE_QUERY_KEYWORDS)

    # 5️⃣ If no category matches, return general information
    else:
        general_info = retrieve_category_info(query, vector_store, [])
        return f"🔹 **Retrieved Information:**\n" + general_info

# Example Queries
query1 = "top states for tourism in India"  # Should return state names
query2 = "Which festivals are celebrated in Maharashtra?"  # Should return festivals
query3 = "Best season to visit Rajasthan"  # Should return season info
query4 = "What language is spoken in Kerala?"  # Should return language info
query5 = "Tell me about foreign tourist visits in 2023"  # Should return general info

# Get results
print("🔹 Query 1 Result:\n", handle_query(query1, vector_store))
print("\n" + "=" * 80 + "\n")
print("🔹 Query 2 Result:\n", handle_query(query2, vector_store))
print("\n" + "=" * 80 + "\n")
print("🔹 Query 3 Result:\n", handle_query(query3, vector_store))
print("\n" + "=" * 80 + "\n")
print("🔹 Query 4 Result:\n", handle_query(query4, vector_store))
print("\n" + "=" * 80 + "\n")
print("🔹 Query 5 Result:\n", handle_query(query5, vector_store))


🔹 Query 1 Result:
 🔹 **Top States for Tourism in India:**
Uttar Pradesh, Kerala, Tamil Nadu, Karnataka, Hindi, Dhuska, Maharashtra, Uttarakhand, Tamil, West Bengal, Karam, Santali, Haryanvi, Nagpuri, India, Telangana, Gujarat, Andhra Pradesh, Rajasthani, Rajasthan


🔹 Query 2 Result:
 crowded localities. 
14. Maharashtra 
Peak Season: October to March. 
Details: Pleasant weather ideal for beach holidays, city tours, and hill 
station retreats. 
Off-Peak Season: April to June. 
Details: Hot and humid; fewer tourists mean discounted rates. 
Major Festivals: 
Ganesh Chaturthi (August/September): A ten-day festival celebrating 
Lord Ganesha with processions, music, and dance. 
Gudi Padwa (March/April): Marathi New Year marked by processions, 
cultural performances, and festive
Languages: Hindi, Bundeli, Malvi. 
Cultural Festivals: Khajuraho Dance Festival, Lokrang Festival. 
Local Cuisine: Poha, Bhutte ka Kees, Dal Bafla. 
​
 14. Maharashtra 
Safety Tips: Be cautious of scams in Mumbai; av