# NarrowStreets Navigator: AI- Generated Personalized Travel Itineraries

By Aastha Prashar

#### This project aims to create an AI-powered travel assistant that generates highly personalized travel itineraries tailored to user preferences such as budget, available time, desired activities, and specific locations. By integrating Retrieval-Augmented Generation (RAG), Fine-tuning, and Prompt Engineering, the system will provide intelligent and dynamic recommendations that adapt to user interactions in real-time.

In [None]:
pip install langchain pinecone-client openai flask react

### Integrating Pinecone with OpenAI Embeddings for Data Retrieval

This script demonstrates how to use Pinecone and OpenAI Embeddings to create a search engine for tourist destinations in India. It processes data from a CSV file, generates embeddings, and stores them in a Pinecone index to allow efficient semantic search. Below is a detailed explanation of the steps and code structure:

In [5]:
import os
from pinecone import Pinecone, ServerlessSpec
from langchain_openai.embeddings import OpenAIEmbeddings
import openai
import pandas as pd

# Set up API keys
openai_api_key = "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"  # Replace with your OpenAI key
pinecone_api_key = "pcsk_3DWssQ_Snki6xhka6gTHhyNMxRNVQvu1zfnihLRFDcNqXvZTys74o54CZsoWXgqGVyvus2"  # Replace with your Pinecone API key
pinecone_environment = "us-east-1"  # Replace with your Pinecone environment region

# Set the OpenAI API key in the environment variable
os.environ["OPENAI_API_KEY"] = openai_api_key

# Create a Pinecone instance
pc = Pinecone(api_key=pinecone_api_key)

# Check if the index exists; if not, create one
index_name = "indian-tourist-destinations"
if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=1536,  # Example dimension for OpenAI embeddings
        metric="cosine",  # Metric to measure similarity
        spec=ServerlessSpec(cloud="aws", region=pinecone_environment)
    )

# Connect to the index
index = pc.Index(index_name)

# Debug: Ensure `index` is not None
if index is None:
    raise ValueError("Failed to initialize or connect to Pinecone index.")

print(f"Connected to Pinecone index: {index_name}")

# Load your CSV data
file_path = "Top_Indian_Places_to_Visit.csv"  # Replace with the actual CSV file path
df = pd.read_csv(file_path)


# Combine relevant columns into a single text representation
df["combined_text"] = df.apply(
    lambda row: f"{row['Zone']}, {row['State']}, {row['City']}, {row['Name']}, {row['Type']}, {row['Significance']}, Best time to visit: {row['Best Time to visit']}",
    axis=1
)

# Create embeddings for the data and upload to Pinecone
embeddings = OpenAIEmbeddings()  # Automatically picks up the API key from the environment
for i, row in df.iterrows():
    text = row["combined_text"]
    embedding = embeddings.embed_query(text)
    index.upsert([
        {"id": str(i), "values": embedding, "metadata": {"name": row["Name"]}}  # Format for Pinecone upsert
    ])

# Retrieval function
def retrieve_relevant_data(user_query):
    query_embedding = embeddings.embed_query(user_query)  # Create the query embedding
    results = index.query(
        vector=query_embedding,  # Use the query embedding
        top_k=5,                # Number of top matches to retrieve
        include_metadata=True   # Include metadata in the results
    )
    return results

# Example usage
user_query = "Best places to visit in Delhi during the morning"
results = retrieve_relevant_data(user_query)

# Display the results
print("Top results:")
for result in results["matches"]:
    destination_name = result["metadata"]["name"]  # Retrieve the destination name
    print(result['id'], destination_name, result['score'])



Connected to Pinecone index: indian-tourist-destinations
Top results:
4 Jantar Mantar 0.900983095
10 Garden of Five Senses 0.895219922
318 Rail Museum 0.893747509
5 Chandni Chowk 0.88187921
14 Qutub Minar 0.879444361


# Explanation Points

### 1. Setup and API Configuration  
- The script begins by importing necessary libraries and setting up API keys for OpenAI and Pinecone. These keys are used to access their respective services.

### 2. Pinecone Initialization  
- A Pinecone instance is initialized using the API key and environment details. The script checks if an index exists and creates one if necessary. The index is configured to use cosine similarity for embedding comparisons.

### 3. Data Preparation  
- A CSV file containing details about Indian tourist destinations is loaded, and relevant columns are combined into a single text representation to form the basis for embeddings.

### 4. Generating and Storing Embeddings  
- Text embeddings are generated using OpenAI's embedding model and stored in the Pinecone index, along with metadata for each destination.

### 5. Semantic Search Function  
- A function is defined to retrieve the most relevant results based on a user query. The query is embedded and compared against stored embeddings in the Pinecone index.

### 6. Example Query  
- The script provides an example query to retrieve relevant tourist destinations based on user input, showcasing the functionality of the semantic search engine.

---

## Key Features
- **Embedding Generation**: Converts textual data into high-dimensional vectors for semantic understanding.  
- **Efficient Indexing**: Utilizes Pinecone for scalable and fast search operations.  
- **Dynamic Queries**: Supports user-defined queries to retrieve contextually relevant results.

---

## Applications
This script can be extended to build:  
- **Personalized Travel Recommendation Engines**  
- **Content-based Search Engines**  
- **Contextual Data Analysis Systems**


# Querying the Pinecone Index with a Test Vector

### Overview
This snippet demonstrates how to perform a test query on the Pinecone index using a mock vector. A query vector of appropriate dimension is created, and the `query` method of the Pinecone index is used to retrieve the top 5 most relevant matches. The results include metadata associated with the stored embeddings, providing insights into the retrieved items.


In [2]:
test_query = [0.1] * 1536  # Mock query vector of appropriate dimension
test_results = index.query(vector=test_query, top_k=5, include_metadata=True)
print(test_results)


{'matches': [{'id': '35',
              'metadata': {'name': 'Ramoji Film City'},
              'score': -0.0256339014,
              'values': []},
             {'id': '177',
              'metadata': {'name': 'Jim Corbett National Park'},
              'score': -0.0257302076,
              'values': []},
             {'id': '84',
              'metadata': {'name': 'Ajmer Sharif Dargah'},
              'score': -0.0257691536,
              'values': []},
             {'id': '106',
              'metadata': {'name': 'Kovalam Beach'},
              'score': -0.0258768685,
              'values': []},
             {'id': '92',
              'metadata': {'name': 'Golden Temple (Harmandir Sahib)'},
              'score': -0.0259095058,
              'values': []}],
 'namespace': '',
 'usage': {'read_units': 6}}


### Explanation of Output

The output represents the results of a query performed on the Pinecone index. Here's a breakdown of its key components:

1. **`matches`**  
   - Contains a list of the top 5 results retrieved from the index. Each match includes the following details:
     - **`id`**: The unique identifier of the record in the index.
     - **`metadata`**: Metadata associated with the record, such as the name of the location.
     - **`score`**: A similarity score indicating how well the query matches the indexed record. Lower (negative) values suggest a higher similarity.
     - **`values`**: An empty list in this case, as the actual embedding values are not returned.

2. **`namespace`**  
   - Represents the namespace in which the query was executed. In this case, it is empty, indicating that the default namespace was used.

3. **`usage`**  
   - Shows resource usage information. For this query, `read_units` indicates that 6 units of read capacity were consumed.

---

### Key Insights
- The retrieved results include metadata like the names of popular locations (e.g., "Ramoji Film City" and "Golden Temple").
- The similarity scores allow ranking the results, with closer matches appearing higher in the list.
- The absence of values suggests this query is focused on metadata and similarity scores rather than retrieving the full embedding vectors.


# Generating JSON Messages from CSV Data for a Conversational Guide

### Overview
This script processes data from a CSV file containing information about Indian tourist destinations and converts it into JSON format for use in a conversational system. The system generates prompts and responses dynamically, creating engaging and informative dialogues about attractions in different cities.

---

### Code Breakdown

1. **Loading Data**  
   - The script reads the CSV file, which contains details like attraction names, cities, states, significance, and other relevant metadata.

2. **Prompt and Completion Templates**  
   - Predefined templates for user prompts and assistant completions are defined. These templates are dynamically populated with data from the CSV to generate realistic and contextually appropriate dialogues.

3. **Message Creation Function**  
   - For each row in the CSV:
     - A random prompt and completion template are selected and formatted with the row's data.
     - A system message is created to provide context about the city.
     - The messages are structured into a JSON object with roles (`system`, `user`, and `assistant`) to align with conversational AI frameworks.

4. **Error Handling**  
   - KeyErrors are handled gracefully by checking for missing data and skipping rows that cannot be processed.

5. **Saving JSON Data**  
   - The generated messages are saved in JSON format, one object per line, in the specified output file.



In [13]:
import pandas as pd
import json
import random

# Load the CSV file
csv_file = "Top_Indian_Places_to_Visit.csv"  # Replace with your actual file path
data = pd.read_csv(csv_file)

# Define prompt and completion templates
prompt_templates = [
    "Tell me about {Name} in {City}.",
    "What is significant about {Name} in {City}?",
    "Where can I experience {Significance} significance in {City}?",
    "Which {Type} in {City} is worth visiting?",
    "What makes {Name} in {City} special?",
    "Suggest a {Significance} destination in {State}.",
]

completion_templates = [
    "{Name} is a {Type} in {City}, built in {Establishment Year}. It has a Google review rating of {Google review rating} and is best visited in the {Best Time to visit}.",
    "{Name} in {City} is known for its {Significance}. It was established in {Establishment Year} and has a rating of {Google review rating}. Ideal visit time: {Best Time to visit}.",
    "For {Significance} experiences in {City}, {Name} is a top choice. Built in {Establishment Year}, it has a Google review rating of {Google review rating}.",
    "{Name} in {City} is a {Type} rated {Google review rating} on Google. It's best visited in the {Best Time to visit}.",
    "Located in {City}, {Name} is a popular {Significance} destination built in {Establishment Year}. It is rated {Google review rating} on Google reviews.",
]

# Function to create messages in the desired JSON format
def create_messages(row):
    # Safely access and preprocess row data
    row_data = {col: (str(row[col]).strip() if pd.notna(row[col]) else "") for col in data.columns}
    # Convert strings to lowercase to match the example format
    row_data = {k: v.lower() if isinstance(v, str) else v for k, v in row_data.items()}
    
    try:
        # Select random prompt and completion templates
        prompt = random.choice(prompt_templates).format(**row_data)
        completion = random.choice(completion_templates).format(**row_data)
        
        # Create the system message based on the City
        system_content = f"You are a helpful guide about {row_data['City']} attractions."
        
        # Structure the messages as required
        messages = [
            {"role": "system", "content": system_content},
            {"role": "user", "content": prompt},
            {"role": "assistant", "content": completion}
        ]
        
        return {"messages": messages}
    
    except KeyError as e:
        # Handle missing columns gracefully
        print(f"KeyError: Missing data for {e}")
        return None

# Generate messages for each row in the CSV
json_data = [create_messages(row) for _, row in data.iterrows()]
# Remove any None entries resulting from KeyErrors
json_data = [entry for entry in json_data if entry]

# Save the data to a JSON file, one JSON object per line
output_file = "output.json"
with open(output_file, "w") as file:
    for entry in json_data:
        file.write(json.dumps(entry) + "\n")

print(f"Formatted JSON data has been written to {output_file}")


Formatted JSON data has been written to output.json


### Explanation
This script is a robust tool for creating conversational data from structured input. By using templates and dynamically filling them with data, it enables the creation of diverse and contextually rich messages. The output JSON can be used in chatbots or other conversational AI systems to provide information about tourist destinations in a natural and engaging way. The system ensures reliability by handling missing data and maintaining the integrity of the output file.

#### Converting the generated output.json file to output.jsonl file. This .jsonl is a format where each line in the file is a separate, valid JSON object. It is commonly used for processing large datasets, as each JSON object can be read line-by-line without loading the entire file into memory.

In [33]:
mv output.json output.jsonl

This script securely loads API keys and environment variables from a .env file using the dotenv library. It retrieves variables like OPENAI_API_KEY, PINECONE_API_KEY, and PINECONE_ENVIRONMENT with os.getenv(), allowing sensitive data to be managed outside the codebase for enhanced security and flexibility.

In [27]:
import os
from dotenv import load_dotenv

# Load variables from .env file into environment
load_dotenv()

# Retrieve the variables
openai_api_key = os.getenv("OPENAI_API_KEY")
pinecone_api_key = os.getenv("PINECONE_API_KEY")
pinecone_environment = os.getenv("PINECONE_ENVIRONMENT")

# Verify that the variables are loaded (optional)
print(f"OpenAI API Key: {openai_api_key}")
print(f"Pinecone API Key: {pinecone_api_key}")
print(f"Pinecone Environment: {pinecone_environment}")


OpenAI API Key: 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
Pinecone API Key: pcsk_3DWssQ_Snki6xhka6gTHhyNMxRNVQvu1zfnihLRFDcNqXvZTys74o54CZsoWXgqGVyvus2
Pinecone Environment: None
