# CLEAN VERSION

### General_Info_and_History_of_Pittsburgh_CMU
- Grant: ["data/history_wiki_1.txt"] Pittsburgh:https://en.wikipedia.org/wiki/Pittsburgh
- Grant: ["data/history_wiki_2.txt"] History of Pittsburgh:https://en.wikipedia.org/wiki/History_of_Pittsburgh
- Grant: ["data/about_cmu_and_cmu_history.txt"] About CMU & CMU History:https://www.cmu.edu/about/
- Grant: ["data/britannica.txt"] Encyclopedia Brittanica page:https://www.britannica.com/place/Pittsburgh
- Grant: ["data/pit_gov.txt"] City of Pittsburgh  gov webpage:https://pittsburghpa.gov/index.html
- Nitya: <span style="color:red">[] Visit Pittsburgh webpage:https://www.visitpittsburgh.com/
- Nitya: <span style="color:red">[] Tax Regulations:https://pittsburghpa.gov/finance/tax-forms
- Nitya: <span style="color:red">[] 2024 Operating Budget:https://apps.pittsburghpa.gov/redtail/images/23255_2024_Operating_Budget.pdf

### Events_in_Pittsburgh_and_CMU
- Grant: ["data/event_cal.txt"] Pittsburgh events calendar:https://pittsburgh.events/
- Grant: ["data/downtown_cal.txt"] Downtown Pittsburgh events calendar:https://downtownpittsburgh.com/events/
- Grant: ["data/city_paper.txt"] Pittsburgh City Paper events:https://www.pghcitypaper.com/pittsburgh/EventSearch?v=d
- Grant: ["data/events_cmu.txt"] CMU events calendar:https://events.cmu.edu/
- Grant: ["data/cmu_campus_major_events.txt"] campus events page:https://www.cmu.edu/engage/alumni/events/campus/index.html

### Music_and_Culture
- Grant: ["data/more.txt"]more:https://en.wikipedia.org/wiki/List_of_museums_in_Pittsburgh
- Grant: ["data/picklesburgh.txt"] Picklesburgh:https://www.picklesburgh.com/
- Grant: ["data/pittsburgh_taco_festival"] Pittsburgh Taco Fest:https://www.pghtacofest.com/
- Grant: ["data/pittsburg_restauratnt_week.txt"] Pittsburgh Restaurant Week:https://pittsburghrestaurantweek.com/
- Grant: ["data/little_italy_days.txt"] Little Italy Days:https://littleitalydays.com/
- Grant: ["data/banana_split_fest.txt"] Banana Split Fest:https://bananasplitfest.com/
- Grant: ["data/pittsburgh_symphony.txt"] Symphony:https://www.pittsburghsymphony.org/
- Nitya: <span style="color:red">[] Opera:https://pittsburghopera.org/
- Nitya: <span style="color:red">[] Cultural Trust:https://trustarts.org/
- Nitya: <span style="color:red">[] Carnegie Museums:https://carnegiemuseums.org/
- Nitya: <span style="color:red">[] Heinz History Center:https://www.heinzhistorycenter.org/
- Nitya: <span style="color:red">[] The Frick:https://www.thefrickpittsburgh.org/
- Nitya: <span style="color:red">[] Food Festivals:https://www.visitpittsburgh.com/events-festivals/food-festivals/


### Sports
- Nitya: <span style="color:red">[] Visit Pittsburgh:https://www.visitpittsburgh.com/things-to-do/pittsburgh-sports-teams/
- Nitya: <span style="color:red">[] Pirates:https://www.mlb.com/pirates
- Nitya: <span style="color:red">[] Steelers:https://www.steelers.com/
- Nitya: <span style="color:red">[] Penguins:https://www.nhl.com/penguins/



### For ease of compiling the document here are the guidelines
- each txt file has =section_start=, =section name="Intro" and =section_end=.
- tables have the following tags: === Table ===,=== End of Table ===, each table is inside the seciton
- events are separated with lines that start with ("-----"), each event has at least the following attributes: Title, Date & Time, Location

In [5]:
# pip install langchain langchain_chroma langchain_openai langchain_core langchain_community langchain_mistralai langchain_ollama langchain_pinecone

In [1]:
import numpy as np
import transformers
import getpass
import os
import langchain
from langchain_core.output_parsers import StrOutputParser
from langchain.prompts import ChatPromptTemplate
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_core.runnables import RunnablePassthrough
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
from langchain_mistralai import MistralAIEmbeddings, ChatMistralAI
from langchain_ollama import OllamaEmbeddings
from langchain_community.vectorstores import DocArrayInMemorySearch
from langchain_pinecone import PineconeEmbeddings
from langchain.schema import Document  
from langchain import PromptTemplate


# Create custom loader

In [2]:
class DynamicHierarchicalTextLoader:
    def __init__(self, categories, name_mappings):
        self.categories = categories  
        self.name_mappings = name_mappings

    def load(self):
        """Load all text files, process events, tables, and sections hierarchically."""
        all_documents = []  
        
        for category, files in self.categories.items():
            for file_path in files:
                if os.path.exists(file_path):
                    descriptive_name = self.name_mappings.get(file_path, "")  
                    file_data = self._load_single_file(file_path, category, descriptive_name)
                    all_documents.extend(file_data)  

        return all_documents

    def _load_single_file(self, file_path, category, descriptive_name):
        """Load a single file and process sections, tables, and events."""
        documents = []
        current_section = None
        current_content = []
        inside_event = False
        inside_table = False
        table_content = []
        event_content = {}

        with open(file_path, 'r', encoding='utf-8') as file:
            lines = file.readlines()

        for i, line in enumerate(lines):
            line = line.strip()  
            if line.startswith('=section_start='):
                if current_section and current_content:
                    documents.extend(self._create_text_chunks(current_section, current_content, category))
                current_content = []
            elif line.startswith('=section name='):
                section_name = line.split('=section name="')[1].strip('\"')
                current_section = f"{descriptive_name}: {section_name}" 
            elif line.startswith('=== Table ==='):
                inside_table = True
                table_content = []
            elif line.startswith('=== End of Table ==='):
                inside_table = False
                documents.append(self._create_table_chunk(current_section, table_content, category))
            elif inside_table:
                table_content.append(line)
            elif line.startswith('-----'):
                if not inside_event:
                    inside_event = True
                    event_content = {"type": "event", "Title": None, "Date & Time": None, "Location": None, "Description": ""}
                else:
                    inside_event = False
                    documents.append(self._create_event_chunk(current_section, event_content, category))
            elif inside_event:
                if line.startswith("Title:"):
                    event_content["Title"] = line.split("Title:")[1].strip()
                elif line.startswith("Date & Time:"):
                    event_content["Date & Time"] = line.split("Date & Time:")[1].strip()
                elif line.startswith("Location:"):
                    event_content["Location"] = line.split("Location:")[1].strip()
                else:
                    event_content["Description"] += line + "\n"
            elif line.startswith('=section_end='):
                if current_section and current_content:
                    documents.extend(self._create_text_chunks(current_section, current_content, category))
                current_section = None
                current_content = []
            else:
                current_content.append(line)
        if current_section and current_content:
            documents.extend(self._create_text_chunks(current_section, current_content, category))

        return documents

    def _create_text_chunks(self, section_name, content, category):
        """Dynamically split large text content into chunks."""
        text = "\n".join(content)
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
        chunks = text_splitter.split_text(text)
        return [
            Document(page_content=chunk, metadata={"category": category, "section": section_name, "type": "text"})
            for chunk in chunks
        ]

    def _create_table_chunk(self, section_name, table_lines, category):
        """Store the entire table as a single chunk."""
        table = "\n".join(table_lines)
        return Document(page_content=table, metadata={"category": category, "section": section_name, "type": "table"})

    def _create_event_chunk(self, section_name, event_content, category):
        """Store the entire event as a single chunk."""
        event_text = f"Title: {event_content['Title']}\nDate & Time: {event_content['Date & Time']}\nLocation: {event_content['Location']}\nDescription: {event_content['Description']}"
        return Document(page_content=event_text, metadata={"category": category, "section": section_name, "type": "event"})

categories = {
    "General_Info_and_History_of_Pittsburgh_CMU": [
        "data/history_wiki_1.txt",
        "data/history_wiki_2.txt",
        "data/britannica.txt",
        "data/about_cmu_and_cmu_history.txt",
        "data/pit_gov.txt"
    ],
    "Events_in_Pittsburgh_and_CMU": [
        "data/city_paper.txt",
        "data/event_cal.txt",
        "data/downtown_cal.txt",
        "data/cmu_campus_major_events.txt",
        "data/events_cmu.txt"
    ],
    "Music_and_Culture": [
        "data/more.txt",
        "data/little_italy_days.txt",
        "data/picklesburgh.txt",
        "data/pittsburg_restauratnt_week.txt",
        "data/pittsburgh_taco_festival.txt",
        "data/banana_split_fest.txt",
        "data/pittsburgh_symphony.txt"
    ],
    "Sports": []  
}

name_mappings = {
    "data/history_wiki_1.txt": "Pittsburgh",
    "data/history_wiki_2.txt": "History of Pittsburgh",
    "data/britannica.txt": "Encyclopedia Brittanica page",
    "data/city_paper.txt": "Pittsburgh City Paper events",
    "data/event_cal.txt": "Pittsburgh events calendar",
    "data/downtown_cal.txt": "Downtown Pittsburgh events calendar",
    "data/more.txt": "more",
    "data/little_italy_days.txt":"Little Italy Days",
    "data/picklesburgh.txt":"Picklesburgh",
    "data/pittsburg_restauratnt_week.txt":"Pittsburg Restauratnt Week",
    "data/pittsburgh_taco_festival.txt":"Pittsburgh Taco Festival",
    "data/banana_split_fest.txt": "Banana Split Fest",
    "data/about_cmu_and_cmu_history.txt":"About CMU and CMU history",
    "data/pittsburgh_symphony.txt": "Pittsburgh Symphony",
    "data/events_cmu.txt": "CMU Event Calendar 2024",
    "data/cmu_campus_major_events.txt": "CMU Major Events 2025",
    "data/pit_gov.txt": "City of Pittsburgh governmental webpage"
}

loader = DynamicHierarchicalTextLoader(categories, name_mappings)
all_documents = loader.load()

### Check if loader correclty loaded and chunked the data

In [3]:
def print_every_20th_document(documents, stop_at=400, interval=20):
    print(f"Total number of documents: {len(documents)}\n")
    
    for i in range(0, min(stop_at, len(documents)), interval):
        doc = documents[i]
        print(f"Document {i+1}:")
        print(f"Content (snippet): {doc.page_content}") 
        print("Metadata:")
        for key, value in doc.metadata.items():
            print(f"  {key}: {value}")
        print("\n" + "-"*60 + "\n")

print_every_20th_document(all_documents, stop_at=400, interval=20)


Total number of documents: 2466

Document 1:
Content (snippet): Table Name: Pittsburgh
+------------------------------+------------------------------+
|City                          |                              |
+------------------------------+------------------------------+
|Downtown PittsburghMexican War|                              |
+------------------------------+------------------------------+
|FlagSealCoat of arms          |                              |
+------------------------------+------------------------------+
|Nicknames: City of Bridges, St|                              |
+------------------------------+------------------------------+
|Motto: Benigno Numine ("With t|                              |
+------------------------------+------------------------------+
|Interactive map of Pittsburgh |                              |
+------------------------------+------------------------------+
|PittsburghShow map of Pennsylv|                              |
+----------------

# Generate Embeddings and create a vectorized representation

In [7]:
os.environ["MISTRAL_API_KEY"] = "9yxVWQjw6YhOpBjw9rCcfvt6GQvjuNXU"
os.environ["PINECONE_API_KEY"] = "3ed0030a-6a1b-4f65-a648-b86da84f4d6d"

In [8]:
embeddings = PineconeEmbeddings(model="multilingual-e5-large")
vector_store = Chroma.from_documents(
    documents=all_documents,
    embedding=embeddings,
    collection_name="my_vector_store"
)

In [11]:
model = ChatMistralAI(api_key=os.environ["MISTRAL_API_KEY"], model="mistral-large-latest")

In [12]:
prompt_template_retrieval = """
Answer the question in one sentence based on the context below.

Context:
{context}

Question:
{question}
"""
prompt_template_no_retrieval = """
    Answer the question in one sentence.

    Question:
    {question}
    """

# Run all the questions from json 

In [26]:
import json
import time

def process_lines_and_save(filename, vector_store, model, prompt_template_retrieval, output_file, start_index=0, delay=1):
    ans_w_rag = []    
    existing_questions = set()
    try:
        with open(output_file, 'r') as out_f:
            for line in out_f:
                existing_data = json.loads(line)
                existing_questions.add(existing_data['question'])
    except FileNotFoundError:
        pass

    with open(filename, 'r') as f, open(output_file, 'a') as out_f:  
        lines = f.readlines()  
    
        for i, line in enumerate(lines[start_index:], start=start_index):
            try:
                data = json.loads(line)  
                q = data['question']
                
                if q in existing_questions:
                    print(f"Skipping already processed question on line {i}: {q}")
                    continue
                
                retrieved_documents = vector_store.as_retriever().get_relevant_documents(q)
                retrieved_context = "\n".join([doc.page_content for doc in retrieved_documents])
                
                response_with_retrieval = model.invoke(prompt_template_retrieval.format(
                    context=retrieved_context,  
                    question=q
                ))
                
                ans_w_rag.append(response_with_retrieval.content) 
                
                out_f.write(json.dumps({
                    'question': q,
                    'response': response_with_retrieval.content
                }) + '\n')
                
                out_f.flush()  
                time.sleep(delay) 

            except json.JSONDecodeError as e:
                print(f"Error decoding JSON on line {i}: {e}")
            except Exception as e:
                print(f"Error processing line {i}: {e}")
                break  
    
    return ans_w_rag

output_file = 'mistral7B_results_with_rag.jsonl'
processed_answers = process_lines_and_save(
    'mistral7B_results.jsonl', 
    vector_store, 
    model, 
    prompt_template_retrieval, 
    output_file=output_file,
    start_index=0, 
    delay=11 
)


## OR Specify the question here (for single question)

In [18]:
q = "Where will Geek Out Gala event will take place on October 17, 2024?"


## LLM only (no retrieval)

In [19]:
response_no_retrieval = model.invoke(prompt_template_no_retrieval.format(
    question=q
))
print("LLM output without retrieval:")
print(response_no_retrieval.content)


LLM output without retrieval:
The Geek Out Gala event will take place at the Hilton Hotel.


## LLM + relevant retrieval 

In [20]:
retrieved_documents = vector_store.as_retriever().get_relevant_documents(q)
retrieved_context = "\n".join([doc.page_content for doc in retrieved_documents])


response_with_retrieval = model.invoke(prompt_template_retrieval.format(
    context=retrieved_context,  
    question=q
))
print(response_with_retrieval.content)


The Geek Out Gala event will take place at the Carnegie Science Center, 1 Allegheny Ave Pittsburgh, PA 15212.


In [15]:
retrieved_documents

[Document(metadata={'category': 'Events_in_Pittsburgh_and_CMU', 'section': 'Pittsburgh events calendar: Pittsburgh events calendar', 'type': 'event'}, page_content='Title: W.A.S.P.\nDate & Time: Nov 20 2024 6:00 PM\nLocation: 15212, Pittsburgh, PA, US\nDescription: Arena: Stage AE\nPrice: Prices from $49\n'),
 Document(metadata={'category': 'Events_in_Pittsburgh_and_CMU', 'section': 'Pittsburgh events calendar: Pittsburgh events calendar', 'type': 'event'}, page_content='Title: The Slaps\nDate & Time: Nov 12 2024 8:00 PM\nLocation: 15203, Pittsburgh, PA, US\nDescription: Arena: Club Cafe\nPrice: Prices from $32\n'),
 Document(metadata={'category': 'Events_in_Pittsburgh_and_CMU', 'section': 'Pittsburgh events calendar: Pittsburgh events calendar', 'type': 'event'}, page_content='Title: Eliza McLamb\nDate & Time: Oct 30 2024 7:00 PM\nLocation: 15201, Pittsburgh, PA, US\nDescription: Arena: Spirit Hall\nPrice: Prices from $33\n'),
 Document(metadata={'category': 'Events_in_Pittsburgh_and_