[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1ZibsKXyH4lIQpBu-dKJWW_VZghwnheaM?usp=sharing)

# Setup

In [1]:
import os
import openai
import sys
import pandas as pd
# from datasets import Dataset
sys.path.append('../..')

# Load environment
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

# Set the OpenAI API key
openai.api_key  = os.environ['OPENAI_API_KEY']

In [2]:
# define llm
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o", temperature=0.5)

  llm = ChatOpenAI(model="gpt-4o", temperature=0.5)


# Web Scrape

In [5]:
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
from typing import List, Dict
import logging
import aiohttp
import asyncio

async def fetch_links_from_page(url: str) -> List[str]:
    """Fetch all links from a given URL using BeautifulSoup."""
    try:
        async with aiohttp.ClientSession() as session:
            async with session.get(url, headers={'User-Agent': 'Mozilla/5.0'}) as response:
                html = await response.text()
        
        # Parse the HTML content
        soup = BeautifulSoup(html, "html.parser")
        links = []
        
        # Extract all <a> tags with href attributes
        for link in soup.find_all("a", href=True):
            href = link.get("href")
            # Construct full URL using urljoin
            full_url = urljoin(url, href)
            links.append(full_url)
        
        return list(set(links))  # Remove duplicates
    except Exception as e:
        logging.error(f"Error fetching links from {url}: {e}")
        return []

async def scrape_information_from_page(url: str) -> Dict:
    """Scrape title and paragraphs from a given URL."""
    try:
        async with aiohttp.ClientSession() as session:
            async with session.get(url, headers={'User-Agent': 'Mozilla/5.0'}) as response:
                html = await response.text()
        
        # Parse the HTML content
        soup = BeautifulSoup(html, "html.parser")
        
        # Extract title and paragraphs
        title = soup.title.string if soup.title else 'No Title'
        paragraphs = [p.get_text().strip() for p in soup.find_all('p')]
        
        return {
            'url': url,
            'title': title,
            'paragraphs': paragraphs
        }
    except Exception as e:
        logging.error(f"Error scraping information from {url}: {e}")
        return {
            'url': url,
            'title': 'Error',
            'paragraphs': []
        }

async def main(url: str) -> List[Dict]:
    """Main function to orchestrate the scraping process."""
    # First get all links from the main page
    links = await fetch_links_from_page(url)
    print(f"Found {len(links)} links")
    
    # Create tasks for scraping each link
    tasks = []
    async with aiohttp.ClientSession() as session:
        for link in links:
            task = asyncio.create_task(scrape_information_from_page(link))
            tasks.append(task)
        
        # Wait for all tasks to complete
        results = await asyncio.gather(*tasks, return_exceptions=True)
        
        # Filter out any failed results
        all_scraped_data = [r for r in results if isinstance(r, dict)]
    
    return all_scraped_data

# Example usage for Jupyter notebook
# Run this in separate cells:

# Cell 1 - URL definition
url_to_scrape = "https://pricai.org/2024/"

# Cell 2 - Run the scraper
scraped_data = await main(url_to_scrape)

# Cell 3 - Print results
for data in scraped_data:
    print(f"\nURL: {data['url']}")
    print(f"Title: {data['title']}")
    print(f"First two paragraphs: {data['paragraphs'][:2] if data['paragraphs'] else []}")
    print("-" * 50)

Found 29 links

URL: https://pricai.org/2024/index.php/programs/technical-sessions
Title: Technical Sessions
First two paragraphs: ['Technical Sessions', '']
--------------------------------------------------

URL: https://pricai.org/2024/index.php/calls/call-for-tutorials2
Title: Call for Tutorials
First two paragraphs: ['The CFT flyer has been published!', 'The PRICAI-2024 organizers invite proposals for the Tutorial Program of the 21st Pacific Rim International Conference on Artificial Intelligence (PRICAI-2024). The tutorials will be held on November 18-20, 2024, in Kyoto, Japan. Anyone interested in presenting a tutorial at PRICAI2024 should submit a proposal as detailed below.']
--------------------------------------------------

URL: https://pricai.org/2024/index.php/programs/sc-meetings
Title: SC Meetings
First two paragraphs: ['', 'Announcement for Steering Committee Meetings of PRICAIPRICAI Steering Committee Meeting- Nov. 20, 2024 13:00-16:00 (JST) - Place: Conference Room 5

# RAG Pipeline

In [6]:
from langchain.schema import Document

# Initialize a dictionary to store links and content
link_content_map = {}

# Populate the link_content_map while creating the documents
documents = [
    Document(
        page_content=" ".join(data['paragraphs']),
        metadata={"url": data['url'], "title": data['title']}
    )
    for data in scraped_data
]

In [7]:
# text splitter
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=300,
    length_function=len,
    is_separator_regex=False,
)

chunk_docs = text_splitter.split_documents(documents)

In [8]:
# Text embedding
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

In [10]:
# Vector Store
from langchain_chroma import Chroma

db_chroma = Chroma.from_documents(chunk_docs, embeddings)

  self._core_bpe = _tiktoken.CoreBPE(mergeable_ranks, special_tokens, pat_str)


In [11]:
# Retriever
retriever = db_chroma.as_retriever(search_type="mmr")

In [12]:
# Function to print contents of chunk_docs
def print_stored_documents(chunk_docs, limit=20):
    print("Contents of the Chroma Vector Database:")
    print("{:<100} {:<200}".format("Link", "Generated from Web Scraping"))  # Header
    print("=" * 300)  # Separator

    # Iterate through the chunked documents and print their content
    for idx, doc in enumerate(chunk_docs):
        if idx >= limit:  # Limit to the first 'limit' entries
            break

        # Extract URL and content from metadata and page content
        url = doc.metadata.get("url", "No URL")
        content = doc.page_content  # The content generated from web scraping

        # Print the URL and the content
        print("{:<100} {:<200}".format(url, content[:197] + '...' if len(content) > 200 else content))

# Call the function to print the stored documents
print_stored_documents(chunk_docs)

Contents of the Chroma Vector Database:
Link                                                                                                 Generated from Web Scraping                                                                                                                                                                             
https://pricai.org/2024/index.php/programs/technical-sessions                                        Technical Sessions                                                                                                                                                                                      
https://pricai.org/2024/index.php/calls/call-for-tutorials2                                          The CFT flyer has been published! The PRICAI-2024 organizers invite proposals for the Tutorial Program of the 21st Pacific Rim International Conference on Artificial Intelligence (PRICAI-2024). The...
https://pricai.org/2024/index.php/calls/call-for-tutor

In [13]:
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_openai import ChatOpenAI

# define prompt template -> downgrade 
answer_prompt = """
You are an intelligent system designed to assist users with their questions about PRICAI 2024 and provide helpful, accurate, and natural-sounding responses. Follow ONLY the provided context to answer the question, thinking step-by-step before responding.
As "AI Assistant," your task is to provide clear and concise answers using the available context. If there is no relevant information, respond with "Hmm, I don't know the answer to that, its out of my resource"
Refuse to answer any unrelated questions, never break character, and always base your response on the given context.
PAY ATTENTION if the user asks about products, or brands other than emos. Answer directly with "Hmm, I don't know the answer to that, its out of my resource"

Do not create an answer if:

- The answer cannot be determined from the context alone. In this case, respond with "Hmm, I don't know the answer to that, its out of my resource"
- The context is empty. Respond with "Hmm, I don't know the answer to that, its out of my resource"

Steps to answer the question:
1. Check the user's question for any slang, abbreviations, or synonyms. Interpret these if needed.
2. Understand the use of word variations, such as timeline, schedule is the same as improtant date.
3. Verify if the question relates to the PRICAI 2024 data source. If not, inform the user that the question is unrelated to the data source.
4. If the question relates to PRICAI 2024, generate an answer based on the data source.
5. Review the generated response to ensure accuracy.
6. Print output in bullet point and if generate the link that related 

The following is a conversation between a user and an AI:

Question: {messages}
Answer: {context}

Provide a clear answer with steps or bullet points if needed, without any line breaks. Ensure the response is neat, organized, and easy to understand, giving detailed and helpful responses.
"""

# Chat prompt template
question_answering_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", answer_prompt),
        MessagesPlaceholder(variable_name="messages"),
    ]
)

document_chain = create_stuff_documents_chain(llm, question_answering_prompt)

In [14]:
from langchain_core.messages import HumanMessage
from langchain_community.chat_message_histories import ChatMessageHistory
message_history = ChatMessageHistory()
# Function to query the vector store and generate a response
def generate_response(query):
    # Retrieve similar documents
    docs = retriever.get_relevant_documents(query, k=10)

    message_history.add_message(HumanMessage(content=query))

    # Use the documents as context for the LLM
    response = document_chain.invoke(
        {
            "context": docs,
            "messages": [
                HumanMessage(content=query)
            ],
        }
    )

    return response

# Inference

In [15]:
# query
query = "where is Pricai converence held?"
response = generate_response(query)

print(response)

  docs = retriever.get_relevant_documents(query, k=10)


- PRICAI 2024 will be held in person in Kyoto, Japan.


In [16]:
# query
query = "what can I prepare to attent the Pricai conference?"
response = generate_response(query)

print(response)

- **Proposal Submission**: Ensure that your proposal is well-prepared and submitted as a PDF file to the designated email. Include all necessary information in one PDF, and if additional formats are needed, provide links within the PDF.
- **Accommodation**: Book your hotel as early as possible, as the conference period is a busy time in Japan. There are several accommodation options near the conference venue and Kyoto station.
- **Sponsorship Opportunities**: Consider sponsorship options if applicable, which offer benefits like increased brand awareness, networking opportunities, and access to AI trends.
- **Technical Sessions**: Prepare to engage in technical sessions and presentations to gain insights into the latest AI research and developments.


In [17]:
# query
query = "when last call for the proposal and paper submited?"
response = generate_response(query)

print(response)

- The last call for workshop proposal submissions is on April 15, 2024.
- The notification of acceptance for workshop proposals will be on April 30, 2024.
- All deadlines are at the end of the day specified, anywhere on Earth (UTC-12).

For more information, you can visit the submission page: [EasyChair submission link](https://easychair.org/conferences/?conf=pricai2024)


In [18]:
# query
query = "What is pricai, and why they make an event every year, what they expected from that event?"
response = generate_response(query)

print(response)

- PRICAI 2024 is an international conference focused on sharing the latest research and advances in artificial intelligence within the Pacific Rim.
- The event aims to facilitate dialogue between academia and industry, bringing together researchers, engineers, students, and industry leaders for knowledge exchange and networking.
- The conference is designed to foster the development of new ideas and host lively discussions on AI topics.
- PRICAI 2024 has received more than 500 paper submissions and is expected to feature contributions from young researchers and prominent researchers from various countries.
- The collaboration with PRIMA 2024, which focuses on multiagent systems, allows for a broader appeal to the AI community, enhancing the conference's reach and impact.


In [19]:
# query
query = "I want to cook pisang goreng, please make the step and recipe"
response = generate_response(query)

print(response)

Hmm, I don't know the answer to that, its out of my resource


In [20]:
# query
query = "Who is Prof. Bo An, Nanyang Technological University, Singapore also what his contribution in Pricai 2024?"
response = generate_response(query)

print(response)

- **Keynote Speaker**: Prof. Bo An from Nanyang Technological University, Singapore is one of the keynote speakers at PRICAI 2024.
- **Presentation Title**: "From Algorithmic and RL-based to LLM-powered Agents"
- **Abstract Summary**: The talk will explore the evolution from algorithmic approaches to reinforcement learning (RL) and large language model (LLM)-powered agents in AI, addressing complex cooperation and strategic interactions.
- **Research Interests**: Prof. Bo An's research includes artificial intelligence, multiagent systems, computational game theory, reinforcement learning, and optimization.
- **Applications**: His research has been applied to domains such as infrastructure security, sustainability, and e-commerce.
- **Achievements**: He has published over 150 papers and received several awards, including the 2010 IFAAMAS Victor Lesser Distinguished Dissertation Award and the 2022 Nanyang Research Award.


In [21]:
# query
query = "give me the schedule for pricai 2024"
response = generate_response(query)

print(response)

- PRICAI 2024 is an international conference focused on artificial intelligence in the Pacific Rim.
- It will facilitate dialogue between academia and industry, featuring researchers, engineers, students, and industry leaders.
- The conference is expected to host discussions on AI from young to prominent researchers from various countries.
- PRIMA 2024, a conference on multiagent (distributed artificial intelligence), will be held in collaboration with PRICAI 2024 on the same day and at the same venue.
- Important deadlines:
  - Application Deadline: October 31, 2024
  - Payment Deadline: November 11, 2024
  - All deadlines are at the end of the specified day, anywhere on Earth (UTC-12).
- PRICAI Steering Committee Meeting:
  - Date: November 20, 2024
  - Time: 13:00-16:00 (JST)
  - Place: Conference Room 5a/5b, 5th floor, International Science Innovation Building in Kyoto University

For further details, you may want to visit the official PRICAI 2024 website.


In [22]:
# query
query = "List down topic that will present in PRICAI 2024"
response = generate_response(query)

print(response)

- PRICAI 2024 will focus on the latest research and advances in artificial intelligence.
- The conference will facilitate dialogue between academia and industry.
- It will include discussions from young researchers, including students, to prominent researchers from different countries.
- PRIMA 2024, which focuses on multiagent (distributed artificial intelligence), will also be held in collaboration with PRICAI 2024.


In [23]:
# query
query = "List down topic that will present in tutorials program PRICAI 2024"
response = generate_response(query)

print(response)

Hmm, I don't know the answer to that, its out of my resource


In [24]:
# query
query = "is there any topics that explain about synthetic data"
response = generate_response(query)

print(response)

- Yes, there is a tutorial on "Synthetic Data Generation through Adaptive Diffusion Models."
- This tutorial focuses on the importance of generating high-quality synthetic data for machine learning and AI applications.
- It covers the use of diffusion models, which are versatile and effective alternatives to traditional generative approaches like GANs and VAEs.
- The tutorial aims to guide participants in generating diverse and high-fidelity synthetic data, which is essential in modern AI workflows.
