<a href="https://colab.research.google.com/github/MichaelSchmidt1729/LangChainUAEHolidaysReznikov/blob/main/LangChain_UAE_Holidays_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#install all proper packages
!pip install chromadb==0.3.25 pydantic==1.10.9 openai==0.27.8 bs4 tiktoken==0.4.0 langchain==0.0.235 huggingface_hub==0.16.4 sentence_transformers==2.2.2 pandas> /dev/null

## Data Download
Firstly, we will fetch the data from a website containing information about the official public holidays in the UAE for this year. To work with our own data, we will save the table as a CSV file and later load it using the `CSVLoader`. Theoretically, one can use `WebCrawler` instead of a custom function or include our function in a tool.

In [None]:
import requests
import bs4
import pandas as pd

In [None]:
# Function to make HTTP GET request
def get_request(url, cookies={}, headers={}):
    return requests.get(url, cookies=cookies, headers=headers)

# Function to collect data from a URL and extract the table
def collect_data(url):
    response = get_request(url)
    soup = bs4.BeautifulSoup(response.text, features="lxml")
    table = soup.find("table", class_="publicholidays")
    return table

# Function to convert HTML table to pandas DataFrame
def convert_html_table_to_df(html_text):
    return pd.read_html(str(html_text))[0]

In [None]:
# Root URL for the website containing holiday data
ROOT_URL = "https://publicholidays.ae/2023-dates/"

# Collect the data and convert it to a DataFrame
html_text = collect_data(url=ROOT_URL)
df = convert_html_table_to_df(html_text=html_text)

In [None]:
df

Unnamed: 0,Date,Day,Holiday
0,1 Jan,Sun,New Year's Day
1,20 Apr,Thu,Eid al-Fitr Holiday
2,21 Apr,Fri,Eid al-Fitr
3,22 Apr,Sat,Eid al-Fitr Holiday
4,23 Apr,Sun,Eid al-Fitr Holiday
5,27 Jun,Tue,Arafat Day
6,28 Jun,Wed,Eid al-Adha
7,29 Jun,Thu,Eid al-Adha Holiday
8,30 Jun,Fri,Eid al-Adha Holiday
9,21 Jul,Fri,Islamic New Year


In [None]:
# Save the DataFrame to a CSV file
df.iloc[:-1, :].to_csv("uae_holidays.csv")

## LangChain
Now, we will import several LangChain methods that we will be utilizing. For the purposes of this demo, we will begin with a straightforward approach using the `ChatOpenAI` model. To achieve this, we will load the previously saved file and create a vector index from its contents. Additionally, we will create a simple prompt and set up a memory to store the conversation history. Finally, we will configure a `RetrievalQA` chain to bring all these components together.

In [None]:
import os
os.environ['OPENAI_API_KEY'] = 'sk-...'

In [None]:
# Load language model, embeddings, and index for conversational AI
from langchain.chat_models import ChatOpenAI                #model
from langchain.indexes import VectorstoreIndexCreator       #index
from langchain.document_loaders.csv_loader import CSVLoader #tool
from langchain.prompts import PromptTemplate                #prompt
from langchain.memory import ConversationBufferMemory       #memory
from langchain.chains import RetrievalQA                    #chain

#import langchain
#langchain.verbose = True

In [None]:
def load_llm():
    llm = ChatOpenAI(temperature=0,model_name="gpt-3.5-turbo")
    return llm

def load_index():
    # if you want to avoid the step of saving/loading a file, you can use the `from_documents()` method of the VectorstoreIndexCreator()
    loader = CSVLoader(file_path='uae_holidays.csv')
    index = VectorstoreIndexCreator().from_loaders([loader])
    return index

In [None]:
template = """
You are a assistant to help answer when are the official UAE holidays, based only on the data provided.
Context: {context}
-----------------------
History: {chat_history}
=======================
Human: {question}
Chatbot:

"""

# Create a prompt using the template
prompt = PromptTemplate(
    input_variables=["chat_history", "context", "question"],
    template=template
)

In [None]:
# Set up conversation memory
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True, input_key="question")

In [None]:
# Set up the retrieval-based conversational AI
qa = RetrievalQA.from_chain_type(
    llm=load_llm(),
    chain_type='stuff',
    retriever=load_index().vectorstore.as_retriever(),
    verbose=True,
    chain_type_kwargs={
        "prompt": prompt,
        "memory": memory,
    }
)

## Q&A
Let's now ask some questions regarding the holidays in UAE:

In [None]:
# Function to print the response for a given query
def print_response_for_query(query):
    return print(qa.run({"query": query}))

### Holidays in March/December

In [None]:
query = "Are there any holidays in March?"
print_response_for_query(query)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
Based on the data provided, there are no holidays in March.


Correct response. What about December?

In [None]:
query = "Sorry, I meant December"
print_response_for_query(query)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
Based on the data provided, there are two official holidays in December for the UAE. The first one is Commemoration Day on December 1st, which falls on a Friday. The second one is National Day on December 2nd, which falls on a Saturday. Additionally, there is a National Day Holiday on December 3rd, which falls on a Sunday.


Did you notice, how we used the **memory** here? If it wasn't for it, the response would've sounded as:
> Sorry, I can't understand you. What exactly are you looking for in December?

It is worth noticing, that despite having an error in counting <font color='red'>two</font> the response contains all <font color='green'>three</font> holiday. Prompt upgrade may probably solve the issue.

### Multichain Ramadan example

In [None]:
query = "When does this year's holiday marking the end of Ramadan start?"
print_response_for_query(query)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
This year's holiday marking the end of Ramadan, also known as Eid al-Fitr, starts on April 21st.


Now this is quite interesting. The chain correctly identified Eid al-Fitr as the holiday that marks the end of Ramadan. But there is a reason, why I'm starting with scraping, instead of clean csv file. As you may notice, from the table, there is only one holiday called "Eid al-Fitr":

| Date | Day | Holiday |
| --- | --- | --- |
| 20 Apr | Thu | Eid al-Fitr Holiday |
| 21 Apr | Fri | Eid al-Fitr |
| 22 Apr | Sat | Eid al-Fitr Holiday |
| 23 Apr | Sun | Eid al-Fitr Holiday |

The problem here is that the data is dirty and the model can't identify, that it's actually a 4-day holiday. Of course the easy solution here would be to either clean the data, possibly through tools or modify prompt.

In [None]:
query = "How many days is it celebrated for this year?"
print_response_for_query(query)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
The holiday marking the end of Ramadan, also known as Eid al-Fitr, is typically celebrated for three days.


### What is the next holiday?

In [None]:
query = "Today is July 16. When is the nearest holiday?"
print_response_for_query(query)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
The nearest holiday is the Islamic New Year, which falls on July 21st.


As one can see the nearest holiday is detected correctly. A math tool looks for the closest date from the data provided