## Step 1: Import Libraries & Set Up API Keys
Make sure you already installed the requirments (pip install -r requirements.txt) and added your [OpenAI Key](https://platform.openai.com/docs/api-reference) (export OPENAI_KEY = xxxx)

In [1]:
## LIBRARIES! ##

# Import Libraries for API Useage
import os
import openai
import requests

# Import Langchain Libraries
from langchain.document_loaders import SeleniumURLLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.docstore.document import Document
from langchain.vectorstores.faiss import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA, RetrievalQAWithSourcesChain

# Import Libraries to Pull URL Lists
from bs4 import BeautifulSoup
from urllib.parse import urljoin

# Import Library to Make Results Pretty
from IPython.display import display, Markdown

In [2]:
## OPENAI KEY ##
openai.api_key = os.environ.get('OPENAI_API_KEY')

### Step 2: Select a WB Data Lab Data Good URL

In [3]:
url = "https://datapartnership.org/syria-economic-monitor/README.html"

### Step 3: Using Beautiful Soup, Generate a List of Sub-URLs

In [4]:
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

urls = []
for link in soup.find_all('a'):
    href = link.get('href')
    if href and not href.startswith('#'):  # Exclude anchor links
        absolute_url = urljoin(url, href)
        urls.append(absolute_url)

## TEST: PRINT URLS ##
# print (urls)

### Step 4: Using a Langchain URL Loader, Read in Contents from All Listed URLs 

In [5]:
## READ IN URL CONTENTS FOR USE WITH LANGCHAIN ##

loader = SeleniumURLLoader(urls=urls)
data = loader.load()

### Step 5: Use a Langchain Text Splitter (CharacterTextSplitter) to Break Contents into "Chunks"

In [6]:
## SPLIT URL CONTENTS INTO CHUNKS ##

text_splitter = CharacterTextSplitter(separator = "\n\n", chunk_size=2500, chunk_overlap=200, length_function=len,)
docs = text_splitter.split_documents(data)

### Step 6: Use OpenAI Embeddings and FAISS Vectorstore Tool to, um, Vectorize the Data to Make Search More Efficient

In [7]:
## VECTORIZE THE CHUNKS FOR EFFICIENT SEARCH ##

embeddings = OpenAIEmbeddings()
db = FAISS.from_documents(docs, embeddings)

### Step 7: Save Vectore Store, So You Can Go Back to it Later

In [8]:
## SAVE EMBEDDINGS LOCALLY, SO YOU DO NOT HAVE TO REGENRATE ALL THE TIME ##

db.save_local("faiss_index")
new_db = FAISS.load_local("faiss_index", embeddings)

### Step 8: Test Embeddings and Set Up and Test Retreiver
The Retreiver will use the vectore store to idenitfy those, um, vectors that have the most relevant information based on the user query. Only these vectors will be sent to the ChatGPT API for processing.

In [9]:
## TEST EMBEDDINGS ##

# query = "What is GDP of Kenya in 2023?"
# docs = new_db.similarity_search(query)
# print(docs)

## SET UP AND TEST RETREIVER ##
retriever = new_db.as_retriever(search_type="similarity_score_threshold", search_kwargs={"score_threshold": .7})
# docs = retriever.get_relevant_documents("Which governorate had the biggest change in tree cover?")
# print (docs)

### Step 9: Set Up Query Chain using Langchain RetrievalQAWithSourcesChain, ChatOpenAI (it's good!), and Your Retreiver

In [10]:
## SET UP QUERY CHAIN. FIRST RETREIVE VECTORS FROM DATABASE. THEN SEND TO OPENAI FOR GENERATING RESPONSE ##

chain = RetrievalQAWithSourcesChain.from_chain_type(ChatOpenAI(temperature=0), 
                                                    chain_type="stuff", 
                                                    retriever=new_db.as_retriever(search_type="similarity_score_threshold", 
                                                                                  search_kwargs={"score_threshold": .7}))


## THIS METHOD USES OPENAI, INSTEAD OF CHATOPENAI -- CHATOPENAI IS SO MUCH BETTER!!! ##
# chain = RetrievalQAWithSourcesChain.from_chain_type(OpenAI(temperature=0), 
#                                                     chain_type="stuff", 
#                                                     retriever=new_db.as_retriever(search_type="similarity_score_threshold", 
#                                                                                   search_kwargs={"score_threshold": .7}))

### Step 10: Big Time! Ask Your Data Good a Question!

In [11]:
## ASK A QUESTION! ##
question = input()

 How were humanitarian surveys used in the Syria Economic Monitor?


### Step 11: Run a Query and Make the Response Pretty
Warning! Even with all this splitting, vectorizing, and retreiving, sometimes we still exceed the paltry ChatGPT token limits. To solve, make your question more specific to reduce the number of retreived materials, or increase the "score_treshold" in your query chain.

In [12]:
## RUN QUERY AND MAKE RESPONSE PRETTY ##

response = chain({"question": question}, return_only_outputs=True)
markdown_text = f"**Question:**\n\n{question}\n\n**Answer:**\n\n{response['answer']}\n\n**Sources:**\n\n{response['sources']}\n\n"
display(Markdown(markdown_text))

Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 1.0 seconds as it raised RateLimitError: That model is currently overloaded with other requests. You can retry your request, or contact us through our help center at help.openai.com if the error persists. (Please include the request ID 1b667d6c9985cd71fccacf01612105b0 in your message.).


**Question:**

How were humanitarian surveys used in the Syria Economic Monitor?

**Answer:**

Humanitarian surveys were used in the Syria Economic Monitor to collect data on community situations and needs relating to shelter, electricity, water sanitation and hygiene (WASH), food security, livelihoods, health, education, humanitarian assistance, and priority needs. The survey used key informant interviews to collect data at the community (admin4) level. The panel used for the analysis includes 1,426 communities (371 in NWS and 1246 in NES) which are included in all rounds of data collection. The data and methodology used to generate insights for this project have been prepared as Data Goods, which are designed to be re-used for future updates and projects. Nighttime lights were also analyzed to understand the economic impacts of the February 6 earthquake in Turkiye and northern Syria. 


**Sources:**

https://github.com/datapartnership/syria-economic-monitor, https://datapartnership.org/syria-economic-monitor/notebooks/hsos-survey/hsos-survey-readme.html, https://datapartnership.org/syria-economic-monitor/docs/2023-summer-economic-monitor.html

