# Langchain + Github = Langgit

### Step 1: Import Libraries & Set Up API Keys
Make sure you already installed the requirments (pip install -r requirements.txt) and added your [OpenAI Key](https://platform.openai.com/docs/api-reference) (export OPENAI_API_KEY = xxxx)

In [1]:
## LIBRARIES! ##

# Import Libraries for API Useage
import os
import openai
import requests

# Import Langchain Libraries
from langchain.document_loaders import SeleniumURLLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.docstore.document import Document
from langchain.vectorstores.faiss import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA, RetrievalQAWithSourcesChain

# Import Libraries to Pull URL Lists
from bs4 import BeautifulSoup
from urllib.parse import urljoin

# Import Library to Make Results Pretty
from IPython.display import display, Markdown

In [2]:
## OPENAI KEY ##
openai.api_key = os.environ.get('OPENAI_API_KEY')

### Step 2: Select a WB Data Lab Data Good URL
This notebook has been designed to support searching Data Goods produced by the WB Data Lab -- GitHub-generated Jupyter books. That said, the notebook should work with other urls, too. Please note, though, that this notebook hasn't been optimized in any way, so more complex websites may require additional effort. 

In [3]:
url = "https://datapartnership.org/syria-economic-monitor/README.html"

### Step 3: Using Beautiful Soup, Generate a List of Sub-URLs
Beautiful Soup is a commonly used Python library for scraping web sites. This step takes a domain and follows links to identify other sub-domain websites. Uncomment the "Test: Print URLs" to make sure it works the first time.

In [4]:
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

urls = []
for link in soup.find_all('a'):
    href = link.get('href')
    if href and not href.startswith('#'):  # Exclude anchor links
        absolute_url = urljoin(url, href)
        urls.append(absolute_url)

## TEST: PRINT URLS ##
# print (urls)

### Step 4: Read in Contents from All Listed URLs 
Langchain is a Language Learning Model library that is commonly used to support interactions between generative AI and existing datasets -- text, .csv files, .pdf files, .json's, and so many more. In this step, we use the Selenium URL loader to read in the contents of each of the URLs retreived in the previous step. 

In [5]:
## READ IN URL CONTENTS FOR USE WITH LANGCHAIN ##

loader = SeleniumURLLoader(urls=urls)
data = loader.load()

### Step 5: Break Contents into "Chunks"
In this step, we use a Langchain Text Splitter (CharacterTextSplitter) to break the website content into encoded chunks, or tokens. 

In [6]:
## SPLIT URL CONTENTS INTO CHUNKS ##

text_splitter = CharacterTextSplitter(separator = "\n\n", chunk_size=2500, chunk_overlap=200, length_function=len,)
docs = text_splitter.split_documents(data)

### Step 6: Vectorize the Data
In this step, we use OpenAI Embeddings and FAISS Vectorstore Tool to make searching our data more efficient. The easiest way to think about this step is to imagine we have a .csv file. Each vector would be one row of the .csv. Then, in a later step, if we make a query that includes a reference in one of the rows, only that row of information (vector) would then be passed on to the OpenAI API for a natural language response. This prevents us from having to send entire datasets through the API. 

In [7]:
## VECTORIZE THE CHUNKS FOR EFFICIENT SEARCH ##

embeddings = OpenAIEmbeddings()
db = FAISS.from_documents(docs, embeddings)

### Step 7: Save Vector Store
Now, we will save the vectorized data (Vector Store), so we can go back to it anytime. 

In [8]:
## SAVE EMBEDDINGS LOCALLY, SO YOU DO NOT HAVE TO REGENRATE ALL THE TIME ##

db.save_local("faiss_index")
new_db = FAISS.load_local("faiss_index", embeddings)

### Step 8: Set Up and Test Retreiver
The Retreiver will search the vector store to idenitfy those vectors that have the most relevant information based on the user query. Only these vectors will be retreived and sent to the ChatGPT API for processing. The code below includes a couple tests that are recommended the first time you run this notebook, to make sure the embeddings (vector process) and retreiver work. 

In [9]:
## TEST EMBEDDINGS ##

# query = "Which governorate experienced the most tree cover loss?"
# docs = new_db.similarity_search(query)
# print(docs)

## SET UP AND TEST RETREIVER ##
retriever = new_db.as_retriever(search_type="similarity_score_threshold", search_kwargs={"score_threshold": .7})
# docs = retriever.get_relevant_documents("Which governorate had the biggest change in tree cover?")
# print (docs)

### Step 9: Set Up Query Chain 
Now, using Langchain RetrievalQAWithSourcesChain, ChatOpenAI (it's good!), and your Retreiver, we are ready to go! The following code takes a query, finds the relevant data in your vector store, and then sends that data as part of a prompt to the ChatGPT API. If you find too much information is being sent through (exceeding the OpenAI token limits), you may consider increasing the similarity score threshold used by the retreiver.

In [10]:
## SET UP QUERY CHAIN. FIRST RETREIVE VECTORS FROM DATABASE. THEN SEND TO OPENAI FOR GENERATING RESPONSE ##

chain = RetrievalQAWithSourcesChain.from_chain_type(ChatOpenAI(temperature=0), 
                                                    chain_type="stuff", 
                                                    retriever=new_db.as_retriever(search_type="similarity_score_threshold", 
                                                                                  search_kwargs={"score_threshold": .7}))


## THIS METHOD USES OPENAI, INSTEAD OF CHATOPENAI -- CHATOPENAI IS SO MUCH BETTER!!! ##
# chain = RetrievalQAWithSourcesChain.from_chain_type(OpenAI(temperature=0), 
#                                                     chain_type="stuff", 
#                                                     retriever=new_db.as_retriever(search_type="similarity_score_threshold", 
#                                                                                   search_kwargs={"score_threshold": .7}))

### Step 10: Big Time! Ask Your Content a Question!
Now we can write a question for our Data Good (or other fed-in url content!)

In [11]:
## ASK A QUESTION! ##
question = input()

 How were humanitarian surveys used in the Syria Economic Monitor?


### Step 11: Run a Query and Make the Response Pretty
Warning! Even with all this splitting, vectorizing, and retreiving, sometimes we still exceed the paltry ChatGPT token limits. To solve, make your question more specific to reduce the number of retreived materials, or increase the "score_treshold" in your query chain. In the response, below, note that we also include the sources -- important and exciting!

In [12]:
## RUN QUERY AND MAKE RESPONSE PRETTY ##

response = chain({"question": question}, return_only_outputs=True)
markdown_text = f"**Question:**\n\n{question}\n\n**Answer:**\n\n{response['answer']}\n\n**Sources:**\n\n{response['sources']}\n\n"
display(Markdown(markdown_text))

**Question:**

How were humanitarian surveys used in the Syria Economic Monitor?

**Answer:**

Humanitarian surveys were used in the Syria Economic Monitor to collect data on community situations and needs relating to shelter, electricity, water sanitation and hygiene (WASH), food security, livelihoods, health, education, humanitarian assistance, and priority needs. The survey used key informant interviews to collect data at the community (admin4) level. The panel used for the analysis includes 1,426 communities (371 in NWS and 1246 in NES) which are included in all rounds of data collection. The data and methodology used to generate insights for this project have been prepared as Data Goods, which are designed to be re-used for future updates and projects. Nighttime lights were also analyzed to understand the economic impacts of the February 6 earthquake in Turkiye and northern Syria. 


**Sources:**

https://github.com/datapartnership/syria-economic-monitor, https://datapartnership.org/syria-economic-monitor/notebooks/hsos-survey/hsos-survey-readme.html, https://datapartnership.org/syria-economic-monitor/docs/2023-summer-economic-monitor.html



### Final Notes
I am a novice! I have never written an application like this before! My explanations and code are based on what I have taught myself in the past couple months, so if you see areas for improvement, or, more importantly, things that are just completely wrong, please submit an issue on this repo, so I can learn and fix <3

After preparing this and other notebooks, I have come to see generative AI as a Matrix-style superpower, enabling me to do much more, much faster, than I could have ever dreamed of. Maybe this is dangerous! Maybe this is the first step to reaching the next galaxy (because to do that, we can't just so what we are doing now...). 