# Question-Answering with LangChain and GPT-3

## Data

Document Source: the content of the web portal [FOSS](https://archive.ph/o/8NCVk/https://itsfoss.com/), which specializes in Open Source technologies, with a particular focus on Linux.

A list of all the articles to process can be found from the site's [sitemap-posts.xml file](https://news.itsfoss.com/sitemap-posts.xml), which contains a list of links to all the articles.

In [14]:
import os

import pandas as pd
import numpy as np

import xmltodict
import requests
from bs4 import BeautifulSoup
from tqdm.notebook import tqdm


In [2]:
r = requests.get("https://news.itsfoss.com/sitemap-posts.xml")
xml = r.text
rss = xmltodict.parse(xml)

article_links = [entry["loc"] for entry in rss["urlset"]["url"]]

print(f"Total number of articles: {len(article_links)}")

Total number of articles: 986


In [3]:
def extract_content(url):
    html = requests.get(url).text
    soup = BeautifulSoup(html, features="html.parser")
    elements = [
        soup.select_one(".c-topper__headline"),
        soup.select_one(".c-topper__standfirst"),
        soup.select_one(".c-content"),
        ]
    
    text = "".join([element.get_text() for element in elements])
    
    return text

In [11]:
# Limited the list of articles to 10 for demo only
articles = (
    [{"source": url, "content": extract_content(url)}
     for url in tqdm(article_links[0:10], desc="Extracting article content")
     ]
)


Extracting article content:   0%|          | 0/10 [00:00<?, ?it/s]

In [9]:
articles[0]["source"], articles[0]["content"]

('https://news.itsfoss.com/warp-file-sharing/',
 "Warp: An Open-Source Secure File Sharing App That Works Cross-PlatformA seamless way to securely share files between Linux and Windows? Try this out!\n\nIn our adventure with First Look series of articles, we found a secure and efficient method of transferring files between Linux and Windows systems.A tool called 'Warp', a part of GNOME Circle featuring apps that extend the GNOME ecosystem. Warp facilitates the seamless transfer of files via the Internet or across a local network.Let's take a look at it.Warp: Overview ⭐Written primarily in the Rust programming language, Warp is a GTK-based file transfer app that uses the 'Magic Wormhole' protocol to make file transfers over the internet/local networks possible.All file transfers are encrypted, and the receiver must use a word-based code to access the files, preventing any misuse.Allow me to show you how it works.When you launch the app for the first time, you are greeted with a welcome 

In [15]:
articles_df = pd.DataFrame(articles)
articles_df.head()

Unnamed: 0,source,content
0,https://news.itsfoss.com/warp-file-sharing/,Warp: An Open-Source Secure File Sharing App T...
1,https://news.itsfoss.com/codecov-open-source/,Code Coverage Tool 'Codecov' Opens its Source ...
2,https://news.itsfoss.com/linux-steam-macos/,Linux Rising: Steam Usage Surpasses macOS for ...
3,https://news.itsfoss.com/fedora-asahi-remix-ap...,Fedora Asahi Remix to Bring Complete Linux Exp...
4,https://news.itsfoss.com/skiff-mail-review/,Skiff is a Dashing Open-Source Secure Email Al...


## Embedding

### Splitting into chunks

In [18]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

rec_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=150)

web_docs, meta = [], []
for article in tqdm(articles, desc="Splitting articles into chunks"):
    splits = rec_splitter.split_text(article["content"])
    web_docs.extend(splits)
    meta.extend([{"source": article["source"]}] * len(splits))

Splitting articles into chunks:   0%|          | 0/10 [00:00<?, ?it/s]

### Embedding chunks

In [19]:
import os
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS

# make sure the OPENAI_API_KEY environment variable has been set to be the OpenAI key 
#os.environ["OPENAI_API_KEY"] = "YOUR KEY"
article_store = FAISS.from_texts(
    texts=web_docs, embedding=OpenAIEmbeddings(), metadatas=meta)


## Query

In [24]:
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(
    memory_key="chat_history",
    input_key="question",
    output_key="answer",
    return_messages=True,
    )

In [25]:
from langchain import PromptTemplate

template = """You are a chatbot having a conversation with a human.
Given the following extracted parts of a long document and a question,
create a final answer.
{context}
{chat_history}
Human: {question}
Chatbot:"""

question_prompt = PromptTemplate(
    input_variables=["chat_history", "question", "context"], 
    template=template
    )

In [26]:
from langchain import OpenAI, PromptTemplate
from langchain.chains import RetrievalQAWithSourcesChain

article_chain = RetrievalQAWithSourcesChain.from_llm(
    llm=OpenAI(temperature=0.0),
    retriever=article_store.as_retriever(k=4),
    memory=memory,
    question_prompt=question_prompt,
    )

result = article_chain(
    {"question": "What is Skiff?"},
    return_only_outputs=True
)

In [27]:
result

{'answer': ' Skiff is an open-source, secure email alternative to Gmail and Proton Mail that provides two-factor authentication, the ability to block remote content, password update, recovery key, and a verification phrase to let other Skiff users verify your identity. It also offers features like Web3 integration and IPFS decentralized storage, Pages to create/store documents securely, and encrypted cloud storage with IPFS support.\n',
 'sources': 'https://news.itsfoss.com/skiff-mail-review/'}