# Unsolved Missing People Chatbot

This is a basic RAG application which uses LLaMA3 as the LLM to retrieve useful information about unsolved dissapearances of people in history via both vectorembeddings and SQL database for any investigative analysis or just morbid curiosity.

## 1. Defining the LLM

In [39]:
from langchain.llms import Ollama

llm = Ollama(base_url="http://localhost:11434", model="llama3")

print(llm.invoke("Tell me a joke"))

Here's one:

Why don't scientists trust atoms?

Because they make up everything!

Hope that made you smile! Do you want to hear another one?


In [8]:
print(llm.invoke("what's trappist-1e"))

A fascinating topic!

Trappist-1e is one of the seven Earth-sized planets orbiting the ultracool dwarf star Trappist-1, located about 39 light-years away from us. This exoplanet system was discovered in 2017 by a team of astronomers using the Transiting Planets and PlanetesImals Small Telescope (TRAPPIST) at the La Silla Observatory in Chile.

Trappist-1e is a particularly interesting world because it is thought to be a terrestrial planet, meaning it has a rocky composition similar to Earth. This makes it an intriguing candidate for hosting liquid water and potentially, life.

Here are some key facts about Trappist-1e:

1. **Size:** Trappist-1e is roughly the same size as Earth, with a diameter of about 8,200 miles (13,100 km).
2. **Mass:** Its mass is estimated to be around 0.9 times that of Earth.
3. **Orbital period:** It takes Trappist-1e about 6.1 days to complete one orbit around its star.
4. **Surface temperature:** The surface temperature of Trappist-1e is likely to be around -

## 2. Data Ingestion

Importing all the required libraries for data ingestion

In [64]:
import bs4
import requests
from langchain import hub
from langchain_community.document_loaders import WebBaseLoader
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_text_splitters import RecursiveCharacterTextSplitter
import csv

### 2.1 Data Extraction

In [81]:
# These are the links from which we will extract data, only recent data is taken
link_0 = "https://en.wikipedia.org/wiki/List_of_people_who_disappeared_mysteriously:_1990%E2%80%93present"
link_1 = "https://en.wikipedia.org/wiki/List_of_people_who_disappeared_mysteriously:_1910%E2%80%931990"
link_2 = "https://en.wikipedia.org/wiki/List_of_people_who_disappeared_mysteriously:_pre-1910"
link_3 = "https://en.wikipedia.org/wiki/List_of_people_who_disappeared_mysteriously_at_sea#1970%E2%80%93present"

In [92]:
with open("missing_persons.csv", "w") as file:
    csv_writer = csv.writer(file)
    headers = ["Date", "Age", "Missing from", "Circumstances", "References"]
    csv_writer.writerow(headers)

In [93]:
for i in missing_list:
    response = requests.get(i)
    soup = bs4.BeautifulSoup(response.content, "lxml")
    # Find the required tables
    table = soup.find_all("table", class_="wikitable sortable zebra")
    table1 = soup.find_all(
        "table", class_="wikitable sortable plainrowheaders zebra"
    )
    table2 = soup.find_all("table", class_="wikitable sortable plainrowheaders")
    # Extract the rows
    rows = []
    for i in table:
        for row in i.find_all("tr")[1:]:  # Skip the header row
            cells = row.find_all("td")
            rows.append([cell.text.strip() for cell in cells])
    for i in table1:
        for row in i.find_all("tr")[1:]:  # Skip the header row
            cells = row.find_all("td")
            rows.append([cell.text.strip() for cell in cells])
    for i in table2:
        for row in i.find_all("tr")[1:]:  # Skip the header row
            cells = row.find_all("td")
            rows.append([cell.text.strip() for cell in cells])
    # Create a DataFrame
    df = pd.DataFrame(rows)

    # Save the DataFrame to a CSV file
    df.to_csv("missing_persons.csv", mode="a", index=False)

    print(f"Table from {i} extracted and saved to missing_persons.csv")

Table from <table class="wikitable sortable plainrowheaders">
<tbody><tr style="text-align:center;">
<th width="105">Date
</th>
<th width="250">Person(s)
</th>
<th width="50">Age
</th>
<th width="150">Missing from
</th>
<th width="500px">Circumstances
</th>
<th width="10px"><abbr title="References">Ref.</abbr>
</th></tr>
<tr>
<td data-sort-value="2010-09-29" rowspan="2">29 September 2010
</td>
<td><a href="/wiki/Richard_Abruzzo" title="Richard Abruzzo">Richard Abruzzo</a>
</td>
<td style="text-align:center;">47
</td>
<td rowspan="2"><a href="/wiki/Adriatic_Sea" title="Adriatic Sea">Adriatic Sea</a>
</td>
<td rowspan="2">American <a href="/wiki/Balloon_(aeronautics)" title="Balloon (aeronautics)">balloonist</a> champion who, together with colleague <a href="/wiki/Carol_Rymer_Davis" title="Carol Rymer Davis">Carol Rymer Davis</a>, disappeared while competing for the <a href="/wiki/Gordon_Bennett_Cup_(ballooning)" title="Gordon Bennett Cup (ballooning)">Gordon Bennett Cup</a> due to a thu

### 2.2 Data Cleaning

In [94]:
import pandas as pd

In [95]:
df = pd.read_csv("missing_persons.csv")

In [96]:
df.head()

Unnamed: 0,Date,Age,Missing from,Circumstances,Unnamed: 4
0,,,,,
1,0,2,3,4,
2,2nd century BC,Unknown,Gulf of Aden,Greek navigator who explored the Arabian Sea f...,
3,c. 1291,Unknown,Atlantic Ocean,The Genoese sailor and explorer brothers were ...,
4,Ugolino Vivaldi,,,,


In [103]:
df.drop("Unnamed: 4", inplace=True, axis=1)

In [104]:
df.head()

Unnamed: 0,Date,Age,Missing from,Circumstances
0,,,,
1,0,2,3,4
2,2nd century BC,Unknown,Gulf of Aden,Greek navigator who explored the Arabian Sea f...
3,c. 1291,Unknown,Atlantic Ocean,The Genoese sailor and explorer brothers were ...
4,Ugolino Vivaldi,,,


In [115]:
df = df[df['Circumstances'].notna()]
df = df[~df["Circumstances"].str.isdigit()]
df

Unnamed: 0,Date,Age,Missing from,Circumstances
2,2nd century BC,Unknown,Gulf of Aden,Greek navigator who explored the Arabian Sea f...
3,c. 1291,Unknown,Atlantic Ocean,The Genoese sailor and explorer brothers were ...
5,c. 1307 or c. 1312,Unknown,Atlantic Ocean,"The eighth mansa of the Mali Empire, who was s..."
6,c. 1346,Unknown,Atlantic Ocean,Majorcan sailor who sailed down the west coast...
7,1487,Unknown,Atlantic Ocean,Portuguese sailor who was the co-captain of an...
...,...,...,...,...
1181,01-Jan-23,39,"Cohasset, Massachusetts, U.S.","Ana Walshe, a Serbian-American real estate exe..."
1182,04-Jul-23,46,"Ampang Jaya, Malaysia","Thuzar Maung, a Burmese human rights activist ..."
1183,19-Jul-23,44,"Christchurch, New Zealand",44-year-old real estate agent Yanfei Bao disap...
1184,12-Oct-23,26,"Lemery, Batangas, Philippines",Miss Grand Philippines 2023 Candidate and Teac...


In [131]:
pd.set_option('display.max_rows', 500)
df.drop('level_0', inplace=True, axis=1)

In [134]:
df[:25]

Unnamed: 0,Date,Age,Missing from,Circumstances
0,2nd century BC,Unknown,Gulf of Aden,Greek navigator who explored the Arabian Sea f...
1,c. 1291,Unknown,Atlantic Ocean,The Genoese sailor and explorer brothers were ...
2,c. 1307 or c. 1312,Unknown,Atlantic Ocean,"The eighth mansa of the Mali Empire, who was s..."
3,c. 1346,Unknown,Atlantic Ocean,Majorcan sailor who sailed down the west coast...
4,1487,Unknown,Atlantic Ocean,Portuguese sailor who was the co-captain of an...
5,1499,~49,Northwest Passage,"Cabot, an Italian explorer, departed with five..."
6,24 March 1500,Unknown,Cape Verde or Cape of Good Hope,Portuguese sailor Vasco de Ataíde's ship was p...
7,1501,50–51,Northwest Passage,Portuguese explorer Gaspar Corte-Real disappea...
8,1502,53–54,Northwest Passage,"Miguel Corte-Real, a Portuguese explorer, disa..."
9,1511,Unknown,Caribbean Sea,"Nicuesa, a Spanish conquistador and explorer, ..."


In [137]:
df.to_csv("missing_persons_clean.csv", index=False)

In [17]:
df = pd.read_csv("missing_persons_clean.csv")

In [14]:
df["Date"] = df["Date"].str[-2:]
df

Unnamed: 0,Date,Age,Missing from,Circumstances
0,BC,Unknown,Gulf of Aden,Greek navigator who explored the Arabian Sea f...
1,91,Unknown,Atlantic Ocean,The Genoese sailor and explorer brothers were ...
2,12,Unknown,Atlantic Ocean,"The eighth mansa of the Mali Empire, who was s..."
3,46,Unknown,Atlantic Ocean,Majorcan sailor who sailed down the west coast...
4,87,Unknown,Atlantic Ocean,Portuguese sailor who was the co-captain of an...
...,...,...,...,...
1036,23,39,"Cohasset, Massachusetts, U.S.","Ana Walshe, a Serbian-American real estate exe..."
1037,23,46,"Ampang Jaya, Malaysia","Thuzar Maung, a Burmese human rights activist ..."
1038,23,44,"Christchurch, New Zealand",44-year-old real estate agent Yanfei Bao disap...
1039,23,26,"Lemery, Batangas, Philippines",Miss Grand Philippines 2023 Candidate and Teac...


All the code upto this point helps generate the primary csv file we need to work with

## 3. Retrieval

### 3.1.1 CSV - row-wise chunking and embedding

In [19]:
from langchain_community.document_loaders.csv_loader import CSVLoader

In [20]:
# Creating the loader
loader = CSVLoader(file_path="missing_persons_clean.csv")

In [21]:
# Loading the documents from csv file
data = loader.load()

In [22]:
data

[Document(page_content='Date: 2nd century BC\nAge: Unknown\nMissing from: Gulf of Aden\nCircumstances: Greek navigator who explored the Arabian Sea for Ptolemy VIII Physcon, who is thought to have perished during a journey to circumnavigate Africa, but this has not been definitively confirmed.', metadata={'source': 'missing_persons_clean.csv', 'row': 0}),
 Document(page_content='Date: c. 1291\nAge: Unknown\nMissing from: Atlantic Ocean\nCircumstances: The Genoese sailor and explorer brothers were lost while attempting the first oceanic journey from Europe to Asia. Their two galleys sailed out of the Mediterranean Sea and into the Atlantic Ocean, but were not heard from again.', metadata={'source': 'missing_persons_clean.csv', 'row': 1}),
 Document(page_content="Date: c. 1307 or c. 1312\nAge: Unknown\nMissing from: Atlantic Ocean\nCircumstances: The eighth mansa of the Mali Empire, who was said by his successor Mansa Musa to have disappeared in an attempt to discover the limits of the A

In [29]:
# Splitting
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=250, chunk_overlap=75, add_start_index=True
)
all_splits = text_splitter.split_documents(data)

In [30]:
len(all_splits)

2571

In [31]:
len(all_splits[0].page_content)

60

In [32]:
all_splits[10].metadata

{'source': 'missing_persons_clean.csv', 'row': 5, 'start_index': 0}

In [33]:
# Store the embeddings in a vectorstore
from langchain.embeddings import OllamaEmbeddings
from langchain.vectorstores import Chroma

oembed = OllamaEmbeddings(base_url="http://localhost:11434", model="nomic-embed-text")
vectorstore = Chroma.from_documents(persist_directory="./Vectorstores",documents=all_splits, embedding=oembed)

In [None]:
embeds[0][:5]

In [None]:
query = "who dissapeared mysteriously in 1990 in scotland"
docs = db.similarity_search(query)
print(docs[0].page_content)

### 3.1.2 CSV to SQL

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("missing_persons_clean.csv")

In [3]:
df.head()

Unnamed: 0,Date,Age,Missing from,Circumstances
0,2nd century BC,Unknown,Gulf of Aden,Greek navigator who explored the Arabian Sea f...
1,c. 1291,Unknown,Atlantic Ocean,The Genoese sailor and explorer brothers were ...
2,c. 1307 or c. 1312,Unknown,Atlantic Ocean,"The eighth mansa of the Mali Empire, who was s..."
3,c. 1346,Unknown,Atlantic Ocean,Majorcan sailor who sailed down the west coast...
4,1487,Unknown,Atlantic Ocean,Portuguese sailor who was the co-captain of an...


In [5]:
from langchain_community.utilities import SQLDatabase
from sqlalchemy import create_engine

In [6]:
engine = create_engine("sqlite:///missing_persons.db")

In [7]:
df.to_sql("missing_persons_csv", engine, index=False)

1041

In [8]:
db = SQLDatabase(engine=engine)

In [9]:
print(db.dialect)
print(db.get_usable_table_names())

sqlite
['missing_persons_csv']


### 3.2 The Retrieval

In [None]:
# Retrieve and generate using the relevant snippets of the blog.
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 6})
prompt = hub.pull("rlm/rag-prompt")