# Philosophy Chat Bot

data set pdf links: https://www.kaggle.com/datasets/kouroshalizadeh/history-of-philosophy/data


https://www.infobooks.org/pdfview/7529-eastern-philosophy-jsrl-narayana-moorty/
https://www.infobooks.org/authors/classic/plato-books/#Republic
https://www.infobooks.org/book/on-youth-old-age-life-and-death-and-respiration-aristotle/
https://www.infobooks.org/pdfview/7225-the-communist-manifesto-karl-marx/


related sources:
https://www.kaggle.com/code/gpreda/rag-using-llama3-langchain-and-chromadb
https://www.kaggle.com/code/vanvalkenberg/nlp-what-the-philosopher-said/notebook

https://www.youtube.com/watch?v=luFHMtaw9pk&ab_channel=DavidBU
https://www.youtube.com/watch?v=2TJxpyO3ei4&t=2s&ab_channel=pixegami
https://www.youtube.com/watch?v=tcqEUSNCn8I&ab_channel=pixegami
https://www.youtube.com/watch?v=Ylz779Op9Pw&ab_channel=ShawTalebi
https://www.youtube.com/watch?v=au2WVVGUvc8&t=307s&ab_channel=LiamOttley

### Download dataset

In [1]:
%%python -m pip install --upgrade pip
%pip install kaggle



In [2]:
# Install kagglehub package
%pip install kagglehub

import kagglehub
import shutil
import os

# Download latest version
path = kagglehub.dataset_download("kouroshalizadeh/history-of-philosophy")
path = path + "/philosophy_data.csv"

print("Path to dataset files:", path)
# Create data directory if it doesn't exist
os.makedirs("data", exist_ok=True)

# Copy the downloaded file to the data directory
shutil.copy(path, "data/philosophy_data.csv")

print("Dataset saved to data/philosophy_data.csv")

Note: you may need to restart the kernel to use updated packages.


  from .autonotebook import tqdm as notebook_tqdm


Path to dataset files: /home/codespace/.cache/kagglehub/datasets/kouroshalizadeh/history-of-philosophy/versions/3/philosophy_data.csv
Dataset saved to data/philosophy_data.csv


### Pre-processing data set

Due to resources limitations, we will work only with data related to Plato works. Now we are going to remove rows not related to plato in the data set and export it to a new csv file.

In [3]:
import pandas as pd

df = pd.read_csv("data/philosophy_data.csv")
df.head(2)

Unnamed: 0,title,author,school,sentence_spacy,sentence_str,original_publication_date,corpus_edition_date,sentence_length,sentence_lowered,tokenized_txt,lemmatized_str
0,Plato - Complete Works,Plato,plato,"What's new, Socrates, to make you leave your ...","What's new, Socrates, to make you leave your ...",-350,1997,125,"what's new, socrates, to make you leave your ...","['what', 'new', 'socrates', 'to', 'make', 'you...","what be new , Socrates , to make -PRON- lea..."
1,Plato - Complete Works,Plato,plato,Surely you are not prosecuting anyone before t...,Surely you are not prosecuting anyone before t...,-350,1997,69,surely you are not prosecuting anyone before t...,"['surely', 'you', 'are', 'not', 'prosecuting',...",surely -PRON- be not prosecute anyone before ...


In [4]:
df.shape

(360808, 11)

In [5]:
df_plato = df[df['title'].str.contains("Plato", case=False, na=False)]
print(df_plato.shape)

(38366, 11)


In [6]:
df_plato.to_csv("data/plato_works.csv", index=False)

### Settin up Ollama

In [7]:
!sudo apt-get install -y pciutils
!curl -fsSL https://ollama.com/install.sh | sh # download ollama api
!sudo apt-get update
!sudo apt-get install -y pciutils
!sudo apt-get install -y lshw

from IPython.display import clear_output

# Create a Python script to start the Ollama API server in a separate thread

import threading
import subprocess
import requests
import json

def ollama():
    os.environ['OLLAMA_HOST'] = '0.0.0.0:11434'
    os.environ['OLLAMA_ORIGINS'] = '*'
    subprocess.Popen(["ollama", "serve"])

ollama_thread = threading.Thread(target=ollama)
ollama_thread.start()

%pip install -U lightrag[ollama]

Reading package lists... Done
Building dependency tree       
Reading state information... Done
pciutils is already the newest version (1:3.6.4-1ubuntu0.20.04.1).
0 upgraded, 0 newly installed, 0 to remove and 45 not upgraded.
>>> Installing ollama to /usr/local
>>> Downloading Linux amd64 bundle
######################################################################## 100.0%                                    18.3% 50.4%##########################                                 57.7%#######           87.9%####################################################    98.4%
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.
Hit:1 https://dl.yarnpkg.com/debian stable InRelease
Hit:2 https://packages.

2024/12/04 05:48:43 routes.go:1197: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/codespace/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[* http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2024-12-04T05:48:43.950Z level=INFO source=ima

Collecting ollama<0.3.0,>=0.2.1 (from lightrag[ollama])
  Using cached ollama-0.2.1-py3-none-any.whl.metadata (4.2 kB)
Using cached ollama-0.2.1-py3-none-any.whl (9.7 kB)
Installing collected packages: ollama
  Attempting uninstall: ollama
    Found existing installation: ollama 0.4.2
    Uninstalling ollama-0.4.2:
      Successfully uninstalled ollama-0.4.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
langchain-ollama 0.2.1 requires ollama<1,>=0.3.0, but you have ollama 0.2.1 which is incompatible.[0m[31m
[0mSuccessfully installed ollama-0.2.1
Note: you may need to restart the kernel to use updated packages.


### Install Dependencies

In [8]:
! pip install langchain langchain_community langchain-chroma 
import sys

__import__('pysqlite3')
import pysqlite3
sys.modules['sqlite3'] = sys.modules["pysqlite3"]
import chromadb



In [9]:
# %pip install pysqlite3-binary 

Use the following two blocks of code in case of discrepancies between sqlite3 and chroma

In [10]:
# import sys

# BASE_DIR = os.path.dirname(sys.executable)
# print(BASE_DIR)

# __import__('pysqlite3')
# sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')

# DATABASES = {
#     'default': {
#         'ENGINE': 'django.db.backends.sqlite3',
#         'NAME': os.path.join(BASE_DIR, 'db.sqlite3'),
#     }
# }

### Import Libraries

In [11]:
import bs4
from langchain import hub
from langchain_chroma import Chroma #make sure to have sqlite3 version 3.35.0 or higher
from langchain_core.output_parsers import StrOutputParser
from langchain_community.document_loaders.csv_loader import CSVLoader
from langchain_core.runnables import RunnablePassthrough
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.embeddings import OllamaEmbeddings
from langchain.prompts import ChatPromptTemplate

### Load models

In [12]:
# Pull models from ollama
!ollama pull llama3.1
!ollama pull nomic-embed-text

[GIN] 2024/12/04 - 05:48:52 | 200 |      26.991µs |       127.0.0.1 | HEAD     "/"
[?25lpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[GIN] 2024/12/04 - 05:48:52 | 200 |  427.913772ms |       127.0.0.1 | POST     "/api/pull"
[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest 
pulling 667b0c1932bc... 100% ▕████████████████▏ 4.9 GB                         
pulling 948af2743fc7... 100% ▕████████████████▏ 1.5 KB                         
pulling 0ba8f0e314b4... 100% ▕████████████████▏  12 KB                         
pulling 56bb8bd477a5... 100% ▕████████████████▏   96 B                         
pulling 455f34728c9b... 100% ▕████████████████▏  487 B                         
verifying sha256 digest 
writing manifest 
success [?25h
[GIN] 2024/12/04 - 05:48:53 | 200 |      34.204µs |       127.0.0.1 | HEAD     "/"
[?25lpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[GIN] 2024/12/04 - 05:48:53 | 200 |

In [13]:
%pip install -qU langchain-ollama
from langchain_ollama import OllamaLLM

Note: you may need to restart the kernel to use updated packages.


In [14]:
# Load LLM
from langchain_community.llms import Ollama
llm = Ollama(model="llama3.1")

  llm = Ollama(model="llama3.1")


In [15]:
# Load Embeddings - convert text into vector representations
embeddings = OllamaEmbeddings(model="nomic-embed-text")

  embeddings = OllamaEmbeddings(model="nomic-embed-text")


### Load documents

In [16]:
path = "data/plato_works.csv"

loader = CSVLoader(
    file_path=path, 
    source_column="author",
    content_columns=["sentence_str"],
    csv_args={
        "delimiter": ",",
        "quotechar": '"',
    },
    metadata_columns=["school", "title","original_publication_date"],
    )

docs = loader.load()

for record in docs[:2]:
    print(record)

page_content='sentence_str: What's new, Socrates, to make you leave your usual haunts in the Lyceum and spend your time here by the king archon's court?' metadata={'source': 'Plato', 'row': 0, 'school': 'plato', 'title': 'Plato - Complete Works', 'original_publication_date': '-350'}
page_content='sentence_str: Surely you are not prosecuting anyone before the king archon as I am?' metadata={'source': 'Plato', 'row': 1, 'school': 'plato', 'title': 'Plato - Complete Works', 'original_publication_date': '-350'}


### Splitting Documents and store

In [17]:
# Create a RecursiveCharacterTextSplitter object with specified chunk size and overlap
# chunk_size=1000: The maximum size of each chunk (in characters) to split the document into.
# chunk_overlap=200: The number of characters that should overlap between consecutive chunks.
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

# Split the documents into smaller chunks using the splitter
split_documents = text_splitter.split_documents(docs)

print(len(split_documents))

38366


In [23]:
print(split_documents[20].page_content)

sentence_str: Tell me, what does he say you do to corrupt the young?


In [19]:
from chromadb.config import Settings
import os

In [24]:
persist_directory = "./chroma_db"

# Create a client
client = chromadb.PersistentClient(path=persist_directory) 
# Create a collection
collection = client.create_collection("rag-chroma") 

collection.add(
        documents=[d.page_content for d in split_documents],
        metadatas=[d.metadata for d in split_documents],
        embeddings=embeddings.embed_documents([d.page_content for d in split_documents])) 


time=2024-12-04T05:58:16.029Z level=INFO source=server.go:105 msg="system memory" total="7.7 GiB" free="3.3 GiB" free_swap="0 B"
time=2024-12-04T05:58:16.030Z level=INFO source=memory.go:343 msg="offload to cpu" layers.requested=-1 layers.model=13 layers.offload=0 layers.split="" memory.available="[3.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="352.9 MiB" memory.required.partial="0 B" memory.required.kv="24.0 MiB" memory.required.allocations="[352.9 MiB]" memory.weights.total="240.1 MiB" memory.weights.repeating="195.4 MiB" memory.weights.nonrepeating="44.7 MiB" memory.graph.full="48.0 MiB" memory.graph.partial="48.0 MiB"
time=2024-12-04T05:58:16.030Z level=INFO source=server.go:380 msg="starting llama server" cmd="/tmp/ollama4084615960/runners/cpu_avx2/ollama_llama_server --model /home/codespace/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 --ctx-size 8192 --batch-size 512 --threads 1 --no-mmap --parallel 1 --port 34085"
time=2

[GIN] 2024/12/04 - 05:58:18 | 200 |   2.24692777s |             ::1 | POST     "/api/embeddings"
[GIN] 2024/12/04 - 05:58:18 | 200 |  256.634691ms |             ::1 | POST     "/api/embeddings"
[GIN] 2024/12/04 - 05:58:18 | 200 |  304.718241ms |             ::1 | POST     "/api/embeddings"
[GIN] 2024/12/04 - 05:58:19 | 200 |  182.706437ms |             ::1 | POST     "/api/embeddings"
[GIN] 2024/12/04 - 05:58:19 | 200 |  296.210684ms |             ::1 | POST     "/api/embeddings"
[GIN] 2024/12/04 - 05:58:19 | 200 |  162.314718ms |             ::1 | POST     "/api/embeddings"
[GIN] 2024/12/04 - 05:58:19 | 200 |  231.643672ms |             ::1 | POST     "/api/embeddings"
[GIN] 2024/12/04 - 05:58:19 | 200 |  165.397875ms |             ::1 | POST     "/api/embeddings"
[GIN] 2024/12/04 - 05:58:20 | 200 |  198.628637ms |             ::1 | POST     "/api/embeddings"
[GIN] 2024/12/04 - 05:58:20 | 200 |   563.76783ms |             ::1 | POST     "/api/embeddings"
[GIN] 2024/12/04 - 05:58:20 | 

TypeError: Collection.add() missing 1 required positional argument: 'ids'

In [None]:
# Create the retriever from the collection
vectorstore = Chroma(client=client, collection_name="rag-chroma", embedding_function=embeddings)

NameError: name 'client' is not defined

In [None]:
# # Create a vector store using the Chroma library from the split documents
# # Chroma is a vector database that stores document embeddings.
# # The `from_documents` method takes the list of split documents and generates embeddings for them.
# vectorstore = Chroma.from_documents(
#     documents=split_documents,  # The list of split documents
#     collection_name="rag-chroma",  # Name of the collection to store in the vector store
#     embedding=embeddings,  # The embedding model used to convert the documents into vector representations
#     persist_directory="./chroma_langchain_db"
)

In [None]:
# Create a retriever from the vector store
# The retriever will allow you to query the vector store and retrieve relevant documents based on vector similarity.
retriever = vectorstore.as_retriever()

### Prompt Construction

In [None]:
template = """
You are an assistant specialized in explaining Plato's philosophical concepts and ideas.
Use the provided context retrieved from a database containing Plato's works to answer the question.
Explain the concepts in a way that high school and college students can easily understand.
Keep your answer concise, using clear language, examples, or analogies when helpful. Aim for three sentences maximum.
If the context doesn't provide enough information, say you don't know instead of speculating.
Question: {question}
Context: {context}
Answer:
"""
prompt = ChatPromptTemplate.from_template(template)

### Chain Construction

In [None]:
rag_chain = (
    # Define inputs: context from retriever, question passed through unchanged.
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt  # Format inputs with the prompt
    | llm  # Generate response using the LLM
    | StrOutputParser()  # Parse output as a string
)

Use method invoke("question") to get answers from the RAG.

In [None]:
result = rag_chain.invoke("What is the theory of forms according to Plato?")
print(result)

In [None]:
result = rag_chain.invoke("What did plato say about the soul? ")
print(result)