## 0.0 LangChain - Self-querying
- I followed through the LangChain course on deep learning ai
- https://learn.deeplearning.ai/langchain-chat-with-your-data/lesson/4/vectorstores-and-embedding

### 0.1 Reference
- another reference is Langchain tutorial on Self-querying
- 
https://python.langchain.com/docs/modules/data_connection/retrievers/self_query/

### 1.0 Dataframe
- import .csv that stores 2 columns
    - ```name```: game names that exist in our dataset from Kaggle 
    - ```content```: text description that was scraped from Wikipedia
- wiki scraper is also on our Github repo

In [4]:
import pandas as pd

In [5]:
# read csv as Dataframe
df = pd.read_csv('../data/name_content_plot.csv')
df.dtypes
print(df.head(1)['content'])
df.head()

0    140 (one hundred [and] forty) is the natural n...
Name: content, dtype: object


Unnamed: 0,name,content
0,140,140 (one hundred [and] forty) is the natural n...
1,60 Seconds!,60 Seconds! is an action-adventure video game ...
2,7 Days to Die,7 Days to Die is a survival horror video game ...
3,911 Operator,A dispatcher is a communications worker who re...
4,A Hat in Time,A Hat in Time is a platform game developed by ...


### 1.1 Data Loader

Can only select 1 column for embedding!!!!
- ```page_content_column```
- all remaing columns from dataset will be store as ```metadata```

In [6]:
from langchain.document_loaders import DataFrameLoader

In [7]:
# there are different data loader, as well as CSV loader
# but dataframe loader worked on our file
loader = DataFrameLoader(df, page_content_column="content")
pages = loader.load()

In [56]:
# each page is a piece of game description
# metadata is the gamename
page = pages[22]
print('Content:', page.page_content)
print('Game Name', page.metadata['name'])

Content: Albion Online (AO) is a free medieval fantasy MMORPG developed by Sandbox Interactive, a studio based in Berlin, Germany.Set in a medieval world, Albion Online is a medieval fantasy game based on the Arthurian legends, with militaristic strategy aspects to it. Albion Online has been translated into 11 languages and has over 5 million registered users.


== Gameplay ==
Albion Online's gameplay centres itself around a classless system, in which the equipment a player chooses to wear defines their abilities and the way they can play. Players can go out and do activities in Albion's world in order to gain "Fame," similar to "experience" in other MMORPGs. Through this Fame players can get access to other weapon and armour types, with stronger equipment requiring more Fame to use. Stronger gear can be used as players progress throughout the game.
The game has a large open-world map that players can travel through. Different PVP zones offer different levels of risk and reward, includ

### 2.0 Creating Document with Text Splitter

In [9]:
from langchain.text_splitter import TokenTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

#### 2.1 TokenTextSplitter
- split uppon tokens

In [12]:
# Splitting chunks in every 10 tokens
text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)
docs = text_splitter.split_documents(pages)

In [13]:
# Original size of page
print(len(pages))
# Size of documents after splitting them 10 tokens
print(len(docs))

721
200761


#### 2.2 RecursiveCharacterTextSplitter
- split upon punctuations

In [14]:
# Split in chunks in sentences
# recursive splitter
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=[",", "(?<=\. )", "\n\n", "\n", " ", ""]
)
r_docs = r_splitter.split_documents(pages)

In [15]:
# Original size of page
print(len(pages))
# Size of documents after splitting them in senteces
print(len(r_docs))


# a doc example
print(r_docs[2])
print(r_docs[2].metadata)

721
93091
page_content=', which makes it a square pyramidal number.' metadata={'name': '140'}
{'name': '140'}


### 3.0 Embedding


#### 3.1 Choosing Embedding Function 
#### 3.1.1 Sentence Transformer Embedding Function

In [24]:
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
embedding_st = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

#### 3.1.2 Open Embedding Function (api key required)

In [31]:
import os
import openai
os.environ['OPENAI_API_KEY'] = 'your_api_key'
from langchain.embeddings.openai import OpenAIEmbeddings
embedding_oa = OpenAIEmbeddings()

## 4.0 Creating Chromadb Database

In [None]:
# For Windows user
# installing necessary package
%%cmd
pip install lark chromadb

In [32]:
from langchain.vectorstores import Chroma

In [None]:
# For windows user
# cleaning if there is an existing directory
%%cmd
rmdir -rf ../data/docs/chroma

In [12]:
persist_directory = '../data/docs/chroma/'

In [13]:
# may take more than 1 hour of embedding
db = Chroma.from_documents(
    documents=r_docs,
    embedding=embedding_oa,
    persist_directory=persist_directory
)

### 4.1 Loading an Embedded Chormadb
- should choose the same function that was used for embedding
- otherwise may encounter error of ```Embedding dimension does not match collection dimensionality```

In [47]:
# After created a local database, can load it from disk in the future 
# load from disk
db_load = Chroma(
    persist_directory="../data/docs/chroma_plot", 
    embedding_function=embedding_oa
)

### 4.1 Query with Similar Search

In [48]:
# query = "What game is similar to Counter-Strike:global offensive?"
query = "Which 2 games is similar to CSGO?"
# query = "What game has animals?"
# query = "What is a game similar to CSGO, besides CSGO"
# query = "What is a game similar to CSGO?"

user can adjust ```k``` manually to receive more docs

In [49]:
docs = db_load.similarity_search(query, k=3)
print(len(docs))
for index in range(len(docs)):
    print(docs[index].metadata)

3
{'name': 'Counter-Strike Nexon: Studio'}
{'name': 'Counter-Strike'}
{'name': 'DayZ'}


### 4.1 Query with Max Marginal Relevence Search
2 hyperparameters
- user to can adjust ```fetch_k=10``` to get a list of 10 docs
- then MMR will output a list of ```k``` docs
- with the widest variety of answer
    

a better version of similarity search
* so we can avoid having repeated games or games from the same series!!! 

In [29]:
# An example of using different embedding function during and embedding and sending query
docs = db_load.max_marginal_relevance_search(query,k=3, fetch_k=10)
print(docs[0].page_content)
print(docs[0].metadata)
print(docs)

InvalidDimensionException: Embedding dimension 384 does not match collection dimensionality 1536

In [51]:
# Correct Exmaple
docs = db_load.max_marginal_relevance_search(query,k=3, fetch_k=10)
print(len(docs))
print(docs[0].page_content)
print(docs[0].metadata)

3
CS:GO.
{'name': 'Counter-Strike Nexon: Studio'}


## 5.0 Self Querying

In [53]:
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever

### 5.1 Selecting LLM 
- langachain is providing a tons of choice on LLM
- such as https://python.langchain.com/docs/integrations/llms/openllm
- in this case, we are using ChatOpenAI

In [59]:
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(temperature=0)

In [60]:
# There are many more llm that 
from langchain.llms import Cohere, HuggingFaceHub, OpenAI

### 5.2 Metadata Field Info
```document_content_description```
- can provide more information about your query to LLM

```metadata_field_info```
- provide information about the metadata in our database
- Can even specify your value of metadata 
```python
    AttributeInfo(
        name="genre",
        description="The genre of the video game. One of ['science fiction', 'comedy', 'drama', 'thriller', 'romance', 'action', 'animated']",
        type="string",
    ),
```

In [75]:
metadata_field_info = [
    AttributeInfo(
        name="name",
        description="The name of this video game",
        type="string",
    ),
]
document_content_description = "Brief summary of a video game"

### 5.3 Setting Up Retriever

#### 5.3.1 Using db itself as retriever

In [70]:
retriever = db_load.as_retriever()

In [73]:
docs = retriever.invoke(query)
print(len(docs))
print(docs[0].metadata['name'])

4
Counter-Strike Nexon: Studio


#### 5.3.2 Implmenting LLM
simply using our previous components
- ```enable_limit=True``` allows the LLM to detect value of ```k``` by itself!

In [74]:
retriever = SelfQueryRetriever.from_llm(
    llm,
    db_load,
    document_content_description,
    metadata_field_info,
    enable_limit=True, 
)

### 5.4 Retrieving Documents

In [79]:
# Able to detect which 2 games from our question
docs = retriever.invoke(query)
print(len(docs))
for doc in docs:print(doc.metadata['name'])

2
Counter-Strike Nexon: Studio
Counter-Strike


#### 5.4.1 Using Get Relevant Documents
to specify our purpose of this query
- but they performed the same
- maybe because we sent a clear question

In [80]:
docs = retriever.get_relevant_documents(query)
print(len(docs))
for doc in docs:print(doc.metadata['name'])

2
Counter-Strike Nexon: Studio
Counter-Strike
