## Introduction
- I basically followed through the LangChain course on deep learning ai
- working on my project 
- https://learn.deeplearning.ai/langchain-chat-with-your-data/lesson/4/vectorstores-and-embedding

### Data visualization component
- I tried to deploy my visualization tool on game data
- deployed on Huggingface
- the Langchain component part is not yet deployed
- https://huggingface.co/spaces/po5302006/ForcaSteam

In [7]:
import pandas as pd

In [8]:
df = pd.read_csv('../data/name_content_2.csv')
df.dtypes
print(df.head(1)['content'])
df.head()

0    Vanguard Princess (ヴァンガードプリンセス, Vangaado Purin...
Name: content, dtype: object


Unnamed: 0,name,content
0,Vanguard Princess,"Vanguard Princess (ヴァンガードプリンセス, Vangaado Purin..."
1,Deadfall Adventures,Deadfall Adventures is an action-adventure vid...
2,Reigns: Game of Thrones,Reigns: Game of Thrones is a 2018 strategy gam...
3,Far Cry® 5,"Far Cry 5 is a 2018 first-person shooter, deve..."
4,Forza Horizon 4,Forza Horizon 4 is a 2018 racing video game de...


In [10]:
from langchain.document_loaders import DataFrameLoader
loader = DataFrameLoader(df, page_content_column="content")
pages = loader.load()

In [11]:
# each page is a piece of game description
# metadata is the gamename
page = pages[0]
print(page)

page_content='Vanguard Princess (ヴァンガードプリンセス, Vangaado Purinsesu), also known as Vanguard Princess: Senjin no Himegimi (ヴァンガードプリンセス 先陣の姫君) is a Japanese dōjin 2D fighting game for Windows and Linux, developed by a single programmer and illustrator Tomoaki Sugeno nicknamed Suge9. The game was created using Fighter Maker 2nd was released for PC on June 26, 2009, and April 10, 2013, for OnLive.\n\n\n== Gameplay ==\nVanguard Princess is a 2D fighting game featuring an all-female cast. The player selects both a fighter and a support character, who follows the fighter and can assist her.\n\n\n== Plot ==\nA mysterious woman with supernatural powers (the game\'s boss) is captured by the government. Her powers are accidentally unleashed, bestowing various young women with magical powers.\n\n\n== Development and release ==\nVanguard Princess was fully developed by Tomoaki Sugeno (or Suge9), an ex-Capcom sprite designer who worked in games like Resident Evil 3: Nemesis and made character sprites 

### Text splitter on token

In [12]:
from langchain.text_splitter import TokenTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

In [13]:
# Split on chunks with 10 tokens
text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)
docs = text_splitter.split_documents(pages)

In [14]:
# Split in chunks in sentences
# recursive splitter
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=[",", "(?<=\. )", "\n\n", "\n", " ", ""]
#     separators=["\n\n", "\n", "(?<=\. )", " ", ""]
)
r_docs = r_splitter.split_documents(pages)

In [17]:
print(len(pages))
print(len(r_docs))
print(r_docs[2])
print(r_docs[2].metadata)

4932
499955
page_content=', developed by a single programmer and illustrator Tomoaki Sugeno nicknamed Suge9. The game was created using Fighter Maker 2nd was released for PC on' metadata={'name': 'Vanguard Princess'}
{'name': 'Vanguard Princess'}


### Embedding

In [18]:
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
embedding = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
from langchain.vectorstores import Chroma

In [19]:
%%cmd
rmdir -rf ../data/docs/chroma

Microsoft Windows [Version 10.0.22621.2715]
(c) Microsoft Corporation. All rights reserved.

C:\Users\po530\Documents\GitHub\AKP_Data_Science\notebook>rmdir -rf ../data/docs/chroma



Invalid switch - "data".


C:\Users\po530\Documents\GitHub\AKP_Data_Science\notebook>

In [21]:
persist_directory = '../data/docs/chroma/'

In [22]:
# will take around 5 mins of embedding
db = Chroma.from_documents(
    documents=r_docs,
    embedding=embedding,
    persist_directory=persist_directory
)

In [23]:
# query = "What game is similar to Counter-Strike:global offensive?"
query = "Which game is CSGO?"
# query = "What game has animals?"
# query = "What is a game similar to CSGO, besides CSGO"
# query = "What is a game similar to CSGO?"
docs = db.similarity_search(query, k=3)

In [None]:
# load from disk
db_load = Chroma(
    persist_directory="../data/docs/chroma", 
    embedding_function=embedding
)

In [24]:
print(docs[0].page_content)
print(docs[0].metadata)

, macOS and Microsoft Windows. The game was first unveiled at GDC 2013 and was released in early access the following year on January 10, 2014
{'name': 'Turbo Dismount™'}


In [16]:
docs = db.max_marginal_relevance_search(query,k=3, fetch_k=10)
print(docs[0].page_content)
print(docs[0].metadata)
print(docs)

, 2017. The game features classic MMORPG gameplay and is set in a world inspired by the books of the Chinese fantasy author Jiang Nan. Revelation
{'name': 'Revelation Online'}
[Document(page_content=', 2017. The game features classic MMORPG gameplay and is set in a world inspired by the books of the Chinese fantasy author Jiang Nan. Revelation', metadata={'name': 'Revelation Online'}), Document(page_content='Metro video game series', metadata={'name': 'Metro Exodus'}), Document(page_content=', PlayStation 4 and Microsoft Windows based on the Sword Art Online light novel series. It is the third video game in the series and is the successor', metadata={'name': 'Sword Art Online: Lost Song'})]


In [17]:
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.chat_models import ChatOpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever

from langchain.llms import Cohere, HuggingFaceHub, OpenAI

In [18]:
metadata_field_info = [
#     AttributeInfo(
#         name="genre",
#         description="The genre of the video game. One of ['science fiction', 'comedy', 'drama', 'thriller', 'romance', 'action', 'animated']",
#         type="string",
#     ),
    AttributeInfo(
        name="year",
        description="The year the video game was released",
        type="integer",
    ),
    AttributeInfo(
        name="rating", description="A 1-10 rating for the video game", 
        type="float"
    ),
]
document_content_description = "Brief summary of a video game"
llm = Cohere
# llm = HuggingFaceHub
# llm = ChatOpenAI(temperature=0)

In [19]:
retriever = SelfQueryRetriever.from_llm(
    llm,
    db,
    document_content_description,
    metadata_field_info,
)

ImportError: Cannot import lark, please install it with 'pip install lark'.

In [None]:
docs = retriever.get_relevant_documents(question)