<a href="https://colab.research.google.com/github/Dyllboy/MovieReviewLLM_Recomendations/blob/main/LLM_Movie_Recommender_ipydonb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [104]:
import os
from google.colab import userdata

os.environ['HUGGINGFACEHUB_API_TOKEN'] = userdata.get('HUGGINGFACEHUB_API_TOKEN')

Required Installs for Program

In [105]:
!pip install sentence_transformers
!pip install pinecone-client
!pip install pypdf
!pip install langchain



In [79]:
import pinecone
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains.qa_with_sources import load_qa_with_sources_chain
from langchain import HuggingFaceHub
from langchain import PromptTemplate
from langchain.document_loaders import UnstructuredPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Pinecone
from langchain.document_loaders import PyPDFDirectoryLoader
from google.colab import drive

Mount Google Drive where PDFs are stored. If files are to be loaded from colab just use /content/ path after uploading files.

In [106]:
drive.mount('/content/gdrive', force_remount=True)
root_dir = "/content/gdrive/My Drive/Movie-Review-Project-Folder"

pdf_folder_path = f'{root_dir}'
print(os.listdir(pdf_folder_path))

Mounted at /content/gdrive
['Napoleon_Review.pdf', 'Napoleon_Review-2.pdf', 'Napoleon_Review-3.pdf', 'Napoleon_Review-4.pdf', 'Napoleon_Review-5.pdf']


Load all pdfs and split them into pages

In [107]:
loader = PyPDFDirectoryLoader(pdf_folder_path)
pages = loader.load_and_split()
print(pages)

[Document(page_content='It’s hard to be a fearless leader and a little cuckhold at the same time. It’s enough to give a man a  \nNapoleon complex.  \nThe first question Francophiles will ask is why are a British director of Oscar -winning epics (Ridley  \nScott, Gladiator) and an American character actor (Joaquin Phoenix, Joker) collaborating on a  \nbio/war film about a legendary, diminutive (5’6”), French leader? Possibly because the film’s name  \nalone sells itself and history buffs will be curious. Scott’s rep for stylistic overblown extravaganzas is  \nanother calling card. (There will be blood —and war.) And the prospect of watching quirky Phoenix  \nanalyze the leader and put his spin on the character is another draw.  \nNapoleon Bonaparte was born on the island of Corsica, on August 15 , 1769, a descendant of  \nItalian nobility. An outsider to the French monarchy, he aggressively championed the French  \nRevolution in 1789. Rising from the military ranks, he ruled France from

Split text using recursive character approach

In [108]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size= 400,
    chunk_overlap=20,
    separators=['\n\n', '\n', '(?=>\. )', ' ', '']
)
docs  = text_splitter.split_documents(pages)

Import Embeddings from Hugging Face

In [83]:
embeddings = HuggingFaceEmbeddings()


Just used to check diminsions of vector space. This is needed when setting up the pinecone vector store.

In [109]:
vectors=embeddings.embed_query("Test")
len(vectors)

768

Initiate Pinecone

In [110]:
pinecone.init(api_key="ead85f5b-b5fd-49f1-ac39-d620cf1f1ce1", environment="gcp-starter")

Embed our pdf documents into the vectorstore

In [None]:
index = Pinecone.from_documents(docs, embeddings, index_name="moviellm-vectorstore")

This code is to load the existing index if it has already been embedded with the documents

In [111]:
index = Pinecone.from_existing_index("moviellm-vectorstore", embeddings)

Prompt template helps give llm context for its answer. By telling it not to answer questions it doesn't know, it prevents odd generlizations.

In [112]:
template ="""
Use {summaries} which are critic reviews of the film Napoleon as a reference to guide your answer.
If you do not know just say I don't know. Do not try and make up an answer.

Question: {question}"""

PROMPT = PromptTemplate(template=template, input_variables=["summaries", "question"])

Setup LLM from HuggingFaceHub API. Chain uses "stuff" meaning context is simplay stuffed into the prompt.

In [113]:
llm=HuggingFaceHub(repo_id="google/flan-t5-xxl", model_kwargs={"temperature":1})
chain = load_qa_with_sources_chain(llm, chain_type="stuff", prompt=PROMPT)



Function to call vectordb using similiarity search with query. k = the amount of contexts returned.

In [114]:
def callVectorDB(query):
  db_results = index.similarity_search(query, k=4)
  return db_results

Prompt 1 - The model is very good at finding simple facts from the context. Such as who played what character.

In [115]:
query = "Who played Napoleon?"
docs= callVectorDB(query)
response = chain.run(input_documents=docs, question=query)
print(response)

Joaquin Phoenix


Prompt 2 - The model is capable of making generalizations and giving suggestions.

In [116]:
query = "I enjoy comedies. Would I enjoy Napoleon? Explain your reasoning."
docs= callVectorDB(query)
response = chain.run(input_documents=docs, question=query)
print(response)

Napoleon is a biopic and not a comedy. The answer: no.


Prompt 3 - The model looks at context about the battle within Napoleon then answers my question.

In [117]:
query = "I enjoy action and fight scenes. Will I enjoy Napoleon? Explain you reasoning."
docs= callVectorDB(query)
response = chain.run(input_documents=docs, question=query)
print(response)

The battle sequences are breathtaking – a marvel of craft and filmmaking that elevates Napoleon and saves it from being relegated to the discards pile of cinema history. So the answer is yes.


Prompt 4 - The model pulls keywords from the context in its answer.

In [118]:
query = "What did the critics have to say about Joaquin Phoenix's acting? Did they enjoy it? Explain your reasoning."
docs= callVectorDB(query)
response = chain.run(input_documents=docs, question=query)
print(response)

The critics said that Joaquin Phoenix's performance lacks dynamism.


Prompt 5 - The model often quotes directly from the context.

In [119]:
query = "What did the critics say about Ridley Scott's direction. Did they enjoy it? Explain your reasoning."
docs= callVectorDB(query)
response = chain.run(input_documents=docs, question=query)
print(response)

Scott is notorious for placing style over substance, story and characters. Again, his towering production elements shroud the basics.


Open Cell

In [120]:
query = "Who is the president of the United States"
docs= callVectorDB(query)
response = chain.run(input_documents=docs, question=query)
print(response)

I don't know


Overall findings:

Limited tokens for calling llm over api leads to limited responses and overall limits the power of the model.

Smaller contexts leads to worse answers and lessens the models ability to make generalizations.

The template is necessary to give the llm knowledge about the contexts you are giving it.