![Example Image](https://static.amazon.jobs/teams/53/images/IMDb_Header_Page.jpg?1501027252)

## Problem Statement

The provided Jupyter Notebook is designed to generate movie insights using advanced natural language processing (NLP) techniques and vector embeddings. The primary objectives and steps involved in this notebook include:

1. **Library Installation and Setup**: Installing necessary libraries such as chromadb and langchain-openai, and importing them for use in the notebook.
2. **Global Variables Configuration**: Setting up global variables, including API keys and model names required for embedding and language model functions.
3. **Chroma DB Initialization**: Setting up Chroma DB and creating a collection to store movie data embeddings.
4. **Embedding and Language Model Integration**: Using OpenAI’s embedding functions to create vector representations of movie data and setting up a language model (GPT-4) for generating insights.
5. **Query Processing**: Implementing a question-answering system that processes user queries about movies and retrieves relevant information using the pre-trained language model and stored vector embeddings.

<hr>

The notebook aims to create an interactive and intelligent system for querying and analyzing movie data, leveraging state-of-the-art NLP models and vector databases to provide concise and relevant movie insights.

### Install Libraries

In [None]:
!pip install chromadb==0.5.3
!pip install langchain-openai

Collecting chromadb==0.5.3
  Downloading chromadb-0.5.3-py3-none-any.whl.metadata (6.8 kB)
Collecting chroma-hnswlib==0.7.3 (from chromadb==0.5.3)
  Downloading chroma_hnswlib-0.7.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecting fastapi>=0.95.2 (from chromadb==0.5.3)
  Downloading fastapi-0.112.0-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb==0.5.3)
  Downloading uvicorn-0.30.5-py3-none-any.whl.metadata (6.6 kB)
Collecting posthog>=2.4.0 (from chromadb==0.5.3)
  Downloading posthog-3.5.0-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting onnxruntime>=1.14.1 (from chromadb==0.5.3)
  Downloading onnxruntime-1.18.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.3 kB)
Collecting opentelemetry-api>=1.2.0 (from chromadb==0.5.3)
  Downloading opentelemetry_api-1.26.0-py3-none-any.whl.metadata (1.4 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chroma

### Import Libraries

In [None]:
import chromadb
import json
import pandas as pd
import chromadb.utils.embedding_functions as embedding_functions

### Setup the global variables

In [None]:
OPENAI_API_KEY = "sk-proj-ddtfnTdGnBffhZfYScOsT3BlbkFJuPVgHYYKVrctwBiJqaJp"
CHROMA_COLLECTION_NAME = "movie_collection"
EMBEDDING_MODEL_NAME = "text-embedding-3-small"
LLM_MODEL_NAME = "gpt-4o-mini-2024-07-18"

### Setup Chroma DB/Collection

![Example Image](https://miro.medium.com/v2/resize:fit:1400/1*nu_Mvi654Al_DV0i3P31Nw.png)

In [None]:
chroma_client = chromadb.Client()
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
                api_key=OPENAI_API_KEY,
                model_name=EMBEDDING_MODEL_NAME
            )

In [None]:
# Create collection
#chroma_client.delete_collection(CHROMA_COLLECTION_NAME)
collection = chroma_client.get_or_create_collection(name=CHROMA_COLLECTION_NAME)

### Load Data

In [None]:
from google.colab import drive
drive.mount('/content/drive')
df = pd.read_csv("/content/drive/MyDrive/GEN AI Learning/imdb_top_1000.csv").drop("Poster_Link", axis=1)
#df = pd.read_csv("imdb_top_1000.csv").drop("Poster_Link", axis=1)
df["id"] = df.Series_Title.str.lower() + "_" + df.Released_Year.astype("str")
df[["Released_Year", "IMDB_Rating", "Meta_score", "No_of_Votes"]] = df[["Released_Year", "IMDB_Rating",
                                                                        "Meta_score", "No_of_Votes"]].astype(str)

Mounted at /content/drive


In [None]:
df.head()

Unnamed: 0,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross,id
0,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469,the shawshank redemption_1994
1,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411,the godfather_1972
2,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444,the dark knight_2008
3,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000,the godfather: part ii_1974
4,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000,12 angry men_1957


In [None]:
# Convert the DataFrame to a JSON string with each record as a dictionary
data_list = json.loads(df.to_json(orient="records"))

# Convert each dictionary in the list to a string representation
data_list = [str(elem) for elem in data_list]
print(len(data_list))

1000


In [None]:
movie_ids = list(df.id.values)

In [None]:
# Add data to the collection
collection.add(
    embeddings = openai_ef(data_list), # Generate embeddings for the data_list using the openai_ef function
    documents=data_list,
    ids=movie_ids
)

In [None]:
def get_vector_store_documents(query):
    """
    Retrieve and sort documents from a vector store based on a query.

    This function takes a query string, generates its embeddings, and queries
    the vector store for the top 5 documents that are most similar to the query.
    The results are then sorted by similarity score in descending order.

    Parameters:
    query (str): The query string to search for in the vector store.

    Returns:
    list: A list of the top 5 documents sorted by their similarity to the query.
    """
    results = collection.query(
        query_embeddings=openai_ef([query]),
        n_results=5
    )
    sorted_list = sorted(zip(results['distances'][0], results['documents'][0]), reverse=True)
    sorted_scores, sorted_documents = zip(*sorted_list)
    sorted_documents = list(sorted_documents)
    return sorted_documents

In [None]:
get_vector_store_documents("movies talking about space")

["{'Series_Title': 'The Martian', 'Released_Year': '2015', 'Certificate': 'UA', 'Runtime': '144 min', 'Genre': 'Adventure, Drama, Sci-Fi', 'IMDB_Rating': '8.0', 'Overview': 'An astronaut becomes stranded on Mars after his team assume him dead, and must rely on his ingenuity to find a way to signal to Earth that he is alive.', 'Meta_score': '80.0', 'Director': 'Ridley Scott', 'Star1': 'Matt Damon', 'Star2': 'Jessica Chastain', 'Star3': 'Kristen Wiig', 'Star4': 'Kate Mara', 'No_of_Votes': '760094', 'Gross': '228,433,663', 'id': 'the martian_2015'}",
 "{'Series_Title': '2001: A Space Odyssey', 'Released_Year': '1968', 'Certificate': 'U', 'Runtime': '149 min', 'Genre': 'Adventure, Sci-Fi', 'IMDB_Rating': '8.3', 'Overview': 'After discovering a mysterious artifact buried beneath the Lunar surface, mankind sets off on a quest to find its origins with help from intelligent supercomputer H.A.L. 9000.', 'Meta_score': '84.0', 'Director': 'Stanley Kubrick', 'Star1': 'Keir Dullea', 'Star2': 'Gary 

<hr>

### LLM Insights

![Example Image](https://miro.medium.com/v2/resize:fit:1200/1*-PlFCd_VBcALKReO3ZaOEg.png)

In [None]:
import os
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

In [None]:
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
llm = ChatOpenAI(model=LLM_MODEL_NAME)

In [None]:
system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of context to answer the question. If you don't know the answer, say that you don't know. Use three sentences maximum and keep the answer concise."
    "The context is strictly in json format with fields such as title, year of release, genre, imdb rating and brief introduction"
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}")
    ]
)

llm_chain = prompt | llm

In [None]:
query = "movies directed by christopher nolan"
vector_store_documents = get_vector_store_documents(query)

In [None]:
response = llm_chain.invoke({"input": query, "context": "\n\n".join(vector_store_documents)})
print(response.content)

The movies directed by Christopher Nolan include "Inception" (2010), "The Prestige" (2006), "The Dark Knight Rises" (2012), "The Dark Knight" (2008), and "Batman Begins" (2005).
