![Example Image](https://static.amazon.jobs/teams/53/images/IMDb_Header_Page.jpg?1501027252)

## Problem Statement

The provided Jupyter Notebook is designed to generate movie insights using advanced natural language processing (NLP) techniques and vector embeddings. The primary objectives and steps involved in this notebook include:

1. **Library Installation and Setup**: Installing necessary libraries such as chromadb and langchain-openai, and importing them for use in the notebook.
2. **Global Variables Configuration**: Setting up global variables, including API keys and model names required for embedding and language model functions.
3. **Chroma DB Initialization**: Setting up Chroma DB and creating a collection to store movie data embeddings.
4. **Embedding and Language Model Integration**: Using OpenAI’s embedding functions to create vector representations of movie data and setting up a language model (GPT-4) for generating insights.
5. **Query Processing**: Implementing a question-answering system that processes user queries about movies and retrieves relevant information using the pre-trained language model and stored vector embeddings.

<hr>

The notebook aims to create an interactive and intelligent system for querying and analyzing movie data, leveraging state-of-the-art NLP models and vector databases to provide concise and relevant movie insights.

### Install Libraries

In [None]:
!pip install chromadb==0.5.3
!pip install langchain-openai

### Import Libraries

In [2]:
import chromadb
import json
import pandas as pd
import chromadb.utils.embedding_functions as embedding_functions

### Setup the global variables

In [None]:
OPENAI_API_KEY = "sk-lkjsdfjdlajd"
CHROMA_COLLECTION_NAME = "movie_collection"
EMBEDDING_MODEL_NAME = "text-embedding-3-small"
LLM_MODEL_NAME = "gpt-4o-mini-2024-07-18"

### Setup Chroma DB/Collection

![Example Image](https://miro.medium.com/v2/resize:fit:1400/1*nu_Mvi654Al_DV0i3P31Nw.png)

In [4]:
chroma_client = chromadb.Client()
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
                api_key=OPENAI_API_KEY,
                model_name=EMBEDDING_MODEL_NAME
            )

In [5]:
# Create collection
#chroma_client.delete_collection(CHROMA_COLLECTION_NAME)
collection = chroma_client.get_or_create_collection(name=CHROMA_COLLECTION_NAME)

### Load Data

In [6]:
# from google.colab import drive
# drive.mount('/content/drive')
# df = pd.read_csv("/content/drive/MyDrive/GEN AI Learning/imdb_top_1000.csv").drop("Poster_Link", axis=1)

df = pd.read_csv("data/imdb_top_1000.csv").drop("Poster_Link", axis=1)
df["id"] = df.Series_Title.str.lower() + "_" + df.Released_Year.astype("str")
df[["Released_Year", "IMDB_Rating", "Meta_score", "No_of_Votes"]] = df[["Released_Year", "IMDB_Rating",
                                                                        "Meta_score", "No_of_Votes"]].astype(str)

In [None]:
df.head()

In [None]:
# Convert the DataFrame to a JSON string with each record as a dictionary
data_list = json.loads(df.to_json(orient="records"))
print(data_list)

In [None]:
# Convert each dictionary in the list to a string representation
data_list = [str(elem) for elem in data_list]
print(len(data_list))

1000


In [None]:
print(data_list)

In [9]:
movie_ids = list(df.id.values)

In [11]:
# Add data to the collection
collection.add(
    embeddings = openai_ef(data_list), # Generate embeddings for the data_list using the openai_ef function
    documents=data_list,
    ids=movie_ids
)

In [12]:
def get_vector_store_documents(query):
    """
    Retrieve and sort documents from a vector store based on a query.

    This function takes a query string, generates its embeddings, and queries
    the vector store for the top 5 documents that are most similar to the query.
    The results are then sorted by similarity score in descending order.

    Parameters:
    query (str): The query string to search for in the vector store.

    Returns:
    list: A list of the top 5 documents sorted by their similarity to the query.
    """
    results = collection.query(
        query_embeddings=openai_ef([query]),
        n_results=5
    )
    sorted_list = sorted(zip(results['distances'][0], results['documents'][0]), reverse=True)
    sorted_scores, sorted_documents = zip(*sorted_list)
    sorted_documents = list(sorted_documents)
    return sorted_documents

In [None]:
get_vector_store_documents("movies talking about space")

<hr>

### LLM Insights

![Example Image](https://miro.medium.com/v2/resize:fit:1200/1*-PlFCd_VBcALKReO3ZaOEg.png)

In [14]:
import os
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

In [15]:
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
llm = ChatOpenAI(model=LLM_MODEL_NAME)

In [16]:
system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of context to answer the question. If you don't know the answer, say that you don't know. Use three sentences maximum and keep the answer concise."
    "The context is strictly in json format with fields such as title, year of release, genre, imdb rating and brief introduction"
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}")
    ]
)

llm_chain = prompt | llm

In [17]:
query = "movies directed by christopher nolan"
vector_store_documents = get_vector_store_documents(query)

In [None]:
response = llm_chain.invoke({"input": query, "context": "\n\n".join(vector_store_documents)})
print(response.content)