<a href="https://colab.research.google.com/github/Ami03sa/Openai_projects-/blob/Movie_recommend_system/Movie_Recommendation_with_OpenAI_API.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


Creating a Movie Recommendation System!


# 2. Libraries import

In [None]:
import os
import openai
import numpy as np
import pandas as pd

from openai import OpenAI

In [None]:
pip install openai



# 3. Sending a first request to OpenAI API


### 3.1 Setting up API Key

In [106]:
os.environ["OPENAI_API_KEY"] = ""
client = OpenAI()

### 3.2 Vectors and their similarity


### Embeddings:

Imagine you have a bunch of different fruits, and you want to describe each one on a piece of paper so that someone can understand what each fruit is like without seeing it. You’d write down things like the color, shape, size, and taste of each fruit. In the world of computers and AI, embeddings do something similar for words or movies.

An embedding is a way of turning words, sentences, or things like movies into a list of numbers (we call this list a "vector") that represents different features, just like the list you made for the fruits. For example, for movies, the numbers might represent how action-packed they are, whether they are romantic, if they are funny, and so on. These numbers aren't random; they are calculated so that movies with similar numbers have similar features.

![](https://cdn.sanity.io/images/vr8gru94/production/e016bbd4d7d57ff27e261adf1e254d2d3c609aac-2447x849.png)
Source: https://www.pinecone.io/learn/vector-embeddings/

### Vector Similarity:

Now, let’s say you have two lists of numbers for two different movies. How can you tell if the movies are similar? This is where vector similarity comes in.

Imagine you and a friend each have a toy car, and you race them side by side to see which one is faster. If the cars finish the race at almost the same time, you’d say they’re pretty similar in speed. Vector similarity does the same thing with the lists of numbers for the movies.

Computers use a method to "race" the vectors against each other, often using something called "cosine similarity." They check how close the numbers are in both lists. If the numbers are really close across both lists, it’s like two cars finishing at the same time, which means the movies are similar. If the numbers are far apart, then the movies are quite different, just like if one car finishes way ahead of the other.

So, in simple terms:

- **Embeddings** are like writing a detailed description of something (like a movie) in a special code of numbers that a computer can understand.
- **Vector similarity** is like a race to see how similar two sets of numbers (or embeddings) are, which tells us how similar the things they represent (like two movies) might be to each other.


![](https://cdn.sanity.io/images/vr8gru94/production/5a5ba7e0971f7b6dc4697732fa8adc59a46b6d8d-338x357.png)

Source: https://www.pinecone.io/learn/vector-similarity/

In [None]:
experiment_sentence = "The Terminator is a movie about an AI going after a human"

In [None]:
res = client.embeddings.create(
    model = "text-embedding-ada-002",
    input = experiment_sentence
)

#-0.015121644362807274, -0.05992080271244049, -0.02566564828157425, -0.021782368421554565


In [None]:
len(res.data[0].embedding)

1536

In [None]:
res.data[0].embedding[:10]

[-0.015241867862641811,
 -0.05812705680727959,
 -0.024942390620708466,
 -0.024194246158003807,
 0.01960393227636814,
 0.008793872781097889,
 -0.03053445741534233,
 -3.6220933452568715e-07,
 -0.021480634808540344,
 -0.0056966799311339855]

## Similarity

In [None]:
toy_dataset = [
    "The Terminator is a movie that has AI-based robots inside of them",
    "Harry Potter is all amobut wizards and magic",
    "In the movie Matrix, AI already has become the most powerfull 'being'"
]

In [None]:
toy_embedding = client.embeddings.create(
    model = "text-embedding-ada-002",
    input = toy_dataset
)

In [None]:
clean_embeds = []
for embed in toy_embedding.data:
    clean_embeds.append(embed.embedding)

In [None]:
clean_embeds[0][:10]

[-0.012985923327505589,
 -0.05840099975466728,
 -0.027800120413303375,
 -0.013011856935918331,
 0.01611732691526413,
 0.010515809990465641,
 -0.03132700175046921,
 -0.00822074431926012,
 -0.015481970272958279,
 -0.012434848584234715]

In [None]:
user_request = input("Enter the movie description: ")
user_vector = client.embeddings.create(
    model = "text-embedding-ada-002",
    input = user_request
)



Enter the movie description: i love ai 


In [None]:
user_vector = user_vector.data[0].embedding

In [None]:
from scipy.spatial.distance import cosine, cdist

In [None]:
np.array(user_vector).reshape(1,-1)

array([[-0.01260287, -0.0207954 , -0.0282796 , ..., -0.01888425,
        -0.00636826, -0.01351834]])

## Recommending most similar vector

In [None]:
similarities= 1-cdist(np.array(user_vector).reshape(1,-1), np.array(clean_embeds), metric="cosine")

In [None]:
similarities

array([[0.77261255, 0.7516612 , 0.80279061]])

In [None]:
np.argsort(-similarities)

array([[2, 0, 1]])

In [None]:
p_movies = [toy_dataset[id]for id in np.argsort(-similarities[0])]

In [None]:
p_movies

["In the movie Matrix, AI already has become the most powerfull 'being'",
 'The Terminator is a movie that has AI-based robots inside of them',
 'Harry Potter is all amobut wizards and magic']

# 4. Scaling to the big dataset

You can download dataset from here: https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset?select=movies_metadata.csv

In [46]:
data = pd.read_csv("movies_metadata.csv")

  data = pd.read_csv("movies_metadata.csv")


In [47]:
data.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [48]:
subset = data[["original_title","overview"]]

In [49]:
subset.head()

Unnamed: 0,original_title,overview
0,Toy Story,"Led by Woody, Andy's toys live happily in his ..."
1,Jumanji,When siblings Judy and Peter discover an encha...
2,Grumpier Old Men,A family wedding reignites the ancient feud be...
3,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom..."
4,Father of the Bride Part II,Just when George Banks has recovered from his ...


In [50]:
# Using only 100 movies for recommendation system to peresven money for API :)
small_data = subset.iloc[:100]

In [53]:
# Drop missing values
small_data = small_data.dropna()
small_data.shape

(99, 2)

In [55]:
movie_embeddings = client.embeddings.create(
    model="text-embedding-ada-002",
    input=small_data['overview'].values.tolist(),
)

clean_movie_embeddings = []
for embedding in movie_embeddings.data:
    clean_movie_embeddings.append(embedding.embedding)

In [56]:
clean_movie_embeddings = np.array(clean_movie_embeddings)

In [74]:
user_request = input("Enter the movie description: ")
user_vector = client.embeddings.create(
    model = "text-embedding-ada-002",
    input = user_request
)

user_vector = np.array(user_vector.data[0].embedding).reshape(1,-1)



Enter the movie description: movie about robots


In [75]:
scores = np.argsort(-cdist(user_vector, clean_movie_embeddings, metric="cosine")[0])

In [76]:
scores[:10]

array([33, 41, 75, 10,  7, 78, 96, 44, 88, 46])

In [78]:
for i in scores[:10]:
  print(small_data.iloc[i]["original_title"])

Carrington
Restoration
Nico Icon
The American President
Tom and Huck
بادکنک سفید
Heidi Fleiss: Hollywood Madam
How To Make An American Quilt
The Journey of August King
Pocahontas


### 5. Building movie recommender with Pinecone


Pinecone website: https://www.pinecone.io/

In [94]:
pip install pinecone



In [95]:
from pinecone import Pinecone

pc = Pinecone(api_key="")
index = pc.Index("movie")

In [102]:
for i in range(len(small_data)):
    upsert_response = index.upsert(
    vectors=[
        (
         str(i),
         clean_movie_embeddings[0].tolist(),
         {"title": small_data.iloc[i]['original_title']}
        )
    ]

    )

## Searching the most similar movie

In [103]:
user_request = input("What movie are you looking for? ")

user_vector = client.embeddings.create(
    model="text-embedding-ada-002",
    input=user_request)

user_vector = user_vector.data[0].embedding


What movie are you looking for? movie about cars


In [104]:
matches = index.query(
    vector=user_vector,
    top_k=10,
    include_metadata=True
)

In [105]:
matches

{'matches': [{'id': '15',
              'metadata': {'title': 'Casino'},
              'score': 0.776062071,
              'values': []},
             {'id': '16',
              'metadata': {'title': 'Sense and Sensibility'},
              'score': 0.776062071,
              'values': []},
             {'id': '17',
              'metadata': {'title': 'Four Rooms'},
              'score': 0.776062071,
              'values': []},
             {'id': '18',
              'metadata': {'title': 'Ace Ventura: When Nature Calls'},
              'score': 0.776062071,
              'values': []},
             {'id': '31',
              'metadata': {'title': 'Twelve Monkeys'},
              'score': 0.776062071,
              'values': []},
             {'id': '45',
              'metadata': {'title': 'Se7en'},
              'score': 0.776062071,
              'values': []},
             {'id': '59',
              'metadata': {'title': 'Eye for an Eye'},
              'score': 0.776062071,
     