<div style="border: 2px solid #e74c3c; padding: 10px; border-radius: 5px; text-align: center; background-color: #fdecea; color: #c0392b; font-weight: bold;">
  Système de recommendation avec des LLMs
</div>

In [99]:
import pandas as pd
import ast

# Lecture des données

In [100]:
md = pd. read_csv('movies_metadata.csv')
print(md.shape)
md.head(2)

(45466, 24)


  md = pd. read_csv('movies_metadata.csv')


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0


In [101]:
md['genres'] = md['genres'].apply(ast.literal_eval)
md['genres'] = md['genres'].apply(lambda x: [genre['name'] for genre in x])

In [102]:
def calculate_weighted_rate(vote_average, vote_count, min_vote_count=10):
    return (vote_count / (vote_count + min_vote_count)) * vote_average + (min_vote_count / (vote_count + min_vote_count)) * 5.0

vote_counts = md[md['vote_count'].notnull()]['vote_count'].astype('int')
min_vote_count = vote_counts.quantile(0.95)

md['weighted_rate'] = md.apply(lambda row: calculate_weighted_rate(row['vote_average'], row['vote_count'], min_vote_count), axis=1)
md.head(2)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,weighted_rate
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[Animation, Comedy, Family]",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,7.499658
1,False,,65000000,"[Adventure, Fantasy, Family]",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,6.610362


In [103]:
md.dropna(inplace=True)

In [104]:
md_final = md[['genres', 'title', 'overview', 'weighted_rate']].reset_index(drop=True)
md_final.head()

Unnamed: 0,genres,title,overview,weighted_rate
0,"[Adventure, Action, Thriller]",GoldenEye,James Bond must unmask the mysterious head of ...,6.173464
1,[Comedy],Friday,Craig and Smokey are two guys in Los Angeles h...,6.083421
2,"[Horror, Action, Thriller, Crime]",From Dusk Till Dawn,Seth Gecko and his younger brother Richard are...,6.503176
3,[Comedy],Blue in the Face,"Auggie runs a small tobacco shop in Brooklyn, ...",5.109091
4,"[Action, Adventure, Science Fiction, Family, F...",Mighty Morphin Power Rangers: The Movie,Power up with six incredible teens who out-man...,5.052129


Combinons toutes les informations.

In [105]:
md_final['combined_info'] = md_final.apply(lambda row: f"Title: {row['title']}. Overview: {row['overview']} Genres: {', '.join(row['genres'])}. Rating: {row['weighted_rate']}", axis=1)
md_final['combined_info'][9]

'Title: Jurassic Park. Overview: A wealthy entrepreneur secretly creates a theme park featuring living dinosaurs drawn from prehistoric DNA. Before opening day, he invites a team of experts and his two eager grandchildren to experience the park and help calm anxious investors. However, the park is anything but amusing as the security systems go off-line and the dinosaurs escape. Genres: Adventure, Science Fiction. Rating: 7.39064935064935'

# Embeddings

Nous allons utiliser le SentenceTransformers.

In [106]:
from langchain.embeddings.huggingface import HuggingFaceEmbeddings

In [107]:
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

In [109]:
md_final["embedding"] = md_final.overview.apply(lambda x: embedding_model.embed_query(x))
md_final.head()

Unnamed: 0,genres,title,overview,weighted_rate,combined_info,embedding
0,"[Adventure, Action, Thriller]",GoldenEye,James Bond must unmask the mysterious head of ...,6.173464,Title: GoldenEye. Overview: James Bond must un...,"[-0.08838904649019241, 0.08205067366361618, -0..."
1,[Comedy],Friday,Craig and Smokey are two guys in Los Angeles h...,6.083421,Title: Friday. Overview: Craig and Smokey are ...,"[0.01811734400689602, -0.041648950427770615, -..."
2,"[Horror, Action, Thriller, Crime]",From Dusk Till Dawn,Seth Gecko and his younger brother Richard are...,6.503176,Title: From Dusk Till Dawn. Overview: Seth Gec...,"[-0.007716728374361992, 0.09969004988670349, -..."
3,[Comedy],Blue in the Face,"Auggie runs a small tobacco shop in Brooklyn, ...",5.109091,Title: Blue in the Face. Overview: Auggie runs...,"[-0.03313828632235527, -0.0016525940736755729,..."
4,"[Action, Adventure, Science Fiction, Family, F...",Mighty Morphin Power Rangers: The Movie,Power up with six incredible teens who out-man...,5.052129,Title: Mighty Morphin Power Rangers: The Movie...,"[-0.08589291572570801, 0.04623064771294594, 0...."


In [110]:
md_final.rename(columns = {'embedding': 'vector'}, inplace = True)
md_final.rename(columns = {'combined_info': 'text'}, inplace = True)
#md_final.rename(columns = {'overview': 'metadata'}, inplace = True)
md_final.to_pickle('movies.pkl')

In [111]:
md_final.columns

Index(['genres', 'title', 'overview', 'weighted_rate', 'text', 'vector'], dtype='object')

# LLMs

In [112]:
md = pd.read_pickle('movies.pkl')

md.head(2)

Unnamed: 0,genres,title,overview,weighted_rate,text,vector
0,"[Adventure, Action, Thriller]",GoldenEye,James Bond must unmask the mysterious head of ...,6.173464,Title: GoldenEye. Overview: James Bond must un...,"[-0.08838904649019241, 0.08205067366361618, -0..."
1,[Comedy],Friday,Craig and Smokey are two guys in Los Angeles h...,6.083421,Title: Friday. Overview: Craig and Smokey are ...,"[0.01811734400689602, -0.041648950427770615, -..."


In [136]:
md["metadata"] = md['overview'].str.lower()

In [113]:
from langchain.vectorstores import LanceDB
import lancedb

In [142]:
from pydantic import BaseModel
from lancedb.pydantic import LanceModel, Vector
from datetime import datetime
from typing import Optional

class Metadata(BaseModel):
    #key1: Optional[str] = ""
    key1: Optional[str] = ""
    key2: Optional[str] = ""

class LanceSchema(LanceModel):
    text: str
    title: str
    vector: Vector(384)
    metadata:  Optional[str] = ""




In [143]:
uri = "data/sample-lancedb"
db = lancedb.connect(uri)
table = db.create_table("movies", md, schema=LanceSchema, mode="overwrite")

In [144]:
table.schema

text: string not null
title: string not null
vector: fixed_size_list<item: float>[384]
  child 0, item: float
metadata: string

In [145]:
docsearch = LanceDB(
    connection = db, 
    embedding = embedding_model, 
    uri=uri, 
    table_name="movies",
    vector_key="vector",
    id_key='title',
    text_key='text',
    distance='cosine'
    )

In [150]:
query = "I'm looking for an animated action movie. What could you suggest to me?"

In [151]:

docs = docsearch.similarity_search(query, )
docs

ValidationError: 1 validation error for Document
metadata
  Input should be a valid dictionary [type=dict_type, input_value='a thriller crime comedy ...by wolfgang murnberger.', input_type=str]
    For further information visit https://errors.pydantic.dev/2.9/v/dict_type