# Рекомендательная система для кинофильмов с использванием TMDB 5000 Movie Dataset
Metadata on ~5,000 movies from TMDb #

Рекомендательные системы - это компьютерные программы, которые предлагают рекомендации пользователям в зависимости от множества критериев.

Эти системы оценивают наиболее вероятный продукт, который потребители купят и который их заинтересует. Netflix, Amazon и другие компании используют рекомендательные системы, чтобы помочь своим пользователям найти подходящий продукт или фильм.

Существует 3 типа рекомендательных систем.

Демографическая фильтрация: Рекомендации одинаковы для каждого пользователя. Они обобщенные, а не персонализированные. Системы такого типа используются в таких разделах, как “Самые популярные”.

Фильтрация на основе контента: они предлагают рекомендации, основанные на метаданных элемента (фильма, продукта, песни и т.д.). Здесь основная идея заключается в том, что если пользователю нравится товар, то ему также будут нравиться похожие на него товары.

Фильтрация на основе совместной работы: эти системы дают рекомендации, группируя пользователей со схожими интересами. Для этой системы метаданные о товаре не требуются.
В этом проекте создается механизм рекомендаций фильмов на основе контента.

In [7]:
import pandas as pd
credits_df = pd.read_csv("tmdb_5000_credits.csv")
movies_df = pd.read_csv("tmdb_5000_movies.csv")

In [8]:
movies_df.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


In [16]:
movies_df.columns

Index(['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title_x', 'vote_average',
       'vote_count', 'title_y', 'cast_x', 'crew_x', 'title', 'cast_y',
       'crew_y'],
      dtype='object')

In [9]:
credits_df.head()

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


Используем только столбцы id, title, cast и crew во фрейме данных credits. Объединим эти датафреймы по столбцу ‘id’.

In [11]:
credits_df.columns = ['id','title','cast','crew']
movies_df = movies_df.merge(credits_df, on="id")
movies_df.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,tagline,title_x,vote_average,vote_count,title_y,cast_x,crew_x,title,cast_y,crew_y
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,Enter the World of Pandora.,Avatar,7.2,11800,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de...",Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de...",Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...",...,A Plan No One Escapes,Spectre,6.3,4466,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de...",Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...",...,The Legend Ends,The Dark Knight Rises,7.6,9106,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de...",The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]",...,"Lost in our world, found in another.",John Carter,6.1,2124,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de...",John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."



Для персонализированных рекомендаций используем фичи: актеры, съемочная группа, ключевые слова, жанры.

In [24]:
from ast import literal_eval #преобразует строки в списки

features = ["cast_x", "crew_x", "keywords", "genres"]

#for feature in features:
    #movies_df[feature] = movies_df[feature].apply(literal_eval)

movies_df[features].head(10)

Unnamed: 0,cast_x,crew_x,keywords,genres
0,"[{'cast_id': 242, 'character': 'Jake Sully', '...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam..."
1,"[{'cast_id': 4, 'character': 'Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""..."
2,"[{'cast_id': 1, 'character': 'James Bond', 'cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de...","[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam..."
3,"[{'cast_id': 2, 'character': 'Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de...","[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...","[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam..."
4,"[{'cast_id': 5, 'character': 'John Carter', 'c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de...","[{""id"": 818, ""name"": ""based on novel""}, {""id"":...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam..."
5,"[{'cast_id': 30, 'character': 'Peter Parker / ...","[{""credit_id"": ""52fe4252c3a36847f80151a5"", ""de...","[{""id"": 851, ""name"": ""dual identity""}, {""id"": ...","[{""id"": 14, ""name"": ""Fantasy""}, {""id"": 28, ""na..."
6,"[{'cast_id': 34, 'character': 'Flynn Rider (vo...","[{""credit_id"": ""52fe46db9251416c91062101"", ""de...","[{""id"": 1562, ""name"": ""hostage""}, {""id"": 2343,...","[{""id"": 16, ""name"": ""Animation""}, {""id"": 10751..."
7,"[{'cast_id': 76, 'character': 'Tony Stark / Ir...","[{""credit_id"": ""55d5f7d4c3a3683e7e0016eb"", ""de...","[{""id"": 8828, ""name"": ""marvel comic""}, {""id"": ...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam..."
8,"[{'cast_id': 3, 'character': 'Harry Potter', '...","[{""credit_id"": ""52fe4273c3a36847f801fab1"", ""de...","[{""id"": 616, ""name"": ""witch""}, {""id"": 2343, ""n...","[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""..."
9,"[{'cast_id': 18, 'character': 'Bruce Wayne / B...","[{""credit_id"": ""553bf23692514135c8002886"", ""de...","[{""id"": 849, ""name"": ""dc comics""}, {""id"": 7002...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam..."


In [32]:
def safe_literal_eval(x):
    if isinstance(x, str):
        try:
            return literal_eval(x)  # Преобразует строку в список/словарь
        except (ValueError, SyntaxError):
            return []  # или np.nan
    return x  # Если уже список/словарь

# Применяем ко всем столбцам, которые содержат JSON-подобные строки
json_columns = ["crew_x", "cast_x", "keywords", "genres"]
for col in json_columns:
    movies_df[col] = movies_df[col].apply(safe_literal_eval)

Функции для вычленения режиссера фильма и для выбора трех главных элементов списка.

In [None]:
def get_director(x):
    if isinstance(x, list): 
        for person in x:
            if person.get("job") == "Director":  
                return person["name"]
    return np.nan  

def get_list(x):
    if isinstance(x, list):
        names = [i["name"] for i in x if isinstance(i, dict) and "name" in i]
        return names[:3] if len(names) > 3 else names
    return []

In [34]:
movies_df["director"] = movies_df["crew_x"].apply(get_director)

features = ["cast_x", "keywords", "genres"]
for feature in features:
    movies_df[feature] = movies_df[feature].apply(get_list)

In [36]:
movies_df[['title', 'cast_x', 'director', 'keywords', 'genres']].head()

Unnamed: 0,title,cast_x,director,keywords,genres
0,Avatar,"[Sam Worthington, Zoe Saldana, Sigourney Weaver]",James Cameron,"[culture clash, future, space war]","[Action, Adventure, Fantasy]"
1,Pirates of the Caribbean: At World's End,"[Johnny Depp, Orlando Bloom, Keira Knightley]",Gore Verbinski,"[ocean, drug abuse, exotic island]","[Adventure, Fantasy, Action]"
2,Spectre,"[Daniel Craig, Christoph Waltz, Léa Seydoux]",Sam Mendes,"[spy, based on novel, secret agent]","[Action, Adventure, Crime]"
3,The Dark Knight Rises,"[Christian Bale, Michael Caine, Gary Oldman]",Christopher Nolan,"[dc comics, crime fighter, terrorist]","[Action, Crime, Drama]"
4,John Carter,"[Taylor Kitsch, Lynn Collins, Samantha Morton]",Andrew Stanton,"[based on novel, mars, medallion]","[Action, Adventure, Science Fiction]"


Для удобства переводим все в lowercase и убираем пробелы

In [38]:
def clean_data(row):
    if isinstance(row, list):
        return [str.lower(i.replace(" ", "")) for i in row]
    else:
        if isinstance(row, str):
            return str.lower(row.replace(" ", ""))
        else:
            return ""
features = ['cast_x', 'keywords', 'director', 'genres']
for feature in features:
    movies_df[feature] = movies_df[feature].apply(clean_data)

In [40]:
def create_soup(features):
    return ' '.join(features['keywords']) + ' ' + ' '.join(features['cast_x']) + ' ' + features['director'] + ' ' + ' '.join(features['genres'])
movies_df["soup"] = movies_df.apply(create_soup, axis=1)
print(movies_df["soup"].head())

0    cultureclash future spacewar samworthington zo...
1    ocean drugabuse exoticisland johnnydepp orland...
2    spy basedonnovel secretagent danielcraig chris...
3    dccomics crimefighter terrorist christianbale ...
4    basedonnovel mars medallion taylorkitsch lynnc...
Name: soup, dtype: object


Механизм рекомендаций фильмов работает, предлагая фильмы пользователю на основе метаданных. Сходство между фильмами вычисляется и затем используется для составления рекомендаций. Для этого текстовые данные должны быть предварительно обработаны и преобразованы в векторизатор с помощью CountVectorizer.

Мы не учитываем такие слова, как a, an, the.

Используется показатель косинусоидального сходства.

In [50]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

count_vectorizer = CountVectorizer(stop_words="english")
count_matrix = count_vectorizer.fit_transform(movies_df["soup"])

cosine_sim2 = cosine_similarity(count_matrix, count_matrix) 
print(cosine_sim.shape)

movies_df = movies_df.reset_index()
indices = pd.Series(movies_df.index, index=movies_df['title'])

(4803, 4803)


ValueError: cannot insert level_0, already exists

In [47]:
indices = pd.Series(movies_df.index, index=movies_df["title"]).drop_duplicates()
print(indices.head())

title
Avatar                                      0
Pirates of the Caribbean: At World's End    1
Spectre                                     2
The Dark Knight Rises                       3
John Carter                                 4
dtype: int64


Функция get_recommendations() использует название фильма и функцию сходства в качестве входных данных.
Для получения рекомендации нужно:
Получить индекс фильма, используя название.
Получить список оценок сходства фильмов, относящихся ко всем фильмам.
Пронумеровать их (создайть кортежи), где первым элементом будет индекс, а вторым - показатель сходства по косинусу.
Отсортировать список кортежей в порядке убывания на основе показателя сходства.
Получить список индексов 10 лучших фильмов из приведенного выше отсортированного списка. Исключить первый элемент, потому что это само название.
Сопоставить эти индексы с соответствующими названиями и получить список фильмов.

In [56]:
def get_recommendations(title, cosine_sim=cosine_similarity):
    idx = indices[title]
    similarity_scores = list(enumerate(cosine_sim[idx]))
    similarity_scores= sorted(similarity_scores, key=lambda x: x[1], reverse=True)
    similarity_scores= similarity_scores[1:11]
    movies_indices = [ind[0] for ind in similarity_scores]
    movies = movies_df["title"].iloc[movies_indices]
    return movies
print("################ Рекомендации #############")
print("Рекомендации для Аватара")
print(get_recommendations("Avatar", cosine_sim2))
print()
print("Рекомендации для Мстителей")
print(get_recommendations("The Avengers", cosine_sim2))

################ Рекомендации #############
Рекомендации для Аватара
206                         Clash of the Titans
71        The Mummy: Tomb of the Dragon Emperor
786                           The Monkey King 2
103                   The Sorcerer's Apprentice
131                                     G-Force
215      Fantastic 4: Rise of the Silver Surfer
466                            The Time Machine
715                           The Scorpion King
1      Pirates of the Caribbean: At World's End
5                                  Spider-Man 3
Name: title, dtype: object

Рекомендации для Мстителей
7                  Avengers: Age of Ultron
26              Captain America: Civil War
79                              Iron Man 2
169     Captain America: The First Avenger
174                    The Incredible Hulk
85     Captain America: The Winter Soldier
31                              Iron Man 3
33                   X-Men: The Last Stand
68                                Iron Man
94       

Сохраняю модель в файл

In [None]:
import pickle
model_objects = {
    'count_vectorizer': count_vectorizer,
    'cosine_sim': cosine_sim2,
    'indices': indices,
    'movie_titles': movies_df['title']
}

with open('movie_recommender_model.pkl', 'wb') as f:
    pickle.dump(model_objects, f)
