## Построение content-based рекомендательной системы
---
В этом ноутбуке:
1. Предобработка данных
2. Формирование признаков (TF-IDF)
3. Построение модели на основе cosine similarity
4. Функция `get_recommendations`
5. Сохранение артефактов для FastAPI

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pickle

# Загружаем данные (замени на свой путь при необходимости)
df = pd.read_csv('../data/raw/imdb_top_1000.csv')  # если у тебя CSV
df.head()

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444
3,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000
4,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000


### Preprocessing

In [2]:
# Заполним пропуски
df['Meta_score'] = df['Meta_score'].fillna(df['Meta_score'].mean())
df['Certificate'] = df['Certificate'].fillna('Unknown')
df['Gross'] = df['Gross'].fillna('0')

# Объединим полезные текстовые признаки
df['combined'] = (
    df['Genre'].fillna('') + ' ' +
    df['Director'].fillna('') + ' ' +
    df['Star1'].fillna('') + ' ' +
    df['Star2'].fillna('') + ' ' +
    df['Star3'].fillna('') + ' ' +
    df['Star4'].fillna('') + ' ' +
    df['Overview'].fillna('')
)
df.head(2)

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross,combined
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469,Drama Frank Darabont Tim Robbins Morgan Freema...
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411,"Crime, Drama Francis Ford Coppola Marlon Brand..."


### Построение TF-IDF матрицы

In [3]:
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(df['combined'])

cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

print('Матрица TF-IDF:', tfidf_matrix.shape)
print('Матрица сходства:', cosine_sim.shape)

Матрица TF-IDF: (1000, 9390)
Матрица сходства: (1000, 1000)


### Функция рекомендаций

In [4]:
indices = pd.Series(df.index, index=df['Series_Title']).drop_duplicates()

def get_recommendations(title, n=5):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:n+1]
    movie_indices = [i[0] for i in sim_scores]
    return df[['Series_Title','IMDB_Rating','Genre']].iloc[movie_indices]

# Пример
get_recommendations('Inception', 5)

Unnamed: 0,Series_Title,IMDB_Rating,Genre
155,Batman Begins,8.2,"Action, Adventure"
754,(500) Days of Summer,7.7,"Comedy, Drama, Romance"
21,Interstellar,8.6,"Adventure, Drama, Sci-Fi"
934,Mysterious Skin,7.6,Drama
907,50/50,7.6,"Comedy, Drama, Romance"


### Сохранение модели для FastAPI

In [5]:
with open('tfidf_model.pkl', 'wb') as f:
    pickle.dump(tfidf, f)
with open('cosine_sim.pkl', 'wb') as f:
    pickle.dump(cosine_sim, f)
df.to_pickle('movies_df.pkl')

print('Модель и данные сохранены.')

Модель и данные сохранены.
