<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Задание" data-toc-modified-id="Задание-1">Задание</a></span><ul class="toc-item"><li><span><a href="#Genres" data-toc-modified-id="Genres-1.1">Genres</a></span></li><li><span><a href="#Tags" data-toc-modified-id="Tags-1.2">Tags</a></span></li><li><span><a href="#Прогноз-оценки-фильма" data-toc-modified-id="Прогноз-оценки-фильма-1.3">Прогноз оценки фильма</a></span></li></ul></li></ul></div>

# Задание

1. Использовать dataset [MovieLens](https://grouplens.org/datasets/movielens/latest/)
2. Построить рекомендации (регрессия, предсказываем оценку) на фичах:
    * TF-IDF на тегах и жанрах
    * Средние оценки (+ median, variance, etc.) пользователя и фильма
3. Оценить RMSE на тестовой выборке

In [1]:
import pandas as pd
import numpy as np

from tqdm.notebook import tqdm

from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
from sklearn.neighbors import NearestNeighbors

In [2]:
# links = pd.read_csv('ml-latest-small/links.csv')
movies = pd.read_csv('ml-latest-small/movies.csv')
ratings = pd.read_csv('ml-latest-small/ratings.csv')
tags = pd.read_csv('ml-latest-small/tags.csv')

In [3]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


In [5]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [6]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [7]:
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


In [8]:
tags.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3683 entries, 0 to 3682
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   userId     3683 non-null   int64 
 1   movieId    3683 non-null   int64 
 2   tag        3683 non-null   object
 3   timestamp  3683 non-null   int64 
dtypes: int64(3), object(1)
memory usage: 115.2+ KB


In [9]:
# tags.groupby('userId').tag.count()

In [10]:
movies.movieId.nunique()
# ratings.userId.nunuqie()

9742

In [11]:
ratings.userId.nunique()

610

In [12]:
ratings.userId.count()

100836

In [13]:
ratings.movieId.count()

100836

In [14]:
ratings.movieId.nunique()

9724

## Genres

In [15]:
def change_string(s):
    return ' '.join(s.replace(' ', '').replace('-', '').split('|'))

In [16]:
movie_genres = [change_string(g) for g in movies.genres.values]

movie_genres[:10]

['Adventure Animation Children Comedy Fantasy',
 'Adventure Children Fantasy',
 'Comedy Romance',
 'Comedy Drama Romance',
 'Comedy',
 'Action Crime Thriller',
 'Comedy Romance',
 'Adventure Children',
 'Action',
 'Action Adventure Thriller']

In [17]:
count_vec = CountVectorizer()
X_train_counts = count_vec.fit_transform(movie_genres)

In [18]:
X_train_counts.toarray()

array([[0, 1, 1, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 1, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [19]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

In [20]:
neigh = NearestNeighbors(n_neighbors=7, n_jobs=-1, metric='euclidean') 
neigh.fit(X_train_tfidf)

NearestNeighbors(metric='euclidean', n_jobs=-1, n_neighbors=7)

In [21]:
test = change_string("Adventure|Comedy|Fantasy|Crime")

predict = count_vec.transform([test])
X_tfidf2 = tfidf_transformer.transform(predict)

res = neigh.kneighbors(X_tfidf2, return_distance=True)

In [22]:
res

(array([[0.42079615, 0.53300564, 0.54288608, 0.54288608, 0.54288608,
         0.54288608, 0.54288608]]),
 array([[6774, 9096, 5636, 6723, 3376, 7496, 9717]], dtype=int64))

In [23]:
movies.iloc[res[1][0]]

Unnamed: 0,movieId,title,genres
6774,60074,Hancock (2008),Action|Adventure|Comedy|Crime|Fantasy
9096,143559,L.A. Slasher (2015),Comedy|Crime|Fantasy
5636,27368,Asterix & Obelix: Mission Cleopatra (Astérix &...,Adventure|Comedy|Fantasy
6723,58972,Nim's Island (2008),Adventure|Comedy|Fantasy
3376,4591,Erik the Viking (1989),Adventure|Comedy|Fantasy
7496,82854,Gulliver's Travels (2010),Adventure|Comedy|Fantasy
9717,188833,The Man Who Killed Don Quixote (2018),Adventure|Comedy|Fantasy


## Tags

In [24]:
movies_with_tags = movies.join(tags.set_index('movieId'), on='movieId')

In [25]:
movies_with_tags.head()

Unnamed: 0,movieId,title,genres,userId,tag,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,336.0,pixar,1139046000.0
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,474.0,pixar,1137207000.0
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,567.0,fun,1525286000.0
1,2,Jumanji (1995),Adventure|Children|Fantasy,62.0,fantasy,1528844000.0
1,2,Jumanji (1995),Adventure|Children|Fantasy,62.0,magic board game,1528844000.0


In [26]:
movies_with_tags[movies_with_tags.title == 'Toy Story (1995)']

Unnamed: 0,movieId,title,genres,userId,tag,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,336.0,pixar,1139046000.0
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,474.0,pixar,1137207000.0
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,567.0,fun,1525286000.0


In [27]:
movies_with_tags.tag.unique()

array(['pixar', 'fun', 'fantasy', ..., 'star wars', 'gintama', 'remaster'],
      dtype=object)

In [28]:
movies_with_tags.tag.unique().shape

(1590,)

In [29]:
movies_with_tags.dropna(inplace=True)

In [30]:
movies_with_tags.tag.unique().shape

(1589,)

In [31]:
tag_strings = []
movies_list = []

for movie, group in tqdm(movies_with_tags.groupby('title')):
    tag_strings.append(' '.join([str(s).replace(' ', '').replace('-', '') for s in group.tag.values]))
    movies_list.append(movie)
    

  0%|          | 0/1572 [00:00<?, ?it/s]

In [32]:
movies_list[:5]

['(500) Days of Summer (2009)',
 '...And Justice for All (1979)',
 '10 Cloverfield Lane (2016)',
 '10 Things I Hate About You (1999)',
 '101 Dalmatians (1996)']

In [33]:
tag_strings[:5]

['artistic Funny humorous inspiring intelligent quirky romance ZooeyDeschanel',
 'lawyers',
 'creepy suspense',
 'Shakespearesortof',
 'dogs remake']

In [34]:
count_vec = CountVectorizer()
X_train_counts = count_vec.fit_transform(tag_strings)

In [35]:
X_train_counts.toarray()

array([[0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [36]:
X_train_counts.shape

(1572, 1472)

In [37]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

In [38]:
neigh = NearestNeighbors(n_neighbors=10, n_jobs=-1, metric='euclidean') 
neigh.fit(X_train_tfidf)

NearestNeighbors(metric='euclidean', n_jobs=-1, n_neighbors=10)

In [39]:
for i in range(len(movies_list)):
    if 'John Wick: Chapter Two (2017)' == movies_list[i]:
        print(i)

709


In [40]:
tag_strings[709]

'action darkhero guntactics hitman KeanuReeves organizedcrime secretsociety HeroicBloodshed'

In [41]:
test = change_string('KeanuReeves')

predict = count_vec.transform([test])
X_tfidf2 = tfidf_transformer.transform(predict)

res = neigh.kneighbors(X_tfidf2, return_distance=True)

In [42]:
res

(array([[1.        , 1.        , 1.12538979, 1.15797883, 1.41421356,
         1.41421356, 1.41421356, 1.41421356, 1.41421356, 1.41421356]]),
 array([[ 661,  822,  709,  708, 1091,  132,  315,  768,  691, 1317]],
       dtype=int64))

In [43]:
for i in res[1][0]:
    print(movies_list[i])

In a Lonely Place (1950)
Magnolia (1999)
John Wick: Chapter Two (2017)
John Wick (2014)
Pulp Fiction (1994)
Believer, The (2001)
Da Vinci Code, The (2006)
League of Extraordinary Gentlemen, The (a.k.a. LXG) (2003)
It Comes at Night (2017)
Star Wars: Episode VI - Return of the Jedi (1983)


## Прогноз оценки фильма

In [51]:
df = ratings.drop(['timestamp'], axis=1).merge(movies, on='movieId', how='left').merge(tags.drop(['timestamp'], axis=1), on=['userId', 'movieId'], how='left')

In [52]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 102677 entries, 0 to 102676
Data columns (total 6 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   userId   102677 non-null  int64  
 1   movieId  102677 non-null  int64  
 2   rating   102677 non-null  float64
 3   title    102677 non-null  object 
 4   genres   102677 non-null  object 
 5   tag      3476 non-null    object 
dtypes: float64(1), int64(2), object(3)
memory usage: 5.5+ MB


In [53]:
df[240:250]

Unnamed: 0,userId,movieId,rating,title,genres,tag
240,2,58559,4.5,"Dark Knight, The (2008)",Action|Crime|Drama|IMAX,
241,2,60756,5.0,Step Brothers (2008),Comedy,funny
242,2,60756,5.0,Step Brothers (2008),Comedy,Highly quotable
243,2,60756,5.0,Step Brothers (2008),Comedy,will ferrell
244,2,68157,4.5,Inglourious Basterds (2009),Action|Drama|War,
245,2,71535,3.0,Zombieland (2009),Action|Comedy|Horror,
246,2,74458,4.0,Shutter Island (2010),Drama|Mystery|Thriller,
247,2,77455,3.0,Exit Through the Gift Shop (2010),Comedy|Documentary,
248,2,79132,4.0,Inception (2010),Action|Crime|Drama|Mystery|Sci-Fi|Thriller|IMAX,
249,2,80489,4.5,"Town, The (2010)",Crime|Drama|Thriller,


In [54]:
df.tag.isna().sum()

99201

In [None]:
# def missing_values_table(df):
#     mis_val = df.isnull().sum()
#     mis_val_percent = 100 * df.isnull().sum() / len(df)
#     mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
#     mis_val_table_ren_columns = mis_val_table.rename(
#     columns = {0 : 'Missing Values', 1 : '% of Total Values'})
#     mis_val_table_ren_columns = mis_val_table_ren_columns[
#         mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
#     '% of Total Values', ascending=False).round(1)
#     print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
#         "There are " + str(mis_val_table_ren_columns.shape[0]) +
#             " columns that have missing values.")
#     return mis_val_table_ren_columns