## Summaries

1. Non-personalized

Non-personalized recommendation algorithms are simple yet powerful. For a user, it is qute interesting to
see which items are popular or in trend, based on other users rating. Non-personalized recommendations are 
exceptionally important in solving the cold-start problem, when user has no decision history.
Real life example: during Netflix sign up user is presented with some popular movie titles (non-personalized)
that enable making personalized recommendations.

2. Content based

Content based recommendation are really shining when some suggestions should be made for similar items.
For example, in YouTube user is presented with a video suggestion after the currently watched one is finished.
There could be many options of "content". Some of them are textual (tags, description), some could be not so 
trivial (features of images, sound etc). One of the most important things in content based recommenders is
feature engineering. In the current task just by using stemmed "soup" the recommender was able to suggest much 
more meaningful movies, based not only on some description but also on cast, director, and keywords.

3. Collaborative filtering

Collaborative filtering recommendations are used when recommendation should be personalized. It is based not only
on a single user history but also on similarities of preferences among all the users on the given items. 
4. Hybrid

Conclusion:



In [428]:
import os
import json
from time import time
from ast import literal_eval

import pandas as pd
import numpy as np
from scipy import stats
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet
from surprise import Reader, Dataset, SVD, KNNBasic, evaluate
from surprise.model_selection import cross_validate, train_test_split
import warnings; warnings.simplefilter('ignore')
import matplotlib.pyplot as plt
import re
%matplotlib inline
import seaborn as sns

#### Added functions

In [429]:
def make_keyword(string):
    return re.sub('[^a-z0-9]+', '', string.lower())

def print_soup(df, title):
    print('Soup for "{}": {}'.format(title, df[df['title'] == title]['soup'].values[0]))
    
def print_description(df, title):
    print('Description for "{}": {}'.format(title, df[df['title'] == title]['description'].values[0]))

In [430]:
# load data
meta_df = pd.read_csv('movies_metadata.csv')
# parse genre feature
meta_df['genres'] = meta_df['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
# parse date
meta_df['year'] = pd.to_datetime(meta_df['release_date'], errors='coerce').apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)
# stack genre and add it to dataframe again
stacked_genre_df = meta_df.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)
stacked_genre_df.name = 'genre'
stacked_genre_df = meta_df.drop('genres', axis=1).join(stacked_genre_df)

## Non-personalized recommendations

__ToDo__: Implement non-personalized recommendations which will return top 10 movies for a genre.
Come up with specific average ratio, and use it to rank videos.
(Use video_count, video_average features from meta_df dataframe)


IMDB weighted rating is used:

Weighted Rating = $(\frac{v}{v + m} . R) + (\frac{m}{v + m} . C)$

where,
* *v* is the number of votes for the movie
* *m* is the minimum votes required to be listed in the chart
* *R* is the average rating of the movie
* *C* is the mean vote across the whole report

In [431]:
def get_weighted_rating(v, m, R, C):
    return (v / (v + m) * R) + (m / (v + m) * C)

def get_genre_nonpersonalized_recommendations(df, genre, percentile=0.85):
    genre_df = df[df['genre'] == genre].copy()
    C = genre_df['vote_average'].mean()
    m = genre_df['vote_count'].quantile(percentile)
    genre_df = genre_df[genre_df['vote_count'] > m]
    genre_df['weighted_rating'] = genre_df.apply(
        lambda x: get_weighted_rating(x['vote_count'], m, x['vote_average'], C), axis=1)
    return genre_df.nlargest(10, 'weighted_rating')[['title', 'year']]

In [432]:
get_genre_nonpersonalized_recommendations(stacked_genre_df, 'Comedy')

Unnamed: 0,title,year
10309,Dilwale Dulhania Le Jayenge,1995
2211,Life Is Beautiful,1997
351,Forrest Gump,1994
18465,The Intouchables,2011
1225,Back to the Future,1985
22841,The Grand Budapest Hotel,2014
22131,The Wolf of Wall Street,2013
30315,Inside Out,2015
40882,La La Land,2016
732,Dr. Strangelove or: How I Learned to Stop Worr...,1964


In [433]:
get_genre_nonpersonalized_recommendations(stacked_genre_df, 'Animation')

Unnamed: 0,title,year
5481,Spirited Away,2001
40251,Your Name.,2016
9698,Howl's Moving Castle,2004
2884,Princess Mononoke,1997
359,The Lion King,1994
30315,Inside Out,2015
5553,Grave of the Fireflies,1988
5833,My Neighbor Totoro,1988
13724,Up,2009
12704,WALL·E,2008


In [434]:
get_genre_nonpersonalized_recommendations(stacked_genre_df, 'Family')

Unnamed: 0,title,year
5481,Spirited Away,2001
1225,Back to the Future,1985
359,The Lion King,1994
30315,Inside Out,2015
17437,Harry Potter and the Deathly Hallows: Part 2,2011
13724,Up,2009
12704,WALL·E,2008
24455,Big Hero 6,2014
5833,My Neighbor Totoro,1988
7725,Harry Potter and the Prisoner of Azkaban,2004


## Item-item content based recommendations

__ToDo__: implement functions to perform item-item description based recommendations

In [435]:
# load ID from smaller set
links_small = pd.read_csv('links_small.csv')
links_small = links_small[links_small['tmdbId'].notnull()]['tmdbId'].astype('int')
# drop rows with broken ID values
meta_df = meta_df.drop([19730, 29503, 35587])
# parse movie ID to int
meta_df['id'] = meta_df['id'].astype('int')
# create small dataframe
small_meta_df = meta_df[meta_df['id'].isin(links_small)]
small_meta_df.drop_duplicates('title', inplace=True)
small_meta_df.dropna(inplace=True)

In [436]:
# create descriptions
small_meta_df['tagline'] = small_meta_df['tagline'].fillna('')
small_meta_df['description'] = small_meta_df['overview'] + small_meta_df['tagline']
small_meta_df['description'] = small_meta_df['description'].fillna('')

In [448]:
def create_cosine_matrix(df):
    tfidf = TfidfVectorizer()
    descriptions = tfidf.fit_transform(df['description']).todense()
    
    return linear_kernel(descriptions, descriptions)

def get_item_content_recommendations(df, cosine_sim, title, top_n=None):
    if top_n == None:
        top_n = len(cosine_sim)
    top_n = min(top_n,len(cosine_sim))

    np.fill_diagonal(cosine_sim, 0)
    index = np.where((df['title'] == title).values)
    similarity_indexes_sorted_asc = np.argsort(np.squeeze(cosine_sim[index,:]))
    
    top_indexes_desc = np.flip(similarity_indexes_sorted_asc[-top_n:], axis=0)
#     print((df['title'] == title).values)
    return df.iloc[top_indexes_desc]['title']

In [446]:
cosine_matrix = create_cosine_matrix(small_meta_df)

In [450]:
small_meta_df['title']

9                                               GoldenEye
68                                                 Friday
69                                    From Dusk Till Dawn
153                                      Blue in the Face
178               Mighty Morphin Power Rangers: The Movie
219                                                Clerks
256                                             Star Wars
309                                     The Swan Princess
359                                         The Lion King
475                                         Jurassic Park
536                                          Blade Runner
579                                            Home Alone
581                                               Aladdin
588                                  Beauty and the Beast
638                                   Mission: Impossible
727                                         A Close Shave
758                                         Trainspotting
825           

In [449]:
recommendations = get_item_content_recommendations(small_meta_df, cosine_matrix, 'Toy Story', top_n=20)
recommendations

[False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False Fa

ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

#### Recommendation for "Toy Story" does not exactly match what was expected. It is mainly used name Andy to get similar movies.
#### Which resulted in suggestion "The 40 Year Old Virgin" which is unappropriate

In [None]:
print_description(small_meta_df, 'Toy Story')
for recommendation_title in recommendations.values[:5]:
    print_description(small_meta_df, recommendation_title)

In [None]:
recommendations = get_item_content_recommendations(small_meta_df, cosine_matrix, 'Africa Screams', top_n=20)
recommendations

In [None]:
print_description(small_meta_df, 'Africa Screams')
for recommendation_title in recommendations.values[:5]:
    print_description(small_meta_df, recommendation_title)

__ToDo__: implement functions to perform item-item keywords based recommendations

In [None]:
# load credits and keywords data
credits = pd.read_csv('credits.csv')
keywords = pd.read_csv('keywords.csv')
# parse ID
keywords['id'] = keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')
meta_df['id'] = meta_df['id'].astype('int')
# merge existing dataframe with credits and keywords
meta_df = meta_df.merge(credits, on='id')
meta_df = meta_df.merge(keywords, on='id')
# take only small subset
small_meta_df = meta_df[meta_df['id'].isin(links_small)]
small_meta_df.drop_duplicates('title', inplace=True)

### <font color='green'>Used a proper keyword-making function so that they are consistent (for cast, director, etc.)</font>

In [None]:
# convert parse to json and keep top 3 from cast
small_meta_df['cast'] = small_meta_df['cast'].apply(literal_eval)
small_meta_df['cast'] = small_meta_df['cast'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
small_meta_df['cast'] = small_meta_df['cast'].apply(lambda x: x[:3] if len(x) >=3 else x)

# join cast name and surname
small_meta_df['cast'] = small_meta_df['cast'].apply(lambda cast: [make_keyword(x) for x in cast])

# parse crew
small_meta_df['crew'] = small_meta_df['crew'].apply(literal_eval)

# measure cast and crew sizes
small_meta_df['cast_size'] = small_meta_df['cast'].apply(lambda x: len(x))
small_meta_df['crew_size'] = small_meta_df['crew'].apply(lambda x: len(x))

### <font color='green'>Fixed a bug</font> 
In get director() -- in list comprehension director of the first movie was always taken

In [None]:
# find director
def get_director(crew):
    names = [x['name'] for x in crew if x['job']=='Director']
    return np.nan if not names else names[0]

small_meta_df['director'] = small_meta_df['crew'].apply(get_director)

small_meta_df['director'] = small_meta_df['director'].astype('str').apply(make_keyword)
small_meta_df['director'] = small_meta_df['director'].apply(lambda x: [x, x, x])

### <font color='green'>Fixed the bugs:</font> 
1. In filtering keywords -- intersection was done on words which were a result of value_counts(). Should be: words = words.index.values
2. Stemmer was never initialized (it was done after the usage actually) and it did not crash because keywords were
always empty ([stemmer.stem(i) for i in x])

In [None]:
def filter_keywords(x):
    return list(set(x).intersection(words))

small_meta_df['keywords'] = small_meta_df['keywords'].apply(literal_eval)
small_meta_df['keywords'] = small_meta_df['keywords'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

# keep only frequent words
words = small_meta_df.apply(lambda x: pd.Series(x['keywords']),axis=1).stack().reset_index(level=1, drop=True)
words.name = 'keyword'
words = words.value_counts()
words = words[words > 1]
words = words.index.values

# create stemmer
stemmer = SnowballStemmer('english')

# filter keywords
small_meta_df['keywords'] = small_meta_df['keywords'].apply(filter_keywords)
small_meta_df['keywords'] = small_meta_df['keywords'].apply(lambda x: [stemmer.stem(i) for i in x])
small_meta_df['keywords'] = small_meta_df['keywords'].apply(lambda x: [make_keyword(i) for i in x])

In [None]:
small_meta_df['soup'] = small_meta_df['keywords'] + small_meta_df['cast'] + small_meta_df['director'] + small_meta_df['genres']
small_meta_df['soup'] = small_meta_df['soup'].apply(lambda x: ' '.join(x))

In [None]:
def create_cosine_matrix_for_words(df):
    # use CountVectorizer and cosine_similarity
    stemmed = df['soup'].apply(lambda sentence: ' '.join([stemmer.stem(word) for word in (sentence).split()]))
    vectorizer = CountVectorizer()
    count_vectorized = vectorizer.fit_transform(stemmed)
    cosine_matrix = cosine_similarity(count_vectorized, count_vectorized)
    return cosine_matrix

In [None]:
cosine_matrix = create_cosine_matrix_for_words(small_meta_df)

In [None]:
recommendations = get_item_content_recommendations(small_meta_df, cosine_matrix, 'Toy Story', top_n=20)
recommendations

### <font color='green'>Here we have much better suggestion of kids movies. All thanks to 'soup'!</font>

In [None]:
print_soup(small_meta_df, 'Toy Story')
for recommendation_title in recommendations.values[:5]:
    print_soup(small_meta_df, recommendation_title)

In [None]:
recommendations = get_item_content_recommendations(small_meta_df, cosine_matrix, 'Africa Screams', top_n=20)
recommendations

In [None]:
print_soup(small_meta_df, 'Africa Screams')
for recommendation_title in recommendations.values[:5]:
    print_soup(small_meta_df, recommendation_title)

## Collaborative filtering / Matrix factorization

In [None]:
ratings = pd.read_csv('ratings_small.csv')
ratings.head()

In [None]:
links_small = pd.read_csv('links_small.csv')
ratings = ratings.merge(links_small, how='right', on='movieId')
ratings.drop(ratings[ratings['tmdbId'].isnull()].index.values, inplace=True)
ratings['tmdbId'] = ratings['tmdbId'].astype(int)

In [None]:
ratings = ratings.merge(meta_df, how='left', left_on='tmdbId', right_on='id')

In [None]:
ratings.head()

In [None]:
ratings.describe()['rating']

In [None]:
def run_complex_model(ratings_df, model_class, train_on_all_ratings=False):
    # use everything imported from surprise library at the beginning
    # if train_on_all_ratings=True - train on all ratings
    # if train_on_all_ratings=False - split data on 5 folds and do evaluation
    
    if model_class == 'SVD':
        algo = SVD()
    elif model_class == 'KNN':
        algo = KNNBasic()
    else: 
        assert False, f'Algorithm {model_class} is not supported'
        
    reader = Reader(rating_scale=(0.5, 5))
    data = Dataset.load_from_df(ratings_df[['userId', 'movieId', 'rating']], reader)

    if train_on_all_ratings:
        data = data.build_full_trainset()
        algo.fit(data)
        return algo        
    else:
        cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)
        return None

In [None]:
run_complex_model(ratings, 'SVD')

In [None]:
run_complex_model(ratings, 'KNN')

In [None]:
knn_model = run_complex_model(ratings, 'KNN', train_on_all_ratings=True)

In [None]:
knn_model.predict(10, 50)

In [None]:
knn_model.predict(10, 152)

In [None]:
knn_model.predict(10, 40)

### <font color='green'>Examine what suggestions we can expect for the given user</font>

In [None]:
ratings[ratings['userId'] == 10][['title', 'rating']].sort_values('rating', ascending=False).head(10)

In [None]:
user_id = 10
movie_ids_to_predict = ratings[~ratings['movieId'].isin(ratings[ratings['userId'] == 10])]['movieId'].values
predictions = np.array([knn_model.predict(10, x).est for x in movie_ids_to_predict])

In [None]:
predictions_for_user = pd.DataFrame(data={'movie_title': ratings['title'],
                                         'predicted_rating': predictions})
predictions_for_user.sort_values('predicted_rating', ascending=False).head(10)

## Hybrid recommendations

In [None]:
svd_model = run_complex_model(ratings, 'SVD', train_on_all_ratings=True)

In [None]:
links_small = pd.read_csv('links_small.csv')
links_small = links_small.merge(small_meta_df[['title','id']], left_on='tmdbId', right_on='id')
id_to_title = links_small.copy().set_index('movieId')
title_to_id = links_small.copy().set_index('title')

In [413]:
def get_hybrid_recommendations(small_meta_df, ratings, userId, title):
    similar_movie_titles = get_item_content_recommendations(small_meta_df, cosine_matrix, title, top_n=100)
    similar_movie_ids = title_to_ids.loc[similar_movie_titles]['movieId']
    
    print(len(similar_movie_ids))
    predicted_ratings = np.array([knn_model.predict(userId, movie_id).est for movie_id in similar_movie_ids])
    predicted_ratings_index_desc = predicted_ratings.argsort()[::-1]
    recommended_movie_ids = similar_movie_ids[predicted_ratings_index_desc]
    recommended_movie_titles = id_to_title.loc[recommended_movie_ids]['title'].values
    recommendation = pd.DataFrame(data={'movie_title': recommended_movie_titles, 
                                        'predicted_rating': predicted_ratings[predicted_ratings_index_desc]})
    print(recommendation[:10])

In [414]:
get_hybrid_recommendations(small_meta_df, ratings, 10, 'Central Intelligence')

ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

In [None]:
get_hybrid_recommendations(small_meta_df, ratings, 10, 'Assassins')

In [None]:
get_hybrid_recommendations(small_meta_df, ratings, 101, 'Central Intelligence')

In [None]:
get_hybrid_recommendations(small_meta_df, ratings, 101, 'Assassins')