# Movie Recommendation System

A set of systems for providing movie recommendations to users.

Uses the MovieLens Latest Datasets from https://grouplens.org/datasets/movielens/latest/

* **Full:** contains 27,000,000 ratings and 1,100,000 tag applications applied to 58,000 movies by 280,000 users. Includes tag genome data with 14 million relevance scores across 1,100 tags. Last updated 9/2018.
* **Small:** contains 100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users. Last updated 9/2018.

The simple recommender uses the full dataset. All systems with more personalized recommendations will use the small dataset.

In [1]:
import pandas as pd
import numpy as np
from ast import literal_eval

## 1. Simple Recommender System

General recommendations based only on popularity and genre, not personalized according to the user.

This system operates under the assumption that if a movie is popular and critically acclaimed, it is more likely that it will appeal to the average user.

To implement this model, the movies are sorted according to ratings and popularity.

In [2]:
try:
    md = pd.read_csv('/Users/awikasaeng/Downloads/Movie-Recommender-Systems-main/input/movies-data/metadata_small.csv',
                     dtype = {'id': int, 'vote_count': int, 'vote_averages': float})
except FileNotFoundError:
    cols = ['id', 'title', 'release_date', 'genres', 'vote_count', 'vote_average', 'popularity']
    md = pd.read_csv('../input/the-movies-dataset/movies_metadata.csv',
                     #skiprows=[19731, 29504, 35588],  #skip error data
                     usecols=cols)
    #extract genres
    md['genres'] = md['genres'].apply(lambda x: [i['name'] for i in literal_eval(x)])
    md = md[md['title'].notnull()].astype({'vote_count': int})
    md = md[cols]
    md.to_csv('../input/movies-data/metadata_small.csv', index=False)

# display movies with genres
md.head()

Unnamed: 0,id,title,release_date,genres,vote_count,vote_average,popularity
0,862,Toy Story,1995-10-30,"['Animation', 'Comedy', 'Family']",5415,7.7,21.946943
1,8844,Jumanji,1995-12-15,"['Adventure', 'Fantasy', 'Family']",2413,6.9,17.015539
2,15602,Grumpier Old Men,1995-12-22,"['Romance', 'Comedy']",92,6.5,11.7129
3,31357,Waiting to Exhale,1995-12-22,"['Comedy', 'Drama', 'Romance']",34,6.1,3.859495
4,11862,Father of the Bride Part II,1995-02-10,['Comedy'],173,5.7,8.387519


IMDb uses its own weighted rating formula to provide recommendations, so we will use that here as a better metric of popularity compared to the raw vote average.
The formula is:

Weighted Rating (WR) = $(\frac{v}{v + m} . R) + (\frac{m}{v + m} . C)$

* *v* = # of votes for the movie
* *m* = minimum votes required to be listed in the chart
* *R* = raw average rating of the movie
* *C* = mean average rating of all movies

For *m*, we will use the 95th percentile of movies, so all the movies in our chart will be the top 5% of movies with the most votes.

In [3]:
print(f"C = {md['vote_average'].mean()}")
print(f"m95 = {md['vote_count'].quantile(0.95)}")
md[md['vote_count'] >= 434].copy().shape

C = 5.618207215133889
m95 = 434.0


(2274, 7)

The mean average rating of movies in our dataset is **5.618** on a rating scale of 1 to 10.

The movies at or above the 95th percentile have at least **434** votes, and *2,274* of them fit this criteria.

In [4]:
def weighted_rating(df, percentile=0.95):
    C = df['vote_average'].mean()
    m = df['vote_count'].quantile(percentile)
    qualified = df[df['vote_count'] >= m].copy()
    R = qualified['vote_average']
    v = qualified['vote_count']
    qualified['weighted_rating'] = (v / (v + m) * R) + (m / (m + v) * C)
    qualified = qualified.sort_values('weighted_rating', ascending=False)
    return qualified

### Top Movies

In [5]:
weighted_rating(md).head(15)

Unnamed: 0,id,title,release_date,genres,vote_count,vote_average,popularity,weighted_rating
314,278,The Shawshank Redemption,1994-09-23,"['Drama', 'Crime']",8358,8.5,51.645403,8.357746
834,238,The Godfather,1972-03-14,"['Drama', 'Crime']",6024,8.5,41.109264,8.306334
12481,155,The Dark Knight,2008-07-16,"['Drama', 'Action', 'Crime', 'Thriller']",12269,8.3,123.167259,8.208376
2843,550,Fight Club,1999-10-15,['Drama'],9678,8.3,63.869599,8.184899
292,680,Pulp Fiction,1994-09-10,"['Thriller', 'Crime']",8670,8.3,140.950236,8.172155
351,13,Forrest Gump,1994-07-06,"['Comedy', 'Drama', 'Romance']",8147,8.2,48.307194,8.069421
522,424,Schindler's List,1993-11-29,"['Drama', 'History', 'War']",4436,8.3,41.725123,8.061007
23671,244786,Whiplash,2014-10-10,['Drama'],4376,8.3,64.29999,8.058025
5481,129,Spirited Away,2001-07-20,"['Fantasy', 'Adventure', 'Animation', 'Family']",3968,8.3,41.048867,8.035598
1154,1891,The Empire Strikes Back,1980-05-17,"['Adventure', 'Action', 'Science Fiction']",5998,8.2,19.470959,8.025793


Here the movies are sorted by weighted rating. As you can see, most of the top movies are drama movies.

We can also build charts for specific genres. Since drama movies dominated the 95th percentile, let's try using a lower percentile to get more diverse movies.

In [6]:
def build_chart(genre, percentile=0.85):
    df = md[md.genres.apply(lambda x: genre in x)]
    return weighted_rating(df, percentile)

### Top Comedy Movies

In [7]:
build_chart('Comedy').head(10)

Unnamed: 0,id,title,release_date,genres,vote_count,vote_average,popularity,weighted_rating
10309,19404,Dilwale Dulhania Le Jayenge,1995-10-20,"['Comedy', 'Drama', 'Romance']",661,9.1,34.457024,8.605916
2211,637,Life Is Beautiful,1997-12-20,"['Comedy', 'Drama']",3643,8.3,39.39497,8.222252
351,13,Forrest Gump,1994-07-06,"['Comedy', 'Drama', 'Romance']",8147,8.2,48.307194,8.166014
18465,77338,The Intouchables,2011-11-02,"['Drama', 'Comedy']",5410,8.2,16.086919,8.149172
1225,105,Back to the Future,1985-07-03,"['Adventure', 'Comedy', 'Science Fiction', 'Fa...",6239,8.0,25.778509,7.959364
22839,120467,The Grand Budapest Hotel,2014-02-26,"['Comedy', 'Drama']",4644,8.0,14.442048,7.945739
22129,106646,The Wolf of Wall Street,2013-12-25,"['Crime', 'Drama', 'Comedy']",6768,7.9,16.382422,7.86413
30311,150540,Inside Out,2015-06-09,"['Drama', 'Comedy', 'Animation', 'Family']",6737,7.9,23.985587,7.863968
40876,313369,La La Land,2016-11-29,"['Comedy', 'Drama', 'Music', 'Romance']",4745,7.9,19.681686,7.849193
732,935,Dr. Strangelove or: How I Learned to Stop Worr...,1964-01-29,"['Drama', 'Comedy', 'War']",1472,8.0,9.80398,7.837147


This chart provides data on the top comedy movies above the **85th** percentile.

## 2. Content Based Recommender

Although the simple recommender is useful for finding the most popular movies, it is not personalized to specific users. A better way to cater to a user's tastes is by considering more specific info about the movie. There may be a pattern in more specific information about the movies that the user likes, such as the director and the cast. By computing the similarity scores between a movie that the user is known to like and a potential movie recommendation, the system can determine whether it will be a good fit for the user.

Here are two implementations of a content-based recommender. The first uses the film's **synopsis and tagline**, a short slogan for the movie's content. The second implementation uses the **director, cast, and keywords** associated with the movie.

In [8]:
try:  #small_movies_data
    smd = pd.read_csv('/Users/awikasaeng/Downloads/Movie-Recommender-Systems-main/input/movies-data/description.csv')
except FileNotFoundError:
    md = pd.read_csv('/Users/awikasaeng/Downloads/Movie-Recommender-Systems-main/input/the-movies-dataset/movies_metadata.csv',
                     skiprows=[19731, 29504, 35588],  #skip error data
                     dtype={'id': int},
                     usecols=['title', 'id', 'overview', 'tagline'])
    links_small = pd.read_csv('/Users/awikasaeng/Downloads/Movie-Recommender-Systems-main/input/the-movies-dataset/links_small.csv')['tmdbId']
    links_small = links_small.dropna().astype(int)
    smd = md[md['id'].isin(links_small)].copy()
    smd['description'] = smd['overview'].fillna('') + ' ' + smd['tagline'].fillna('')
    smd = smd[['id', 'title', 'description']].drop_duplicates()
    smd.to_csv('/Users/awikasaeng/Downloads/Movie-Recommender-Systems-main/input/movies-data/description.csv', index=False)
    smd = smd.reset_index(drop=True)

smd.shape

(9082, 3)

This recommender uses a subset of 9,082 movies from the full dataset of 58,000.

### 2a. Movie Description Based Recommender


In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
smd['description'] = smd['description'].fillna('')
tf = TfidfVectorizer(ngram_range=(1, 2), min_df=0, stop_words='english')

A matrix of the TF-IDF scores for each movie is needed for this recommender.

In [10]:
tfidf_matrix = tf.fit_transform(smd['description'])
tfidf_matrix.shape

(9082, 267952)

#### Cosine Similarity

Now that we have a representation of the movie descriptions in terms of normalized TF-IDF vectors, we can apply the cosine similarity score using these vecotrs.

$cosine(x,y) = \frac{x. y^\intercal}{||x||.||y||} $

In [11]:
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
cosine_sim = linear_kernel(tfidf_matrix)
cosine_sim

array([[1.        , 0.00680302, 0.        , ..., 0.        , 0.        ,
        0.00477808],
       [0.00680302, 1.        , 0.01530688, ..., 0.        , 0.00175214,
        0.00367921],
       [0.        , 0.01530688, 1.        , ..., 0.00192587, 0.00221235,
        0.        ],
       ...,
       [0.        , 0.        , 0.00192587, ..., 1.        , 0.        ,
        0.        ],
       [0.        , 0.00175214, 0.00221235, ..., 0.        , 1.        ,
        0.00146392],
       [0.00477808, 0.00367921, 0.        , ..., 0.        , 0.00146392,
        1.        ]])

Using a library from sklearn, we can create a pairwise cosine similarity matrix that can serve as an easy lookup table when we want to find the similarity score between two movies.

In [12]:
def recommend(title):
    movie = smd[smd['title'] == title]
    if len(movie) > 1:
        print("There are duplications of same name. Choose index and use get_recommendations(idx)")
        print(movie)
    else:
        indexes = get_recommendations(movie.index[0])
        recommend_movies = smd.iloc[indexes]
        return recommend_movies[1:].set_index('id')


def get_recommendations(idx):
    # return movies index which similarity score bigger than 0.01
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    return [i[0] for i in sim_scores if i[1] > 0.01]

With these functions, we can pass a title to them and get recommendations for movies similar to that film.

In [13]:
recommend('Heat').head(15)

Unnamed: 0_level_0,title,description
id,Unnamed: 1_level_1,Unnamed: 2_level_1
10990,Mulholland Falls,"In 1950s Los Angeles, a special crime squad of..."
10061,Escape from L.A.,"This time, a cataclysmic temblor hits Los Ange..."
9491,Blue Steel,A female rookie in the police force engages in...
32042,Night Moves,"When Los Angeles private detective, Harry Mose..."
43089,Roadgames,A truck driver plays a cat-and-mouse game with...
5236,Kiss Kiss Bang Bang,A petty thief posing as an actor is brought to...
86241,Welcome to L.A.,The lives and romantic entanglements of a grou...
171274,Inherent Vice,"In Los Angeles at the turn of the 1970s, drug-..."
10357,Volcano,An earthquake shatters a peaceful Los Angeles ...
14373,Death Wish 2,Paul Kersey is again a vigilante trying to fin...


In [14]:
recommend('Batman Begins').head(10)

Unnamed: 0_level_0,title,description
id,Unnamed: 1_level_1,Unnamed: 2_level_1
14919,Batman: Mask of the Phantasm,An old flame of Bruce Wayne's strolls into tow...
123025,"Batman: The Dark Knight Returns, Part 1",Batman has not been seen for ten years. A new ...
414,Batman Forever,The Dark Knight of Gotham City confronts a das...
69735,Batman: Year One,Two men come to Gotham City: Bruce Wayne after...
268,Batman,The Dark Knight of Gotham City begins his war ...
40662,Batman: Under the Red Hood,Batman faces his ultimate challenge as the mys...
49026,The Dark Knight Rises,Following the death of District Attorney Harve...
364,Batman Returns,"Having defeated the Joker, Batman now faces th..."
142061,"Batman: The Dark Knight Returns, Part 2",Batman has stopped the reign of terror that Th...
155,The Dark Knight,Batman raises the stakes in his war on crime. ...


For **Heat**, a classic crime drama film, the recommender provides a table of other crime films as expected.

For **Batman Begins**, the recommender has identified other Batman films. This shows that the recommender may not be useful for films in a specific franchise, since a Batman movie fan would likely have already watched the rest of these. This recommender also does not take into account more significant factors in a movie's popularity, such as the director. After all, a film is not automatically good just because it is part of a popular series. Even if two movies have similar premises, they can turn out vastly different depending on the human factor.

### 2b. Metadata Based Recommender

This recommender takes a different approach to the metrics needed for a content-based recommender, using the movie's genre and keywords in addition to the cast and director to make recommendations.

In [15]:
try:
    credits = pd.read_csv('/Users/awikasaeng/Downloads/Movie-Recommender-Systems-main/input/movies-data/credits_small.csv')
except FileNotFoundError:
    credits = pd.read_csv('/Users/awikasaeng/Downloads/Movie-Recommender-Systems-main/input/the-movies-dataset/credits.csv')
    links_small = pd.read_csv('/Users/awikasaeng/Downloads/Movie-Recommender-Systems-main/input/the-movies-dataset/links_small.csv')['tmdbId']
    links_small = links_small.dropna().astype(int)
    credits = credits[credits['id'].isin(links_small)]


    def get_director(x):
        for i in literal_eval(x):
            if i['job'] == 'Director':
                return i['name']
        return ''


    credits['crew'] = credits['crew'].apply(get_director)
    credits = credits.rename(columns={'crew': 'director'})
    credits['cast'] = credits['cast'].apply(lambda x: [i['name'] for i in literal_eval(x)[:3]])
    credits = credits.astype(str).drop_duplicates()

    keywords = pd.read_csv('/Users/awikasaeng/Downloads/Movie-Recommender-Systems-main/input/the-movies-dataset/keywords.csv')
    keywords = keywords[keywords['id'].isin(links_small)].drop_duplicates()
    keywords['keywords'] = keywords['keywords'].apply(lambda x: [i['name'] for i in literal_eval(x)])

    credits = keywords.astype(str).merge(credits)
    credits.to_csv('/Users/awikasaeng/Downloads/Movie-Recommender-Systems-main/input/movies-data/credits_small.csv', index=False)

Only the director and the top 3 actors in the cast will be used as factors to simplify computing recommendations.


In [16]:
credits[['cast', 'keywords']] = credits[['cast', 'keywords']].applymap(literal_eval)
credits.head()

Unnamed: 0,id,keywords,cast,director
0,862,"[jealousy, toy, boy, friendship, friends, riva...","[Tom Hanks, Tim Allen, Don Rickles]",John Lasseter
1,8844,"[board game, disappearance, based on children'...","[Robin Williams, Jonathan Hyde, Kirsten Dunst]",Joe Johnston
2,15602,"[fishing, best friend, duringcreditsstinger, o...","[Walter Matthau, Jack Lemmon, Ann-Margret]",Howard Deutch
3,31357,"[based on novel, interracial relationship, sin...","[Whitney Houston, Angela Bassett, Loretta Devine]",Forest Whitaker
4,11862,"[baby, midlife crisis, confidence, aging, daug...","[Steve Martin, Diane Keaton, Martin Short]",Charles Shyer


Now we have the movies filtered to only their keywords, the top 3 actors in their cast, and the director. A pairwise similarity matrix will be created as was done in the description-based recommender, and the system will provide recommendations according to the similarity scores.

In [17]:
# process actor names to make sure they are distinct from each other
strip = lambda x: str(x).replace(" ", "").lower()
cast = credits['cast'].apply(lambda x: [strip(i) for i in x])
# give director a greater weight as a recommendation factor
director = credits['director'].apply(lambda x: [strip(x)] * 3)

#### Keywords

The keywords will need some pre-processing as well.

In [18]:
s = pd.DataFrame(np.concatenate(credits['keywords'])).value_counts()
s[:5]

independent film        603
woman director          541
murder                  397
duringcreditsstinger    327
based on novel          309
dtype: int64

The most common keyword appears 603 times. Any keywords that appear only once will not be useful for the similarity matrix, so those will be dropped.

In [19]:
# drop keywords that appear only once
s = s[s > 1]

from nltk.stem.snowball import SnowballStemmer

stem = SnowballStemmer('english').stem

# stemming keywords to simplify index
def filter_keywords(x):
    words = []
    for i in x:
        if i in s:
            for a in i.split():
                words.append(stem(a))
    return words

In [20]:
keywords = credits['keywords'].apply(filter_keywords)
md = pd.read_csv('/Users/awikasaeng/Downloads/Movie-Recommender-Systems-main/input/movies-data/metadata_small.csv', dtype=
{'id': int}, usecols=['id', 'genres'])
genres = credits.merge(md)[['id', 'genres']].drop_duplicates()
genres = genres['genres'].reset_index(drop=True).apply(literal_eval)
soup = keywords + cast + director + genres
soup = soup.apply(lambda x: ' '.join(x))
soup.head()

0    jealousi toy boy friendship friend rivalri boy...
1    board game disappear base on children book new...
2    fish best friend duringcreditssting waltermatt...
3    base on novel interraci relationship singl mot...
4    babi midlif crisi confid age daughter mother d...
dtype: object

In [21]:
count = CountVectorizer(ngram_range=(1, 2), min_df=0, stop_words='english')
count_matrix = count.fit_transform(soup)

cosine_sim = cosine_similarity(count_matrix)

Now that we have a different cosine similarity matrix, we can run get_recommendations again and get a different set of movies. The system recommends other movies directed by Michael Mann when it looks for movies similar to **Heat**.

In [22]:
recommend('Heat').head(10)

Unnamed: 0_level_0,title,description
id,Unnamed: 1_level_1,Unnamed: 2_level_1
11524,Thief,"Frank is an expert professional safecracker, s..."
82,Miami Vice,Miami Vice is a feature film based on the 1980...
1538,Collateral,Cab driver Max picks up a man who offers him $...
8489,Ali,"In 1964, a brash new pro boxer, fresh from his..."
11322,Public Enemies,Depression-era bank robber John Dillinger's ch...
9008,The Insider,Tells the true story of a 60 Minutes televisio...
9361,The Last of the Mohicans,As the English and French soldiers battle for ...
31640,Blood and Wine,A man who has failed as a father and husband c...
11454,Manhunter,"FBI Agent Will Graham, who retired after catch..."
16958,The Asphalt Jungle,"Recently paroled from prison, legendary burgla..."


This recommender system can be modified if the user would like to try out different weights.

If we try getting recommendations from **Gone Girl**, directed by David Fincher, we see that the list also provides other Fincher films. In fact, the first 8 films (up to Se7en) are all directed by Fincher. 

In [23]:
recommend('Gone Girl').head(15)

Unnamed: 0_level_0,title,description
id,Unnamed: 1_level_1,Unnamed: 2_level_1
4547,Panic Room,Trapped in their New York brownstone's panic r...
2649,The Game,"In honor of his birthday, San Francisco banker..."
4922,The Curious Case of Benjamin Button,"Tells the story of Benjamin Button, a man who ..."
65754,The Girl with the Dragon Tattoo,This English-language adaptation of the Swedis...
37799,The Social Network,"On a fall night in 2003, Harvard undergrad and..."
1949,Zodiac,The true story of the investigation of 'The Zo...
550,Fight Club,A ticking-time-bomb insomniac and a slippery s...
807,Se7en,Two homicide detectives are on a desperate hun...
157825,White Bird in a Blizzard,Kat Connors is 17 years old when her perfect h...
8077,Alien³,After escaping with Newt and Hicks from the al...


#### Ratings and Popularity

Although the system provides similar movies tailored to the movie it was given, it does not provide average ratings for these movies. We can improve the results by using the weighted rating formula mentioned earlier.

In [24]:
def improved_recommendations(title):
    movies = recommend(title)[:25]
    md_s = pd.read_csv('/Users/awikasaeng/Downloads/Movie-Recommender-Systems-main/input/movies-data/metadata_small.csv', dtype=
    {'id': int, 'vote_count': int, 'vote_averages': float})
    md_s = md_s[md_s['id'].isin(movies.index)]
    return weighted_rating(md_s, 0.6)

In [25]:
improved_recommendations('Heat')

Unnamed: 0,id,title,release_date,genres,vote_count,vote_average,popularity,weighted_rating
2890,9008,The Insider,1999-10-28,"['Drama', 'Thriller']",489,7.3,11.3569,6.951717
8071,1538,Collateral,2004-08-04,"['Drama', 'Crime', 'Thriller']",1476,7.0,13.455112,6.897452
1357,9361,The Last of the Mohicans,1992-09-25,"['Action', 'Adventure', 'Drama', 'History', 'R...",747,7.1,15.228794,6.897399
2466,11660,Following,1998-09-12,"['Crime', 'Drama', 'Thriller']",363,7.2,5.283661,6.839131
13042,13809,RockNRolla,2008-09-04,"['Action', 'Crime', 'Thriller']",851,6.9,7.263988,6.773145
4879,8489,Ali,2001-12-11,['Drama'],457,6.7,9.303104,6.597778
13847,11322,Public Enemies,2009-07-01,"['History', 'Crime', 'Drama']",1371,6.5,10.364794,6.492692
17342,22907,Takers,2010-08-26,"['Action', 'Crime', 'Drama', 'Thriller']",399,6.1,7.000421,6.269886
11130,82,Miami Vice,2006-07-27,"['Action', 'Adventure', 'Crime', 'Thriller']",494,5.7,11.72545,6.0164
25834,201088,Blackhat,2015-01-13,"['Crime', 'Drama', 'Mystery']",842,5.1,14.932368,5.499856


In [26]:
improved_recommendations('Gone Girl')

Unnamed: 0,id,title,release_date,genres,vote_count,vote_average,popularity,weighted_rating
2843,550,Fight Club,1999-10-15,['Drama'],9678,8.3,63.869599,8.136671
46,807,Se7en,1995-09-22,"['Crime', 'Mystery', 'Thriller']",5915,8.1,18.45743,7.882168
1551,2649,The Game,1997-09-12,"['Drama', 'Thriller', 'Mystery']",1556,7.5,14.825587,7.185797
13219,4922,The Curious Case of Benjamin Button,2008-11-24,"['Fantasy', 'Drama', 'Thriller', 'Mystery', 'R...",3398,7.3,17.934821,7.163616
11646,1949,Zodiac,2007-03-02,"['Crime', 'Drama', 'Mystery', 'Thriller']",2080,7.3,19.083823,7.107934
18293,65754,The Girl with the Dragon Tattoo,2011-12-14,"['Thriller', 'Crime', 'Mystery', 'Drama']",2479,7.2,8.907829,7.060717
15798,37799,The Social Network,2010-09-30,['Drama'],3492,7.1,16.972995,7.015868
5132,4547,Panic Room,2002-03-29,"['Crime', 'Drama', 'Thriller']",1303,6.6,14.969502,6.674948
1273,8077,Alien³,1992-05-22,"['Science Fiction', 'Action', 'Horror']",1664,6.2,17.126768,6.428644
20048,75780,Jack Reacher,2012-12-20,"['Crime', 'Drama', 'Thriller']",3046,6.3,12.789812,6.425929


## 3. Collaborative Filtering

This system allows for providing recommendations that are tailored to the user specifically by using their rating and comparing their tastes to other users. If it finds users with similar tastes, it can provide movies that those users enjoyed which the original user has not yet seen.

In [27]:
pip install scikit-surprise

Note: you may need to restart the kernel to use updated packages.


In [28]:
import surprise
from surprise import Reader, Dataset, SVD
from surprise import accuracy
from surprise.model_selection import train_test_split

ratings = pd.read_csv('/Users/awikasaeng/Downloads/Movie-Recommender-Systems-main/input/the-movies-dataset/ratings_small.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


This system uses a machine-learning algorithm called **Singular Value Decomposition (SVD)** to predict and approximate user ratings for movies they have not seen yet.

In [29]:
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], Reader())

# sample random trainset and testset
# test set is made of 25% of the ratings.
trainset, testset = train_test_split(data, test_size=.25)

# using the SVD algorithm
algo = SVD()

# Train the algorithm on the trainset, and predict ratings for the testset
algo.fit(trainset)
predictions = algo.test(testset)

# Then compute RMSE - should be a small number
accuracy.rmse(predictions)

RMSE: 0.8945


0.8944873442350099

The **Root Mean Square Error** serves as a metric for how much the algorithm deviates from expected values. The RMSE is small so we can be confident that the model fits the data.

In [30]:
trainset = data.build_full_trainset()
algo.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f82a063cfd0>

Let's predict what user ID 1 would rate **Toy Story**, the movie with ID 862.



In [31]:
# user ID, movie ID
algo.predict(1, 862)

Prediction(uid=1, iid=862, r_ui=None, est=2.7156856402662997, details={'was_impossible': False})

The system predicts that user 1 would give Toy Story a rating of **2.658**.
Given a movie ID, this system tries to determine the likeliest score that a user would give that movie by comparing their ratings with users with similar tastes that have seen the movie.

## 4. Hybrid Recommendation System

This system combines the content-based and collaborative systems to provide movie recommendations.

It takes the user ID and a movie title and outputs movies similar to the given movies while taking estimated ratings into account.

In [32]:
id_map = pd.read_csv('/Users/awikasaeng/Downloads/Movie-Recommender-Systems-main/input/the-movies-dataset/links_small.csv',
                     usecols=['movieId', 'tmdbId'])
id_map = id_map.dropna().astype(int).set_index('tmdbId')


def hybrid(userid, title):
    movies = recommend(title)
    movies['est'] = [algo.predict(userid, id_map.loc[x]['movieId']).est for x in movies.index]
    movies = movies.sort_values('est', ascending=False)
    return movies.head(10)

In [33]:
hybrid(1, 'Avatar')

Unnamed: 0_level_0,title,description,est
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
103,Taxi Driver,A mentally unstable Vietnam War veteran works ...,3.713683
629,The Usual Suspects,"Held in an L.A. interrogation room, Verbal Kin...",3.671035
603,The Matrix,"Set in the 22nd century, The Matrix tells the ...",3.604926
11645,Ran,"Set in Japan in the 16th century (or so), an e...",3.578919
625,The Killing Fields,The Killing Fields tells the real life story o...,3.548431
37853,Mister Roberts,A hilarious and heartfelt military comedy-dram...,3.542115
1891,The Empire Strikes Back,"The epic saga continues as Luke Skywalker, in ...",3.534786
901,City Lights,City Lights is the first silent film that Char...,3.528254
238,The Godfather,"Spanning the years 1945 to 1955, a chronicle o...",3.524424
423,The Pianist,The Pianist is a film adapted from the biograp...,3.517847


In [34]:
hybrid(500, 'Avatar')

Unnamed: 0_level_0,title,description,est
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
637,Life Is Beautiful,A touching story of an Italian book seller of ...,4.591762
13,Forrest Gump,A man with a low IQ has accomplished great thi...,4.167963
8392,My Neighbor Totoro,Two sisters move to the country with their fat...,4.098971
587,Big Fish,Throughout his life Edward Bloom has always be...,4.087997
497,The Green Mile,A supernatural tale set on death row in a Sout...,4.074447
7985,Penelope,"Forlorn heiress Penelope Wilhern is cursed, an...",4.008574
10705,Henry V,Gritty adaption of William Shakespeare's play ...,3.989948
3090,The Treasure of the Sierra Madre,"Fred C. Dobbs and Bob Curtin, both down on the...",3.920251
508,Love Actually,Follows seemingly unrelated people as their li...,3.90534
4808,Charade,After Regina Lampert falls for the dashing Pet...,3.864646


This recommender will provide a different list of movies depending on the user, so it is personalized and reflects the user's previous ratings.