# Movie-Recommendations

Movie Recommendations with Movielens Dataset

Almost everyone today uses technology to stream movies and television shows. While figuring out what to stream next can be daunting, recommendations are often made based on a viewer’s history and preferences. This is done through machine learning and can be a fun and easy project for beginners to take on. New programmers can practice by coding in either Python or R languages and with data from the Movielens Dataset. Generated by more than 6,000 users, Movielens currently includes more than 1 million movie ratings of 3,900 films.

Dataset link: [https://grouplens.org/datasets/movielens/1m/](https://grouplens.org/datasets/movielens/1m/)


A **recommendation system** predicts the rating or the preference a user might give to an item. It is an algorithm that suggests relevant things to users. Thus, Recommender systems aim to present relevant items to users based on various factors. Recommender systems are widely used in products like in the case of Netflix, it recommends which movie to watch, in case of e-commerce, which product to buy, or in the case of kindle, which book to read, etc.

**Word embeddings** represent words that allow words with similar meanings to have an equal representation. Stemming uses the word's stem, while lemmatization uses the context in which the term is used.

For grammatical reasons, sentences use different word forms, such as organize, organizes, and organizing. Additionally, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization. Both stemming and lemmatization aim to reduce inflectional and derivationally related phrases to a common form.

example:

am, are, is => be

car, cars, car's, cars' => car

**Stemming** algorithms work by trimming off the end of the word, taking into account a list of common prefixes and suffixes found in a word.

**Lemmatization** considers the morphological examination of the words. It is essential to have dictionaries that the algorithm can refers through to link the form to its lemma.

**TF-IDF**, known as the term frequency-inverse document frequency, is a statistical measurement that estimates how a word is relevant to a document in a group of documents. This is achieved by multiplying two metrics, the number of times a word appears in a document and the inverse document frequency of the word across a set of documents. To simplify it is a text vectorizer that transforms the text into a usable vector. It combines two concepts, Term Frequency (TF) and Document Frequency (DF).


**Content-based filtering system:** Content-Based recommender system predicts the features or behavior of given the item's attributes to which the user will react positively. During recommendation, the similarity metrics are computed from the item's feature vectors and the user's preferred feature vectors from previous data. Then, the top few are recommended. It does not require other users' data during recommendation.

  

**Collaborative filtering System:** Collaborative does not require the features of the items. Every user and entity is described using a feature vector or embedding. It builds an embedding for both users and items. It takes into consideration other users' reactions while recommending a particular user. It records which items a particular user likes and the items that the users with behavior and likings of other users, to recommend things to that user. It collects user feedback on different items and uses them for recommendations.


Differences between Collaborative Filtering and Content-Based Filtering :

-   The Content-based method requires information about the item's features instead of using the user's liking and feedback. It can be any attributes of items such as plot, year, genre, or text that is extracted by applying NLP. 
- Collaborative Filtering doesn't need anything else except the user's preference on items to recommend. As it is based on historical data, the assumption made is that the users who have agreed in the past will also tend to agree in the future.
-   Domain knowledge is not required in the case of Collaborative Filtering as the embeddings are automatically learned. 
- In the case of a Content-based approach, the feature representation of the items is hand-engineered to an extent, this technique requires domain knowledge.
-   A Content-Based filtering model does not require any records about other users as the recommendations are to a particular user.
-   The collaborative algorithm uses only user behavior for recommending items.

In [1]:
#importing libraries
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns


import warnings 
warnings.filterwarnings('ignore')

In [None]:
ratings =  pd.read_csv('../Data/ratings.dat',sep='::',header=None,names=["UserID", "MovieID", "Rating", "Timestamp"], encoding="ISO-8859-1")


In [None]:
ratings

In [None]:
ratings.info()

In [None]:
movies = pd.read_csv('../Data/movies.dat',sep='::',header=None,names=["MovieID", "Title", "Genres"], encoding="ISO-8859-1")


In [None]:
movies

In [None]:
movies.info()

In [None]:
users = pd.read_csv('../Data/users.dat',sep='::',header=None,names=["UserID", "Gender", "Age", "Occupation", "Zip-code"], encoding="ISO-8859-1")

In [None]:
users

In [None]:
users.info()

## Preprocessing dataset

The Content-based recommendation method requires information about the item's features. Therefore we will use attributes of movies genres, overview and tagline to recommend movie

As genere string has json type structure, we will strip the string and extract genres by using following function

In [None]:
def clean_genres(text):
    text=text.replace("[{'id': ",'')
    text=text.replace(", 'name': '",' ')
    text=text.lower()
    text=text.replace(", {'id': ",' ')
    text=text.replace("'}" ,'')
    text=text.replace("'}]" ,'')
    text=text.replace("]" ,'')
    text=''.join([i for i in text if not i.isdigit()])
    text=text.strip()
    return text

In [None]:
#Read data from file
df = pd.read_csv("../Data/movies_metadata.csv")
df.head().T

In [None]:
df['genres'] = df['genres'].apply(clean_genres)

df['tagline'] = df['tagline'].fillna('')
df['movie_text'] = df['overview'] + df['tagline']+df['genres'] 
df['movie_text'] = df['movie_text'].fillna('')

In [None]:
# verifying text
df['movie_text'][16212]

In [None]:
df['movie_text'][1325]

In [None]:
df['movie_text'][1326]

In [None]:
df['movie_text'][1324]

In [None]:
df['movie_text'][20922]

In [None]:
df['movie_text'][1327]

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf = TfidfVectorizer()
tfidf_matrix = tf.fit_transform(df['movie_text'])

The cosine similarity is the cosine of the angle between two vectors. It also has the identical inner product of the vectors if they were normalized to both have length one. Cosine similarity considers vector orientation, independent of vector magnitude.

Computing cosine similarity between the movie text feature we created. Cosine similarity, or the cosine kernel will compute similarity as the normalized dot product of X and Y

In [None]:
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

In [None]:
df = df.reset_index()
titles = df['title']
title_ids = pd.Series(df.index, index=df['title'])

In [None]:
def get_content_recommendations(title):
    idx = title_ids[title]
    cosine_scores = list(enumerate(cosine_sim[idx]))

    #sorting scores in descending order
    cosine_scores = sorted(cosine_scores, key=lambda x: x[1], reverse=True)
    
    #top 10 recommendations
    cosine_scores = cosine_scores[1:10]
    movie_indices = [i[0] for i in cosine_scores]
    return titles.iloc[movie_indices]

In [None]:
get_content_recommendations('Star Trek: The Motion Picture')

In [None]:
get_content_recommendations('Batman Forever') 

In [None]:
get_content_recommendations('The Hangover') 

In [2]:
import pickle 


Use the below generated files to load model in server.py

In [None]:
pickle.dump(df, open('movies_df.pkl','wb'))
pickle.dump(cosine_sim, open('similarity.pkl','wb'))


In [3]:
ls

255-project.ipynb              movies_df.pkl
Collaborative filtering.ipynb  similarity.pkl


In [6]:
movie_data = pickle.load(open('movies_df.pkl','rb'))
sim = pickle.load(open('similarity.pkl','rb'))
titles = movie_data['title']
title_ids = pd.Series(movie_data.index, index=movie_data['title'])
movie_poster_ids = movie_data['id']

In [8]:
title_ids

title
Toy Story                          0
Jumanji                            1
Grumpier Old Men                   2
Waiting to Exhale                  3
Father of the Bride Part II        4
                               ...  
Subdue                         45461
Century of Birthing            45462
Betrayal                       45463
Satan Triumphant               45464
Queerama                       45465
Length: 45466, dtype: int64

In [13]:
def get_content_recommendations(title):
    idx = title_ids[title]
    cosine_scores = list(enumerate(sim[idx]))

    #sorting scores in descending order
    cosine_scores = sorted(cosine_scores, key=lambda x: x[1], reverse=True)
    
    #top 10 recommendations
    cosine_scores = cosine_scores[1:10]
    movie_indices = [i[0] for i in cosine_scores]
    return movie_poster_ids.iloc[movie_indices],titles.iloc[movie_indices]

In [18]:
mid,mdata = get_content_recommendations('The Hangover')

In [24]:
a = mid.to_json()

In [25]:
b = mdata.to_json()

In [26]:
a+b

'{"28175":"252838","2700":"11037","25455":"292191","24158":"276843","2453":"16508","39873":"238475","37974":"343112","6840":"6472","15807":"23168"}{"28175":"The Wedding Ringer","2700":"Iron Eagle","25455":"Bachelor Night","24158":"What We Did on Our Holiday","2453":"Doug\'s 1st Movie","39873":"Best Night Ever","37974":"Man Vs.","6840":"Guarding Tess","15807":"The Town"}'