# Movie-Recommendations

Movie Recommendations with Movielens Dataset

Almost everyone today uses technology to stream movies and television shows. While figuring out what to stream next can be daunting, recommendations are often made based on a viewer’s history and preferences. This is done through machine learning and can be a fun and easy project for beginners to take on. New programmers can practice by coding in either Python or R languages and with data from the Movielens Dataset. Generated by more than 6,000 users, Movielens currently includes more than 1 million movie ratings of 3,900 films.

Dataset link: [https://grouplens.org/datasets/movielens/1m/](https://grouplens.org/datasets/movielens/1m/)


A **recommendation system** predicts the rating or the preference a user might give to an item. It is an algorithm that suggests relevant things to users. Thus, Recommender systems aim to present relevant items to users based on various factors. Recommender systems are widely used in products like in the case of Netflix, it recommends which movie to watch, in case of e-commerce, which product to buy, or in the case of kindle, which book to read, etc.

**Word embeddings** represent words that allow words with similar meanings to have an equal representation. Stemming uses the word's stem, while lemmatization uses the context in which the term is used.

For grammatical reasons, sentences use different word forms, such as organize, organizes, and organizing. Additionally, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization. Both stemming and lemmatization aim to reduce inflectional and derivationally related phrases to a common form.

example:

am, are, is => be

car, cars, car's, cars' => car

**Stemming** algorithms work by trimming off the end of the word, taking into account a list of common prefixes and suffixes found in a word.

**Lemmatization** considers the morphological examination of the words. It is essential to have dictionaries that the algorithm can refers through to link the form to its lemma.

**TF-IDF**, known as the term frequency-inverse document frequency, is a statistical measurement that estimates how a word is relevant to a document in a group of documents. This is achieved by multiplying two metrics, the number of times a word appears in a document and the inverse document frequency of the word across a set of documents. To simplify it is a text vectorizer that transforms the text into a usable vector. It combines two concepts, Term Frequency (TF) and Document Frequency (DF).


**Content-based filtering system:** Content-Based recommender system predicts the features or behavior of given the item's attributes to which the user will react positively. During recommendation, the similarity metrics are computed from the item's feature vectors and the user's preferred feature vectors from previous data. Then, the top few are recommended. It does not require other users' data during recommendation.

  

**Collaborative filtering System:** Collaborative does not require the features of the items. Every user and entity is described using a feature vector or embedding. It builds an embedding for both users and items. It takes into consideration other users' reactions while recommending a particular user. It records which items a particular user likes and the items that the users with behavior and likings of other users, to recommend things to that user. It collects user feedback on different items and uses them for recommendations.


Differences between Collaborative Filtering and Content-Based Filtering :

-   The Content-based method requires information about the item's features instead of using the user's liking and feedback. It can be any attributes of items such as plot, year, genre, or text that is extracted by applying NLP. 
- Collaborative Filtering doesn't need anything else except the user's preference on items to recommend. As it is based on historical data, the assumption made is that the users who have agreed in the past will also tend to agree in the future.
-   Domain knowledge is not required in the case of Collaborative Filtering as the embeddings are automatically learned. 
- In the case of a Content-based approach, the feature representation of the items is hand-engineered to an extent, this technique requires domain knowledge.
-   A Content-Based filtering model does not require any records about other users as the recommendations are to a particular user.
-   The collaborative algorithm uses only user behavior for recommending items.

In [1]:
#importing libraries
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns


import warnings 
warnings.filterwarnings('ignore')

In [7]:
ratings =  pd.read_csv('../Data/ratings.dat',sep='::',header=None,names=["UserID", "MovieID", "Rating", "Timestamp"], encoding="ISO-8859-1")


In [8]:
ratings

Unnamed: 0,UserID,MovieID,Rating,Timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291
...,...,...,...,...
1000204,6040,1091,1,956716541
1000205,6040,1094,5,956704887
1000206,6040,562,5,956704746
1000207,6040,1096,4,956715648


In [4]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000209 entries, 0 to 1000208
Data columns (total 4 columns):
 #   Column     Non-Null Count    Dtype
---  ------     --------------    -----
 0   UserID     1000209 non-null  int64
 1   MovieID    1000209 non-null  int64
 2   Rating     1000209 non-null  int64
 3   Timestamp  1000209 non-null  int64
dtypes: int64(4)
memory usage: 30.5 MB


In [9]:
movies = pd.read_csv('../Data/movies.dat',sep='::',header=None,names=["MovieID", "Title", "Genres"], encoding="ISO-8859-1")


In [10]:
movies

Unnamed: 0,MovieID,Title,Genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
3878,3948,Meet the Parents (2000),Comedy
3879,3949,Requiem for a Dream (2000),Drama
3880,3950,Tigerland (2000),Drama
3881,3951,Two Family House (2000),Drama


In [11]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3883 entries, 0 to 3882
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   MovieID  3883 non-null   int64 
 1   Title    3883 non-null   object
 2   Genres   3883 non-null   object
dtypes: int64(1), object(2)
memory usage: 91.1+ KB


In [12]:
users = pd.read_csv('../Data/users.dat',sep='::',header=None,names=["UserID", "Gender", "Age", "Occupation", "Zip-code"], encoding="ISO-8859-1")

In [13]:
users

Unnamed: 0,UserID,Gender,Age,Occupation,Zip-code
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,02460
4,5,M,25,20,55455
...,...,...,...,...,...
6035,6036,F,25,15,32603
6036,6037,F,45,1,76006
6037,6038,F,56,1,14706
6038,6039,F,45,0,01060


In [14]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6040 entries, 0 to 6039
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   UserID      6040 non-null   int64 
 1   Gender      6040 non-null   object
 2   Age         6040 non-null   int64 
 3   Occupation  6040 non-null   int64 
 4   Zip-code    6040 non-null   object
dtypes: int64(3), object(2)
memory usage: 236.1+ KB


## Preprocessing dataset

The Content-based recommendation method requires information about the item's features. Therefore we will use attributes of movies genres, overview and tagline to recommend movie

As genere string has json type structure, we will strip the string and extract genres by using following function

In [15]:
def clean_genres(text):
    text=text.replace("[{'id': ",'')
    text=text.replace(", 'name': '",' ')
    text=text.lower()
    text=text.replace(", {'id': ",' ')
    text=text.replace("'}" ,'')
    text=text.replace("'}]" ,'')
    text=text.replace("]" ,'')
    text=''.join([i for i in text if not i.isdigit()])
    text=text.strip()
    return text

In [16]:
#Read data from file
df = pd.read_csv("../Data/movies_metadata.csv")
df.head().T

Unnamed: 0,0,1,2,3,4
adult,False,False,False,False,False
belongs_to_collection,"{'id': 10194, 'name': 'Toy Story Collection', ...",,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",,"{'id': 96871, 'name': 'Father of the Bride Col..."
budget,30000000,65000000,0,16000000,0
genres,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...","[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...","[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...","[{'id': 35, 'name': 'Comedy'}]"
homepage,http://toystory.disney.com/toy-story,,,,
id,862,8844,15602,31357,11862
imdb_id,tt0114709,tt0113497,tt0113228,tt0114885,tt0113041
original_language,en,en,en,en,en
original_title,Toy Story,Jumanji,Grumpier Old Men,Waiting to Exhale,Father of the Bride Part II
overview,"Led by Woody, Andy's toys live happily in his ...",When siblings Judy and Peter discover an encha...,A family wedding reignites the ancient feud be...,"Cheated on, mistreated and stepped on, the wom...",Just when George Banks has recovered from his ...


In [17]:
df['genres'] = df['genres'].apply(clean_genres)

df['tagline'] = df['tagline'].fillna('')
df['movie_text'] = df['overview'] + df['tagline']+df['genres'] 
df['movie_text'] = df['movie_text'].fillna('')

In [18]:
# verifying text
df['movie_text'][16212]

'A dark comedy which follows two brothers who are in love with the same woman.comedy'

In [19]:
df['movie_text'][1325]

'Admiral James T. Kirk is feeling old; the prospect of accompanying his old ship the Enterprise on a two week cadet cruise is not making him feel any younger. But the training cruise becomes a a life or death struggle when Khan escapes from years of exile and captures the power of creation itself.At the end of the universe lies the beginning of vengeance.action  adventure  science fiction  thriller'

In [20]:
df['movie_text'][1326]

"Admiral Kirk and his bridge crew risk their careers stealing the decommissioned Enterprise to return to the restricted Genesis planet to recover Spock's body.A dying planet. A fight for life.science fiction  action  adventure  thriller"

In [21]:
df['movie_text'][1324]

"Capt. Kirk and his crew must deal with Mr. Spock's half brother who kidnaps three diplomats and hijacks the Enterprise in his obsessive search for God.Adventure and imagination will meet at the final frontier.science fiction  action  adventure  thriller"