# Guided project : Movie recommendation

**Dataset : https://files.grouplens.org/datasets/movielens/ml-25m.zip**

In [1]:
import pandas as pd
import re

In [2]:
#Importing dataset
movies = pd.read_csv('RawData/movies.csv')

In [3]:
#Creating backup
movies.to_csv('RawData/moviesBackup.csv', index = True)

In [4]:
#Checking data
movies.tail()

Unnamed: 0,movieId,title,genres
62418,209157,We (2018),Drama
62419,209159,Window of the Soul (2001),Documentary
62420,209163,Bad Poems (2018),Comedy|Drama
62421,209169,A Girl Thing (2001),(no genres listed)
62422,209171,Women of Devil's Island (1962),Action|Adventure|Drama


In [5]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


# Cleaning

**Clean the movie titles with regular expressions** 

Create a function that takes in a title and returns the cleaned title. It should remove any character that isn't a letter, digit, or a space.

In [6]:
#Creating two lists for the cleaned title and our future new Dates column
titles = []
dates = []
    
for element in movies['title']:
    new_title = re.split('\(', element)
    if (len(new_title) ==2):
        dates.append(new_title[1])
        titles.append(new_title[0])
    else:
        titles.append(new_title[0])
        dates.append(None)

In [7]:
#Checking that everything has been added correctly by checking the two lists contain the same number of elements
print (len(dates), len(titles))

62423 62423


In [8]:
#Cleaning the two columns 
titles_cleaned = []
dates_cleaned = []

for element in titles:
    element_cleaned = element.rstrip(' ')
    titles_cleaned.append(element_cleaned)
    
for element in dates: 
    if element != None:
        element_cleaned = element.rstrip(')')
        dates_cleaned.append(element_cleaned)
    else:
        dates_cleaned.append(None)

In [9]:
print (len(dates_cleaned), len(titles_cleaned))

62423 62423


In [10]:
#In the dataset, replacing Title column by the cleaned one and adding Dates column
movies_cleaned = movies
movies_cleaned['title'] = titles_cleaned
movies_cleaned['dates'] = dates_cleaned

movies_cleaned

Unnamed: 0,movieId,title,genres,dates
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji,Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men,Comedy|Romance,1995
3,4,Waiting to Exhale,Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II,Comedy,1995
...,...,...,...,...
62418,209157,We,Drama,2018
62419,209159,Window of the Soul,Documentary,2001
62420,209163,Bad Poems,Comedy|Drama,2018
62421,209169,A Girl Thing,(no genres listed),2001


#  Creating a search engine

**Our search engine is to return the 5 nearest titles in the list of movies we provided.**

**First step is creating a TF-IDF matrix. TF-IDF stands for *Term Frequency Inverse Document Frequency of records*.**

**It is a statistical measure that evaluates how relevant a word is to a document in a collection of documents.
This is done by multiplying two metrics:**
- How many times a word appears in a document (the simplest being a raw count of instances a word appears in a document)
- The inverse document frequency (how common or rare a word is in the entire document set) of the word across a set of documents

In [11]:
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer

**TF-IDF vectorizer is used to tokenize the documents, learn the vocabulary and inverse the document frequency weightings, and allow to encode new documents.
It will transform the text into meaningful representation of integers or numbers which is used to fit algorithm for predictions.**

**The TfidfVectorizer converts a collection of raw documents into a matrix of TF-IDF features.**

In [12]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
#np.set_printoptions(threshold=np.inf)

#Initializing TfidVectorizer with a range of 1 to 2 words for engrams
vectorizer = TfidfVectorizer(ngram_range = (1,2))
#Turning our titles into a metric
tfidf = vectorizer.fit_transform(movies_cleaned['title'])

In [13]:
def search_similar_movies(title):
    query_vector = vectorizer.transform([title])
    
    #Computing the similarity between given title and the matrix, & flattening the result to one dimension :
    #Returns an array which lenght is the same as the titles matrix and each index corresponds to the similarity between this index's title and the searched title
    similarity = cosine_similarity(query_vector, tfidf).flatten()
    
    #Finding the 5 titles that have the greatest similarity with our searched title :
    indices = np.argpartition(similarity, -5)[-5:]
    results = movies_cleaned.iloc[indices].iloc[::-1]
    print(pd.Series(results.title))
   

In [14]:
search_similar_movies('toy story')

0                  Toy Story
3021             Toy Story 2
59767            Toy Story 4
14813            Toy Story 3
20497    Toy Story of Terror
Name: title, dtype: object


# Creating recommendation system based on other users ratings

**In the ratings.csv file, we have movie_id and rating. Each user has rated movies, and we can see how they rated them.**

**We'll create a function to find all the users who also liked the movie that we typed in. 
Then we want to see the other movies they liked because those will probably be good recommendations for us.**

In [15]:
ratings = pd.read_csv('RawData/ratings.csv')

In [16]:
ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510
...,...,...,...,...
25000090,162541,50872,4.5,1240953372
25000091,162541,55768,2.5,1240951998
25000092,162541,56176,2.0,1240950697
25000093,162541,58559,4.0,1240953434


**Searching for users who liked the same movie we liked. Let's consider someone liked a movie gave it 4/5 rating or over.**

In [17]:
searched_movie_id = 1

In [18]:
most_liked_movies = ratings[ratings.rating >= 4]
similar_users = most_liked_movies[most_liked_movies.movieId == searched_movie_id]['userId'].unique()
similar_users

array([     3,      5,      8, ..., 162530, 162533, 162534], dtype=int64)

**Finding the other movies users also liked.**

In [19]:
prefered_movies = most_liked_movies[(most_liked_movies['userId'].isin(similar_users))]
prefered_movies

Unnamed: 0,userId,movieId,rating,timestamp
254,3,1,4.0,1439472215
255,3,29,4.5,1484754967
256,3,32,4.5,1439474635
257,3,50,5.0,1439474391
258,3,111,4.0,1484753849
...,...,...,...,...
24999332,162534,166643,4.0,1526891765
24999342,162534,171763,4.0,1526717390
24999348,162534,177593,5.0,1526666314
24999351,162534,177765,4.0,1526666311


**Finding the 5 most appearing movies in each of the 3 highest ratings.**

In [20]:
five_stars_ids = prefered_movies[prefered_movies['rating'] == 5]['movieId'].value_counts().index
five_stars_selection_id = five_stars_ids[0:5]

most_loved_movies = movies[movies['movieId'].isin(five_stars_selection_id)]['title']

In [21]:
four_half_stars_ids = prefered_movies[prefered_movies.rating == 4.5]['movieId'].value_counts().index
four_half_stars_selection_id = four_half_stars_ids[0:5]

loved_movies = movies[movies['movieId'].isin(four_half_stars_selection_id)]['title']

In [22]:
four_stars_ids = (prefered_movies[prefered_movies['rating'] == 4]['movieId'].value_counts()).index
four_stars_selection_id = four_stars_ids[0:5]

liked_movies = movies[movies['movieId'].isin(four_stars_selection_id)]['title']

In [25]:
result_list = pd.concat([most_loved_movies, loved_movies, liked_movies]).drop_duplicates()
print ('others like you watched this and also liked:', result_list)

others like you watched this and also liked: 0                                            Toy Story
257                 Star Wars: Episode IV - A New Hope
292                                       Pulp Fiction
314                          Shawshank Redemption, The
1166    Star Wars: Episode V - The Empire Strikes Back
351                                       Forrest Gump
2480                                       Matrix, The
452                                      Fugitive, The
475                                      Jurassic Park
1237                                Back to the Future
Name: title, dtype: object
