![alt text](3.png "Title")
**Recommender systems** are among the most popular applications of data science today. They are used to predict the "rating" or "preference" that a user would give to an item. Almost every major tech company has applied them in some form. Amazon uses it to suggest products to customers, YouTube uses it to decide which video to play next on autoplay, and Facebook uses it to recommend pages to like and people to follow.

Broadly, recommender systems can be classified into 3 types:

>**Simple recommenders**:<br><br><br>
offer generalized recommendations to every user, based on movie popularity and/or genre. The basic idea behind this system is that movies that are more popular and critically acclaimed will have a higher probability of being liked by the average audience. An example could be IMDB Top 250.<br><br><br>
**Content-based recommenders**:<br><br><br>
suggest similar items based on a particular item. This system uses item metadata, such as genre, director, description, actors, etc. for movies, to make these recommendations. The general idea behind these recommender systems is that if a person likes a particular item, he or she will also like an item that is similar to it. And to recommend that, it will make use of the user's past item metadata. A good example could be YouTube, where based on your history, it suggests you new videos that you could potentially watch.<br><br><br>
**Collaborative filtering engines**:<br><br><br> these systems are widely used, and they try to predict the rating or preference that a user would give an item-based on past ratings and preferences of other users. Collaborative filters do not require item metadata like its content-based counterparts.

In [1]:
import pandas as pd 
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

In [2]:
df = pd.read_csv("movies_metadata.csv",low_memory=False)

df.head(2)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0


In [3]:
df.shape

(45466, 24)

In [4]:
df.columns

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')

In [5]:
df.isnull().sum()

adult                        0
belongs_to_collection    40972
budget                       0
genres                       0
homepage                 37684
id                           0
imdb_id                     17
original_language           11
original_title               0
overview                   954
popularity                   5
poster_path                386
production_companies         3
production_countries         3
release_date                87
revenue                      6
runtime                    263
spoken_languages             6
status                      87
tagline                  25054
title                        6
video                        6
vote_average                 6
vote_count                   6
dtype: int64

In [6]:
df.drop(columns=['homepage','belongs_to_collection','tagline'],inplace=True,axis=1)
df.dropna(inplace=True)
df.isnull().sum()

adult                   0
budget                  0
genres                  0
id                      0
imdb_id                 0
original_language       0
original_title          0
overview                0
popularity              0
poster_path             0
production_companies    0
production_countries    0
release_date            0
revenue                 0
runtime                 0
spoken_languages        0
status                  0
title                   0
video                   0
vote_average            0
vote_count              0
dtype: int64

In [7]:
df.shape

(44048, 21)

In [8]:
df.duplicated().sum()

17

In [9]:
df.drop_duplicates(inplace=True)

In [10]:
df.duplicated().sum()

0

In [11]:
df.columns

Index(['adult', 'budget', 'genres', 'id', 'imdb_id', 'original_language',
       'original_title', 'overview', 'popularity', 'poster_path',
       'production_companies', 'production_countries', 'release_date',
       'revenue', 'runtime', 'spoken_languages', 'status', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')

**Simple recommenders**: <br><br><br>
>As described in the previous section, simple recommenders are basic systems that recommend the top items based on a certain metric or score. In this section, you will build a simplified clone of IMDB Top 250 Movies using metadata collected from IMDB.

We Will be using **weighted rating** so that the movie popularity * number of votes * takes into account the average rating and the number of votes it has accumulated. Such a system will make sure that a movie with a 9 rating from 100,000 voters gets a (far) higher score than a movie with the same rating but a mere few hundred voters.

WeightedRating(WR)=((v/v+m⋅)*R)+((m/v+m)*C)
 
In the above equation,

v is the number of votes for the movie;

m is the minimum votes required to be listed in the chart;

R is the average rating of the movie;

C is the mean vote across the whole report.

In [12]:
c= df['vote_average'].mean()
c

5.673836615111793

In [13]:
#calculate the number of votes, m, received by a movie in the 90th percentile
m = df['vote_count'].quantile(0.9)
m

168.0

Filter Movies based on the vote_count

In [14]:
q_movies = df.copy().loc[df['vote_count']>=m]
q_movies.shape

(4406, 21)

In [15]:
df['vote_count'].describe()

count    44031.000000
mean       113.240285
std        498.848134
min          0.000000
25%          3.000000
50%         10.000000
75%         36.000000
max      14075.000000
Name: vote_count, dtype: float64

In [16]:
# Function that computes the weighted rating of each movie
def weighted_rating(x,m=m,c=c):
    v=x['vote_count']
    r=x['vote_average']
    rate = (v/(v+m)*r)+(m/(v+m)*c)
    return rate

In [17]:
#Apply Our Function to create new weighted rate column
q_movies['weighted_rate']=q_movies.apply(weighted_rating,axis=1)

In [18]:
#Sort The Movies based on the weighted rate
q_movies = q_movies.sort_values('weighted_rate',ascending=False)
# Print the top 5 rated movies of all time
q_movies.iloc[:5,-5]

314         The Shawshank Redemption
834                    The Godfather
10309    Dilwale Dulhania Le Jayenge
12481                The Dark Knight
2843                      Fight Club
Name: title, dtype: object

In [19]:
q_movies.shape

(4406, 22)

**Content-Based Recommender**
<br><br>
>Plot Description Based Recommender
We will build a system that recommends movies that are similar to a particular movie. To achieve this, we will compute pairwise cosine similarity scores for all movies based on their plot descriptions and recommend movies based on that similarity score threshold.

In [20]:
df.overview.head()

0    Led by Woody, Andy's toys live happily in his ...
1    When siblings Judy and Peter discover an encha...
2    A family wedding reignites the ancient feud be...
3    Cheated on, mistreated and stepped on, the wom...
4    Just when George Banks has recovered from his ...
Name: overview, dtype: object

In [21]:
thresh = df['vote_count'].quantile(0.8)
print("we will only recommend movies with vote count greater than or equal "+str(int(thresh)))

we will only recommend movies with vote count greater than or equal 52


In [22]:
filtered_df = df.copy().loc[df['vote_count']>=thresh]
filtered_df.shape

(8890, 21)

we will compute Term Frequency-Inverse Document Frequency (TF-IDF) vectors for each movie. This will give a matrix where each column represents a word in the overview vocabulary (all the words that appear in at least one document), and each column represents a movie, as before.

In its essence, the TF-IDF score is the frequency of a word occurring in a document, down-weighted by the number of documents in which it occurs. This is done to reduce the importance of words that frequently occur in plot overviews and, therefore, their significance in computing the final similarity score.

In [23]:

#create an instance of the vectorizer
tfidf = TfidfVectorizer(stop_words='english')
filtered_df['overview'] = filtered_df['overview'].fillna('')
#fit and transform the vectorizer on the dataframe 'overview' columns
fit_matrix= tfidf.fit_transform(filtered_df['overview'])
#view shape
fit_matrix.shape

(8890, 28642)

In [24]:
tfidf.get_feature_names()[1000:1010]

['alchemist',
 'alchemy',
 'alcohol',
 'alcoholic',
 'alcoholism',
 'alcott',
 'alcs',
 'ald',
 'aldea',
 'alderman']

From the above output, you observe that 75375 different vocabularies or words in the dataset that has 45,000 movies.

we will be using the cosine similarity to calculate a numeric quantity that denotes the similarity between two movies. we choose the cosine similarity score since it is independent of magnitude and is relatively easy and fast to calculate 

 we will use sklearn's linear_kernel() instead of cosine_similarities() since it is faster.

In [25]:
cosine_sim = linear_kernel(fit_matrix,fit_matrix)
cosine_sim.shape

(8890, 8890)

In [26]:
filtered_df['title']= filtered_df['title'].str.lower()

In [27]:
filtered_df =filtered_df.reset_index()

reverse mapping of movie titles and DataFrame indices.

In [30]:
get_ind = pd.Series(filtered_df.index,index=filtered_df['title'])

In [31]:
get_ind[:5]

title
toy story                      0
jumanji                        1
grumpier old men               2
father of the bride part ii    3
heat                           4
dtype: int64

In [32]:
final = filtered_df['title']

here we will implement a function to get the top 10 recommendation for a given movie name

In [34]:
def get_recommendations(title,final,cosine_sim):
    #conver title to lower
    title=title.lower()
    # Get the index of the movie that matches the title
    try:

      ind = get_ind[title]
    
      # Get the pairwsie similarity scores of all movies with that movie
      sim_scores = list(enumerate(cosine_sim[ind]))
    
      # Sort the movies based on the similarity scores
      sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
      # Get the scores of the 10 most similar movies * neglecting first because it reflects the same movie
      sim_scores = sim_scores[1:11]
      
      # Get the movie indices
      movie_indices = [i[0] for i in sim_scores]
      
      # Return the top 10 most similar movies
      #return saved_df['title'].iloc[movie_indices]
      return final.iloc[movie_indices]
    except :
      print("Did not find movie in our database")
      

In [41]:
def view_sim_score(mov1,mov2,cosine_sim=cosine_sim):
  mov1 = mov1.lower()
  mov2 = mov2.lower()
  ind = get_ind[mov1]
  ind2 = get_ind[mov2]
  coss = cosine_sim[ind]
  return coss[ind2]
print(view_sim_score('Toy Story','Toy Story 2'))


0.45815194652946556


In [46]:
def save_artifacts(cosine_sim,final):
    arr = np.array(cosine_sim)
    np.save('cosine_sim',arr)
    final.to_csv('final.csv',index=False)
    return "Saved"
print(save_artifacts(cosine_sim,final))
    

Saved


In [47]:
ret , db = load_artifacts()

In [48]:
def load_artifacts():
    ret = np.load('cosine_sim.npy')
    db = pd.read_csv('final.csv')
    return ret,db
#def recommend(load_artifacts()):

In [50]:
get_recommendations("source code",final,ret)

7382                  the anomaly
7213    dead snow 2: red vs. dead
6282                      bad ass
2635                 terror train
7217               last passenger
8083                         howl
3559                silver streak
1266                      airport
5157                   jab we met
5854                         hugo
Name: title, dtype: object