# Movie Recommendation System

* Recommendation systems predict user preferences and suggest relevant items (movies, products, articles) by analyzing user behavior and item attributes.
* There are two types of recommendation systems:
    * **Content-based recommendation system:** analyzes content you liked to suggest similar items. eg: use tags, genre, description , etc.
    * **Collaborative filtering recommendation system:** uses other users' preferences to recommend items you haven't tried yet. eg: use ratings from other other users
    
# Content-Based Recommendation System

In [1]:
import pandas as pd
data = pd.read_csv("tmdb_movies_data.csv")

In [2]:
data.head()

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,Twenty-two years after the events of Jurassic ...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/2015,5562,6.5,2015,137999939.3,1392446000.0
1,76341,tt1392190,28.419936,150000000,378436354,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,http://www.madmaxmovie.com/,George Miller,What a Lovely Day.,...,An apocalyptic story set in the furthest reach...,120,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,5/13/2015,6185,7.1,2015,137999939.3,348161300.0
2,262500,tt2908446,13.112507,110000000,295238201,Insurgent,Shailene Woodley|Theo James|Kate Winslet|Ansel...,http://www.thedivergentseries.movie/#insurgent,Robert Schwentke,One Choice Can Destroy You,...,Beatrice Prior must confront her inner demons ...,119,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,3/18/2015,2480,6.3,2015,101199955.5,271619000.0
3,140607,tt2488496,11.173104,200000000,2068178225,Star Wars: The Force Awakens,Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...,http://www.starwars.com/films/star-wars-episod...,J.J. Abrams,Every generation has a story.,...,Thirty years after defeating the Galactic Empi...,136,Action|Adventure|Science Fiction|Fantasy,Lucasfilm|Truenorth Productions|Bad Robot,12/15/2015,5292,7.5,2015,183999919.0,1902723000.0
4,168259,tt2820852,9.335014,190000000,1506249360,Furious 7,Vin Diesel|Paul Walker|Jason Statham|Michelle ...,http://www.furious7.com/,James Wan,Vengeance Hits Home,...,Deckard Shaw seeks revenge against Dominic Tor...,137,Action|Crime|Thriller,Universal Pictures|Original Film|Media Rights ...,4/1/2015,2947,7.3,2015,174799923.1,1385749000.0


## Feature Selection

In [3]:
data.isnull().sum()

id                         0
imdb_id                   10
popularity                 0
budget                     0
revenue                    0
original_title             0
cast                      76
homepage                7930
director                  44
tagline                 2824
keywords                1493
overview                   4
runtime                    0
genres                    23
production_companies    1030
release_date               0
vote_count                 0
vote_average               0
release_year               0
budget_adj                 0
revenue_adj                0
dtype: int64

In [4]:
data.columns

Index(['id', 'imdb_id', 'popularity', 'budget', 'revenue', 'original_title',
       'cast', 'homepage', 'director', 'tagline', 'keywords', 'overview',
       'runtime', 'genres', 'production_companies', 'release_date',
       'vote_count', 'vote_average', 'release_year', 'budget_adj',
       'revenue_adj'],
      dtype='object')

In [5]:
movies = data[['id', 'original_title', 'overview', 'genres']]
movies['tags'] = movies['overview']+ movies['genres']
movies = movies.drop(columns=['overview', 'genres'])
movies

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies['tags'] = movies['overview']+ movies['genres']


Unnamed: 0,id,original_title,tags
0,135397,Jurassic World,Twenty-two years after the events of Jurassic ...
1,76341,Mad Max: Fury Road,An apocalyptic story set in the furthest reach...
2,262500,Insurgent,Beatrice Prior must confront her inner demons ...
3,140607,Star Wars: The Force Awakens,Thirty years after defeating the Galactic Empi...
4,168259,Furious 7,Deckard Shaw seeks revenge against Dominic Tor...
...,...,...,...
10861,21,The Endless Summer,"The Endless Summer, by Bruce Brown, is one of ..."
10862,20379,Grand Prix,Grand Prix driver Pete Aron is fired by his te...
10863,39768,Beregis Avtomobilya,An insurance agent who moonlights as a carthie...
10864,21449,"What's Up, Tiger Lily?","In comic Woody Allen's film debut, he took the..."


## Vectorizing the tags 

* **CountVectorizer** : Turns text into a "document-term" matrix where each cell is a word's count within that document. 

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(max_features=10866, stop_words='english')
cv

In [7]:
vector=cv.fit_transform(movies['tags'].values.astype('U')).toarray()
vector

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [8]:
vector.shape

(10866, 10866)

## Cosine Similarity

* In recommendation engines, we dont use length based distance metric (eucliean distance)for calculating the similarity, rather we use angular distances because:
    * **cold start problem**: where new items or users have little interaction history. Angular distance allows the system to find relevant items based on the available information, even if it's limited.
    * Angular distance focuses on the direction of the vectors, capturing the shared features.
    * Two movies with similar themes but different running times. Length distance would penalize the longer movie even though the content might be very relevant.

* **Case 1:**
<img src="images/case1.jpeg" style="width:500px; height:200px;">

* **Case 2:**
<img src="images/case2.jpeg" style="width:500px; height:200px;">

* **Case 3:**
<img src="images/case3.jpeg" style="width:500px; height:200px;">

* **Case 4:**
<img src="images/case4.jpeg" style="width:500px; height:200px;">

* **Range of Cosine similarity:**
<img src="images/range.jpeg" style="width:500px; height:200px;">

* **Formula:**
<img src="images/formula.jpeg" style="width:500px; height:400px;">

In [9]:
from sklearn.metrics.pairwise import cosine_similarity
similarity=cosine_similarity(vector)
similarity

array([[1.        , 0.2057378 , 0.19802951, ..., 0.        , 0.0758098 ,
        0.        ],
       [0.2057378 , 1.        , 0.12222647, ..., 0.        , 0.16376789,
        0.        ],
       [0.19802951, 0.12222647, 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.03573708,
        0.        ],
       [0.0758098 , 0.16376789, 0.        , ..., 0.03573708, 1.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        1.        ]])

In [10]:
similarity.shape

(10866, 10866)

## Recommendation engine

In [11]:
movies[movies['original_title']=="The Godfather"].index[0]

7269

In [12]:
distance = sorted(list(enumerate(similarity[7269])), reverse=True, key=lambda vector:vector[1])
for i in distance[1:6]:
    print(movies.iloc[i[0]].original_title)

The Godfather: Part II
Blood Ties
Felon
Life
Bad Turn Worse


In [13]:
# sorted with index 0. need to sort with similarity score therefore add key.
sorted(list(enumerate(similarity[7269])), reverse=True)

[(10865, 0.08151391459392224),
 (10864, 0.0),
 (10863, 0.0),
 (10862, 0.03251280443811775),
 (10861, 0.027389551783238836),
 (10860, 0.0),
 (10859, 0.062257280636469035),
 (10858, 0.0),
 (10857, 0.024738534799764674),
 (10856, 0.0914991421995628),
 (10855, 0.06745406256199318),
 (10854, 0.09762720569806997),
 (10853, 0.11095900821829636),
 (10852, 0.0),
 (10851, 0.10955820713295535),
 (10850, 0.12199885626608374),
 (10849, 0.130051217752471),
 (10848, 0.02831827358942995),
 (10847, 0.0),
 (10846, 0.0),
 (10845, 0.023816275411477048),
 (10844, 0.0),
 (10843, 0.026546593660094948),
 (10842, 0.10166571355506977),
 (10841, 0.04822428221704121),
 (10840, 0.0),
 (10839, 0.053916386601719206),
 (10838, 0.024738534799764674),
 (10837, 0.03594425773447948),
 (10836, 0.0),
 (10835, 0.022733144649015782),
 (10834, 0.0),
 (10833, 0.1320676359488436),
 (10832, 0.0),
 (10831, 0.0),
 (10830, 0.025070610528195012),
 (10829, 0.0566365471788599),
 (10828, 0.039374961547907886),
 (10827, 0.17049858486761

In [14]:
sorted(list(enumerate(similarity[7269])), reverse=True, key=lambda vector:vector[1])

[(7269, 1.0000000000000004),
 (9758, 0.4969124919176443),
 (5587, 0.3521803625302496),
 (2950, 0.321860342910192),
 (2513, 0.3149996923832631),
 (1008, 0.28768617953126774),
 (3952, 0.28237248320100705),
 (4643, 0.28015776286411065),
 (8786, 0.2795807122764419),
 (9064, 0.2776029241433383),
 (4, 0.2758802939230217),
 (505, 0.2758802939230217),
 (4475, 0.2758802939230217),
 (5721, 0.2744974265986884),
 (3222, 0.2727977357881894),
 (9418, 0.2727977357881894),
 (2444, 0.26687249808205815),
 (7553, 0.26687249808205815),
 (2327, 0.26622333025588873),
 (3583, 0.25890435250935817),
 (4166, 0.25890435250935817),
 (7017, 0.25890435250935817),
 (10558, 0.25438520029955736),
 (10601, 0.25438520029955736),
 (4757, 0.2537729606626882),
 (8357, 0.2537729606626882),
 (10622, 0.2537729606626882),
 (130, 0.25160980414135636),
 (1421, 0.2516098041413563),
 (3120, 0.2516098041413563),
 (5662, 0.2516098041413563),
 (1993, 0.2513674754670647),
 (7583, 0.2500645911391736),
 (2748, 0.24902912254587617),
 (91

In [15]:
def recommend(movie_name):
    index = movies[movies['original_title']==movie_name].index[0]
    distance = sorted(list(enumerate(similarity[index])), reverse=True, key=lambda vector:vector[1])
    for i in distance[1:6]:
        print(movies.iloc[i[0]].original_title)

In [18]:
recommend("Find Me Guilty")

Revenge of the Green Dragons
Casino Jack
The 33
Quiz Show
Seabiscuit


## Save data

In [17]:
import pickle
pickle.dump(movies, open('movies_list.pkl', 'wb'))
pickle.dump(similarity, open('similarity.pkl', 'wb'))
pickle.load(open('movies_list.pkl', 'rb'))

Unnamed: 0,id,original_title,tags
0,135397,Jurassic World,Twenty-two years after the events of Jurassic ...
1,76341,Mad Max: Fury Road,An apocalyptic story set in the furthest reach...
2,262500,Insurgent,Beatrice Prior must confront her inner demons ...
3,140607,Star Wars: The Force Awakens,Thirty years after defeating the Galactic Empi...
4,168259,Furious 7,Deckard Shaw seeks revenge against Dominic Tor...
...,...,...,...
10861,21,The Endless Summer,"The Endless Summer, by Bruce Brown, is one of ..."
10862,20379,Grand Prix,Grand Prix driver Pete Aron is fired by his te...
10863,39768,Beregis Avtomobilya,An insurance agent who moonlights as a carthie...
10864,21449,"What's Up, Tiger Lily?","In comic Woody Allen's film debut, he took the..."
