# Content Cosine Similarity 
- Requires an Actual Movie title to obtain a set of recommended movies through Tfidf vectorisation
- This can be used when the user may watch a film or has a list of films they may have enjoyed

In [10]:
# Obtain Movies Variable
%store -r movies

- Movies variable obtained from 'preprocessing.ipynb'

In [11]:
movies['overview']

0       In the 22nd century, a paraplegic Marine is di...
1       Captain Barbossa, long believed to be dead, ha...
2       A cryptic message from Bond’s past sends him o...
3       Following the death of District Attorney Harve...
4       John Carter is a war-weary, former military ca...
                              ...                        
4804    El Mariachi just wants to play his guitar and ...
4805    A newlywed couple's honeymoon is upended by th...
4806    "Signed, Sealed, Delivered" introduces a dedic...
4807    When ambitious New York attorney Sam is sent t...
4808    Ever since the second grade when he first saw ...
Name: overview, Length: 4801, dtype: object

- overview column check

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Remove all english stop words i.e., and, a, an
tfidf = TfidfVectorizer(stop_words='english')

# Convert 'overview' column to string type and fill missing values
movies['overview'] = movies['overview'].astype(str).fillna('')

tfidf_matrix = tfidf.fit_transform(movies['overview'])

print(tfidf_matrix.shape)


(4801, 20970)


- Stop words are unusual information in this so therefore they are removed from the TF-IDF
- Therefore the importance of terms are remmained
- Handling missing values to prevent errors during visualisation 
- ((4801, 20970)) - 4801 rows and 20970 columns

In [13]:
from sklearn.metrics.pairwise import linear_kernel
import pandas as pd

consine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

indices = pd.Series(movies.index, index=movies['title']).drop_duplicates()

- linear_kernel is used to compute the cosine similarity between each pair of movie overviews represented by the TF-IDF matrux
- cosine_similarity between movie 'i' and 'j'
- indices are set to movies and titles consist of an index

In [14]:
def get_recommendations(title):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(consine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return movies['title'].iloc[movie_indices]

get_recommendations('The Dark Knight Rises')

65                              The Dark Knight
299                              Batman Forever
428                              Batman Returns
1362                                     Batman
1363                                     Batman
3861    Batman: The Dark Knight Returns, Part 2
2513                                  Slow Burn
119                               Batman Begins
9            Batman v Superman: Dawn of Justice
1184                                        JFK
Name: title, dtype: object

- retrieves the index of the movie that matches the input title by querying indices. 
- this index is then used to retrieve the similarity scores of the target movie with all other movies
- sim_scores is the variable that retrieves the cosine similarity scores between the target and other movies
- this is then sorted by similarity and the top ten scores are selected

In [15]:
get_recommendations('Wolf')

2634      Blood and Chocolate
2035               Underworld
273     Gone in Sixty Seconds
2285              Bad Teacher
990               Dream House
3580                Clerks II
1135          Red Riding Hood
1436                   Cursed
135               The Wolfman
3178              Boiler Room
Name: title, dtype: object

In [16]:
get_recommendations('Boiler Room')

3881                                  The Opposite Sex
263                        Around the World in 80 Days
4160    A Funny Thing Happened on the Way to the Forum
1837                             Bridget Jones's Diary
2410                                      My Fair Lady
1776                                     Money Monster
2251                                The Brothers Bloom
3729                                           Airlift
1200                                 Horrible Bosses 2
2550                            Not Another Teen Movie
Name: title, dtype: object