# Recommendation Engine
A recommendation engine is a tool that uses machine learning to detect patterns in a person’s behavioral data and suggest specific content, products or information they’re likely to find interesting or relevant. For example, Netflix uses a recommendation engine to suggest shows and movies based on your watch history, and Amazon uses one to suggest products based on your purchase history.

- There are different types of recommendation engines, such as **collaborative filtering, content-based filtering**, and hybrid filtering, that use different algorithms and data sources to make recommendations. Some of the common data sources are:

- **Implicit data:** This refers to information about a user’s search history, clicks, purchases, and other activities. It is gathered by a company every time a user uses their site1.
- **Explicit data:** This refers to information that a user voluntarily provides, such as ratings, reviews, preferences, and feedback. It is more reliable but harder to collect than implicit data1.
- Recommendation engines are widely used in various domains, such as e-commerce, entertainment, education, and healthcare, to provide personalized experiences and increase customer satisfaction and retention. They are also beneficial for businesses, as they can increase sales, revenue, and engagement

#### For more details on the topic refer Drive link provided below
- **Recommendation Engines Notes.pdf**
- https://drive.google.com/drive/folders/1h-x1kLcB-_rqNcc0lfLSE8fwfukMdTZ9

In [1]:
import numpy as np
import pandas as pd

In [2]:
movies_data = pd.read_csv("movies.csv")
movies_data

Unnamed: 0,index,genres,title
0,0,Action Adventure Fantasy Science Fiction,Avatar
1,1,Adventure Fantasy Action,Pirates of the Caribbean: At World's End
2,2,Action Adventure Crime,Spectre
3,3,Action Crime Drama Thriller,The Dark Knight Rises
4,4,Action Adventure Science Fiction,John Carter
...,...,...,...
4688,4688,Foreign Thriller,Cavite
4689,4689,Action Crime Thriller,El Mariachi
4690,4690,Comedy Romance,Newlyweds
4691,4691,Comedy Drama Romance TV Movie,"Signed, Sealed, Delivered"


In [3]:
movies_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4693 entries, 0 to 4692
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   index   4693 non-null   int64 
 1   genres  4666 non-null   object
 2   title   4693 non-null   object
dtypes: int64(1), object(2)
memory usage: 110.1+ KB


In [4]:
movies_data.isnull().sum()

index      0
genres    27
title      0
dtype: int64

In [5]:
# Replace the null values with null string
movies_data["genres"] = movies_data["genres"].fillna("")

In [6]:
movies_data.duplicated().sum()

0

In [7]:
# Converting the text data to feature vectors

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()

feature_vectors = vectorizer.fit_transform(movies_data["genres"])

print(feature_vectors)

  (0, 9)	0.4707458516730511
  (0, 17)	0.4707458516730511
  (0, 8)	0.5071491125814311
  (0, 1)	0.4130006715194316
  (0, 0)	0.3592031586687302
  (1, 8)	0.6796531732320795
  (1, 1)	0.5534806430329519
  (1, 0)	0.48138419365761775
  (2, 4)	0.6185214118995771
  (2, 1)	0.5928935478061559
  (2, 0)	0.5156631691245587
  (3, 18)	0.48649581688790344
  (3, 6)	0.3629530982869474
  (3, 4)	0.6104131981773028
  (3, 0)	0.5089033268563236
  (4, 9)	0.5461986458318269
  (4, 17)	0.5461986458318269
  (4, 1)	0.47919786591815133
  (4, 0)	0.4167774992516349
  (5, 8)	0.6796531732320795
  (5, 1)	0.5534806430329519
  (5, 0)	0.48138419365761775
  (6, 7)	0.6244061832929827
  (6, 2)	0.7810998132540361
  (7, 9)	0.5461986458318269
  :	:
  (4681, 4)	0.6102991975788721
  (4682, 6)	1.0
  (4683, 12)	1.0
  (4684, 6)	1.0
  (4685, 12)	0.7225522352219937
  (4685, 3)	0.4518041679821976
  (4685, 18)	0.5232506676246239
  (4686, 6)	1.0
  (4687, 18)	0.4337475650346395
  (4687, 6)	0.32359995119961343
  (4687, 9)	0.5946200977977993
 

- The above code converts the text data in the `movies_data["genres"]` column to a matrix of TF-IDF features. **TF-IDF** stands for **Term Frequency-Inverse Document Frequency**, and it is a way of measuring how important a word is to a document in a collection of documents. **It assigns a weight to each word based on how often it appears in the document and how rare it is in the whole collection. The higher the weight, the more relevant the word is.**

- The `TfidfVectorizer` class from the `sklearn.feature_extraction.text` module is a tool that can perform this task for you. It takes a collection of raw documents as input and produces a sparse matrix of TF-IDF features as output. It also has many parameters that you can customize, such as the tokenizer, the analyzer, the n-gram range, the stop words, and the vocabulary.

- The use of `TfidfVectorizer` in the above code is to transform the text data in the `movies_data["genres"]` column into a numerical representation that can be used for machine learning purposes. For example, you can use the TF-IDF features to train a classifier that can predict the genre of a movie based on its description, or to find similar movies based on their genres.

### Cosine Similarity

In [8]:
# Getting the similarity scores using cosine similarity

from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity(feature_vectors)
print(similarity)

[[1.         0.7461881  0.43009327 ... 0.         0.         0.        ]
 [0.7461881  1.         0.5763872  ... 0.         0.         0.        ]
 [0.43009327 0.5763872  1.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 1.         0.30646855 0.        ]
 [0.         0.         0.         ... 0.30646855 1.         0.        ]
 [0.         0.         0.         ... 0.         0.         1.        ]]


- The above code uses the `cosine_similarity function` from the `sklearn.metrics.pairwise` module to compute the cosine similarity between the rows of the `feature_vectors` matrix. **Cosine similarity is a measure of how similar two vectors are, based on the `cosine of the angle` between them. It ranges from `-1 (opposite directions)` to `1 (same direction)`, with `0 indicating orthogonality (perpendicularity)`**

- The use of the above code is to find the similarity between the text data in the `movies_data["genres"]` column, which was previously transformed into TF-IDF features by the `TfidfVectorizer` class. The `similarity` variable will store a matrix of shape (n_samples, n_samples), where n_samples is the number of rows in the `feature_vectors` matrix. Each element in the `similarity` matrix will represent the cosine similarity between two rows (i.e., two genres) of the `feature_vectors` matrix.

- Cosine similarity is a useful tool in machine learning for various applications, such as information retrieval, document clustering, text classification, and recommendation systems. It can help to find the most relevant or similar documents, texts, or items based on their features. For example, you can use the `similarity` matrix to find the most similar genres to a given genre, or to recommend movies based on their genres.

In [9]:
print(similarity.shape)

(4693, 4693)


#### Creating a list with all the movie names given in the dataset

In [10]:
list_of_all_titles = movies_data["title"].tolist()
print(list_of_all_titles)

['Avatar', "Pirates of the Caribbean: At World's End", 'Spectre', 'The Dark Knight Rises', 'John Carter', 'Spider-Man 3', 'Tangled', 'Avengers: Age of Ultron', 'Harry Potter and the Half-Blood Prince', 'Batman v Superman: Dawn of Justice', 'Superman Returns', 'Quantum of Solace', "Pirates of the Caribbean: Dead Man's Chest", 'The Lone Ranger', 'Man of Steel', 'The Chronicles of Narnia: Prince Caspian', 'The Avengers', 'Pirates of the Caribbean: On Stranger Tides', 'Men in Black 3', 'The Hobbit: The Battle of the Five Armies', 'The Amazing Spider-Man', 'Robin Hood', 'The Hobbit: The Desolation of Smaug', 'The Golden Compass', 'King Kong', 'Titanic', 'Captain America: Civil War', 'Battleship', 'Jurassic World', 'Skyfall', 'Spider-Man 2', 'Iron Man 3', 'Alice in Wonderland', 'X-Men: The Last Stand', 'Monsters University', 'Transformers: Revenge of the Fallen', 'Transformers: Age of Extinction', 'Oz: The Great and Powerful', 'The Amazing Spider-Man 2', 'TRON: Legacy', 'Cars 2', 'Green Lant

## Getting the Movie name from the user

In [11]:
# Getting the movie name from the user
movie_name = input("Enter your favourite movie name : ")

Enter your favourite movie name : batsman


In [12]:
# finding the close match for the movie name given by the user

import difflib
find_close_match = difflib.get_close_matches(movie_name,list_of_all_titles)
print(find_close_match)

['Batman', 'Batman', 'Catwoman']


In [13]:
close_match = find_close_match[0]
print(close_match)

Batman


In [14]:
# finding the index of the movie with title

index_of_the_movie = movies_data[movies_data.title == close_match]["index"].values[0]
print(index_of_the_movie)

1341


In [15]:
# Getting a list of similar movies

similarity_score = list(enumerate(similarity[index_of_the_movie]))
print(similarity_score)

[(0, 0.6214717464130821), (1, 0.8328620400689603), (2, 0.2980470797389255), (3, 0.29413997260353775), (4, 0.2408923576289715), (5, 0.8328620400689603), (6, 0.0), (7, 0.2408923576289715), (8, 0.511863075084698), (9, 0.8328620400689603), (10, 0.6214717464130821), (11, 0.26733017204095694), (12, 0.8328620400689603), (13, 0.22436823806293424), (14, 0.6214717464130821), (15, 0.511863075084698), (16, 0.2408923576289715), (17, 0.8328620400689603), (18, 0.25552894692978273), (19, 0.8328620400689603), (20, 0.8328620400689603), (21, 0.37930674281666016), (22, 0.6327685734280608), (23, 0.6327685734280608), (24, 0.34353978327424434), (25, 0.0), (26, 0.2408923576289715), (27, 0.2237841645574818), (28, 0.2237841645574818), (29, 0.32131045752504683), (30, 0.8328620400689603), (31, 0.2408923576289715), (32, 0.511863075084698), (33, 0.2237841645574818), (34, 0.0), (35, 0.2408923576289715), (36, 0.2408923576289715), (37, 0.511863075084698), (38, 0.8328620400689603), (39, 0.2408923576289715), (40, 0.0), 

In [16]:
len(similarity_score)

4693

In [17]:
# sorting the movies based on their similarity score

sorted_similar_movies = sorted(similarity_score,key= lambda x:x[1],reverse=True)
print(sorted_similar_movies)

[(113, 1.0000000000000002), (157, 1.0000000000000002), (424, 1.0000000000000002), (661, 1.0000000000000002), (1341, 1.0000000000000002), (2068, 1.0000000000000002), (3529, 1.0000000000000002), (4549, 1.0000000000000002), (531, 0.9245283098776775), (146, 0.9025440934294581), (883, 0.9025440934294581), (541, 0.8752760069471), (795, 0.8752760069471), (921, 0.8752760069471), (1273, 0.8752760069471), (1997, 0.8752760069471), (1, 0.8328620400689603), (5, 0.8328620400689603), (9, 0.8328620400689603), (12, 0.8328620400689603), (17, 0.8328620400689603), (19, 0.8328620400689603), (20, 0.8328620400689603), (30, 0.8328620400689603), (38, 0.8328620400689603), (70, 0.8328620400689603), (96, 0.8328620400689603), (105, 0.8328620400689603), (124, 0.8328620400689603), (127, 0.8328620400689603), (197, 0.8328620400689603), (204, 0.8328620400689603), (206, 0.8328620400689603), (259, 0.8328620400689603), (312, 0.8328620400689603), (325, 0.8328620400689603), (326, 0.8328620400689603), (375, 0.832862040068960

## Print the Name of Similar Movies Based on the Index

In [18]:
print("Movies suggested for you: \n")
i = 1
for movie in sorted_similar_movies:
    index = movie[0]
    title_from_index = movies_data[movies_data.index==index]["title"].values[0]
    if (i<=10):
        print(i, ".",title_from_index)
    i+=1

Movies suggested for you: 

1 . Hancock
2 . Spider-Man
3 . Batman Returns
4 . Elektra
5 . Batman
6 . Mortal Kombat
7 . The Beastmaster
8 . Ink
9 . Immortals
10 . Ghostbusters
