**Table of contents**<a id='toc0_'></a>    

  - [Background on recommendation systems](#toc1_1_)    
    - [Types of recommendation systems](#toc1_1_1_)    
  - [Objectives](#toc1_2_)    
  - [Setup](#toc1_3_)    
    - [Installing required libraries](#toc1_3_1_)    
    - [Importing required libraries](#toc1_3_2_)    
  - [Exploratory data analysis (EDA)](#toc1_4_)    
  - [Popularity-based recommendation](#toc1_5_)    
    - [Exercise 1 - Get the top 5 suggestions sorting by score in descending order](#toc1_5_1_)    
  - [Content-based recommendation](#toc1_6_)    
    - [Exercise 2 - Check the recommendations for the movie 'Toy Story 2 (1999)'](#toc1_6_1_)    
  - [Collaborative filtering](#toc1_7_)    
    - [Exercise 3 - Check the recommendations for the movie 'Jurassic Park (1993)'](#toc1_7_1_)   


## <a id='toc1_1_'></a>[Background on recommendation systems](#toc0_)

Recommendation systems have become an integral part of our digital lives, subtly shaping the content we consume and the products we buy. From suggesting movies on Netflix to recommending products on Amazon, these systems help users navigate vast amounts of information by providing personalized suggestions based on their preferences and behaviors.

### <a id='toc1_1_1_'></a>[Types of recommendation systems](#toc0_)

There are several types of recommendation systems, each with its unique approach to generating recommendations:

1. **Popularity-based recommendation**: Popular-based recommendation systems are straightforward to implement because they don‚Äôt require complex algorithms or user-specific data. They often rely on basic statistics like item frequency and offer the same suggestions to all users, focusing on what is popular among the majority.

2. **Content-based filtering**: This approach focuses on the characteristics of the items themselves. It recommends items that are similar to those the user has shown interest in, based on item features.

3. **Collaborative filtering**: This method relies on the collective preferences of users. It can be user-based, where recommendations are made based on the preferences of similar users, or item-based, where recommendations are made based on items that are similar to what the user has liked in the past.


## <a id='toc1_2_'></a>[Objectives](#toc0_)



After completing this lab, you are able to:



- Understand the basic concepts and types of recommendation systems.

- Implement a simple popularity-based recommendation system.

- Implement a content-based recommendation system.

- Implement a item-based recommendation system.



----


## <a id='toc1_3_'></a>[Setup](#toc0_)

For this lab, you use the following libraries:

*   [`pandas`](https://pandas.pydata.org/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) for managing the data.
*   [`sklearn`](https://scikit-learn.org/stable/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) for machine learning and machine-learning-pipeline related functions.


### <a id='toc1_3_1_'></a>[Installing required libraries](#toc0_)


In [None]:
%pip install tqdm==4.66.4  | tail -n 1
%pip install pandas==2.1.4  | tail -n 1
%pip install scikit-learn==1.5.1  | tail -n 1

Successfully installed tqdm-4.66.4
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires pandas==2.2.2, but you have pandas 2.1.4 which is incompatible.
mizani 0.13.1 requires pandas>=2.2.0, but you have pandas 2.1.4 which is incompatible.
plotnine 0.14.5 requires pandas>=2.2.0, but you have pandas 2.1.4 which is incompatible.[0m[31m
[0mSuccessfully installed pandas-2.1.4
Successfully installed scikit-learn-1.5.1


### <a id='toc1_3_2_'></a>[Importing required libraries](#toc0_)


In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import NearestNeighbors
import statistics


# You can also use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass

import warnings

warnings.warn = warn
warnings.filterwarnings('ignore')


The dataset is taken from [Kaggle](https://www.kaggle.com/datasets/shubhammehta21/movie-lens-small-latest-dataset/data).
This dataset describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. Users were selected at random for inclusion. No demographic information is included. Each user is represented by an ID, and no other information is provided.

The data are contained in the files movies.csv, ratings.csv and tags.csv.

In the `movies.csv` file:
- `movieId`: ID of the movie/show (unique)
- `title`: Title of the movie/show
- `genres`: Genre of the show
  
In the `ratings.csv` file:
- `userId`: ID of the user who gave a rating
- `movieId`: ID of the movie/show rated
- `rating`: Rating given to the show
- `timestamp`: Time when the rating was specified
  
In the `tags.csv` file:
- `userId`: ID of the user who gave a rating
- `movieId`: ID of the movie/show rated
- `tag`: Tags given to the show
- `timestamp`: Time when the rating was specified

Now, let's load these datasets into a pandas DataFrame.



In [None]:
movie_df = pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/BxZuF3FrO7Bdw6McwsBaBw/movies.csv')
rating_df = pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/R-bYYyyf7s3IUE5rsssmMw/ratings.csv')
tag_df = pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/UZKHhXSl7Ft7t9mfUFZJPQ/tags.csv')

Let's look at some samples rows from the dataset we loaded:


In [None]:
movie_df.sample(5)

Unnamed: 0,movieId,title,genres
2847,3809,What About Bob? (1991),Comedy
265,305,Ready to Wear (Pret-A-Porter) (1994),Comedy
2080,2764,"Thomas Crown Affair, The (1968)",Crime|Drama|Romance|Thriller
3232,4368,Dr. Dolittle 2 (2001),Comedy
6093,42013,"Ringer, The (2005)",Comedy


In [None]:
tag_df.sample(5)

Unnamed: 0,userId,movieId,tag,timestamp
1660,474,2660,aliens,1137521194
1703,474,2935,gambling,1138032373
1741,474,3101,adultery,1138032312
2489,477,32,Bruce Willis,1242494306
2067,474,6063,doll,1137375194


In [None]:
rating_df.sample(5)

Unnamed: 0,userId,movieId,rating,timestamp
94730,599,71550,1.0,1498527121
66093,425,3698,3.0,1085491483
22889,156,1965,4.0,939853183
97558,606,1200,3.5,1177104994
27355,186,1031,4.0,1031087983


In [None]:
# We will merge the three dataframes to create a single dataframe that contains all the information we need.
user_movie_df = movie_df.merge(rating_df, on = 'movieId', how = 'inner')
df = user_movie_df.merge(tag_df, on = ['movieId', 'userId'], how = 'inner')
df

Unnamed: 0,movieId,title,genres,userId,rating,timestamp_x,tag,timestamp_y
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,336,4.0,1122227329,pixar,1139045764
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,474,4.0,978575760,pixar,1137206825
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,567,3.5,1525286001,fun,1525286013
3,2,Jumanji (1995),Adventure|Children|Fantasy,62,4.0,1528843890,fantasy,1528843929
4,2,Jumanji (1995),Adventure|Children|Fantasy,62,4.0,1528843890,magic board game,1528843932
...,...,...,...,...,...,...,...,...
3471,187595,Solo: A Star Wars Story (2018),Action|Adventure|Children|Sci-Fi,62,4.0,1528934550,star wars,1528934552
3472,193565,Gintama: The Movie (2010),Action|Animation|Comedy|Sci-Fi,184,3.5,1537098554,anime,1537098582
3473,193565,Gintama: The Movie (2010),Action|Animation|Comedy|Sci-Fi,184,3.5,1537098554,comedy,1537098587
3474,193565,Gintama: The Movie (2010),Action|Animation|Comedy|Sci-Fi,184,3.5,1537098554,gintama,1537098603


In [None]:
# Here, we will drop the timestamp columns as they are not needed for our analysis.
df.drop(columns = ['timestamp_x', 'timestamp_y'], inplace = True)
df

Unnamed: 0,movieId,title,genres,userId,rating,tag
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,336,4.0,pixar
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,474,4.0,pixar
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,567,3.5,fun
3,2,Jumanji (1995),Adventure|Children|Fantasy,62,4.0,fantasy
4,2,Jumanji (1995),Adventure|Children|Fantasy,62,4.0,magic board game
...,...,...,...,...,...,...
3471,187595,Solo: A Star Wars Story (2018),Action|Adventure|Children|Sci-Fi,62,4.0,star wars
3472,193565,Gintama: The Movie (2010),Action|Animation|Comedy|Sci-Fi,184,3.5,anime
3473,193565,Gintama: The Movie (2010),Action|Animation|Comedy|Sci-Fi,184,3.5,comedy
3474,193565,Gintama: The Movie (2010),Action|Animation|Comedy|Sci-Fi,184,3.5,gintama


---
## <a id='toc1_4_'></a>[Exploratory data analysis (EDA)](#toc0_)


Before doing any preprocessing, we will be performing some simple exploratory data analysis (EDA) to know about our dataset. This includes looking at the number of unique values/number of duplicate values, the distributions, etc.

First, looking at the shape of the `pd.DataFrame`


In [None]:
print('Number of rows: ' , df.shape[0])
print('Number of columns: ' , df.shape[1])

Number of rows:  3476
Number of columns:  6


Looking at the data type of each columns:


In [None]:
df.dtypes

Unnamed: 0,0
movieId,int64
title,object
genres,object
userId,int64
rating,float64
tag,object


Next, let's see if we have any null values:


In [None]:
# Deal with null values
df.isnull().any()

Unnamed: 0,0
movieId,False
title,False
genres,False
userId,False
rating,False
tag,False


## <a id='toc1_5_'></a>[Popularity-based recommendation](#toc0_)

The popularity based recommendation recommends items, in this case, movies, based on what is popular accross the site. It is the most basic recommendation system. The system identifies popular items by considering metrics such as the number of views, ratings, or purchases and suggests these items to all users. For this type of recommendation system, all users get the same recommendations. The system can suggest items based on what's popular in your country.

This approach ensures that users are aware of current popular content, which can be useful for new users who have not yet developed a viewing history on the platform. However, this is also a limitation because everyone receives the same suggestions, which may not always be relevant or interesting to them. This lack of specificity can result in a less engaging user experience compared to more personalized recommendation systems.


In [None]:
df_1 = df
df_1

Unnamed: 0,movieId,title,genres,userId,rating,tag
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,336,4.0,pixar
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,474,4.0,pixar
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,567,3.5,fun
3,2,Jumanji (1995),Adventure|Children|Fantasy,62,4.0,fantasy
4,2,Jumanji (1995),Adventure|Children|Fantasy,62,4.0,magic board game
...,...,...,...,...,...,...
3471,187595,Solo: A Star Wars Story (2018),Action|Adventure|Children|Sci-Fi,62,4.0,star wars
3472,193565,Gintama: The Movie (2010),Action|Animation|Comedy|Sci-Fi,184,3.5,anime
3473,193565,Gintama: The Movie (2010),Action|Animation|Comedy|Sci-Fi,184,3.5,comedy
3474,193565,Gintama: The Movie (2010),Action|Animation|Comedy|Sci-Fi,184,3.5,gintama


Next, we will be calculating the number of votes and the average rating for each movie.


In [None]:
num_votes = df_1.groupby('movieId').size().reset_index(name='numVotes')

# Merge the numVotes back into the original DataFrame
df_1 = pd.merge(df_1, num_votes, on='movieId')

df_1

Unnamed: 0,movieId,title,genres,userId,rating,tag,numVotes
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,336,4.0,pixar,3
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,474,4.0,pixar,3
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,567,3.5,fun,3
3,2,Jumanji (1995),Adventure|Children|Fantasy,62,4.0,fantasy,4
4,2,Jumanji (1995),Adventure|Children|Fantasy,62,4.0,magic board game,4
...,...,...,...,...,...,...,...
3471,187595,Solo: A Star Wars Story (2018),Action|Adventure|Children|Sci-Fi,62,4.0,star wars,2
3472,193565,Gintama: The Movie (2010),Action|Animation|Comedy|Sci-Fi,184,3.5,anime,4
3473,193565,Gintama: The Movie (2010),Action|Animation|Comedy|Sci-Fi,184,3.5,comedy,4
3474,193565,Gintama: The Movie (2010),Action|Animation|Comedy|Sci-Fi,184,3.5,gintama,4


In [None]:
avg_ratings = df_1.groupby('movieId')['rating'].mean().reset_index(name='avgRating')

# Merge the avgRating back into the original DataFrame
df_1 = pd.merge(df_1, avg_ratings, on='movieId')

In [None]:
df_1.drop_duplicates(subset = ['movieId', 'title', 'avgRating', 'numVotes'], inplace = True)
df_1

Unnamed: 0,movieId,title,genres,userId,rating,tag,numVotes,avgRating
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,336,4.0,pixar,3,3.833333
3,2,Jumanji (1995),Adventure|Children|Fantasy,62,4.0,fantasy,4,3.750000
7,3,Grumpier Old Men (1995),Comedy|Romance,289,2.5,moldy,2,2.500000
9,5,Father of the Bride Part II (1995),Comedy,474,1.5,pregnancy,2,1.500000
11,7,Sabrina (1995),Comedy|Romance,474,3.0,remake,1,3.000000
...,...,...,...,...,...,...,...,...
3461,183611,Game Night (2018),Action|Comedy|Crime|Horror,62,4.0,Comedy,3,4.000000
3464,184471,Tomb Raider (2018),Action|Adventure|Fantasy,62,3.5,adventure,3,3.500000
3467,187593,Deadpool 2 (2018),Action|Comedy|Sci-Fi,62,4.0,Josh Brolin,3,4.000000
3470,187595,Solo: A Star Wars Story (2018),Action|Adventure|Children|Sci-Fi,62,4.0,Emilia Clarke,2,4.000000


We will be calculating the weighted score for each type. Usually, we would think that a good score results when the rating is high and the number of votes is also high. For instance, suppose you were browsing to choose a restaurant to dine at on your trip. If restaurant A had score 8.5 with 100,000 votes and restaurant B had score 8.5 but with 10 votes, we would be more convinced that restaurant A is more enjoyable and popular. Similarly, if restaurant C had score 5.0 with 1000 votes and restaurant D had score 5.0 with 1 vote, we may not automatically think that restaurant D was not enjoyable (but we do know that it is not popular), since only one person submitted a rating, if another person gave it score 10, this would immediately bump the score of restaurant D to 7.5.

The code below creates a new column `df['score']` that calculates the weighted average score for each movie.


In [None]:
import statistics

# Define the function to calculate the weighted score
def calculate_weighted_score(avgRating, num_votes, C, m):
    return (num_votes * avgRating + m * C) / (num_votes + m)

# Calculate the global average rating (C)
average_rating = statistics.mean(df_1['avgRating'])
print('The average rating across all movies is:', average_rating)

# Calculate the average number of votes (m)
avg_num_votes = statistics.mean(df_1['numVotes'])  # Use the average number of votes for threshold
print('The average number of votes is:', avg_num_votes)

# Create a new column 'score' for the weighted average rating using 'avgRating' and 'numVotes'
df_1['score'] = df_1.apply(lambda row: calculate_weighted_score(row['avgRating'], row['numVotes'], average_rating, avg_num_votes), axis=1)

# Display the DataFrame with the calculated weighted score
df_1[['movieId', 'title', 'avgRating', 'numVotes', 'score']].head()

The average rating across all movies is: 3.7323364168313313
The average number of votes is: 2.3743169398907105


Unnamed: 0,movieId,title,avgRating,numVotes,score
0,1,Toy Story (1995),3.833333,3,3.788714
3,2,Jumanji (1995),3.75,4,3.743421
7,3,Grumpier Old Men (1995),2.5,2,3.168895
9,5,Father of the Bride Part II (1995),1.5,2,2.71168
11,7,Sabrina (1995),3.0,1,3.515304


In [None]:
df_1.to_csv("Netflix_movies_data.csv")

### <a id='toc1_5_1_'></a>[Exercise 1 - Get the top 5 suggestions sorting by score in descending order](#toc0_)


In [None]:
# TODO: filtering out the top 5 suggestions
# You can use `sort_values` to sort the DataFrame by the 'score' column in descending order

top_5_movies = df_1.sort_values(by = 'score', ascending = False).head(5)[['title', 'genres', 'tag', 'score']]
print('Top 5 movies:')
top_5_movies

Top 5 movies:


Unnamed: 0,title,genres,tag,score
199,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller,good dialogue,4.967226
1337,Fight Club (1999),Action|Crime|Drama|Thriller,dark comedy,4.893394
604,2001: A Space Odyssey (1968),Adventure|Drama|Sci-Fi,Hal,4.884498
998,"Big Lebowski, The (1998)",Comedy|Crime,Coen Brothers,4.868802
164,L√©on: The Professional (a.k.a. The Professiona...,Action|Crime|Drama|Thriller,assassin,4.852577


<details>
    <summary>Click here for the solution</summary>

```python
# filtering out the top 5 suggestions
top_5_movies = df_1.sort_values(by = 'score', ascending = False).head(5)[['title', 'genres', 'tag', 'score']]
print('Top 5 movies:')
top_5_movies
```

</details>


## <a id='toc1_6_'></a>[Content-based recommendation](#toc0_)

Content-based filtering focuses on the attributes of items and the user's profile. It recommends movies to users based on features that closely match the user's profile. Movie A could be recommended because it matches the user's preferred genre, cast, and keywords. However, we might get limited diversity as it may not recommend items outside the user's known preferences, potentially limiting discovery of new types of items.

We want to compute the cosine similarity based on a number of features. Next, we will be creating a column `features` to gather the columns that we want to recommend to users. Calculation will be based on the type, genres, origin country, language, plot, summary, and cast.


In [None]:
# We will now create a new DataFrame that contains only the columns we need for our analysis.
df_2 = df_1[['movieId', 'title', 'userId', 'avgRating', 'numVotes', 'score', 'genres', 'tag']].copy()
df_2.reset_index(drop=True, inplace=True)

# save a dataframe
df_2.to_csv("2_Netflix_movies_data.csv")
df_2.head()

Unnamed: 0,movieId,title,userId,avgRating,numVotes,score,genres,tag
0,1,Toy Story (1995),336,3.833333,3,3.788714,Adventure|Animation|Children|Comedy|Fantasy,pixar
1,2,Jumanji (1995),62,3.75,4,3.743421,Adventure|Children|Fantasy,fantasy
2,3,Grumpier Old Men (1995),289,2.5,2,3.168895,Comedy|Romance,moldy
3,5,Father of the Bride Part II (1995),474,1.5,2,2.71168,Comedy,pregnancy
4,7,Sabrina (1995),474,3.0,1,3.515304,Comedy|Romance,remake


In [None]:
# Replace '|' with spaces in 'genres' and combine it with 'tag' using a space
df_2['features'] = df_2['genres'].str.replace('|', ' ') + ' ' + df_2['tag'].fillna('')
#df_2.to_csv("Ready_Netflix_movies_data.csv")
df_2.head()

Unnamed: 0,movieId,title,userId,avgRating,numVotes,score,genres,tag,features
0,1,Toy Story (1995),336,3.833333,3,3.788714,Adventure|Animation|Children|Comedy|Fantasy,pixar,Adventure Animation Children Comedy Fantasy pixar
1,2,Jumanji (1995),62,3.75,4,3.743421,Adventure|Children|Fantasy,fantasy,Adventure Children Fantasy fantasy
2,3,Grumpier Old Men (1995),289,2.5,2,3.168895,Comedy|Romance,moldy,Comedy Romance moldy
3,5,Father of the Bride Part II (1995),474,1.5,2,2.71168,Comedy,pregnancy,Comedy pregnancy
4,7,Sabrina (1995),474,3.0,1,3.515304,Comedy|Romance,remake,Comedy Romance remake


In [None]:
print(df_2["title"])

0                         Toy Story (1995)
1                           Jumanji (1995)
2                  Grumpier Old Men (1995)
3       Father of the Bride Part II (1995)
4                           Sabrina (1995)
                       ...                
1459                     Game Night (2018)
1460                    Tomb Raider (2018)
1461                     Deadpool 2 (2018)
1462        Solo: A Star Wars Story (2018)
1463             Gintama: The Movie (2010)
Name: title, Length: 1464, dtype: object


Next, let's vectorize the features column using TF-IDF vectorizer.
The Term Frequency-Inverse Document Frequency(TF-IDF) vectorizer is used to transform text into numerical representations. It evaluates the importance of a word in a document relative to a collection of documents by considering both its frequency within a specific document (TF) and its rarity across all documents (IDF).


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english')

# Fit and transform the 'features' column to create TF-IDF vectors
X = vectorizer.fit_transform(df_2['features'])



Finally, let's get the cosine similarity and recommend items based on users' needs.


In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Calculate Cosine Similarity
similarity = cosine_similarity(X)

# Recommendation function (including itself as first result)
def recommendation(title, df, similarity, top_n=5):
    try:
        # Get the index of the movie that matches the title
        idx = df[df['title'] == title].index[0]
    except IndexError:
        print(f"Movie '{title}' not found in the dataset.")
        return

    # Get the similarity scores for the given movie
    sim_scores = list(enumerate(similarity[idx]))

    # Sort the movies based on similarity scores in descending order
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Print the top_n most similar movies (including itself)
    print(f"Movies similar to '{title}' (First movie is itself):")
    for i, (index, score) in enumerate(sim_scores[:top_n+1]):
        movie = df.iloc[index]
        print(f"{i}. {movie['title']} (Similarity Score: {score:.3f})")
        print(f"   Genres: {movie['genres']}")
        print(f"   Tag: {movie['tag']}\n")

# Test the recommendation function
recommendation("Toy Story (1995)", df_2, similarity)

Movies similar to 'Toy Story (1995)' (First movie is itself):
0. Toy Story (1995) (Similarity Score: 1.000)
   Genres: Adventure|Animation|Children|Comedy|Fantasy
   Tag: pixar

1. Bug's Life, A (1998) (Similarity Score: 0.939)
   Genres: Adventure|Animation|Children|Comedy
   Tag: Pixar

2. Toy Story 2 (1999) (Similarity Score: 0.675)
   Genres: Adventure|Animation|Children|Comedy|Fantasy
   Tag: animation

3. Sintel (2010) (Similarity Score: 0.583)
   Genres: Animation|Fantasy
   Tag: adventure

4. Up (2009) (Similarity Score: 0.550)
   Genres: Adventure|Animation|Children|Drama
   Tag: adventure

5. Jumanji (1995) (Similarity Score: 0.542)
   Genres: Adventure|Children|Fantasy
   Tag: fantasy



# using gradio

In [None]:
#!pip install gradio

In [None]:
import gradio as gr
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

# Precomputed similarity matrix and dataframe
# Assume df_2 and X are already defined
similarity = cosine_similarity(X)

# Gradio-compatible recommendation function
def recommend_movies(title, top_n=5):
    try:
        idx = df_2[df_2['title'] == title].index[0]
    except IndexError:
        return f"‚ùå Movie '{title}' not found in the dataset."

    sim_scores = list(enumerate(similarity[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    result = f"### üé¨ Movies similar to **{title}** (First movie is itself):\n"
    for i, (index, score) in enumerate(sim_scores[:top_n+1]):
        movie = df_2.iloc[index]
        result += (
            f"**{i}. {movie['title']}** (Similarity Score: {score:.3f})\n"
            f"- Genres: {movie['genres']}\n"
            f"- Tag: {movie['tag']}\n\n"
        )
    return result

# Create Gradio interface
iface = gr.Interface(
    fn=recommend_movies,
    inputs=[
        gr.Textbox(label="Enter Movie Title", placeholder="e.g. Toy Story (1995)"),
        gr.Slider(1, 10, value=5, label="Number of Recommendations")
    ],
    outputs=gr.Markdown(label="Recommendations"),
    title="üé• Movie Recommendation System",
    description="Enter a movie title to get similar movie suggestions based on cosine similarity."
)

iface.launch()


It looks like you are running Gradio on a hosted a Jupyter notebook. For the Gradio app to work, sharing must be enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://68b0948c574f70bbbf.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




In [None]:
# close gradio application
iface.close()

Closing server running on port: 7860


# some title same more title self find

In [None]:
!pip install fuzzywuzzy

Collecting fuzzywuzzy
  Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl.metadata (4.9 kB)
Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl (18 kB)
Installing collected packages: fuzzywuzzy
Successfully installed fuzzywuzzy-0.18.0


In [None]:
from sklearn.metrics.pairwise import cosine_similarity
from fuzzywuzzy import process

# Precompute similarity matrix
similarity = cosine_similarity(X)

# Recommendation function with fuzzy title matching
def recommendation(title, df, similarity, top_n=5):
    # Fuzzy match to find best matching title from df
    all_titles = df['title'].tolist()
    best_match, score = process.extractOne(title, all_titles)

    if score < 60:  # threshold for similarity
        print(f"‚ùå No close match found for '{title}'. Try another title.")
        return

    idx = df[df['title'] == best_match].index[0]
    sim_scores = list(enumerate(similarity[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    print(f"üîç Closest match: '{best_match}' (Match Score: {score})")
    print(f"üé¨ Movies similar to '{best_match}':")
    for i, (index, score) in enumerate(sim_scores[:top_n+1]):
        movie = df.iloc[index]
        print(f"{i}. {movie['title']} (Similarity Score: {score:.3f})")
        print(f"   Genres: {movie['genres']}")
        print(f"   Tag: {movie['tag']}\n")

# Example usage
recommendation("jurassic", df_2, similarity)


üîç Closest match: 'Jurassic Park (1993)' (Match Score: 90)
üé¨ Movies similar to 'Jurassic Park (1993)':
0. Jurassic Park (1993) (Similarity Score: 1.000)
   Genres: Action|Adventure|Sci-Fi|Thriller
   Tag: Dinosaur

1. X-Men (2000) (Similarity Score: 0.576)
   Genres: Action|Adventure|Sci-Fi
   Tag: action

2. Batman v Superman: Dawn of Justice (2016) (Similarity Score: 0.525)
   Genres: Action|Adventure|Fantasy|Sci-Fi
   Tag: action

3. Independence Day (a.k.a. ID4) (1996) (Similarity Score: 0.503)
   Genres: Action|Adventure|Sci-Fi|Thriller
   Tag: aliens

4. Spider-Man (2002) (Similarity Score: 0.497)
   Genres: Action|Adventure|Sci-Fi|Thriller
   Tag: superhero

5. Mad Max: Fury Road (2015) (Similarity Score: 0.480)
   Genres: Action|Adventure|Sci-Fi|Thriller
   Tag: beautiful



# using gradio

In [None]:
import gradio as gr
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from fuzzywuzzy import process

# Assuming df_2 and X are already defined
similarity = cosine_similarity(X)

# Recommendation function returning a string
def recommend_movies(title, top_n=5):
    all_titles = df_2['title'].tolist()
    best_match, match_score = process.extractOne(title, all_titles)

    if match_score < 60:
        return f"‚ùå No close match found for '{title}'. Try another title."

    idx = df_2[df_2['title'] == best_match].index[0]
    sim_scores = list(enumerate(similarity[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    result = f"üîç Closest match: **{best_match}** (Match Score: {match_score})\n\n"
    result += f"üé¨ Top {top_n} similar movies:\n"

    for i, (index, score) in enumerate(sim_scores[:top_n+1]):
        movie = df_2.iloc[index]
        result += (
            f"**{i}. {movie['title']}** (Similarity: {score:.3f})\n"
            f"- Genres: {movie['genres']}\n"
            f"- Tag: {movie['tag']}\n\n"
        )

    return result

# Gradio Interface
iface = gr.Interface(
    fn=recommend_movies,
    inputs=[
        gr.Textbox(label="Enter Movie Title", placeholder="e.g. Jurassic"),
        gr.Slider(1, 10, value=5, label="Number of Recommendations")
    ],
    outputs=gr.Markdown(label="Recommendations"),
    title="üé• Movie Recommender with Fuzzy Matching",
    description="Type a movie title (even with types or partial names) and get similar movie suggestions."
)

iface.launch()


It looks like you are running Gradio on a hosted a Jupyter notebook. For the Gradio app to work, sharing must be enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://cc81ac434979af24fd.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




### <a id='toc1_6_1_'></a>[Exercise 2 - Check the recommendations for the movie 'Toy Story 2 (1999)'](#toc0_)


In [None]:
# TODO
recommendation("Toy Story 2 (1999)", df_2, similarity)

Movies similar to 'Toy Story 2 (1999)' (First movie is itself):
0. Toy Story 2 (1999) (Similarity Score: 1.000)
   Genres: Adventure|Animation|Children|Comedy|Fantasy
   Tag: animation

1. Croods, The (2013) (Similarity Score: 0.856)
   Genres: Adventure|Animation|Comedy
   Tag: animation

2. Sintel (2010) (Similarity Score: 0.853)
   Genres: Animation|Fantasy
   Tag: adventure

3. Invincible Iron Man, The (2007) (Similarity Score: 0.775)
   Genres: Animation
   Tag: animation

4. Big Hero 6 (2014) (Similarity Score: 0.757)
   Genres: Action|Animation|Comedy
   Tag: animation

5. Up (2009) (Similarity Score: 0.754)
   Genres: Adventure|Animation|Children|Drama
   Tag: adventure



<details>
    <summary>Click here for the solution</summary>

```python
recommendation("Toy Story 2 (1999)", df_2, similarity)
```

</details>


---


## <a id='toc1_7_'></a>[Collaborative filtering](#toc0_)

Collaborative filtering is a recommendation system technique that makes automatic predictions about a user‚Äôs preferences by collecting taste or preference information from many users. The assumption behind collaborative filtering is that if users agreed on certain items in the past, they are likely to agree on similar items in the future.

There are two primary approaches to collaborative filtering:

1.	User-based Collaborative Filtering: This method identifies users with similar preferences and recommends items that similar users have liked. In other words, a user receives recommendations based on the preferences of users who have historically rated items similarly.
2.	Item-based Collaborative Filtering: In this method, items similar to those the user has liked or rated highly in the past are recommended. The system identifies items that are frequently rated similarly across a user base and suggests items that share these patterns.



In [None]:
# Pivot user-item matrix from ratings
user_rating_matrix = rating_df.pivot(index="movieId", columns="userId", values="rating")

# fill na with 0
user_rating_matrix = user_rating_matrix.fillna(0)

user_rating_matrix.head()

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,0.0,0.0,4.0,0.0,4.5,0.0,0.0,0.0,...,4.0,0.0,4.0,3.0,4.0,2.5,4.0,2.5,3.0,5.0
2,0.0,0.0,0.0,0.0,0.0,4.0,0.0,4.0,0.0,0.0,...,0.0,4.0,0.0,5.0,3.5,0.0,0.0,2.0,0.0,0.0
3,4.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0


In this section, we will be using a NearestNeighbors classifier and using it based on the cosine similarity metric.


In [None]:
from sklearn.neighbors import NearestNeighbors

rec = NearestNeighbors(metric = 'cosine')
rec.fit(user_rating_matrix)

Finally, here is our function to get 5 recommended items based on a movie previously watched.


In [None]:
# Function to get movie recommendations based on a title
def get_recommendations(title):
    # Get movie details
    movie = df_2[df_2['title'] == title]

    if movie.empty:
        print(f"Movie '{title}' not found in dataset.")
        return None

    movie_id = int(movie['movieId'])

    # Get the index of the movie in the user-item matrix
    try:
        user_index = user_rating_matrix.index.get_loc(movie_id)
    except KeyError:
        print(f"Movie ID {movie_id} not found in the user rating matrix.")
        return None

    # Get the user ratings for the movie
    user_ratings = user_rating_matrix.iloc[user_index]

    # Reshape the ratings to be a single sample (1, -1)
    reshaped_df = user_ratings.values.reshape(1, -1)

    # Find the nearest neighbors (similar movies)
    distances, indices = rec.kneighbors(reshaped_df, n_neighbors=15)

    # Get the movieIds of the nearest neighbors (excluding the first, which is the queried movie itself)
    nearest_idx = user_rating_matrix.iloc[indices[0]].index[1:]

    # Get the movie details for the nearest neighbors
    nearest_neighbors = pd.DataFrame({'movieId': nearest_idx})
    result = pd.merge(nearest_neighbors, df_2, on='movieId', how='left')

    # Return the top recommendations
    return result[['title', 'avgRating', 'genres']].head()

# Test the recommendation function
get_recommendations('Toy Story (1995)')

Unnamed: 0,title,avgRating,genres
0,Toy Story 2 (1999),3.125,Adventure|Animation|Children|Comedy|Fantasy
1,Jurassic Park (1993),4.5,Action|Adventure|Sci-Fi|Thriller
2,Independence Day (a.k.a. ID4) (1996),4.0,Action|Adventure|Sci-Fi|Thriller
3,Star Wars: Episode IV - A New Hope (1977),4.527778,Action|Adventure|Sci-Fi
4,Forrest Gump (1994),3.666667,Comedy|Drama|Romance|War


### <a id='toc1_7_1_'></a>[Exercise 3 - Check the recommendations for the movie 'Jurassic Park (1993)'](#toc0_)


In [None]:
# TODO
get_recommendations('Jurassic Park (1993)')
#get_recommendations("Forrest Gump (1994)")

Unnamed: 0,title,avgRating,genres
0,Terminator 2: Judgment Day (1991),2.625,Action|Sci-Fi
1,Forrest Gump (1994),3.666667,Comedy|Drama|Romance|War
2,Braveheart (1995),4.35,Action|Drama|War
3,"Fugitive, The (1993)",5.0,Thriller
4,Speed (1994),4.0,Action|Romance|Thriller


<details>
    <summary>Click here for the solution</summary>

```python
get_recommendations('Jurassic Park (1993)')
```

</details>


---
