In [None]:
#Introduction:
#Recommendation systems in Python use data-driven approaches to provide users with personalized suggestions. By analyzing user data and applying algorithms, these systems can predict and recommend products, services, or content that a user is likely to be interested in. Such systems are crucial in environments with vast amounts of information, like social media, streaming platforms, and online retail. Python is a popular choice for building recommendation systems due to its extensive libraries and machine learning frameworks. There are two primary types of recommendation systems: content-based filtering, which considers product attributes and user profiles, and collaborative filtering, which bases recommendations on user behavior and preferences. Additionally, hybrid approaches that combine both methods are widely used. These systems enhance user experiences, increase user engagement, and drive business growth.

In [None]:
""""
There are various types of recommender systems:

Content-Based Recommendation: This type utilizes supervised machine learning to create a classifier that distinguishes between items that a user will find interesting and those that they won't.

Collaborative Filtering: This method recommends items based on similarity metrics among users and/or items. The fundamental premise of this algorithm is that users with similar interests tend to have similar preferences.

In [1]:
#Importing Libraries:
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)




In [None]:
""""
To initialize the Python environment for data analysis and visualization, the following line of code is used. It imports key libraries essential for data processing and visualization, such as NumPy, Pandas, scikit-learn, Matplotlib, and Seaborn. Additionally, it configures the environment to suppress future warnings, preventing notifications about upcoming library changes from cluttering the output and hindering productivity. These initial steps establish a robust framework for effective data exploration and analysis using the imported tools.

In [2]:
# Loading the rating dataset:
ratings = pd.read_csv("https://s3-us-west-2.amazonaws.com/recommender-tutorial/ratings.csv")
print(ratings.head())

   userId  movieId  rating  timestamp
0       1        1     4.0  964982703
1       1        3     4.0  964981247
2       1        6     4.0  964982224
3       1       47     5.0  964983815
4       1       50     5.0  964982931


In [3]:
# Loading movie dataset:
movies = pd.read_csv("https://s3-us-west-2.amazonaws.com/recommender-tutorial/movies.csv")
print(movies.head())

   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  


In [None]:
""""
This code imports two datasets for conducting a movie recommendation study. The first dataset, "ratings.csv," contains user ratings for movies and is loaded into a Pandas DataFrame named 'ratings'. The second dataset, "movies.csv," holds movie metadata, such as titles and genres, and is loaded into a Pandas DataFrame called 'movies'. To provide an initial overview and set the stage for further analysis or the development of a recommendation system, the code displays the first few rows of each DataFrame.

In [7]:
# Statistical Anaylsis of Ratings:

n_ratings = len(ratings)
n_movies = len(ratings['movieId'].unique())
n_users = len(ratings['userId'].unique())

print(f"Number of ratings: {n_ratings}")
print(f"Number of unique movieId's: {n_movies}")
print(f"Number of unique users: {n_users}")
print(f"Average ratings per user: {round(n_ratings/n_users, 2)}")
print(f"Average ratings per movie: {round(n_ratings/n_movies, 2)}")

Number of ratings: 100836
Number of unique movieId's: 9724
Number of unique users: 610
Average ratings per user: 165.3
Average ratings per movie: 10.37


In [None]:
""""
This code calculates and presents several key statistics for a movie ratings dataset. It determines the number of unique movie IDs ('n_movies'), user IDs ('n_users'), and the total count of ratings ('n_ratings'). These metrics provide valuable insights into the dataset's characteristics, including its scale and the diversity of users and movies it contains. Additionally, the code computes and displays the average number of ratings per user and per movie, offering a more detailed view of the ratings distribution across the dataset. This information is essential for understanding the dataset's size and user engagement.

In [12]:
# User Rating Frequency:
user_freq = ratings[['userId', 'movieId']].groupby('userId').count().reset_index()
user_freq.columns = ['userId', 'n_ratings']
print(user_freq.head())

   userId  n_ratings
0       1        232
1       2         29
2       3         39
3       4        216
4       5         44


In [None]:
""""
This code segment calculates and displays user-specific statistics from the movie ratings dataset. By grouping the data by user IDs, it computes the total number of ratings each user has submitted, storing the results in a new DataFrame called 'user_freq'. The columns are appropriately labeled with 'userId' for the user ID and 'n_ratings' for the number of ratings contributed by each user. This information is vital for understanding user engagement and activity within the rating dataset, aiding in further user-based analysis and the development of recommendation systems. The 'print(user_freq.head())' line provides a quick summary of the first few rows of this DataFrame, showcasing user-specific rating counts.

In [15]:
# Movie Rating Analysis:

#Find Lowest and Highest rated movies:
mean_rating = ratings.groupby('movieId')[['rating']].mean()

# Lowest rated movies:
lowest_rated = mean_rating['rating'].idxmin()
movies.loc[movies['movieId'] == lowest_rated]

#Highest rated movie:
highest_rated = mean_rating['rating'].idxmax()
movies.loc[movies['movieId'] == highest_rated]

# Show number of people who rated movies highest
ratings[ratings['movieId'] == highest_rated]

# Show number of people who rated movies lowest
ratings[ratings['movieId'] == lowest_rated]

#The above movies has very low dataset. We will use Bayesin average.
movie_stats = ratings.groupby('movieId')[['rating']].agg(['count','mean'])
movie_stats.columns = movie_stats.columns.droplevel()

In [None]:
""""
This algorithm evaluates movie reviews to identify the films with the lowest and highest ratings in the dataset. It calculates the average rating for each movie, allowing the identification of those with the lowest and highest average scores. The algorithm then retrieves and displays information about these movies from the 'movies' dataset. Additionally, it reveals the number of users who rated both the highest and lowest-ranked films, providing insights into their popularity and audience engagement. For movies with fewer ratings, Bayesian averages can provide more accurate quality assessments.

In [16]:
# User-Item Matrix Creation:

from scipy.sparse import csr_matrix

def create_matrix(df):
    
    N = len(df['userId'].unique())
    M = len(df['movieId'].unique())
    
    # Mapping Ids to indices:
    user_mapper = dict(zip(np.unique(df["userId"]), list(range(N))))
    movie_mapper = dict(zip(np.unique(df["movieId"]), list(range(M))))
    
    # Mapping indices to Ids:
    user_inv_mapper = dict(zip(list(range(N)), np.unique(df["userId"])))
    movie_inv_mapper = dict(zip(list(range(M)), np.unique(df["movieId"])))
    
    user_index = [user_mapper[i] for i in df['userId']]
    movie_index = [movie_mapper[i] for i in df['movieId']]
    
    X = csr_matrix((df["rating"], (movie_index, user_index)), shape=(M,N))
    
    return X, user_mapper, movie_mapper, user_inv_mapper, movie_inv_mapper
X, user_mapper, movie_mapper, user_inv_mapper, movie_inv_mapper = create_matrix(ratings)

In [None]:
""""
The provided code constructs a user-item matrix, a fundamental data structure in recommendation systems. Here’s how it functions:

1. Calculation of Unique Counts: It computes the number of unique users (N) and unique movies (M) in the dataset.
   
2. Dictionary Creation:
   - 'user_mapper': Maps each unique user ID to an index (e.g., user ID 1 becomes index 0).
   - 'movie_mapper': Maps each unique movie ID to an index (e.g., movie ID 1 becomes index 0).
   - 'user_inv_mapper': Reverses 'user_mapper', mapping indices back to user IDs.
   - 'movie_inv_mapper': Reverses 'movie_mapper', mapping indices back to movie IDs.
   
3. Index Mapping: Lists 'user_index' and 'movie_index' are created to map the actual user and movie IDs in the dataset to their respective indices.

4. Sparse Matrix Creation: Using the SciPy function 'csr_matrix', a sparse matrix 'X' is generated. This matrix uses the user and movie indices corresponding to the rating values in the dataset. Its shape is (M, N), where M represents the number of unique movies and N represents the number of unique users.

In summary, this code facilitates calculations and the development of recommendation systems by structuring user ratings for movies into a manageable matrix format.

In [30]:
# Movie Similarity Analysis:

# Import necessary libraries
from sklearn.neighbors import NearestNeighbors

def find_similar_movies(movie_id, X, k, metric='cosine', show_distance=False):
    neighbor_ids = []
     
    movie_ind = movie_mapper[movie_id]
    movie_vec = X[movie_ind]
    k += 1
    kNN = NearestNeighbors(n_neighbors=k, algorithm="brute", metric=metric)
    kNN.fit(X)
    movie_vec = movie_vec.reshape(1, -1)
    neighbor = kNN.kneighbors(movie_vec, return_distance=show_distance)
    for i in range(0, k):
        n = neighbor.item(i)
        neighbor_ids.append(movie_inv_mapper[n])
    neighbor_ids.pop(0)
    return neighbor_ids
 
movie_titles = dict(zip(movies['movieId'], movies['title']))
 
movie_id = 3
 
similar_ids = find_similar_movies(movie_id, X, k=10)
movie_title = movie_titles[movie_id]
 
print(f"Since you watched {movie_title}")
for i in similar_ids:
    print(movie_titles[i])


Since you watched Grumpier Old Men (1995)
Grumpy Old Men (1993)
Striptease (1996)
Nutty Professor, The (1996)
Twister (1996)
Father of the Bride Part II (1995)
Broken Arrow (1996)
Bio-Dome (1996)
Truth About Cats & Dogs, The (1996)
Sabrina (1995)
Birdcage, The (1996)


In [None]:
""""
The provided code defines a function named 'find_similar_movies' that employs the k-Nearest Neighbors (KNN) algorithm to find movies similar to a specified target movie. The function accepts several inputs: the target movie ID, a user-item matrix (X), the number of neighbors to consider (k), a similarity metric (defaulting to cosine similarity), and an option to display distances between movies.

The function starts by initializing an empty list to store the IDs of similar movies. It retrieves the index of the target movie from the 'movie_mapper' dictionary and obtains the corresponding feature vector from the user-item matrix. The KNN model is then configured with the specified parameters.

After fitting the KNN model, the distances and indices of the k-nearest neighbors to the target movie are computed. The function iterates through these neighbor indices, maps them back to movie IDs using the 'movie_inv_mapper' dictionary, and appends them to the list of neighbor IDs. Since the first item in the list is the target movie itself, it is removed. Finally, the code returns a list of similar movie titles and prints the title of the target movie along with the recommended movies based on the KNN model.

In [31]:
# Movie Recommendation with respect to Users Preference:

def recommend_movies_for_user(user_id, X, user_mapper, movie_mapper, movie_inv_mapper, k=10):
    df1 = ratings[ratings['userId'] == user_id]
     
    if df1.empty:
        print(f"User with ID {user_id} does not exist.")
        return
 
    movie_id = df1[df1['rating'] == max(df1['rating'])]['movieId'].iloc[0]
 
    movie_titles = dict(zip(movies['movieId'], movies['title']))
 
    similar_ids = find_similar_movies(movie_id, X, k)
    movie_title = movie_titles.get(movie_id, "Movie not found")
 
    if movie_title == "Movie not found":
        print(f"Movie with ID {movie_id} not found.")
        return
 
    print(f"Since you watched {movie_title}, you might also like:")
    for i in similar_ids:
        print(movie_titles.get(i, "Movie not found"))

In [None]:
""""
The function takes several inputs: dictionaries (`user_mapper`, `movie_mapper`, and `movie_inv_mapper`) for mapping user and movie IDs to matrix indices, the 'user_id' for whom recommendations are desired, a user-item matrix 'X' representing movie ratings, and an optional parameter 'k' for the number of recommended movies (default is 10).

Initially, the function checks if the user with the given ID exists in the ratings dataset. If the user is not found (i.e., the filtered DataFrame is empty), it notifies the user and terminates the function.

If the user exists, the code identifies the movie that has received the highest rating from that user. It retrieves the 'movieId' of this movie based on the highest rating.

A dictionary named 'movie_titles' is created from the movies dataset to map movie IDs to their titles. The function then calls 'find_similar_movies' to locate movies similar to the highest-rated movie (denoted by 'movie_id') in the user-item matrix, returning a list of similar movie IDs.

The code then attempts to retrieve the title of the highest-rated movie from the 'movie_titles' dictionary. If the movie is not found, the title is set to "Movie not found." When a movie title is retrieved as "Movie not found," it indicates that the highest-rated movie (based on 'movie_id') is not present in the dataset. If the movie is found, the function provides recommendations for other movies based on the highest-rated film. It iterates over the list of similar movie IDs and prints the corresponding titles. If a movie is not found in the dataset, the default message is "Movie not found."

This function handles scenarios where the user or movie does not exist in the dataset and aims to recommend movies for a particular user based on their highest-rated film. The code demonstrates how to use the method by calling the function with the necessary parameters and setting the 'user_id' to a specific user.

In [32]:
# Recomment the movies:

user_id = 150
recommend_movies_for_user(user_id, X, user_mapper, movie_mapper, movie_inv_mapper, k=10)

Since you watched Twelve Monkeys (a.k.a. 12 Monkeys) (1995), you might also like:
Pulp Fiction (1994)
Terminator 2: Judgment Day (1991)
Independence Day (a.k.a. ID4) (1996)
Seven (a.k.a. Se7en) (1995)
Fargo (1996)
Fugitive, The (1993)
Usual Suspects, The (1995)
Jurassic Park (1993)
Star Wars: Episode IV - A New Hope (1977)
Heat (1995)


In [33]:
user_id = 2300  # Replace with the desired user ID
recommend_movies_for_user(user_id, X, user_mapper, movie_mapper, movie_inv_mapper, k=10)

User with ID 2300 does not exist.


In [None]:
""""
In summary, building a recommendation system with Python enables the delivery of personalized content recommendations, enhancing user experience by considering individual preferences. By leveraging collaborative filtering, content-based filtering, and hybrid approaches, these systems can provide customized suggestions for movies, products, or other content. Utilizing advanced techniques like nearest neighbors and matrix factorization, recommendation systems uncover hidden patterns in user behavior and item characteristics. The integration of machine learning and data-driven insights allows these systems to continuously adapt and improve. Ultimately, these solutions are crucial for increasing customer satisfaction, boosting user engagement, and driving business growth across various sectors.