# Movie Recommendation System Using KNN

This notebook implements a movie recommendation system using **K-Nearest Neighbors (KNN)** based on user ratings and movie features. The recommendation system leverages **collaborative filtering** to suggest movies to users based on their preferences and ratings.

## Overview
The recommendation system follows these steps:
1. **Data Loading**: Load the `ratings.csv` and `movies_rc.csv` datasets which contain movie ratings from users and movie details, respectively.
2. **Data Preprocessing**: Clean the data and explore the dataset to get useful insights.
3. **User-Item Matrix Creation**: Create a sparse user-item matrix where each row represents a movie and each column represents a user.
4. **Similarity Calculation**: Use K-Nearest Neighbors (KNN) with cosine similarity to find similar movies based on user ratings.
5. **Movie Recommendations**: Recommend movies to users based on their highest-rated movie, by finding the top `k` similar movies.

## Key Concepts:
- **Collaborative Filtering**: A method of recommendation based on user-item interactions, where the system recommends items based on the preferences of similar users.
- **Cosine Similarity**: A metric used to measure the similarity between two vectors by calculating the cosine of the angle between them.
- **KNN**: An algorithm that finds the k-nearest neighbors to a given point (movie) based on some distance metric (cosine similarity in this case).

### Dataset Information:
- **ratings.csv**: Contains user ratings for movies with columns `userId`, `movieId`, and `rating`.
- **movies_rc.csv**: Contains movie details with columns `movieId` and `title`.





In [41]:
# Importing necessary libraries for data processing, model creation, and visualization
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import warnings

# Suppressing future warnings for cleaner output
warnings.simplefilter(action = 'ignore', category = FutureWarning)


In [42]:
# Importing ratings data
ratings = pd.read_csv("D:/movie_recommendation/data/ratings.csv")
ratings.head()  # Checking the first few rows of ratings dataset

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [43]:
movies = pd.read_csv("D:/movie_recommendation/data/movies_rc.csv")
movies.head()  # Checking the first few rows of movies dataset

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [44]:
# Checking the structure and details of the ratings dataset
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [45]:
# Extracting the number of ratings, movies, and users
n_ratings = len(ratings.rating)
n_movies = len(ratings.movieId.unique())
n_users = len(ratings.userId.unique())

# Printing summary statistics on ratings, users, and movies
print(f'Number of Ratings: {n_ratings}')
print(f'Number of Unique MoviesIDs: {n_movies}')
print(f'Number of Unique UserIDs: {n_users}')
print(f'Average ratings per user: {round(n_ratings/n_users, 2)}')
print(f'Average ratings per movie: {round(n_ratings/n_movies, 2)}')

Number of Ratings: 100836
Number of Unique MoviesIDs: 9724
Number of Unique UserIDs: 610
Average ratings per user: 165.3
Average ratings per movie: 10.37


In [46]:
# Calculating the frequency of ratings given by each user
user_freq = ratings[['userId', 'movieId']].groupby('userId').count().reset_index()
user_freq.columns = ['userId','n_ratings']
user_freq.head()

Unnamed: 0,userId,n_ratings
0,1,232
1,2,29
2,3,39
3,4,216
4,5,44


In [47]:

# Calculating the lowest and highest rated movies using mean
mean_rating = ratings.groupby('movieId')[['rating']].mean()

In [48]:

# Finding the lowest rated movie
lowest_rated = mean_rating['rating'].idxmin()
movies.loc[movies['movieId'] == lowest_rated]


Unnamed: 0,movieId,title,genres
2689,3604,Gypsy (1962),Musical


In [49]:
# Finding the highest rated movie
highest_rated = mean_rating['rating'].idxmax()
movies.loc[movies['movieId'] == highest_rated]

Unnamed: 0,movieId,title,genres
48,53,Lamerica (1994),Adventure|Drama


In [50]:
# Number of people that rated the lowest and highest rated movies
print(f"Number of people who rated the lowest rated movie: {ratings.movieId[ratings['movieId'] == lowest_rated].count()}")
print(f"Number of people who rated the highest rated movie: {ratings.movieId[ratings['movieId'] == highest_rated].count()}")

Number of people who rated the lowest rated movie: 1
Number of people who rated the highest rated movie: 2


In [51]:
# Creating the user-item matrix using scipy csr_matrix
from scipy.sparse import csr_matrix
def create_matrix(df):
    """
    This function creates a sparse matrix for collaborative filtering.
    It converts the ratings data into a user-item matrix, where rows represent users and columns represent movies.
    """
    N = len(df['userId'].unique())  # Number of unique users
    M = len(df['movieId'].unique())  # Number of unique movies
    user_mapper = dict(zip(np.unique(df['userId']), list(range(N))))  # Mapping user IDs to matrix indices
    movie_mapper = dict(zip(np.unique(df['movieId']), list(range(M))))  # Mapping movie IDs to matrix indices
    user_inv_mapper = dict(zip(list(range(N)), np.unique(df['userId'])))  # Inverse mapping for user IDs
    movie_inv_mapper = dict(zip(list(range(M)), np.unique(df['movieId'])))  # Inverse mapping for movie IDs
    user_index = [user_mapper[i] for i in df['userId']]  # Mapping user IDs to matrix indices
    movie_index = [movie_mapper[i] for i in df['movieId']]  # Mapping movie IDs to matrix indices
    X = csr_matrix((df['rating'], (movie_index, user_index)), shape=(M, N))  # Creating the sparse matrix
    return X, user_mapper, movie_mapper, user_inv_mapper, movie_inv_mapper

X, user_mapper, movie_mapper, user_inv_mapper, movie_inv_mapper = create_matrix(ratings)


In [52]:
# Finding similar movies using KNN
from sklearn.neighbors import NearestNeighbors

def find_similar_movies(movie_id, X, k, metric='cosine', show_distance=False):
    """
    This function finds the k most similar movies to a given movie using KNN.
    """
    neighbour_ids = []
    movie_ind = movie_mapper[movie_id]  # Get movie index from the movie ID
    movie_vec = X[movie_ind]  # Get the vector of ratings for the movie
    k += 1  # Include the movie itself in the list of similar movies
    knn = NearestNeighbors(n_neighbors=k, algorithm='brute', metric=metric)  # KNN model
    knn.fit(X)  # Fit the model to the user-item matrix
    movie_vec = movie_vec.reshape(1, -1)  # Reshaping to 2D array
    neighbour = knn.kneighbors(movie_vec, return_distance=show_distance)  # Find nearest neighbors
    for i in range(0, k):
        n = neighbour.item(i)  # Get the movie index from the nearest neighbor
        neighbour_ids.append(movie_inv_mapper[n])  # Add movie ID to the list
    neighbour_ids.pop(0)  # Remove the movie itself from the list
    return neighbour_ids


In [53]:
# Create a dictionary to map movie IDs to movie titles
movie_titles = dict(zip(movies.movieId, movies.title))
movie_id = 3  # Example movie ID

In [54]:

# Get the 10 most similar movies to the given movie ID
similar_ids = find_similar_movies(movie_id, X, k=10)
movie_title = movie_titles[movie_id]
print(f'Since you watched {movie_title}')
for i in similar_ids:
    print(movie_titles[i])

Since you watched Grumpier Old Men (1995)
Grumpy Old Men (1993)
Striptease (1996)
Nutty Professor, The (1996)
Twister (1996)
Father of the Bride Part II (1995)
Broken Arrow (1996)
Bio-Dome (1996)
Truth About Cats & Dogs, The (1996)
Sabrina (1995)
Birdcage, The (1996)


In [55]:
# Function to recommend movies for a specific user based on their highest rated movie
def recommend_movies_for_user(user_id, k=10):
    """
    This function recommends k similar movies to a user based on their highest-rated movie.
    """
    df1 = ratings[ratings.userId == user_id]  # Filter ratings for the user
    if df1.empty:
        print(f'User with ID {user_id} does not exist.')
        return
    
    # Get the highest-rated movie by the user
    movie_id = df1[df1.rating == max(df1.rating)].movieId.iloc[0]
     
    # Get the title of the user's highest-rated movie
    movie_titles = dict(zip(movies['movieId'], movies['title']))
    movie_title = movie_titles.get(movie_id, "Movie not found")
    
    if movie_title == "Movie not found":
        print(f"Movie with ID {movie_id} not found.")
        return
    # Find similar movies
    similar_ids = find_similar_movies(movie_id, X, k)
    
    # Print recommendations
    print(f"Since you watched {movie_title}, you might also like:")
    for i in similar_ids:
        print(movie_titles.get(i, "Movie not found"))

In [56]:
# Example usage of the recommendation function for user with ID 150
user_id = 150
recommend_movies_for_user(user_id, k=10)

Since you watched Twelve Monkeys (a.k.a. 12 Monkeys) (1995), you might also like:
Pulp Fiction (1994)
Terminator 2: Judgment Day (1991)
Independence Day (a.k.a. ID4) (1996)
Seven (a.k.a. Se7en) (1995)
Fargo (1996)
Fugitive, The (1993)
Usual Suspects, The (1995)
Jurassic Park (1993)
Star Wars: Episode IV - A New Hope (1977)
Heat (1995)
