In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import cosine_similarity
from scipy.spatial.distance import euclidean
from scipy.spatial.distance import cosine
ratings_df = pd.read_csv('/mnt/data/ratings.csv')
movies_df = pd.read_csv('/mnt/data/movies.csv')



How many movies has this user watched?

In [None]:
user_2_ratings = ratings_df[ratings_df['userId'] == 2]
movies_watched_by_user_2 = user_2_ratings['movieId'].nunique()
print(f"User 2 has watched {movies_watched_by_user_2} unique movies.")

Plot a bar chart of user 2's ratings

In [None]:
rating_counts = user_2_ratings['rating'].value_counts().sort_index()
plt.figure(figsize=(8, 6))
rating_counts.plot(kind='bar', color='skyblue')
plt.title("User 2's Movie Ratings Distribution")
plt.xlabel('Rating')
plt.ylabel('Count of Ratings')
plt.xticks(rotation=0)
plt.grid(axis='y')
plt.show()

user 2's top-rated movies

In [None]:
user_2_top_movies = pd.merge(user_2_ratings, movies_df, on='movieId')
top_movies = user_2_top_movies.sort_values(by='rating', ascending=False)
top_movies_list = top_movies[['title', 'rating']].head(10)
print("User 2's Top-Rated Movies:")
print(top_movies_list)

 Identify the most similar user to user 2 using Euclidean distance and cosine similarity

In [None]:
user_movie_matrix = ratings_df.pivot(index='userId', columns='movieId', values='rating').fillna(0)
user_2_vector = user_movie_matrix.loc[2].values.reshape(1, -1)
cosine_similarities = cosine_similarity(user_movie_matrix, user_2_vector).flatten()
euclidean_distances = user_movie_matrix.apply(lambda x: euclidean(x, user_2_vector.flatten()), axis=1)
most_similar_user_cosine = cosine_similarities.argsort()[-2] + 1 
most_similar_user_euclidean = euclidean_distances.sort_values().index[1]
print(f"Most similar user to User 2 based on Cosine Similarity: User {most_similar_user_cosine}")
print(f"Most similar user to User 2 based on Euclidean Distance: User {most_similar_user_euclidean}")
plt.figure(figsize=(10, 6))



Plot cosine similarity values

In [None]:
plt.subplot(1, 2, 1)
plt.scatter(range(len(cosine_similarities)), cosine_similarities, color='blue', label='Cosine Similarity')
plt.axvline(x=most_similar_user_cosine, color='red', linestyle='--', label=f'Most Similar User: {most_similar_user_cosine}')
plt.title('Cosine Similarity with User 2')
plt.xlabel('User ID')
plt.ylabel('Similarity')
plt.legend()
plt.subplot(1, 2, 2)
plt.scatter(range(len(euclidean_distances)), euclidean_distances, color='green', label='Euclidean Distance')
plt.axvline(x=most_similar_user_euclidean, color='red', linestyle='--', label=f'Most Similar User: {most_similar_user_euclidean}')
plt.title('Euclidean Distance from User 2')
plt.xlabel('User ID')
plt.ylabel('Distance')
plt.legend()
plt.tight_layout()
plt.show()

Recommend movies for user 2 based on similar users

In [None]:
movies_watched_by_user_366 = set(ratings_df[ratings_df['userId'] == 366]['movieId'])
movies_watched_by_user_442 = set(ratings_df[ratings_df['userId'] == 442]['movieId'])
recommended_movies_list_366 = movies_df[movies_df['movieId'].isin(recommended_movies_user_366)]['title'].head(10).tolist()
recommended_movies_list_442 = movies_df[movies_df['movieId'].isin(recommended_movies_user_442)]['title'].head(10).tolist()
print("Recommendations for User 2 based on similarity with User 366 (Cosine Similarity):")
print(recommended_movies_list_366)
print("\nRecommendations for User 2 based on similarity with User 442 (Euclidean Distance):")
print(recommended_movies_list_442)



Do the recommendations from this method make sense?

In [None]:
The recommendations based on cosine similarity (user 366) seem to align more closely with user 2's preferences. These movies, like "Braveheart" and "Fight Club," are action-packed and intense, similar to user 2's highly rated films such as "Mad Max: Fury Road" and "The Dark Knight." This indicates that user 366 has similar taste in genres, making these recommendations reasonable.

On the other hand, the recommendations based on Euclidean distance (user 442) include a broader range of genres, such as classics ("Patton"), animation ("Aristocats"), and romance/drama ("Dangerous Liaisons"). These do not seem to align as well with user 2's preference for intense, dramatic, or action-oriented films. Thus, the Euclidean distance recommendations are less suitable for user 2's tastes.

Short Analysis

In [None]:
Cosine similarity focuses on the angle between two users' rating vectors, capturing the similarity in their movie preferences irrespective of their rating scale. This metric was effective in identifying user 366 as the most similar to user 2, leading to recommendations that aligned well with user 2's taste. It demonstrated its strength in collaborative filtering by emphasizing shared preferences rather than differences in rating intensity.

On the other hand, Euclidean distance, which measures the straight-line distance between rating points, was less effective. It tends to be sensitive to the magnitude of ratings, which led to recommendations that were less relevant to user 2's preferences. This metric suggested movies from diverse genres that did not align with user 2's history of favoring intense and dramatic films.

References
OpenAI. (2024). ChatGPT (Oct 1 version) [Large language model]. https://chat.openai.com/chat