#### Getting Started: Loading Libraries

In [4]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

#### Loading the Dataset
Loading the Dataset provided by Kaggle <a href = "https://www.kaggle.com/rounakbanik/the-movies-dataset">The Movies Dataset</a> to a Pandas DataFrame

In [7]:
df = pd.read_csv("/Users/sange/Downloads/movies.csv")
df = pd.read_csv("/Users/sange/Downloads/ratings.csv")

We have our dataframe ready, so let`s visualize it

In [9]:
df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [10]:
df.describe()

Unnamed: 0,userId,movieId,rating,timestamp
count,100836.0,100836.0,100836.0,100836.0
mean,326.127564,19435.295718,3.501557,1205946000.0
std,182.618491,35530.987199,1.042529,216261000.0
min,1.0,1.0,0.5,828124600.0
25%,177.0,1199.0,3.0,1019124000.0
50%,325.0,2991.0,3.5,1186087000.0
75%,477.0,8122.0,4.0,1435994000.0
max,610.0,193609.0,5.0,1537799000.0


In [11]:
print(df.columns.values)

['userId' 'movieId' 'rating' 'timestamp']


In [12]:
# Fill missing values
movies['title'] = movies['title'].fillna('')
movies['genres'] = movies['genres'].fillna('')

In [13]:
features = ['genres', 'keywords', 'title', 'cast', 'director']

In [14]:
df['movieId'].isnull().values.any()

False

Our next task is to create a function for combining the values of these columns into a single string

In [15]:
# Combine features
movies['combined_features'] = movies['title'] + ' ' + movies['genres']

Now that we have obtained the combined strings, we can now feed these strings to a CountVectorizer() object for getting the count matrix.

In [16]:
# Vectorization
cv = CountVectorizer()
count_matrix = cv.fit_transform(movies['combined_features'])

Now, we need to obtain the cosine similarity matrix from the count matrix.

In [17]:
cosine_sim = cosine_similarity(count_matrix)

Now, we will define two helper functions to get movie title from movie index and vice-versa.

In [18]:
def get_index_from_title(title):
    matches = movies[movies['title'].str.contains(title, case=False, na=False)]
    if matches.empty:
        print("Movie not found in dataset")
        return None
    return matches.index[0]


In [19]:
movie_user_likes = "Toy Story"
movie_index = get_index_from_title(movie_user_likes)

We will sort the list similar_movies according to similarity scores in descending order. Since the most similar movie to a given movie will be itself, we will discard the first element after sorting the movies.

In [20]:
# Similarity calculation
similar_movies = list(enumerate(cosine_sim[movie_index]))
sorted_similar_movies = sorted(similar_movies, key=lambda x: x[1], reverse=True)[1:11]


Then, we will run a loop to print first 5 entries from sorted_similar_movies list.

In [22]:
def get_title_from_index(index):
    return movies[movies.index == index]["title"].values[0]


In [23]:
print("Top 10 similar movies to", movie_user_likes, "are:\n")

for i, element in enumerate(sorted_similar_movies[:10], start=1):
    print(i, ".", get_title_from_index(element[0]))


Top 10 similar movies to Toy Story are:

1 . Toy Story 2 (1999)
2 . Toy Story 3 (2010)
3 . Antz (1998)
4 . Turbo (2013)
5 . Moana (2016)
6 . Jumanji (1995)
7 . Balto (1995)
8 . Gordy (1995)
9 . Shrek (2001)
10 . Monsters, Inc. (2001)


In [24]:
# Create user-movie matrix
user_movie_matrix = ratings.pivot_table(
    index='userId',
    columns='movieId',
    values='rating'
).fillna(0)

In [25]:
# Cosine similarity between users
user_similarity = cosine_similarity(user_movie_matrix)

In [26]:
# Convert to DataFrame
user_similarity_df = pd.DataFrame(
    user_similarity,
    index=user_movie_matrix.index,
    columns=user_movie_matrix.index)

In [27]:
# Function to recommend movies
def recommend_movies(user_id, top_n=10):
    similar_users = user_similarity_df[user_id].sort_values(ascending=False)[1:6]
    
    weighted_ratings = user_movie_matrix.loc[similar_users.index].T.dot(similar_users)
    recommendations = weighted_ratings.sort_values(ascending=False).head(top_n)
    
    return recommendations

In [28]:
print("Top 10 movie recommendations for User 1:\n")
print(recommend_movies(1))

Top 10 movie recommendations for User 1:

movieId
1200    8.324253
2571    8.324160
1198    8.324160
2028    8.144516
296     7.811855
50      7.799389
1197    7.608757
1240    7.454355
1610    7.453767
592     7.276591
dtype: float64
