## In this project we will be creating the movie recommendation system based on GroupLens dataset. This project is solely for educational purposes.
### We will be using 4 different csv files and features such as: tags, genres, ratings to recommend movies to watch, further details will be given throughout the project

In [2]:
import pandas as pd

### Let's import preprocessed tags file and save it into dataframe

In [3]:
tags_per_movie_df = pd.read_csv('tags_per_movie.csv')
tags_per_movie_df.head()

Unnamed: 0,movieId,tag
0,1,"pixar, pixar, fun"
1,2,"fantasy, magic board game, Robin Williams, game"
2,3,"moldy, old"
3,5,"pregnancy, remake"
4,7,remake


### Now let's preprocess movies DF

In [4]:
raw_data = pd.read_csv('movies.csv')
# create a copy of raw_data
movies_df = raw_data.copy()

In [5]:
# Descriptive stats, we have mostly categorical data, so we will have to think of a way to convert them to numerical values to be able to use them in our model
print(movies_df.info())
print(movies_df.shape)
# Let's check for NaN values and duplicates
print(f"The are  {movies_df.isna().values.sum()} NaN values in the dataset")
print(f"The are  {movies_df.duplicated().values.sum()} duplicates in the dataset")
# Great, no NaN values and no duplicates, let's check the unique values for each column
print(movies_df.nunique())
movies_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB
None
(9742, 3)
The are  0 NaN values in the dataset
The are  0 duplicates in the dataset
movieId    9742
title      9737
genres      951
dtype: int64


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


### At this point we can merge tags and movies dataframes to get the more detailed picture

In [6]:
# Let's merge the two dataframes based on the movieId column
merged_df = pd.merge(movies_df, tags_per_movie_df, on='movieId', how='left')
merged_df.head()
# As we know shape of tags df is much smaller than movies df, we will have NaN values in the tags column, as not all movies have tags given by users, so let's fill them with 'no tag'
merged_df['tag'] = merged_df['tag'].fillna('no tag')
merged_df.head()

Unnamed: 0,movieId,title,genres,tag
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,"pixar, pixar, fun"
1,2,Jumanji (1995),Adventure|Children|Fantasy,"fantasy, magic board game, Robin Williams, game"
2,3,Grumpier Old Men (1995),Comedy|Romance,"moldy, old"
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,no tag
4,5,Father of the Bride Part II (1995),Comedy,"pregnancy, remake"


### Before we move forward, let's also preprocess ratings data in a different file

In [7]:
raw_data_2 = pd.read_csv('ratings.csv')

ratings_df = raw_data_2.copy()

In [13]:
# We don't need timestamp, so we drop it
# ratings_df.drop('timestamp', axis=1, inplace=True)
ratings_df.head()
# Let's see the shape of the dataframe
print('Shape of the dataframe: ', ratings_df.shape)
# Let's check for missing, duplicate and unique values in the dataset
print('Missing values: ', ratings_df.isnull().sum().sum())
print('Duplicate values: ', ratings_df.duplicated().sum())
print('Unique values: ', ratings_df.nunique())
# Let's see the distribution of ratings
ratings_df.head()
# Let's add a new column to the dataframe, which will be the average rating for each movie
ratings_df['avg_rating'] = ratings_df.groupby('movieId')['rating'].transform('mean')
ratings_df.head()
# Let's add another column that counts how many users rated each movie and name it 'num_of_ratings'
ratings_df['num_of_ratings'] = ratings_df.groupby('movieId')['rating'].transform('count')
ratings_df.head()
ratings_df.sort_values(by='num_of_ratings', ascending=False)
# Now, let's create a dataframe that cointains unique movieId and avg_rating and num of ratings, where num of ratings is greater than 30
new_ratings_df = ratings_df[ratings_df['num_of_ratings'] > 30]
new_ratings_df.head()

new_ratings_df = new_ratings_df[['movieId', 'avg_rating', 'num_of_ratings']].drop_duplicates()
new_ratings_df.head()


Shape of the dataframe:  (100836, 5)
Missing values:  0
Duplicate values:  0
Unique values:  userId             610
movieId           9724
rating              10
avg_rating        1286
num_of_ratings     177
dtype: int64


Unnamed: 0,movieId,avg_rating,num_of_ratings
0,1,3.92093,215
1,3,3.259615,52
2,6,3.946078,102
3,47,3.975369,203
4,50,4.237745,204


In [14]:
# Let's merge the two dataframes based on the movieId column and drop the movies that won't have any ratings
merged_df_2 = pd.merge(merged_df, new_ratings_df, on='movieId', how='left')
# drop nan values
merged_df_2.dropna(inplace=True)
merged_df_2.head()
merged_df_2.sort_values(by='avg_rating', ascending=False)

Unnamed: 0,movieId,title,genres,tag,avg_rating,num_of_ratings
277,318,"Shawshank Redemption, The (1994)",Crime|Drama,"prison, Stephen King, wrongful imprisonment, M...",4.429022,317.0
906,1204,Lawrence of Arabia (1962),Adventure|Drama|War,Middle East,4.300000,45.0
659,858,"Godfather, The (1972)",Crime|Drama,Mafia,4.289062,192.0
2226,2959,Fight Club (1999),Action|Crime|Drama|Thriller,"dark comedy, psychology, thought-provoking, tw...",4.272936,218.0
975,1276,Cool Hand Luke (1967),Drama,prison,4.271930,57.0
...,...,...,...,...,...,...
2860,3826,Hollow Man (2000),Horror|Sci-Fi|Thriller,no tag,2.294872,39.0
1174,1562,Batman & Robin (1997),Action|Adventure|Fantasy|Thriller,no tag,2.214286,42.0
2029,2701,Wild Wild West (1999),Action|Comedy|Sci-Fi|Western,no tag,2.207547,53.0
1235,1644,I Know What You Did Last Summer (1997),Horror|Mystery|Thriller,no tag,2.109375,32.0


In [18]:
# Let's create a function with argument - 'genre' that returns top 10 movies to watch, based on rating. We will use the merged_df_2 dataframe
def top_10_movies(genre):
    # create a new dataframe with the genre we want
    genre_df = merged_df_2[merged_df_2['genres'].str.contains(genre)]
    # sort the values by avg_rating and num_of_ratings
    genre_df.sort_values(by=['avg_rating', 'num_of_ratings'], ascending=False, inplace=True)
    # return the top 10 movies
    return genre_df.head(10)

top_10_movies('Horror')


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  genre_df.sort_values(by=['avg_rating', 'num_of_ratings'], ascending=False, inplace=True)


Unnamed: 0,movieId,title,genres,tag,avg_rating,num_of_ratings
1616,2160,Rosemary's Baby (1968),Drama|Horror|Thriller,"Atmospheric, creepy, paranoia, scary, suspense",4.171875,32.0
510,593,"Silence of the Lambs, The (1991)",Crime|Horror|Thriller,"Hannibal Lector, disturbing, drama, gothic, ps...",4.16129,279.0
957,1258,"Shining, The (1980)",Horror,"atmospheric, disturbing, Horror, jack nicholso...",4.082569,109.0
960,1261,Evil Dead II (Dead by Dawn) (1987),Action|Comedy|Fantasy|Horror,no tag,4.044118,34.0
916,1215,Army of Darkness (1993),Action|Adventure|Comedy|Fantasy|Horror,no tag,4.039216,51.0
920,1219,Psycho (1960),Crime|Horror,"Alfred Hitchcock, psychology, suspenseful, ten...",4.036145,83.0
5335,8874,Shaun of the Dead (2004),Comedy|Horror,zombies,4.006494,77.0
1067,1387,Jaws (1975),Action|Horror,Shark,4.005495,91.0
4408,6502,28 Days Later (2002),Action|Horror|Sci-Fi,"zombies, zombies",3.974138,58.0
915,1214,Alien (1979),Horror|Sci-Fi,aliens,3.969178,146.0
