I'm a data scientist at FlixGenius, a popular movie streaming service. Our management has recently decided to invest in enhancing our recommendation engine to provide more personalized movie recommendations to our users. We've found that users are more likely to continue using our service if they receive movie recommendations that match their personal preferences.

As a data scientist, I've been tasked with building a model that provides top 5 movie recommendations to a user, based on their ratings of other movies. The model will take into account the user's past viewing history and ratings, as well as the ratings and viewing history of other users with similar preferences.

For this project, I've been provided with a dataset called MovieLens. The data comes from the GroupLens research lab at the University of Minnesota and includes user ratings of movies, as well as information about the movies themselves. My job is to use this data to build a recommendation model that will provide personalized recommendations to users.

In [63]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [64]:
movies_df = pd.read_csv('https://gist.githubusercontent.com/MichalOst3389/e3913bbb6921ea9475660d58d280d55c/raw/3b8861ea300bbdd6b689bf853dfce94524b39301/movies.csv')
links_df = pd.read_csv('https://gist.githubusercontent.com/MichalOst3389/cfc2c59e9f323d11b7afb8f3224229f3/raw/ce13331097cbff6abcd941e8388db941220876fb/links.csv')
ratings_df = pd.read_csv('https://gist.githubusercontent.com/MichalOst3389/d8ed774a84197205b2d7e53ce8345aae/raw/064966f6d7c5f45f3aa404ca45d5ee9b9fed0ece/ratings.csv')
tags_df = pd.read_csv('https://gist.githubusercontent.com/MichalOst3389/9ff4a3740c440a3391d891af2ccac50a/raw/05572aac39b0fd8d278228158ad5a4cb20ecaa9c/tags.csv')

# Data Cleaning

The first step is to combine them into a single dataset. I've merge the datasets using a common identifier which is movieId

In [65]:
# Merge datasets using movieId as the key
merged_df_1 = pd.merge(ratings_df, movies_df, on='movieId', how='inner')
merged_df = pd.merge(merged_df_1, tags_df, on='movieId', how='inner')

In [66]:
merged_df.isnull()

Unnamed: 0,userId_x,movieId,rating,timestamp_x,title,genres,userId_y,tag,timestamp_y
0,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...
233208,False,False,False,False,False,False,False,False,False
233209,False,False,False,False,False,False,False,False,False
233210,False,False,False,False,False,False,False,False,False
233211,False,False,False,False,False,False,False,False,False


In [67]:
print(merged_df.isnull().sum())

userId_x       0
movieId        0
rating         0
timestamp_x    0
title          0
genres         0
userId_y       0
tag            0
timestamp_y    0
dtype: int64


### There are no null values

In [68]:
merged_df.duplicated()

0         False
1         False
2         False
3         False
4         False
          ...  
233208    False
233209    False
233210    False
233211    False
233212    False
Length: 233213, dtype: bool

In [69]:
print(merged_df.duplicated().sum())

0


### There are no duplicated rows.

In [70]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 233213 entries, 0 to 233212
Data columns (total 9 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   userId_x     233213 non-null  int64  
 1   movieId      233213 non-null  int64  
 2   rating       233213 non-null  float64
 3   timestamp_x  233213 non-null  int64  
 4   title        233213 non-null  object 
 5   genres       233213 non-null  object 
 6   userId_y     233213 non-null  int64  
 7   tag          233213 non-null  object 
 8   timestamp_y  233213 non-null  int64  
dtypes: float64(1), int64(5), object(3)
memory usage: 17.8+ MB


### One hot encoding tags

By one-hot encoding the tag column, we can represent the user's tags for each movie in a way that can be easily used as input to a recommendation algorithm.

In [71]:
# One-hot encode the 'tag' column
tags_encoded = pd.get_dummies(merged_df['tag'])

# Merge the encoded tags back into the original dataframe
merged_df = pd.concat([merged_df, tags_encoded], axis=1)

# Drop the original 'tag' column
merged_df.drop('tag', axis=1, inplace=True)

### One hot encoding genres

In [72]:
# One-hot encode the 'tag' column
genres_encoded = pd.get_dummies(merged_df['genres'])

# Merge the encoded tags back into the original dataframe
merged_df = pd.concat([merged_df, tags_encoded], axis=1)

# Drop the original 'tag' column
merged_df.drop('genres', axis=1, inplace=True)

In [73]:
merged_df

Unnamed: 0,userId_x,movieId,rating,timestamp_x,title,userId_y,timestamp_y,"""artsy""",06 Oscar Nominated Best Movie - Animation,1900s,...,women,wonderwoman,workplace,writing,wrongful imprisonment,wry,younger men,zither,zoe kazan,zombies
0,1,1,4.0,964982703,Toy Story (1995),336,1139045764,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,1,4.0,964982703,Toy Story (1995),474,1137206825,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,1,4.0,964982703,Toy Story (1995),567,1525286013,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,5,1,4.0,847434962,Toy Story (1995),336,1139045764,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,1,4.0,847434962,Toy Story (1995),474,1137206825,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
233208,599,176419,3.5,1516604655,Mother! (2017),567,1525287588,0,0,0,...,0,0,0,0,0,0,0,0,0,0
233209,599,176419,3.5,1516604655,Mother! (2017),567,1525287586,0,0,0,...,0,0,0,0,0,0,0,0,0,0
233210,594,7023,4.5,1108972356,"Wedding Banquet, The (Xi yan) (1993)",474,1137179697,0,0,0,...,0,0,0,0,0,0,0,0,0,0
233211,606,6107,4.0,1171324428,Night of the Shooting Stars (Notte di San Lore...,606,1178473747,0,0,0,...,0,0,0,0,0,0,0,0,0,0
