# Problem Statement
Online streaming platforms are the dominant form of media consumption and viewers face a paradox of choice when selecting films to watch. Avid watchers struggle to fully traverse their streaming catalogs, leading to subscriber frustration and an increased likelihood of churn. By leveraging the MovieLens dataset comprised of over 100,000 ratings applied to nearly 10,000 films by hundreds of users, we have a tremendous opportunity to push recommendation technology forward. Enhancing recommendations to navigate to what movie streamers prefer has the potential to profoundly impact viewer satisfaction for streaming platforms worldwide by connecting audiences with content that matches their interests. We can promote media discovery, encourage niche and inspire artistry through expanded access to cinema. Tackling the recommendations challenge remains imperative to unlocking the full creative and commercial potential of online streaming.

# Data Cleaning

In [1]:
#importing relevant packages
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
#create data frames 
movies = pd.read_csv("data/ml-latest-small/movies.csv")
links = pd.read_csv("data/ml-latest-small/links.csv")
ratings = pd.read_csv("data/ml-latest-small/ratings.csv")
tags = pd.read_csv("data/ml-latest-small/tags.csv")

In [3]:
#reading the first 3 rows
movies.head(3)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance


In [4]:
#reading the first 3 rows
links.head(3)

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0


In [5]:
#reading the first 3 rows
ratings.head(3)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224


In [6]:
#reading the first 3 rows
tags.head(3)

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992


In [8]:
#merge movies and links on 'movieId'
df = pd.merge(movies, links, on='movieId')

#merge with ratings on 'movieId'
df = pd.merge(df, ratings, on='movieId')

#merge with tags on 'movieId'
df = pd.merge(df, tags, on='movieId')

df.head()

Unnamed: 0,movieId,title,genres,imdbId,tmdbId,userId_x,rating,timestamp_x,userId_y,tag,timestamp_y
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,1,4.0,964982703,336,pixar,1139045764
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,1,4.0,964982703,474,pixar,1137206825
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,1,4.0,964982703,567,fun,1525286013
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,5,4.0,847434962,336,pixar,1139045764
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,5,4.0,847434962,474,pixar,1137206825


In [12]:
columns_to_drop = ['timestamp_x', 'timestamp_y']
df.drop(columns=columns_to_drop, inplace=True)

# Display the first few rows of the DataFram
df.head()

Unnamed: 0,movieId,title,genres,imdbId,tmdbId,userId_x,rating,userId_y,tag
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,1,4.0,336,pixar
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,1,4.0,474,pixar
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,1,4.0,567,fun
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,5,4.0,336,pixar
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,5,4.0,474,pixar


In [13]:
df.isnull().sum()

movieId     0
title       0
genres      0
imdbId      0
tmdbId      0
userId_x    0
rating      0
userId_y    0
tag         0
dtype: int64