# Data Cleaning/Visualization Starter

This notebook should include starter functionality to get start cleaning data and looking at some interesting features. The goal is to familiarize yourself with the dataset so when model development starts, working with the dataset will be easier.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
sns.set_theme() # feel free to change this - default seaborn plots look nice but you can play with colors

`movies` is the DataFrame which will hold all the movies in the MovieLens100k dataset. Notice how each movie has:
- a uniquely identifying `movie_id` - this will be useful for tracing ratings to movies
- genre and year information - think about how this might be useful to the prediction process

In [106]:
movies = pd.read_csv("../data/ml-latest-small/movies.csv")
print(f"Number of movies: {len(movies)}")
movies.head(10)

Number of movies: 9742


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


In practice, we will not need to use the `movies` dataset as all the information is available in the `ratings` dataset.

`ratings` is the DataFrame which will hold all the ratings given for the movies in `movies`. We will be using these ratings to make predictions on the ratings users will have for movies they have not seen. 

Some features to note:
- `user_id` is the ID assigned to the user who made the rating
- `movie_id` is the movie that was rated (you can use this column to map movie names to ratings)
- `user_rating` are movie ratings on a scale 1-5 stars (with half-star increments).
- How might we use `timestamp` to aid the recommender system? (Times are given in UTC seconds)

In [107]:
ratings = pd.read_csv("../data/ml-latest-small/ratings.csv")
print(f"Number of ratings: {len(ratings)}")
ratings.head(10)

Number of ratings: 100836


Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
5,1,70,3.0,964982400
6,1,101,5.0,964980868
7,1,110,4.0,964982176
8,1,151,5.0,964984041
9,1,157,5.0,964984100


## Suggested data visualization steps
Our main goal this first work session is to get to understand the dataset and manipulate it, as in the future, we will be manipulating this dataset extensively.
- Take a look at the distribution of years among movies in our dataset. What is the most recent movie? The oldest?
- Visualize the genres of the movies represented in the dataset. Are any more represented than others?
- Pick any movie in the dataset - look at the ratings users have given that movie. Do these ratings make sense to you? (You would want to pick a movie you've probably watched before).
- Can you find your favorite movies in here?

**KEY**: Make plots to visualize your results - a picture speaks a 1000 words!

## Suggested data cleaning steps
If you want to get started data cleaning, here are some recommended steps:
- Separate Movie Year from the Movie Name
- Construct a table of ratings per movie (or vice versa). This will help for some of the later data cleaning next week.
- [One-Hot Encode](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/) genre data

If you write any nontrivial code to process the data, try to generalize the processing into functions/scripts. This is good practice in writing extensible data cleaning code.

In [108]:
# merge the two tables together to not have to process data twice
data = ratings.join(movies.set_index('movieId'), on="movieId", how="left")
data

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,1,3,4.0,964981247,Grumpier Old Men (1995),Comedy|Romance
2,1,6,4.0,964982224,Heat (1995),Action|Crime|Thriller
3,1,47,5.0,964983815,Seven (a.k.a. Se7en) (1995),Mystery|Thriller
4,1,50,5.0,964982931,"Usual Suspects, The (1995)",Crime|Mystery|Thriller
...,...,...,...,...,...,...
100831,610,166534,4.0,1493848402,Split (2017),Drama|Horror|Thriller
100832,610,168248,5.0,1493850091,John Wick: Chapter Two (2017),Action|Crime|Thriller
100833,610,168250,5.0,1494273047,Get Out (2017),Horror
100834,610,168252,5.0,1493846352,Logan (2017),Action|Sci-Fi


In [109]:
# convert the genre pipe-separated string into a list
data["genres"] = data["genres"].apply(lambda x: x.split("|"))

In [110]:
# extract the set of genres utilized in the dataset
genres = set()
def func(genre_list):
    for genre in genre_list:
        genres.add(genre.lower())
data["genres"].apply(func)
print(genres)

{'adventure', 'mystery', 'animation', 'romance', '(no genres listed)', 'children', 'thriller', 'crime', 'war', 'imax', 'action', 'documentary', 'horror', 'sci-fi', 'musical', 'fantasy', 'film-noir', 'drama', 'comedy', 'western'}


In [111]:
# one-hot encode genres into table
def update(genres, g=""):
    return g in genres
for g in genres:
    data[g] = data["genres"].apply(update, g=g)
data = data.drop("genres", axis=1)

In [112]:
# extract year from movie title and push to new column
import re
def getyear(x):
    cleaned_year = re.findall(r"\(\d{4}\)", x)
    if not cleaned_year:
        return np.NaN
    return int(cleaned_year[0][1:-1])
data["year"] = data["title"].apply(getyear)
data["title"] = data["title"].apply(lambda x: re.sub(r"\(\d{4}\)", "", x).strip())
data.dropna(inplace=True, subset="year")
data["year"] = data["year"].astype(np.int16)

In [113]:
# remap movieId because we only have 9711 movies but movieId can be up to 178000
index_map = {old:new for old, new in zip(data["movieId"].unique(), np.arange(0,len(data["movieId"])))}
data["movieId"] = data["movieId"].apply(lambda x: index_map[x])

In [115]:
# remap userId to be zero indexed for clarity
data["userId"] = data["userId"].apply(lambda x: x-1)

In [116]:
data

Unnamed: 0,userId,movieId,rating,timestamp,title,adventure,mystery,animation,romance,(no genres listed),...,documentary,horror,sci-fi,musical,fantasy,film-noir,drama,comedy,western,year
0,0,0,4.0,964982703,Toy Story,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,1995
1,0,1,4.0,964981247,Grumpier Old Men,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,1995
2,0,2,4.0,964982224,Heat,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,1995
3,0,3,5.0,964983815,Seven (a.k.a. Se7en),False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,1995
4,0,4,5.0,964982931,"Usual Suspects, The",False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,1995
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
100831,609,3120,4.0,1493848402,Split,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,2017
100832,609,2035,5.0,1493850091,John Wick: Chapter Two,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,2017
100833,609,3121,5.0,1494273047,Get Out,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,2017
100834,609,1392,5.0,1493846352,Logan,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,2017


In [119]:
# change some data types to reduce memory usage
data["userId"] = data["userId"].astype(np.int16)
data["movieId"] = data["movieId"].astype(np.int16)
data["rating"] = data["rating"].astype(np.float16)

In [125]:
data.to_csv("../data/cleaned/cleaned_ratings.csv", index=False)