# Data Cleaning/Visualization Starter

This notebook should include starter functionality to get start cleaning data and looking at some interesting features. The goal is to familiarize yourself with the dataset so when model development starts, working with the dataset will be easier.

Note that you have two options to obtain the dataset:
- Download the data using the `setup.py` script
- Use `tensorflow_datasets` to fetch the data for you

The notebook currently uses the latter method, but if you do not want the overhead of TensorFlow, the previous method will perform the same functionality. We prefer to use the TensorFlow dataset as the data is already partially cleaned.

In [32]:
import numpy as np
import pandas as pd
import seaborn as sns
import tensorflow_datasets as tfds
sns.set_theme() # feel free to change this - default seaborn plots look nice but you can play with colors

`movies` is the DataFrame which will hold all the movies in the MovieLens100k dataset. Notice how each movie has:
- a uniquely identifying `movie_id` - this will be useful for tracing ratings to movies
- genre and year information - think about how this might be useful to the prediction process

In [33]:
movies = tfds.as_dataframe(tfds.load("movielens/latest-small-movies", split="train"))
print(f"Number of movies: {len(movies)}")
movies.head(10)

Number of movies: 9742


Unnamed: 0,movie_genres,movie_id,movie_title
0,[4],b'2261',b'One Crazy Summer (1986)'
1,[10],b'1979',b'Friday the 13th Part VI: Jason Lives (1986)'
2,"[4, 5]",b'6143',b'Trail of the Pink Panther (1982)'
3,"[4, 7, 14]",b'5856',"b'Do You Remember Dolly Bell? (Sjecas li se, D..."
4,"[0, 4, 7, 16]",b'70728',b'Bronson (2009)'
5,"[14, 19]",b'4035',"b'Claim, The (2000)'"
6,"[4, 19]",b'3873',b'Cat Ballou (1965)'
7,"[0, 5, 7, 16]",b'8593',b'Juice (1992)'
8,"[7, 10]",b'71304',b'Thirst (Bakjwi) (2009)'
9,"[0, 7, 13]",b'107314',b'Oldboy (2013)'


In practice, we will not need to use the `movies` dataset as all the information is available in the `ratings` dataset.

`ratings` is the DataFrame which will hold all the ratings given for the movies in `movies`. We will be using these ratings to make predictions on the ratings users will have for movies they have not seen. 

Some features to note:
- `user_id` is the ID assigned to the user who made the rating
- `movie_id` is the movie that was rated (you can use this column to map movie names to ratings)
- `user_rating` are movie ratings on a scale 1-5 stars (with half-star increments).
- How might we use `timestamp` to aid the recommender system? (Times are given in UTC seconds)

In [34]:
ratings = tfds.as_dataframe(tfds.load("movielens/latest-small-ratings", split="train"))
print(f"Number of ratings: {len(ratings)}")
ratings.head(10)

Number of ratings: 100836


Unnamed: 0,movie_genres,movie_id,movie_title,timestamp,user_id,user_rating
0,"[7, 8, 13, 15]",b'4874',b'K-PAX (2001)',1446749868,b'105',5.0
1,"[7, 18]",b'527',"b""Schindler's List (1993)""",1305696664,b'17',4.5
2,"[5, 9]",b'7943',"b'Killers, The (1946)'",1166068511,b'309',4.0
3,"[10, 13, 16]",b'1644',b'I Know What You Did Last Summer (1997)',1518640852,b'111',0.5
4,"[1, 2, 3, 4, 12, 14]",b'8360',b'Shrek 2 (2004)',1127221149,b'182',3.0
5,"[4, 8, 15]",b'2717',b'Ghostbusters II (1989)',1053021961,b'474',0.5
6,"[0, 1, 4, 19]",b'3624',b'Shanghai Noon (2000)',974705163,b'450',4.0
7,"[0, 5, 16]",b'6',b'Heat (1995)',1270603905,b'434',4.0
8,"[0, 15, 16]",b'3986',"b'6th Day, The (2000)'",1529902021,b'586',4.0
9,"[0, 1, 7, 18]",b'57499',b'Heaven and Earth (Ten to Chi to) (1990)',1498521142,b'599',3.0


For your reference, here are the mappings from the `movie-genres` sequence below to their actual values:
- 0 = Action
- 1 = Adventure
- 2 = Animation
- 3 = Children's
- 4 = Comedy
- 5 = Crime
- 6 = Documentary
- 7 = Drama
- 8 = Fantasy
- 9 = Film-Noir
- 10 = Horror
- 11 = IMAX
- 12 = Musical
- 13 = Mystery
- 14 = Romance
- 15 = Sci-Fi
- 16 = Thriller
- 17 = ??
- 18 = War
- 19 = Western
- 20 = (no genres listed)

## Suggested data visualization steps
Our main goal this first work session is to get to understand the dataset and manipulate it, as in the future, we will be manipulating this dataset extensively.
- Take a look at the distribution of years among movies in our dataset. What is the most recent movie? The oldest?
- Visualize the genres of the movies represented in the dataset. Are any more represented than others?
- Pick any movie in the dataset - look at the ratings users have given that movie. Do these ratings make sense to you? (You would want to pick a movie you've probably watched before).
- Can you find your favorite movies in here?

**KEY**: Make plots to visualize your results - a picture speaks a 1000 words!

## Suggested data cleaning steps
If you want to get started data cleaning, here are some recommended steps:
- Separate Movie Year from the Movie Name
- Construct a table of ratings per movie (or vice versa). This will help for some of the later data cleaning next week.
- [One-Hot Encode](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/) genre data

If you write any nontrivial code to process the data, try to generalize the processing into functions/scripts. This is good practice in writing extensible data cleaning code.