# Data Cleaning/Visualization Starter

This notebook should include starter functionality to get start cleaning data and looking at some interesting features. The goal is to familiarize yourself with the dataset so when model development starts, working with the dataset will be easier.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
sns.set_theme() # feel free to change this - default seaborn plots look nice but you can play with colors

`movies` is the DataFrame which will hold all the movies in the MovieLens100k dataset. Notice how each movie has:
- a uniquely identifying `movie_id` - this will be useful for tracing ratings to movies
- genre and year information - think about how this might be useful to the prediction process

In [3]:
movies = pd.read_csv("../data/ml-latest-small/movies.csv")
print(f"Number of movies: {len(movies)}")
movies.head(10)

Number of movies: 9742


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


In practice, we will not need to use the `movies` dataset as all the information is available in the `ratings` dataset.

`ratings` is the DataFrame which will hold all the ratings given for the movies in `movies`. We will be using these ratings to make predictions on the ratings users will have for movies they have not seen. 

Some features to note:
- `user_id` is the ID assigned to the user who made the rating
- `movie_id` is the movie that was rated (you can use this column to map movie names to ratings)
- `user_rating` are movie ratings on a scale 1-5 stars (with half-star increments).
- How might we use `timestamp` to aid the recommender system? (Times are given in UTC seconds)

In [4]:
ratings = pd.read_csv("../data/ml-latest-small/ratings.csv")
print(f"Number of ratings: {len(ratings)}")
ratings.head(10)

Number of ratings: 100836


Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
5,1,70,3.0,964982400
6,1,101,5.0,964980868
7,1,110,4.0,964982176
8,1,151,5.0,964984041
9,1,157,5.0,964984100


## Suggested data visualization steps
Our main goal this first work session is to get to understand the dataset and manipulate it, as in the future, we will be manipulating this dataset extensively.
- Take a look at the distribution of years among movies in our dataset. What is the most recent movie? The oldest?
- Visualize the genres of the movies represented in the dataset. Are any more represented than others?
- Pick any movie in the dataset - look at the ratings users have given that movie. Do these ratings make sense to you? (You would want to pick a movie you've probably watched before).
- Can you find your favorite movies in here?

**KEY**: Make plots to visualize your results - a picture speaks a 1000 words!

## Suggested data cleaning steps
If you want to get started data cleaning, here are some recommended steps:
- Separate Movie Year from the Movie Name
- Construct a table of ratings per movie (or vice versa). This will help for some of the later data cleaning next week.
- [One-Hot Encode](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/) genre data

If you write any nontrivial code to process the data, try to generalize the processing into functions/scripts. This is good practice in writing extensible data cleaning code.