# Data Cleaning/Visualization Starter

This notebook should include starter functionality to get start cleaning data and looking at some interesting features. The goal is to familiarize yourself with the dataset so when model development starts, working the dataset will be easier. 

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
sns.set_theme() # feel free to change this - default seaborn plots look nice but you can play with colors

`movies` is the DataFrame which will hold all the movies in the MovieLens100k dataset. Notice how each movie has:
- a uniquely identifying `movieId` - this will be useful for tracing ratings to movies
- genre and year information - think about how this might be useful to the prediction process

In [2]:
movies = pd.read_csv("../data/ml-latest-small/movies.csv")
movies.head(10)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


`ratings` is the DataFrame which will hold all the ratings given for the movies in `movies`. We will be using these ratings to make predictions on the ratings users will have for movies they have not seen. Some features to note:
- `userId` is the ID assigned to the user who made the rating
- `movieId` is the movie that was rated (you can use this column to map movie names to ratings)
- `rating` are movie ratings on a scale 1-5 stars (with half-star increments).
- How might we use `timestamp` to aid the recommender system? (Times are given in UTC seconds) 

In [3]:
ratings = pd.read_csv("../data/ml-latest-small/ratings.csv")
ratings.head(10)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
5,1,70,3.0,964982400
6,1,101,5.0,964980868
7,1,110,4.0,964982176
8,1,151,5.0,964984041
9,1,157,5.0,964984100


`tags` are additional text information assigned to each movie. Again, think about how these tags may be used to provide additional information to the recommender system.

In [4]:
tags = pd.read_csv("../data/ml-latest-small/tags.csv")
tags.head(10)

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200
5,2,89774,Tom Hardy,1445715205
6,2,106782,drugs,1445715054
7,2,106782,Leonardo DiCaprio,1445715051
8,2,106782,Martin Scorsese,1445715056
9,7,48516,way too long,1169687325


"links.csv" provides useful information to map the entries in this dataset to other movie databases like IMDb or TMDb. It will likely not be of much use to us, but if you decide to play around more with this dataset this would be useful information. 

Some suggested data cleaning steps:
- Separate Movie Year from the Movie Name
- [One-Hot Encode](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/) genre data
- Generate an $n\times m$ utility matrix (rows are users, columns are the ratings that user has given each movie). We will eventually have to do this, but if you get a head start it won't hurt.

If you write any nontrivial code to process the data, try to generalize the processing into functions/scripts. This is good practice in writing extensible data cleaning code.

Some suggested data visualization steps:
- Take a look at the distribution of years among movies in our dataset. What is the most recent movie? The oldest?
- Visualize the genres of the movies represented in the dataset. Are any more represented than others?
- Pick any movie in the dataset - look at the ratings users have given that movie. Do these ratings make sense to you? (You would want to pick a movie you've probably watched before).
- Can you find your favorite movies in here?

**KEY**: Make plots to visualize your results - a picture speaks a 1000 words!