# Data Wrangling in Python  
*__[Pandas](https://pandas.pydata.org/)__ with the __MovieLens__ dataset*  

**Part 2: Playing with the Movies and Ratings data**

### <font color='green'>__Support for Google Colab__  </font>  
    
open this notebook in Colab using the following button:  
  
<a href="https://colab.research.google.com/github/shauryashaurya/learn-data-munging/blob/main/02-Pandas/02.01-Data-Wrangling-with-MovieLens-and-Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>  

  
<font color='green'>uncomment and execute the cell below to setup and run this notebook on Google Colab.</font>

In [1]:
# # SETUP FOR COLAB: select all the lines below and uncomment (CTRL+/ on windows)
# # Let's download and unzip the Small MovieLens Dataset
# ! mkdir ./../data
# ! wget -q https://files.grouplens.org/datasets/movielens/ml-latest-small.zip
# ! unzip ./ml-latest-small.zip -d ./../data/

### Get the _Small_ MovieLens Dataset

We'll use the [small MovieLens dataset](https://grouplens.org/datasets/movielens/#:~:text=Small%3A%20100%2C000%20ratings%20and%203%2C600%20tag%20applications) here.

Download it and unzip to the data folder under the name `ml-latest-small`.

This dataset expands to about 3.2 MB on your local disk. 

# Locate the data

In [2]:
datalocation = "./../data/ml-latest-small/"

In [3]:
# specify file names
file_path_movies = datalocation + "movies.csv"
file_path_links = datalocation + "links.csv"
file_path_ratings = datalocation + "ratings.csv"
file_path_tags = datalocation + "tags.csv"

# Setup Pandas and Numpy

In [4]:
import numpy as np
import pandas as pd

print("numpy version: ", np.__version__)
print("pandas version: ", pd.__version__)

numpy version:  1.26.0
pandas version:  2.1.1


# Load the dataset(s)

From the ```README.txt``` file in the small MovieLens dataset:
The dataset files are written as [**comma-separated values**](http://en.wikipedia.org/wiki/Comma-separated_values) files with a **single header row**. Columns that contain commas (`,`) are **escaped using double-quotes (`"`)**. These files are encoded as **UTF-8**. If accented characters in movie titles or tag values (e.g. Misérables, Les (1995)) display incorrectly, make sure that any program reading the data, such as a text editor, terminal, or script, is configured for UTF-8.

So, we specify:
* Separator - ```,```
* Escape Character - ```"```
* Encoding - ```UTF-8```  
  
We saw in the last notebook that what the README file really meant was that the **Quote Character** is ```"```, so additionally:  
* Quote Character - ```"```

In [1]:
csv_separator = ","
csv_escapechar = '"'
csv_encoding = "utf-8"
csv_quotechar = csv_escapechar

## Movies

Let's specify the [-  ```dtypes```  ](https://pandas.pydata.org/docs/user_guide/basics.html#dtypes) of each of the columns in the movies file. 

In [6]:
# schema, inferred from the README.txt file
movies_schema = {"movieId": "Int32", "title": "string", "genres": "string"}

In [9]:
movies = pd.read_csv(
    file_path_movies,
    dtype=movies_schema,
    sep=csv_separator,
    quotechar=csv_quotechar,
    encoding=csv_encoding,
)

In [10]:
# show the first 15 lines
movies.head(5)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


In [11]:
# data types of each column
movies.dtypes

movieId             Int32
title      string[python]
genres     string[python]
dtype: object

## Ratings

Reading through the ```README``` file:  
Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).  
Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.  

In [16]:
# schema, inferred from the README.txt file
# read timestamps as integers then convert to dates later.
ratings_schema = {
    "userId": "Int32",
    "movieId": "Int32",
    "rating": "Float32",
    "timestamp": "Int64",
}
#

In [17]:
ratings = pd.read_csv(
    file_path_ratings,
    dtype=ratings_schema,
    sep=csv_separator,
    quotechar=csv_quotechar,
    encoding=csv_encoding,
)

# now let's add a datetime column that we derive from the raw timestamp
ratings["datetime"] = pd.to_datetime(ratings["timestamp"], unit="s", utc=True)
ratings["date"] = pd.to_datetime(ratings["datetime"].dt.date)
ratings["day"] = ratings["date"].dt.day
ratings["month"] = ratings["date"].dt.month
ratings["year"] = ratings["date"].dt.year

In [18]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [19]:
# now let's add a datetime column that we derive from the raw timestamp
ratings["datetime"] = pd.to_datetime(ratings["timestamp"], unit="s", utc=True)

In [20]:
ratings.dtypes

userId                    Int32
movieId                   Int32
rating                  Float32
timestamp                 Int64
datetime     datetime64[s, UTC]
dtype: object

let's [extract the dates](https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.date.html#pandas-series-dt-date) into a new column

In [22]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp,datetime,date
0,1,1,4.0,964982703,2000-07-30 18:45:03+00:00,2000-07-30
1,1,3,4.0,964981247,2000-07-30 18:20:47+00:00,2000-07-30
2,1,6,4.0,964982224,2000-07-30 18:37:04+00:00,2000-07-30
3,1,47,5.0,964983815,2000-07-30 19:03:35+00:00,2000-07-30
4,1,50,5.0,964982931,2000-07-30 18:48:51+00:00,2000-07-30


# Next

* Let's play with the MovieLens dataset some more.