# Pandas Example: MovieLens Data

**Outcomes**

- Learn how to download file from internet using the [requests](https://requests.readthedocs.io/en/master/) library
- Know how to operate on a `.zip` file in memory, without writing to hard drive
- Practice merging datasets

**Note: requires internet access to run.**  

This Jupyter notebook was originally created in 2016 by Dave Backus, Chase Coleman, Brian LeBlanc, and Spencer Lyon for the NYU Stern course [Data Bootcamp](http://databootcamp.nyuecon.com/).

The notebook has been modified for this course

In [2]:
%matplotlib inline 

import pandas as pd             # data package
import matplotlib.pyplot as plt # graphics 
import datetime as dt           # date tools, used to note current date  

# these are new 
import os                       # operating system tools (check files)
import requests, io             # internet and input tools  
import zipfile as zf            # zip file tools 
import shutil                   # file management tools

<a id=movielens></a>

## MovieLens data 

The [GroupLens](https://grouplens.org/) team at the University of Minnesota has prepared many datasets

One is called [MovieLens](https://grouplens.org/datasets/movielens/), and contains millions of movie ratings by viewers and users of the MovieLens website

We will use a small subset of the data with 100,000 ratings

This data comes in a `.zip` file that contains a handful of csv's and a readme

Here are some details about the zipped files:

* `ratings.csv`:  each line is an individual film rating with the rater and movie id's and the rating.  Order:  `userId, movieId, rating, timestamp`. 
* `tags.csv`:  each line is a tag on a specific film.  Order:  `userId, movieId, tag, timestamp`. 
* `movies.csv`:  each line is a movie name, its id, and its genre.  Order:  `movieId, title, genres`.  Multiple genres are separated by "pipes" `|`.   
* `links.csv`:  each line contains the movie id and corresponding id's at [IMBd](http://www.imdb.com/) and [TMDb](https://www.themoviedb.org/).  

We are interested in the `ratings.csv` and `movies.csv` files as pandas DataFrames

There are at least two approaches to doing this:

1. Download the file to the hard drive, unzip manually, then come back and use `pd.read_csv`
2. Have Python download the file, learn to use Python to handle zip files, then use `pd.read_csv`

The first option is likely easier, but the second is more powerful

We will choose option 2 here as it gives us a chance to learn more "real-world" data+Python skills

**Why do it the hard way?**

- It builds character
- Entire analysis can be self-contained in our notebook, no external *by hand* steps  needed
- Can be applied to other datasets, as we'll surely see a zip file again in the future

<a id=requests></a>

## Automated file download 

**WANT:** create pandas DataFrames from the `ratings.csv` and `movies.csv` files in the zipfile on the GroupLens website

Let's unpack what steps need to happen:

1. Download the file: we'll use the [requests](http://docs.python-requests.org/) package
2. Unpack raw downloaded content as file: using built-in [io](https://docs.python.org/3.5/library/io.html) module's `io.BytesIO`
3. Access csv files inside the zip file: using built-in [zipfile](https://docs.python.org/3.5/library/zipfile.html) module to read csv's from zip
4. Read in csvs: easy part -- using `pd.read_csv`

### Digression 1

The `requests` documentation states

>Recreational use of other HTTP libraries may result in dangerous side-effects, including: security vulnerabilities, verbose code, reinventing the wheel, constantly reading documentation, depression, headaches, or even death.

We whole-heartedly agree

### Digression 2

We found this [Stack Overflow exchange](http://stackoverflow.com/questions/23419322/download-a-zip-file-and-extract-it-in-memory-using-python3) helpful when creating this solution

### Step 1: Download the file

Let's get to it!

Here we download the file and print out some information about the response object

In [13]:
#  Step 1 -- download the file 
url = 'http://files.grouplens.org/datasets/movielens/ml-latest-small.zip'
res = requests.get(url) 

# (sub-step, see what the response looks like)
print('Response status code:', res.status_code)
print('Response type:', type(res))
print('Response .content:', type(res.content)) 
print('Response headers:\n', res.headers, sep='')

Response status code: 200
Response type: <class 'requests.models.Response'>
Response .content: <class 'bytes'>
Response headers:
{'Date': 'Tue, 06 Oct 2020 23:32:04 GMT', 'Server': 'Apache/2.4.6 (Red Hat Enterprise Linux) OpenSSL/1.0.2k-fips', 'Last-Modified': 'Tue, 03 Dec 2019 17:14:18 GMT', 'ETag': '"eed1a-598cfd387dc5a"', 'Accept-Ranges': 'bytes', 'Content-Length': '978202', 'Keep-Alive': 'timeout=5, max=100', 'Connection': 'Keep-Alive', 'Content-Type': 'application/zip'}


### Step 2: read file as bytes

The ZipFile is a binary file format (not plain text)

For this reason we need to treat the contents of the file as bytes

We'll load in `res.content` into an instance of `io.BytesIO

In [14]:
# Step 2 -- read bytes of response contentonvert bytes to zip file  
bytes = io.BytesIO(res.content)

### Step 3: Interpret bytes as ZipFile

Now we have downloaded file, and interpreted as a binary source (`BytesIO`)

This is not just any binary file, but rather a zip compressed file

Python knows how to handle these using the built-in `zipfile` module

Below we tell Python to interpret the `BytesIO` as a ZipFile

In [15]:
# Step 3 -- Interpret bytes as zipfile
zip = zf.ZipFile(bytes)
print('Type of zipfile object:', type(zip))

Type of zipfile object: <class 'zipfile.ZipFile'>


Now that we have a ZipFile, we need to explore what is inside

The `ZipFile` class has a handy method named `namelist` 

This method lists all folders and files in the zip archive

Let's take a look

In [16]:
# (sub-step, inspect the file)
names = zip.namelist()
names

['ml-latest-small/',
 'ml-latest-small/links.csv',
 'ml-latest-small/tags.csv',
 'ml-latest-small/ratings.csv',
 'ml-latest-small/README.txt',
 'ml-latest-small/movies.csv']

Notice that our target `ratings.csv` and `movies.csv` files are there

However, they are in a folder named `ml-latest-small`

We could write out the "path" to those files by hand, but instead we'll let Python find them for us

In [17]:
movie_fn = [n for n in names if "movies" in n][0]
ratings_fn = [n for n in names if "ratings" in n][0]

print("The path to movies.csv is:", movie_fn)

The path to movies.csv is: ml-latest-small/movies.csv


### Step 4: `pd.read_csv` the Files

Now that we have the ZipFile **and** the path to our csvs, let's read them in as DataFrames

For this we'll use our friend `pd.read_csv`

We need to call the `.open` method on our zipfile

This method expects the path to the file we need to open, these are saved in `movie_fn` and `ratings_fn` above

In [26]:
# extract and read csv's
movies  = pd.read_csv(zip.open(movie_fn))
ratings = pd.read_csv(zip.open(ratings_fn))

Let's take a look at the data

In [36]:
movies.info()
movies.head(3)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance


In [37]:
ratings.info()
ratings.head(3)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224


**Exercise.** Something to do together.  suppose we wanted to save the files on our computer.  How would we do it? Would we prefer individual csv's or a single zip?

In [28]:
# writing csv (specify different location)
with open('test_01.csv', 'wb') as out_file:
    shutil.copyfileobj(zip.open(movie_fn), out_file)

In [29]:
# experiment via http://stackoverflow.com/a/18043472/804513
with open('test.zip', 'wb') as out_file:
    shutil.copyfileobj(io.BytesIO(res.content), out_file)   

<a id=merge-movies></a>
## Merging ratings and movie titles

The movie ratings in the dataframe `ratings` give us individual opinions about movies, but they don't include the name of the movie.  

**Why not?**

### Normalization

Storing movie names in rating DataFrame would cause movie name to be repeated many times

The string "Grumpier Old Men (1995)" takes up more space in a file than the integer `3`

So, the GroupLens team decided to store the movie name in `movies.csv`, and a `movidId` column that is an integer

In other files, they can use just the `movieId` column

This is an eample of a relational database (think SQL) technique known as [normalization](https://en.wikipedia.org/wiki/Database_normalization)

**Why normalize?**

There are many benefits to normalization, but two that immediately come to mind are:

1. Save storage space: movie title (and genres in this example) are not repeated for each rating
2. Easier to maintain/update

For the second point, suppose the GroupLens team wanted to include the movie's director

In the current, normalized format they would add a `director` column to `movies.csv` and have one row to update per movie

If they instead chose to put the movie title inside the `ratings.csv` they would have to find all occurances of each movie and add the director

### Combining ratings and movies

We **want** to be able to analyze the ratings for movies, and associate those ratings with a movie name

We need a way to bring in the movie title information into the `ratings` DataFrame

This is a perfect use case for `merge`!

Let's start with a small example, just the first three rows of ratings:

Here's what this looks like

In [39]:
ratings.head(3)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224


Now let's see what happens when we `merge` this with the `movies` DataFrame

In [40]:
ratings.head(3).merge(movies, on="movieId")

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,1,3,4.0,964981247,Grumpier Old Men (1995),Comedy|Romance
2,1,6,4.0,964982224,Heat (1995),Action|Crime|Thriller


Here's a breakdown of what happened:

- For each row in `ratings.head(3)` pandas found the `movieId`
- It then looked for row(s) in `movies` that had a matching `movieId`
- It then added columns `title` and `genres` alongside existing columns from `ratings` (`userId`, `rating`, `timestamp`) to form combined DataFrame

Let's now apply this `merge` to the whole `ratings` dataset

In [41]:
combo = pd.merge(ratings, movies,   # left and right df's
                 how='left',        # add to left 
                 on='movieId'       # link with this variable/column 
                ) 

print('Dimensions of ratings:', ratings.shape)
print('Dimensions of movies:', movies.shape)
print('Dimensions of new df:', combo.shape)

combo.head(20)

Dimensions of ratings: (100836, 4)
Dimensions of movies: (9742, 3)
Dimensions of new df: (100836, 6)


Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,1,3,4.0,964981247,Grumpier Old Men (1995),Comedy|Romance
2,1,6,4.0,964982224,Heat (1995),Action|Crime|Thriller
3,1,47,5.0,964983815,Seven (a.k.a. Se7en) (1995),Mystery|Thriller
4,1,50,5.0,964982931,"Usual Suspects, The (1995)",Crime|Mystery|Thriller
5,1,70,3.0,964982400,From Dusk Till Dawn (1996),Action|Comedy|Horror|Thriller
6,1,101,5.0,964980868,Bottle Rocket (1996),Adventure|Comedy|Crime|Romance
7,1,110,4.0,964982176,Braveheart (1995),Action|Drama|War
8,1,151,5.0,964984041,Rob Roy (1995),Action|Drama|Romance|War
9,1,157,5.0,964984100,Canadian Bacon (1995),Comedy|War


Now that we've merged the DataFrames, we can save it to our own de-normalized csv

This could make future analysis easier as we don't have to repeat the merge step

In [50]:
# save as csv file for future use 
combo.to_csv('mlcombined.csv')

In [53]:
print('Current directory:\n', os.getcwd(), sep='')
print('List of files:', os.listdir(), sep='\n')

Current directory:
/home/sglyon/valorum/training/CSS/PHBS20-21/Fall2020/LExtra_pandas_review
List of files:
['test.zip', 'pandas_example_movieLens.ipynb', 'mlcombined.csv', 'test_01.csv', '.ipynb_checkpoints']


**Exercise.** Some of these we know how to do, the others we don't.  For the ones we know, what is the answer?  For the others, what (in loose terms) do we need to be able to do to come up with an answer?  

* What is the overall average rating?  
* What is the overall distribution of ratings?  
* What is the average rating of each movie?  
* How many ratings does each movie get?  

In [None]:
# Your code/idea here -- average rating

In [None]:
# Your code/idea here -- distribution of ratings

In [None]:
# Your code/idea here -- average rating of each movie

In [None]:
# Your code/idea here -- number of ratings for each movie

In [None]:
combo['rating'].mean()

In [None]:
fig, ax = plt.subplots()
bins = [bin/100 for bin in list(range(25, 575, 50))]
print(bins)
combo['rating'].plot(kind='hist', ax=ax, bins=bins, color='blue', alpha=0.5)
ax.set_xlim(0,5.5)
ax.set_ylabel('Number')
ax.set_xlabel('Rating')
plt.show()

In [None]:
from plotly.offline import iplot             # plotting functions
import plotly.graph_objs as go               # ditto
import plotly

In [None]:
plotly.offline.init_notebook_mode(connected=True)

In [None]:
trace = go.Histogram(
    x=combo['rating'],
    histnorm='count',
    name='control',
    autobinx=False,
    xbins=dict(
        start=.5,
        end=5.0,
        size=0.5
    ),
    marker=dict(
        color='Blue',
    ),
    opacity=0.75
)

layout = go.Layout(
    title='Distribution of ratings',
    xaxis=dict(
        title='Rating value'
    ),
    yaxis=dict(
        title='Count'
    ),
    bargap=0.01,
    bargroupgap=0.1
)

iplot(go.Figure(data=[trace], layout=layout))

In [None]:
combo[combo['movieId']==31]['rating'].mean()

In [None]:
ave_mov = combo['rating'].groupby(combo['movieId']).mean()

In [None]:
ave_mov = ave_mov.reset_index()

In [None]:
ave_mov = ave_mov.rename(columns={"rating": "average rating"})

In [None]:
combo2 = combo.merge(ave_mov, how='left', on='movieId')

In [None]:
combo2.shape

In [None]:
combo2.head(3)

In [None]:
combo['ave'] = combo['rating'].groupby(combo['movieId']).transform('mean')

In [None]:
combo.head()

In [None]:
combo2[combo['movieId']==1129]

In [None]:
combo['count'] = combo['rating'].groupby(combo['movieId']).transform('count')

In [None]:
combo.head()

## Resources 

The [Pandas docs](http://pandas.pydata.org/pandas-docs/stable/merging.html) are ok, but we prefer the Data Carpentry [guide](http://www.datacarpentry.org/python-ecology-lesson/04-merging-data)