# DataFrames - Pandas

Example analysis with the [Movielens](https://grouplens.org/datasets/movielens/) data set to demonstrate common data manipulation steps using Pandas

In [1]:
import datetime
import os
import urllib
import zipfile
import pandas as pd

### Download the Movielens data set

In [2]:
if not os.path.isdir("data"):
    os.makedirs("data")
    
if not os.path.isfile("data/ml-20m.zip"):
    urllib.request.urlretrieve("http://files.grouplens.org/datasets/movielens/ml-20m.zip", "data/ml-20m.zip")
    zipped_data = zipfile.ZipFile("data/ml-20m.zip", 'r')
    zipped_data.extractall("data")
    zipped_data.close()

### Read data from CSV

Movie dimension table

In [3]:
movies = pd.read_csv("data/ml-20m/movies.csv")

In [4]:
movies.head(10)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


Ratings fact table

In [5]:
ratings = pd.read_csv("data/ml-20m/ratings.csv")

In [6]:
ratings.head(10)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,1112486027
1,1,29,3.5,1112484676
2,1,32,3.5,1112484819
3,1,47,3.5,1112484727
4,1,50,3.5,1112484580
5,1,112,3.5,1094785740
6,1,151,4.0,1094785734
7,1,223,4.0,1112485573
8,1,253,4.0,1112484940
9,1,260,4.0,1112484826


### Add a column

Convert timestamp to a datetime

In [7]:
ratings["time"] = pd.to_datetime(ratings["timestamp"], unit="s")
ratings.head(10)

Unnamed: 0,userId,movieId,rating,timestamp,time
0,1,2,3.5,1112486027,2005-04-02 23:53:47
1,1,29,3.5,1112484676,2005-04-02 23:31:16
2,1,32,3.5,1112484819,2005-04-02 23:33:39
3,1,47,3.5,1112484727,2005-04-02 23:32:07
4,1,50,3.5,1112484580,2005-04-02 23:29:40
5,1,112,3.5,1094785740,2004-09-10 03:09:00
6,1,151,4.0,1094785734,2004-09-10 03:08:54
7,1,223,4.0,1112485573,2005-04-02 23:46:13
8,1,253,4.0,1112484940,2005-04-02 23:35:40
9,1,260,4.0,1112484826,2005-04-02 23:33:46


### Filter

Only include movies that were rated in 2009

In [8]:
binary_mask = ratings["time"].between(pd.Timestamp(2009,1,1), pd.Timestamp(2010,1,1))
binary_mask.head(10)

0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8    False
9    False
Name: time, dtype: bool

In [9]:
ratingsFiltered = ratings[binary_mask] # View or copy???
ratingsFiltered.head(10)

Unnamed: 0,userId,movieId,rating,timestamp,time
960,11,1,4.5,1230858821,2009-01-02 01:13:41
961,11,10,2.5,1230858959,2009-01-02 01:15:59
962,11,19,3.5,1230783704,2009-01-01 04:21:44
963,11,32,5.0,1230783095,2009-01-01 04:11:35
964,11,39,4.5,1230859032,2009-01-02 01:17:12
965,11,65,2.0,1230856649,2009-01-02 00:37:29
966,11,110,4.0,1230853748,2009-01-01 23:49:08
967,11,145,3.0,1230785947,2009-01-01 04:59:07
968,11,150,5.0,1230785343,2009-01-01 04:49:03
969,11,153,3.5,1230858914,2009-01-02 01:15:14


### Group by/Aggregation

Group by movieId and aggregate to find the total number of ratings and their average value

In [10]:
averageRatings = ratingsFiltered.groupby("movieId")["rating"]\
                                .agg(["count", "mean"])\
                                .rename({"mean": "averageRating"}, axis=1)\
                                .sort_values("averageRating", ascending=False)
averageRatings.head(10)

Unnamed: 0_level_0,count,averageRating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1138,1,5.0
59,1,5.0
68265,1,5.0
4922,1,5.0
5306,1,5.0
8845,1,5.0
5405,1,5.0
72714,1,5.0
2698,1,5.0
7989,1,5.0


### Filter again

Only include movies with at least 100 reviews

In [11]:
averageRatingsFiltered = averageRatings[averageRatings["count"] > 100]
averageRatingsFiltered.head(10)

Unnamed: 0_level_0,count,averageRating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
318,2623,4.439954
2959,2709,4.313031
296,2556,4.28169
6016,1153,4.273634
5618,986,4.26572
50,1826,4.259036
926,141,4.255319
1201,682,4.248534
858,1851,4.246623
44555,664,4.224398


### Join

Join with the movie dimension table

In [12]:
averageRatingsWithMovie = averageRatingsFiltered.merge(movies, on="movieId", how="inner")\
                                                .sort_values("averageRating", ascending=False)
averageRatingsWithMovie.head(10)

Unnamed: 0,movieId,count,averageRating,title,genres
0,318,2623,4.439954,"Shawshank Redemption, The (1994)",Crime|Drama
1,2959,2709,4.313031,Fight Club (1999),Action|Crime|Drama|Thriller
2,296,2556,4.28169,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller
3,6016,1153,4.273634,City of God (Cidade de Deus) (2002),Action|Adventure|Crime|Drama|Thriller
4,5618,986,4.26572,Spirited Away (Sen to Chihiro no kamikakushi) ...,Adventure|Animation|Fantasy
5,50,1826,4.259036,"Usual Suspects, The (1995)",Crime|Mystery|Thriller
6,926,141,4.255319,All About Eve (1950),Drama
7,1201,682,4.248534,"Good, the Bad and the Ugly, The (Buono, il bru...",Action|Adventure|Western
8,858,1851,4.246623,"Godfather, The (1972)",Crime|Drama
9,44555,664,4.224398,"Lives of Others, The (Das leben der Anderen) (...",Drama|Romance|Thriller


### Filter again

Only include Westerns

In [13]:
topWesterns = averageRatingsWithMovie[averageRatingsWithMovie["genres"].apply(lambda x: "Western" in x)]
topWesterns.head(10)

Unnamed: 0,movieId,count,averageRating,title,genres
7,1201,682,4.248534,"Good, the Bad and the Ugly, The (Buono, il bru...",Action|Adventure|Western
19,1209,284,4.163732,Once Upon a Time in the West (C'era una volta ...,Action|Drama|Western
26,1254,130,4.134615,"Treasure of the Sierra Madre, The (1948)",Action|Adventure|Drama|Western
112,1304,313,4.041534,Butch Cassidy and the Sundance Kid (1969),Action|Western
118,3681,279,4.035842,For a Few Dollars More (Per qualche dollaro in...,Action|Drama|Thriller|Western
123,1266,439,4.031891,Unforgiven (1992),Drama|Western
192,26649,195,3.961538,Lonesome Dove (1989),Adventure|Drama|Western
194,2951,321,3.961059,"Fistful of Dollars, A (Per un pugno di dollari...",Action|Western
252,714,156,3.916667,Dead Man (1995),Drama|Mystery|Western
257,56782,721,3.911928,There Will Be Blood (2007),Drama|Western
