# Data Science 1 Topic 2

## DataFrames

### From a dictionary

<u>In T0-TupleListDictionary, you've seen the following dictionary. Run the cell.</u>

In [2]:
movie_ratings = {'Toy Story':{'rating':4.0, 'genre':'Animation'},
                          'Jumanji':{'rating':4.0, 'genre':'Adventure'},
                          'Grumpier Old Men':{'rating':4.0, 'genre':'Comedy'},
                          'Waiting to Exhale':{'rating':4.0, 'genre':'Comedy'},
                          'Father of the Bride Part II':{'rating':5.0, 'genre':'Comedy'}
                         }

<u>Let's now create a simple DataFrame from this dictionary using [Pandas](https://pandas.pydata.org/) package.<br>Follow the instructions below.</u>

In [1]:
# import Pandas package with the alias pd
import pandas as pd

In [3]:
ratings_df = pd.DataFrame(movie_ratings)

# display ratings_df
ratings_df.head()

Unnamed: 0,Toy Story,Jumanji,Grumpier Old Men,Waiting to Exhale,Father of the Bride Part II
rating,4.0,4.0,4.0,4.0,5.0
genre,Animation,Adventure,Comedy,Comedy,Comedy


### Transposing a DataFrame

<u>With this simple DataFrame, let's say we want the rating and genre to be the **columns**.</u>

In [5]:
# How to use transpose() on a DataFrame? overwrite ratings_df with its transpose
ratings_df = ratings_df.transpose()

# print ratings_df
print(ratings_df)

        Toy Story    Jumanji Grumpier Old Men Waiting to Exhale  \
rating        4.0        4.0              4.0               4.0   
genre   Animation  Adventure           Comedy            Comedy   

       Father of the Bride Part II  
rating                         5.0  
genre                       Comedy  


<u>The title of the movies became the **index** of the DataFrame.</u>

### Using `reset_index` and  `set_index`

<u>What if we don't want the title to be the indices, but as a variable (contained in a column)?<br>The current index has no name, and we want the contents to be under the column `title`. So we'll first rename it, before using `reset_index`.</u>

In [None]:
# First use rename_axis(), then reset_index(), where should the `title` be?
ratings_df = ratings_df.__________.__________

# print ratings_df
# __________

<u>You can simply use `.set_index('title')` on ratings_df to reverse this. <br>We will see later, when using a bigger DataFrame, how `index` and `columns` can be useful.</u>

***

## Reading csv files

<u>We will use the [MovieLens Latest Datasets](https://grouplens.org/datasets/movielens/latest/). Download **ml-latest-small.zip** and extract the zip file. There are four csv files (`links.csv`, `movies.csv`, `ratings.csv`, and `tags.csv`). Take some time to read the README file.<br>We'll first look at `ratings.csv`</u>

In [None]:
# ONLY IF YOU'RE USING GOOGLE COLAB AND WANT TO MOUNT YOUR DRIVE TO ACCESS YOUR FILES,
# uncomment these two lines:
# from google.colab import drive
# drive.mount('/content/drive')

# Once you're done:
#drive.flush_and_unmount()

# ALTERNATIVE: Upload the file to the session storage (will be deleted once your session ends). 
# Click on the folder icon, then the icon with the up arrow symbol.

In [23]:
ratings_df = pd.read_csv('ml-latest-small/ratings.csv')

<u>Read the rest of the csv files.</u>

In [24]:
movies_df = pd.read_csv("ml-latest-small/movies.csv")
links_df = pd.read_csv("ml-latest-small/links.csv")
tags_df = pd.read_csv("ml-latest-small/tags.csv")

***

## Exploring DataFrames

<u>Run the cells below, follow the instructions and inspect the outputs.</u>

In [25]:
ratings_df.head(3) # without number specification, how many entries will be printed?

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224


<u>We want the timestamp column to be more readable and easier to deal with. Let's change it to DateTime format.</u>

In [26]:
# run this
ratings_df["time"] = pd.to_datetime(ratings_df.timestamp, unit='s')
ratings_df = ratings_df.drop(columns="timestamp")

In [15]:
# .tail() will print the last few entries.
# show the last 3 entries of ratings_df
ratings_df.tail()

Unnamed: 0,userId,movieId,rating,time
100831,610,166534,4.0,2017-05-03 21:53:22
100832,610,168248,5.0,2017-05-03 22:21:31
100833,610,168250,5.0,2017-05-08 19:50:47
100834,610,168252,5.0,2017-05-03 21:19:12
100835,610,170875,3.0,2017-05-03 21:20:15


In [27]:
# show the first few entries of movies_df
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [28]:
# show the first few entries of links_df
links_df.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [32]:
tags_df.head()

Unnamed: 0,userId,movieId,tag,timestamp,time
0,2,60756,funny,1445714994,1970-01-01 00:00:01.445714994
1,2,60756,Highly quotable,1445714996,1970-01-01 00:00:01.445714996
2,2,60756,will ferrell,1445714992,1970-01-01 00:00:01.445714992
3,2,89774,Boxing story,1445715207,1970-01-01 00:00:01.445715207
4,2,89774,MMA,1445715200,1970-01-01 00:00:01.445715200


In [35]:
# tags_df also contains a Unix timestamp, "timestamp", column

# add a column "time" that contains the DateTime format of "timestamp"
tags_df["time"] = pd.to_datetime(tags_df["timestamp"])

# drop the column "timestamp"
tags_df.drop(columns="timestamp")

# show the first few entries of tags_df
tags_df.head()

Unnamed: 0,userId,movieId,tag,timestamp,time
0,2,60756,funny,1445714994,1970-01-01 00:00:01.445714994
1,2,60756,Highly quotable,1445714996,1970-01-01 00:00:01.445714996
2,2,60756,will ferrell,1445714992,1970-01-01 00:00:01.445714992
3,2,89774,Boxing story,1445715207,1970-01-01 00:00:01.445715207
4,2,89774,MMA,1445715200,1970-01-01 00:00:01.445715200


<u>Run the following cells and inspect the outputs.</u>

In [36]:
# Print the size 
ratings_df.shape

(100836, 4)

In [37]:
# Print the information about ratings_df
ratings_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column   Non-Null Count   Dtype         
---  ------   --------------   -----         
 0   userId   100836 non-null  int64         
 1   movieId  100836 non-null  int64         
 2   rating   100836 non-null  float64       
 3   time     100836 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(2)
memory usage: 3.1 MB


In [38]:
# Print the description 
ratings_df.describe()

Unnamed: 0,userId,movieId,rating,time
count,100836.0,100836.0,100836.0,100836
mean,326.127564,19435.295718,3.501557,2008-03-19 17:01:27.368469248
min,1.0,1.0,0.5,1996-03-29 18:36:55
25%,177.0,1199.0,3.0,2002-04-18 09:57:46
50%,325.0,2991.0,3.5,2007-08-02 20:31:02
75%,477.0,8122.0,4.0,2015-07-04 07:15:44.500000
max,610.0,193609.0,5.0,2018-09-24 14:27:30
std,182.618491,35530.987199,1.042529,


<u>How many entries are there in `ratings_df`? Are there any empty (null) entries?</u>

**Ans**: ...

<u>Now the userId and movieId are still numerical. We can use .astype(str) or .astype('category')) or pd.Category(...) to change them. But we'll leave them be for now.</u>

<u>How do you print the names of the columns and index (or indices)?</u>

In [39]:
# Print the column names
ratings_df.columns

Index(['userId', 'movieId', 'rating', 'time'], dtype='object')

In [40]:
# Print the indices
ratings_df.index

RangeIndex(start=0, stop=100836, step=1)

## Subsetting and Sorting

### Subsetting columns

In [None]:
# Run this cell
only_ratings = ratings_df["rating"]

<u>Q: What is the data type?</u>

In [None]:
# print the data type of only_ratings
# ___________

<u>We can subset multiple columns by passing the column names as a list.</u>

In [None]:
# Select the columns "movieId" and "rating"
__________

# What is the data type?
__________

### Sorting rows

<u>Let's sort the movie based on their ratings (in `rating`).</u>

In [41]:
ratings_df.sort_values(by="rating")

Unnamed: 0,userId,movieId,rating,time
3752,22,53519,0.5,2010-03-16 08:12:17
60861,393,5445,0.5,2015-05-01 18:57:16
47025,307,2017,0.5,2007-08-03 20:40:39
22446,153,1198,0.5,2018-05-05 19:24:24
60865,393,5902,0.5,2015-05-01 19:11:49
...,...,...,...,...
90260,587,50,5.0,2000-03-15 17:29:26
90261,587,58,5.0,2000-03-15 17:33:12
17061,108,5303,5.0,2003-01-17 21:41:59
90266,587,236,5.0,2000-03-15 16:59:01


<u>How to get the highest rating movies on the top?</u>

In [48]:
# Pass the argument ascending=False and display the result.
ratings_df.sort_values(by="rating", ascending=False).iloc[0]

userId                     232
movieId                   3147
rating                     5.0
time       2008-08-08 03:08:23
Name: 34031, dtype: object

### Sorting rows based on multiple columns

<u>To sort the rows based on multiple columns, pass the column names and the `ascending` argument as lists.</u>

In [49]:
# Try it: sort with descending rating and ascending time
ratings_df.sort_values(by=["rating", "time"], ascending=[False, True])

Unnamed: 0,userId,movieId,rating,time
66665,429,150,5.0,1996-03-29 18:36:55
66667,429,161,5.0,1996-03-29 18:36:55
66716,429,588,5.0,1996-03-29 18:36:55
66717,429,590,5.0,1996-03-29 18:36:55
66718,429,592,5.0,1996-03-29 18:36:55
...,...,...,...,...
81446,514,141994,0.5,2018-09-08 04:35:03
58089,380,179819,0.5,2018-09-13 21:05:21
27248,184,184641,0.5,2018-09-16 10:46:48
27232,184,175475,0.5,2018-09-16 14:52:50


### Subsetting rows with a Boolean series

<u>Run the cells below.</u>

In [None]:
# This will result in a Boolean series
ratings_df["userId"] == 1 

In [None]:
#Filtering for rows where userId=1
ratings_df[ratings_df["userId"]==1]

<u>Now create a DataFrame called `high_ratings_df` where it only contains the entries with rating=5.0.</u>

In [50]:
# Filtering for rows where rating=5.0
high_ratings_df = ratings_df[ratings_df["rating"] == 5.0]

# Display high_ratings_df 
high_ratings_df.head()

Unnamed: 0,userId,movieId,rating,time
3,1,47,5.0,2000-07-30 19:03:35
4,1,50,5.0,2000-07-30 18:48:51
6,1,101,5.0,2000-07-30 18:14:28
8,1,151,5.0,2000-07-30 19:07:21
9,1,157,5.0,2000-07-30 19:08:20


### Subsetting rows with multiple conditions

<u>Which movieId did user 1 give 5.0 rating to?</u>

In [None]:
# Construct the Boolean series
is_user1 = ratings_df["userId"] == 1
is_rating5 = ratings_df["rating"] == 5.0

# Try it: Use both Boolean series to subset ratings_df. Which logical operator do you need?
_________


### Subsetting using `.isin()`

<u>Say we want to subset the ratings based on more than two conditions, checking the conditions one by one can get quite tedious. In this case, it's better to use `.isin()`.<br>Run the cell below.</u>

In [80]:
# also results in Boolean series
is_movie123 = ratings_df['movieId'].isin([1, 2, 3])

#subset ratings_df with is_movie123
ratings_df[is_movie123]

Unnamed: 0,userId,movieId,rating,time
0,1,1,4.0,2018
1,1,3,4.0,2018
516,5,1,4.0,2018
560,6,2,4.0,2018
561,6,3,5.0,2018
...,...,...,...,...
98666,608,1,2.5,2018
98667,608,2,2.0,2018
98668,608,3,2.0,2018
99497,609,1,3.0,2018


### Hierarchical indices

<u>We've seen how to use `.set_index()` and `.reset_index()`before, now let's try setting multiple indices.</u>

In [60]:
# create a list of the column names to set as indices
indices = ["userId", "movieId"]

# use .set_index() on ratings_df and pass the indices as the argument
ratings_multindex = ratings_df.set_index(indices)

# print ratings_multindex
ratings_multindex.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,rating,time
userId,movieId,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1,4.0,2000-07-30 18:45:03
1,3,4.0,2000-07-30 18:20:47
1,6,4.0,2000-07-30 18:37:04
1,47,5.0,2000-07-30 19:03:35
1,50,5.0,2000-07-30 18:48:51


### Subsetting with indices

<u>Here we'll learn how to use `loc` and `iloc` to subset a DataFrame using its indices.</u>

#### Outer index level

In [61]:
# Run this cell

print(tags_df.head(2))

# Set tag as the index
tags_1idx = tags_df.set_index('tag')
tags_1idx.head()

   userId  movieId              tag   timestamp                          time
0       2    60756            funny  1445714994 1970-01-01 00:00:01.445714994
1       2    60756  Highly quotable  1445714996 1970-01-01 00:00:01.445714996


Unnamed: 0_level_0,userId,movieId,timestamp,time
tag,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
funny,2,60756,1445714994,1970-01-01 00:00:01.445714994
Highly quotable,2,60756,1445714996,1970-01-01 00:00:01.445714996
will ferrell,2,60756,1445714992,1970-01-01 00:00:01.445714992
Boxing story,2,89774,1445715207,1970-01-01 00:00:01.445715207
MMA,2,89774,1445715200,1970-01-01 00:00:01.445715200


In [62]:
# print out all indices
tags_1idx.index

Index(['funny', 'Highly quotable', 'will ferrell', 'Boxing story', 'MMA',
       'Tom Hardy', 'drugs', 'Leonardo DiCaprio', 'Martin Scorsese',
       'way too long',
       ...
       'music', 'British', 'Romans', '70mm', 'World War II', 'for katie',
       'austere', 'gun fu', 'heroic bloodshed', 'Heroic Bloodshed'],
      dtype='object', name='tag', length=3683)

<u> You've seen the following examples in the lecture</u>

In [63]:
# iloc: by the integer position(s)
tags_1idx.iloc[[2,0]]

Unnamed: 0_level_0,userId,movieId,timestamp,time
tag,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
will ferrell,2,60756,1445714992,1970-01-01 00:00:01.445714992
funny,2,60756,1445714994,1970-01-01 00:00:01.445714994


In [64]:
# loc: by the label(s)
tags_1idx.loc[['funny', 'will ferrell']].tail()

Unnamed: 0_level_0,userId,movieId,timestamp,time
tag,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
funny,599,1732,1498456291,1970-01-01 00:00:01.498456291
will ferrell,2,60756,1445714992,1970-01-01 00:00:01.445714992
will ferrell,62,60756,1528934379,1970-01-01 00:00:01.528934379
will ferrell,62,107348,1528935002,1970-01-01 00:00:01.528935002
will ferrell,424,60756,1457846129,1970-01-01 00:00:01.457846129


#### Inner index level

<u>We'll set a second index based on the year of ratings.</u>

In [67]:
tags_df["year"] = tags_df.time.dt.year

tags_2idx = tags_df.set_index(['tag', 'year']).sort_index()
tags_2idx.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,userId,movieId,timestamp,time
tag,year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"""artsy""",1970,567,4552,1525285878,1970-01-01 00:00:01.525285878
06 Oscar Nominated Best Movie - Animation,1970,474,31658,1145451856,1970-01-01 00:00:01.145451856
06 Oscar Nominated Best Movie - Animation,1970,474,37729,1139059938,1970-01-01 00:00:01.139059938
06 Oscar Nominated Best Movie - Animation,1970,474,38038,1139664190,1970-01-01 00:00:01.139664190
1900s,1970,474,918,1138137949,1970-01-01 00:00:01.138137949
1920s,1970,474,1084,1138137844,1970-01-01 00:00:01.138137844
1920s,1970,474,3548,1137205331,1970-01-01 00:00:01.137205331
1950s,1970,474,1103,1137207088,1970-01-01 00:00:01.137207088
1950s,1970,474,2664,1138032017,1970-01-01 00:00:01.138032017
1960s,1970,474,1093,1137375736,1970-01-01 00:00:01.137375736


In [70]:
# Try it: Pass a tag and a year to .loc[]
tags_2idx.loc[("artsy", 1970)]

Unnamed: 0_level_0,Unnamed: 1_level_0,userId,movieId,timestamp,time
tag,year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
artsy,1970,567,1921,1525287302,1970-01-01 00:00:01.525287302
artsy,1970,567,99917,1525286654,1970-01-01 00:00:01.525286654


<u> Run the cells and inspect the results</u>

In [None]:
# Slicing based on multiple indices
print(tags_2idx.loc[("funny",2015):("funny",2017)])
print("---------------------")
print(tags_2idx.loc[("will ferrell",(2015,2018)),])

In [None]:
print(tags_2idx.loc["Al Pacino":"Alicia Vikander"])
print("---------------------")
print(tags_2idx.loc[("Al Pacino",2018):("Alicia Vikander",2015)])

### Subsetting in both directions

<u>This is especially useful if your DataFrame has many columns (variables).</u>

In [None]:
# Try it: display the columns "userId" and "movieId", 
# for movies with tag "funny" between years 2016-2018
# Pass the columns as a tuple
print( tags_2idx.loc[(_______,_______):(_______,_______), 
                     (__________, __________)] )

### Slicing by (partial) dates

<u>Say we want to see only the ratings made in 2018.</u>

In [81]:
# Try it: get the Boolean series using logical operator(s) on ratings_df["time"]
is_year2018 = ratings_df["time"] == 2018

# ratings made in 2018
ratings_df[is_year2018]

Unnamed: 0,userId,movieId,rating,time
0,1,1,4.0,2018
1,1,3,4.0,2018
2,1,6,4.0,2018
3,1,47,5.0,2018
4,1,50,5.0,2018
...,...,...,...,...
100831,610,166534,4.0,2018
100832,610,168248,5.0,2018
100833,610,168250,5.0,2018
100834,610,168252,5.0,2018


In [82]:
ratings_df.head()

Unnamed: 0,userId,movieId,rating,time
0,1,1,4.0,2018
1,1,3,4.0,2018
2,1,6,4.0,2018
3,1,47,5.0,2018
4,1,50,5.0,2018


In [None]:
# Alternative: use .dt.year
# Get the Boolean series
is_year2018 = ratings_df["time"].dt.year==2018

# ratings made in 2018
ratings_df[is_year2018]

In [None]:
# Try it: Get the ratings made in years 2000 and 2001.
__________

## Merging and Pivoting

<u>In the lecture we have seen the example of merging and pivoting using `tags_df` and `ratings_df`.</u>

In [53]:
# Print a few lines from tags_df and ratings_df to remind ourselves how they look like.
tags_df.head()


Unnamed: 0,userId,movieId,tag,timestamp,time
0,2,60756,funny,1445714994,1970-01-01 00:00:01.445714994
1,2,60756,Highly quotable,1445714996,1970-01-01 00:00:01.445714996
2,2,60756,will ferrell,1445714992,1970-01-01 00:00:01.445714992
3,2,89774,Boxing story,1445715207,1970-01-01 00:00:01.445715207
4,2,89774,MMA,1445715200,1970-01-01 00:00:01.445715200


In [54]:
ratings_df.head()

Unnamed: 0,userId,movieId,rating,time
0,1,1,4.0,2000-07-30 18:45:03
1,1,3,4.0,2000-07-30 18:20:47
2,1,6,4.0,2000-07-30 18:37:04
3,1,47,5.0,2000-07-30 19:03:35
4,1,50,5.0,2000-07-30 18:48:51


In [72]:
# Merge the two DataFrames on two columns, "userId" and "movieId",
# by first dropping the time stamp on ratings_df
tags_ratings_df = tags_df.merge(ratings_df.drop(columns=["time"]),
                                on=["userId", "movieId"])

tags_ratings_df

Unnamed: 0,userId,movieId,tag,timestamp,time,year,rating
0,2,60756,funny,1445714994,1970-01-01 00:00:01.445714994,1970,5.0
1,2,60756,Highly quotable,1445714996,1970-01-01 00:00:01.445714996,1970,5.0
2,2,60756,will ferrell,1445714992,1970-01-01 00:00:01.445714992,1970,5.0
3,2,89774,Boxing story,1445715207,1970-01-01 00:00:01.445715207,1970,5.0
4,2,89774,MMA,1445715200,1970-01-01 00:00:01.445715200,1970,5.0
...,...,...,...,...,...,...,...
3471,606,6107,World War II,1178473747,1970-01-01 00:00:01.178473747,1970,4.0
3472,606,7382,for katie,1171234019,1970-01-01 00:00:01.171234019,1970,4.5
3473,610,3265,gun fu,1493843984,1970-01-01 00:00:01.493843984,1970,5.0
3474,610,3265,heroic bloodshed,1493843978,1970-01-01 00:00:01.493843978,1970,5.0


<u>Read the description in the cell below and run it.</u>

In [73]:
# Pivot table by year and aggregate the ratings with their average
mean_yearly_rating = tags_ratings_df.pivot_table(index="tag", 
                                                 columns="year",
                                                 values="rating", 
                                                 aggfunc="mean")

# This line of code sorts the pivot table by the least occurrence of NAs, 
# because not every year has data, and replace NAs by "-"
mean_yearly_rating.iloc[mean_yearly_rating.isnull().sum(axis=1).argsort()].fillna("-").head(10)

year,1970
tag,Unnamed: 1_level_1
"""artsy""",3.5
invisibility,4.0
investor corruption,5.0
introspection,4.5
intimate,4.5
interwoven storylines,5.0
intertwining storylines,5.0
interracial romance,2.0
interracial marriage,4.5
interesting scenario,3.0


In [None]:
# Try it: Create the table of the median rating per year from 2010 onwards,
# set the tag as the index and years as the columns
# display 10 lines, sort based on the number the least number of NAs like before, and 
# replace NAs by blanks ("")

median_yearly_rating = 

# display it
median_yearly_rating.__________

In [None]:
# Try it: Create the table of the count of ratings
# We're only interested with the tags that start with 'com', e.g. 'comic', 'computer', etc
# set the year as the index and tags as the columns
# display all lines sorted by year
# replace NAs by "-"
count_yearly_rating = __________

# display it
count_yearly_rating.___________