# Data Science 1 Topic 2

## DataFrames

### From a dictionary

<u>In T0-TupleListDictionary, you've seen the following dictionary. Run the cell.</u>

In [1]:
movie_ratings = {'Toy Story':{'rating':4.0, 'genre':'Animation'},
                          'Jumanji':{'rating':4.0, 'genre':'Adventure'},
                          'Grumpier Old Men':{'rating':4.0, 'genre':'Comedy'},
                          'Waiting to Exhale':{'rating':4.0, 'genre':'Comedy'},
                          'Father of the Bride Part II':{'rating':5.0, 'genre':'Comedy'}
                         }

<u>Let's now create a simple DataFrame from this dictionary using [Pandas](https://pandas.pydata.org/) package.<br>Follow the instructions below.</u>

In [2]:
# import Pandas package with the alias pd
import pandas as pd

In [3]:
ratings_df = pd.DataFrame(movie_ratings)

# display ratings_df
ratings_df.head()

Unnamed: 0,Toy Story,Jumanji,Grumpier Old Men,Waiting to Exhale,Father of the Bride Part II
rating,4.0,4.0,4.0,4.0,5.0
genre,Animation,Adventure,Comedy,Comedy,Comedy


### Transposing a DataFrame

<u>With this simple DataFrame, let's say we want the rating and genre to be the **columns**.</u>

In [4]:
# How to use transpose() on a DataFrame? overwrite ratings_df with its transpose
ratings_df = ratings_df.transpose()

# print ratings_df
print(ratings_df)

                            rating      genre
Toy Story                      4.0  Animation
Jumanji                        4.0  Adventure
Grumpier Old Men               4.0     Comedy
Waiting to Exhale              4.0     Comedy
Father of the Bride Part II    5.0     Comedy


<u>The title of the movies became the **index** of the DataFrame.</u>

### Using `reset_index` and  `set_index`

<u>What if we don't want the title to be the indices, but as a variable (contained in a column)?<br>The current index has no name, and we want the contents to be under the column `title`. So we'll first rename it, before using `reset_index`.</u>

In [6]:
# First use rename_axis(), then reset_index(), where should the `title` be?
ratings_df = ratings_df.rename_axis("title", axis=0).reset_index()

# print ratings_df
ratings_df.head()

Unnamed: 0,title,rating,genre
0,Toy Story,4.0,Animation
1,Jumanji,4.0,Adventure
2,Grumpier Old Men,4.0,Comedy
3,Waiting to Exhale,4.0,Comedy
4,Father of the Bride Part II,5.0,Comedy


In [54]:
ratings_df = ratings_df.set_index("title")

print(ratings_df.loc["Toy Story"])

rating          4.0
genre     Animation
Name: Toy Story, dtype: object


In [32]:
ratings_df = ratings_df.reset_index()


In [55]:
ratings_df.rating > 2

title
Toy Story                      True
Jumanji                        True
Grumpier Old Men               True
Waiting to Exhale              True
Father of the Bride Part II    True
Name: rating, dtype: bool

In [56]:
ratings_df.loc["Toy Story":"Waiting to Exhale", "rating":"genre"]

Unnamed: 0_level_0,rating,genre
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Toy Story,4.0,Animation
Jumanji,4.0,Adventure
Grumpier Old Men,4.0,Comedy
Waiting to Exhale,4.0,Comedy


In [20]:
ratings_df.iloc[0:3, -1]

title
Toy Story           Animation
Jumanji             Adventure
Grumpier Old Men       Comedy
Name: genre, dtype: object

<u>You can simply use `.set_index('title')` on ratings_df to reverse this. <br>We will see later, when using a bigger DataFrame, how `index` and `columns` can be useful.</u>

***

## Reading csv files

<u>We will use the [MovieLens Latest Datasets](https://grouplens.org/datasets/movielens/latest/). Download **ml-latest-small.zip** and extract the zip file. There are four csv files (`links.csv`, `movies.csv`, `ratings.csv`, and `tags.csv`). Take some time to read the README file.<br>We'll first look at `ratings.csv`</u>

In [None]:
# ONLY IF YOU'RE USING GOOGLE COLAB AND WANT TO MOUNT YOUR DRIVE TO ACCESS YOUR FILES,
# uncomment these two lines:
# from google.colab import drive
# drive.mount('/content/drive')

# Once you're done:
#drive.flush_and_unmount()

# ALTERNATIVE: Upload the file to the session storage (will be deleted once your session ends). 
# Click on the folder icon, then the icon with the up arrow symbol.

In [None]:
ratings_df = pd.read_csv('ml-latest-small/ratings.csv')

<u>Read the rest of the csv files.</u>

In [None]:
movies_df = pd.read_csv("ml-latest-small/movies.csv")
links_df = pd.read_csv("ml-latest-small/links.csv")
tags_df = pd.read_csv("ml-latest-small/tags.csv")

***

## Exploring DataFrames

<u>Run the cells below, follow the instructions and inspect the outputs.</u>

In [None]:
ratings_df.head(3) # without number specification, how many entries will be printed?

<u>We want the timestamp column to be more readable and easier to deal with. Let's change it to DateTime format.</u>

In [None]:
# run this
ratings_df["time"] = pd.to_datetime(ratings_df.timestamp, unit='s')
ratings_df = ratings_df.drop(columns="timestamp")

In [None]:
# .tail() will print the last few entries.
# show the last 3 entries of ratings_df
ratings_df.tail()

In [None]:
# show the first few entries of movies_df
movies_df.head()

In [None]:
# show the first few entries of links_df
links_df.head()

In [None]:
tags_df.head()

In [None]:
# tags_df also contains a Unix timestamp, "timestamp", column

# add a column "time" that contains the DateTime format of "timestamp"
tags_df["time"] = pd.to_datetime(tags_df["timestamp"])

# drop the column "timestamp"
tags_df.drop(columns="timestamp")

# show the first few entries of tags_df
tags_df.head()

<u>Run the following cells and inspect the outputs.</u>

In [None]:
# Print the size 
ratings_df.shape

In [None]:
# Print the information about ratings_df
ratings_df.info()

In [None]:
# Print the description 
ratings_df.describe()

<u>How many entries are there in `ratings_df`? Are there any empty (null) entries?</u>

**Ans**: ...

<u>Now the userId and movieId are still numerical. We can use .astype(str) or .astype('category')) or pd.Category(...) to change them. But we'll leave them be for now.</u>

<u>How do you print the names of the columns and index (or indices)?</u>

In [None]:
# Print the column names
ratings_df.columns

In [None]:
# Print the indices
ratings_df.index

## Subsetting and Sorting

### Subsetting columns

In [None]:
# Run this cell
only_ratings = ratings_df["rating"]

<u>Q: What is the data type?</u>

In [None]:
# print the data type of only_ratings
# ___________

<u>We can subset multiple columns by passing the column names as a list.</u>

In [None]:
# Select the columns "movieId" and "rating"
__________

# What is the data type?
__________

### Sorting rows

<u>Let's sort the movie based on their ratings (in `rating`).</u>

In [None]:
ratings_df.sort_values(by="rating")

<u>How to get the highest rating movies on the top?</u>

In [None]:
# Pass the argument ascending=False and display the result.
ratings_df.sort_values(by="rating", ascending=False).iloc[0]

### Sorting rows based on multiple columns

<u>To sort the rows based on multiple columns, pass the column names and the `ascending` argument as lists.</u>

In [None]:
# Try it: sort with descending rating and ascending time
ratings_df.sort_values(by=["rating", "time"], ascending=[False, True])

### Subsetting rows with a Boolean series

<u>Run the cells below.</u>

In [None]:
# This will result in a Boolean series
ratings_df["userId"] == 1 

In [None]:
#Filtering for rows where userId=1
ratings_df[ratings_df["userId"]==1]

<u>Now create a DataFrame called `high_ratings_df` where it only contains the entries with rating=5.0.</u>

In [None]:
# Filtering for rows where rating=5.0
high_ratings_df = ratings_df[ratings_df["rating"] == 5.0]

# Display high_ratings_df 
high_ratings_df.head()

### Subsetting rows with multiple conditions

<u>Which movieId did user 1 give 5.0 rating to?</u>

In [None]:
# Construct the Boolean series
is_user1 = ratings_df["userId"] == 1
is_rating5 = ratings_df["rating"] == 5.0

# Try it: Use both Boolean series to subset ratings_df. Which logical operator do you need?
_________


### Subsetting using `.isin()`

<u>Say we want to subset the ratings based on more than two conditions, checking the conditions one by one can get quite tedious. In this case, it's better to use `.isin()`.<br>Run the cell below.</u>

In [None]:
# also results in Boolean series
is_movie123 = ratings_df['movieId'].isin([1, 2, 3])

#subset ratings_df with is_movie123
ratings_df[is_movie123]

### Hierarchical indices

<u>We've seen how to use `.set_index()` and `.reset_index()`before, now let's try setting multiple indices.</u>

In [None]:
# create a list of the column names to set as indices
indices = ["userId", "movieId"]

# use .set_index() on ratings_df and pass the indices as the argument
ratings_multindex = ratings_df.set_index(indices)

# print ratings_multindex
ratings_multindex.head()

### Subsetting with indices

<u>Here we'll learn how to use `loc` and `iloc` to subset a DataFrame using its indices.</u>

#### Outer index level

In [None]:
# Run this cell

print(tags_df.head(2))

# Set tag as the index
tags_1idx = tags_df.set_index('tag')
tags_1idx.head()

In [None]:
# print out all indices
tags_1idx.index

<u> You've seen the following examples in the lecture</u>

In [None]:
# iloc: by the integer position(s)
tags_1idx.iloc[[2,0]]

In [None]:
# loc: by the label(s)
tags_1idx.loc[['funny', 'will ferrell']].tail()

#### Inner index level

<u>We'll set a second index based on the year of ratings.</u>

In [None]:
tags_df["year"] = tags_df.time.dt.year

tags_2idx = tags_df.set_index(['tag', 'year']).sort_index()
tags_2idx.head(10)

In [None]:
# Try it: Pass a tag and a year to .loc[]
tags_2idx.loc[("artsy", 1970)]

<u> Run the cells and inspect the results</u>

In [None]:
# Slicing based on multiple indices
print(tags_2idx.loc[("funny",2015):("funny",2017)])
print("---------------------")
print(tags_2idx.loc[("will ferrell",(2015,2018)),])

In [None]:
print(tags_2idx.loc["Al Pacino":"Alicia Vikander"])
print("---------------------")
print(tags_2idx.loc[("Al Pacino",2018):("Alicia Vikander",2015)])

### Subsetting in both directions

<u>This is especially useful if your DataFrame has many columns (variables).</u>

In [None]:
# Try it: display the columns "userId" and "movieId", 
# for movies with tag "funny" between years 2016-2018
# Pass the columns as a tuple
print( tags_2idx.loc[(_______,_______):(_______,_______), 
                     (__________, __________)] )

### Slicing by (partial) dates

<u>Say we want to see only the ratings made in 2018.</u>

In [None]:
# Try it: get the Boolean series using logical operator(s) on ratings_df["time"]
is_year2018 = ratings_df["time"] == 2018

# ratings made in 2018
ratings_df[is_year2018]

In [None]:
ratings_df.head()

In [None]:
# Alternative: use .dt.year
# Get the Boolean series
is_year2018 = ratings_df["time"].dt.year==2018

# ratings made in 2018
ratings_df[is_year2018]

In [None]:
# Try it: Get the ratings made in years 2000 and 2001.
__________

## Merging and Pivoting

<u>In the lecture we have seen the example of merging and pivoting using `tags_df` and `ratings_df`.</u>

In [None]:
# Print a few lines from tags_df and ratings_df to remind ourselves how they look like.
tags_df.head()


In [None]:
ratings_df.head()

In [None]:
# Merge the two DataFrames on two columns, "userId" and "movieId",
# by first dropping the time stamp on ratings_df
tags_ratings_df = tags_df.merge(ratings_df.drop(columns=["time"]),
                                on=["userId", "movieId"])

tags_ratings_df

<u>Read the description in the cell below and run it.</u>

In [None]:
# Pivot table by year and aggregate the ratings with their average
mean_yearly_rating = tags_ratings_df.pivot_table(index="tag", 
                                                 columns="year",
                                                 values="rating", 
                                                 aggfunc="mean")

# This line of code sorts the pivot table by the least occurrence of NAs, 
# because not every year has data, and replace NAs by "-"
mean_yearly_rating.iloc[mean_yearly_rating.isnull().sum(axis=1).argsort()].fillna("-").head(10)

In [None]:
# Try it: Create the table of the median rating per year from 2010 onwards,
# set the tag as the index and years as the columns
# display 10 lines, sort based on the number the least number of NAs like before, and 
# replace NAs by blanks ("")

median_yearly_rating = 

# display it
median_yearly_rating.__________

In [None]:
# Try it: Create the table of the count of ratings
# We're only interested with the tags that start with 'com', e.g. 'comic', 'computer', etc
# set the year as the index and tags as the columns
# display all lines sorted by year
# replace NAs by "-"
count_yearly_rating = __________

# display it
count_yearly_rating.___________