# Data Structures and Processing

## Week 8: Data Wrangling with Pandas

### Remarks:

1. Press the `install requirements` button above to install the required packages.  See the `requirements.txt` for list of packages to be installed.

2. Make sure that you are following the conventions.  For examples, `import pandas as pd`, which imports pandas packages and sets the abbreviation for it.

3. Do not import the packages without the short names.  Doing so might lead to a namespace conflict, or unintended uses of functions coming from two libraries as a part of different implementations.

4. We are assigning `None` to variables and use `pass` in the body of the functions, where we expect a solution from you.  Please replace these values and statements with your solution.

The exercises in this notebook are aligned with the material provided for the lecture.

### Load Libraries

In [70]:
import numpy as np
import pandas as pd

## MultiIndex


### Task 1

Consider the `json` file named `entertain.json`(attached) and read it into a variable `df1` using the function `pd.read_json`.

In [71]:
# Your solution goes here.
df1 = pd.read_json('entertain.json')

In [72]:
assert df1.shape == (8, 6)

### Task 2

We want to change the index of the imported `DataFrame`. More specifically, we would like to have two indexes.  The two indexes come from the columns named `"rating"` and `"stars"`.

Define a new `DataFrame` by the name `df2` where `"rating"`and `"stars"` are indexes.

In [73]:
# Your solution goes here.
df2 = df1.set_index(["rating","stars"])
df2


Unnamed: 0_level_0,Unnamed: 1_level_0,title,duration,actors,genre
rating,stars,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
PG-13,7.5,Quiz Show,133,"[Ralph Fiennes, John Turturro, Rob Morrow]",
PG-13,7.6,Batman,126,"[Michael Keaton, Jack Nicholson, Kim Basinger]",Action
R,8.2,The Wolf of Wall Street,180,"[Leonardo DiCaprio, Jonah Hill, Margot Robbie]",Biography
PG,8.1,Jaws,124,"[Roy Scheider, Robert Shaw, Richard Dreyfuss]",Drama
,7.8,Belle de Jour,101,"[Catherine Deneuve, Jean Sorel, Michel Piccoli]",Drama
PG-13,7.8,As Good as It Gets,139,"[Jack Nicholson, Helen Hunt, Greg Kinnear]",Comedy
G,8.4,Toy Story 3,103,"[Tom Hanks, Tim Allen, Joan Cusack]",Animation
PG,7.4,Manhattan Murder Mystery,104,"[Woody Allen, Diane Keaton, Jerry Adler]",Comedy


In [74]:
assert df2.index[0] == ('PG-13', 7.5)

### Task 3

In the `DataFrame` named `df2`, defined above, we have two indexes: `"rating"` and `"stars"`, and *in this order*.  We want to change

1. their order so that `"stars"` is the first index and `"ratings"` is the second one. You might want to use `swaplevel`.
2. their case, i.e., make names of index columns upcase.

Define a new `DataFrame` by the name `df3`, which is the same as `df2`, except that the indexes are swapped and they are in upcase form.

In [75]:
# Your solution goes here.
df3 = df2.swaplevel(0,1)
df3


Unnamed: 0_level_0,Unnamed: 1_level_0,title,duration,actors,genre
stars,rating,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
7.5,PG-13,Quiz Show,133,"[Ralph Fiennes, John Turturro, Rob Morrow]",
7.6,PG-13,Batman,126,"[Michael Keaton, Jack Nicholson, Kim Basinger]",Action
8.2,R,The Wolf of Wall Street,180,"[Leonardo DiCaprio, Jonah Hill, Margot Robbie]",Biography
8.1,PG,Jaws,124,"[Roy Scheider, Robert Shaw, Richard Dreyfuss]",Drama
7.8,,Belle de Jour,101,"[Catherine Deneuve, Jean Sorel, Michel Piccoli]",Drama
7.8,PG-13,As Good as It Gets,139,"[Jack Nicholson, Helen Hunt, Greg Kinnear]",Comedy
8.4,G,Toy Story 3,103,"[Tom Hanks, Tim Allen, Joan Cusack]",Animation
7.4,PG,Manhattan Murder Mystery,104,"[Woody Allen, Diane Keaton, Jerry Adler]",Comedy


In [76]:
assert df3.index.names == ['stars', 'rating']

### Task 4

We have now data with two indexes. Let us say that we do not need this indexing, and furthermore, we would like to recover the data to a form where it was as it was imported into a `DataFrame`.  There are three ways to do it:

1. One is to keep the data stored in a separate variable and define a new variable with the new indexes.  This might lead to a memory problem, when huge data is read into a `DataFrame`.

2. Reread the data from the file into the desired variable again.

3. Use the builtin function for reseting indexing in `pandas`.

Use the third method above to reset index of the data stored in `df3` and put it in variable `df4` (we do not desire to modify the existing variable).

In [77]:
# Your Solution Goes here
df3 = df2.reset_index()
df4 = df3

In [78]:
assert len(df4.columns) == 6

### Task 5

Recall from the beginning of this set of exercises that we have imported data from a `json` file.  The data that is in hierarchical form is usually stored in this format.  There are other ways to store such data too, for example as an `xml` format, to name only another one beside `json`.  You might have noticed as we have imported the data that it was presented in a tabular format, and it was done by assigning some `NaN` value to some of the entries, because the corresponding values were missing.  For example, the `"Quiz Show"` has `"genre"` `NaN`.  Compare it with the json file, where `"Quiz Show"`, does not have any value for `"genre"`.

Recall that in `pandas`, tabular data can be converted into a hierarchical data using the `stack()` method.

Create a new `DataFrame`, called `df5`, from `df3` and use the method `stack()` to present it in hierarchical form.

In [79]:
# Your solution goes here.
df5 = df2.stack().swaplevel(0,1)
df5
# tried to swap with df3 but did not have title column


stars  rating          
7.5    PG-13   title                                             Quiz Show
               duration                                                133
               actors           [Ralph Fiennes, John Turturro, Rob Morrow]
7.6    PG-13   title                                                Batman
               duration                                                126
               actors       [Michael Keaton, Jack Nicholson, Kim Basinger]
               genre                                                Action
8.2    R       title                               The Wolf of Wall Street
               duration                                                180
               actors       [Leonardo DiCaprio, Jonah Hill, Margot Robbie]
               genre                                             Biography
8.1    PG      title                                                  Jaws
               duration                                                124
 

In [80]:
assert df5.index[0] == (7.5, 'PG-13', 'title')

### Task 6

Consider the hierarchical `DataFrame` `df5`, and filter it down to all the entries with `"rating"` `"PG-13"`.  Store the result in the variable `df6`.

In [85]:
# Your solution goes here.
df6 = df5.loc[:,"PG-13"]
df6

stars          
7.5    title                                            Quiz Show
       duration                                               133
       actors          [Ralph Fiennes, John Turturro, Rob Morrow]
7.6    title                                               Batman
       duration                                               126
       actors      [Michael Keaton, Jack Nicholson, Kim Basinger]
       genre                                               Action
7.8    title                                   As Good as It Gets
       duration                                               139
       actors          [Jack Nicholson, Helen Hunt, Greg Kinnear]
       genre                                               Comedy
dtype: object

In [86]:
assert len(df6) == 11

### Task 7

Let us consider a case, where we are given two `DataFrame`s with `MultiIndex`, we would like to merge them using a column as a reference.  There are several different functions available in `pandas` for such a purpose.

Let us define two variable `df71` and `df72`.  These `DataFrame`s help us demonstrate what is stated above.  In practice, the two `DataFrame`s may come from different sources unlike how we have defined them.

Define a variable `df7`, which contains a merge of `df71` and `df72` on the column `"title"`.  Make sure that the returned table has the `MultiIndex` inherited and it is not stripped away. You can do in by resetting index and setting it back after merge or by using `combine_first` function.

In [124]:
# Your solution goes here
df71 = df2[["title", "duration"]]
df72 = df2[["genre", "title"]]

df71.combine_first(df72)


df7 = pd.merge(df71,df72,on='title')
df7



Unnamed: 0,title,duration,genre
0,Quiz Show,133,
1,Batman,126,Action
2,The Wolf of Wall Street,180,Biography
3,Jaws,124,Drama
4,Belle de Jour,101,Drama
5,As Good as It Gets,139,Comedy
6,Toy Story 3,103,Animation
7,Manhattan Murder Mystery,104,Comedy


In [128]:
assert df7.index.names == ['rating', 'stars']
assert df7.shape == (8, 3)

### Task 8 (Bonus)

Consider the file `taxi.csv` in the attachment. Your task is to follow the discussion in the section "Reshaping and Pivoting", using the data in this file.

Explain what goes wrong. You could write your remarks as comments or in new blocks.

In [132]:
data = pd.read_csv("taxi.csv")
# data.head() 

periods = pd.PeriodIndex(data.pop('pickup_datetime'), freq="s")
data.index = periods.to_timestamp("S")
# data.head()

data.columns.name = 'item'
long_data = data.stack().reset_index().rename(columns={0: 'value'})
# long_data

# the values do not change in pickup_datetime when 
# we change the frequency. 

  data.index = periods.to_timestamp("S")


Unnamed: 0,pickup_datetime,item,value
0,2023-01-01 00:26:10,dropoff_datetime,2023-01-01 00:37:11
1,2023-01-01 00:26:10,passenger_count,1
2,2023-01-01 00:26:10,trip_distance,2.58
3,2023-01-01 00:51:03,dropoff_datetime,2023-01-01 00:57:49
4,2023-01-01 00:51:03,passenger_count,1
...,...,...,...
2986,2023-01-03 18:21:36,passenger_count,1
2987,2023-01-03 18:21:36,trip_distance,2.06
2988,2023-01-03 18:37:07,dropoff_datetime,2023-01-01 18:46:09
2989,2023-01-03 18:37:07,passenger_count,1
