<h1><center> PPOL 5203 Data Science I: Foundations <br><br> 
<font color='grey'> Data Wrangling and Tidy in Pandas<br><br>
Tiago Ventura</center></center> <h1> 

---

**In this Notebook:**

We will first discuss the concept of `Tidy Data` and reshaping methods using `pandas`.


Building on that, we will cover standard data wrangling methods using pandas:

- Selecting Methods.
- Filtering Methods.
- Grouping and Summarization.
- Recodingd Variables.
- Reshaping data in Pandas.


## Setup

In [1]:
import pandas as pd
import numpy as np

## Tidy Data

Data can be organized in many different ways and can target many different concepts. Having an consistent and well-established procedures for data organization will make your workflow will faster, more efficient and less prone to errors. 

On tasks related to data visualization and data wrangling, we will organize our datasets following a `tidy data format` proposed by [Hadley Wickham in his 2014 article](https://vita.had.co.nz/papers/tidy-data.pdf). 

Consider the following 4 ways to organize the same data (_example pulled from [R4DS](https://r4ds.had.co.nz/tidy-data.html)_).

**<center> Example 1 </center>**

| Country     |Year| Cases | Population|
|:-----------:|:--:|:-----:|:---------:|
|Afghanistan  |1999|    745|   19987071|
|Afghanistan  |2000|   2666|   20595360|
|Brazil       |1999|  37737|  172006362|
|Brazil       |2000|  80488|  174504898|
|China        |1999| 212258| 1272915272|
|China        |2000| 213766| 1280428583|

**<center> Example 2 </center>**

|country      |year |type      |     count|
|:-----------:|:--:|:-----:|:---------:|
|Afghanistan  |1999 |cases     |       745|
|Afghanistan  |1999 |population|  19987071|
|Afghanistan  |2000 |cases     |      2666|
|Afghanistan  |2000 |population|  20595360|
|Brazil       |1999 |cases     |     37737|
|Brazil       |1999 |population| 172006362|

**<center> Example 3 </center>**

|country      |year| rate             |
|:-----------:|:--:|:-----:|
|Afghanistan  |1999| 745/19987071     |
|Afghanistan  |2000| 2666/20595360    |
|Brazil       |1999| 37737/172006362  |
|Brazil       |2000| 80488/174504898  |
|China        |1999| 212258/1272915272|
|China        |2000| 213766/1280428583|

**<center> Example 4 </center>**

| country     |`1999` |`2000`|
|:-----------:|:--:|:-----:|
| Afghanistan |   745 |  2666|
| Brazil      | 37737 | 80488|
| China       |212258 |213766|
    
    
    
| country     |    `1999`|     `2000`|
|:-----------:|:--:|:-----:|
| Afghanistan |  19987071|   20595360|
| Brazil      | 172006362|  174504898|
| China       |1272915272| 1280428583|

<div class="alert alert-block alert-info">

**Of the data examples outlined above, only the first could be considered `tidy` by this definition.**
</div>

### What makes a data tidy?

Three interrelated rules which make a dataset **tidy**:
<br><br>

1. **Each variable must have its own column.**


2. **Each observation must have its own row.**


3. **Each value must have its own cell.**

![Image drawn from Grolemund and Wickham 2017](https://d33wubrfki0l68.cloudfront.net/6f1ddb544fc5c69a2478e444ab8112fb0eea23f8/91adc/images/tidy-1.png)


### Why Tidy?

There are many reasons why `tidy format` facilitates data analysis:  

- Facilitates split-apply-combine analysis
- Take full advantage of `pandas` vectorize operations over columns,
- Allows for apply operations over unit of your data 
- Sits well with grammar of graphs approach for visualization

<div class="alert alert-block alert-info">

You don't need to be convinced in theory of these advantages. You will get in practice the tasty of working with tidy format data **The most important thing for you to have now is to follow a consistent strategy to organize your datasets, and apply these procedures all over the board!**
    
</div>    

## Using `pandas` to tidy your data. 

Most often you will encounter untidy datasets. A huge portion of your time as a data scientist will consist on apply tidy procedures to your dataset before starting any analysis or modeling. Let's learn some `pandas` methods for it!

Let's first create every example of datasets we saw above

In [67]:
base_url = "https://github.com/byuidatascience/data4python4ds/raw/master/data-raw/"
table1 = pd.read_csv("{}table1/table1.csv".format(base_url))
table2 = pd.read_csv("{}table2/table2.csv".format(base_url))
table3 = pd.read_csv("{}table3/table3.csv".format(base_url))
table4a = pd.read_csv("{}table4a/table4a.csv".format(base_url), names=["country", "cases_1999", "cases_2000"] )
table4b = pd.read_csv("{}table4b/table4b.csv".format(base_url), names=["country", "population_1999", "poulation_2000"] )
table5 = pd.read_csv("{}table5/table5.csv".format(base_url), dtype = 'object')

For most real analyses, you will need to resolve one of three common problems to tidy your data:

- One variable might be spread across multiple columns (`pd.melt()`)

- One observation might be scattered across multiple rows (`pd.pivot_table()`)

- A cell might contain weird/non-sensical/missing values. 

We will focus on the first two cases, as the last requires a mix of data cleaning skills that are spread over different notebooks

#### `pd.melt()` from wide to long

Requires: 

- Columns whose names are identifier variables, and you wish to keep in the dataset as it is. 

- A string for the new column with the variable names. 

- A string for the nes columns with the values.


In [14]:
# untidy - wide
print(table4a)

# tidy - long
table4a.melt(id_vars=['country'], var_name = "year", value_name = "cases")

       country  cases_1999  cases_2000
0      country        1999        2000
1  Afghanistan         745        2666
2       Brazil       37737       80488
3        China      212258      213766


Unnamed: 0,country,year,cases
0,country,cases_1999,1999
1,Afghanistan,cases_1999,745
2,Brazil,cases_1999,37737
3,China,cases_1999,212258
4,country,cases_2000,2000
5,Afghanistan,cases_2000,2666
6,Brazil,cases_2000,80488
7,China,cases_2000,213766


![https://r4ds.had.co.nz/tidy-data.html](https://d33wubrfki0l68.cloudfront.net/3aea19108d39606bbe49981acda07696c0c7fcd8/2de65/images/tidy-9.png)

#### `pd.pivot()` from long to wide (but tidy)

`pivot()` is the opposite of melt(). Think about this as you are widening a coarced variables. It requires: 

- Index to hold your new dataset upon

- The column to open up across multiple new columnes. 

- The column with the values to fill the cell on the new colum. 

In [23]:
#untidy
print(table2)

#tidy
table2.pivot_table(
    index = ['country', 'year'], 
    columns = 'type', 
    values = 'count')

        country  year        type       count
0   Afghanistan  1999       cases         745
1   Afghanistan  1999  population    19987071
2   Afghanistan  2000       cases        2666
3   Afghanistan  2000  population    20595360
4        Brazil  1999       cases       37737
5        Brazil  1999  population   172006362
6        Brazil  2000       cases       80488
7        Brazil  2000  population   174504898
8         China  1999       cases      212258
9         China  1999  population  1272915272
10        China  2000       cases      213766
11        China  2000  population  1280428583


Unnamed: 0_level_0,type,cases,population
country,year,Unnamed: 2_level_1,Unnamed: 3_level_1
Afghanistan,1999,745,19987071
Afghanistan,2000,2666,20595360
Brazil,1999,37737,172006362
Brazil,2000,80488,174504898
China,1999,212258,1272915272
China,2000,213766,1280428583


![](https://d33wubrfki0l68.cloudfront.net/8350f0dda414629b9d6c354f87acf5c5f722be43/bcb84/images/tidy-8.png)

## string methods with `pandas`

To tidy example three, we will use a combination of `pandas` and `string` methods. This combination will be around very often in our data cleaning tasks

In [24]:
# we need to separate rate
table3

Unnamed: 0,country,year,rate
0,Afghanistan,1999,745/19987071
1,Afghanistan,2000,2666/20595360
2,Brazil,1999,37737/172006362
3,Brazil,2000,80488/174504898
4,China,1999,212258/1272915272
5,China,2000,213766/1280428583


#### `str.split()`: to split strings in multiple elements

In [41]:
# simple string
hello, world = "hello/world".split("/")
print(hello)
print(world)

# with pandas
table3[["cases","population"]]=table3['rate'].str.split("/",  expand=True)


hello
world


Unnamed: 0,country,year,rate,cases,population
0,Afghanistan,1999,745/19987071,745,19987071
1,Afghanistan,2000,2666/20595360,2666,20595360
2,Brazil,1999,37737/172006362,37737,172006362
3,Brazil,2000,80488/174504898,80488,174504898
4,China,1999,212258/1272915272,212258,1272915272
5,China,2000,213766/1280428583,213766,1280428583


`string` methods will be super helpful on cleaning your data

In [68]:
# all to upper
table3["country_upper"]=table3["country"].str.upper()

# length
table3["country_length"]=table3["country"].str.len()

# find 
table3["country_brazil"]=table3["country"].str.find("Brazil")


# replace
table3["country_brazil"]=table3["country"].str.replace("Brazil", "BR")

# extract
table3[["century", "years"]] = table3["year"].astype(str).str.extract("(\d{2})(\d{2})")

# see all
table3

Unnamed: 0,country,year,rate,country_upper,country_length,country_brazil,century,years
0,Afghanistan,1999,745/19987071,AFGHANISTAN,11,Afghanistan,19,99
1,Afghanistan,2000,2666/20595360,AFGHANISTAN,11,Afghanistan,20,0
2,Brazil,1999,37737/172006362,BRAZIL,6,BR,19,99
3,Brazil,2000,80488/174504898,BRAZIL,6,BR,20,0
4,China,1999,212258/1272915272,CHINA,5,China,19,99
5,China,2000,213766/1280428583,CHINA,5,China,20,0


## Data Wrangling in `pandas`


Since you are also learning R throughout DSPP, let's provide you with an overview of the main Data Wrangling Functions in R using Tidyverse and Python using `pandas`. 

**<center>Main (tidy) Data Wrangling Functions </center>**

|   [`pandas`](https://pandas.pydata.org/)      |   [`dplyr`](https://dplyr.tidyverse.org/)$^\dagger$      |     Description     |
|:---------------:|:-------------:|:-----------------------------|
| `.filter()`     | `select()`    | select column variables/index |
| `.drop()`       | `select()`    | drop selected column variables/index |
| `.rename()`     | `rename()`    | rename column variables/index |
| `.query()`      | `filter()`    | row-wise subset of a data frame by a values of a column variable/index |
| `.assign()`     |`mutate()`    | Create a new variable on the existing data frame |
| `.sort_values()`| `arrange()`   | Arrange all data values along a specified (set of) column variable(s)/indices |
| `.groupby()`    |  `group_by()`  | Index data frame by specific (set of) column variable(s)/index value(s)|
| `.agg()`        |  `summarize()` | aggregate data by specific function rules |
| `.pivot_table()`        | `spread()` | cast the data from a "long" to a "wide" format |
| `pd.melt()`        | `gather()` | cast the data from a "wide" to a "long" format |
| `.()`            | `%>%`          | piping, fluid programming, or the passing one function output to the next |


If you want to fully embrace the tidyverse style from R in Python, you should check the [`dfply` module](https://github.com/kieferk/dfply). This modules ofers an alternative to data wrangling in Python, and mirrors the popular tidyverse functionalities from R. 

We will not cover `dfply` in class because I beleive you should dominate `pandas` as data scientists that are fluent in Python. However, feel free to learn and even use in your homeworks and assignment. 

In [4]:
# load worldcup dataset
wc = pd.read_csv("WorldCups.csv")

###  Column-Wise Operations

For data wrangling tasks at the columns of your data frame, we will discuss: 

- Select columns
- Drop  columns
- Generate new columns
- Rename columns


### Select Columns

### Drop Columns

### Transforming columns

**Functionality:**

- Create a new column/index given inputs and or transformations from other columns. 

**Implementation**

- Traditional index assignment. Advantages:
  
    + Looks like a dictionary operation
    + Overwrites the data frame

- `.assign()` method. Advantage:
    
    + It returns a dataframe so you can chain/pipe operations
    + Can create multiple variables in a single call
    + Easy to combine with numpy + lambda functions
    + Improves readibility.
    
Let's see examples of both methods:


#### Transformation via index assignment

In [8]:
# With built in math operations
wc["av_goals_matches"] = wc["GoalsScored"]/wc["MatchesPlayed"]
wc.head(5)

Unnamed: 0,Year,Country,Winner,Runners-Up,Third,Fourth,GoalsScored,QualifiedTeams,MatchesPlayed,Attendance,av_goals_matches
0,1930,Uruguay,Uruguay,Argentina,USA,Yugoslavia,70,13,18,590.549,3.888889
1,1934,Italy,Italy,Czechoslovakia,Germany,Austria,70,16,17,363.000,4.117647
2,1938,France,Italy,Hungary,Brazil,Sweden,84,15,18,375.700,4.666667
3,1950,Brazil,Uruguay,Brazil,Sweden,Spain,88,13,22,1.045.246,4.0
4,1954,Switzerland,Germany FR,Hungary,Austria,Uruguay,140,16,26,768.607,5.384615


In [10]:
## with numpy
wc["winner_and_hoster"] = np.where(wc.Country==wc.Winner, True, False)
wc[["Year", "Winner", "Country", "winner_and_hoster"]].head(5)

Unnamed: 0,Year,Winner,Country,winner_and_hoster
0,1930,Uruguay,Uruguay,True
1,1934,Italy,Italy,True
2,1938,Italy,France,False
3,1950,Uruguay,Brazil,False
4,1954,Germany FR,Switzerland,False


In [17]:
# with an apply method + function
wc["av_goals_matches"] = wc.apply(lambda x: x["GoalsScored"]/x["MatchesPlayed"], axis=1)
wc.head()

Unnamed: 0,Year,Country,Winner,Runners-Up,Third,Fourth,GoalsScored,QualifiedTeams,MatchesPlayed,Attendance,av_goals_matches,winner_and_hoster
0,1930,Uruguay,Uruguay,Argentina,USA,Yugoslavia,70,13,18,590.549,3.888889,True
1,1934,Italy,Italy,Czechoslovakia,Germany,Austria,70,16,17,363.000,4.117647,True
2,1938,France,Italy,Hungary,Brazil,Sweden,84,15,18,375.700,4.666667,False
3,1950,Brazil,Uruguay,Brazil,Sweden,Spain,88,13,22,1.045.246,4.0,False
4,1954,Switzerland,Germany FR,Hungary,Austria,Uruguay,140,16,26,768.607,5.384615,False


#### `.assign()` method.

Allows you to create variables with methods chaining. 

In [21]:
## when the winned was also hosting the world cup?
(wc.
 # multiple variables
 assign(final=wc.Winner + " vs " + wc['Runners-Up'],
        winner_and_hoster_np = np.where(wc.Winner==wc.Country, True, False), 
        av_goals_matches = lambda x: x["GoalsScored"]/x["MatchesPlayed"], 
        ).
  #allows for methods chaining
 filter(["Year", "Winner", "Country", "final", "winner_and_hoster_np", "av_goals_matches"]).
 head(5))

Unnamed: 0,Year,Winner,Country,final,winner_and_hoster_np,av_goals_matches
0,1930,Uruguay,Uruguay,Uruguay vs Argentina,True,3.888889
1,1934,Italy,Italy,Italy vs Czechoslovakia,True,4.117647
2,1938,Italy,France,Italy vs Hungary,False,4.666667
3,1950,Uruguay,Brazil,Uruguay vs Brazil,False,4.0
4,1954,Germany FR,Switzerland,Germany FR vs Hungary,False,5.384615


**Alert:**
    
Spend a few seconds trying to understand the use of the lambda function in the code above. This is actually the beauty of `.assign` which resambles the properties of `mutate` in R. A lambda function (or a normal function) with .assign passes in the **current state of the dataframe** into the function. Because we create the variable in the state before, the lambda function allows us to access this new variable, and manipulate it in sequence. This also works for cases of grouped data, or chains in which we filter observations and transform the dataframe

To make this point clear, see how doing the same operation without a lambda function will throw an error: 


In [23]:
(wc.
 # multiple variables
 assign(final=wc.Winner + " vs " + wc['Runners-Up'],
        winner_and_hoster_np = np.where(wc.Winner==wc.Country, True, False), 
        av_goals_matches = wc["GoalsScored"]/wc["MatchesPlayed"], 
        ).
  #allows for methods chaining
 filter(["Year", "Winner", "Country", "final", "winner_and_hoster_np", "av_goals_matches"]).
 head(5))

Unnamed: 0,Year,Country,Winner,Runners-Up,Third,Fourth,GoalsScored,QualifiedTeams,MatchesPlayed,Attendance,av_goals_matches,winner_and_hoster
0,1930,Uruguay,Uruguay,Argentina,USA,Yugoslavia,70,13,18,590.549,3.888889,True
1,1934,Italy,Italy,Czechoslovakia,Germany,Austria,70,16,17,363.000,4.117647,True
4,1954,Switzerland,Germany FR,Hungary,Austria,Uruguay,140,16,26,768.607,5.384615,False
5,1958,Sweden,Brazil,Sweden,France,Germany FR,126,16,35,819.810,3.6,False
7,1966,England,England,Germany FR,Portugal,Soviet Union,89,16,32,1.563.135,2.78125,True
10,1978,Argentina,Argentina,Netherlands,Brazil,Italy,102,16,38,1.545.791,2.684211,True
15,1998,France,France,Brazil,Croatia,Netherlands,171,32,64,2.785.100,2.671875,True
18,2010,South Africa,Spain,Netherlands,Germany,Uruguay,145,32,64,3.178.856,2.265625,False
19,2014,Brazil,Germany,Argentina,Netherlands,Brazil,171,32,64,3.386.810,2.671875,False


`.assign` also allows for the use of newly create variables in the same chain. To do that, you need to make use of the lambda function

In [30]:
# Notice calling the recently created variable final
(wc.
 assign(final= wc.Winner + " vs " + wc['Runners-Up'],
        best_three = lambda x: x["final"] + "in" + x["Third"]).
 filter(["best_three","final", "Country"])
)

Unnamed: 0,best_three,final,Country
0,Uruguay vs ArgentinainUSA,Uruguay vs Argentina,Uruguay
1,Italy vs CzechoslovakiainGermany,Italy vs Czechoslovakia,Italy
2,Italy vs HungaryinBrazil,Italy vs Hungary,France
3,Uruguay vs BrazilinSweden,Uruguay vs Brazil,Brazil
4,Germany FR vs HungaryinAustria,Germany FR vs Hungary,Switzerland
5,Brazil vs SwedeninFrance,Brazil vs Sweden,Sweden
6,Brazil vs CzechoslovakiainChile,Brazil vs Czechoslovakia,Chile
7,England vs Germany FRinPortugal,England vs Germany FR,England
8,Brazil vs ItalyinGermany FR,Brazil vs Italy,Mexico
9,Germany FR vs NetherlandsinPoland,Germany FR vs Netherlands,Germany


<div class="alert alert-block alert-danger">

**Alert:** Spend a few seconds trying to understand the use of the lambda function in the code above. 

</div>   

The combination of lambda (or a normal function) and `.assign()` is actually a nice property that resambles the properties of `mutate` in R. A lambda function (or a normal function) with .assign passes in the **current state of the dataframe** into the function. Because we create the variable in the state before, the lambda function allows us to access this new variable, and manipulate it in sequence. This also works for cases of grouped data, or chains in which we filter observations and transform the dataframe

To make this point clear, see how doing the same operation without a lambda function will throw an error: 


In [32]:
# will throw an error
(wc.
 assign(final= wc.Winner + " vs " + wc['Runners-Up'],
        best_three = wc["final"] + "in" + wc["Third"]).
 filter(["best_three","final", "Country"])
)

KeyError: 'final'

### Rename Columns 

### Row-Wise Operations

For data wrangling tasks at the rows of your data frame, we will discuss: 

- Subsetting
- filtering distinct values
- recoding values 
- Grouping and Summarizing
- Grouping and Transforming
- Sorting values

#### subsetting

#### filtering distinct values

#### recoding values

#### grouping and summarizing

#### grouping and Transforming

#### sorting values

### `pandas` apply methods