<h1><center> PPOL 5203 Data Science I: Foundations <br><br> 
<font color='grey'> Miscellaneous in Pandas<br><br>
Tiago Ventura</center></center> <h1> 

---

**In this Notebook we cover some miscellaneous data wrangling techniques:**

- Pipining operations in Pandas. 
- Dealing with Missing values/imputation



## Setup

In [4]:
import pandas as pd
import numpy as np

## Piping

If you started your data science career, or are simultaneously learning, R, it is likely you were exposed to the `tidyverse` world, and the famous `pipe`. Here it is for you:

![](https://magrittr.tidyverse.org/logo.png)


Pipe operators, introduced by tha magritt package in R, absolutely transformed the way of writing coding in R. It did for good reasons. And in the past few years, piping has also become more prevalent in Python. 

Pipe operators allows you to:

- **Chain together data manipulations in a single operational sequence.**

In pandas, piping allows you to chain together in a single workflow a sequence of methods. Pipe operators (`.pipe()`) in `pandas` were introduced relatively recently (version 0.16.2).

Let's see some examples using the world cup data. We want to count which country played more world cup games.

In [None]:
# read world cup data
wc = pd.read_csv("WorldCupMatches.csv")

#### _Method 1_: sequentially overwrite the object

In [25]:
wc = pd.read_csv("WorldCupMatches.csv")

# select columns
wc = wc.filter(['Year','Home Team Name','Away Team Name'])

# make it tidy
wc = wc.melt(id_vars=["Year"], var_name="var", value_name="country")

# group by
wc = wc.groupby("country")

# count occurrences
wc = wc.size()

# move index to column with new name
wc = wc.reset_index(name="n")

# sort
wc = wc.sort_values(by="n", ascending=False)

# print 10
wc.head(10)

Unnamed: 0,country,n
7,Brazil,108
39,Italy,83
2,Argentina,81
25,England,62
29,Germany FR,62
26,France,61
66,Spain,59
45,Mexico,54
47,Netherlands,54
74,Uruguay,52


#### Method 2: Pandas Pipe

In [27]:
wc = pd.read_csv("WorldCupMatches.csv")

# select columns
wc_10 = (wc.filter(['Year','Home Team Name','Away Team Name']).
             melt(id_vars=["Year"], var_name="var", value_name="country"). 
             groupby("country"). 
             size(). 
             reset_index(name="n").
             sort_values(by="n", ascending=False)
        )

# print 10
wc_10.head(10)

Unnamed: 0,country,n
7,Brazil,108
39,Italy,83
2,Argentina,81
25,England,62
29,Germany FR,62
26,France,61
66,Spain,59
45,Mexico,54
47,Netherlands,54
74,Uruguay,52


#### Notice that a sequential chain would also work

But it is not a nice code to read!

In [28]:
wc.filter(['Year','Home Team Name','Away Team Name']).melt(id_vars=["Year"], var_name="var", value_name="country"). groupby("country"). size(). reset_index(name="n").sort_values(by="n", ascending=False).head(10)
        

Unnamed: 0,country,n
7,Brazil,108
39,Italy,83
2,Argentina,81
25,England,62
29,Germany FR,62
26,France,61
66,Spain,59
45,Mexico,54
47,Netherlands,54
74,Uruguay,52


#### Final notes in Piping

<div class="alert alert-block alert-info">

To understand pipes, you should always remember: **data in, data out**. That's what pipes do: apply methods sequentially to `pandas dataframes`. 
    
</div>


It should be clear by now, but these are some of the advantages of piping your code: 
    
- Improves code structure
- Eliminates intermediate variables	
- Can improve code readability
- Memory efficiency by eliminating intermediate variables
- Makes easies to add steps anywhere in the sequence of operations

Keep in mind pipes also have some disadvantages. In my view, specially when working in notebook in which you do not execute line by line, pipes can make codes a bit harder to debug and also embed errors that you could perceive more easier by examining intermediate variables. 