<h1><center> PPOL 5203 Data Science I: Foundations <br><br> 
<font color='grey'> Data Wrangling in Pandas<br><br>
Tiago Ventura</center></center> <h1> 

---

**In this Notebook:**

In this notebook, we will cover standard data wrangling methods using pandas:

- Selecting Methods.
- Filtering Methods.
- Grouping and Summarization.
- Recodingd Variables.
- Reshaping data in Pandas.


## Setup

In [3]:
import pandas as pd
import numpy as np

## Data Wrangling in `pandas`


Since you are also learning R throughout DSPP, let's provide you with an overview of the main Data Wrangling Functions in R using Tidyverse and Python using `pandas`. 

**<center>Main (tidy) Data Wrangling Functions </center>**

|   [`pandas`](https://pandas.pydata.org/)      |   [`dplyr`](https://dplyr.tidyverse.org/)$^\dagger$      |     Description     |
|:---------------:|:-------------:|:-----------------------------|
| `.filter()`     | `select()`    | select column variables/index |
| `.drop()`       | `select()`    | drop selected column variables/index |
| `.rename()`     | `rename()`    | rename column variables/index |
| `.query()`      | `filter()`    | row-wise subset of a data frame by a values of a column variable/index |
| `.assign()`     |`mutate()`    | Create a new variable on the existing data frame |
| `.sort_values()`| `arrange()`   | Arrange all data values along a specified (set of) column variable(s)/indices |
| `.groupby()`    |  `group_by()`  | Index data frame by specific (set of) column variable(s)/index value(s)|
| `.agg()`        |  `summarize()` | aggregate data by specific function rules |
| `.pivot_table()`        | `spread()` | cast the data from a "long" to a "wide" format |
| `pd.melt()`        | `gather()` | cast the data from a "wide" to a "long" format |
| `.()`            | `%>%`          | piping, fluid programming, or the passing one function output to the next |


If you want to fully embrace the tidyverse style from R in Python, you should check the [`dfply` module](https://github.com/kieferk/dfply). This modules ofers an alternative to data wrangling in Python, and mirrors the popular tidyverse functionalities from R. 

We will not cover `dfply` in class because I beleive you should dominate `pandas` as data scientists that are fluent in Python. However, feel free to learn and even use in your homeworks and assignment. 

In [4]:
# load worldcup dataset
wc = pd.read_csv("WorldCups.csv")
wc_matches = pd.read_csv("WorldCupMatches.csv")

###  Column-Wise Operations

For data wrangling tasks at the columns of your data frame, we will discuss: 

- Select columns
- Drop  columns
- Create new columns
- Rename columns


### Select Columns

**Functionality:**

- Select specific variables/column indices

**Implementation**

- Traditional indexing in pandas
- `.loc()` methods
- `pd.filter()` methods (allows piping). 


#### Select via index

In [6]:
## simple index for columns
wc["Year"].head(3)

0    1930
1    1934
2    1938
Name: Year, dtype: int64

In [8]:
## using dot method
wc.Year.head(3)

0    1930
1    1934
2    1938
Name: Year, dtype: int64

In [31]:
## using .loc methods
wc.loc[:,"Year"].head(3)

0    1930
1    1934
2    1938
Name: Year, dtype: int64

Notice, in all cases, the output comes as a `pandas series`. If you would like the output to be a full data frame, or if you need to select multiple columns, you should give a list of indexes as inputs

In [32]:
wc[["Winner", "Year"]].head()

Unnamed: 0,Winner,Year
0,Uruguay,1930
1,Italy,1934
2,Italy,1938
3,Uruguay,1950
4,Germany FR,1954


#### `.loc()` methods for selecting columns

It allows you to select multible variables in between column names!

In [42]:
# .loc for a single or multiple columns
wc.loc[:, "Year":"Winner"].head(3)

Unnamed: 0,Year,Country,Winner
0,1930,Uruguay,Uruguay
1,1934,Italy,Italy
2,1938,France,Italy


#### `.filter()` method

The `filter` methods in Pandas works similarly to the `select` function in the Tidyverse in R. It has the following advantages: 

- Allows for a piping approach
- Can be combined with regex queries for selectins columns

In [26]:
# simple filter
wc.filter(["Year", "Winner"]).head(3)

Unnamed: 0,Year,Winner
0,1930,Uruguay
1,1934,Italy
2,1938,Italy


**Using the like parameter:** Select columns that contain a specific substring.

In [43]:
# like parameter. In R: data %>% select(contains("Away"))
wc_matches.filter(like="Away").head()

Unnamed: 0,Away Team Goals,Away Team Name,Half-time Away Goals,Away Team Initials
0,1,Mexico,0,MEX
1,0,Belgium,0,BEL
2,1,Brazil,0,BRA
3,1,Peru,0,PER
4,0,France,0,FRA


**Using the Regex:** You can also input regex queries for selecting columns. More and More flexibility

In [39]:
# starts with
wc_matches.filter(regex="^Away").tail()

Unnamed: 0,Away Team Goals,Away Team Name,Away Team Initials
847,0,Costa Rica,CRC
848,7,Germany,GER
849,0,Argentina,ARG
850,3,Netherlands,NED
851,0,Argentina,ARG


In [41]:
# ends with
wc_matches.filter(regex="Initials$").tail()


Unnamed: 0,Home Team Initials,Away Team Initials
847,NED,CRC
848,BRA,GER
849,NED,ARG
850,BRA,NED
851,GER,ARG


### Drop Columns

**Functionality:**

Drop specific variables/column indices

**Implementation:**

- Indexing + Boolean operations
- .loc() methods
- pd.drop() methods (allows piping).

In [59]:
# Indexing: Bit too much?!
# .isin returns a boolean.
wc[wc.columns[~wc.columns.isin(["Year", "Country"])]].head()

Unnamed: 0,Winner,Runners-Up,Third,Fourth,GoalsScored,QualifiedTeams,MatchesPlayed,Attendance
0,Uruguay,Argentina,USA,Yugoslavia,70,13,18,590.549
1,Italy,Czechoslovakia,Germany,Austria,70,16,17,363.000
2,Italy,Hungary,Brazil,Sweden,84,15,18,375.700
3,Uruguay,Brazil,Sweden,Spain,88,13,22,1.045.246
4,Germany FR,Hungary,Austria,Uruguay,140,16,26,768.607


In [65]:
# loc methods
wc.loc[:,~wc.columns.isin(["Year", "Country"])]

Unnamed: 0,Winner,Runners-Up,Third,Fourth,GoalsScored,QualifiedTeams,MatchesPlayed,Attendance
0,Uruguay,Argentina,USA,Yugoslavia,70,13,18,590.549
1,Italy,Czechoslovakia,Germany,Austria,70,16,17,363.000
2,Italy,Hungary,Brazil,Sweden,84,15,18,375.700
3,Uruguay,Brazil,Sweden,Spain,88,13,22,1.045.246
4,Germany FR,Hungary,Austria,Uruguay,140,16,26,768.607
5,Brazil,Sweden,France,Germany FR,126,16,35,819.810
6,Brazil,Czechoslovakia,Chile,Yugoslavia,89,16,32,893.172
7,England,Germany FR,Portugal,Soviet Union,89,16,32,1.563.135
8,Brazil,Italy,Germany FR,Uruguay,95,16,32,1.603.975
9,Germany FR,Netherlands,Poland,Brazil,97,16,38,1.865.753


In [None]:
# what is going on there?? Whats is the difference? 

# loc accepts boleans returned by .isin method
bol_col = ~wc.columns.isin(["Year", "Country"])


<div class="alert alert-block alert-warning">

bol_col is an array of booleans. If you throw them directly as a index,  Pandas interprets `wc[bol_col]` as trying to index the rows of the DataFrame, not the columns. This is because when you pass a boolean array directly to the DataFrame indexing operator [], pandas assumes you're trying to index rows based on the boolean array.
</div>    

In [70]:
# indexing -> error
wc["bol_col"]

KeyError: 'bol_col'

In [73]:
# With .loc, it works
wc.loc[:,bol_col].head()

Unnamed: 0,Winner,Runners-Up,Third,Fourth,GoalsScored,QualifiedTeams,MatchesPlayed,Attendance
0,Uruguay,Argentina,USA,Yugoslavia,70,13,18,590.549
1,Italy,Czechoslovakia,Germany,Austria,70,16,17,363.000
2,Italy,Hungary,Brazil,Sweden,84,15,18,375.700
3,Uruguay,Brazil,Sweden,Spain,88,13,22,1.045.246
4,Germany FR,Hungary,Austria,Uruguay,140,16,26,768.607


#### `pd.drop()` methods: easier way to go

In [76]:
wc.drop(columns=["Year"]).head()

Unnamed: 0,Country,Winner,Runners-Up,Third,Fourth,GoalsScored,QualifiedTeams,MatchesPlayed,Attendance
0,Uruguay,Uruguay,Argentina,USA,Yugoslavia,70,13,18,590.549
1,Italy,Italy,Czechoslovakia,Germany,Austria,70,16,17,363.000
2,France,Italy,Hungary,Brazil,Sweden,84,15,18,375.700
3,Brazil,Uruguay,Brazil,Sweden,Spain,88,13,22,1.045.246
4,Switzerland,Germany FR,Hungary,Austria,Uruguay,140,16,26,768.607


### Create new columns

**Functionality:**

- Create a new column/index given inputs and or transformations from other columns. 

**Implementation**

- Traditional index assignment. Advantages:
  
    + Looks like a dictionary operation
    + Overwrites the data frame

- `.assign()` method. Advantage:
    
    + It returns a dataframe so you can chain/pipe operations
    + Can create multiple variables in a single call
    + Easy to combine with numpy + lambda functions
    + Improves readibility.
    
Let's see examples of both methods:


#### Transformation via index assignment

In [8]:
# With built in math operations
wc["av_goals_matches"] = wc["GoalsScored"]/wc["MatchesPlayed"]
wc.head(5)

Unnamed: 0,Year,Country,Winner,Runners-Up,Third,Fourth,GoalsScored,QualifiedTeams,MatchesPlayed,Attendance,av_goals_matches
0,1930,Uruguay,Uruguay,Argentina,USA,Yugoslavia,70,13,18,590.549,3.888889
1,1934,Italy,Italy,Czechoslovakia,Germany,Austria,70,16,17,363.000,4.117647
2,1938,France,Italy,Hungary,Brazil,Sweden,84,15,18,375.700,4.666667
3,1950,Brazil,Uruguay,Brazil,Sweden,Spain,88,13,22,1.045.246,4.0
4,1954,Switzerland,Germany FR,Hungary,Austria,Uruguay,140,16,26,768.607,5.384615


In [10]:
## with numpy
wc["winner_and_hoster"] = np.where(wc.Country==wc.Winner, True, False)
wc[["Year", "Winner", "Country", "winner_and_hoster"]].head(5)

Unnamed: 0,Year,Winner,Country,winner_and_hoster
0,1930,Uruguay,Uruguay,True
1,1934,Italy,Italy,True
2,1938,Italy,France,False
3,1950,Uruguay,Brazil,False
4,1954,Germany FR,Switzerland,False


In [17]:
# with an apply method + function
wc["av_goals_matches"] = wc.apply(lambda x: x["GoalsScored"]/x["MatchesPlayed"], axis=1)
wc.head()

Unnamed: 0,Year,Country,Winner,Runners-Up,Third,Fourth,GoalsScored,QualifiedTeams,MatchesPlayed,Attendance,av_goals_matches,winner_and_hoster
0,1930,Uruguay,Uruguay,Argentina,USA,Yugoslavia,70,13,18,590.549,3.888889,True
1,1934,Italy,Italy,Czechoslovakia,Germany,Austria,70,16,17,363.000,4.117647,True
2,1938,France,Italy,Hungary,Brazil,Sweden,84,15,18,375.700,4.666667,False
3,1950,Brazil,Uruguay,Brazil,Sweden,Spain,88,13,22,1.045.246,4.0,False
4,1954,Switzerland,Germany FR,Hungary,Austria,Uruguay,140,16,26,768.607,5.384615,False


#### `.assign()` method.

Allows you to create variables with methods chaining. 

In [21]:
## when the winned was also hosting the world cup?
(wc.
 # multiple variables
 assign(final=wc.Winner + " vs " + wc['Runners-Up'],
        winner_and_hoster_np = np.where(wc.Winner==wc.Country, True, False), 
        av_goals_matches = lambda x: x["GoalsScored"]/x["MatchesPlayed"], 
        ).
  #allows for methods chaining
 filter(["Year", "Winner", "Country", "final", "winner_and_hoster_np", "av_goals_matches"]).
 head(5))

Unnamed: 0,Year,Winner,Country,final,winner_and_hoster_np,av_goals_matches
0,1930,Uruguay,Uruguay,Uruguay vs Argentina,True,3.888889
1,1934,Italy,Italy,Italy vs Czechoslovakia,True,4.117647
2,1938,Italy,France,Italy vs Hungary,False,4.666667
3,1950,Uruguay,Brazil,Uruguay vs Brazil,False,4.0
4,1954,Germany FR,Switzerland,Germany FR vs Hungary,False,5.384615


In [77]:
(wc.
 # multiple variables
 assign(final=wc.Winner + " vs " + wc['Runners-Up'],
        winner_and_hoster_np = np.where(wc.Winner==wc.Country, True, False), 
        av_goals_matches = wc["GoalsScored"]/wc["MatchesPlayed"], 
        ).
  #allows for methods chaining
 filter(["Year", "Winner", "Country", "final", "winner_and_hoster_np", "av_goals_matches"]).
 head(5))

Unnamed: 0,Year,Winner,Country,final,winner_and_hoster_np,av_goals_matches
0,1930,Uruguay,Uruguay,Uruguay vs Argentina,True,3.888889
1,1934,Italy,Italy,Italy vs Czechoslovakia,True,4.117647
2,1938,Italy,France,Italy vs Hungary,False,4.666667
3,1950,Uruguay,Brazil,Uruguay vs Brazil,False,4.0
4,1954,Germany FR,Switzerland,Germany FR vs Hungary,False,5.384615


`.assign` also allows for the use of newly create variables in the same chain. To do that, you need to make use of the lambda function

In [79]:
# Notice calling the recently created variable final
(wc.
 assign(final= wc.Winner + " vs " + wc['Runners-Up'],
        best_three = lambda x: x["final"] + "in" + x["Third"]).
 filter(["best_three","final", "Country"]).
 head(5)
)

Unnamed: 0,best_three,final,Country
0,Uruguay vs ArgentinainUSA,Uruguay vs Argentina,Uruguay
1,Italy vs CzechoslovakiainGermany,Italy vs Czechoslovakia,Italy
2,Italy vs HungaryinBrazil,Italy vs Hungary,France
3,Uruguay vs BrazilinSweden,Uruguay vs Brazil,Brazil
4,Germany FR vs HungaryinAustria,Germany FR vs Hungary,Switzerland


<div class="alert alert-block alert-danger">

**Alert:** Spend a few seconds trying to understand the use of the lambda function in the code above. 

</div>   

The combination of lambda (or a normal function) and `.assign()` is actually a nice property that resambles the properties of `mutate` in R. A lambda function (or a normal function) with .assign passes in the **current state of the dataframe** into the function. Because we create the variable in the state before, the lambda function allows us to access this new variable, and manipulate it in sequence. This also works for cases of grouped data, or chains in which we filter observations and transform the dataframe

To make this point clear, see how doing the same operation without a lambda function will throw an error: 


In [32]:
# will throw an error
(wc.
 assign(final= wc.Winner + " vs " + wc['Runners-Up'],
        best_three = wc["final"] + "in" + wc["Third"]).
 filter(["best_three","final", "Country"])
)

KeyError: 'final'

### Renaming 

**Functionality:**

Rename a columns directly or a via a function. 

**Implementation:**

- Use dictionaries!

In [82]:
# Pandas: renaming variables using the rename method
wc.rename(columns={"Year":"year"}).head(3)

Unnamed: 0,year,Country,Winner,Runners-Up,Third,Fourth,GoalsScored,QualifiedTeams,MatchesPlayed,Attendance
0,1930,Uruguay,Uruguay,Argentina,USA,Yugoslavia,70,13,18,590.549
1,1934,Italy,Italy,Czechoslovakia,Germany,Austria,70,16,17,363.0
2,1938,France,Italy,Hungary,Brazil,Sweden,84,15,18,375.7


In [85]:
# we can use dictionary comprehension to apply functions in all
wc.rename(columns={col: col.lower() for col in wc.columns}).head(5)

Unnamed: 0,year,country,winner,runners-up,third,fourth,goalsscored,qualifiedteams,matchesplayed,attendance
0,1930,Uruguay,Uruguay,Argentina,USA,Yugoslavia,70,13,18,590.549
1,1934,Italy,Italy,Czechoslovakia,Germany,Austria,70,16,17,363.000
2,1938,France,Italy,Hungary,Brazil,Sweden,84,15,18,375.700
3,1950,Brazil,Uruguay,Brazil,Sweden,Spain,88,13,22,1.045.246
4,1954,Switzerland,Germany FR,Hungary,Austria,Uruguay,140,16,26,768.607


In [87]:
# Or for only a set of the columns by given them as inputs
wc.rename(columns={col: col.lower() for col in ["Year", "Country"]}).head(5)

Unnamed: 0,year,country,Winner,Runners-Up,Third,Fourth,GoalsScored,QualifiedTeams,MatchesPlayed,Attendance
0,1930,Uruguay,Uruguay,Argentina,USA,Yugoslavia,70,13,18,590.549
1,1934,Italy,Italy,Czechoslovakia,Germany,Austria,70,16,17,363.000
2,1938,France,Italy,Hungary,Brazil,Sweden,84,15,18,375.700
3,1950,Brazil,Uruguay,Brazil,Sweden,Spain,88,13,22,1.045.246
4,1954,Switzerland,Germany FR,Hungary,Austria,Uruguay,140,16,26,768.607


## Row-Wise Operations

For data wrangling tasks at the rows of your data frame, we will discuss: 

- Subsetting
- Filtering distinct values
- Recoding values 
- Grouping and Summarizing
- Grouping and Transforming
- Sorting values

### Subsetting

**Functionality:**

Slice the dataframe row-wise following a certain input. 

**Implementation:**

- index based implementation
- `.loc` or `.iloc` methods
- `.query`


#### Subsetting by index

In [90]:
# index based implementation
wc[wc.Year<1990]

Unnamed: 0,Year,Country,Winner,Runners-Up,Third,Fourth,GoalsScored,QualifiedTeams,MatchesPlayed,Attendance
0,1930,Uruguay,Uruguay,Argentina,USA,Yugoslavia,70,13,18,590.549
1,1934,Italy,Italy,Czechoslovakia,Germany,Austria,70,16,17,363.000
2,1938,France,Italy,Hungary,Brazil,Sweden,84,15,18,375.700
3,1950,Brazil,Uruguay,Brazil,Sweden,Spain,88,13,22,1.045.246
4,1954,Switzerland,Germany FR,Hungary,Austria,Uruguay,140,16,26,768.607
5,1958,Sweden,Brazil,Sweden,France,Germany FR,126,16,35,819.810
6,1962,Chile,Brazil,Czechoslovakia,Chile,Yugoslavia,89,16,32,893.172
7,1966,England,England,Germany FR,Portugal,Soviet Union,89,16,32,1.563.135
8,1970,Mexico,Brazil,Italy,Germany FR,Uruguay,95,16,32,1.603.975
9,1974,Germany,Germany FR,Netherlands,Poland,Brazil,97,16,38,1.865.753


In [91]:
# or with multiple condition
wc[(wc.Year<1990)&(wc.Year>1940)]

Unnamed: 0,Year,Country,Winner,Runners-Up,Third,Fourth,GoalsScored,QualifiedTeams,MatchesPlayed,Attendance
3,1950,Brazil,Uruguay,Brazil,Sweden,Spain,88,13,22,1.045.246
4,1954,Switzerland,Germany FR,Hungary,Austria,Uruguay,140,16,26,768.607
5,1958,Sweden,Brazil,Sweden,France,Germany FR,126,16,35,819.810
6,1962,Chile,Brazil,Czechoslovakia,Chile,Yugoslavia,89,16,32,893.172
7,1966,England,England,Germany FR,Portugal,Soviet Union,89,16,32,1.563.135
8,1970,Mexico,Brazil,Italy,Germany FR,Uruguay,95,16,32,1.603.975
9,1974,Germany,Germany FR,Netherlands,Poland,Brazil,97,16,38,1.865.753
10,1978,Argentina,Argentina,Netherlands,Brazil,Italy,102,16,38,1.545.791
11,1982,Spain,Italy,Germany FR,Poland,France,146,24,52,2.109.723
12,1986,Mexico,Argentina,Germany FR,France,Belgium,132,24,52,2.394.031


#### Subsetting with `.loc()`  method

In [97]:
# pretty much the same
wc.loc[(wc.Year<1990)&(wc.Year>1940),:]

Unnamed: 0,Year,Country,Winner,Runners-Up,Third,Fourth,GoalsScored,QualifiedTeams,MatchesPlayed,Attendance
3,1950,Brazil,Uruguay,Brazil,Sweden,Spain,88,13,22,1.045.246
4,1954,Switzerland,Germany FR,Hungary,Austria,Uruguay,140,16,26,768.607
5,1958,Sweden,Brazil,Sweden,France,Germany FR,126,16,35,819.810
6,1962,Chile,Brazil,Czechoslovakia,Chile,Yugoslavia,89,16,32,893.172
7,1966,England,England,Germany FR,Portugal,Soviet Union,89,16,32,1.563.135
8,1970,Mexico,Brazil,Italy,Germany FR,Uruguay,95,16,32,1.603.975
9,1974,Germany,Germany FR,Netherlands,Poland,Brazil,97,16,38,1.865.753
10,1978,Argentina,Argentina,Netherlands,Brazil,Italy,102,16,38,1.545.791
11,1982,Spain,Italy,Germany FR,Poland,France,146,24,52,2.109.723
12,1986,Mexico,Argentina,Germany FR,France,Belgium,132,24,52,2.394.031


#### Subsetting with `.query()`: a pipe approach

As before, you can use `.query()` methods to a more reable and pipeble approach. Notice the inside of the quotation marks use only the column name. 

In [98]:
wc.query('Year<1990& Year>1940')

Unnamed: 0,Year,Country,Winner,Runners-Up,Third,Fourth,GoalsScored,QualifiedTeams,MatchesPlayed,Attendance
3,1950,Brazil,Uruguay,Brazil,Sweden,Spain,88,13,22,1.045.246
4,1954,Switzerland,Germany FR,Hungary,Austria,Uruguay,140,16,26,768.607
5,1958,Sweden,Brazil,Sweden,France,Germany FR,126,16,35,819.810
6,1962,Chile,Brazil,Czechoslovakia,Chile,Yugoslavia,89,16,32,893.172
7,1966,England,England,Germany FR,Portugal,Soviet Union,89,16,32,1.563.135
8,1970,Mexico,Brazil,Italy,Germany FR,Uruguay,95,16,32,1.603.975
9,1974,Germany,Germany FR,Netherlands,Poland,Brazil,97,16,38,1.865.753
10,1978,Argentina,Argentina,Netherlands,Brazil,Italy,102,16,38,1.545.791
11,1982,Spain,Italy,Germany FR,Poland,France,146,24,52,2.109.723
12,1986,Mexico,Argentina,Germany FR,France,Belgium,132,24,52,2.394.031


### Other types of subsetting

#### Subset by distinct entry

In [100]:
# Pandas: drop duplicative entries for a specific variable
# notice here you are actually deleting important rows
wc.drop_duplicates("Country")

Unnamed: 0,Year,Country,Winner,Runners-Up,Third,Fourth,GoalsScored,QualifiedTeams,MatchesPlayed,Attendance
0,1930,Uruguay,Uruguay,Argentina,USA,Yugoslavia,70,13,18,590.549
1,1934,Italy,Italy,Czechoslovakia,Germany,Austria,70,16,17,363.000
2,1938,France,Italy,Hungary,Brazil,Sweden,84,15,18,375.700
3,1950,Brazil,Uruguay,Brazil,Sweden,Spain,88,13,22,1.045.246
4,1954,Switzerland,Germany FR,Hungary,Austria,Uruguay,140,16,26,768.607
5,1958,Sweden,Brazil,Sweden,France,Germany FR,126,16,35,819.810
6,1962,Chile,Brazil,Czechoslovakia,Chile,Yugoslavia,89,16,32,893.172
7,1966,England,England,Germany FR,Portugal,Soviet Union,89,16,32,1.563.135
8,1970,Mexico,Brazil,Italy,Germany FR,Uruguay,95,16,32,1.603.975
9,1974,Germany,Germany FR,Netherlands,Poland,Brazil,97,16,38,1.865.753


### Subset by sampling

In [104]:
# Pandas: randomly sample N number of rows from the data
wc.sample(3)

Unnamed: 0,Year,Country,Winner,Runners-Up,Third,Fourth,GoalsScored,QualifiedTeams,MatchesPlayed,Attendance
0,1930,Uruguay,Uruguay,Argentina,USA,Yugoslavia,70,13,18,590.549
1,1934,Italy,Italy,Czechoslovakia,Germany,Austria,70,16,17,363.000
2,1938,France,Italy,Hungary,Brazil,Sweden,84,15,18,375.700
3,1950,Brazil,Uruguay,Brazil,Sweden,Spain,88,13,22,1.045.246
4,1954,Switzerland,Germany FR,Hungary,Austria,Uruguay,140,16,26,768.607
5,1958,Sweden,Brazil,Sweden,France,Germany FR,126,16,35,819.810
6,1962,Chile,Brazil,Czechoslovakia,Chile,Yugoslavia,89,16,32,893.172
7,1966,England,England,Germany FR,Portugal,Soviet Union,89,16,32,1.563.135
8,1970,Mexico,Brazil,Italy,Germany FR,Uruguay,95,16,32,1.603.975
9,1974,Germany,Germany FR,Netherlands,Poland,Brazil,97,16,38,1.865.753


### Recoding values

**Functionality:**

Recode a value given certain conditions. This type of transformation is one of the most importants in data cleaning. 

**Implementation:**

There are many ways to recode variables in Python. We will showcase four of the most useful in my view. 

First, we will see more generaliz row-wise approach in pandas using: 

- `map()`
- `filter()`

Then we will see two vectorized solutions using numpy: 

- `np.where()`
- `np.select()` 


#### Recode with map() + dictionaries

The `map()` function is used to substitute each value in a **Series** with another value. 

- It takes a series as input
- Uses a function/dictionary to transform values
- Returns a series. 

In [110]:
wc

Unnamed: 0,Year,Country,Winner,Runners-Up,Third,Fourth,GoalsScored,QualifiedTeams,MatchesPlayed,Attendance
0,1930,Uruguay,Uruguay,Argentina,USA,Yugoslavia,70,13,18,590.549
1,1934,Italy,Italy,Czechoslovakia,Germany,Austria,70,16,17,363.000
2,1938,France,Italy,Hungary,Brazil,Sweden,84,15,18,375.700
3,1950,Brazil,Uruguay,Brazil,Sweden,Spain,88,13,22,1.045.246
4,1954,Switzerland,Germany FR,Hungary,Austria,Uruguay,140,16,26,768.607
5,1958,Sweden,Brazil,Sweden,France,Germany FR,126,16,35,819.810
6,1962,Chile,Brazil,Czechoslovakia,Chile,Yugoslavia,89,16,32,893.172
7,1966,England,England,Germany FR,Portugal,Soviet Union,89,16,32,1.563.135
8,1970,Mexico,Brazil,Italy,Germany FR,Uruguay,95,16,32,1.603.975
9,1974,Germany,Germany FR,Netherlands,Poland,Brazil,97,16,38,1.865.753


In [121]:
# map + dictionary to recode to create a dummy for certain country

# create a map function
mapping = {"Brazil":1}

# map the values
wc["brazil_winner"]= wc["Winner"].map(mapping)

# Fill missing values with a default value (e.g., 0)
wc["brazil_winner"].fillna(0, inplace=True)

# see results
wc.tail(10)

Unnamed: 0,Year,Country,Winner,Runners-Up,Third,Fourth,GoalsScored,QualifiedTeams,MatchesPlayed,Attendance,brazil_winner
10,1978,Argentina,Argentina,Netherlands,Brazil,Italy,102,16,38,1.545.791,0.0
11,1982,Spain,Italy,Germany FR,Poland,France,146,24,52,2.109.723,0.0
12,1986,Mexico,Argentina,Germany FR,France,Belgium,132,24,52,2.394.031,0.0
13,1990,Italy,Germany FR,Argentina,Italy,England,115,24,52,2.516.215,0.0
14,1994,USA,Brazil,Italy,Sweden,Bulgaria,141,24,52,3.587.538,1.0
15,1998,France,France,Brazil,Croatia,Netherlands,171,32,64,2.785.100,0.0
16,2002,Korea/Japan,Brazil,Germany,Turkey,Korea Republic,161,32,64,2.705.197,1.0
17,2006,Germany,Italy,France,Germany,Portugal,147,32,64,3.359.439,0.0
18,2010,South Africa,Spain,Netherlands,Germany,Uruguay,145,32,64,3.178.856,0.0
19,2014,Brazil,Germany,Argentina,Netherlands,Brazil,171,32,64,3.386.810,0.0


#### Recode with apply() + function

The apply() function is used when you want to apply a function along the axis of a DataFrame (either rows or columns).  This function can be both an in-built function or a user-defined function.

In [123]:
# apply + function to recode
def get_dummies(x):
    if x =="Brazil":
        return 1
    else:
        return 0
    
# apply function  
wc['Winner'].apply(get_dummies).head(10)

0    0
1    0
2    0
3    0
4    0
5    1
6    1
7    0
8    1
9    0
Name: Winner, dtype: int64

Notice, we can make it more general by providing a argument to the function

In [126]:
# apply + function to recode
def get_dummies(x, country):
    if x ==country:
        return 1
    else:
        return 0
    
# Apply function with Uruguay now
wc['Winner'].apply(get_dummies, country="Uruguay").head(10)

0    1
1    0
2    0
3    1
4    0
5    0
6    0
7    0
8    0
9    0
Name: Winner, dtype: int64

#### Summary of `apply()` and `map()`

- `apply()` is used to apply a function along an axis of the DataFrame or on values of Series.
    - you are free to use lambda functions with map
    - you can also apply functions column-wise changing the axis=1 argument. 
    
- `map()` is used to substitute each value in a Series with another value.




### Recode using numpy

#### `np.where`: if-else approach.

- `np.where` is  similar to ifelse in R
- Useful if there’s only 1-2 (True/False conditions)
- sintax: `np.where(condition, true, false)`
- condition can be anything that returns a boolean. 

Let's see some examples:

In [130]:
# create a new variable
wc["winner_brazil"]=np.where(wc["Winner"]=="Brazil", 1, 0)
wc.head()

Unnamed: 0,Year,Country,Winner,Runners-Up,Third,Fourth,GoalsScored,QualifiedTeams,MatchesPlayed,Attendance,brazil_winner,winner_brazil
0,1930,Uruguay,Uruguay,Argentina,USA,Yugoslavia,70,13,18,590.549,0.0,0
1,1934,Italy,Italy,Czechoslovakia,Germany,Austria,70,16,17,363.000,0.0,0
2,1938,France,Italy,Hungary,Brazil,Sweden,84,15,18,375.700,0.0,0
3,1950,Brazil,Uruguay,Brazil,Sweden,Spain,88,13,22,1.045.246,0.0,0
4,1954,Switzerland,Germany FR,Hungary,Austria,Uruguay,140,16,26,768.607,0.0,0


In [134]:
# notice we can easily use np.where with assign
wc.assign(winner_brazil_=np.where(wc["Winner"]=="Brazil", 1, 0)).head(10)

Unnamed: 0,Year,Country,Winner,Runners-Up,Third,Fourth,GoalsScored,QualifiedTeams,MatchesPlayed,Attendance,brazil_winner,winner_brazil,winner_brazil_
0,1930,Uruguay,Uruguay,Argentina,USA,Yugoslavia,70,13,18,590.549,0.0,0,0
1,1934,Italy,Italy,Czechoslovakia,Germany,Austria,70,16,17,363.000,0.0,0,0
2,1938,France,Italy,Hungary,Brazil,Sweden,84,15,18,375.700,0.0,0,0
3,1950,Brazil,Uruguay,Brazil,Sweden,Spain,88,13,22,1.045.246,0.0,0,0
4,1954,Switzerland,Germany FR,Hungary,Austria,Uruguay,140,16,26,768.607,0.0,0,0
5,1958,Sweden,Brazil,Sweden,France,Germany FR,126,16,35,819.810,1.0,1,1
6,1962,Chile,Brazil,Czechoslovakia,Chile,Yugoslavia,89,16,32,893.172,1.0,1,1
7,1966,England,England,Germany FR,Portugal,Soviet Union,89,16,32,1.563.135,0.0,0,0
8,1970,Mexico,Brazil,Italy,Germany FR,Uruguay,95,16,32,1.603.975,1.0,1,1
9,1974,Germany,Germany FR,Netherlands,Poland,Brazil,97,16,38,1.865.753,0.0,0,0


In [140]:
# using string methods
(wc
 .assign(south_america=np.where(wc["Winner"].str.contains("Brazil|Uruguay|Argentina"), 1, 0))
     .filter(["Winner", "south_america"])
     .head(10)
)

Unnamed: 0,Winner,south_america
0,Uruguay,1
1,Italy,0
2,Italy,0
3,Uruguay,1
4,Germany FR,0
5,Brazil,1
6,Brazil,1
7,England,0
8,Brazil,1
9,Germany FR,0


#### `np.select`: for multiple conditions

- np.select: similar to `case_when` in R. 
- useful for when there’s multiple conditions to be recoded
- sintax: `np.select(conditon, choicelist, default)`

In [149]:
# step one: create a list of conditions
condition = [wc["Winner"]==wc["Country"], 
             wc["Runners-Up"]==wc["Country"],
             wc["Third"]==wc["Country"]
            ]
# step two: create the choice list
choice_list = [1, 2, 3]

# recode
(wc
    .assign(where_is_hoster=np.select(condition, choice_list, default="4+"))
    .filter(["where_is_hoster"])
)

Unnamed: 0,where_is_hoster
0,1
1,1
2,4+
3,2
4,4+
5,2
6,3
7,1
8,4+
9,4+


### Group by: split-apply-combine

**Functionality**:

- Grouping data by specific variables/column indices
- Summarize/aggregate data by specific group features

**How it works**

“group by” in a common data wrangling process that exists in any language (R, Python, SQL) and refers to a process involving one or more of the following steps:

- Splitting the data into groups based on some criteria.

- Applying a function to each group independently.

- Combining the results into a data structure.


![From Python Data Science Handbook by Jake VanderPlas](https://jakevdp.github.io/PythonDataScienceHandbook/figures/03.08-split-apply-combine.png)

### group by

The `groupby()` method in pandas splits your dataset in smaller parts. 

It generates an iterable where each group is broken up into a tuple (group,data). 

We can iterate across the tuple positions. 


In [24]:
# load worldcup dataset
wc = pd.read_csv("WorldCups.csv")
wc_matches = pd.read_csv("WorldCupMatches.csv")

#groupby object
g = wc.groupby(["Winner"])
g

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x16d6e0b50>

In [25]:
# Iteration for groups
for group, data in g:
    print(group)

('Argentina',)
('Brazil',)
('England',)
('France',)
('Germany',)
('Germany FR',)
('Italy',)
('Spain',)
('Uruguay',)


In [26]:
# iteration for data grouped
for group, data in g:
    print(data.head(2))

    Year    Country     Winner   Runners-Up   Third   Fourth  GoalsScored  \
10  1978  Argentina  Argentina  Netherlands  Brazil    Italy          102   
12  1986     Mexico  Argentina   Germany FR  France  Belgium          132   

    QualifiedTeams  MatchesPlayed Attendance  
10              16             38  1.545.791  
12              24             52  2.394.031  
   Year Country  Winner      Runners-Up   Third      Fourth  GoalsScored  \
5  1958  Sweden  Brazil          Sweden  France  Germany FR          126   
6  1962   Chile  Brazil  Czechoslovakia   Chile  Yugoslavia           89   

   QualifiedTeams  MatchesPlayed Attendance  
5              16             35    819.810  
6              16             32    893.172  
   Year  Country   Winner  Runners-Up     Third        Fourth  GoalsScored  \
7  1966  England  England  Germany FR  Portugal  Soviet Union           89   

   QualifiedTeams  MatchesPlayed Attendance  
7              16             32  1.563.135  
    Year Co

And we can acess a specific group:

In [27]:
g.get_group("Argentina")

Unnamed: 0,Year,Country,Winner,Runners-Up,Third,Fourth,GoalsScored,QualifiedTeams,MatchesPlayed,Attendance
10,1978,Argentina,Argentina,Netherlands,Brazil,Italy,102,16,38,1.545.791
12,1986,Mexico,Argentina,Germany FR,France,Belgium,132,24,52,2.394.031


### Aggreations (summarize)

The power of a grouping function (like `.groupby()` shines when coupled with an aggregation operation.

An aggregation is any operation that reduces the dimensionality of the data!


In [31]:
# mean of all numeric variables grouping by winners
wc.groupby(["Winner"]).mean()

TypeError: Could not convert ArgentinaMexico to numeric

#### `pandas`: `.groupby()` +  built-in methods

or select a specific variable to perform the aggregation step on. 

In [33]:
# to save as a new column
wc.groupby(["Winner"])["GoalsScored"].mean().reset_index()

Unnamed: 0,Winner,GoalsScored
0,Argentina,117.0
1,Brazil,122.4
2,England,89.0
3,France,171.0
4,Germany,171.0
5,Germany FR,117.333333
6,Italy,111.75
7,Spain,145.0
8,Uruguay,79.0


#### `pandas`: `.groupby()` + `agg()`

Alternatively, we can specify a whole range of operations to aggregate by (along with specific variable columns) using the `.aggregate()`/`.agg()` method. To keep track of which operations correspond which variable, `pandas` will generate a hierarchical index for column entries. 

In [165]:
wc.groupby(["Winner"])["GoalsScored"].agg(["mean","std","median"])

Unnamed: 0_level_0,mean,std,median
Winner,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Argentina,117.0,21.213203,117.0
Brazil,122.4,30.47622,126.0
England,89.0,,89.0
France,171.0,,171.0
Germany,171.0,,171.0
Germany FR,117.333333,21.594752,115.0
Italy,111.75,40.532908,115.0
Spain,145.0,,145.0
Uruguay,79.0,12.727922,79.0


We can also **_user-defined functions_**  into the `aggregate()` function as well.

In [169]:
def mean_add_50(x):
    return np.mean(x) + 50

wc.groupby(["Winner"])["GoalsScored"].agg(["mean","std","median",mean_add_50])

Unnamed: 0_level_0,mean,std,median,mean_add_50
Winner,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Argentina,117.0,21.213203,117.0,167.0
Brazil,122.4,30.47622,126.0,172.4
England,89.0,,89.0,139.0
France,171.0,,171.0,221.0
Germany,171.0,,171.0,221.0
Germany FR,117.333333,21.594752,115.0,167.333333
Italy,111.75,40.532908,115.0,161.75
Spain,145.0,,145.0,195.0
Uruguay,79.0,12.727922,79.0,129.0


We can also group by more than one variable (i.e. implement a **_multi-index on the rows_**).

In [185]:
wc.groupby(["Winner", "Year"])["GoalsScored"].mean().reset_index()

Unnamed: 0,Winner,Year,GoalsScored
0,Argentina,1978,102.0
1,Argentina,1986,132.0
2,Brazil,1958,126.0
3,Brazil,1962,89.0
4,Brazil,1970,95.0
5,Brazil,1994,141.0
6,Brazil,2002,161.0
7,England,1966,89.0
8,France,1998,171.0
9,Germany,2014,171.0


A common task on data analysis is to simple count the appereance of values in a variable. We can do that very easily with `groupby` and  `size`

In [175]:
(wc.groupby(["Winner"]).
    size().
    reset_index(name='N'))

Unnamed: 0,Winner,N
0,Argentina,2
1,Brazil,5
2,England,1
3,France,1
4,Germany,1
5,Germany FR,3
6,Italy,4
7,Spain,1
8,Uruguay,2


#### `pandas`: `.groupby()` +   `.transform()`

Other times we want to implement data manipulations by some grouping variable but retain structure of the original data. Put differently, our aim is not to aggregate but to perform some operation across specific groups. For example, we might want to group-mean center our variables as a way of removing between group variation.

To do so, we will combine `.groupby()` and `.transform()` methods. Let's see an example: 

In [211]:
# create a new column
# notive you need to select the column you want to transform. 
wc.groupby(["Winner"])["GoalsScored"].transform("mean").head(5)

  wc.groupby(["Winner"]).transform("mean")


Unnamed: 0,Year,GoalsScored,QualifiedTeams,MatchesPlayed,brazil_winner,winner_brazil,count
0,1940.0,79.0,13.0,20.0,0.0,0.0,2.0
1,1965.0,111.75,21.75,37.75,0.0,0.0,4.0
2,1965.0,111.75,21.75,37.75,0.0,0.0,4.0
3,1940.0,79.0,13.0,20.0,0.0,0.0,2.0
4,1972.666667,117.333333,18.666667,38.666667,0.0,0.0,3.0
5,1977.2,122.4,20.8,43.0,1.0,1.0,5.0
6,1977.2,122.4,20.8,43.0,1.0,1.0,5.0
7,1966.0,89.0,16.0,32.0,0.0,0.0,1.0
8,1977.2,122.4,20.8,43.0,1.0,1.0,5.0
9,1972.666667,117.333333,18.666667,38.666667,0.0,0.0,3.0


In [196]:
# easily combined with assign
# create a new column
(wc.assign(goals_score_wc_mean_wc=wc.groupby(["Winner"])["GoalsScored"].
    transform("mean"))).head()

Unnamed: 0,Year,Country,Winner,Runners-Up,Third,Fourth,GoalsScored,QualifiedTeams,MatchesPlayed,Attendance,brazil_winner,winner_brazil,goals_score_wc_mean_wc
0,1930,Uruguay,Uruguay,Argentina,USA,Yugoslavia,70,13,18,590.549,0.0,0,79.0
1,1934,Italy,Italy,Czechoslovakia,Germany,Austria,70,16,17,363.000,0.0,0,111.75
2,1938,France,Italy,Hungary,Brazil,Sweden,84,15,18,375.700,0.0,0,111.75
3,1950,Brazil,Uruguay,Brazil,Sweden,Spain,88,13,22,1.045.246,0.0,0,79.0
4,1954,Switzerland,Germany FR,Hungary,Austria,Uruguay,140,16,26,768.607,0.0,0,117.333333


In [213]:
# also: very useful with lambda functions
wc.groupby("Winner")["Winner"].transform(lambda x: len(x))


0     2
1     4
2     4
3     2
4     3
5     5
6     5
7     1
8     5
9     3
10    2
11    4
12    2
13    3
14    5
15    1
16    5
17    4
18    1
19    1
Name: Winner, dtype: int64

### Sorting values

In [214]:
# Pandas: sort values by a column variable (ascending)
wc.sort_values('Country').head(3)

Unnamed: 0,Year,Country,Winner,Runners-Up,Third,Fourth,GoalsScored,QualifiedTeams,MatchesPlayed,Attendance,brazil_winner,winner_brazil,count
10,1978,Argentina,Argentina,Netherlands,Brazil,Italy,102,16,38,1.545.791,0.0,0,2
19,2014,Brazil,Germany,Argentina,Netherlands,Brazil,171,32,64,3.386.810,0.0,0,1
3,1950,Brazil,Uruguay,Brazil,Sweden,Spain,88,13,22,1.045.246,0.0,0,2


In [215]:
# Pandas: sort values by a column variable (descending)
wc.sort_values('Year',ascending=False).head(3)

Unnamed: 0,Year,Country,Winner,Runners-Up,Third,Fourth,GoalsScored,QualifiedTeams,MatchesPlayed,Attendance,brazil_winner,winner_brazil,count
19,2014,Brazil,Germany,Argentina,Netherlands,Brazil,171,32,64,3.386.810,0.0,0,1
18,2010,South Africa,Spain,Netherlands,Germany,Uruguay,145,32,64,3.178.856,0.0,0,1
17,2006,Germany,Italy,France,Germany,Portugal,147,32,64,3.359.439,0.0,0,4


In [217]:
# Pandas: sort values by more than one column variable 
wc.sort_values(['Winner', "Country"]).head(3)

Unnamed: 0,Year,Country,Winner,Runners-Up,Third,Fourth,GoalsScored,QualifiedTeams,MatchesPlayed,Attendance,brazil_winner,winner_brazil,count
10,1978,Argentina,Argentina,Netherlands,Brazil,Italy,102,16,38,1.545.791,0.0,0,2
12,1986,Mexico,Argentina,Germany FR,France,Belgium,132,24,52,2.394.031,0.0,0,2
6,1962,Chile,Brazil,Czechoslovakia,Chile,Yugoslavia,89,16,32,893.172,1.0,1,5


### That was a lot! but you will get to keep this notebook for you! 

And remember of the [cheat sheet for pandas](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)