# Python Data Wrangling with `pandas`

* * * 
<div class="alert alert-success">  
    
### Learning Objectives 
    
* Gain familiarity with `pandas` and the core `DataFrame` object
* Apply core data wrangling techniques in `pandas`
* Understand the flexibility of the `pandas` library
</div>

### Icons Used in This Notebook
🔔 **Question**: A quick question to help you understand what's going on.<br>
🥊 **Challenge**: Interactive exercise. We'll work through these in the workshop!<br>
💡 **Tip**: How to do something a bit more efficiently or effectively.<br>
⚠️ **Warning:** Heads-up about tricky stuff or common mistakes.<br>
🎬 **Demo**: Showing off something more advanced – so you know what Pandas can be used for!<br>

In this workshop, we provide an introduction to **data wrangling with Python**. We will be extensively using the `pandas` package, which provides a rich set of tools to manipulate and interact with DataFrames, the most common data structure used when analyzing tabular data.

We'll use worked examples and practice on real data to learn the core techniques of data wrangling -- how to index, manipulate, merge, group, and plot data frames -- in `pandas`. 

Now let's get started!

### Sections
1. [The `DataFrame` object](#dataframe)
2. [Indexing data](#indexing)
3. [Manipulating data](#manipulating)

<a id='dataframe'></a>
# The `DataFrame` object

`pandas` is designed to make it easier to work with structured, tabular data. Many of the analyses you might typically perform likely involve using tabular data, i.e. .csv files, excel files, extracts from relational databases, etc. `pandas` represents this data as a `DataFrame` object -- we'll see what this looks like in a moment.
## Importing and Viewing Data
We are going to work with European unemployment data from Eurostat, which is[ hosted by Google](https://code.google.com/p/dspl/downloads/list). There are several `.csv` files related to this topic that we'll work with in this workshop.

Let's begin by importing `pandas` using the conventional `pd` abbreviation.

In [6]:
# Imports pandas and assign it to the variable `pd`
import pandas as pd

# We often import NumPy (numerical python) with pandas
# we will import that and assign it to the variable `np`
import numpy as np

# Load matplotlib for plotting later in the notebook
import matplotlib.pyplot as plt
%matplotlib inline

The `read_csv()` function allows us to easily import tabular data (e.g. `.csv` files). The function returns a `DataFrame` object, which is the main object `pandas` uses to represent tabular data.

Notice that we call `read_csv()` using the `pd` abbreviation from the import statement above:

In [7]:
unemployment = pd.read_csv('../data/country_total.csv')

Let's run `type()` on the `unemployment` object and see what it is...

In [8]:
type(unemployment)

pandas.core.frame.DataFrame

Great! You've created a `pandas` `DataFrame`. We can look at our data by using the `.head()` method. By default, this shows the header (column names) and the first **five** rows.  

In [9]:
unemployment.head()

Unnamed: 0,country,seasonality,month,unemployment,unemployment_rate
0,at,nsa,1993.01,171000,4.5
1,at,nsa,1993.02,175000,4.6
2,at,nsa,1993.03,166000,4.4
3,at,nsa,1993.04,157000,4.1
4,at,nsa,1993.05,147000,3.9


💡 **Tip**: If you'd like to see some other number of rows, you can pass an integer to `.head()` to return that many rows. For example `unemployment.head(6)` would return the first six rows.  



To find the number of rows, you can use the `.shape` attribute, which returns a [tuple](https://www.w3schools.com/python/python_tuples.asp): `(number of rows, number of columns)`

In [10]:
unemployment.shape

(20796, 5)

To find out exactly what all of your columns are, you can use the `.columns` attribute.

In [11]:
unemployment.columns

Index(['country', 'seasonality', 'month', 'unemployment', 'unemployment_rate'], dtype='object')

To find out what kinds of data we have, we use the `.dtypes` attribute, which tells us which columns contain numerical data (e.g. `float64` or `int64` types) and which ones contain text (e.g. `object` types)

In [12]:
unemployment.dtypes

country               object
seasonality           object
month                float64
unemployment           int64
unemployment_rate    float64
dtype: object

### Summarizing data
A useful method that generates various summary statistics is `.describe()`. This is a powerful method that will return a lot of information, so before we run it, let's look up exactly what it does.

💡 **Tip**: The [`pandas` documentation](http://pandas.pydata.org/pandas-docs/stable/) contains exhaustive information on every function, object, etc. in `pandas`. It can be a little difficult to navigate on its own, so it's typical to interact with the documentation primarily through Google searches.  

The following is a general worflow for learning about a function in `pandas`:
1. Google the `pandas` function, e.g. "pandas {insert function name}"
2. Find a result from pandas.pydata.org (the pandas documentation)
3. Read the summary of what the function does (at the top of the page), examine its arguments and what it returns.

<span color="purple">🔔 **Question:** Before running the following code, try using the general workflow detailed above to find out what `.describe()` returns. </span>  

In [13]:
unemployment.describe()

Unnamed: 0,month,unemployment,unemployment_rate
count,20796.0,20796.0,19851.0
mean,1999.40129,790081.8,8.179764
std,7.483751,1015280.0,3.922533
min,1983.01,2000.0,1.1
25%,1994.09,140000.0,5.2
50%,2001.01,310000.0,7.6
75%,2006.01,1262250.0,10.0
max,2010.12,4773000.0,20.9


⚠️ **Warning**: `.describe()` will behave differently depending on your data's types, or, `dtype`s. If your `DataFrame` includes both numeric and object (e.g., strings) `dtype`s, it will default to **summarizing only the numeric data** (as shown above). If `.describe()` is called on a `DataFrame` that only contains strings, it will return the count, number of unique values, and the most frequent value along with its count.  

---

## 🥊 Challenge 1: Import Data From A URL

Above, we imported the unemployment data using the `read_csv` function and a relative file path. `read_csv` is [a very flexible method](https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.read_csv.html); it also allows us to import data using a URL as the file path. 

A .csv file with data on world countries and their abbreviations is located at the URL:

[https://raw.githubusercontent.com/dlab-berkeley/introduction-to-pandas/master/data/countries.csv](https://raw.githubusercontent.com/dlab-berkeley/introduction-to-pandas/master/data/countries.csv)

We've saved this exact URL as a string variable, `countries_url`, below.

Using `read_csv`, import the country data and save it to the variable `countries`. Once you've saved the variable, look at the first five rows of the data using the `.head()` method.

---

In [14]:
countries_url = 'https://raw.githubusercontent.com/dlab-berkeley/Python-Data-Wrangling/main/data/countries.csv'
# countries = # YOUR CODE HERE
### Answer
countries = pd.read_csv(countries_url)

---

## 🥊 Challenge 2: Exploring `countries`

It's important to understand a few fundamentals about your data before you start work with it, including what information it contains, how large it is, and how the values are generally distributed.

Using the methods and attributes above, **answer the following questions** about `countries`:

1. What columns does `countries` contain?
2. How many rows and columns does it contain?
3. What are the minimum and maximum values of the columns with numerical data?

Hint: consider using `.columns`, `.describe()`, and `.shape` here.

<mark> make this into a poll </mark>

---

In [15]:
# Answer
countries.columns

Index(['country', 'google_country_code', 'country_group', 'name_en', 'name_fr',
       'name_de', 'latitude', 'longitude'],
      dtype='object')

In [16]:
# Answer
countries.shape

(30, 8)

In [17]:
# Answer
countries.describe()

Unnamed: 0,latitude,longitude
count,30.0,30.0
mean,49.092609,14.324579
std,7.956624,11.25701
min,35.129141,-8.239122
25%,43.230916,6.979186
50%,49.238087,14.941462
75%,54.0904,23.35169
max,64.950159,35.439795


## Indexing Data

Wrangling data in a DataFrame often requires extracting specific rows and/or columns of interest. This is referred to as **Indexing**. We've actually already learned a simple way to index our data using `.head()`, which isolated the first five rows of our data. Now, we'll learn more flexible and powerful methods for indexing.

### Recall basic Python indexing
To index (this is synonymous with other verbs like "subset," "slice," etc.) data in Python, we use bracket notation: `[]`. Run the following code to instantiate a list of numbers and observe what different indexes return:

In [18]:
my_list = [1, 2, 3, 4, 5, 6]

In [19]:
my_list[:4]

[1, 2, 3, 4]

In [20]:
my_list[0]

1

In [21]:
my_list[2:]

[3, 4, 5, 6]


Indexing works very similarly in `pandas` as it does in standard python, but with a few key differences. In `pandas`, indexing relies on referencing a DataFrame's rows and then its columns <span>&#8594;</span> `[rows, columns]`. Let's get a more visual sense of this -- in the `countries` DataFrame that we created earlier, the structure of the data is as follows:  

<img src="../images/df_diagram.png" align="left" width="50%" alt="diagram of pandas datafram">  

To index and get to specific data from this DataFrame, we select a row/column combination.  
For example, indexing row 3 and the column `google_country_code` would give us the value 'HR'. In code, that would look as follows:  
`countries.loc[3, 'google_country_code']`  
Try writing that in the cell below and running it.

In [22]:
...

Ellipsis

### `.loc`
Let's go deeper into what `.loc` does, as this will be the primary tool we use for indexing.  

`.loc` allows us to index data based on the labels of our DataFrame's index and its column names. Let's take a look at its behavior below:

In [23]:
countries.loc[:4, :]

Unnamed: 0,country,google_country_code,country_group,name_en,name_fr,name_de,latitude,longitude
0,at,AT,eu,Austria,Autriche,Österreich,47.696554,13.34598
1,be,BE,eu,Belgium,Belgique,Belgien,50.501045,4.476674
2,bg,BG,eu,Bulgaria,Bulgarie,Bulgarien,42.725674,25.482322
3,hr,HR,non-eu,Croatia,Croatie,Kroatien,44.746643,15.340844
4,cy,CY,eu,Cyprus,Chypre,Zypern,35.129141,33.428682


The code, `countries.loc[:4, :]` in effect says the following:  
- `countries.loc[`<mark style="background: yellow">**:4**</mark>`, :]` <span>&#8594;</span> Select rows up to index 4
- `countries.loc[:4, `<mark style="background: yellow">**:**</mark>`]` <span>&#8594;</span>
 Select all columns

This format allows us to flexibly select ranges of rows and columns at the same time. Consider this more complex example:

In [24]:
countries.loc[2:4, 'name_en']

2    Bulgaria
3     Croatia
4      Cyprus
Name: name_en, dtype: object

This code executed the following:  
- `countries.loc[`<mark style="background: yellow">**2:4**</mark>`, 'name_en']` -> Select rows from index 2 up to index 4 
- `countries.loc[2:4, `<mark style="background: yellow">**'name_en'**</mark>`]` -> Select the `name_en` column

💡 **Tip**: Note that the output of this code looks different from our previous output! Because we selected a single column, our code returned a `Series` object, which is a single vector of data (e.g., a NumPy array).  

In [25]:
type(countries.loc[2:4, 'name_en'])

pandas.core.series.Series

Let's look at one more example of `.loc`.  
<span>🔔 **Question:** Before running the following code block, can you anticipate what it will output?</span> 

In [26]:
countries.loc[19:29, ['name_en', 'country_group']]

Unnamed: 0,name_en,country_group
19,Netherlands,eu
20,Norway,non-eu
21,Poland,eu
22,Portugal,eu
23,Romania,eu
24,Slovakia,eu
25,Slovenia,eu
26,Spain,eu
27,Sweden,eu
28,Turkey,non-eu


---

## 🥊 Challenge 3: Indexing with `.loc`

Let's get a little practice with the `.loc` operator.
1. Select the `google_country_code` for rows 10 through 20
2. Select the `name_en`, `longitude`, and `latitude` for the follwing rows [1, 4, 2, 9]
3. Select the first five rows, then compute their average `latitude`
<details>
    <summary><a>Click for Hint</a></summary>
    This can be done usign <code>`.loc`</code> and <code>`.mean()`</code>, all in one line of code: <code>countries.loc[{YOUR_CODE_HERE}].mean()</code>
</details>

In [27]:
# answer
countries.loc[10:20, 'google_country_code']

10    DE
11    GR
12    HU
13    IE
14    IT
15    LV
16    LT
17    LU
18    MT
19    NL
20    NO
Name: google_country_code, dtype: object

In [28]:
# answer
countries.loc[[1, 4, 2, 9], ['name_en', 'longitude', 'latitude']]

Unnamed: 0,name_en,longitude,latitude
1,Belgium,4.476674,50.501045
4,Cyprus,33.428682,35.129141
2,Bulgaria,25.482322,42.725674
9,France,1.718561,46.710994


In [29]:
# answer
countries.loc[:4, 'latitude'].mean()

44.159811444

### Positional Indexing
`.loc` is a very powerful indexing system and it can handle almost any indexing task you can imagine. However, as is typical in `pandas`, there are a number of alternatives to accomplish the same thing. When we are executing very simple indexing tasks, such as selecting a full column, it is common to use the more succinct **positional indexing** system.  
Positional indexing allows us to omit `.loc`, but only allows us to select a row **OR** column index, whereas most of the indexing we just did using `.loc` involved both row **AND** column indices.

In [44]:
# This will work
countries['latitude']

0     47.696554
1     50.501045
2     42.725674
3     44.746643
4     35.129141
5     49.803531
6     55.939684
7     58.592469
8     64.950159
9     46.710994
10    51.163825
11    39.698467
12    47.161163
13    53.415260
14    42.504191
15    56.880117
16    55.173687
17    49.815319
18    35.902422
19    52.108118
20    64.556460
21    51.918907
22    39.558069
23    45.942611
24    48.672644
25    46.149259
26    39.895013
27    62.198467
28    38.952942
29    54.315447
Name: latitude, dtype: float64

⚠️ **Warning**: It is also possible to access a column via dot notation (also referred to as attribute access) as follows: `unemployment.year_month`. You should avoid this technique, as a column name might inadvertently have the same name as a `DataFrame` (or `Series`) method. In addition, only bracket notation can be used to create a new column. If you try and use attribute access to create a new column, you'll create a new attribute, *not* a new column.

***

Try running the following code -- it will throw an error. You can "comment out" (put a # before the code) the first statement and "un-comment" (remove the # before the code) the second statement to see how `.loc` fixes the error. 

In [45]:
# this won't work
# countries[0, 'latitude']

# this will work
countries.loc[0, 'latitude']

47.6965545

### `iloc` (I want to delete this)

Another widely used alternativve to `.loc` is `.iloc`. **We recommend sticking to `.loc` while learning `pandas`**.

Whereas `.loc` ultimately selects data based on the index and the column names, `.iloc` selects data based purely on numbers. For any given DataFrame, we can use `iloc` to select based on row numbers 0 through the number of rows and column numbers 0 through the number of columns:

In [32]:
countries.iloc[:5, :6]

Unnamed: 0,country,google_country_code,country_group,name_en,name_fr,name_de
0,at,AT,eu,Austria,Autriche,Österreich
1,be,BE,eu,Belgium,Belgique,Belgien
2,bg,BG,eu,Bulgaria,Bulgarie,Bulgarien
3,hr,HR,non-eu,Croatia,Croatie,Kroatien
4,cy,CY,eu,Cyprus,Chypre,Zypern


⚠️ **Warning**: `.loc[:5, :]` and `.iloc[:5, :]` return different selections of data. `.loc` makes an **inclusive** selection, in this case selecting up to and including index 5. This is contrary to typical python subsetting behavior. `.iloc` uses Python's typical subsetting behavior, which means it **always excludes the end position**. Therefore, we don't see the row associated with the index 5. [You can read more here](https://pandas.pydata.org/docs/user_guide/indexing.html#different-choices-for-indexing).

## Boolean Indexing
Now that we've covered the basics of indexing, let's get into an extremely powerful extension -- "boolean indexing." This is a complicated term that just describes filtering data based on some logical test. The `pandas` implementation of boolean indexing can be a little jarring at first, so let's build up to it from scratch. First, recall how booleans and logical tests work in standard python:

In [46]:
"D-Lab" == "D-Lab"

True

In [47]:
"D-Lab" == "H-Lab"

False

In [48]:
7 > 3

True

We will use that same style of logical test in `pandas` to execute boolean indexing.   

### Example: find countries outside the EU
Notice in the `countries` dataframe pictured below that we have a column, `country_group`, that tells us whether or not a country is in the European Union (EU). We're going to do a boolean indexing example on these first five rows.
<img src="../images/df_diagram.png" align="left" width="50%" alt="diagram of pandas datafram">  

In [58]:
# Create a smaller test dataframe
# to show how boolean indexing works
test = countries.loc[:5, :]

Let's use that column to filter our data down to only countries outside of the European Union. The steps are as follows:
1. Select the column we will use as a filter: `test['country_group']` or `test.loc[:, 'country_group']`

In [59]:
test['country_group']

0        eu
1        eu
2        eu
3    non-eu
4        eu
5        eu
Name: country_group, dtype: object

2. Determine which rows in that column are equal to "non-eu" -- which denotes that the country is outside the European Union: `test['country_group'] == 'non-eu'`. The output of this code is what's called a **boolean mask**.

In [60]:
test['country_group'] == 'non-eu'

0    False
1    False
2    False
3     True
4    False
5    False
Name: country_group, dtype: bool

3. Use the boolean mask to index only those rows that satisfied the test: `test[test['country_group'] == 'non-eu']`

In [61]:
test[test['country_group'] == 'non-eu']

Unnamed: 0,country,google_country_code,country_group,name_en,name_fr,name_de,latitude,longitude
3,hr,HR,non-eu,Croatia,Croatie,Kroatien,44.746643,15.340844


And that is boolean indexing! We used a test for equality (`countries['country_group'] == 'non-eu`), but we can use a variety of different tests and conditions to index our data.

For example, we might want to find those countries with a longitude greater than some threshold, such as 25, if we want to examine countries further east (note that we will go back to using the full `countries` DataFrame now):

In [62]:
countries[countries['longitude'] > 25]

Unnamed: 0,country,google_country_code,country_group,name_en,name_fr,name_de,latitude,longitude
2,bg,BG,eu,Bulgaria,Bulgarie,Bulgarien,42.725674,25.482322
4,cy,CY,eu,Cyprus,Chypre,Zypern,35.129141,33.428682
7,ee,EE,eu,Estonia,Estonie,Estland,58.592469,25.80695
8,fi,FI,eu,Finland,Finlande,Finnland,64.950159,26.067564
28,tr,TR,non-eu,Turkey,Turquie,Türkei,38.952942,35.439795


We can refine this even more by using `.loc` to simultaneously index rows and columns to isolate the names of those countries with longitude over 25:

In [63]:
countries.loc[countries['longitude'] > 25, "name_en"]

2     Bulgaria
4       Cyprus
7      Estonia
8      Finland
28      Turkey
Name: name_en, dtype: object

## 🥊 Challenge 4: Boolean Indexing

Let's push our boolean indexing skills a little further with two challenge problems.
1. Find the average longitude of countries outside of the European Union in our data  
<details>
    <summary><a>Click for Hint</a></summary>
This sounds pretty tough, but start with the code we used above to isolate non-EU countries: <code>countries[countries['country_group'] == 'non-eu']</code>.  
Use <code>.loc</code> to extend this to also select just the <code>longitude</code> column. Once you've figured that out, use <code>.mean()</code> to compute the average.
</details>
2. Find countries that have "above average" longitude
<details>
    <summary><a>Click for Hint</a></summary>
    Compute the average longitude of the data: <code>countries['longitude'].mean()</code> and save that to a variable <code>average_longitude</code>. Then, you can use that variable to create a boolean mask for indexing: <code>countries['longitude'] > average_longitude</code>
</details>

In [64]:
countries[countries['longitude'] > countries['longitude'].mean()]

Unnamed: 0,country,google_country_code,country_group,name_en,name_fr,name_de,latitude,longitude
2,bg,BG,eu,Bulgaria,Bulgarie,Bulgarien,42.725674,25.482322
3,hr,HR,non-eu,Croatia,Croatie,Kroatien,44.746643,15.340844
4,cy,CY,eu,Cyprus,Chypre,Zypern,35.129141,33.428682
5,cz,CZ,eu,Czech Republic,République tchèque,Tschechische Republik,49.803531,15.474998
7,ee,EE,eu,Estonia,Estonie,Estland,58.592469,25.80695
8,fi,FI,eu,Finland,Finlande,Finnland,64.950159,26.067564
11,gr,GR,eu,Greece,Grèce,Griechenland,39.698467,21.577256
12,hu,HU,eu,Hungary,Hongrie,Ungarn,47.161163,19.504265
15,lv,LV,eu,Latvia,Lettonie,Lettland,56.880117,24.606555
16,lt,LT,eu,Lithuania,Lituanie,Litauen,55.173687,23.943168


### Boolean Indexing with multiple conditions
We won't have a challenge on this topic, but it's useful to know that we can boolean index using as many logical tests as we want by wrapping each test in parenthesis (`()`)and by using the AND operator (`&`) or the OR operator (`|`)

In [65]:
# Select the countries with longitude greater than 25 but less than 30
countries[(countries['longitude'] > 25) & (countries['longitude'] < 30)]

Unnamed: 0,country,google_country_code,country_group,name_en,name_fr,name_de,latitude,longitude
2,bg,BG,eu,Bulgaria,Bulgarie,Bulgarien,42.725674,25.482322
7,ee,EE,eu,Estonia,Estonie,Estland,58.592469,25.80695
8,fi,FI,eu,Finland,Finlande,Finnland,64.950159,26.067564


In [66]:
# Select the countries with longitude greater than 25 but less than 30
countries[(countries['longitude'] > 30) | (countries['longitude'] < 0)]

Unnamed: 0,country,google_country_code,country_group,name_en,name_fr,name_de,latitude,longitude
4,cy,CY,eu,Cyprus,Chypre,Zypern,35.129141,33.428682
13,ie,IE,eu,Ireland,Irlande,Irland,53.41526,-8.239122
22,pt,PT,eu,Portugal,Portugal,Portugal,39.558069,-7.844941
26,es,ES,eu,Spain,Espagne,Spanien,39.895013,-2.988296
28,tr,TR,non-eu,Turkey,Turquie,Türkei,38.952942,35.439795
29,uk,GB,eu,United Kingdom,Royaume-Uni,Vereinigtes Königreich,54.315447,-2.232612


# End Part 1