# Extract, transform and load data (ETL) 



The ETL process is a fundamental part of working with data and is based in three steps:


*   **Extraction**: collecting data, potentially from multiple heterogeneous sources. It can involve scraping web pages, access to programming interfaces (APIs) or consulting databases.
*   **Transformation**: reorganizing data, involving operations such as merging, crossing and aggregation.
*   **Load**: persistence of the new data set where one wants to store it.

This notebook focuses on examples of transformation with Pandas. 

For this, we will use three dataframes in our examples: `df_a`, `df_b` and `df_c`.






## Creating dataframes from dictionaries

---



Once we will create customized dataframes, we will need the help of **dictionaries**.

A dictionary is an object type in Python which allows storing values indexed by curly brackets, similar to what `DataFrame` from Pandas does. 

We use the following notation for creating a dictionary:

```
name = {
        key1: value1,
        key2: value2,
        ...
        keyN: valueN
        }
```
We access a value in a dictionary through its key, usando the notation `dictionary[key]`.

In the following example, the dictionary `data_df_a` has the names of associate series as keys. 




In [0]:
import pandas as pd

In [0]:
data_df_a = {
            'id_individual': ['1', '2', '3', '4', '5'],
            'name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
            'surname': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']
            }

Note that each series is represented as a list. 

Creating a `DataFrame` from a list is really simple.

In [0]:
df_a = pd.DataFrame(data_df_a)
df_a

Follwing the same model, let's create dataframe `df_b`:

In [0]:
data_df_b = {
            'id_individual': ['4', '5', '6', '7', '8'],
            'name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 
            'surname': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']
            }

In [0]:
df_b = pd.DataFrame(data_df_b)
df_b

In [0]:
data_df_c = {
            'id_individual': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],
            'id_exam': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]
            }

In [0]:
df_c = pd.DataFrame(data_df_c)
df_c

## Joining data

One of the common operations is merging observations that have the same characteristics, but that are in different dataframes. 

For this we will use the `concat` command, which receives a list with ***n*** `DataFrame` objects as a parameter.

In [0]:
df_new = pd.concat([df_a, df_b])
df_new

It would also be possible to join `DataFrame` objects with distinct characteristics.

Such operation, however, would produce a `DataFrame` with many missing data:

In [0]:
pd.concat([df_a, df_c])

## Merging data

In the previous example, we saw the result of the joining data between different `DataFrames` 
whose characteristics are not identical.


However, when there is at least one characteristic in common between two `DataFrames`, we can use
the technique called **data merging**, using the pandas **merge** method, which returns
a new data frame containing all the data from the two previous `DataFrames`.


In the example below, the observations of the `DataFrame` on the left (```df_a```) and the `DataFrame` on the right (```df_c```) were crossed, taking ```id_individual``` as a common characteristic.

As you can see, the new `DataFrame` gathers information from both `DataFrames` used in merging data:


In [0]:
pd.merge(df_a, df_c, on='id_individual')

In some situations, the same characteristic may be
represented by different names in the `DataFrames` one wants to cross. So, in this case, we use
the arguments `left_on` and `right_on` to specify, respectively, the names of the characteristics in the `DataFrame` on the left and in the `DataFrame` on the right.

In summary, the **merging** operation combines the data from two `DataFrames` with at least one
common **column**. In the previous example, the common characteristic was the **id_individual** field.

### Types of merge

A data merging operation combines data from two `DataFrames` that share a **common characteristic**.


In the previous example, the common characteristic was the `id_individual` field.


Note that the observations present in the `df_c` dataframe whose values ​​for `id_individual` are not present in the `df_a` dataframe were not shown.


If we want these **observations** to be preserved, we can use a cross to the **right**.

The **how** argument must be included
which will receive **right** or **left**, where they refer, respectively, to the first and second `DataFrames`.


What does that mean?  This is used in order to **avoid missing rows** in the resulting `DataFrame`.

In [0]:
pd.merge(df_a, df_c, on='id_individual', how='right')


The result above shows both the observations with `id_individual` present in the two `DataFrames` and the rest of the `DataFrame` observations on the right.

Note that the observations added by the intersection on the right have **missing data**.


The **left cross** also has the same behavior:

In [0]:
pd.merge(df_b, df_c, on='id_individual', how='left')


In this case, the observation of the `df_b` dataframe whose `id_individual` was not present in the `df_c` dataframe was maintained.

In a more extreme case, we can use an **external merging**, which holds all observations from both dataframes. Instead of **right** or **left**, use **outer**:

In [0]:
pd.merge(df_b, df_c, on='id_individual', how='outer')

## Aggregating data

Joining and merging operations aim to gather information spread across multiple bases on a single `DataFrame`.

A complementary type of operation is **aggregation**, which aims to summarize blocks of information using **descriptive statistics**.

The main forms of aggregation are obtained through **pivoting**, be it one-dimensional (**groups**)
 or two-dimensional (**pivot** **tables**).

### Groups

Organizing data into groups can be useful to analyse each group or to calculate statistics by group.

The first step of aggregation is to define one or more characteristics used as grouping factors. 

In the example below, we group the data from `iris` dataset. 

This dataset is the most downloaded of the machine learning repository [UCI](https://archive.ics.uci.edu/ml/). It lists petal and sepal measurements of three species of iris flowers.

For convenience, we are going to download it from the `seaborn` library:

In [0]:
import seaborn as sns
iris_dataset = sns.load_dataset('iris')
iris_dataset

As we can see, the dataset contains the width and height of sepals and petals of 150 samples of iris flowers.

Let's see how many samples we have per species: 

In [0]:
iris_dataset['species'].value_counts()

To group this dataset by species, we can use the `groupby()` method:

In [0]:
iris_groups = iris_dataset.groupby(['species'])

Then, we can treat each group as a `DataFrame` using the `get_group()` method:

In [0]:
iris_groups.get_group('versicolor').head()

The grouping allow us to compute statistics about the groups at the same time or individually:

#### At the same time

In [0]:
iris_groups.min()

In [0]:
iris_groups.max()

In [0]:
iris_groups.mean()

#### Individually

In [0]:
iris_groups.get_group("versicolor").describe()

In [0]:
versicolor_group = iris_groups.get_group("versicolor")
versicolor_group.count()

#### Aggregating by multiple characteristics

A powerful feature of Pandas is to allow aggregations from multiple characteristics.

In general, we use this feature when we have a set of data that has categorical and numerical characteristics.

In the `iris` dataset, however, we only have one categorical feature available.

Let's take advantage of this situation and take a look at a really cool feature of Pandas, called discretization in intervals:

In [0]:
pd.cut(iris_dataset["petal_width"], bins=3)

Did you understand what happened?

The `cut()` method calculated the maximum and minimum values for the `petal_width` characteristic and divided this interval into three subintervals.

Thus, each of the original values was replaced by the sub-interval to which it belongs and, now, we have a categorical variable 😄

Let's replace the original data with categorized data:

In [0]:
iris_dataset["petal_width"] = pd.cut(iris_dataset["petal_width"], bins=3)
iris_dataset

An additional feature of Pandas to deal with categorical characteristics is renaming the categories.

Let's rename the generated subintervals.

Note that this time we are changing the original data directly using the option `inplace=True` (almost all Pandas methods accept this option).

In [0]:
iris_dataset["petal_width"].cat.rename_categories(["low", "medium", "high"], inplace=True)
iris_dataset

Now that our dataset has two categorical characteristics, we can aggregate by multiple characteristics:

In [0]:
group2_iris = iris_dataset.groupby(["species","petal_width"]).size()
group2_iris

In this case, instead of producing the groups, we directly produce the aggregation using the `size()` method, which counts the size of each group.

From the data above, we can see that all iris flowers of the `setosa` species present in the dataset have a small petal width.

It is also possible to make an excellent separation between the `versicolor` and `virginica` species.

Note that the data above is a series that has a multi-level index (known in Pandas as `MultiIndex`):

In [0]:
group2_iris.index

Amid the verbose messages of Pandas, we see that there are two levels in this index (`levels`), whose names (`names`) are `species` and `petal_width`.

We can index this series in several different ways

In [0]:
group2_iris["virginica","high"]

In [0]:
group2_iris["virginica",]

In [0]:
group2_iris[:,"high"]

We can also convert this series into a `DataFrame`.

For this, we use the `reset_index()` method and inform the name we want to give to the series:

In [0]:
df_iris = group2_iris.reset_index(name="count")
df_iris

### Pivot tables

Another form of aggregation available in Pandas is through pivot tables.

In this case, we use the `pivot_table()` method. We must inform the characteristics for the grouping  at the level of rows (`index`) and columns (` columns`).

We also can inform the aggregation method using the `aggfunc` option, which by default calculates the mean:  

In [0]:
pt_iris = iris_dataset.pivot_table(index="species", columns="petal_width", aggfunc="size")
pt_iris

Note that the pivot table tries to generate all possible combinations between the values of rows and columns characteristics.

Since our dataset does not have observations of the `setosa` species with petal width `medium` or `high`, these values are marked as missing/invalid.

The `pivot_table()` method provides the `fill_value` option, which allow us to choose how to fill these cases:

In [0]:
pt_iris = iris_dataset.pivot_table(index="species", columns="petal_width", aggfunc="size", fill_value=0)
pt_iris

The `pivot_table()` produces an object of type `DataFrame`. 

So, the indexing works the way we already know:

In [0]:
pt_iris.loc["versicolor"]

In [0]:
pt_iris.loc["versicolor","low"]

In [0]:
pt_iris.loc[:,"low"]