# Data exploration & preprocessing and the `dfply` library

`dfply` is a library that allows manipulation of pandas dataframes using "pipes", or operators such as `>>` or `>>=`, which make code much easier to read compared to using the operations provided by the `pandas` library. You can find the full library documentation [here](https://github.com/kieferk/dfply).

First things first... let's install!

In [None]:
# !pip install dfply

# We'll directly import all functions from dfply
from dfply import * 

![Diamond image](https://imgs.search.brave.com/9G9zDX2B07vBNRsjS3uhXz_axrgqmj0ulCY1P2nO90k/rs:fit:844:225:1/g:ce/aHR0cHM6Ly90c2U0/Lm1tLmJpbmcubmV0/L3RoP2lkPU9JUC54/R1BTRkJvRUx2WkUt/OFBKUjIxUFh3SGFF/SyZwaWQ9QXBp)

In [None]:
# dfply comes with an in-built generic dataset
diamonds.head()

# How did I find this?

I regularly follow a Data Science weekly newsletter called [Data Elixir](https://dataelixir.com/), which compiles industry news, articles, and tips which I highly recommend! This newsletter is also how I learnt that the Dunning-Kruger effect, or the belief that "stupid people don't know they're stupid", is actually [statistics gone wrong](https://economicsfromthetopdown.com/2022/04/08/the-dunning-kruger-effect-is-autocorrelation/#fn2) ... 

# Pipeline operators

## 1. Create and display a new dataframe (`>>`)

In [None]:
diamonds >> head(5)

In [None]:
diamonds >> head(5) >> tail(3)

In [None]:
diamonds.head(5).tail(3)

In [None]:
diamonds_5 = diamonds >> head(5)
diamonds_5

## 2. Replace current dataframe (`>>=`)

In [None]:
diamonds >>= head(3)

In [None]:
diamonds

We haven't seen yet why `dfply` is better than `pandas` as showing the `head` & `tail` of a dataframe is a relatively basic task but stay tuned!

# Selecting & Filtering

## 1. Select & drop (columns)

The beauty of the select function is that it allows you to select column names by:
- name (e.g. `price`)
- number (e.g. 1)
- name as attribute (e.g. `X.price`, we'll get back to this later) 

In [None]:
diamonds >> select(1, X.price, ['x', 'y']) >> head(2)

We'll try something similar with `pandas`:

In [None]:
try:
    diamonds[1]
except Exception as e:
    print(f"Following exception occurred: {repr(e)}")

We need to use `iloc` (`integer location`) to get the first column:

In [None]:
diamonds.iloc[:, 1]

However, we cannot mix up column names with column numbers!

In [None]:
try:
    diamonds.loc[:, ['price', 1]]
except Exception as e:
    print(f"Following exception occurred: {repr(e)}")

In [None]:
try:
    diamonds.iloc[:, ['price', 1]]
except Exception as e:
    print(f"Following exception occurred: {repr(e)}")

This already shows one cool thing about `dfply` - it's more flexible than `pandas`! Plus you can also do this cool trick:

In [None]:
diamonds >> select(starts_with('c')) 

Drop is just the opposite of select, and has the same functionalities as the latter so we won't spend too much time on it:

In [None]:
diamonds >> drop(starts_with('c')) 

## 2. Masking (a.k.a. my favourite function ever!!)

In [None]:
diamonds >> mask(X.cut == 'Ideal')

Now you can see why I don't like the `pandas` version as much:

In [None]:
diamonds[diamonds.cut == 'Ideal']

It's the same and just slightly more difficult to read but not impossible. However, when you start adding more conditions you can more easily see the difference!

In [None]:
# Let's first reload the dataset
from dfply import diamonds
diamonds

In [None]:
diamonds >> mask(X.cut == 'Ideal') \
         >> mask(X.table > 60) \
         >> mask(X.x > 4)

You can also write it like this but I don't recommend it because it's less readable:

In [None]:
diamonds >> mask(X.cut == 'Ideal', X.table > 60, X.x > 4)

The `pandas` version looks slightly more cluttered:

In [None]:
diamonds[(diamonds.cut == 'Ideal') \
            & (diamonds.table > 60) \
            & (diamonds.x > 4)]

And in real life you can really see how organizing data preprocessing steps by using pipes improves your code readability tremendously:

Example from my work:  
![](masking_example_1.png)  

How would this look in `pandas`?

In [None]:
# scope3_reported = reported_df[(reported_df.emission_co2e_indirect_scope3_total.isna() == False) & \
#                               (reported_df.emission_co2e_indirect_scope2_total.isna() == False) & \
#                               (reported_df.emission_co2e_direct_scope1_total.isna() == False)]

What about using `OR` conditions? In this case, the `mask` function very closely resembles the `pandas` alternative:

In [None]:
diamonds >> mask((X.cut == 'Ideal') | (X.table < 60) | (X.x < 4))

In [None]:
diamonds[(diamonds.cut == 'Ideal') \
            | (diamonds.table < 60) \
            | (diamonds.x < 4)]

# Joining dataframes

When it comes to joining dataframes, you already know that you can use `pd.concat` and `pd.join` (or the general version, `pd.merge`). 

For `pd.concat`, `dfply` has two great replacements in terms of readability: the `bind_rows` & `bind_cols` functions!

In [None]:
# Let's first get 2 dataframes:
a = pd.DataFrame({
        'x1':['A','B','C'],
        'x2':[1,2,3]
    })
display(a)

c = pd.DataFrame({
      'x1':['B','C','D'],
      'x2':[2,3,4]
})
display(c)

In [None]:
a >> bind_rows(c)

In [None]:
pd.concat([a, c], axis=0)

In [None]:
a >> bind_cols(c)

In [None]:
pd.concat([a, c], axis=1)

However, when it comes to joining dataframes, `dfply`'s functions (`inner_join`, `left_join`, `right_join`, `outer_join`) are great when the dataframes columns have the same names (as it fulfills the same role as `pd.join`) but it's not as useful when names aren't standardized! For example:

In [None]:
# We'll create a diamonds copy and change the df name
diamonds_copy = diamonds.copy()
diamonds_copy >>= rename(color_tag=X.color)
diamonds_copy >> head(5)

In [None]:
# I'm selecting only 2 columns and 50 rows to avoid any memory issues
pd.merge(diamonds.loc[:50, ['carat', 'color']], diamonds_copy.loc[:50, ['clarity', 'color_tag']], how='inner',left_on='color', right_on='color_tag')

In [None]:
# Now I'll try to join the two dataframes (selecting 2 columns from each table for simplicity):
try:
    diamonds.loc[:50, ['carat', 'color']] >> inner_join(diamonds_copy.loc[:50, ['clarity', 'color_tag']], by=['color', 'color_tag'])
except Exception as e:
    print(repr(e))

# Summary statistics

In [None]:
round(diamonds.describe(), 2)

`describe` is an amazing method for getting a quick feeling for the data you're working with, however, as a growing data analyst, you might need a bit more than that later on. For example, how about the mean per `cut` or `color`?

In [None]:
diamonds >> group_by('cut') >> summarize(price_mean=X.price.mean().round(2), price_std=X.price.std().round(2))

`summarize_each` is a shorter way of summarizing the data in case you wanted to look at the same summary statistics - the drawback here is that it's more difficult to customize your functions!

In [None]:
diamonds >> group_by(X.cut) >> summarize_each([np.mean, np.var, np.min, np.max], X.price, 4)

<details>
<summary>
<b>Question for the class:</b> What other way of checking the price mean per cut could we use? 👀
</summary>

```
import plotly.express as px
px.box(diamonds, x='price', color='cut')
```
</details>

# Who said ranking was just for SQL?

In [None]:
diamonds >> select(X.price) >> mutate(price_drank=dense_rank(X.price)) >> head(6)