# Pandas DataFrames (`pd.DataFrame`)

The pandas dataframe is the workhorse of most data science and analytics projects.  The dataframe represents the data you're working with as a table.  However, the flexibility of the dataframe is that each row **and** column is represented as a pandas Series, which allows for many powerful ways to mess around with the data

In [1]:
import pandas as pd
import numpy as np
import requests

First let's get some data so we can see what we can do with a data frame.  Don't worry about exactly what this function is doing, we will go over it in a bit.

In [2]:
def get_data(token):
    res = requests.get(
        f'https://api.cryptowat.ch/markets/coinbase-pro/{token}usd/ohlc',
        params={
            'periods': '3600',
            'after': str(int(pd.Timestamp('2021-12-01').timestamp()))
        }
    )

    df = pd.DataFrame(
        res.json()['result']['3600'],
        columns=['ts', 'open', 'high', 'low', 'close', 'volume', 'volumeUSD']
    )
    df['ts'] = pd.to_datetime(df.ts, unit='s')
    df['token'] = token
    
    return df


In [3]:
tokens = ['BTC', 'ETH', 'SOL', 'AAVE', 'COMP']

Don't worry too much about what is going on in the function below - we'll briefly go over it as it showcases the power of python, but it's not necessary for the class

In [4]:
dfs = [
    (lambda x: x.assign(chain=np.where(x.token.isin(['ETH', 'AAVE', 'COMP']), np.full(x.shape[0], 'ETH'), x.token)))(get_data(token)) 
    for token in tokens
]

In [5]:
df_base = pd.concat(get_data(token) for token in tokens)
df_base['chain'] = np.where(df_base.token.isin(['ETH', 'AAVE', 'COMP']), np.full(df_base.shape[0], 'ETH'), df_base.token)


In [6]:
df = df_base.set_index('ts')

## Understanding the data frame

After loading the data in our data frame, we can now inspect what's inside.  We'll need to do this as often the data we will store will be impossible to inspect row by row, and we will need to check that our data loading was correct

Let's check some basic properties of the data set:

We can see how many rows and columns this data frame has, and total number of data points

In [7]:
df.shape

(2360, 8)

In [8]:
df.size

18880

We can see what the first 5 rows looks like:

In [9]:
df.head()

Unnamed: 0_level_0,open,high,low,close,volume,volumeUSD,token,chain
ts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2021-12-01 00:00:00,57321.41,57451.05,56814.34,56987.97,388.482022,22184300.0,BTC,BTC
2021-12-01 01:00:00,56998.35,57726.45,56705.06,57616.41,599.791578,34371530.0,BTC,BTC
2021-12-01 02:00:00,57618.55,57620.0,56972.97,57030.83,591.6872,33870670.0,BTC,BTC
2021-12-01 03:00:00,57029.79,57396.87,56841.01,57307.59,702.560364,40078160.0,BTC,BTC
2021-12-01 04:00:00,57306.55,57456.82,57026.11,57404.01,859.591535,49205030.0,BTC,BTC


...and the last 5 rows

In [10]:
df.tail()

Unnamed: 0_level_0,open,high,low,close,volume,volumeUSD,token,chain
ts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2021-12-20 11:00:00,185.0,188.11,185.0,185.59,422.829,78860.58968,COMP,ETH
2021-12-20 12:00:00,185.54,192.04,185.42,186.57,3997.07,746777.49758,COMP,ETH
2021-12-20 13:00:00,186.68,186.96,183.09,183.61,550.611,101728.26831,COMP,ETH
2021-12-20 14:00:00,183.69,185.16,182.4,183.13,973.037,178787.34838,COMP,ETH
2021-12-20 15:00:00,183.08,183.8,182.29,182.6,186.177,34097.9409,COMP,ETH


We can also see a general overview of the schema (column name, data and data type) of the data

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2360 entries, 2021-12-01 00:00:00 to 2021-12-20 15:00:00
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   open       2360 non-null   float64
 1   high       2360 non-null   float64
 2   low        2360 non-null   float64
 3   close      2360 non-null   float64
 4   volume     2360 non-null   float64
 5   volumeUSD  2360 non-null   float64
 6   token      2360 non-null   object 
 7   chain      2360 non-null   object 
dtypes: float64(6), object(2)
memory usage: 165.9+ KB


as well as descriptive statistics about every column

In [12]:
df.describe()

Unnamed: 0,open,high,low,close,volume,volumeUSD
count,2360.0,2360.0,2360.0,2360.0,2360.0,2360.0
mean,10893.584959,10956.035398,10821.472901,10888.121808,16131.924348,17416290.0
std,19546.478028,19654.869615,19421.386583,19536.27455,34189.188556,26646110.0
min,151.63,152.75,148.03,151.64,67.178,18212.33
25%,186.05,187.749,184.335,185.9535,741.46975,328643.6
50%,223.1275,224.985,220.285,223.0535,2234.017851,8934317.0
75%,4316.97,4339.36,4285.1625,4316.1975,11588.103742,24292120.0
max,58664.4,59118.84,58445.53,58664.4,534212.095,398803500.0


## DataFrame Indexing

Indexing in data frames works very similar to Series, however there are now two "axes" that we can operate on - rows and columns.  For example, using `[*]` for indexing (like in series) by default will operate on columns:

In [13]:
df['open']

ts
2021-12-01 00:00:00    57321.41
2021-12-01 01:00:00    56998.35
2021-12-01 02:00:00    57618.55
2021-12-01 03:00:00    57029.79
2021-12-01 04:00:00    57306.55
                         ...   
2021-12-20 11:00:00      185.00
2021-12-20 12:00:00      185.54
2021-12-20 13:00:00      186.68
2021-12-20 14:00:00      183.69
2021-12-20 15:00:00      183.08
Name: open, Length: 2360, dtype: float64

however using `.loc[*]` will allow you to access rows:

In [14]:
df.loc['2021-12-01 01:00:00']

Unnamed: 0_level_0,open,high,low,close,volume,volumeUSD,token,chain
ts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2021-12-01 01:00:00,56998.35,57726.45,56705.06,57616.41,599.791578,34371530.0,BTC,BTC
2021-12-01 01:00:00,4636.43,4736.9,4605.49,4729.1,13819.06161,64876930.0,ETH,ETH
2021-12-01 01:00:00,208.716,211.773,207.821,211.507,92606.555,19461530.0,SOL,SOL
2021-12-01 01:00:00,257.149,266.249,255.27,264.816,5752.541,1516148.0,AAVE,ETH
2021-12-01 01:00:00,278.65,283.8,276.36,283.44,817.668,229274.6,COMP,ETH


and `.iloc[*]` will get you positional rows

In [15]:
df.iloc[0]

open               57321.41
high               57451.05
low                56814.34
close              56987.97
volume           388.482022
volumeUSD    22184300.66241
token                   BTC
chain                   BTC
Name: 2021-12-01 00:00:00, dtype: object

we can also get to the last row easily

In [16]:
df.iloc[-1]

open             183.08
high              183.8
low              182.29
close             182.6
volume          186.177
volumeUSD    34097.9409
token              COMP
chain               ETH
Name: 2021-12-20 15:00:00, dtype: object

or return it as a data frame instead of a Series

In [17]:
df.iloc[[-1]]

Unnamed: 0_level_0,open,high,low,close,volume,volumeUSD,token,chain
ts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2021-12-20 15:00:00,183.08,183.8,182.29,182.6,186.177,34097.9409,COMP,ETH


**note**: `df.loc[0]` will not work, as this is accessing via index

---
**note**: Also, the index operators will return a `pd.Series` if there's 1 row returned, or a new `pd.DataFrame` if multiple rows are returned, e.g.:

In [18]:
type(df.iloc[0])

pandas.core.series.Series

In [19]:
type(df.loc['2021-12-01 01:00:00']) # 5 rows returned

pandas.core.frame.DataFrame

we can convert a Series to a DataFrame anytime by using the `.to_frame()` method on the Series object.  This will turn the Series to a DataFrame, using the `Series.name` as the column name

In [20]:
df.iloc[0].to_frame()

Unnamed: 0,2021-12-01
open,57321.41
high,57451.05
low,56814.34
close,56987.97
volume,388.482022
volumeUSD,22184300.66241
token,BTC
chain,BTC


---

In addition, we can select on multiple columns and rows in various ways:

In [21]:
df[['open', 'close']]

Unnamed: 0_level_0,open,close
ts,Unnamed: 1_level_1,Unnamed: 2_level_1
2021-12-01 00:00:00,57321.41,56987.97
2021-12-01 01:00:00,56998.35,57616.41
2021-12-01 02:00:00,57618.55,57030.83
2021-12-01 03:00:00,57029.79,57307.59
2021-12-01 04:00:00,57306.55,57404.01
...,...,...
2021-12-20 11:00:00,185.00,185.59
2021-12-20 12:00:00,185.54,186.57
2021-12-20 13:00:00,186.68,183.61
2021-12-20 14:00:00,183.69,183.13


In [22]:
df[0:2]

Unnamed: 0_level_0,open,high,low,close,volume,volumeUSD,token,chain
ts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2021-12-01 00:00:00,57321.41,57451.05,56814.34,56987.97,388.482022,22184300.0,BTC,BTC
2021-12-01 01:00:00,56998.35,57726.45,56705.06,57616.41,599.791578,34371530.0,BTC,BTC


In [23]:
df.loc['2021-12-01 00:00:00':'2021-12-01 02:00:00']

Unnamed: 0_level_0,open,high,low,close,volume,volumeUSD,token,chain
ts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2021-12-01 00:00:00,57321.41,57451.05,56814.34,56987.97,388.482022,22184300.0,BTC,BTC
2021-12-01 01:00:00,56998.35,57726.45,56705.06,57616.41,599.791578,34371530.0,BTC,BTC
2021-12-01 02:00:00,57618.55,57620.0,56972.97,57030.83,591.6872,33870670.0,BTC,BTC
2021-12-01 00:00:00,4656.62,4672.43,4624.16,4634.95,6013.006735,27933210.0,ETH,ETH
2021-12-01 01:00:00,4636.43,4736.9,4605.49,4729.1,13819.06161,64876930.0,ETH,ETH
2021-12-01 02:00:00,4729.1,4729.1,4684.49,4695.78,7491.46544,35241610.0,ETH,ETH
2021-12-01 00:00:00,210.312,210.59,208.432,208.676,70031.618,14658510.0,SOL,SOL
2021-12-01 01:00:00,208.716,211.773,207.821,211.507,92606.555,19461530.0,SOL,SOL
2021-12-01 02:00:00,211.506,212.235,210.003,210.868,49728.032,10497560.0,SOL,SOL
2021-12-01 00:00:00,257.102,260.775,255.345,257.078,2730.299,703918.3,AAVE,ETH


In [24]:
df.iloc[0:4]

Unnamed: 0_level_0,open,high,low,close,volume,volumeUSD,token,chain
ts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2021-12-01 00:00:00,57321.41,57451.05,56814.34,56987.97,388.482022,22184300.0,BTC,BTC
2021-12-01 01:00:00,56998.35,57726.45,56705.06,57616.41,599.791578,34371530.0,BTC,BTC
2021-12-01 02:00:00,57618.55,57620.0,56972.97,57030.83,591.6872,33870670.0,BTC,BTC
2021-12-01 03:00:00,57029.79,57396.87,56841.01,57307.59,702.560364,40078160.0,BTC,BTC


In [25]:
df.iloc[[0, 4, 10, 50]]

Unnamed: 0_level_0,open,high,low,close,volume,volumeUSD,token,chain
ts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2021-12-01 00:00:00,57321.41,57451.05,56814.34,56987.97,388.482022,22184300.0,BTC,BTC
2021-12-01 04:00:00,57306.55,57456.82,57026.11,57404.01,859.591535,49205030.0,BTC,BTC
2021-12-01 10:00:00,56945.94,57220.39,56756.48,57131.16,586.052971,33407680.0,BTC,BTC
2021-12-03 02:00:00,56569.3,56762.82,56425.99,56545.98,238.584708,13496120.0,BTC,BTC


And finally, we can index on both rows and columns at the same time with `.loc`:

In [26]:
df.loc['2021-12-01 00:00:00':'2021-12-01 02:00:00', ['close', 'volume', 'token']]

Unnamed: 0_level_0,close,volume,token
ts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2021-12-01 00:00:00,56987.97,388.482022,BTC
2021-12-01 01:00:00,57616.41,599.791578,BTC
2021-12-01 02:00:00,57030.83,591.6872,BTC
2021-12-01 00:00:00,4634.95,6013.006735,ETH
2021-12-01 01:00:00,4729.1,13819.06161,ETH
2021-12-01 02:00:00,4695.78,7491.46544,ETH
2021-12-01 00:00:00,208.676,70031.618,SOL
2021-12-01 01:00:00,211.507,92606.555,SOL
2021-12-01 02:00:00,210.868,49728.032,SOL
2021-12-01 00:00:00,257.078,2730.299,AAVE


**note**: Given that by default dataframe indices are sequential integers by default, it's good practice to use `.loc` and `.iloc` to index into the data frame to be very clear, for example, let's shuffle our data frame then select:

In [27]:
df_shuffled = df_base.sample(frac=1)

In [28]:
df_shuffled.loc[[0, 2, 3]]

Unnamed: 0,ts,open,high,low,close,volume,volumeUSD,token,chain
0,2021-12-01 00:00:00,4656.62,4672.43,4624.16,4634.95,6013.006735,27933210.0,ETH,ETH
0,2021-12-01 00:00:00,57321.41,57451.05,56814.34,56987.97,388.482022,22184300.0,BTC,BTC
0,2021-12-01 00:00:00,280.59,281.4,278.3,278.7,207.849,58221.57,COMP,ETH
0,2021-12-01 00:00:00,210.312,210.59,208.432,208.676,70031.618,14658510.0,SOL,SOL
0,2021-12-01 00:00:00,257.102,260.775,255.345,257.078,2730.299,703918.3,AAVE,ETH
2,2021-12-01 02:00:00,283.2,283.2,280.61,281.29,254.33,71609.33,COMP,ETH
2,2021-12-01 02:00:00,211.506,212.235,210.003,210.868,49728.032,10497560.0,SOL,SOL
2,2021-12-01 02:00:00,4729.1,4729.1,4684.49,4695.78,7491.46544,35241610.0,ETH,ETH
2,2021-12-01 02:00:00,57618.55,57620.0,56972.97,57030.83,591.6872,33870670.0,BTC,BTC
2,2021-12-01 02:00:00,264.755,266.187,262.597,263.125,1559.33,412444.9,AAVE,ETH


In [29]:
df_shuffled.iloc[[0, 2, 3]]

Unnamed: 0,ts,open,high,low,close,volume,volumeUSD,token,chain
412,2021-12-18 04:00:00,206.75,213.01,206.75,211.09,4854.884,1018896.0,COMP,ETH
30,2021-12-02 06:00:00,56535.01,57023.59,56522.52,56963.08,713.298801,40519600.0,BTC,BTC
289,2021-12-13 01:00:00,179.87,180.77,177.75,178.01,794.861,142723.2,AAVE,ETH


lastly, we can set the DataFrame index from a column, or remove an index into a column

In [30]:
df_shuffled.set_index('ts')

Unnamed: 0_level_0,open,high,low,close,volume,volumeUSD,token,chain
ts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2021-12-18 04:00:00,206.75,213.01,206.75,211.09,4854.884000,1.018896e+06,COMP,ETH
2021-12-16 01:00:00,178.59,179.96,177.12,179.47,82310.880000,1.471309e+07,SOL,SOL
2021-12-02 06:00:00,56535.01,57023.59,56522.52,56963.08,713.298801,4.051960e+07,BTC,BTC
2021-12-13 01:00:00,179.87,180.77,177.75,178.01,794.861000,1.427232e+05,AAVE,ETH
2021-12-06 04:00:00,48950.48,49300.00,48893.94,49118.32,714.117438,3.504269e+07,BTC,BTC
...,...,...,...,...,...,...,...,...
2021-12-18 15:00:00,204.36,208.56,204.36,207.65,814.755000,1.684817e+05,COMP,ETH
2021-12-02 20:00:00,4483.57,4509.11,4476.16,4490.75,7955.847565,3.574223e+07,ETH,ETH
2021-12-18 23:00:00,205.82,207.08,205.13,205.65,706.472000,1.453265e+05,COMP,ETH
2021-12-01 11:00:00,4744.70,4754.94,4706.00,4727.15,7807.194746,3.691863e+07,ETH,ETH


In [31]:
df_shuffled.set_index('ts').reset_index()

Unnamed: 0,ts,open,high,low,close,volume,volumeUSD,token,chain
0,2021-12-18 04:00:00,206.75,213.01,206.75,211.09,4854.884000,1.018896e+06,COMP,ETH
1,2021-12-16 01:00:00,178.59,179.96,177.12,179.47,82310.880000,1.471309e+07,SOL,SOL
2,2021-12-02 06:00:00,56535.01,57023.59,56522.52,56963.08,713.298801,4.051960e+07,BTC,BTC
3,2021-12-13 01:00:00,179.87,180.77,177.75,178.01,794.861000,1.427232e+05,AAVE,ETH
4,2021-12-06 04:00:00,48950.48,49300.00,48893.94,49118.32,714.117438,3.504269e+07,BTC,BTC
...,...,...,...,...,...,...,...,...,...
2355,2021-12-18 15:00:00,204.36,208.56,204.36,207.65,814.755000,1.684817e+05,COMP,ETH
2356,2021-12-02 20:00:00,4483.57,4509.11,4476.16,4490.75,7955.847565,3.574223e+07,ETH,ETH
2357,2021-12-18 23:00:00,205.82,207.08,205.13,205.65,706.472000,1.453265e+05,COMP,ETH
2358,2021-12-01 11:00:00,4744.70,4754.94,4706.00,4727.15,7807.194746,3.691863e+07,ETH,ETH


## DataFrame Filtering

Filtering a data frame is very similar to filtering a series.  We can filter on any set of columns, the filtering is done via indices.  For example, if we wanted to just get the data points for tokens on the ethereum chain:

In [32]:
df['chain'] == 'ETH'

ts
2021-12-01 00:00:00    False
2021-12-01 01:00:00    False
2021-12-01 02:00:00    False
2021-12-01 03:00:00    False
2021-12-01 04:00:00    False
                       ...  
2021-12-20 11:00:00     True
2021-12-20 12:00:00     True
2021-12-20 13:00:00     True
2021-12-20 14:00:00     True
2021-12-20 15:00:00     True
Name: chain, Length: 2360, dtype: bool

In [33]:
df.loc[df['chain'] == 'ETH']

Unnamed: 0_level_0,open,high,low,close,volume,volumeUSD,token,chain
ts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2021-12-01 00:00:00,4656.62,4672.43,4624.16,4634.95,6013.006735,2.793321e+07,ETH,ETH
2021-12-01 01:00:00,4636.43,4736.90,4605.49,4729.10,13819.061610,6.487693e+07,ETH,ETH
2021-12-01 02:00:00,4729.10,4729.10,4684.49,4695.78,7491.465440,3.524161e+07,ETH,ETH
2021-12-01 03:00:00,4695.78,4754.97,4672.30,4754.09,10530.834423,4.963273e+07,ETH,ETH
2021-12-01 04:00:00,4754.09,4774.74,4722.02,4764.59,12471.624735,5.924627e+07,ETH,ETH
...,...,...,...,...,...,...,...,...
2021-12-20 11:00:00,185.00,188.11,185.00,185.59,422.829000,7.886059e+04,COMP,ETH
2021-12-20 12:00:00,185.54,192.04,185.42,186.57,3997.070000,7.467775e+05,COMP,ETH
2021-12-20 13:00:00,186.68,186.96,183.09,183.61,550.611000,1.017283e+05,COMP,ETH
2021-12-20 14:00:00,183.69,185.16,182.40,183.13,973.037000,1.787873e+05,COMP,ETH


In [34]:
df.loc[df['chain'] == 'ETH', 'close']

ts
2021-12-01 00:00:00    4634.95
2021-12-01 01:00:00    4729.10
2021-12-01 02:00:00    4695.78
2021-12-01 03:00:00    4754.09
2021-12-01 04:00:00    4764.59
                        ...   
2021-12-20 11:00:00     185.59
2021-12-20 12:00:00     186.57
2021-12-20 13:00:00     183.61
2021-12-20 14:00:00     183.13
2021-12-20 15:00:00     182.60
Name: close, Length: 1416, dtype: float64

## Deleting from Dataframes

We can select for all the things we'd like, but we can also drop both rows and columns.  This also works by index, i.e.:

In [35]:
df.drop(pd.to_datetime('2021-12-01 00:00:00'))

Unnamed: 0_level_0,open,high,low,close,volume,volumeUSD,token,chain
ts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2021-12-01 01:00:00,56998.35,57726.45,56705.06,57616.41,599.791578,3.437153e+07,BTC,BTC
2021-12-01 02:00:00,57618.55,57620.00,56972.97,57030.83,591.687200,3.387067e+07,BTC,BTC
2021-12-01 03:00:00,57029.79,57396.87,56841.01,57307.59,702.560364,4.007816e+07,BTC,BTC
2021-12-01 04:00:00,57306.55,57456.82,57026.11,57404.01,859.591535,4.920503e+07,BTC,BTC
2021-12-01 05:00:00,57404.01,57460.42,57016.00,57084.36,566.037996,3.238116e+07,BTC,BTC
...,...,...,...,...,...,...,...,...
2021-12-20 11:00:00,185.00,188.11,185.00,185.59,422.829000,7.886059e+04,COMP,ETH
2021-12-20 12:00:00,185.54,192.04,185.42,186.57,3997.070000,7.467775e+05,COMP,ETH
2021-12-20 13:00:00,186.68,186.96,183.09,183.61,550.611000,1.017283e+05,COMP,ETH
2021-12-20 14:00:00,183.69,185.16,182.40,183.13,973.037000,1.787873e+05,COMP,ETH


In [36]:
df.drop(columns='volumeUSD')

Unnamed: 0_level_0,open,high,low,close,volume,token,chain
ts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2021-12-01 00:00:00,57321.41,57451.05,56814.34,56987.97,388.482022,BTC,BTC
2021-12-01 01:00:00,56998.35,57726.45,56705.06,57616.41,599.791578,BTC,BTC
2021-12-01 02:00:00,57618.55,57620.00,56972.97,57030.83,591.687200,BTC,BTC
2021-12-01 03:00:00,57029.79,57396.87,56841.01,57307.59,702.560364,BTC,BTC
2021-12-01 04:00:00,57306.55,57456.82,57026.11,57404.01,859.591535,BTC,BTC
...,...,...,...,...,...,...,...
2021-12-20 11:00:00,185.00,188.11,185.00,185.59,422.829000,COMP,ETH
2021-12-20 12:00:00,185.54,192.04,185.42,186.57,3997.070000,COMP,ETH
2021-12-20 13:00:00,186.68,186.96,183.09,183.61,550.611000,COMP,ETH
2021-12-20 14:00:00,183.69,185.16,182.40,183.13,973.037000,COMP,ETH


In [37]:
df.drop(['close', 'open'], axis=1)

Unnamed: 0_level_0,high,low,volume,volumeUSD,token,chain
ts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2021-12-01 00:00:00,57451.05,56814.34,388.482022,2.218430e+07,BTC,BTC
2021-12-01 01:00:00,57726.45,56705.06,599.791578,3.437153e+07,BTC,BTC
2021-12-01 02:00:00,57620.00,56972.97,591.687200,3.387067e+07,BTC,BTC
2021-12-01 03:00:00,57396.87,56841.01,702.560364,4.007816e+07,BTC,BTC
2021-12-01 04:00:00,57456.82,57026.11,859.591535,4.920503e+07,BTC,BTC
...,...,...,...,...,...,...
2021-12-20 11:00:00,188.11,185.00,422.829000,7.886059e+04,COMP,ETH
2021-12-20 12:00:00,192.04,185.42,3997.070000,7.467775e+05,COMP,ETH
2021-12-20 13:00:00,186.96,183.09,550.611000,1.017283e+05,COMP,ETH
2021-12-20 14:00:00,185.16,182.40,973.037000,1.787873e+05,COMP,ETH


## Common Operations

Like with pandas Series, a DataFrame is simply a numpy array underneath the hood.

In [38]:
type(df.values)

numpy.ndarray

This means that the operations we saw for pandas Series can be applied to DataFrames as well, e.g. we can apply a scalar to every element in the DataFrame

In [39]:
df.head() * 10

Unnamed: 0_level_0,open,high,low,close,volume,volumeUSD,token,chain
ts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2021-12-01 00:00:00,573214.1,574510.5,568143.4,569879.7,3884.82022,221843000.0,BTCBTCBTCBTCBTCBTCBTCBTCBTCBTC,BTCBTCBTCBTCBTCBTCBTCBTCBTCBTC
2021-12-01 01:00:00,569983.5,577264.5,567050.6,576164.1,5997.915776,343715300.0,BTCBTCBTCBTCBTCBTCBTCBTCBTCBTC,BTCBTCBTCBTCBTCBTCBTCBTCBTCBTC
2021-12-01 02:00:00,576185.5,576200.0,569729.7,570308.3,5916.872,338706700.0,BTCBTCBTCBTCBTCBTCBTCBTCBTCBTC,BTCBTCBTCBTCBTCBTCBTCBTCBTCBTC
2021-12-01 03:00:00,570297.9,573968.7,568410.1,573075.9,7025.603645,400781600.0,BTCBTCBTCBTCBTCBTCBTCBTCBTCBTC,BTCBTCBTCBTCBTCBTCBTCBTCBTCBTC
2021-12-01 04:00:00,573065.5,574568.2,570261.1,574040.1,8595.915349,492050300.0,BTCBTCBTCBTCBTCBTCBTCBTCBTCBTC,BTCBTCBTCBTCBTCBTCBTCBTCBTCBTC


However, the operation needs to be valid for ALL elements if we want to do this - e.g. while `*` is overridden for strings, `/` is not and will fail

In [40]:
df.head() / 10

TypeError: unsupported operand type(s) for /: 'str' and 'int'

Aggergation functions are by default done by column

In [41]:
df.mean()

  df.mean()


open         1.089358e+04
high         1.095604e+04
low          1.082147e+04
close        1.088812e+04
volume       1.613192e+04
volumeUSD    1.741629e+07
dtype: float64

However we can also make them aggregate by row:

In [42]:
df.mean(axis=1)

  df.mean(axis=1)


ts
2021-12-01 00:00:00    3.735544e+06
2021-12-01 01:00:00    5.766863e+06
2021-12-01 02:00:00    5.683417e+06
2021-12-01 03:00:00    6.717907e+06
2021-12-01 04:00:00    8.239181e+06
                           ...     
2021-12-20 11:00:00    1.333785e+04
2021-12-20 12:00:00    1.252540e+05
2021-12-20 13:00:00    1.716987e+04
2021-12-20 14:00:00    3.008246e+04
2021-12-20 15:00:00    5.835981e+03
Length: 2360, dtype: float64

## Mutating the Dataframe

Like with other functionality, mutating DataFrames is very similar to mutating Series.  For example, setting one column to a single value is easy:

In [43]:
df_mutations = df_base.set_index('ts')

In [44]:
df_mutations.head()

Unnamed: 0_level_0,open,high,low,close,volume,volumeUSD,token,chain
ts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2021-12-01 00:00:00,57321.41,57451.05,56814.34,56987.97,388.482022,22184300.0,BTC,BTC
2021-12-01 01:00:00,56998.35,57726.45,56705.06,57616.41,599.791578,34371530.0,BTC,BTC
2021-12-01 02:00:00,57618.55,57620.0,56972.97,57030.83,591.6872,33870670.0,BTC,BTC
2021-12-01 03:00:00,57029.79,57396.87,56841.01,57307.59,702.560364,40078160.0,BTC,BTC
2021-12-01 04:00:00,57306.55,57456.82,57026.11,57404.01,859.591535,49205030.0,BTC,BTC


In [45]:
df_mutations['chain'] = 'NA'

In [46]:
df_mutations.head()

Unnamed: 0_level_0,open,high,low,close,volume,volumeUSD,token,chain
ts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2021-12-01 00:00:00,57321.41,57451.05,56814.34,56987.97,388.482022,22184300.0,BTC,
2021-12-01 01:00:00,56998.35,57726.45,56705.06,57616.41,599.791578,34371530.0,BTC,
2021-12-01 02:00:00,57618.55,57620.0,56972.97,57030.83,591.6872,33870670.0,BTC,
2021-12-01 03:00:00,57029.79,57396.87,56841.01,57307.59,702.560364,40078160.0,BTC,
2021-12-01 04:00:00,57306.55,57456.82,57026.11,57404.01,859.591535,49205030.0,BTC,


We can also create a new column and add data by index:

In [47]:
updates = pd.Series({pd.to_datetime('2021-12-01 00:00:00'): 1})
updates

2021-12-01    1
dtype: int64

In [48]:
df_mutations['start_of_week'] = updates

In [49]:
df_mutations.head()

Unnamed: 0_level_0,open,high,low,close,volume,volumeUSD,token,chain,start_of_week
ts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2021-12-01 00:00:00,57321.41,57451.05,56814.34,56987.97,388.482022,22184300.0,BTC,,1.0
2021-12-01 01:00:00,56998.35,57726.45,56705.06,57616.41,599.791578,34371530.0,BTC,,
2021-12-01 02:00:00,57618.55,57620.0,56972.97,57030.83,591.6872,33870670.0,BTC,,
2021-12-01 03:00:00,57029.79,57396.87,56841.01,57307.59,702.560364,40078160.0,BTC,,
2021-12-01 04:00:00,57306.55,57456.82,57026.11,57404.01,859.591535,49205030.0,BTC,,


We can also use the `.assign(...)` method to update columns, e.g.:

In [50]:
df_mutations.assign(
    chain=np.where(df_mutations.token.isin(['ETH', 'AAVE', 'COMP']), np.full(df_mutations.shape[0], 'ETH'), df_mutations.token),
    start_of_week=np.NaN
)

Unnamed: 0_level_0,open,high,low,close,volume,volumeUSD,token,chain,start_of_week
ts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2021-12-01 00:00:00,57321.41,57451.05,56814.34,56987.97,388.482022,2.218430e+07,BTC,BTC,
2021-12-01 01:00:00,56998.35,57726.45,56705.06,57616.41,599.791578,3.437153e+07,BTC,BTC,
2021-12-01 02:00:00,57618.55,57620.00,56972.97,57030.83,591.687200,3.387067e+07,BTC,BTC,
2021-12-01 03:00:00,57029.79,57396.87,56841.01,57307.59,702.560364,4.007816e+07,BTC,BTC,
2021-12-01 04:00:00,57306.55,57456.82,57026.11,57404.01,859.591535,4.920503e+07,BTC,BTC,
...,...,...,...,...,...,...,...,...,...
2021-12-20 11:00:00,185.00,188.11,185.00,185.59,422.829000,7.886059e+04,COMP,ETH,
2021-12-20 12:00:00,185.54,192.04,185.42,186.57,3997.070000,7.467775e+05,COMP,ETH,
2021-12-20 13:00:00,186.68,186.96,183.09,183.61,550.611000,1.017283e+05,COMP,ETH,
2021-12-20 14:00:00,183.69,185.16,182.40,183.13,973.037000,1.787873e+05,COMP,ETH,


**note**: using the index notation `[*]` will mutate the dataframe in place, however `.assign` will return a new data frame

We can also rename columns using a `{from:to}` syntax, e.g.:

In [51]:
df_mutations.rename(
    columns={
        'open':'OpeningPrice',
        'chain':'CryptoChain'
    }
)

Unnamed: 0_level_0,OpeningPrice,high,low,close,volume,volumeUSD,token,CryptoChain,start_of_week
ts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2021-12-01 00:00:00,57321.41,57451.05,56814.34,56987.97,388.482022,2.218430e+07,BTC,,1.0
2021-12-01 01:00:00,56998.35,57726.45,56705.06,57616.41,599.791578,3.437153e+07,BTC,,
2021-12-01 02:00:00,57618.55,57620.00,56972.97,57030.83,591.687200,3.387067e+07,BTC,,
2021-12-01 03:00:00,57029.79,57396.87,56841.01,57307.59,702.560364,4.007816e+07,BTC,,
2021-12-01 04:00:00,57306.55,57456.82,57026.11,57404.01,859.591535,4.920503e+07,BTC,,
...,...,...,...,...,...,...,...,...,...
2021-12-20 11:00:00,185.00,188.11,185.00,185.59,422.829000,7.886059e+04,COMP,,
2021-12-20 12:00:00,185.54,192.04,185.42,186.57,3997.070000,7.467775e+05,COMP,,
2021-12-20 13:00:00,186.68,186.96,183.09,183.61,550.611000,1.017283e+05,COMP,,
2021-12-20 14:00:00,183.69,185.16,182.40,183.13,973.037000,1.787873e+05,COMP,,


We can also use functions to rename, e.g.:

In [52]:
df_mutations.rename(columns=lambda x: x.upper())

Unnamed: 0_level_0,OPEN,HIGH,LOW,CLOSE,VOLUME,VOLUMEUSD,TOKEN,CHAIN,START_OF_WEEK
ts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2021-12-01 00:00:00,57321.41,57451.05,56814.34,56987.97,388.482022,2.218430e+07,BTC,,1.0
2021-12-01 01:00:00,56998.35,57726.45,56705.06,57616.41,599.791578,3.437153e+07,BTC,,
2021-12-01 02:00:00,57618.55,57620.00,56972.97,57030.83,591.687200,3.387067e+07,BTC,,
2021-12-01 03:00:00,57029.79,57396.87,56841.01,57307.59,702.560364,4.007816e+07,BTC,,
2021-12-01 04:00:00,57306.55,57456.82,57026.11,57404.01,859.591535,4.920503e+07,BTC,,
...,...,...,...,...,...,...,...,...,...
2021-12-20 11:00:00,185.00,188.11,185.00,185.59,422.829000,7.886059e+04,COMP,,
2021-12-20 12:00:00,185.54,192.04,185.42,186.57,3997.070000,7.467775e+05,COMP,,
2021-12-20 13:00:00,186.68,186.96,183.09,183.61,550.611000,1.017283e+05,COMP,,
2021-12-20 14:00:00,183.69,185.16,182.40,183.13,973.037000,1.787873e+05,COMP,,


The above commands will return a new DataFrame.  If we want to rename the input DataFrame, we can use the `inplace` option (which is available on most mutating functions), such as:

In [53]:
df_mutations.rename(
    columns={
        'open':'OpeningPrice',
        'chain':'CryptoChain'
    },
    inplace=True
)

df_mutations

Unnamed: 0_level_0,OpeningPrice,high,low,close,volume,volumeUSD,token,CryptoChain,start_of_week
ts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2021-12-01 00:00:00,57321.41,57451.05,56814.34,56987.97,388.482022,2.218430e+07,BTC,,1.0
2021-12-01 01:00:00,56998.35,57726.45,56705.06,57616.41,599.791578,3.437153e+07,BTC,,
2021-12-01 02:00:00,57618.55,57620.00,56972.97,57030.83,591.687200,3.387067e+07,BTC,,
2021-12-01 03:00:00,57029.79,57396.87,56841.01,57307.59,702.560364,4.007816e+07,BTC,,
2021-12-01 04:00:00,57306.55,57456.82,57026.11,57404.01,859.591535,4.920503e+07,BTC,,
...,...,...,...,...,...,...,...,...,...
2021-12-20 11:00:00,185.00,188.11,185.00,185.59,422.829000,7.886059e+04,COMP,,
2021-12-20 12:00:00,185.54,192.04,185.42,186.57,3997.070000,7.467775e+05,COMP,,
2021-12-20 13:00:00,186.68,186.96,183.09,183.61,550.611000,1.017283e+05,COMP,,
2021-12-20 14:00:00,183.69,185.16,182.40,183.13,973.037000,1.787873e+05,COMP,,


We can also add rows to the DataFrame by using `append`:

In [54]:
df_mutations.append(
    pd.Series({
        'high': 1,
        'low': 2,
        'token': 'FAKE'
    }, name=pd.to_datetime('2021-11-30 00:00:00'))
)

Unnamed: 0_level_0,OpeningPrice,high,low,close,volume,volumeUSD,token,CryptoChain,start_of_week
ts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2021-12-01 00:00:00,57321.41,57451.05,56814.34,56987.97,388.482022,2.218430e+07,BTC,,1.0
2021-12-01 01:00:00,56998.35,57726.45,56705.06,57616.41,599.791578,3.437153e+07,BTC,,
2021-12-01 02:00:00,57618.55,57620.00,56972.97,57030.83,591.687200,3.387067e+07,BTC,,
2021-12-01 03:00:00,57029.79,57396.87,56841.01,57307.59,702.560364,4.007816e+07,BTC,,
2021-12-01 04:00:00,57306.55,57456.82,57026.11,57404.01,859.591535,4.920503e+07,BTC,,
...,...,...,...,...,...,...,...,...,...
2021-12-20 12:00:00,185.54,192.04,185.42,186.57,3997.070000,7.467775e+05,COMP,,
2021-12-20 13:00:00,186.68,186.96,183.09,183.61,550.611000,1.017283e+05,COMP,,
2021-12-20 14:00:00,183.69,185.16,182.40,183.13,973.037000,1.787873e+05,COMP,,
2021-12-20 15:00:00,183.08,183.80,182.29,182.60,186.177000,3.409794e+04,COMP,,


## Sorting DataFrames

One thing that we didn't need to really do with Series is sorting.  For DataFrames, we will often need to sort by column(s) or by the index.  We can use `sort_values` and `sort_index` to do this

In [55]:
df.sort_values('open')

Unnamed: 0_level_0,open,high,low,close,volume,volumeUSD,token,chain
ts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2021-12-14 07:00:00,151.63,152.75,149.65,151.96,85682.725000,1.293305e+07,SOL,SOL
2021-12-14 08:00:00,152.00,155.49,151.10,154.82,44388.372000,6.808425e+06,SOL,SOL
2021-12-13 22:00:00,152.00,154.08,150.87,153.35,88402.908000,1.347827e+07,SOL,SOL
2021-12-14 06:00:00,153.19,154.28,151.40,151.64,40724.649000,6.230916e+06,SOL,SOL
2021-12-13 23:00:00,153.34,160.00,152.66,157.39,161350.804000,2.518932e+07,SOL,SOL
...,...,...,...,...,...,...,...,...
2021-12-01 15:00:00,57706.57,58783.16,57704.98,58610.19,908.858596,5.303393e+07,BTC,BTC
2021-12-01 19:00:00,58037.52,58148.15,57423.59,57498.26,851.152400,4.917105e+07,BTC,BTC
2021-12-01 18:00:00,58485.88,58631.40,58007.24,58037.51,633.239868,3.697725e+07,BTC,BTC
2021-12-01 16:00:00,58610.19,58900.00,58349.19,58664.40,684.590976,4.012464e+07,BTC,BTC


In [56]:
df.sort_values('open', ascending=False)

Unnamed: 0_level_0,open,high,low,close,volume,volumeUSD,token,chain
ts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2021-12-01 17:00:00,58664.40,59118.84,58445.53,58485.88,728.515015,4.285684e+07,BTC,BTC
2021-12-01 16:00:00,58610.19,58900.00,58349.19,58664.40,684.590976,4.012464e+07,BTC,BTC
2021-12-01 18:00:00,58485.88,58631.40,58007.24,58037.51,633.239868,3.697725e+07,BTC,BTC
2021-12-01 19:00:00,58037.52,58148.15,57423.59,57498.26,851.152400,4.917105e+07,BTC,BTC
2021-12-01 15:00:00,57706.57,58783.16,57704.98,58610.19,908.858596,5.303393e+07,BTC,BTC
...,...,...,...,...,...,...,...,...
2021-12-13 23:00:00,153.34,160.00,152.66,157.39,161350.804000,2.518932e+07,SOL,SOL
2021-12-14 06:00:00,153.19,154.28,151.40,151.64,40724.649000,6.230916e+06,SOL,SOL
2021-12-13 22:00:00,152.00,154.08,150.87,153.35,88402.908000,1.347827e+07,SOL,SOL
2021-12-14 08:00:00,152.00,155.49,151.10,154.82,44388.372000,6.808425e+06,SOL,SOL


In [57]:
df.sort_values(['volumeUSD', 'open'])

Unnamed: 0_level_0,open,high,low,close,volume,volumeUSD,token,chain
ts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2021-12-02 08:00:00,270.71,272.13,270.58,270.89,67.178000,1.821233e+04,COMP,ETH
2021-12-12 08:00:00,195.65,195.98,194.83,195.27,134.849000,2.635749e+04,COMP,ETH
2021-12-17 10:00:00,171.97,172.10,170.76,171.59,153.954000,2.640436e+04,AAVE,ETH
2021-12-20 15:00:00,168.85,169.59,168.30,168.30,160.734000,2.716503e+04,AAVE,ETH
2021-12-16 02:00:00,192.91,193.41,191.41,193.16,157.224000,3.025104e+04,COMP,ETH
...,...,...,...,...,...,...,...,...
2021-12-03 21:00:00,4229.09,4229.09,4035.00,4212.34,51723.840085,2.147795e+08,ETH,ETH
2021-12-15 20:00:00,47843.14,49300.00,47079.44,48696.07,4652.190928,2.256240e+08,BTC,BTC
2021-12-15 20:00:00,3811.38,3989.00,3760.98,3972.00,62360.886127,2.444909e+08,ETH,ETH
2021-12-04 06:00:00,4022.66,4038.80,3575.00,3884.54,97097.417492,3.702769e+08,ETH,ETH


We can also sort by the index

In [58]:
df.sort_index()

Unnamed: 0_level_0,open,high,low,close,volume,volumeUSD,token,chain
ts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2021-12-01 00:00:00,57321.410,57451.050,56814.340,56987.970,388.482022,2.218430e+07,BTC,BTC
2021-12-01 00:00:00,4656.620,4672.430,4624.160,4634.950,6013.006735,2.793321e+07,ETH,ETH
2021-12-01 00:00:00,280.590,281.400,278.300,278.700,207.849000,5.822157e+04,COMP,ETH
2021-12-01 00:00:00,257.102,260.775,255.345,257.078,2730.299000,7.039183e+05,AAVE,ETH
2021-12-01 00:00:00,210.312,210.590,208.432,208.676,70031.618000,1.465851e+07,SOL,SOL
...,...,...,...,...,...,...,...,...
2021-12-20 15:00:00,168.850,169.590,168.300,168.300,160.734000,2.716503e+04,AAVE,ETH
2021-12-20 15:00:00,3777.070,3791.140,3768.940,3775.750,1599.147903,6.042037e+06,ETH,ETH
2021-12-20 15:00:00,45725.000,45834.750,45620.180,45651.410,167.994386,7.679295e+06,BTC,BTC
2021-12-20 15:00:00,170.040,170.370,169.280,169.370,13936.621000,2.366874e+06,SOL,SOL


## Grouping DataFrames

one _very common_ action we will do during data manipulation is grouping then aggregating.  Pandas DataFrame has the method `groupby`, which allows us to group by any column in our DataFrame.

`groupby` returns a `DataFrameGroupBy` object, which we can apply a function to each group, or directly aggregate

In [59]:
df.groupby('chain')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fd7ce98a0d0>

In [60]:
df.groupby('chain').groups

{'BTC': [2021-12-01 00:00:00, 2021-12-01 01:00:00, 2021-12-01 02:00:00, 2021-12-01 03:00:00, 2021-12-01 04:00:00, 2021-12-01 05:00:00, 2021-12-01 06:00:00, 2021-12-01 07:00:00, 2021-12-01 08:00:00, 2021-12-01 09:00:00, 2021-12-01 10:00:00, 2021-12-01 11:00:00, 2021-12-01 12:00:00, 2021-12-01 13:00:00, 2021-12-01 14:00:00, 2021-12-01 15:00:00, 2021-12-01 16:00:00, 2021-12-01 17:00:00, 2021-12-01 18:00:00, 2021-12-01 19:00:00, 2021-12-01 20:00:00, 2021-12-01 21:00:00, 2021-12-01 22:00:00, 2021-12-01 23:00:00, 2021-12-02 00:00:00, 2021-12-02 01:00:00, 2021-12-02 02:00:00, 2021-12-02 03:00:00, 2021-12-02 04:00:00, 2021-12-02 05:00:00, 2021-12-02 06:00:00, 2021-12-02 07:00:00, 2021-12-02 08:00:00, 2021-12-02 09:00:00, 2021-12-02 10:00:00, 2021-12-02 11:00:00, 2021-12-02 12:00:00, 2021-12-02 13:00:00, 2021-12-02 14:00:00, 2021-12-02 15:00:00, 2021-12-02 16:00:00, 2021-12-02 17:00:00, 2021-12-02 18:00:00, 2021-12-02 19:00:00, 2021-12-02 20:00:00, 2021-12-02 21:00:00, 2021-12-02 22:00:00, 2021

In [61]:
len(df.groupby('chain'))

3

In [62]:
df.groupby('chain').size()

chain
BTC     472
ETH    1416
SOL     472
dtype: int64

after grouping, we can operate on the whole DataFrame or on any column

In [63]:
df.groupby('chain')['volumeUSD'].sum().to_frame()

Unnamed: 0_level_0,volumeUSD
chain,Unnamed: 1_level_1
BTC,16335020000.0
ETH,18862380000.0
SOL,5905052000.0


we can also groupby multiple columns.  The row indices now are a multi-index, however we will not go into this

In [64]:
df.groupby(['chain', 'token'])['volumeUSD'].sum().to_frame()

Unnamed: 0_level_0,Unnamed: 1_level_0,volumeUSD
chain,token,Unnamed: 2_level_1
BTC,BTC,16335020000.0
ETH,AAVE,235878800.0
ETH,COMP,120022200.0
ETH,ETH,18506480000.0
SOL,SOL,5905052000.0


We can actually aggregate without setting a compound index by adding `as_index=False`

In [65]:
df.groupby(['chain', 'token'], as_index=False)['volumeUSD'].sum()

Unnamed: 0,chain,token,volumeUSD
0,BTC,BTC,16335020000.0
1,ETH,AAVE,235878800.0
2,ETH,COMP,120022200.0
3,ETH,ETH,18506480000.0
4,SOL,SOL,5905052000.0


We can now operate on the groups.  For example, if we wanted to sum all columns:

In [66]:
df.groupby('chain').aggregate(np.sum)

Unnamed: 0_level_0,open,high,low,close,volume,volumeUSD
chain,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
BTC,23481470.0,23613310.0,23329440.0,23469680.0,329565.8,16335020000.0
ETH,2139105.0,2153749.0,2121890.0,2138040.0,6328884.0,18862380000.0
SOL,88289.18,89187.01,87342.58,88248.2,31412890.0,5905052000.0


or describe all columns

In [67]:
df.groupby('chain').describe()

Unnamed: 0_level_0,open,open,open,open,open,open,open,open,high,high,...,volume,volume,volumeUSD,volumeUSD,volumeUSD,volumeUSD,volumeUSD,volumeUSD,volumeUSD,volumeUSD
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
chain,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
BTC,472.0,49748.869894,3273.993777,45678.15,47565.3425,48775.185,50454.3775,58664.4,472.0,50028.193051,...,853.557968,8420.715164,472.0,34608090.0,30033910.0,6951952.0,17107870.0,26684210.0,42867470.0,398803500.0
ETH,1416.0,1510.667185,1856.941185,159.53,187.744,217.18,3943.52,4772.6,1416.0,1521.009471,...,5418.896666,97097.417492,1416.0,13320890.0,27018280.0,18212.33,179982.5,446529.2,18163890.0,370276900.0
SOL,472.0,187.053347,20.544945,151.63,172.205,182.6775,195.71975,242.099,472.0,188.955525,...,81407.571,534212.095,472.0,12510700.0,9501726.0,1781884.0,6447974.0,10360060.0,15100980.0,99408870.0


We can also do multiple aggregations

In [68]:
df.groupby('chain')['open'].agg([np.size, np.mean, np.std, np.min, np.max])

Unnamed: 0_level_0,size,mean,std,amin,amax
chain,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
BTC,472,49748.869894,3273.993777,45678.15,58664.4
ETH,1416,1510.667185,1856.941185,159.53,4772.6
SOL,472,187.053347,20.544945,151.63,242.099


we can actually use _any_ arbitrary functions - for example, we can use lambdas

In [69]:
df.groupby('chain')['open'].agg(
    range=lambda x: x.max() - x.min()
)

Unnamed: 0_level_0,range
chain,Unnamed: 1_level_1
BTC,12986.25
ETH,4613.07
SOL,90.469


## Joining Dataframes

One of the primary things we need to do before starting to clean data is to make sure that we can get all of our data into one place.  This is usually called either a fat talbe or a long table, depending on how we are doing the joining.  We'll look at a few different ways to join pandas DataFrames below.

We will be using `dfs`, which is a list of DataFrames that we created up above

### `pd.concat`

To join the dataframes lengthwise, we can use `pd.concat`.  This will append the dataframes together, and join the rows by using the column names as an index.  If any dataframe doesn't have a column name another one has, it will appear in the full DataFrame but will have NA for the missing DataFrames

In [70]:
pd.concat(dfs)

Unnamed: 0,ts,open,high,low,close,volume,volumeUSD,token,chain
0,2021-12-01 00:00:00,57321.41,57451.05,56814.34,56987.97,388.482022,2.218430e+07,BTC,BTC
1,2021-12-01 01:00:00,56998.35,57726.45,56705.06,57616.41,599.791578,3.437153e+07,BTC,BTC
2,2021-12-01 02:00:00,57618.55,57620.00,56972.97,57030.83,591.687200,3.387067e+07,BTC,BTC
3,2021-12-01 03:00:00,57029.79,57396.87,56841.01,57307.59,702.560364,4.007816e+07,BTC,BTC
4,2021-12-01 04:00:00,57306.55,57456.82,57026.11,57404.01,859.591535,4.920503e+07,BTC,BTC
...,...,...,...,...,...,...,...,...,...
467,2021-12-20 11:00:00,185.00,188.11,185.00,185.59,422.829000,7.886059e+04,COMP,ETH
468,2021-12-20 12:00:00,185.54,192.04,185.42,186.57,3997.070000,7.467775e+05,COMP,ETH
469,2021-12-20 13:00:00,186.68,186.96,183.09,183.61,550.611000,1.017283e+05,COMP,ETH
470,2021-12-20 14:00:00,183.69,185.16,182.40,183.13,973.037000,1.787873e+05,COMP,ETH


if you wanted to make sure you know where the original data is from, we can add keys, which creates a multi index:

In [71]:
res = pd.concat(dfs, keys=tokens)
res

Unnamed: 0,Unnamed: 1,ts,open,high,low,close,volume,volumeUSD,token,chain
BTC,0,2021-12-01 00:00:00,57321.41,57451.05,56814.34,56987.97,388.482022,2.218430e+07,BTC,BTC
BTC,1,2021-12-01 01:00:00,56998.35,57726.45,56705.06,57616.41,599.791578,3.437153e+07,BTC,BTC
BTC,2,2021-12-01 02:00:00,57618.55,57620.00,56972.97,57030.83,591.687200,3.387067e+07,BTC,BTC
BTC,3,2021-12-01 03:00:00,57029.79,57396.87,56841.01,57307.59,702.560364,4.007816e+07,BTC,BTC
BTC,4,2021-12-01 04:00:00,57306.55,57456.82,57026.11,57404.01,859.591535,4.920503e+07,BTC,BTC
...,...,...,...,...,...,...,...,...,...,...
COMP,467,2021-12-20 11:00:00,185.00,188.11,185.00,185.59,422.829000,7.886059e+04,COMP,ETH
COMP,468,2021-12-20 12:00:00,185.54,192.04,185.42,186.57,3997.070000,7.467775e+05,COMP,ETH
COMP,469,2021-12-20 13:00:00,186.68,186.96,183.09,183.61,550.611000,1.017283e+05,COMP,ETH
COMP,470,2021-12-20 14:00:00,183.69,185.16,182.40,183.13,973.037000,1.787873e+05,COMP,ETH


this allows us to select the data from the source tables, e.g.:

In [72]:
res.loc['COMP']

Unnamed: 0,ts,open,high,low,close,volume,volumeUSD,token,chain
0,2021-12-01 00:00:00,280.59,281.40,278.30,278.70,207.849,58221.57184,COMP,ETH
1,2021-12-01 01:00:00,278.65,283.80,276.36,283.44,817.668,229274.61550,COMP,ETH
2,2021-12-01 02:00:00,283.20,283.20,280.61,281.29,254.330,71609.32568,COMP,ETH
3,2021-12-01 03:00:00,281.25,283.22,279.90,283.09,393.771,110890.74168,COMP,ETH
4,2021-12-01 04:00:00,283.10,284.16,282.60,283.73,489.120,138532.76788,COMP,ETH
...,...,...,...,...,...,...,...,...,...
467,2021-12-20 11:00:00,185.00,188.11,185.00,185.59,422.829,78860.58968,COMP,ETH
468,2021-12-20 12:00:00,185.54,192.04,185.42,186.57,3997.070,746777.49758,COMP,ETH
469,2021-12-20 13:00:00,186.68,186.96,183.09,183.61,550.611,101728.26831,COMP,ETH
470,2021-12-20 14:00:00,183.69,185.16,182.40,183.13,973.037,178787.34838,COMP,ETH


As we saw above, we can also use `.append(*)` on DataFrames as well as Series

In [73]:
dfs[0].append(dfs[1])

Unnamed: 0,ts,open,high,low,close,volume,volumeUSD,token,chain
0,2021-12-01 00:00:00,57321.41,57451.05,56814.34,56987.97,388.482022,2.218430e+07,BTC,BTC
1,2021-12-01 01:00:00,56998.35,57726.45,56705.06,57616.41,599.791578,3.437153e+07,BTC,BTC
2,2021-12-01 02:00:00,57618.55,57620.00,56972.97,57030.83,591.687200,3.387067e+07,BTC,BTC
3,2021-12-01 03:00:00,57029.79,57396.87,56841.01,57307.59,702.560364,4.007816e+07,BTC,BTC
4,2021-12-01 04:00:00,57306.55,57456.82,57026.11,57404.01,859.591535,4.920503e+07,BTC,BTC
...,...,...,...,...,...,...,...,...,...
467,2021-12-20 11:00:00,3768.69,3821.54,3767.10,3798.99,4426.081206,1.682557e+07,ETH,ETH
468,2021-12-20 12:00:00,3798.96,3820.00,3787.21,3814.44,5781.695890,2.197543e+07,ETH,ETH
469,2021-12-20 13:00:00,3814.63,3815.80,3770.17,3782.86,4682.198085,1.773570e+07,ETH,ETH
470,2021-12-20 14:00:00,3782.85,3811.00,3766.88,3776.87,5356.147003,2.028810e+07,ETH,ETH


Lastly, remember the importance of indices.  In the operation above (both `.concat` and `.append`) we joined the DataFrames while keeping the indices of the original tables.  This means that we have repeated indices:

In [74]:
dfs[0].append(dfs[1]).sort_index()

Unnamed: 0,ts,open,high,low,close,volume,volumeUSD,token,chain
0,2021-12-01 00:00:00,57321.41,57451.05,56814.34,56987.97,388.482022,2.218430e+07,BTC,BTC
0,2021-12-01 00:00:00,4656.62,4672.43,4624.16,4634.95,6013.006735,2.793321e+07,ETH,ETH
1,2021-12-01 01:00:00,56998.35,57726.45,56705.06,57616.41,599.791578,3.437153e+07,BTC,BTC
1,2021-12-01 01:00:00,4636.43,4736.90,4605.49,4729.10,13819.061610,6.487693e+07,ETH,ETH
2,2021-12-01 02:00:00,57618.55,57620.00,56972.97,57030.83,591.687200,3.387067e+07,BTC,BTC
...,...,...,...,...,...,...,...,...,...
469,2021-12-20 13:00:00,46104.32,46112.00,45666.00,45735.37,539.322634,2.472197e+07,BTC,BTC
470,2021-12-20 14:00:00,3782.85,3811.00,3766.88,3776.87,5356.147003,2.028810e+07,ETH,ETH
470,2021-12-20 14:00:00,45735.37,46003.59,45600.00,45724.22,519.184470,2.377989e+07,BTC,BTC
471,2021-12-20 15:00:00,45725.00,45834.75,45620.18,45651.23,161.281285,7.372779e+06,BTC,BTC


This sometimes isn't ideal, esp. if we want to join against these indices later.  Instead, we can create a new index on the joined table using the `ignore_index` parameter, which allows us to have a sequential, non-repeated index:

In [75]:
dfs[0].append(dfs[1], ignore_index=True).sort_index()

Unnamed: 0,ts,open,high,low,close,volume,volumeUSD,token,chain
0,2021-12-01 00:00:00,57321.41,57451.05,56814.34,56987.97,388.482022,2.218430e+07,BTC,BTC
1,2021-12-01 01:00:00,56998.35,57726.45,56705.06,57616.41,599.791578,3.437153e+07,BTC,BTC
2,2021-12-01 02:00:00,57618.55,57620.00,56972.97,57030.83,591.687200,3.387067e+07,BTC,BTC
3,2021-12-01 03:00:00,57029.79,57396.87,56841.01,57307.59,702.560364,4.007816e+07,BTC,BTC
4,2021-12-01 04:00:00,57306.55,57456.82,57026.11,57404.01,859.591535,4.920503e+07,BTC,BTC
...,...,...,...,...,...,...,...,...,...
939,2021-12-20 11:00:00,3768.69,3821.54,3767.10,3798.99,4426.081206,1.682557e+07,ETH,ETH
940,2021-12-20 12:00:00,3798.96,3820.00,3787.21,3814.44,5781.695890,2.197543e+07,ETH,ETH
941,2021-12-20 13:00:00,3814.63,3815.80,3770.17,3782.86,4682.198085,1.773570e+07,ETH,ETH
942,2021-12-20 14:00:00,3782.85,3811.00,3766.88,3776.87,5356.147003,2.028810e+07,ETH,ETH


### `pd.DataFrame.join`

`df.join` is a nice and easy method that allows us to join two dataframes by their index

In [76]:
dfs[0].set_index('ts')['close'].rename(f'close_{tokens[0]}').to_frame().join(
    dfs[1].set_index('ts')['close'].rename(f'close_{tokens[1]}').to_frame()
)

Unnamed: 0_level_0,close_BTC,close_ETH
ts,Unnamed: 1_level_1,Unnamed: 2_level_1
2021-12-01 00:00:00,56987.97,4634.95
2021-12-01 01:00:00,57616.41,4729.10
2021-12-01 02:00:00,57030.83,4695.78
2021-12-01 03:00:00,57307.59,4754.09
2021-12-01 04:00:00,57404.01,4764.59
...,...,...
2021-12-20 11:00:00,45894.94,3798.99
2021-12-20 12:00:00,46111.64,3814.44
2021-12-20 13:00:00,45735.37,3782.86
2021-12-20 14:00:00,45724.22,3776.87


we can get a little more advanced by having a **left** unkeyed DataFrame joining against a **right** keyed DataFrame, e.g.:

In [77]:
dfs[0][['ts', 'close']].join(
    dfs[1].set_index('ts')['close'].rename(f'close_{tokens[1]}').to_frame(),
    on='ts'
)

Unnamed: 0,ts,close,close_ETH
0,2021-12-01 00:00:00,56987.97,4634.95
1,2021-12-01 01:00:00,57616.41,4729.10
2,2021-12-01 02:00:00,57030.83,4695.78
3,2021-12-01 03:00:00,57307.59,4754.09
4,2021-12-01 04:00:00,57404.01,4764.59
...,...,...,...
467,2021-12-20 11:00:00,45894.94,3798.99
468,2021-12-20 12:00:00,46111.64,3814.44
469,2021-12-20 13:00:00,45735.37,3782.86
470,2021-12-20 14:00:00,45724.22,3776.87


### `pd.merge`

`pd.merge` is Pandas way of doing sql-like joins (e.g. left join, inner join, outer join etc).  There are a few quirks we'll see though.

In [78]:
pd.merge(
    dfs[0][['ts', 'close']].rename(columns={'close': f'close_{tokens[0]}'}),
    dfs[1][['ts', 'close']].rename(columns={'close': f'close_{tokens[1]}'}),
    on='ts',
    how='inner'
)

Unnamed: 0,ts,close_BTC,close_ETH
0,2021-12-01 00:00:00,56987.97,4634.95
1,2021-12-01 01:00:00,57616.41,4729.10
2,2021-12-01 02:00:00,57030.83,4695.78
3,2021-12-01 03:00:00,57307.59,4754.09
4,2021-12-01 04:00:00,57404.01,4764.59
...,...,...,...
467,2021-12-20 11:00:00,45894.94,3798.99
468,2021-12-20 12:00:00,46111.64,3814.44
469,2021-12-20 13:00:00,45735.37,3782.86
470,2021-12-20 14:00:00,45724.22,3776.87


we can use other conditions for `how`, e.g. 'left', 'right', 'outer', and 'cross'

if left and right DataFrames have columns with the same name, pandas will automatically resolve the delta by adding `_x` and `_y` suffixes to the conflicted columns

In [79]:
pd.merge(
    dfs[0][['ts', 'close']],
    dfs[1][['ts', 'close']],
    on='ts',
    how='inner'
)

Unnamed: 0,ts,close_x,close_y
0,2021-12-01 00:00:00,56987.97,4634.95
1,2021-12-01 01:00:00,57616.41,4729.10
2,2021-12-01 02:00:00,57030.83,4695.78
3,2021-12-01 03:00:00,57307.59,4754.09
4,2021-12-01 04:00:00,57404.01,4764.59
...,...,...,...
467,2021-12-20 11:00:00,45894.94,3798.99
468,2021-12-20 12:00:00,46111.64,3814.44
469,2021-12-20 13:00:00,45735.37,3782.86
470,2021-12-20 14:00:00,45724.22,3776.87


however, we can also define our own suffixes, e.g.

In [80]:
pd.merge(
    dfs[0][['ts', 'close']],
    dfs[1][['ts', 'close']],
    on='ts',
    how='inner',
    suffixes=[f'_{tokens[0]}', f'_{tokens[1]}']
)

Unnamed: 0,ts,close_BTC,close_ETH
0,2021-12-01 00:00:00,56987.97,4634.95
1,2021-12-01 01:00:00,57616.41,4729.10
2,2021-12-01 02:00:00,57030.83,4695.78
3,2021-12-01 03:00:00,57307.59,4754.09
4,2021-12-01 04:00:00,57404.01,4764.59
...,...,...,...
467,2021-12-20 11:00:00,45894.94,3798.99
468,2021-12-20 12:00:00,46111.64,3814.44
469,2021-12-20 13:00:00,45735.37,3782.86
470,2021-12-20 14:00:00,45724.22,3776.87
