# Pandas DataFrames (`pd.DataFrame`)

The pandas dataframe is the workhorse of most data science and analytics projects.  The dataframe represents the data you're working with as a table.  However, the flexibility of the dataframe is that each row **and** column is represented as a pandas Series, which allows for many powerful ways to mess around with the data

In [1]:
import pandas as pd
import numpy as np
import requests

First let's get some data so we can see what we can do with a data frame.  Don't worry about exactly what this function is doing, we will go over it in a bit.

In [2]:
def get_data(token):
    res = requests.get(
        f'https://api.cryptowat.ch/markets/coinbase-pro/{token}usd/ohlc',
        params={
            'periods': '3600',
            'after': str(int(pd.Timestamp('2021-12-01').timestamp()))
        }
    )

    df = pd.DataFrame(
        res.json()['result']['3600'],
        columns=['ts', 'open', 'high', 'low', 'close', 'volume', 'volumeUSD']
    )
    df['ts'] = pd.to_datetime(df.ts, unit='s')
    df['token'] = token
    
    return df


In [3]:
tokens = ['BTC', 'ETH', 'SOL', 'AAVE', 'COMP']

Don't worry too much about what is going on in the function below - we'll briefly go over it as it showcases the power of python, but it's not necessary for the class

In [4]:
dfs = [
    (lambda x: x.assign(chain=np.where(x.token.isin(['ETH', 'AAVE', 'COMP']), np.full(x.shape[0], 'ETH'), x.token)))(get_data(token)) 
    for token in tokens
]

In [5]:
df_base = pd.concat(get_data(token) for token in tokens)
df_base['chain'] = np.where(df_base.token.isin(['ETH', 'AAVE', 'COMP']), np.full(df_base.shape[0], 'ETH'), df_base.token)


In [6]:
df = df_base.set_index('ts')

## Understanding the data frame

After loading the data in our data frame, we can now inspect what's inside.  We'll need to do this as often the data we will store will be impossible to inspect row by row, and we will need to check that our data loading was correct

Let's check some basic properties of the data set:

We can see how many rows and columns this data frame has, and total number of data points

In [7]:
df.shape

(2020, 8)

In [8]:
df.size

16160

We can see what the first 5 rows looks like:

In [9]:
df.head()

Unnamed: 0_level_0,open,high,low,close,volume,volumeUSD,token,chain
ts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2021-12-01 00:00:00,57321.41,57451.05,56814.34,56987.97,388.482022,22184300.0,BTC,BTC
2021-12-01 01:00:00,56998.35,57726.45,56705.06,57616.41,599.791578,34371530.0,BTC,BTC
2021-12-01 02:00:00,57618.55,57620.0,56972.97,57030.83,591.6872,33870670.0,BTC,BTC
2021-12-01 03:00:00,57029.79,57396.87,56841.01,57307.59,702.560364,40078160.0,BTC,BTC
2021-12-01 04:00:00,57306.55,57456.82,57026.11,57404.01,859.591535,49205030.0,BTC,BTC


...and the last 5 rows

In [10]:
df.tail()

Unnamed: 0_level_0,open,high,low,close,volume,volumeUSD,token,chain
ts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2021-12-17 15:00:00,183.67,183.88,177.76,182.28,2137.736,385827.04609,COMP,ETH
2021-12-17 16:00:00,182.34,185.8,179.29,185.59,1743.956,317583.25306,COMP,ETH
2021-12-17 17:00:00,185.59,189.54,184.54,188.37,1197.438,224311.62004,COMP,ETH
2021-12-17 18:00:00,188.27,188.27,184.99,185.93,573.678,107020.17582,COMP,ETH
2021-12-17 19:00:00,185.94,186.48,184.97,185.18,123.647,22971.81363,COMP,ETH


We can also see a general overview of the schema (column name, data and data type) of the data

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2020 entries, 2021-12-01 00:00:00 to 2021-12-17 19:00:00
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   open       2020 non-null   float64
 1   high       2020 non-null   float64
 2   low        2020 non-null   float64
 3   close      2020 non-null   float64
 4   volume     2020 non-null   float64
 5   volumeUSD  2020 non-null   float64
 6   token      2020 non-null   object 
 7   chain      2020 non-null   object 
dtypes: float64(6), object(2)
memory usage: 142.0+ KB


as well as descriptive statistics about every column

In [12]:
df.describe()

Unnamed: 0,open,high,low,close,volume,volumeUSD
count,2020.0,2020.0,2020.0,2020.0,2020.0,2020.0
mean,11001.52865,11065.869541,10927.090963,10995.710508,17323.77799,18731340.0
std,19743.072574,19854.492039,19614.332101,19732.210624,36360.531485,28196880.0
min,151.63,152.75,148.03,151.64,67.178,18212.33
25%,186.74925,188.67375,184.66575,186.555,748.770952,319159.6
50%,226.1695,227.92,223.493,226.15,2206.631,10321790.0
75%,4348.6875,4375.9275,4322.7925,4347.3125,12475.055259,26525940.0
max,58664.4,59118.84,58445.53,58664.4,534212.095,398803500.0


## DataFrame Indexing

Indexing in data frames works very similar to Series, however there are now two "axes" that we can operate on - rows and columns.  For example, using `[*]` for indexing (like in series) by default will operate on columns:

In [13]:
df['open']

ts
2021-12-01 00:00:00    57321.41
2021-12-01 01:00:00    56998.35
2021-12-01 02:00:00    57618.55
2021-12-01 03:00:00    57029.79
2021-12-01 04:00:00    57306.55
                         ...   
2021-12-17 15:00:00      183.67
2021-12-17 16:00:00      182.34
2021-12-17 17:00:00      185.59
2021-12-17 18:00:00      188.27
2021-12-17 19:00:00      185.94
Name: open, Length: 2020, dtype: float64

however using `.loc[*]` will allow you to access rows:

In [14]:
df.loc['2021-12-01 01:00:00']

Unnamed: 0_level_0,open,high,low,close,volume,volumeUSD,token,chain
ts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2021-12-01 01:00:00,56998.35,57726.45,56705.06,57616.41,599.791578,34371530.0,BTC,BTC
2021-12-01 01:00:00,4636.43,4736.9,4605.49,4729.1,13819.06161,64876930.0,ETH,ETH
2021-12-01 01:00:00,208.716,211.773,207.821,211.507,92606.555,19461530.0,SOL,SOL
2021-12-01 01:00:00,257.149,266.249,255.27,264.816,5752.541,1516148.0,AAVE,ETH
2021-12-01 01:00:00,278.65,283.8,276.36,283.44,817.668,229274.6,COMP,ETH


and `.iloc[*]` will get you positional rows

In [15]:
df.iloc[0]

open               57321.41
high               57451.05
low                56814.34
close              56987.97
volume           388.482022
volumeUSD    22184300.66241
token                   BTC
chain                   BTC
Name: 2021-12-01 00:00:00, dtype: object

we can also get to the last row easily

In [16]:
df.iloc[-1]

open              185.94
high              186.48
low               184.97
close             185.18
volume           123.647
volumeUSD    22971.81363
token               COMP
chain                ETH
Name: 2021-12-17 19:00:00, dtype: object

or return it as a data frame instead of a Series

In [17]:
df.iloc[[-1]]

Unnamed: 0_level_0,open,high,low,close,volume,volumeUSD,token,chain
ts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2021-12-17 19:00:00,185.94,186.48,184.97,185.18,123.647,22971.81363,COMP,ETH


**note**: `df.loc[0]` will not work, as this is accessing via index

---
**note**: Also, the index operators will return a `pd.Series` if there's 1 row returned, or a new `pd.DataFrame` if multiple rows are returned, e.g.:

In [18]:
type(df.iloc[0])

pandas.core.series.Series

In [19]:
type(df.loc['2021-12-01 01:00:00']) # 5 rows returned

pandas.core.frame.DataFrame

we can convert a Series to a DataFrame anytime by using the `.to_frame()` method on the Series object.  This will turn the Series to a DataFrame, using the `Series.name` as the column name

In [20]:
df.iloc[0].to_frame()

Unnamed: 0,2021-12-01
open,57321.41
high,57451.05
low,56814.34
close,56987.97
volume,388.482022
volumeUSD,22184300.66241
token,BTC
chain,BTC


---

In addition, we can select on multiple columns and rows in various ways:

In [21]:
df[['open', 'close']]

Unnamed: 0_level_0,open,close
ts,Unnamed: 1_level_1,Unnamed: 2_level_1
2021-12-01 00:00:00,57321.41,56987.97
2021-12-01 01:00:00,56998.35,57616.41
2021-12-01 02:00:00,57618.55,57030.83
2021-12-01 03:00:00,57029.79,57307.59
2021-12-01 04:00:00,57306.55,57404.01
...,...,...
2021-12-17 15:00:00,183.67,182.28
2021-12-17 16:00:00,182.34,185.59
2021-12-17 17:00:00,185.59,188.37
2021-12-17 18:00:00,188.27,185.93


In [22]:
df[0:2]

Unnamed: 0_level_0,open,high,low,close,volume,volumeUSD,token,chain
ts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2021-12-01 00:00:00,57321.41,57451.05,56814.34,56987.97,388.482022,22184300.0,BTC,BTC
2021-12-01 01:00:00,56998.35,57726.45,56705.06,57616.41,599.791578,34371530.0,BTC,BTC


In [23]:
df.loc['2021-12-01 00:00:00':'2021-12-01 02:00:00']

Unnamed: 0_level_0,open,high,low,close,volume,volumeUSD,token,chain
ts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2021-12-01 00:00:00,57321.41,57451.05,56814.34,56987.97,388.482022,22184300.0,BTC,BTC
2021-12-01 01:00:00,56998.35,57726.45,56705.06,57616.41,599.791578,34371530.0,BTC,BTC
2021-12-01 02:00:00,57618.55,57620.0,56972.97,57030.83,591.6872,33870670.0,BTC,BTC
2021-12-01 00:00:00,4656.62,4672.43,4624.16,4634.95,6013.006735,27933210.0,ETH,ETH
2021-12-01 01:00:00,4636.43,4736.9,4605.49,4729.1,13819.06161,64876930.0,ETH,ETH
2021-12-01 02:00:00,4729.1,4729.1,4684.49,4695.78,7491.46544,35241610.0,ETH,ETH
2021-12-01 00:00:00,210.312,210.59,208.432,208.676,70031.618,14658510.0,SOL,SOL
2021-12-01 01:00:00,208.716,211.773,207.821,211.507,92606.555,19461530.0,SOL,SOL
2021-12-01 02:00:00,211.506,212.235,210.003,210.868,49728.032,10497560.0,SOL,SOL
2021-12-01 00:00:00,257.102,260.775,255.345,257.078,2730.299,703918.3,AAVE,ETH


In [24]:
df.iloc[0:4]

Unnamed: 0_level_0,open,high,low,close,volume,volumeUSD,token,chain
ts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2021-12-01 00:00:00,57321.41,57451.05,56814.34,56987.97,388.482022,22184300.0,BTC,BTC
2021-12-01 01:00:00,56998.35,57726.45,56705.06,57616.41,599.791578,34371530.0,BTC,BTC
2021-12-01 02:00:00,57618.55,57620.0,56972.97,57030.83,591.6872,33870670.0,BTC,BTC
2021-12-01 03:00:00,57029.79,57396.87,56841.01,57307.59,702.560364,40078160.0,BTC,BTC


In [25]:
df.iloc[[0, 4, 10, 50]]

Unnamed: 0_level_0,open,high,low,close,volume,volumeUSD,token,chain
ts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2021-12-01 00:00:00,57321.41,57451.05,56814.34,56987.97,388.482022,22184300.0,BTC,BTC
2021-12-01 04:00:00,57306.55,57456.82,57026.11,57404.01,859.591535,49205030.0,BTC,BTC
2021-12-01 10:00:00,56945.94,57220.39,56756.48,57131.16,586.052971,33407680.0,BTC,BTC
2021-12-03 02:00:00,56569.3,56762.82,56425.99,56545.98,238.584708,13496120.0,BTC,BTC


And finally, we can index on both rows and columns at the same time with `.loc`:

In [26]:
df.loc['2021-12-01 00:00:00':'2021-12-01 02:00:00', ['close', 'volume', 'token']]

Unnamed: 0_level_0,close,volume,token
ts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2021-12-01 00:00:00,56987.97,388.482022,BTC
2021-12-01 01:00:00,57616.41,599.791578,BTC
2021-12-01 02:00:00,57030.83,591.6872,BTC
2021-12-01 00:00:00,4634.95,6013.006735,ETH
2021-12-01 01:00:00,4729.1,13819.06161,ETH
2021-12-01 02:00:00,4695.78,7491.46544,ETH
2021-12-01 00:00:00,208.676,70031.618,SOL
2021-12-01 01:00:00,211.507,92606.555,SOL
2021-12-01 02:00:00,210.868,49728.032,SOL
2021-12-01 00:00:00,257.078,2730.299,AAVE


**note**: Given that by default dataframe indices are sequential integers by default, it's good practice to use `.loc` and `.iloc` to index into the data frame to be very clear, for example, let's shuffle our data frame then select:

In [27]:
df_shuffled = df_base.sample(frac=1)

In [28]:
df_shuffled.loc[[0, 2, 3]]

Unnamed: 0,ts,open,high,low,close,volume,volumeUSD,token,chain
0,2021-12-01 00:00:00,257.102,260.775,255.345,257.078,2730.299,703918.3,AAVE,ETH
0,2021-12-01 00:00:00,57321.41,57451.05,56814.34,56987.97,388.482022,22184300.0,BTC,BTC
0,2021-12-01 00:00:00,280.59,281.4,278.3,278.7,207.849,58221.57,COMP,ETH
0,2021-12-01 00:00:00,4656.62,4672.43,4624.16,4634.95,6013.006735,27933210.0,ETH,ETH
0,2021-12-01 00:00:00,210.312,210.59,208.432,208.676,70031.618,14658510.0,SOL,SOL
2,2021-12-01 02:00:00,4729.1,4729.1,4684.49,4695.78,7491.46544,35241610.0,ETH,ETH
2,2021-12-01 02:00:00,57618.55,57620.0,56972.97,57030.83,591.6872,33870670.0,BTC,BTC
2,2021-12-01 02:00:00,264.755,266.187,262.597,263.125,1559.33,412444.9,AAVE,ETH
2,2021-12-01 02:00:00,211.506,212.235,210.003,210.868,49728.032,10497560.0,SOL,SOL
2,2021-12-01 02:00:00,283.2,283.2,280.61,281.29,254.33,71609.33,COMP,ETH


In [29]:
df_shuffled.iloc[[0, 2, 3]]

Unnamed: 0,ts,open,high,low,close,volume,volumeUSD,token,chain
79,2021-12-04 07:00:00,47844.68,48892.32,47200.0,47711.33,2910.596893,138974700.0,BTC,BTC
271,2021-12-12 07:00:00,178.9,180.06,178.39,179.26,555.967,99530.62,AAVE,ETH
330,2021-12-14 18:00:00,184.64,186.82,180.77,181.13,2661.731,490289.9,COMP,ETH


lastly, we can set the DataFrame index from a column, or remove an index into a column

In [30]:
df_shuffled.set_index('ts')

Unnamed: 0_level_0,open,high,low,close,volume,volumeUSD,token,chain
ts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2021-12-04 07:00:00,47844.680,48892.320,47200.000,47711.330,2910.596893,1.389747e+08,BTC,BTC
2021-12-15 05:00:00,3848.940,3895.580,3826.130,3857.920,8607.636619,3.325520e+07,ETH,ETH
2021-12-12 07:00:00,178.900,180.060,178.390,179.260,555.967000,9.953062e+04,AAVE,ETH
2021-12-14 18:00:00,184.640,186.820,180.770,181.130,2661.731000,4.902899e+05,COMP,ETH
2021-12-13 23:00:00,46829.920,47292.730,46710.940,46914.020,702.442910,3.306139e+07,BTC,BTC
...,...,...,...,...,...,...,...,...
2021-12-16 14:00:00,193.890,196.760,192.840,196.250,1341.163000,2.617817e+05,COMP,ETH
2021-12-06 18:00:00,220.360,221.920,217.990,218.540,1673.916000,3.678586e+05,COMP,ETH
2021-12-16 06:00:00,176.900,177.660,174.110,174.550,50759.476000,8.937873e+06,SOL,SOL
2021-12-17 19:00:00,46773.460,46890.250,46648.310,46708.950,328.815147,1.537009e+07,BTC,BTC


In [31]:
df_shuffled.set_index('ts').reset_index()

Unnamed: 0,ts,open,high,low,close,volume,volumeUSD,token,chain
0,2021-12-04 07:00:00,47844.680,48892.320,47200.000,47711.330,2910.596893,1.389747e+08,BTC,BTC
1,2021-12-15 05:00:00,3848.940,3895.580,3826.130,3857.920,8607.636619,3.325520e+07,ETH,ETH
2,2021-12-12 07:00:00,178.900,180.060,178.390,179.260,555.967000,9.953062e+04,AAVE,ETH
3,2021-12-14 18:00:00,184.640,186.820,180.770,181.130,2661.731000,4.902899e+05,COMP,ETH
4,2021-12-13 23:00:00,46829.920,47292.730,46710.940,46914.020,702.442910,3.306139e+07,BTC,BTC
...,...,...,...,...,...,...,...,...,...
2015,2021-12-16 14:00:00,193.890,196.760,192.840,196.250,1341.163000,2.617817e+05,COMP,ETH
2016,2021-12-06 18:00:00,220.360,221.920,217.990,218.540,1673.916000,3.678586e+05,COMP,ETH
2017,2021-12-16 06:00:00,176.900,177.660,174.110,174.550,50759.476000,8.937873e+06,SOL,SOL
2018,2021-12-17 19:00:00,46773.460,46890.250,46648.310,46708.950,328.815147,1.537009e+07,BTC,BTC


## DataFrame Filtering

Filtering a data frame is very similar to filtering a series.  We can filter on any set of columns, the filtering is done via indices.  For example, if we wanted to just get the data points for tokens on the ethereum chain:

In [32]:
df['chain'] == 'ETH'

ts
2021-12-01 00:00:00    False
2021-12-01 01:00:00    False
2021-12-01 02:00:00    False
2021-12-01 03:00:00    False
2021-12-01 04:00:00    False
                       ...  
2021-12-17 15:00:00     True
2021-12-17 16:00:00     True
2021-12-17 17:00:00     True
2021-12-17 18:00:00     True
2021-12-17 19:00:00     True
Name: chain, Length: 2020, dtype: bool

In [33]:
df.loc[df['chain'] == 'ETH']

Unnamed: 0_level_0,open,high,low,close,volume,volumeUSD,token,chain
ts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2021-12-01 00:00:00,4656.62,4672.43,4624.16,4634.95,6013.006735,2.793321e+07,ETH,ETH
2021-12-01 01:00:00,4636.43,4736.90,4605.49,4729.10,13819.061610,6.487693e+07,ETH,ETH
2021-12-01 02:00:00,4729.10,4729.10,4684.49,4695.78,7491.465440,3.524161e+07,ETH,ETH
2021-12-01 03:00:00,4695.78,4754.97,4672.30,4754.09,10530.834423,4.963273e+07,ETH,ETH
2021-12-01 04:00:00,4754.09,4774.74,4722.02,4764.59,12471.624735,5.924627e+07,ETH,ETH
...,...,...,...,...,...,...,...,...
2021-12-17 15:00:00,183.67,183.88,177.76,182.28,2137.736000,3.858270e+05,COMP,ETH
2021-12-17 16:00:00,182.34,185.80,179.29,185.59,1743.956000,3.175833e+05,COMP,ETH
2021-12-17 17:00:00,185.59,189.54,184.54,188.37,1197.438000,2.243116e+05,COMP,ETH
2021-12-17 18:00:00,188.27,188.27,184.99,185.93,573.678000,1.070202e+05,COMP,ETH


In [34]:
df.loc[df['chain'] == 'ETH', 'close']

ts
2021-12-01 00:00:00    4634.95
2021-12-01 01:00:00    4729.10
2021-12-01 02:00:00    4695.78
2021-12-01 03:00:00    4754.09
2021-12-01 04:00:00    4764.59
                        ...   
2021-12-17 15:00:00     182.28
2021-12-17 16:00:00     185.59
2021-12-17 17:00:00     188.37
2021-12-17 18:00:00     185.93
2021-12-17 19:00:00     185.18
Name: close, Length: 1212, dtype: float64

## Deleting from Dataframes

We can select for all the things we'd like, but we can also drop both rows and columns.  This also works by index, i.e.:

In [35]:
df.drop(pd.to_datetime('2021-12-01 00:00:00'))

Unnamed: 0_level_0,open,high,low,close,volume,volumeUSD,token,chain
ts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2021-12-01 01:00:00,56998.35,57726.45,56705.06,57616.41,599.791578,3.437153e+07,BTC,BTC
2021-12-01 02:00:00,57618.55,57620.00,56972.97,57030.83,591.687200,3.387067e+07,BTC,BTC
2021-12-01 03:00:00,57029.79,57396.87,56841.01,57307.59,702.560364,4.007816e+07,BTC,BTC
2021-12-01 04:00:00,57306.55,57456.82,57026.11,57404.01,859.591535,4.920503e+07,BTC,BTC
2021-12-01 05:00:00,57404.01,57460.42,57016.00,57084.36,566.037996,3.238116e+07,BTC,BTC
...,...,...,...,...,...,...,...,...
2021-12-17 15:00:00,183.67,183.88,177.76,182.28,2137.736000,3.858270e+05,COMP,ETH
2021-12-17 16:00:00,182.34,185.80,179.29,185.59,1743.956000,3.175833e+05,COMP,ETH
2021-12-17 17:00:00,185.59,189.54,184.54,188.37,1197.438000,2.243116e+05,COMP,ETH
2021-12-17 18:00:00,188.27,188.27,184.99,185.93,573.678000,1.070202e+05,COMP,ETH


In [36]:
df.drop(columns='volumeUSD')

Unnamed: 0_level_0,open,high,low,close,volume,token,chain
ts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2021-12-01 00:00:00,57321.41,57451.05,56814.34,56987.97,388.482022,BTC,BTC
2021-12-01 01:00:00,56998.35,57726.45,56705.06,57616.41,599.791578,BTC,BTC
2021-12-01 02:00:00,57618.55,57620.00,56972.97,57030.83,591.687200,BTC,BTC
2021-12-01 03:00:00,57029.79,57396.87,56841.01,57307.59,702.560364,BTC,BTC
2021-12-01 04:00:00,57306.55,57456.82,57026.11,57404.01,859.591535,BTC,BTC
...,...,...,...,...,...,...,...
2021-12-17 15:00:00,183.67,183.88,177.76,182.28,2137.736000,COMP,ETH
2021-12-17 16:00:00,182.34,185.80,179.29,185.59,1743.956000,COMP,ETH
2021-12-17 17:00:00,185.59,189.54,184.54,188.37,1197.438000,COMP,ETH
2021-12-17 18:00:00,188.27,188.27,184.99,185.93,573.678000,COMP,ETH


In [37]:
df.drop(['close', 'open'], axis=1)

Unnamed: 0_level_0,high,low,volume,volumeUSD,token,chain
ts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2021-12-01 00:00:00,57451.05,56814.34,388.482022,2.218430e+07,BTC,BTC
2021-12-01 01:00:00,57726.45,56705.06,599.791578,3.437153e+07,BTC,BTC
2021-12-01 02:00:00,57620.00,56972.97,591.687200,3.387067e+07,BTC,BTC
2021-12-01 03:00:00,57396.87,56841.01,702.560364,4.007816e+07,BTC,BTC
2021-12-01 04:00:00,57456.82,57026.11,859.591535,4.920503e+07,BTC,BTC
...,...,...,...,...,...,...
2021-12-17 15:00:00,183.88,177.76,2137.736000,3.858270e+05,COMP,ETH
2021-12-17 16:00:00,185.80,179.29,1743.956000,3.175833e+05,COMP,ETH
2021-12-17 17:00:00,189.54,184.54,1197.438000,2.243116e+05,COMP,ETH
2021-12-17 18:00:00,188.27,184.99,573.678000,1.070202e+05,COMP,ETH


## Common Operations

Like with pandas Series, a DataFrame is simply a numpy array underneath the hood.

In [38]:
type(df.values)

numpy.ndarray

This means that the operations we saw for pandas Series can be applied to DataFrames as well, e.g. we can apply a scalar to every element in the DataFrame

In [39]:
df.head() * 10

Unnamed: 0_level_0,open,high,low,close,volume,volumeUSD,token,chain
ts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2021-12-01 00:00:00,573214.1,574510.5,568143.4,569879.7,3884.82022,221843000.0,BTCBTCBTCBTCBTCBTCBTCBTCBTCBTC,BTCBTCBTCBTCBTCBTCBTCBTCBTCBTC
2021-12-01 01:00:00,569983.5,577264.5,567050.6,576164.1,5997.915776,343715300.0,BTCBTCBTCBTCBTCBTCBTCBTCBTCBTC,BTCBTCBTCBTCBTCBTCBTCBTCBTCBTC
2021-12-01 02:00:00,576185.5,576200.0,569729.7,570308.3,5916.872,338706700.0,BTCBTCBTCBTCBTCBTCBTCBTCBTCBTC,BTCBTCBTCBTCBTCBTCBTCBTCBTCBTC
2021-12-01 03:00:00,570297.9,573968.7,568410.1,573075.9,7025.603645,400781600.0,BTCBTCBTCBTCBTCBTCBTCBTCBTCBTC,BTCBTCBTCBTCBTCBTCBTCBTCBTCBTC
2021-12-01 04:00:00,573065.5,574568.2,570261.1,574040.1,8595.915349,492050300.0,BTCBTCBTCBTCBTCBTCBTCBTCBTCBTC,BTCBTCBTCBTCBTCBTCBTCBTCBTCBTC


However, the operation needs to be valid for ALL elements if we want to do this - e.g. while `*` is overridden for strings, `/` is not and will fail

In [40]:
df.head() / 10

TypeError: unsupported operand type(s) for /: 'str' and 'int'

Aggergation functions are by default done by column

In [None]:
df.mean()

However we can also make them aggregate by row:

In [None]:
df.mean(axis=1)

## Mutating the Dataframe

Like with other functionality, mutating DataFrames is very similar to mutating Series.  For example, setting one column to a single value is easy:

In [None]:
df_mutations = df_base.set_index('ts')

In [None]:
df_mutations.head()

In [None]:
df_mutations['chain'] = 'NA'

In [None]:
df_mutations.head()

We can also create a new column and add data by index:

In [None]:
updates = pd.Series({pd.to_datetime('2021-12-01 00:00:00'): 1})
updates

In [None]:
df_mutations['start_of_week'] = updates

In [None]:
df_mutations.head()

We can also use the `.assign(...)` method to update columns, e.g.:

In [None]:
df_mutations.assign(
    chain=np.where(df_mutations.token.isin(['ETH', 'AAVE', 'COMP']), np.full(df_mutations.shape[0], 'ETH'), df_mutations.token),
    start_of_week=np.NaN
)

**note**: using the index notation `[*]` will mutate the dataframe in place, however `.assign` will return a new data frame

We can also rename columns using a `{from:to}` syntax, e.g.:

In [None]:
df_mutations.rename(
    columns={
        'open':'OpeningPrice',
        'chain':'CryptoChain'
    }
)

We can also use functions to rename, e.g.:

In [None]:
df_mutations.rename(columns=lambda x: x.upper())

The above commands will return a new DataFrame.  If we want to rename the input DataFrame, we can use the `inplace` option (which is available on most mutating functions), such as:

In [None]:
df_mutations.rename(
    columns={
        'open':'OpeningPrice',
        'chain':'CryptoChain'
    },
    inplace=True
)

df_mutations

We can also add rows to the DataFrame by using `append`:

In [None]:
df_mutations.append(
    pd.Series({
        'high': 1,
        'low': 2,
        'token': 'FAKE'
    }, name=pd.to_datetime('2021-11-30 00:00:00'))
)

## Sorting DataFrames

One thing that we didn't need to really do with Series is sorting.  For DataFrames, we will often need to sort by column(s) or by the index.  We can use `sort_values` and `sort_index` to do this

In [None]:
df.sort_values('open')

In [None]:
df.sort_values('open', ascending=False)

In [None]:
df.sort_values(['volumeUSD', 'open'])

We can also sort by the index

In [None]:
df.sort_index()

## Grouping DataFrames

one _very common_ action we will do during data manipulation is grouping then aggregating.  Pandas DataFrame has the method `groupby`, which allows us to group by any column in our DataFrame.

`groupby` returns a `DataFrameGroupBy` object, which we can apply a function to each group, or directly aggregate

In [None]:
df.groupby('chain')

In [None]:
df.groupby('chain').groups

In [None]:
len(df.groupby('chain'))

In [None]:
df.groupby('chain').size()

after grouping, we can operate on the whole DataFrame or on any column

In [None]:
df.groupby('chain')['volumeUSD'].sum().to_frame()

we can also groupby multiple columns.  The row indices now are a multi-index, however we will not go into this

In [None]:
df.groupby(['chain', 'token'])['volumeUSD'].sum().to_frame()

We can actually aggregate without setting a compound index by adding `as_index=False`

In [None]:
df.groupby(['chain', 'token'], as_index=False)['volumeUSD'].sum()

We can now operate on the groups.  For example, if we wanted to sum all columns:

In [None]:
df.groupby('chain').aggregate(np.sum)

or describe all columns

In [None]:
df.groupby('chain').describe()

We can also do multiple aggregations

In [None]:
df.groupby('chain')['open'].agg([np.size, np.mean, np.std, np.min, np.max])

we can actually use _any_ arbitrary functions - for example, we can use lambdas

In [None]:
df.groupby('chain')['open'].agg(
    range=lambda x: x.max() - x.min()
)

## Joining Dataframes

One of the primary things we need to do before starting to clean data is to make sure that we can get all of our data into one place.  This is usually called either a fat talbe or a long table, depending on how we are doing the joining.  We'll look at a few different ways to join pandas DataFrames below.

We will be using `dfs`, which is a list of DataFrames that we created up above

### `pd.concat`

To join the dataframes lengthwise, we can use `pd.concat`.  This will append the dataframes together, and join the rows by using the column names as an index.  If any dataframe doesn't have a column name another one has, it will appear in the full DataFrame but will have NA for the missing DataFrames

In [None]:
pd.concat(dfs)

if you wanted to make sure you know where the original data is from, we can add keys, which creates a multi index:

In [None]:
res = pd.concat(dfs, keys=tokens)
res

this allows us to select the data from the source tables, e.g.:

In [None]:
res.loc['COMP']

As we saw above, we can also use `.append(*)` on DataFrames as well as Series

In [None]:
dfs[0].append(dfs[1])

Lastly, remember the importance of indices.  In the operation above (both `.concat` and `.append`) we joined the DataFrames while keeping the indices of the original tables.  This means that we have repeated indices:

In [None]:
dfs[0].append(dfs[1]).sort_index()

This sometimes isn't ideal, esp. if we want to join against these indices later.  Instead, we can create a new index on the joined table using the `ignore_index` parameter, which allows us to have a sequential, non-repeated index:

In [None]:
dfs[0].append(dfs[1], ignore_index=True).sort_index()

### `pd.DataFrame.join`

`df.join` is a nice and easy method that allows us to join two dataframes by their index

In [None]:
dfs[0].set_index('ts')['close'].rename(f'close_{tokens[0]}').to_frame().join(
    dfs[1].set_index('ts')['close'].rename(f'close_{tokens[1]}').to_frame()
)

we can get a little more advanced by having a **left** unkeyed DataFrame joining against a **right** keyed DataFrame, e.g.:

In [None]:
dfs[0][['ts', 'close']].join(
    dfs[1].set_index('ts')['close'].rename(f'close_{tokens[1]}').to_frame(),
    on='ts'
)

### `pd.merge`

`pd.merge` is Pandas way of doing sql-like joins (e.g. left join, inner join, outer join etc).  There are a few quirks we'll see though.

In [None]:
pd.merge(
    dfs[0][['ts', 'close']].rename(columns={'close': f'close_{tokens[0]}'}),
    dfs[1][['ts', 'close']].rename(columns={'close': f'close_{tokens[1]}'}),
    on='ts',
    how='inner'
)

we can use other conditions for `how`, e.g. 'left', 'right', 'outer', and 'cross'

if left and right DataFrames have columns with the same name, pandas will automatically resolve the delta by adding `_x` and `_y` suffixes to the conflicted columns

In [None]:
pd.merge(
    dfs[0][['ts', 'close']],
    dfs[1][['ts', 'close']],
    on='ts',
    how='inner'
)

however, we can also define our own suffixes, e.g.

In [None]:
pd.merge(
    dfs[0][['ts', 'close']],
    dfs[1][['ts', 'close']],
    on='ts',
    how='inner',
    suffixes=[f'_{tokens[0]}', f'_{tokens[1]}']
)