# Pandas DataFrames (`pd.DataFrame`)

The pandas dataframe is the workhorse of most data science and analytics projects.  The dataframe represents the data you're working with as a table.  However, the flexibility of the dataframe is that each row **and** column is represented as a pandas Series, which allows for many powerful ways to mess around with the data

In [1]:
import pandas as pd
import numpy as np
import requests

First let's get some data so we can see what we can do with a data frame.  Don't worry about exactly what this function is doing, we will go over it in a bit.

In [2]:
def get_data(token):
    res = requests.get(
        f'https://api.cryptowat.ch/markets/coinbase-pro/{token}usd/ohlc',
        params={
            'periods': '3600',
            'after': str(int(pd.Timestamp('2021-12-01').timestamp()))
        }
    )

    df = pd.DataFrame(
        res.json()['result']['3600'],
        columns=['ts', 'open', 'high', 'low', 'close', 'volume', 'volumeUSD']
    )
    df['ts'] = pd.to_datetime(df.ts, unit='s')
    df['token'] = token
    
    return df


In [3]:
tokens = ['BTC', 'ETH', 'SOL', 'AAVE', 'COMP']

Don't worry too much about what is going on in the function below - we'll briefly go over it as it showcases the power of python, but it's not necessary for the class

In [4]:
dfs = [
    (lambda x: x.assign(chain=np.where(x.token.isin(['ETH', 'AAVE', 'COMP']), np.full(x.shape[0], 'ETH'), x.token)))(get_data(token)) 
    for token in tokens
]

In [5]:
df_base = pd.concat(get_data(token) for token in tokens)
df_base['chain'] = np.where(df_base.token.isin(['ETH', 'AAVE', 'COMP']), np.full(df_base.shape[0], 'ETH'), df_base.token)


In [6]:
df = df_base.set_index('ts')

## Understanding the data frame

After loading the data in our data frame, we can now inspect what's inside.  We'll need to do this as often the data we will store will be impossible to inspect row by row, and we will need to check that our data loading was correct

Let's check some basic properties of the data set:

We can see how many rows and columns this data frame has, and total number of data points

In [None]:
df.shape

In [None]:
df.size

We can see what the first 5 rows looks like:

In [None]:
df.head()

...and the last 5 rows

In [None]:
df.tail()

We can also see a general overview of the schema (column name, data and data type) of the data

In [None]:
df.info()

as well as descriptive statistics about every column

In [None]:
df.describe()

## DataFrame Indexing

Indexing in data frames works very similar to Series, however there are now two "axes" that we can operate on - rows and columns.  For example, using `[*]` for indexing (like in series) by default will operate on columns:

In [None]:
df['open']

however using `.loc[*]` will allow you to access rows:

In [None]:
df.loc['2021-12-01 01:00:00']

and `.iloc[*]` will get you positional rows

In [None]:
df.iloc[0]

we can also get to the last row easily

In [None]:
df.iloc[-1]

or return it as a data frame instead of a Series

In [None]:
df.iloc[[-1]]

**note**: `df.loc[0]` will not work, as this is accessing via index

---
**note**: Also, the index operators will return a `pd.Series` if there's 1 row returned, or a new `pd.DataFrame` if multiple rows are returned, e.g.:

In [None]:
type(df.iloc[0])

In [None]:
type(df.loc['2021-12-01 01:00:00']) # 5 rows returned

we can convert a Series to a DataFrame anytime by using the `.to_frame()` method on the Series object.  This will turn the Series to a DataFrame, using the `Series.name` as the column name

In [None]:
df.iloc[0].to_frame()

---

In addition, we can select on multiple columns and rows in various ways:

In [None]:
df[['open', 'close']]

In [None]:
df[0:2]

In [None]:
df.loc['2021-12-01 00:00:00':'2021-12-01 02:00:00']

In [None]:
df.iloc[0:4]

In [None]:
df.iloc[[0, 4, 10, 50]]

And finally, we can index on both rows and columns at the same time with `.loc`:

In [None]:
df.loc['2021-12-01 00:00:00':'2021-12-01 02:00:00', ['close', 'volume', 'token']]

**note**: Given that by default dataframe indices are sequential integers by default, it's good practice to use `.loc` and `.iloc` to index into the data frame to be very clear, for example, let's shuffle our data frame then select:

In [None]:
df_shuffled = df_base.sample(frac=1)

In [None]:
df_shuffled.loc[[0, 2, 3]]

In [None]:
df_shuffled.iloc[[0, 2, 3]]

lastly, we can set the DataFrame index from a column, or remove an index into a column

In [None]:
df_shuffled.set_index('ts')

In [None]:
df_shuffled.set_index('ts').reset_index()

## DataFrame Filtering

Filtering a data frame is very similar to filtering a series.  We can filter on any set of columns, the filtering is done via indices.  For example, if we wanted to just get the data points for tokens on the ethereum chain:

In [None]:
df['chain'] == 'ETH'

In [None]:
df.loc[df['chain'] == 'ETH']

In [None]:
df.loc[df['chain'] == 'ETH', 'close']

## Deleting from Dataframes

We can select for all the things we'd like, but we can also drop both rows and columns.  This also works by index, i.e.:

In [None]:
df.drop(pd.to_datetime('2021-12-01 00:00:00'))

In [None]:
df.drop(columns='volumeUSD')

In [None]:
df.drop(['close', 'open'], axis=1)

## Common Operations

Like with pandas Series, a DataFrame is simply a numpy array underneath the hood.

In [None]:
type(df.values)

This means that the operations we saw for pandas Series can be applied to DataFrames as well, e.g. we can apply a scalar to every element in the DataFrame

In [None]:
df.head() * 10

However, the operation needs to be valid for ALL elements if we want to do this - e.g. while `*` is overridden for strings, `/` is not and will fail

In [None]:
df.head() / 10

Aggergation functions are by default done by column

In [None]:
df.mean()

However we can also make them aggregate by row:

In [None]:
df.mean(axis=1)

## Mutating the Dataframe

Like with other functionality, mutating DataFrames is very similar to mutating Series.  For example, setting one column to a single value is easy:

In [7]:
df_mutations = df_base.set_index('ts')

In [None]:
df_mutations.head()

In [8]:
df_mutations['chain'] = 'NA'

In [None]:
df_mutations.head()

We can also create a new column and add data by index:

In [9]:
updates = pd.Series({pd.to_datetime('2021-12-01 00:00:00'): 1})
updates

2021-12-01    1
dtype: int64

In [10]:
df_mutations['start_of_week'] = updates

In [None]:
df_mutations.head()

We can also use the `.assign(...)` method to update columns, e.g.:

In [11]:
df_mutations.assign(
    chain=np.where(df_mutations.token.isin(['ETH', 'AAVE', 'COMP']), np.full(df_mutations.shape[0], 'ETH'), df_mutations.token),
    start_of_week=np.NaN
)

Unnamed: 0_level_0,open,high,low,close,volume,volumeUSD,token,chain,start_of_week
ts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2021-12-01 00:00:00,57321.41,57451.05,56814.34,56987.97,388.482022,2.218430e+07,BTC,BTC,
2021-12-01 01:00:00,56998.35,57726.45,56705.06,57616.41,599.791578,3.437153e+07,BTC,BTC,
2021-12-01 02:00:00,57618.55,57620.00,56972.97,57030.83,591.687200,3.387067e+07,BTC,BTC,
2021-12-01 03:00:00,57029.79,57396.87,56841.01,57307.59,702.560364,4.007816e+07,BTC,BTC,
2021-12-01 04:00:00,57306.55,57456.82,57026.11,57404.01,859.591535,4.920503e+07,BTC,BTC,
...,...,...,...,...,...,...,...,...,...
2021-12-14 20:00:00,178.99,181.27,178.72,180.28,628.699000,1.130815e+05,COMP,ETH,
2021-12-14 21:00:00,180.15,183.76,179.96,183.43,2453.455000,4.447119e+05,COMP,ETH,
2021-12-14 22:00:00,183.31,186.74,182.60,185.24,2086.017000,3.865710e+05,COMP,ETH,
2021-12-14 23:00:00,185.22,186.09,183.88,185.08,416.028000,7.698720e+04,COMP,ETH,


**note**: using the index notation `[*]` will mutate the dataframe in place, however `.assign` will return a new data frame

We can also rename columns using a `{from:to}` syntax, e.g.:

In [12]:
df_mutations.rename(
    columns={
        'open':'OpeningPrice',
        'chain':'CryptoChain'
    }
)

Unnamed: 0_level_0,OpeningPrice,high,low,close,volume,volumeUSD,token,CryptoChain,start_of_week
ts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2021-12-01 00:00:00,57321.41,57451.05,56814.34,56987.97,388.482022,2.218430e+07,BTC,,1.0
2021-12-01 01:00:00,56998.35,57726.45,56705.06,57616.41,599.791578,3.437153e+07,BTC,,
2021-12-01 02:00:00,57618.55,57620.00,56972.97,57030.83,591.687200,3.387067e+07,BTC,,
2021-12-01 03:00:00,57029.79,57396.87,56841.01,57307.59,702.560364,4.007816e+07,BTC,,
2021-12-01 04:00:00,57306.55,57456.82,57026.11,57404.01,859.591535,4.920503e+07,BTC,,
...,...,...,...,...,...,...,...,...,...
2021-12-14 20:00:00,178.99,181.27,178.72,180.28,628.699000,1.130815e+05,COMP,,
2021-12-14 21:00:00,180.15,183.76,179.96,183.43,2453.455000,4.447119e+05,COMP,,
2021-12-14 22:00:00,183.31,186.74,182.60,185.24,2086.017000,3.865710e+05,COMP,,
2021-12-14 23:00:00,185.22,186.09,183.88,185.08,416.028000,7.698720e+04,COMP,,


We can also use functions to rename, e.g.:

In [13]:
df_mutations.rename(columns=lambda x: x.upper())

Unnamed: 0_level_0,OPEN,HIGH,LOW,CLOSE,VOLUME,VOLUMEUSD,TOKEN,CHAIN,START_OF_WEEK
ts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2021-12-01 00:00:00,57321.41,57451.05,56814.34,56987.97,388.482022,2.218430e+07,BTC,,1.0
2021-12-01 01:00:00,56998.35,57726.45,56705.06,57616.41,599.791578,3.437153e+07,BTC,,
2021-12-01 02:00:00,57618.55,57620.00,56972.97,57030.83,591.687200,3.387067e+07,BTC,,
2021-12-01 03:00:00,57029.79,57396.87,56841.01,57307.59,702.560364,4.007816e+07,BTC,,
2021-12-01 04:00:00,57306.55,57456.82,57026.11,57404.01,859.591535,4.920503e+07,BTC,,
...,...,...,...,...,...,...,...,...,...
2021-12-14 20:00:00,178.99,181.27,178.72,180.28,628.699000,1.130815e+05,COMP,,
2021-12-14 21:00:00,180.15,183.76,179.96,183.43,2453.455000,4.447119e+05,COMP,,
2021-12-14 22:00:00,183.31,186.74,182.60,185.24,2086.017000,3.865710e+05,COMP,,
2021-12-14 23:00:00,185.22,186.09,183.88,185.08,416.028000,7.698720e+04,COMP,,


The above commands will return a new DataFrame.  If we want to rename the input DataFrame, we can use the `inplace` option (which is available on most mutating functions), such as:

In [14]:
df_mutations.rename(
    columns={
        'open':'OpeningPrice',
        'chain':'CryptoChain'
    },
    inplace=True
)

df_mutations

Unnamed: 0_level_0,OpeningPrice,high,low,close,volume,volumeUSD,token,CryptoChain,start_of_week
ts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2021-12-01 00:00:00,57321.41,57451.05,56814.34,56987.97,388.482022,2.218430e+07,BTC,,1.0
2021-12-01 01:00:00,56998.35,57726.45,56705.06,57616.41,599.791578,3.437153e+07,BTC,,
2021-12-01 02:00:00,57618.55,57620.00,56972.97,57030.83,591.687200,3.387067e+07,BTC,,
2021-12-01 03:00:00,57029.79,57396.87,56841.01,57307.59,702.560364,4.007816e+07,BTC,,
2021-12-01 04:00:00,57306.55,57456.82,57026.11,57404.01,859.591535,4.920503e+07,BTC,,
...,...,...,...,...,...,...,...,...,...
2021-12-14 20:00:00,178.99,181.27,178.72,180.28,628.699000,1.130815e+05,COMP,,
2021-12-14 21:00:00,180.15,183.76,179.96,183.43,2453.455000,4.447119e+05,COMP,,
2021-12-14 22:00:00,183.31,186.74,182.60,185.24,2086.017000,3.865710e+05,COMP,,
2021-12-14 23:00:00,185.22,186.09,183.88,185.08,416.028000,7.698720e+04,COMP,,


We can also add rows to the DataFrame by using `append`:

In [None]:
df_mutations.append(
    pd.Series({
        'high': 1,
        'low': 2,
        'token': 'FAKE'
    }, name=pd.to_datetime('2021-11-30 00:00:00'))
)

## Sorting DataFrames

One thing that we didn't need to really do with Series is sorting.  For DataFrames, we will often need to sort by column(s) or by the index.  We can use `sort_values` and `sort_index` to do this

In [None]:
df.sort_values('open')

In [None]:
df.sort_values('open', ascending=False)

In [None]:
df.sort_values(['volumeUSD', 'open'])

We can also sort by the index

In [None]:
df.sort_index()

## Grouping DataFrames

one _very common_ action we will do during data manipulation is grouping then aggregating.  Pandas DataFrame has the method `groupby`, which allows us to group by any column in our DataFrame.

`groupby` returns a `DataFrameGroupBy` object, which we can apply a function to each group, or directly aggregate

In [15]:
df.groupby('chain')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x12055c7c0>

In [16]:
df.groupby('chain').groups

{'BTC': [2021-12-01 00:00:00, 2021-12-01 01:00:00, 2021-12-01 02:00:00, 2021-12-01 03:00:00, 2021-12-01 04:00:00, 2021-12-01 05:00:00, 2021-12-01 06:00:00, 2021-12-01 07:00:00, 2021-12-01 08:00:00, 2021-12-01 09:00:00, 2021-12-01 10:00:00, 2021-12-01 11:00:00, 2021-12-01 12:00:00, 2021-12-01 13:00:00, 2021-12-01 14:00:00, 2021-12-01 15:00:00, 2021-12-01 16:00:00, 2021-12-01 17:00:00, 2021-12-01 18:00:00, 2021-12-01 19:00:00, 2021-12-01 20:00:00, 2021-12-01 21:00:00, 2021-12-01 22:00:00, 2021-12-01 23:00:00, 2021-12-02 00:00:00, 2021-12-02 01:00:00, 2021-12-02 02:00:00, 2021-12-02 03:00:00, 2021-12-02 04:00:00, 2021-12-02 05:00:00, 2021-12-02 06:00:00, 2021-12-02 07:00:00, 2021-12-02 08:00:00, 2021-12-02 09:00:00, 2021-12-02 10:00:00, 2021-12-02 11:00:00, 2021-12-02 12:00:00, 2021-12-02 13:00:00, 2021-12-02 14:00:00, 2021-12-02 15:00:00, 2021-12-02 16:00:00, 2021-12-02 17:00:00, 2021-12-02 18:00:00, 2021-12-02 19:00:00, 2021-12-02 20:00:00, 2021-12-02 21:00:00, 2021-12-02 22:00:00, 2021

In [17]:
len(df.groupby('chain'))

3

In [18]:
df.groupby('chain').size()

chain
BTC     337
ETH    1011
SOL     337
dtype: int64

after grouping, we can operate on the whole DataFrame or on any column

In [19]:
df.groupby('chain')['volumeUSD'].sum().to_frame()

Unnamed: 0_level_0,volumeUSD
chain,Unnamed: 1_level_1
BTC,12611250000.0
ETH,14809560000.0
SOL,4639044000.0


we can also groupby multiple columns.  The row indices now are a multi-index, however we will not go into this

In [None]:
df.groupby(['chain', 'token'])['volumeUSD'].sum().to_frame()

We can actually aggregate without setting a compound index by adding `as_index=False`

In [None]:
df.groupby(['chain', 'token'], as_index=False)['volumeUSD'].sum()

We can now operate on the groups.  For example, if we wanted to sum all columns:

In [22]:
df.groupby('chain').aggregate(np.sum)

Unnamed: 0_level_0,open,high,low,close,volume,volumeUSD
chain,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
BTC,17081110.0,17181740.0,16964340.0,17071910.0,251074.8,12611250000.0
ETH,1558996.0,1569836.0,1545875.0,1557992.0,4738150.0,14809560000.0
SOL,64358.58,65032.74,63624.48,64307.39,24209130.0,4639044000.0


or describe all columns

In [23]:
df.groupby('chain').sum()

Unnamed: 0_level_0,open,high,low,close,volume,volumeUSD
chain,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
BTC,17081110.0,17181740.0,16964340.0,17071910.0,251074.8,12611250000.0
ETH,1558996.0,1569836.0,1545875.0,1557992.0,4738150.0,14809560000.0
SOL,64358.58,65032.74,63624.48,64307.39,24209130.0,4639044000.0


In [None]:
df.groupby('chain').describe()

We can also do multiple aggregations

In [None]:
df.groupby('chain')['open'].agg([np.size, np.mean, np.std, np.min, np.max])

we can actually use _any_ arbitrary functions - for example, we can use lambdas

In [None]:
df.groupby('chain')['open'].agg(
    range=lambda x: x.max() - x.min()
)

## Joining Dataframes

One of the primary things we need to do before starting to clean data is to make sure that we can get all of our data into one place.  This is usually called either a fat talbe or a long table, depending on how we are doing the joining.  We'll look at a few different ways to join pandas DataFrames below.

We will be using `dfs`, which is a list of DataFrames that we created up above

### `pd.concat`

To join the dataframes lengthwise, we can use `pd.concat`.  This will append the dataframes together, and join the rows by using the column names as an index.  If any dataframe doesn't have a column name another one has, it will appear in the full DataFrame but will have NA for the missing DataFrames

In [24]:
pd.concat(dfs)

Unnamed: 0,ts,open,high,low,close,volume,volumeUSD,token,chain
0,2021-12-01 00:00:00,57321.41,57451.05,56814.34,56987.97,388.482022,2.218430e+07,BTC,BTC
1,2021-12-01 01:00:00,56998.35,57726.45,56705.06,57616.41,599.791578,3.437153e+07,BTC,BTC
2,2021-12-01 02:00:00,57618.55,57620.00,56972.97,57030.83,591.687200,3.387067e+07,BTC,BTC
3,2021-12-01 03:00:00,57029.79,57396.87,56841.01,57307.59,702.560364,4.007816e+07,BTC,BTC
4,2021-12-01 04:00:00,57306.55,57456.82,57026.11,57404.01,859.591535,4.920503e+07,BTC,BTC
...,...,...,...,...,...,...,...,...,...
332,2021-12-14 20:00:00,178.99,181.27,178.72,180.28,628.699000,1.130815e+05,COMP,ETH
333,2021-12-14 21:00:00,180.15,183.76,179.96,183.43,2453.455000,4.447119e+05,COMP,ETH
334,2021-12-14 22:00:00,183.31,186.74,182.60,185.24,2086.017000,3.865710e+05,COMP,ETH
335,2021-12-14 23:00:00,185.22,186.09,183.88,185.08,416.028000,7.698720e+04,COMP,ETH


if you wanted to make sure you know where the original data is from, we can add keys, which creates a multi index:

In [25]:
res = pd.concat(dfs, keys=tokens)
res

Unnamed: 0,Unnamed: 1,ts,open,high,low,close,volume,volumeUSD,token,chain
BTC,0,2021-12-01 00:00:00,57321.41,57451.05,56814.34,56987.97,388.482022,2.218430e+07,BTC,BTC
BTC,1,2021-12-01 01:00:00,56998.35,57726.45,56705.06,57616.41,599.791578,3.437153e+07,BTC,BTC
BTC,2,2021-12-01 02:00:00,57618.55,57620.00,56972.97,57030.83,591.687200,3.387067e+07,BTC,BTC
BTC,3,2021-12-01 03:00:00,57029.79,57396.87,56841.01,57307.59,702.560364,4.007816e+07,BTC,BTC
BTC,4,2021-12-01 04:00:00,57306.55,57456.82,57026.11,57404.01,859.591535,4.920503e+07,BTC,BTC
...,...,...,...,...,...,...,...,...,...,...
COMP,332,2021-12-14 20:00:00,178.99,181.27,178.72,180.28,628.699000,1.130815e+05,COMP,ETH
COMP,333,2021-12-14 21:00:00,180.15,183.76,179.96,183.43,2453.455000,4.447119e+05,COMP,ETH
COMP,334,2021-12-14 22:00:00,183.31,186.74,182.60,185.24,2086.017000,3.865710e+05,COMP,ETH
COMP,335,2021-12-14 23:00:00,185.22,186.09,183.88,185.08,416.028000,7.698720e+04,COMP,ETH


this allows us to select the data from the source tables, e.g.:

In [None]:
res.loc['COMP']

As we saw above, we can also use `.append(*)` on DataFrames as well as Series

In [26]:
dfs[0]

Unnamed: 0,ts,open,high,low,close,volume,volumeUSD,token,chain
0,2021-12-01 00:00:00,57321.41,57451.05,56814.34,56987.97,388.482022,2.218430e+07,BTC,BTC
1,2021-12-01 01:00:00,56998.35,57726.45,56705.06,57616.41,599.791578,3.437153e+07,BTC,BTC
2,2021-12-01 02:00:00,57618.55,57620.00,56972.97,57030.83,591.687200,3.387067e+07,BTC,BTC
3,2021-12-01 03:00:00,57029.79,57396.87,56841.01,57307.59,702.560364,4.007816e+07,BTC,BTC
4,2021-12-01 04:00:00,57306.55,57456.82,57026.11,57404.01,859.591535,4.920503e+07,BTC,BTC
...,...,...,...,...,...,...,...,...,...
332,2021-12-14 20:00:00,46658.23,47086.98,46649.03,46877.70,783.748050,3.677161e+07,BTC,BTC
333,2021-12-14 21:00:00,46875.95,47890.62,46856.37,47805.73,1634.610294,7.759534e+07,BTC,BTC
334,2021-12-14 22:00:00,47805.73,48686.91,47773.01,48303.57,1804.082095,8.714276e+07,BTC,BTC
335,2021-12-14 23:00:00,48303.56,48482.90,48008.28,48288.99,675.097958,3.259071e+07,BTC,BTC


In [None]:
dfs[0].append(dfs[1])

Lastly, remember the importance of indices.  In the operation above (both `.concat` and `.append`) we joined the DataFrames while keeping the indices of the original tables.  This means that we have repeated indices:

In [None]:
dfs[0].append(dfs[1]).sort_index()

This sometimes isn't ideal, esp. if we want to join against these indices later.  Instead, we can create a new index on the joined table using the `ignore_index` parameter, which allows us to have a sequential, non-repeated index:

In [None]:
dfs[0].append(dfs[1], ignore_index=True).sort_index()

### `pd.DataFrame.join`

`df.join` is a nice and easy method that allows us to join two dataframes by their index

In [30]:
dfs[0].set_index('ts')['close'].to_frame().join(dfs[1].set_index('ts')['close'].to_frame())

ValueError: columns overlap but no suffix specified: Index(['close'], dtype='object')

In [31]:
dfs[0].set_index('ts')['close'].rename(f'close_{tokens[0]}').to_frame().join( 
    dfs[1].set_index('ts')['close'].rename(f'close_{tokens[1]}').to_frame()
)

Unnamed: 0_level_0,close_BTC,close_ETH
ts,Unnamed: 1_level_1,Unnamed: 2_level_1
2021-12-01 00:00:00,56987.97,4634.95
2021-12-01 01:00:00,57616.41,4729.10
2021-12-01 02:00:00,57030.83,4695.78
2021-12-01 03:00:00,57307.59,4754.09
2021-12-01 04:00:00,57404.01,4764.59
...,...,...
2021-12-14 20:00:00,46877.70,3771.94
2021-12-14 21:00:00,47805.73,3835.07
2021-12-14 22:00:00,48303.57,3852.80
2021-12-14 23:00:00,48288.99,3840.49


we can get a little more advanced by having a **left** unkeyed DataFrame joining against a **right** keyed DataFrame, e.g.:

In [None]:
dfs[0][['ts', 'close']].join(
    dfs[1].set_index('ts')['close'].rename(f'close_{tokens[1]}').to_frame(),
    on='ts'
)

### `pd.merge`

`pd.merge` is Pandas way of doing sql-like joins (e.g. left join, inner join, outer join etc).  There are a few quirks we'll see though.

In [None]:
pd.merge(
    dfs[0][['ts', 'close']].rename(columns={'close': f'close_{tokens[0]}'}),
    dfs[1][['ts', 'close']].rename(columns={'close': f'close_{tokens[1]}'}),
    on='ts',
    how='inner'
)

we can use other conditions for `how`, e.g. 'left', 'right', 'outer', and 'cross'

if left and right DataFrames have columns with the same name, pandas will automatically resolve the delta by adding `_x` and `_y` suffixes to the conflicted columns

In [None]:
pd.merge(
    dfs[0][['ts', 'close']],
    dfs[1][['ts', 'close']],
    on='ts',
    how='inner'
)

however, we can also define our own suffixes, e.g.

In [None]:
pd.merge(
    dfs[0][['ts', 'close']],
    dfs[1][['ts', 'close']],
    on='ts',
    how='inner',
    suffixes=[f'_{tokens[0]}', f'_{tokens[1]}']
)