In [6]:
import pandas as pd
import os
import numpy as np

### Problem: Creating a DataFrame by combining multiple csvs

There are two manners of doing so:
1. Initialize one DataFrame and concatenate on each read
2. Read several small DataFrames and concatenate at the end


Source: https://github.com/TomAugspurger/effective-pandas/blob/master/modern_4_performance.ipynb

**This entire example + source-code was taken from the project Effective-Pandas**



In [7]:
size_per = 5000
N = 100
cols = list('abcd')

In [11]:
%%timeit

def append_df():
    '''
    The pythonic (bad) way
    '''
    df = pd.DataFrame(columns=cols)
    for _ in range(N):
        df.append(pd.DataFrame(np.random.randn(size_per, 4), columns=cols))
    return df

69.6 ns ± 0.389 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


In [12]:
%%timeit

def concat_df():
    '''
    The pandorabe (good) way
    '''
    dfs = [pd.DataFrame(np.random.randn(size_per, 4), columns=cols)
           for _ in range(N)]
    return pd.concat(dfs, ignore_index=True)

69.1 ns ± 2.01 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


### Conclusion

In my tests I found no significant difference between the two methods.
