# Split Dataframe using Panda's Groupby

For this tutorial, I will asume you have a basic understanding of Python, and know how to load a dataframe using the Panda's library.
I will use the GL_Detail example file from the AICPA's AuditDataAnalytic's GitHub.

In [1]:
import pandas as pd
import numpy as np

# Displays numbers with 2 decimals and thousands separators.
pd.options.display.float_format = '{:,.2f}'.format 

In [2]:
# Load Dataframe into memory.
file_location = "data/GL_Detail_YYYYMMDD_YYYYMMDD.csv"
df = pd.read_csv(file_location)
df.head()

Unnamed: 0,Journal_ID,Journal_ID_Line_Number,JE_Line_Description,Business_Unit_Code,Effective_Date,Fiscal_Year,GL_Account_Number,Amount,Amount_Credit_Debit_Indicator,Amount_Currency,JE_Header_ Description,Source,Entered_By,Document_Date,Entered_Date,Entered_Time,Period
0,100000000,1,Postkosten ohne Tel.,9900.0,19000101,2007,473000,9770.52,S,USD,,SA,STEINER,20070101,20070122,101205,1
1,100000000,2,,,19000101,2007,113100,9770.52,H,USD,,SA,STEINER,20070101,20070122,101205,1
2,100000001,1,Reisekst./Unterkunft,9900.0,19000101,2007,474210,5875.2,S,USD,,SA,STEINER,20070101,20070122,101206,1
3,100000001,2,,,19000101,2007,113100,5875.2,H,USD,,SA,STEINER,20070101,20070122,101206,1
4,100000002,1,,9900.0,19000101,2007,474211,244.8,S,USD,,SA,STEINER,20070101,20070122,101206,1


<p>This method splits the dataframe into individual dataframes by "Entered_By Column". Groupby in Pandas, will group all similar elements. This can replace a pivot table in Excel, and functions similarly to SQL's Groupby. <b>Note</b> if you want to have multiple layers of grouping (i.e. "Entered_By", then "Business_Unit_Code"), you must use a list of items. This would look like df.groupby(["Entered_By", "Business_Unit_Code"]).</p>
<p>Note how Python allows for "unpacking" of elements. In this case <i>split, file</i>. df.groupby returns 2 values for each loop through. It returns the split value (my terminology), which is our "Entered_By" code, and the related file. We then use the split value as a key to add the file to our dict of files.</p>

In [3]:
files = {}
for split, file in df.groupby("Entered_By"):
    files[split] = file

In [4]:
files.keys() #shows the users who entered journal entries

dict_keys(['BRAUNDI', 'D020281', 'D023132', 'D025016', 'D032884', 'D034394', 'D036495', 'D037397', 'D046954', 'D049461', 'DEVENTER', 'FISCHER', 'GENTNERB', 'I034454', 'I036867', 'I040584', 'I040990', 'I800109', 'I807424', 'MARCINKO', 'RAGHAVAN', 'STEINER', 'ZECHA'])

<p>This is a loop through all of the items in the Python dict "Files". When looping through a dict data structure, make sure to add the ".items()" to the end if you want the dict key and the item. In our case, the key is the filename, and the item is split dataframe.</p> 
<p>Please note the "!mkdir" at the top of the cell. This is to create a new directory for the split files to go in. The mkdir command will only work on Mac/Linux systems. On a Windows OS, you would need the following: <n>import os</n><n>os.mkdir("data/split")</n></p>
<p>All Dataframe's have a write method, to write the Dataframe to various mediums. In this case, I'm using csv. You can concatenate strings in Python using the "+" operator to build unique file paths for each Dataframe.</p>
<p>If you don't need to load the Dataframe's into memory, and only wish to split a file, you could perform this task in one loop. </p>

<p>    for filename, file in df.groupby("Entered_By"):
        file.to_csv("data/split/"+filename+".csv") </p>

In [5]:
!mkdir data/split

for filename, file in files.items():
    file.to_csv("data/split/"+filename+".csv")