# Split column by example
Copyright (c) Microsoft Corporation. All rights reserved.<br>
Licensed under the MIT License.

DataPrep also offers you a way to easily split a column into multiple columns.
The SplitColumnByExampleBuilder class lets you generate a proper split program that will work even when the cases are not trivial, like in example below.

In [1]:
import azureml.dataprep as dprep

In [2]:
dflow = dprep.read_lines(path='https://dpreptestfiles.blob.core.windows.net/testfiles/sample.log')
df = dflow.head(10)

In [3]:
df['Line'].iloc[0]

'2012-02-03 18:35:34 SampleClass6 [INFO] everything normal for id 577725851'

As you can see above, you can't split this particular log file by space character as it will create too many columns and even worse number of columns will depend on a string in 6th column.
That's where split_column_by_example could be quite useful.

In [4]:
b = dflow.builders.split_column_by_example('Line', keep_delimiters=True)

In [5]:
b.preview()

Unnamed: 0,Line,Line_1,Line_2,Line_3,Line_4,Line_5,Line_6,Line_7,Line_8,Line_9,Line_10,Line_11,Line_12
0,2012-02-03 18:35:34 SampleClass6 [INFO] everyt...,2012-02-03,,18:35:34,,SampleClass,6.0,[,INFO,],everything normal for id,,577725851.0
1,2012-02-03 18:35:34 SampleClass4 [FATAL] syste...,2012-02-03,,18:35:34,,SampleClass,4.0,[,FATAL,],system problem at id,,1991281254.0
2,2012-02-03 18:35:34 SampleClass3 [DEBUG] detai...,2012-02-03,,18:35:34,,SampleClass,3.0,[,DEBUG,],detail for id,,1304807656.0
3,2012-02-03 18:35:34 SampleClass3 [WARN] missin...,2012-02-03,,18:35:34,,SampleClass,3.0,[,WARN,],missing id,,423340895.0
4,2012-02-03 18:35:34 SampleClass5 [TRACE] verbo...,2012-02-03,,18:35:34,,SampleClass,5.0,[,TRACE,],verbose detail for id,,2082654978.0
5,2012-02-03 18:35:34 SampleClass0 [ERROR] incor...,,,,,,,,,,,,
6,2012-02-03 18:35:34 SampleClass9 [TRACE] verbo...,2012-02-03,,18:35:34,,SampleClass,9.0,[,TRACE,],verbose detail for id,,438634209.0
7,2012-02-03 18:35:34 SampleClass8 [DEBUG] detai...,2012-02-03,,18:35:34,,SampleClass,8.0,[,DEBUG,],detail for id,,2074121310.0
8,2012-02-03 18:55:54 SampleClass4 [DEBUG] detai...,2012-02-03,,18:55:54,,SampleClass,4.0,[,DEBUG,],detail for id,,1029178762.0
9,2012-02-03 18:55:54 SampleClass2 [TRACE] verbo...,2012-02-03,,18:55:54,,SampleClass,2.0,[,TRACE,],verbose detail for id,,1135460272.0


Couple things to take note of here. No examples were given, and yet DataPrep was able to generate quite reasonable split program. 
We have passed keep_delimiters=True so we can see all the data split into columns. In practice, though, delimiters are rarely useful, so let's exclude them.

In [6]:
b.keep_delimiters = False
b.preview()

Unnamed: 0,Line,Line_1,Line_2,Line_3,Line_4,Line_5,Line_6,Line_7
0,2012-02-03 18:35:34 SampleClass6 [INFO] everyt...,2012-02-03,18:35:34,SampleClass,6.0,INFO,everything normal for id,577725851.0
1,2012-02-03 18:35:34 SampleClass4 [FATAL] syste...,2012-02-03,18:35:34,SampleClass,4.0,FATAL,system problem at id,1991281254.0
2,2012-02-03 18:35:34 SampleClass3 [DEBUG] detai...,2012-02-03,18:35:34,SampleClass,3.0,DEBUG,detail for id,1304807656.0
3,2012-02-03 18:35:34 SampleClass3 [WARN] missin...,2012-02-03,18:35:34,SampleClass,3.0,WARN,missing id,423340895.0
4,2012-02-03 18:35:34 SampleClass5 [TRACE] verbo...,2012-02-03,18:35:34,SampleClass,5.0,TRACE,verbose detail for id,2082654978.0
5,2012-02-03 18:35:34 SampleClass0 [ERROR] incor...,,,,,,,
6,2012-02-03 18:35:34 SampleClass9 [TRACE] verbo...,2012-02-03,18:35:34,SampleClass,9.0,TRACE,verbose detail for id,438634209.0
7,2012-02-03 18:35:34 SampleClass8 [DEBUG] detai...,2012-02-03,18:35:34,SampleClass,8.0,DEBUG,detail for id,2074121310.0
8,2012-02-03 18:55:54 SampleClass4 [DEBUG] detai...,2012-02-03,18:55:54,SampleClass,4.0,DEBUG,detail for id,1029178762.0
9,2012-02-03 18:55:54 SampleClass2 [TRACE] verbo...,2012-02-03,18:55:54,SampleClass,2.0,TRACE,verbose detail for id,1135460272.0


This looks pretty good already, except for line 5.
If we request generation of suggested examples we will see that line 5 is one of the items program need more input on.

In [7]:
suggestions = b.generate_suggested_examples()
suggestions

Unnamed: 0,Line
0,2012-02-03 18:35:34 SampleClass6 [INFO] everyt...
1,2012-02-03 18:35:34 SampleClass0 [ERROR] incor...
2,
3,java.lang.Exception: 2012-02-03 19:11:02 Sampl...
4,\tat com.osa.mocklogger.MockLogger$2.run(MockL...


In [8]:
suggestions.iloc[1]['Line']

'2012-02-03 18:35:34 SampleClass0 [ERROR] incorrect id  1886438513'

Having retrieved source value we can now provide an example of desired split.
Notice that we chose not to split date and time but rather keep them together in one column.

In [9]:
b.add_example(example=(suggestions['Line'].iloc[1], ['2012-02-03 18:35:34','SampleClass0','ERROR','incorrect id','1886438513']))

In [10]:
b.preview()

Unnamed: 0,Line,Line_1,Line_2,Line_3,Line_4,Line_5
0,2012-02-03 18:35:34 SampleClass6 [INFO] everyt...,2012-02-03 18:35:34,SampleClass6,INFO,everything normal for id,577725851
1,2012-02-03 18:35:34 SampleClass4 [FATAL] syste...,2012-02-03 18:35:34,SampleClass4,FATAL,system problem at id,1991281254
2,2012-02-03 18:35:34 SampleClass3 [DEBUG] detai...,2012-02-03 18:35:34,SampleClass3,DEBUG,detail for id,1304807656
3,2012-02-03 18:35:34 SampleClass3 [WARN] missin...,2012-02-03 18:35:34,SampleClass3,WARN,missing id,423340895
4,2012-02-03 18:35:34 SampleClass5 [TRACE] verbo...,2012-02-03 18:35:34,SampleClass5,TRACE,verbose detail for id,2082654978
5,2012-02-03 18:35:34 SampleClass0 [ERROR] incor...,2012-02-03 18:35:34,SampleClass0,ERROR,incorrect id,1886438513
6,2012-02-03 18:35:34 SampleClass9 [TRACE] verbo...,2012-02-03 18:35:34,SampleClass9,TRACE,verbose detail for id,438634209
7,2012-02-03 18:35:34 SampleClass8 [DEBUG] detai...,2012-02-03 18:35:34,SampleClass8,DEBUG,detail for id,2074121310
8,2012-02-03 18:55:54 SampleClass4 [DEBUG] detai...,2012-02-03 18:55:54,SampleClass4,DEBUG,detail for id,1029178762
9,2012-02-03 18:55:54 SampleClass2 [TRACE] verbo...,2012-02-03 18:55:54,SampleClass2,TRACE,verbose detail for id,1135460272


This looks just like what we need, so let's get a dataflow with split in it and drop original column.

In [11]:
dflow = b.to_dataflow()
dflow = dflow.drop_columns(['Line'])
dflow.head(10)

Unnamed: 0,Line_1,Line_2,Line_3,Line_4,Line_5
0,2012-02-03 18:35:34,SampleClass6,INFO,everything normal for id,577725851
1,2012-02-03 18:35:34,SampleClass4,FATAL,system problem at id,1991281254
2,2012-02-03 18:35:34,SampleClass3,DEBUG,detail for id,1304807656
3,2012-02-03 18:35:34,SampleClass3,WARN,missing id,423340895
4,2012-02-03 18:35:34,SampleClass5,TRACE,verbose detail for id,2082654978
5,2012-02-03 18:35:34,SampleClass0,ERROR,incorrect id,1886438513
6,2012-02-03 18:35:34,SampleClass9,TRACE,verbose detail for id,438634209
7,2012-02-03 18:35:34,SampleClass8,DEBUG,detail for id,2074121310
8,2012-02-03 18:55:54,SampleClass4,DEBUG,detail for id,1029178762
9,2012-02-03 18:55:54,SampleClass2,TRACE,verbose detail for id,1135460272
