# Introduction
Sometimes you want to read in data that contains different kinds of information per line. This could be a log file where some lines contain a timestamp and the log messages itself, but other lines could contain specific information about errors or stack traces. 

The first thing that we need to accomplish is to read in the file into Pandas. There is a nice little trick that you can use to read almost every kind of structured data into a Pandas DataFrame. By using a separator that doesn't occur in the dataset, we can read in the content of a file line by line. Once we have the data into a DataFrame, we can use Pandas functionality to reshape the data as we need it.

# Example data set

For this example, I choose a Git log output from the Sprint PetClinic project.  Such a dataset look like the following:
<hr/>
```
commit 75912a06c5613a2ea1305ad4d8ad6bc4be7765ce
Author: Stephane Nicoll <snicoll@pivotal.io>
Date:   Fri Feb 17 12:30:57 2017 +0100

    Polish contribution
    
    Closes gh-229

commit 443d35eae23c874ed38305fbe75216339c41beaf
Author: Henri Tremblay <henri.tremblay@gmail.com>
Date:   Thu Feb 16 15:08:30 2017 -0500

    Put Ehcache back

```
<hr/>

This is a pretty challenging dataset: Albeit we have some lines that could be used as markers (like "`commit:`", "`Author`" and "`Date:`"), we also have multiple lines without a marker text. Luckily, there is some convention here: The first line in this particular dataset is the so called "subject" or "title line", the other lines is the "full commit message". This makes it possible to differentiate between these two kind of information as well. So the data structure for one complete Git log entry looks like this (from [doc](https://git-scm.com/docs/git-log#_pretty_formats)):

<hr/>

```
commit <sha1>
Author: <author>
Date:   <author date>
[empty]
<title line>
[empty]
<full commit message>
[empty]
```
<hr/>

It may seem impossible to read in that data in a Pandas DataFrame, but we'll see that it's relative easy to do once we know some basic techniques for treating these kind of semi-structured data.

<div class="alert alert-block alert-info">
Granted, this particular case is some kind of artificial because we can output the information in a Git repository in one line using the `--pretty=format:` option. But I choose that one as example because it's easy to understand. Overall, there could be other cases where the following technique could be get very handy.
</div>

## Reading in the file
Let's start by reading in the given dataset. Especially take a look at the choosen `sep` parameter: After hours of <strike>trial and error</strike> research, I've found the Unicode character "DEVICE CONTROL TWO" very suitable as separator. As of today, I hadn't come across any datasets where this character is included. By using it's Unicode character code `u0012`, we can read in the dataset line by line. 

The parameter `names` takes a array of column names that present the headers for the DataFrame. This is necessary for our specific dataset because it doesn't have any specific header at the top of the file. 

We also make use of the `skipinitialspace` parameter. This will not delete the whitespace in front of the first characters. In our case, this is very useful because the commit's subject and message lines are prefixed with four whitespace characters. By setting `skipinitialspace=False`, we can ensure that we don't interpret this data wrongly as some other type of data like the commit or author line.

There is one keyword that could be useful for this task, too, but isn't used here. The `skip_blank_lines` parameter will not jump over lines in the given dataset, that are empty. Again, the use of this parameters depends highly on your kind if data.

In [40]:
import pandas as pd
git_log = pd.read_csv(
    'datasets/git_log_sample.txt',
    sep='\u0012',
    names=['raw'],
    skipinitialspace=False)
git_log.head()

Unnamed: 0,raw
0,commit 101c9dc69064633f697d93dcf0918bb4f74ff7ed
1,Author: Dave Syer <dsyer@pivotal.io>
2,Date: Fri Jun 30 11:07:07 2017 +0100
3,Update Spring Boot and Thymeleaf versions
4,commit ffa967c94b65a70ea6d3b44275632821838d9fd3


## Marking data that belongs together 

Next, we need a marking that shows which entries belong together (speaking in Data Science terms: are just variables for one observation). This could be a continuous number that marks all entries that should later be grouped together or an entry that is suitable as a index key. We'll take a look at both approaches right now.

### The index column approach

An easy way to achieve a marking for entries that belong together is to use the default index column. When we read in a dataset, Pandas usually creates an index column to enumerate all entries that were read in. So we have a columns with consecutive numbers. The idea is to keep just those index entries that mark the beginning of a new group of entries that belong together.

For this, we first we reset the DataFrame's existing index column to get our marking column named `index`.

In [38]:
marked_git_log = git_log.reset_index()
marked_git_log.head()

Unnamed: 0,index,raw
0,0,commit 101c9dc69064633f697d93dcf0918bb4f74ff7ed
1,1,Author: Dave Syer <dsyer@pivotal.io>
2,2,Date: Fri Jun 30 11:07:07 2017 +0100
3,3,Update Spring Boot and Thymeleaf versions
4,4,commit ffa967c94b65a70ea6d3b44275632821838d9fd3


Next, we choose a suitable text in the dataset that marks the beginning of a new group of consecutive data. In our case, these are entries that start with the text "`commit `". We set all other entries to `None`. With this

In [41]:
git_log_with_marker = git_log_with_marker.loc[~git_log_with_marker['raw'].str.startswith('commit '), 'index'] = None
git_log_with_marker.head()

AttributeError: 'NoneType' object has no attribute 'loc'

If we propagate the data from the former `index` column with a `ffill()` (which forward fills missing data), we get a nice column that marks entries that should belong together.

In [42]:
git_log_with_marker['index'] = git_log_with_marker['index'].ffill()
git_log_with_marker.head(10)

TypeError: 'NoneType' object is not subscriptable

### The index key column approach
Alternatively (and used in this example), we can use also an entry that is suitable as index later on. Again, we use the entries with the commit information. Analog, we create a new column with this information.

In [220]:
git_log.loc[git_log.raw.str.startswith("commit"), 'commit_id'] = git_log['raw']
git_log.head()

Unnamed: 0,raw,commit_id
0,commit 101c9dc69064633f697d93dcf0918bb4f74ff7ed,commit 101c9dc69064633f697d93dcf0918bb4f74ff7ed
1,Author: Dave Syer <dsyer@pivotal.io>,
2,Date: Fri Jun 30 11:07:07 2017 +0100,
3,Update Spring Boot and Thymeleaf versions,
4,commit ffa967c94b65a70ea6d3b44275632821838d9fd3,commit ffa967c94b65a70ea6d3b44275632821838d9fd3


Next, we fill all the other columns with the information about the commit. We clean up the commit columns at the same time.

In [221]:
git_log['commit_id'] = git_log['commit_id'].ffill()
git_log['commit_id'] = git_log['commit_id'].str.replace("commit ", "")
git_log.head()

Unnamed: 0,raw,commit_id
0,commit 101c9dc69064633f697d93dcf0918bb4f74ff7ed,101c9dc69064633f697d93dcf0918bb4f74ff7ed
1,Author: Dave Syer <dsyer@pivotal.io>,101c9dc69064633f697d93dcf0918bb4f74ff7ed
2,Date: Fri Jun 30 11:07:07 2017 +0100,101c9dc69064633f697d93dcf0918bb4f74ff7ed
3,Update Spring Boot and Thymeleaf versions,101c9dc69064633f697d93dcf0918bb4f74ff7ed
4,commit ffa967c94b65a70ea6d3b44275632821838d9fd3,ffa967c94b65a70ea6d3b44275632821838d9fd3


## Dissceting the data

We mark each entry with its meaning. We can achieve this by using the starting characters of each `raw` entry. For the commit messages, we don't habe any information that could be used as marker. So we just fill in the missing information in the `type` entries with a `"message"` text in the last step.

Side note: When to use `'` and when to use `"` ? When working directly with Pandas, the difference doesn't really matter. I use `'` when I'm referencing keys or parameters, but `" ` when I'm using text information. I'm trying to use it consinstent, though.

In [186]:
git_log.loc[git_log.raw.str.startswith("commit "), 'type'] = "commit"
git_log.loc[git_log.raw.str.startswith("Author: "), 'type'] = "author"
git_log.loc[git_log.raw.str.startswith("Date: "), 'type'] = "date"
git_log['type'] = git_log['type'].fillna("message")
git_log.head()

Unnamed: 0,raw,commit_id,type
0,commit 101c9dc69064633f697d93dcf0918bb4f74ff7ed,101c9dc69064633f697d93dcf0918bb4f74ff7ed,commit
1,Author: Dave Syer <dsyer@pivotal.io>,101c9dc69064633f697d93dcf0918bb4f74ff7ed,author
2,Date: Fri Jun 30 11:07:07 2017 +0100,101c9dc69064633f697d93dcf0918bb4f74ff7ed,date
3,Update Spring Boot and Thymeleaf versions,101c9dc69064633f697d93dcf0918bb4f74ff7ed,message
4,commit ffa967c94b65a70ea6d3b44275632821838d9fd3,ffa967c94b65a70ea6d3b44275632821838d9fd3,commit


With all lines marked by their types, we can 

In [179]:
git_log = git_log[git_log['type'] != 'commit']

git_log.head()

Unnamed: 0,raw,commit_id,type
1,Author: Dave Syer <dsyer@pivotal.io>,101c9dc69064633f697d93dcf0918bb4f74ff7ed,author
2,Date: Fri Jun 30 11:07:07 2017 +0100,101c9dc69064633f697d93dcf0918bb4f74ff7ed,date
3,Update Spring Boot and Thymeleaf versions,101c9dc69064633f697d93dcf0918bb4f74ff7ed,message
5,Author: Antoine Rey <antoine.rey@gmail.com>,ffa967c94b65a70ea6d3b44275632821838d9fd3,author
6,Date: Wed Apr 12 21:41:00 2017 +0200,ffa967c94b65a70ea6d3b44275632821838d9fd3,date


In [172]:
git_log_data = git_log.pivot_table(index='commit', columns='type', values='raw' , aggfunc='first')
git_log_data.head()

type,author,date,message
commit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
commit 024811d252f8d8218e6795d46203cff25971bc19,Mic <misvy@vmware.com>,Date: Thu Mar 14 18:04:36 2013 +0800,simplifying access to Integer
commit 0504ec9fe345d9d34b15c374333f709fb147e6d6,thinksh <thinkshihang@gmail.com>,Date: Wed Feb 3 23:19:46 2016 -0500,Update petclinic_db_setup_mysql.txt
commit 053c84ecc95b246ef4a40fb3d4304e8908604af4,Mic <misvy@vmware.com>,Date: Mon Feb 3 09:31:44 2014 +0800,migrated to Spring 4.0.1
commit 057015c14cce4791ff309419de8a8bd6339fd6e7,Mic <misvy@vmware.com>,Date: Fri Feb 15 15:31:04 2013 +0800,Spring MVC Test Framework and migration to...
commit 05c1110dceeaef0626137a2f7a509add6617765b,Mic <misvy@vmware.com>,Date: Tue Jan 15 09:29:01 2013 +0800,fixed content negotiation configuration


In [173]:
git_log_data[['author', 'email']] = git_log_data['author'].str.extract(
    "(.*) <(.*)>", expand=True)
git_log_data.head()

type,author,date,message,email
commit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
commit 024811d252f8d8218e6795d46203cff25971bc19,Mic,Date: Thu Mar 14 18:04:36 2013 +0800,simplifying access to Integer,misvy@vmware.com
commit 0504ec9fe345d9d34b15c374333f709fb147e6d6,thinksh,Date: Wed Feb 3 23:19:46 2016 -0500,Update petclinic_db_setup_mysql.txt,thinkshihang@gmail.com
commit 053c84ecc95b246ef4a40fb3d4304e8908604af4,Mic,Date: Mon Feb 3 09:31:44 2014 +0800,migrated to Spring 4.0.1,misvy@vmware.com
commit 057015c14cce4791ff309419de8a8bd6339fd6e7,Mic,Date: Fri Feb 15 15:31:04 2013 +0800,Spring MVC Test Framework and migration to...,misvy@vmware.com
commit 05c1110dceeaef0626137a2f7a509add6617765b,Mic,Date: Tue Jan 15 09:29:01 2013 +0800,fixed content negotiation configuration,misvy@vmware.com


In [174]:
git_log_data['date'] =  git_log_data['date'].str.replace("Date: ", "")
git_log_data['date'].head()

commit
commit 024811d252f8d8218e6795d46203cff25971bc19      Thu Mar 14 18:04:36 2013 +0800
commit 0504ec9fe345d9d34b15c374333f709fb147e6d6       Wed Feb 3 23:19:46 2016 -0500
commit 053c84ecc95b246ef4a40fb3d4304e8908604af4       Mon Feb 3 09:31:44 2014 +0800
commit 057015c14cce4791ff309419de8a8bd6339fd6e7      Fri Feb 15 15:31:04 2013 +0800
commit 05c1110dceeaef0626137a2f7a509add6617765b      Tue Jan 15 09:29:01 2013 +0800
Name: date, dtype: object

In [169]:
git_log_data['date'] = pd.to_datetime(git_log_data['date'])
git_log_data['date'].head()

commit
commit 024811d252f8d8218e6795d46203cff25971bc19   2013-03-14 10:04:36
commit 0504ec9fe345d9d34b15c374333f709fb147e6d6   2016-02-04 04:19:46
commit 053c84ecc95b246ef4a40fb3d4304e8908604af4   2014-02-03 01:31:44
commit 057015c14cce4791ff309419de8a8bd6339fd6e7   2013-02-15 07:31:04
commit 05c1110dceeaef0626137a2f7a509add6617765b   2013-01-15 01:29:01
Name: date, dtype: datetime64[ns]

In [117]:
git_log.type.unstack

<bound method Series.unstack of 0        commit
1        author
2          date
3       message
4        commit
5        author
6          date
7       message
8        commit
9        author
10         date
11      message
12       commit
13      message
14       author
15         date
16      message
17      message
18      message
19      message
20       commit
21       author
22         date
23      message
24      message
25       commit
26       author
27         date
28      message
29       commit
         ...   
2520       date
2521    message
2522     commit
2523     author
2524       date
2525    message
2526     commit
2527     author
2528       date
2529    message
2530     commit
2531     author
2532       date
2533    message
2534     commit
2535     author
2536       date
2537    message
2538     commit
2539     author
2540       date
2541    message
2542     commit
2543     author
2544       date
2545    message
2546     commit
2547     author
2548       date
2549    

In [86]:
git_log['type'], git_log['value'] = "commit" ,2 #.ix[git_log['raw'].str.startswith('commit '), 'commit'] = 1
git_log

Unnamed: 0,raw,commit,type,value
0,commit 101c9dc69064633f697d93dcf0918bb4f74ff7ed,1,commit,2
1,Author: Dave Syer <dsyer@pivotal.io>,,commit,2
2,Date: Fri Jun 30 11:07:07 2017 +0100,,commit,2
3,Update Spring Boot and Thymeleaf versions,,commit,2
4,commit ffa967c94b65a70ea6d3b44275632821838d9fd3,1,commit,2
5,Author: Antoine Rey <antoine.rey@gmail.com>,,commit,2
6,Date: Wed Apr 12 21:41:00 2017 +0200,,commit,2
7,spring-petclinic-angular1 repo renamed to ...,,commit,2
8,commit fd1c742d4f8d193eb935519909c15302b783cd52,1,commit,2
9,Author: Antoine Rey <antoine.rey@gmail.com>,,commit,2


In [68]:
git_log[['type', 'commit']] = git_log['raw'].str.split(' ', expand=True)
git_log.head()

ValueError: Columns must be same length as key

In [42]:
import pandas as pds
pd.read_csv(
    "datasets/mixed_separators.txt",
    sep="\t",
    names=['timedata', 'author'])

Unnamed: 0,timedata,author
0,1514531161 -0800,Linus Torvalds
1,1514489303 -0500,David S. Miller
2,1514487644 -0800,Tom Herbert
3,1514487643 -0800,Tom Herbert
4,1514482693 -0500,Willem de Bruijn
