# Data science is OSEMN

According to a popular model, the elements of data science are

* Obtaining data
* Scrubbing data
* Exploring data
* Modeling data
* iNterpreting data

and hence the acronym OSEMN, pronounced as “Awesome”.

We will start with the **O**, moving towards the rest later, but first let's have a quick look at what it all boils down to:

In [19]:
import numpy as np

data = np.loadtxt('populations.txt')
year, hares, lynxes, carrots = data.T # trick: columns to variables


from matplotlib import pyplot as plt
%matplotlib inline

plt.axes([0.2, 0.1, 0.5, 0.8]) 
plt.plot(year, hares, year, lynxes, year, carrots) 
plt.legend(('Hare', 'Lynx', 'Carrot'), loc=(1.05, 0.5)) 

FileNotFoundError: [Errno 2] No such file or directory: 'populations.txt'

By plotting the data a clear (and reasonable) correlations between pray and predator becomes evident. How can it be quantified? Is that statistical significant? What about the correlation between carrots and hares? Is that evident? Is that significant?

Finding correlations in data is the main goal of data science, though that is not the end of the story: as this precious [site](http://tylervigen.com/spurious-correlations) demonstrates, **correlations is not causation**. 


*Exercise*: write an algorithm that determins and quantifies a correlation between two time series. Use as an example the hare-lynx-carrot dataset.

# Obtaining and processing (remote) data

Accessing data is a really serious business. Data can sit on public or on remote machines. In the case of the former, things may be straightforward, whereas in the latter case you need to worry about a few things.

In both cases, depending on the size of the dataset, the managment of the dataset can become extremely complicated. We won't deal here with large datasets, which require a whole course per se.., but still care should be put. In particular, it is not wise to keep (and even worse commit) data into a git repository!

The suggestion is then to create a directory somewhere and copy the example datasets there. From a terminal:

```bash

# create a data directory in your home directory
mkdir ~/data/

# check the content (it's empty now of course)
ls -ltr ~/data/

# in the case you need to move there:
cd ~/data/
```

### Download data from a server

A nice set of interesting datasets can be found on this [server](https://archive.ics.uci.edu/ml/datasets.html?sort=nameUp&view=list) that collects training/test data for machine learning developments. Several of those pertein physical sciences, it is worth browsing through those.

You can download any of those, in the following we will consider a dataset from the MAGIC experiment. For that we will the `wget` command

In [3]:
# get the dataset and its description on the proper data directory
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/magic/magic04.data -P ~/data/
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/magic/magic04.names -P ~/data/    

--2018-11-12 15:11:53--  https://archive.ics.uci.edu/ml/machine-learning-databases/magic/magic04.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.249
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.249|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1477391 (1.4M) [text/plain]
Saving to: ‘/home/kommurik/data/magic04.data’


2018-11-12 15:11:59 (276 KB/s) - ‘/home/kommurik/data/magic04.data’ saved [1477391/1477391]

--2018-11-12 15:11:59--  https://archive.ics.uci.edu/ml/machine-learning-databases/magic/magic04.names
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.249
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.249|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5400 (5.3K) [text/plain]
Saving to: ‘/home/kommurik/data/magic04.names’


2018-11-12 15:12:00 (50.5 MB/s) - ‘/home/kommurik/data/magic04.names’ saved [5400/5400]



In [4]:
# print the description. This can (and better) be done from a terminal
!cat ~/data/magic04.names

1. Title of Database: MAGIC gamma telescope data 2004

2. Sources:

   (a) Original owner of the database:

       R. K. Bock
       Major Atmospheric Gamma Imaging Cherenkov Telescope project (MAGIC)
       http://wwwmagic.mppmu.mpg.de
       rkb@mail.cern.ch

   (b) Donor:

       P. Savicky
       Institute of Computer Science, AS of CR
       Czech Republic
       savicky@cs.cas.cz

   (c) Date received: May 2007

3. Past Usage:

   (a) Bock, R.K., Chilingarian, A., Gaug, M., Hakl, F., Hengstebeck, T.,
       Jirina, M., Klaschka, J., Kotrc, E., Savicky, P., Towers, S.,
       Vaicilius, A., Wittek W. (2004).
       Methods for multidimensional event classification: a case study
       using images from a Cherenkov gamma-ray telescope.
       Nucl.Instr.Meth. A, 516, pp. 511-528.

   (b) P. Savicky, E. Kotrc.
       Experimental Study of Leaf Confidences for Random Forest.
       Proceedings of COMPSTAT 2004, In: Computational Statistics.
       (Ed.

It is possible to download and load remote files via their url's directly from within python (and thus on a jupyter session). This is a rather powerful tool as it allows http communications, IO streaming and so on.

Care should be put as the dataset is stored in memory.

In [5]:
import urllib.request
url ='https://archive.ics.uci.edu/ml/machine-learning-databases/magic/magic04.names'
with urllib.request.urlopen(url) as data_file:
    #print (data_file.read(300))
    for line in data_file:
        print (line)

b'1. Title of Database: MAGIC gamma telescope data 2004\n'
b'\n'
b'2. Sources:\n'
b'\n'
b'   (a) Original owner of the database:\n'
b'\n'
b'       R. K. Bock\n'
b'       Major Atmospheric Gamma Imaging Cherenkov Telescope project (MAGIC)\n'
b'       http://wwwmagic.mppmu.mpg.de\n'
b'       rkb@mail.cern.ch\n'
b'\n'
b'   (b) Donor:\n'
b'\n'
b'       P. Savicky\n'
b'       Institute of Computer Science, AS of CR\n'
b'       Czech Republic\n'
b'       savicky@cs.cas.cz\n'
b'\n'
b'   (c) Date received: May 2007\n'
b'\n'
b'3. Past Usage:\n'
b'\n'
b'   (a) Bock, R.K., Chilingarian, A., Gaug, M., Hakl, F., Hengstebeck, T.,\n'
b'       Jirina, M., Klaschka, J., Kotrc, E., Savicky, P., Towers, S.,\n'
b'       Vaicilius, A., Wittek W. (2004).\n'
b'       Methods for multidimensional event classification: a case study\n'
b'       using images from a Cherenkov gamma-ray telescope.\n'
b'       Nucl.Instr.Meth. A, 516, pp. 511-528.\n'
b'\n'
b'   (b) P. Savicky, E. Kotrc.\n'
b'       Experimental Stu

###  Copy data from a remote machine

Often datasets are not available on websites but rather they are sitting on some remote machine. Several tools are there that can allow you to get hold off remote data, even from within python (e.g. [paramiko](https://www.paramiko.org/)), but best in this case is to get a local copy. E.g. from a terminal:

```bash
scp lemma@lxplus.cern.ch:/eos/project/l/lemma/data/2018/raw/Run000333/data_000637.* ~/data/
```

by issuing that command you are immediately exposed to the most relevant problem in obtaining the data: permissions/authorization.

Secondily (essentially a further consequence of the same issue), the remote machine itself may have accessibility restrictions, e.g. being behind a firewall. In that case you may need to use a tunnel:

``` bash
ssh -L 1234:<address of R known to G>:22 <user at G>@<address of G> 

scp -P 1234 <user at R>@127.0.0.1:/path/to/file file-name-to-be-copied
```

In summary, just getting the data is a complicated business.

## Data Formats

datasets can be stored in a gazillion different ways, often they have formats which are application dependent, even though more and more standards are being established. Python have "readers" for most of the formats, another reason for being the optimal programming language for data analysis.

### Text files 

Plain text files are commonly used for "readibility", at the price of a very poor storing efficiency due to their low entropy. [UTF-8](https://en.wikipedia.org/wiki/UTF-8) is the most common encoding.

Reading (and writing) text files in python is straightforward:

In [10]:
file_name = "/home/kommurik/data/magic04.names"

# mode can be specified for writing, reading or both
with open(file_name, mode='r') as f:
    # print-out the whole file
    # print (f.read()) 
    for line in f:
        ## print line by line
        print (line)
        ## each line is a string, you need to split it yourself
        #for c in line.split(): print(c) # check the functionalities of the split() method 

1. Title of Database: MAGIC gamma telescope data 2004

1.
Title
of
Database:
MAGIC
gamma
telescope
data
2004


2. Sources:

2.
Sources:


   (a) Original owner of the database:

(a)
Original
owner
of
the
database:


       R. K. Bock

R.
K.
Bock
       Major Atmospheric Gamma Imaging Cherenkov Telescope project (MAGIC)

Major
Atmospheric
Gamma
Imaging
Cherenkov
Telescope
project
(MAGIC)
       http://wwwmagic.mppmu.mpg.de

http://wwwmagic.mppmu.mpg.de
       rkb@mail.cern.ch

rkb@mail.cern.ch


   (b) Donor:

(b)
Donor:


       P. Savicky

P.
Savicky
       Institute of Computer Science, AS of CR

Institute
of
Computer
Science,
AS
of
CR
       Czech Republic

Czech
Republic
       savicky@cs.cas.cz

savicky@cs.cas.cz


   (c) Date received: May 2007

(c)
Date
received:
May
2007


3. Past Usage:

3.
Past
Usage:


   (a) Bock, R.K., Chilingarian, A., Gaug, M., Hakl, F., Hengstebeck, T.,

(a)
Bock,
R.K.,
Chilingarian,
A.,
Gaug,
M.,
Hakl,
F.,
Hengstebeck,
T.,
       Jirina, M., Klaschka, 

### CSV files

If you are lucky text files are already framed into a defined structured, in a "table-like" manner. These files are colled "comma separated values" (csv), even though the separator may well not be the "," symbol.
Python have package to deal with that:

In [14]:
import csv

with open('/home/kommurik/data/magic04.data') as data_file:
    for line in csv.reader(data_file, delimiter=','): # the delimiter is often guessed by the reader
        # again note that elements of each line are treated as strings
        # if you need to convert them into numbers, you need to to that yourself
        fLength,fWidth,fSize,\
        fConc,fConc1,fAsym,\
        fM3Long,fM3Trans,fAlpha,fDist = map(float,line[:-1])
        category = line[-1]
        print (fLength,fWidth,fSize,fConc,fConc1,fAsym,fM3Long,fM3Trans,fAlpha,fDist)
        print (category)
        break

28.7967 16.0021 2.6449 0.3918 0.1982 27.7004 22.011 -8.2027 40.092 81.8828
g


More often than not, csv files have comments (e.g. starting with '#'), which cannot be interpreted by the reader. Tricks like:

```python
csv.reader(row for row in f if not row.startswith('#'))
```

may be useful

## Binary (hexadecimal) files

The output of sensors often is stored as hexadecimal files. Information is packed in a well defined format (similarly to how floating point numbers are formatted).
To read and process hexadecimal files in python you need to use the "b" option of `open` and progress along the file at step of defined lenght (depending on the size of the words information is packed into)

The following is an example from data collected from an FPGA implementing a TDC. Relevant infomation are the coordinates of the TDC channels and their time measurements.

In [None]:
import struct, time

with  open('/home/kommurik/data/data_000637.dat','rb') as file:
    word_counter=0
    word_size = 8 # size of the word in bytes
    file_content = file.read()
    for i in range(0, len(file_content), word_size):
        if word_counter>100: break
        time.sleep(0.1)
        thisInt = struct.unpack('<q', file_content[i:i+word_size])[0]
        head = (thisInt >> 62) & 0x3
        if head == 1:
            fpga     = (thisInt >> 58) & 0xF
            tdc_chan = (thisInt >> 49) & 0x1FF
            orb_cnt  = (thisInt >> 17) & 0xFFFFFFFF
            bx       = (thisInt >> 5 ) & 0xFFF
            tdc_meas = (thisInt >> 0 ) & 0x1F
            if i==0 : print ('{0},{1},{2},{3},{4},{5}'.format('HEAD', 'FPGA', 'TDC_CHANNEL', 'ORB_CNT', 'BX', 'TDC_MEAS'))
            print ('{0},{1},{2},{3},{4},{5}'.format(head, fpga, tdc_chan, orb_cnt, bx, tdc_meas))
        else:
            print ('ERROR! head =', head)



HEAD,FPGA,TDC_CHANNEL,ORB_CNT,BX,TDC_MEAS
1,0,122,3869200167,2374,27
1,0,123,3869200167,2374,28
1,0,62,3869200167,2553,29
1,0,63,3869200167,2558,20
1,0,63,3869200167,2760,26
1,0,62,3869200167,2762,5
1,0,60,3869200167,2772,15
1,0,138,3869200167,2776,1
1,0,61,3869200167,2774,22
1,0,59,3869200167,2788,8
1,1,6,3869200167,2785,5
1,0,63,3869200167,2786,20
1,1,5,3869200167,2792,19
1,0,35,3869200167,2791,24
1,0,55,3869200167,2789,4
1,1,138,3869200167,2797,1
1,1,7,3869200167,2787,15
1,0,62,3869200167,2790,11
1,1,4,3869200167,2795,5
1,0,52,3869200167,2796,27
1,1,9,3869200167,2789,15
1,0,56,3869200167,2789,11
1,0,60,3869200167,2790,24
1,0,37,3869200167,2799,16
1,0,57,3869200167,2795,20
1,0,61,3869200167,2797,15
1,0,58,3869200167,2799,15
1,0,58,3869200167,3081,22
1,0,60,3869200167,3081,2
1,0,59,3869200167,3083,21
1,0,138,3869200167,3085,1
1,0,61,3869200167,3079,5
1,0,60,3869200167,3085,26
1,0,53,3869200167,3134,5
1,0,52,3869200167,3136,26
1,0,50,3869200167,3146,21
1,0,58,3869200167,3176,9
1,0,52,3

1,1,63,3869200170,3097,18
1,0,56,3869200170,3100,24
1,1,138,3869200170,3110,1
1,1,22,3869200170,3108,18
1,1,61,3869200170,3101,27
1,0,61,3869200170,3101,2
1,1,25,3869200170,3106,11
1,0,138,3869200170,3108,1
1,1,61,3869200170,3109,10
1,0,47,3869200170,3101,2
1,0,59,3869200170,3103,15
1,0,63,3869200170,3104,17
1,0,58,3869200170,3106,10
1,0,60,3869200170,3105,28
1,0,59,3869200170,3109,27
1,0,48,3869200170,3320,16
1,0,138,3869200170,3329,1
1,0,35,3869200170,3324,22
1,0,49,3869200170,3327,19
1,0,28,3869200170,3332,19
1,0,30,3869200170,3326,5
1,0,50,3869200170,3331,11
1,0,99,3869200170,3337,29
1,0,32,3869200170,3331,18
1,0,35,3869200170,3333,10
1,1,1,3869200170,3366,4
1,0,63,3869200170,3367,5
1,1,0,3869200170,3371,28
1,1,21,3869200170,3373,6
1,1,2,3869200170,3377,12
1,0,123,3869200171,125,17
1,1,1,3869200171,262,28
1,1,7,3869200171,261,26
1,0,122,3869200171,260,19
1,0,49,3869200171,264,9
1,1,0,3869200171,264,10
1,0,138,3869200171,268,1
1,0,58,3869200171,265,29
1,1,138,3869200171,270,1
1,1,9,

1,1,1,3869200174,294,10
1,0,56,3869200174,294,16
1,1,10,3869200174,290,18
1,0,61,3869200174,290,12
1,1,3,3869200174,296,5
1,0,57,3869200174,291,13
1,1,11,3869200174,291,5
1,0,58,3869200174,302,10
1,1,13,3869200174,303,9
1,0,55,3869200174,302,23
1,1,11,3869200174,298,19
1,0,23,3869200174,410,12
1,0,57,3869200174,422,7
1,0,52,3869200174,429,23
1,0,59,3869200174,432,9
1,0,56,3869200174,511,1
1,0,57,3869200174,512,12
1,0,58,3869200174,524,19
1,0,138,3869200174,526,1
1,0,59,3869200174,521,22
1,0,44,3869200174,665,10
1,1,0,3869200174,667,14
1,0,59,3869200174,668,5
1,1,138,3869200174,675,1
1,1,1,3869200174,668,27
1,0,47,3869200174,668,15
1,0,138,3869200174,675,1
1,0,53,3869200174,669,5
1,1,2,3869200174,670,4
1,0,45,3869200174,670,25
1,1,3,3869200174,673,15
1,0,46,3869200174,673,17
1,1,2,3869200174,1062,19
1,1,41,3869200174,1064,9
1,1,138,3869200174,1069,1
1,1,27,3869200174,1064,26
1,1,40,3869200174,1067,9
1,0,62,3869200174,1069,21
1,1,39,3869200174,1072,5
1,0,63,3869200174,1071,1
1,1,42,38692

1,0,20,3869200177,649,15
1,0,138,3869200177,652,1
1,0,23,3869200177,650,25
1,0,18,3869200177,650,11
1,0,60,3869200177,830,19
1,0,61,3869200177,830,26
1,0,63,3869200177,842,30
1,1,5,3869200177,1010,16
1,0,60,3869200177,1011,16
1,1,6,3869200177,1013,18
1,0,138,3869200177,1019,1
1,0,61,3869200177,1011,4
1,1,138,3869200177,1022,1
1,1,4,3869200177,1017,23
1,0,62,3869200177,1017,21
1,1,7,3869200177,1020,25
1,0,63,3869200177,1017,5
1,0,52,3869200177,1059,26
1,0,15,3869200177,1061,20
1,0,59,3869200177,1058,28
1,1,12,3869200177,1065,14
1,0,138,3869200177,1067,1
1,0,58,3869200177,1060,25
1,0,15,3869200177,1069,24
1,0,62,3869200177,1060,9
1,1,11,3869200177,1067,13
1,0,61,3869200177,1065,19
1,1,138,3869200177,1072,1
1,1,13,3869200177,1062,18
1,0,56,3869200177,1072,1
1,1,10,3869200177,1070,28
1,0,62,3869200177,1073,11
1,0,57,3869200177,1073,13
1,1,0,3869200177,1160,24
1,1,1,3869200177,1163,3
1,1,138,3869200177,1170,1
1,1,2,3869200177,1168,19
1,1,3,3869200177,1167,30
1,0,43,3869200177,1337,11
1,0,13

1,0,138,3869200180,2571,1
1,0,89,3869200180,2569,27
1,0,55,3869200180,2700,3
1,0,57,3869200180,2710,20
1,0,45,3869200180,2975,11
1,0,42,3869200180,2978,30
1,0,138,3869200180,2986,1
1,0,44,3869200180,2980,23
1,0,43,3869200180,2984,16
1,0,31,3869200180,3009,17
1,0,63,3869200180,3009,17
1,1,15,3869200180,3019,21
1,0,122,3869200180,3025,7
1,0,62,3869200180,3021,16
1,0,79,3869200180,3025,30
1,1,0,3869200180,3050,20
1,1,11,3869200180,3049,20
1,1,138,3869200180,3053,1
1,1,1,3869200180,3050,16
1,1,10,3869200180,3051,23
1,1,2,3869200180,3058,5
1,1,12,3869200180,3058,21
1,1,3,3869200180,3060,13
1,1,13,3869200180,3062,13
1,0,25,3869200180,3118,16
1,0,56,3869200180,3120,9
1,0,138,3869200180,3123,1
1,0,62,3869200180,3121,24
1,0,57,3869200180,3119,22
1,0,63,3869200180,3119,16
1,0,60,3869200180,3160,15
1,0,10,3869200180,3157,28
1,0,138,3869200180,3165,1
1,0,62,3869200180,3163,20
1,0,63,3869200180,3161,10
1,0,61,3869200180,3163,3
1,0,60,3869200180,3467,16
1,0,50,3869200180,3469,11
1,1,7,3869200180,347

1,0,47,3869200183,1752,1
1,1,3,3869200183,1761,1
1,0,62,3869200183,1769,19
1,0,48,3869200183,1771,6
1,0,63,3869200183,1769,5
1,1,4,3869200183,1847,21
1,0,56,3869200183,1851,23
1,1,3,3869200183,1854,3
1,0,138,3869200183,1857,1
1,1,138,3869200183,1860,1
1,1,5,3869200183,1852,6
1,0,57,3869200183,1853,21
1,0,60,3869200183,1863,28
1,1,2,3869200183,1858,14
1,0,58,3869200183,1855,1
1,0,61,3869200183,1862,16
1,0,59,3869200183,1853,19
1,1,0,3869200183,2004,21
1,0,48,3869200183,2003,22
1,0,58,3869200183,2006,22
1,1,1,3869200183,2005,16
1,0,138,3869200183,2011,1
1,0,49,3869200183,2003,5
1,1,138,3869200183,2015,1
1,1,2,3869200183,2006,17
1,0,59,3869200183,2009,4
1,0,61,3869200183,2008,9
1,1,0,3869200183,2013,5
1,0,47,3869200183,2011,10
1,1,3,3869200183,2013,16
1,0,48,3869200183,2010,27
1,0,60,3869200183,2011,16
1,0,50,3869200183,2012,19
1,0,28,3869200183,2151,20
1,1,12,3869200183,2147,26
1,0,31,3869200183,2150,6
1,1,11,3869200183,2154,24
1,0,30,3869200183,2159,12
1,1,138,3869200183,2160,1
1,1,13,3

1,1,1,3869200186,3168,24
1,1,3,3869200186,3174,27
1,1,0,3869200186,3179,12
1,1,10,3869200186,3299,26
1,1,11,3869200186,3305,20
1,1,138,3869200186,3312,1
1,1,9,3869200186,3309,15
1,1,12,3869200186,3310,11
1,0,25,3869200186,3328,1
1,0,55,3869200186,3475,17
1,1,1,3869200187,118,15
1,0,55,3869200187,456,3
1,0,56,3869200187,461,5
1,1,6,3869200187,715,2
1,0,62,3869200187,719,27
1,1,7,3869200187,715,10
1,1,9,3869200187,725,15
1,0,63,3869200187,727,14
1,0,58,3869200187,775,18
1,0,59,3869200187,778,5
1,0,56,3869200187,780,1
1,0,138,3869200187,789,1
1,0,61,3869200187,787,9
1,1,14,3869200187,1006,13
1,1,15,3869200187,1003,4
1,1,138,3869200187,1017,1
1,1,16,3869200187,1008,11
1,0,17,3869200187,1016,30
1,1,17,3869200187,1015,8
1,0,53,3869200187,1023,4
1,1,79,3869200187,1170,11
1,0,62,3869200187,1294,10
1,0,63,3869200187,1296,11
1,1,36,3869200187,1341,7
1,0,12,3869200187,1423,1
1,0,43,3869200187,1817,23
1,0,59,3869200187,1843,15
1,0,95,3869200187,1844,13
1,0,23,3869200187,1849,25
1,0,88,3869200187,1

1,0,138,3869200190,3104,1
1,0,60,3869200190,3101,4
1,0,61,3869200190,3102,16
1,1,13,3869200190,3118,22
1,0,62,3869200190,3102,23
1,1,138,3869200190,3120,1
1,0,23,3869200190,3124,24
1,0,59,3869200190,3119,19
1,0,63,3869200190,3104,17
1,1,0,3869200190,3127,29
1,0,122,3869200190,3122,21
1,0,21,3869200190,3123,24
1,0,61,3869200190,3123,5
1,1,12,3869200190,3126,27
1,0,123,3869200190,3129,15
1,0,22,3869200190,3128,3
1,0,62,3869200190,3125,27
1,1,14,3869200190,3129,2
1,1,64,3869200190,3137,28
1,0,53,3869200190,3135,15
1,0,60,3869200190,3126,13
1,0,52,3869200190,3138,11
1,0,63,3869200190,3130,12
1,0,50,3869200190,3144,19
1,0,116,3869200190,3474,3
1,0,118,3869200190,3472,9
1,0,119,3869200190,3477,1
1,0,116,3869200190,3481,20
1,0,62,3869200190,3528,2
1,0,63,3869200190,3529,25
1,0,58,3869200191,331,6
1,0,138,3869200191,334,1
1,0,60,3869200191,335,7
1,0,110,3869200191,333,29
1,0,59,3869200191,331,18
1,0,114,3869200191,332,16
1,0,108,3869200191,338,30
1,0,116,3869200191,332,16
1,0,56,3869200191,344

1,1,7,3869200193,2604,25
1,1,20,3869200193,2609,16
1,1,7,3869200193,2613,21
1,1,10,3869200193,2613,9
1,0,50,3869200193,2728,23
1,0,53,3869200193,2733,10
1,0,52,3869200193,2739,10
1,1,23,3869200193,2923,29
1,0,27,3869200193,2928,1
1,0,60,3869200193,2976,8
1,1,9,3869200193,2978,21
1,1,7,3869200193,2983,7
1,0,61,3869200193,2979,12
1,0,62,3869200193,2983,16
1,1,10,3869200193,2987,28
1,0,138,3869200193,2988,1
1,1,11,3869200193,2989,22
1,0,62,3869200193,2989,23
1,0,63,3869200193,2986,28
1,1,6,3869200193,2987,28
1,0,122,3869200193,3141,28
1,0,91,3869200193,3142,26
1,1,7,3869200193,3144,15
1,1,5,3869200193,3206,11
1,0,52,3869200193,3205,22
1,0,45,3869200193,3208,17
1,1,138,3869200193,3213,1
1,1,1,3869200193,3206,19
1,0,138,3869200193,3211,1
1,0,55,3869200193,3206,13
1,1,10,3869200193,3206,12
1,0,110,3869200193,3212,9
1,0,61,3869200193,3207,6
1,1,12,3869200193,3213,24
1,0,121,3869200193,3216,4
1,0,42,3869200193,3213,18
1,1,3,3869200193,3207,3
1,0,122,3869200193,3214,1
1,0,52,3869200193,3219,18


1,0,55,3869200196,331,24
1,0,57,3869200196,335,25
1,1,67,3869200196,579,24
1,0,52,3869200196,775,1
1,1,3,3869200196,772,11
1,0,55,3869200196,771,16
1,1,0,3869200196,780,18
1,0,53,3869200196,773,22
1,0,93,3869200196,902,15
1,0,92,3869200196,903,10
1,0,90,3869200196,910,24
1,0,138,3869200196,912,1
1,0,91,3869200196,910,18
1,0,138,3869200196,1091,1
1,0,40,3869200196,1089,1
1,0,42,3869200196,1089,23
1,0,41,3869200196,1089,22
1,0,43,3869200196,1089,25
1,1,46,3869200196,1101,7
1,0,40,3869200196,1097,19
1,0,41,3869200196,1102,4
1,0,46,3869200196,1290,25
1,0,138,3869200196,1300,1
1,0,47,3869200196,1293,10
1,0,48,3869200196,1296,11
1,0,44,3869200196,1302,15
1,0,45,3869200196,3191,15
1,0,60,3869200196,3281,18
1,0,58,3869200196,3283,24
1,0,61,3869200196,3277,30
1,0,59,3869200196,3286,4
1,0,138,3869200196,3288,1
1,0,58,3869200196,3504,12
1,0,56,3869200196,3516,27
1,1,5,3869200197,67,3
1,1,4,3869200197,70,16
1,1,138,3869200197,80,1
1,1,3,3869200197,78,13
1,1,6,3869200197,78,7
1,1,28,3869200197,124,

1,0,61,3869200199,2348,4
1,1,0,3869200199,2361,17
1,0,60,3869200199,2355,13
1,1,1,3869200199,2359,5
1,0,56,3869200199,2548,29
1,1,2,3869200199,2555,26
1,0,36,3869200199,2555,7
1,0,57,3869200199,2549,10
1,0,61,3869200199,2556,16
1,1,3,3869200199,2558,20
1,0,37,3869200199,2557,25
1,1,138,3869200199,2570,1
1,1,7,3869200199,2561,11
1,0,59,3869200199,2561,8
1,0,95,3869200199,2570,9
1,1,4,3869200199,2568,15
1,0,138,3869200199,2569,1
1,0,63,3869200199,2558,22
1,1,1,3869200199,2568,8
1,0,58,3869200199,2567,17
1,0,63,3869200199,2568,2
1,0,39,3869200199,2565,17
1,0,50,3869200199,2738,3
1,0,56,3869200199,2738,19
1,0,57,3869200199,2739,28
1,0,138,3869200199,2753,1
1,0,62,3869200199,2749,18
1,0,59,3869200199,2751,5
1,0,63,3869200199,2751,3
1,1,9,3869200199,3153,28
1,1,0,3869200199,3156,1
1,0,59,3869200199,3159,17
1,0,60,3869200199,3157,4
1,1,1,3869200199,3164,12
1,0,138,3869200199,3165,1
1,0,58,3869200199,3162,2
1,0,62,3869200199,3162,28
1,0,47,3869200199,3165,12
1,0,63,3869200199,3162,15
1,0,61,38

### JSON files

JSON is JavaScript Object Notation - a format used widely for web-based resource sharing. It is very similar in structure to a Python nested dictionary. Here is an example from http://json.org/example

In [1]:
%%file example.json
{
    "glossary": {
        "title": "example glossary",
            "GlossDiv": {
            "title": "S",
                    "GlossList": {
                "GlossEntry": {
                    "ID": "SGML",
                                    "SortAs": "SGML",
                                    "GlossTerm": "Standard Generalized Markup Language",
                                    "Acronym": "SGML",
                                    "Abbrev": "ISO 8879:1986",
                                    "GlossDef": {
                        "para": "A meta-markup language, used to create markup languages such as DocBook.",
                                            "GlossSeeAlso": ["GML", "XML"]
                    },
                                    "GlossSee": "markup"
                }
            }
        }
    }
}

Writing example.json


In [2]:
!cat example.json

{
    "glossary": {
        "title": "example glossary",
            "GlossDiv": {
            "title": "S",
                    "GlossList": {
                "GlossEntry": {
                    "ID": "SGML",
                                    "SortAs": "SGML",
                                    "GlossTerm": "Standard Generalized Markup Language",
                                    "Acronym": "SGML",
                                    "Abbrev": "ISO 8879:1986",
                                    "GlossDef": {
                        "para": "A meta-markup language, used to create markup languages such as DocBook.",
                                            "GlossSeeAlso": ["GML", "XML"]
                    },
                                    "GlossSee": "markup"
                }
            }
        }
    }
}

In [3]:
import json
data = json.load(open('example.json'))
print (data)

{'glossary': {'title': 'example glossary', 'GlossDiv': {'title': 'S', 'GlossList': {'GlossEntry': {'ID': 'SGML', 'SortAs': 'SGML', 'GlossTerm': 'Standard Generalized Markup Language', 'Acronym': 'SGML', 'Abbrev': 'ISO 8879:1986', 'GlossDef': {'para': 'A meta-markup language, used to create markup languages such as DocBook.', 'GlossSeeAlso': ['GML', 'XML']}, 'GlossSee': 'markup'}}}}}


In [4]:
# and can be parsed using standard key lookups
data['glossary']['GlossDiv']['GlossList']

{'GlossEntry': {'Abbrev': 'ISO 8879:1986',
  'Acronym': 'SGML',
  'GlossDef': {'GlossSeeAlso': ['GML', 'XML'],
   'para': 'A meta-markup language, used to create markup languages such as DocBook.'},
  'GlossSee': 'markup',
  'GlossTerm': 'Standard Generalized Markup Language',
  'ID': 'SGML',
  'SortAs': 'SGML'}}

### HDF5

HDF5 is a hierarchical format often used to store complex scientific data. For instance, Matlab now saves its data to HDF5. It is particularly useful to store complex hierarchical data sets with associated metadata, for example, the results of a computer simulation experiment.

The main concepts associated with HDF5 are

* file: container for hierachical data - serves as ‘root’ for tree
* group: a node for a tree
* dataset: array for numeric data - can be huge
* attribute: small pieces of metadata that provide additional context


In [10]:
import h5py
import numpy as np

# creating a HDF5 file
import datetime

if not os.path.exists('example.hdf5'):

    with h5py.File('example.hdf5') as f:
        project = f.create_group('project')
        project.attrs.create('name', 'My project')
        project.attrs.create('date', str(datetime.date.today()))

        expt1 = project.create_group('expt1')
        expt2 = project.create_group('expt2')
        expt1.create_dataset('counts', (100,), dtype='i')
        expt2.create_dataset('values', (1000,), dtype='f')

        expt1['counts'][:] = range(100)
        expt2['values'][:] = np.random.random(1000)

NameError: name 'os' is not defined

In [11]:
with h5py.File('example.hdf5') as f:
    project = f['project']
    print project.attrs['name']
    print project.attrs['date']
    print project['expt1']['counts'][:10]
    print project['expt2']['values'][:10]

SyntaxError: Missing parentheses in call to 'print'. Did you mean print(int project.attrs['name'])? (<ipython-input-11-0b7ca8976970>, line 3)

### Pandas

the most convenient tool to read and process formatted dataset is however Pandas. In the following a couple of examples. Pandas will be the main subject of one of the next classes


In [15]:
import pandas as pd
file_name="/home/kommurik/data/data_000637.txt"
data=pd.read_csv(file_name,nrows=10,skiprows=range(1,1))
data

Unnamed: 0,HEAD,FPGA,TDC_CHANNEL,ORBIT_CNT,BX_COUNTER,TDC_MEAS
0,1,0,123,3869200167,2374,26
1,1,0,124,3869200167,2374,27
2,1,0,63,3869200167,2553,28
3,1,0,64,3869200167,2558,19
4,1,0,64,3869200167,2760,25
5,1,0,63,3869200167,2762,4
6,1,0,61,3869200167,2772,14
7,1,0,139,3869200167,2776,0
8,1,0,62,3869200167,2774,21
9,1,0,60,3869200167,2788,7


In [None]:
import pandas as pd
file_name="/home/kommurik/data/data_000637.dat/data/magic04.data"
data=pd.read_csv(file_name,nrows=1000)
data.columns=['fLength','fWidth','fSize',
        'fConc','fConc1','fAsym',
        'fM3Long','fM3Trans','fAlpha','fDist','category']
data

In [None]:
%matplotlib inline

data.plot.scatter("fLength","fWidth",)


In [None]:
data.plot.hist("fAlpha")