# Handling Data over time

There's a widespread trend in solar physics at the moment for correlation over actual science, so being able to handle data over time spans is a skill we all need to have. Python has ample support for this so lets have a look at wjat we can use.

So lets import some data to handle:

In [2]:
from __future__ import print_function, division
import numpy as np

data = np.genfromtxt('macrospicules.csv', skip_header=1, dtype=None, delimiter=',')

Now, the above line imports information on some solar features over a sample time period. Specifically we have, maximum length, lifetime and time at which they occured. Now if we type `data[0]` what will happen?

In [3]:
data[0]

(27.022617088020528, 13.6, '2010-06-01T13:00:14.120000')

This is the first row of the array, containing the first element of our three properties. This particular example is a stuctured array, so the columns and rows can have properties and assign properties to the header. We can ask what the title of these columns is by using a `dtype` command:

In [4]:
data.dtype.names

('f0', 'f1', 'f2')

Unhelpful, so lets give them something more recognisable. We can use the docs to look up syntax and change the names of the column lables.
<br/>
<section class="objectives panel panel-success">
<div class="panel-heading">
<h2><span class="fa fa-pencil"></span> Google your troubles away </h2>
</div>
<br/>
So the docs are [here](http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.genfromtxt.html). Find the syntax to change to names to something to represent maximum length, lifetime and point in time which they occured.
<br/>
</section>

In [5]:
data.dtype.names = ('max_len', 'ltime', 'sample_time')
data['max_len']

array([ 27.02261709,  36.20809658,  62.93289752,  38.97054949,
        53.58942032,  30.6401107 ,  34.62323773,  25.76578452,
        42.55229885,  43.78096232,  32.07478058,  46.32895236,
        41.81948071,  36.15639728,  27.76449747,  28.36338204,
        46.24264885,  31.66963787,  37.98509124,  41.10763789,
        36.31153664,  63.94515322,  33.20603032,  45.12547627,
        45.59984305,  44.13747471,  39.05135355,  42.70997285,
        22.00943851,  45.70103315,  40.61460622,  53.58952877,
        33.55068528,  34.12704163,  28.84501455,  33.50857898,
        33.04991294,  43.14772776,  64.7263967 ,  58.4113328 ,
        26.51897508,  38.16891551,  53.91004427,  38.66085552,
        27.32363074,  34.82350067,  49.29044968,  44.55780869,
        30.01404343,  33.79223954,  37.02116485,  28.72061961,
        29.21863497,  55.72629452,  54.57759503,  45.33884703,
        50.69244379,  33.25632341,  37.73000783,  26.92855622,
        52.49096835,  63.49966052,  36.13680388,  61.63

## Pandas

In its own words Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. Pandas has two forms of structures, 1D series and 2D dataframe. Now can you count the number of times you literally just have 1 column of data? So we're going to focus on the data frame aspect.

Now a pandas dataframe takes two arguments as a minimum, index and data. In this case the index will be our time within the sample and the maximum length and lifetime will be our data. So lets import pandas and use the dataframe:


<section class="objectives panel panel-warning">
<div class="panel-heading">
<h2><span class="fa fa-certificate"></span> Dictionaries </h2>
</div>
<br/>
So we covered dictionaries earlier. We can create keyword data pairs to form a dictionary (shock horror) of values. In this case 
<code>
temps = {'Brussles': 9, 'London': 3, 'Barcelona': 13, 'Rome': 16}
temps['Rome']
>>> 16
</code>
<br/>
</section>

Now pandas reads a dictionary when we want to input multiple data columns. Therefore we need to make a dictionary of our data and read that into a pandas data frame. First however though we need to import pandas.


In [6]:
import pandas as pd

d = {'max_len': data['max_len'], 'ltime': data['ltime']}

df = pd.DataFrame(data=d, index=data['sample_time'])
print(df)

                                ltime    max_len
2010-06-01T13:00:14.120000  13.600000  27.022617
2010-06-01T12:58:02.120000   8.400000  36.208097
2010-06-15T12:55:02.110000  24.199833  62.932898
2010-07-07T12:23:50.110000  22.999667  38.970549
2010-07-07T13:28:50.120000  21.799833  53.589420
2010-07-15T12:12:02.130000  17.400167  30.640111
2010-07-15T12:19:02.110000  12.800000  34.623238
2010-07-24T12:09:50.120000  14.200000  25.765785
2010-08-01T13:27:14.120000  24.400000  42.552299
2010-08-07T12:37:44.120000  15.400000  43.780962
2010-08-15T13:10:32.120000  15.199833  32.074781
2010-08-24T12:51:44.120000  18.599833  46.328952
2010-08-24T13:22:20.120000  15.200000  41.819481
2010-09-01T12:43:44.120000  18.800167  36.156397
2010-09-15T12:21:20.120000  17.200000  27.764497
2010-09-15T12:26:08.120000  20.200000  28.363382
2010-09-15T12:33:44.120000  18.800167  46.242649
2010-09-15T13:06:32.130000  18.999833  31.669638
2010-09-24T12:36:56.120000  18.400000  37.985091
2010-10-15T12:32:08.

## Datetime Objects

Notice that the time for the sample is in a strange format. It is a string containing the date in YYYY-MM-DD and time in HH-MM-SS-mmmmmm. These datetime objects have their own set of methods associated with them. Python appreciates that theses are built this way and can use them for the indexing easily. 

We can use this module to create timedelta objects and date objects (representing just year, month, day). We can also get information about universal time, such as the time and date today.

In [7]:
import datetime
print(datetime.datetime.now())
print(datetime.datetime.utcnow())
lunchtime = datetime.time(12,30)
the_date = datetime.date(2005, 7, 14)
dinner = datetime.datetime.combine(the_date, lunchtime)

print("When is dinner? {}".format(dinner))

2016-01-11 14:58:04.936936
2016-01-11 14:58:04.937047
When is dinner? 2005-07-14 12:30:00


### So lets do something with the data

Pandas has inbuilt methods attached to dataframe, similar to the way we put axes on graphs. Lets take the example of taking a histogram of the data. First things first we need to bin the data according to the index of the file:

In [8]:
bins = pd.groupby(df, by=[df.index.year, df.index.month])


AttributeError: 'Index' object has no attribute 'year'