# 03_05: Special arrays

This lesson will cover two NumPy Features not usually covered in tutorials but that are still very useful.

In [25]:
import math
import collections
import dataclasses
import datetime

import numpy as np
import pandas as pd
import matplotlib.pyplot as pp

The first is record arrays, which allows us to mix different data types and give descriptive names to fields. There is a much more powerful version of this in the Pandas Library, but sometimes it's good to use it within NumPy.

The other feature is datetime objects, which, as the name says, can encode a date and a time.

We will start by loading a simple example of a record array, saved in a NumPy format.

In [26]:
discography = np.load('discography.npy')

This is a partial David Bowie discography. Each entry shows a record's name, the date of release, and the top rank in the UK charts. 

The dtype, the last vector shown, is a list, which shows the name and dtype for each field.

For 'title', it's u32, which denotes a Unicode string of length 32. For 'release' it's M8D, which denotes a NumPy datetime object with the precision of a day. The 8 is because NumPy datetime objects are 8 bytes. Finally, 'toprank' is an 8 byte integer. 

If you're wondering about the less than symbols, refer to the endianness of the data types; the order in which the bytes are stored in memory. A less than sign denotes little endian numbers. This table, which is in the NumPy cheat sheet, shows the most common NumPy data types, their memory usage, and the data type string.

Do remember that in NumPy, all strings have fixed length and that assigning a longer string than the defined length will truncate it.

In [28]:
discography

array([('David Bowie', '1969-11-14', 17),
       ('The Man Who Sold the World', '1970-11-04',  3),
       ('Hunky Dory', '1971-12-17',  5),
       ('Ziggy Stardust', '1972-06-16',  1),
       ('Aladdin Sane', '1973-04-13',  1), ('Pin Ups', '1973-10-19',  1),
       ('Diamond Dogs', '1974-05-24',  1),
       ('Young Americans', '1975-03-07',  2),
       ('Station To Station', '1976-01-23',  5),
       ('Low', '1977-01-14',  2), ('Heroes', '1977-10-14',  3),
       ('Lodger', '1979-05-18',  4)],
      dtype=[('title', '<U32'), ('release', '<M8[D]'), ('toprank', '<i8')])

We could also load his discography from a text file. 

In [27]:
print(open('discography.txt', 'r').read())

# title, release, toprank
David Bowie, 1969-11-14, 17
The Man Who Sold the World, 1970-11-04, 3
Hunky Dory, 1971-12-17, 5
Ziggy Stardust, 1972-06-16, 1
Aladdin Sane, 1973-04-13, 1
Pin Ups, 1973-10-19, 1
Diamond Dogs, 1974-05-24, 1
Young Americans, 1975-03-07, 2
Station To Station, 1976-01-23, 5
Low, 1977-01-14, 2
Heroes, 1977-10-14, 3
Lodger, 1979-05-18, 4



That takes a little more work because we have to specify the dtype of every field as well as the field delimiter. Here it is a comma, but the result is the same.

In [29]:
discography_txt = np.genfromtxt('discography.txt', dtype=('U32', 'M8[D]', 'i8'), delimiter=',', names=True)

This shows us it is the same.

In [30]:
discography_txt == discography

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True])

So, once we have a record array, what can we do? Each record looks like a Python tuple, and we can extract or modify the elements as we would for a tuple by using two indices.

In [32]:
discography[0]

np.void(('David Bowie', '1969-11-14', 17), dtype=[('title', '<U32'), ('release', '<M8[D]'), ('toprank', '<i8')])

In [33]:
discography[0][0]

np.str_('David Bowie')

In [34]:
discography[0][1]

np.datetime64('1969-11-14')

But we can also use the dictionary interface using field names.

In [35]:
discography[0]['title']

np.str_('David Bowie')

Using the field name on the entire array will give us a view of a column.

In [36]:
discography['title']

array(['David Bowie', 'The Man Who Sold the World', 'Hunky Dory',
       'Ziggy Stardust', 'Aladdin Sane', 'Pin Ups', 'Diamond Dogs',
       'Young Americans', 'Station To Station', 'Low', 'Heroes', 'Lodger'],
      dtype='<U32')

To create a record array in Python code, we have to provide the data records as tuples, and we need to be careful about describing the data types.

For our dicography, it would look like this.

In [37]:
np.array([('David Bowie','1969-11-14',17),('The Man Who Sold The World','1970-11-04',3)],dtype = [('title', 'U32'), ('release', 'M8[D]'), ('toprank', 'i8')])

array([('David Bowie', '1969-11-14', 17),
       ('The Man Who Sold The World', '1970-11-04',  3)],
      dtype=[('title', '<U32'), ('release', '<M8[D]'), ('toprank', '<i8')])

Now for dates and times in NumPy. The dtype that we need is called datetime64; this is named like this to avoid any confusion with the datetime object in the Python standard library, and to remind us that each element takes 64 bits. We initialize datetime objects from strings, and we can give as much detail as we want.

The string format is ISO A601, which goes from larger to smaller units. That is, from year to month to day, and so on.

Here are 3 dates of increasing precision. 

In [38]:
np.datetime64('1969')

np.datetime64('1969')

In [39]:
np.datetime64('1969-11-14')

np.datetime64('1969-11-14')

In [45]:
np.datetime64('2015-02-03 12:00')

np.datetime64('2015-02-03T12:00')

We can also create a NumPy datetime from a standard library datetime object. We specify a granularity of D to avoid setting the time exactly at midnight.

The Python datetime module has a lot of functionality that can be useful before we bring data into NUmPy.

In [41]:
np.datetime64(datetime.datetime(2015,2,3))

np.datetime64('2015-02-03T00:00:00.000000')

In [46]:
np.datetime64(datetime.datetime(2015,2,3),'D') # 'D' for Day

np.datetime64('2015-02-03')

nce, if we need to pass a generic string format, we can do so by specifying the format itself.

In [44]:
np.datetime64(datetime.datetime.strptime('02/03/2015','%m/%d/%Y'),'D')

np.datetime64('2015-02-03')

NumPy datetime objects can be compared

In [47]:
np.datetime64('2025-02-03 12:00') < np.datetime64('2015-02-03 18:00')

np.False_

They can be subtracted, which results in a time delta object.It is specified in minutes.

In [49]:
np.datetime64('2015-02-03 18:00') - np.datetime64('2015-02-03 12:00')

np.timedelta64(360,'m')

The nice thing aout these datetime64 objects is that they work across NumPy. FOr instance, we can use the NumPy function diff, which computes the difference between successive array elements to see how long it took David to come up with each new record.

In [50]:
np.diff(discography['release'])

array([355, 408, 182, 301, 189, 217, 287, 322, 357, 273, 581],
      dtype='timedelta64[D]')

In [53]:
discography[3]

np.void(('Ziggy Stardust', '1972-06-16', 1), dtype=[('title', '<U32'), ('release', '<M8[D]'), ('toprank', '<i8')])

It seems 'Ziggy Stardust' was especially quick

In [52]:
discography[3]

np.void(('Ziggy Stardust', '1972-06-16', 1), dtype=[('title', '<U32'), ('release', '<M8[D]'), ('toprank', '<i8')])

Another example of using standard NumPy functions with datetime 64 is making a range of date. Consistently with the usual convention in Python, the last day in the range is excluded.

In [55]:
np.arange(np.datetime64('2015-02-03'), np.datetime64('2015-03-01'))

array(['2015-02-03', '2015-02-04', '2015-02-05', '2015-02-06',
       '2015-02-07', '2015-02-08', '2015-02-09', '2015-02-10',
       '2015-02-11', '2015-02-12', '2015-02-13', '2015-02-14',
       '2015-02-15', '2015-02-16', '2015-02-17', '2015-02-18',
       '2015-02-19', '2015-02-20', '2015-02-21', '2015-02-22',
       '2015-02-23', '2015-02-24', '2015-02-25', '2015-02-26',
       '2015-02-27', '2015-02-28'], dtype='datetime64[D]')