**Pickle and Numpy formats**

So far, we used datasets stored in .csv files. Working with this format has many advantages. It’s well supported by many data analysis software like Pandas and Excel, and it’s a text format which means that we can easily edit them with a text editor. However, they are limited to tabular data, and they are not particularly optimized to work with large arrays of values.

In this unit, we will see how to store/load Python objects and Numpy arrays using the Pickle .p and the Numpy .npy and .npz binary formats.

**Create Pickle files**

A Pickle file is a very convenient way to store some data in Python. This format works with many Python objects and data structures, and it’s cross-platform which means that we can use these files across different platforms ex. Windows, macOS, Linux.

Let’s take an example. We will start by storing a simple dictionary with two lists.



In [1]:
# Sample dictionary with two lists
data = {
    'x': [6.28318, 2.71828, 1],
    'y': [2, 3, 5]
}

In [2]:
# First, we need to import the pickle library.

import pickle

We can then use the Pickle dump() function to store the data variable into a Pickle .p file. But first, we need to create it using the Python open() function. **Pickle files are binary, i.e., they are not text files. For this reason, we need to use the wb flags (writing and binary mode).**

In [3]:
# Save the dictionary into a pickle file
with open('data.p', 'wb') as file:
    pickle.dump(data, file)

It’s important to understand that Pickle is a binary format. We can take a look at the content of the data.p file by typing the xxd -b data.p command in a terminal.

The command returns an eight-column representation of the data.p file. The first one is a count of the number of bytes and indicates how far we are from the start of the file. The next six columns show its binary content by groups of eight bits (also known as bytes) and the last column is the text representation of each byte.

As you can see, unlike CSV files, we cannot open data.p with a text editor and edit the text representation of the Python dictionary directly. To modify its content, we need to open, edit and save the changes using Python code. Let’s see how to do that.

**Edit Pickle files**

The Pickle library implements a load() function to read the content of a Pickle file. Let’s use it to read our data.p file.

In [4]:
# Load the pickle file
with open('data.p', 'rb') as file:
    data = pickle.load(file)

print(data)

{'x': [6.28318, 2.71828, 1], 'y': [2, 3, 5]}


This time, we are reading and not writing the file. Hence, we need to pass the rb flags to the open() function (reading and binary mode).

We can now edit the dictionary using Python code and store the modified version with another dump() instruction.

In [5]:
# Add a new element to each list
data['x'].append(0)
data['y'].append(7)

In [6]:
# It would then be straightforward to save our modifications:
# Save our modifications
with open('data.p', 'wb') as file:
    pickle.dump(data, file)

In practice, Pickle files are often used to store machine learning datasets. In a typical scenario, we have a train.p and a test.p file. Both contain a dictionary with the array of features, the target values and optionally some meta information such as the names of the categories for classification tasks.

Pickle files are very convenient since they work with a large number of Python objects. However, it’s possible to store and execute arbitrary (and potentially malicious) code with them. For this reason, you should always verify the integrity of your Pickle files before loading them.

We will now see another way to store our datasets using the Numpy .npy and .npz formats which are specifically designed to store potentially large Numpy arrays.

**Numpy npy files**

Numpy implements a .npy binary format. It has two main advantages.

* Unlike Pickle files, it doesn’t duplicate the data in memory before loading or saving the file which is very convenient for large datasets.
* It implements memory-mapping which allows reading small parts of a large dataset without loading the entire file into memory.
For these reasons, it’s the recommended way to store Numpy arrays. Let’s take an example. This time, we will save a Numpy array with three values of type float16.

In [7]:
import numpy as np

# Create the Numpy array
data = np.array([6.28318, 2.71828, 1], dtype=np.float16)

We can now store this array using the Numpy save() function. Unlike the Pickle dump() one, we don’t need to open the file beforehand with open(). Numpy will automatically create and manage the file object for us.

In [8]:
# Save it into a .npy file
np.save('data.npy', data)

In [9]:
# Reading the file is also straightforward.
# Read it
np.load('data.npy') # array([6.28 , 2.719, 1.   ], dtype=float16)


array([6.28 , 2.719, 1.   ], dtype=float16)

The .npy format is made to store Numpy arrays, but it’s also possible to use the save() function to store other data structures such as dictionaries.



In [11]:
data = {
    'x': np.array([6.28318, 2.71828, 1], dtype=np.float16),
    'y': np.array([2, 3, 5])
}

# Save it into a .npy file
np.save('data.npy', data)

# Read it
np.load('data.npy',allow_pickle=True)

# Note: Numpy wraps the dictionary in an array of
# type `object` and uses pickle to save that object

array({'x': array([6.28 , 2.719, 1.   ], dtype=float16), 'y': array([2, 3, 5])},
      dtype=object)

In this case, Numpy wraps the data dictionary in a Numpy array with the object data type and saves its content using Pickle. In other words, the file has the .npy extension but contains a Pickle object.

Note that by default Numpy disallows Pickle (e.g., for security reasons) with the allow_pickle argument.



In [12]:
try:
    np.load('data.npy', allow_pickle=False)
except Exception as e:
    print(e) # Object arrays cannot be loaded when allow_pickle=False

Object arrays cannot be loaded when allow_pickle=False


This code raises an error because the data.npy file contains a Pickle object.

In our machine learning tasks, we usually want to work with a single file that contains both the array of features and the array of target values. We will now see how to do that without using the “array of objects” trick from above.

**The Numpy npz format**

It’s possible to store multiple arrays using the Numpy .npz format. Let’s take an example. Say that we want to save the x and y arrays from above into a single file.

In [13]:
# Create two Numpy arrays
x = np.array([6.28318, 2.71828, 1], dtype=np.float16)
y = np.array([2, 3, 5])

To achieve this, we need to use the savez() function. In our case, we will pass the two arrays as arguments, but you can use it to save any number of arrays.

In [14]:
# Save them into a .npz file
np.savez('data.npz', features=x, targets=y)

Note that we need to label each array. We chose to use the names features and targets, but you can try with other labels. We use these labels to refer to each array when loading the data.npz file with the load() function.

In [15]:
# Load the npz file
with np.load('data.npz', allow_pickle=False) as npz_file:
    # It's a dictionary-like object
    print(list(npz_file.keys()))

    # Load the arrays
    print('x:', npz_file['features'])
    print('y:', npz_file['targets'])

['features', 'targets']
x: [6.28  2.719 1.   ]
y: [2 3 5]


Unlike .npy files, the load() function doesn’t return the content of the file directly but rather an NpzFile dictionary-like object which performs lazy loading, i.e., it loads the arrays only when we access them. For this reason, we need to use a **with statement** to manage the file resource.

This also implies that we cannot use the npz_files variable to read the arrays outside the **with** statement.

In [16]:
with np.load('data.npz', allow_pickle=False) as npz_file:
    # Read the "y" array (inside the with statement)
    print('y:', npz_file['targets'])

# Read the "y" array (outside the with statement)
try:
    print('y:', npz_file['targets'])
except Exception as e:
    print(e)

y: [2 3 5]
'NoneType' object has no attribute 'open'


The reason is that Python closes the data.npz file after the last line of the with statement and we cannot read anymore its content. One solution is to load the arrays into an x and a y variable inside the with statement.

In [17]:
with np.load('data.npz', allow_pickle=False) as npz_file:
    # Load the arrays
    x = npz_file['features']
    y = npz_file['targets']

print('x:', x)
print('y:', y)

x: [6.28  2.719 1.   ]
y: [2 3 5]


In this code, we get the (label, array) pairs using the item() function and build a {label: array} dictionary from them using the Python dict() function.

We now know three different formats to store our data using Python.

* .csv files to store tabular data
* Pickle .p files for Python objects
* .npy and .npz files for (potentially large) Numpy arrays