---

## File I/O with NumPy



---

### Writing a NumPy Array to a File

Let's say we have an array `a` that we would like to export to a file for some reason.

In [None]:
import numpy as np

In [None]:
a = np.random.random((5,5))

In [None]:
print(a)

[[3.22102520e-01 3.83853526e-03 2.97083536e-02 9.11864235e-01
  9.34494371e-01]
 [3.43968157e-01 7.67771813e-01 1.43700773e-01 6.49020017e-01
  2.77784882e-02]
 [4.74948783e-01 7.75934484e-01 3.97296565e-01 3.92097538e-01
  8.38167344e-01]
 [6.36458399e-01 7.54254301e-04 7.99089014e-01 3.52367122e-02
  7.20586765e-01]
 [1.98810435e-01 3.15008299e-01 2.26772390e-01 8.09334656e-01
  9.73348553e-01]]


One option would be to use `np.save()` which saves the array to a binary `.npy` file.

In [None]:
np.save('array1', a)

In [None]:
np.savetxt('array2.txt', a)

In [None]:
np.savetxt('array3.csv', a, fmt='%.12f', delimiter=',')

---

### Reading a NumPy Array from a File

To read a binary `.npy` file into a NumPy array, we can use `np.load()`.

In [None]:
b = np.load('array1.npy')

In [None]:
b

array([[3.22102520e-01, 3.83853526e-03, 2.97083536e-02, 9.11864235e-01,
        9.34494371e-01],
       [3.43968157e-01, 7.67771813e-01, 1.43700773e-01, 6.49020017e-01,
        2.77784882e-02],
       [4.74948783e-01, 7.75934484e-01, 3.97296565e-01, 3.92097538e-01,
        8.38167344e-01],
       [6.36458399e-01, 7.54254301e-04, 7.99089014e-01, 3.52367122e-02,
        7.20586765e-01],
       [1.98810435e-01, 3.15008299e-01, 2.26772390e-01, 8.09334656e-01,
        9.73348553e-01]])

To read data from a text file into a NumPy array, we can use either `np.loadtxt()` or `np.genfromtxt()`.

- `np.loadtxt()` is an older function and provides very basic functionality
- `np.genfromtxt()` is a newer and **faster** faster function that is more customizable and can handle missing values

Hence it is recommended you use `np.genfromtxt()` as a default. When using either function, you have to specify the `delimiter` argument if using anything other than whitespace.

A detailed guide on importing data with `np.genfromtxt()`: https://numpy.org/doc/stable/user/basics.io.genfromtxt.html

In [None]:
c = np.loadtxt('array2.txt')

In [None]:
c

array([[3.22102520e-01, 3.83853526e-03, 2.97083536e-02, 9.11864235e-01,
        9.34494371e-01],
       [3.43968157e-01, 7.67771813e-01, 1.43700773e-01, 6.49020017e-01,
        2.77784882e-02],
       [4.74948783e-01, 7.75934484e-01, 3.97296565e-01, 3.92097538e-01,
        8.38167344e-01],
       [6.36458399e-01, 7.54254301e-04, 7.99089014e-01, 3.52367122e-02,
        7.20586765e-01],
       [1.98810435e-01, 3.15008299e-01, 2.26772390e-01, 8.09334656e-01,
        9.73348553e-01]])

In [None]:
d = np.genfromtxt('array3.csv', delimiter=',')

In [None]:
d

array([[3.22102520e-01, 3.83853526e-03, 2.97083536e-02, 9.11864235e-01,
        9.34494371e-01],
       [3.43968157e-01, 7.67771813e-01, 1.43700773e-01, 6.49020017e-01,
        2.77784882e-02],
       [4.74948783e-01, 7.75934484e-01, 3.97296565e-01, 3.92097538e-01,
        8.38167344e-01],
       [6.36458399e-01, 7.54254301e-04, 7.99089014e-01, 3.52367122e-02,
        7.20586765e-01],
       [1.98810435e-01, 3.15008299e-01, 2.26772390e-01, 8.09334656e-01,
        9.73348553e-01]])

An important thing to note when saving floating-point arrays to text files is ***loss of significance***. Because we can only store a set number of significant digits in the text file, it is possible that the number of significant digits will be reduced when writing data to a file, introducing round-off errors and causing precision loss.

Note that this is not the case when using the binary `.npy` format.

In [None]:
a == b

array([[ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True]])

When writing to a text file using the default setting of scientific notation with 16 significant digits, precision loss does not occur under normal circumstances. However, note that this is dependent on the *datatype* of your array.

---

### Advanced: File I/O With Python

But what exactly happens when we use `np.genfromtxt()` to read data from a file? We can get a high-level overview of the mechanisms that take place in the background when we try to recreate the functionality using standard Python.

First, we have to open the file in order to be able to read data from it.

In [None]:
file = open('array3.csv')

Now we have  **file object** called `file` that gives us access to `array3.csv`. Using `.readlines()` with a file object, we can read all the lines from a file into a list.

In [None]:
lines = file.readlines()

In [None]:
s = 'hello'
s[0]='m'

TypeError: 'str' object does not support item assignment

In [None]:
lines

['0.322102519573,0.003838535263,0.029708353561,0.911864235169,0.934494371035\n',
 '0.343968156552,0.767771813325,0.143700773197,0.649020016879,0.027778488244\n',
 '0.474948783366,0.775934484381,0.397296564911,0.392097538433,0.838167343918\n',
 '0.636458399008,0.000754254301,0.799089014128,0.035236712233,0.720586764954\n',
 '0.198810435127,0.315008298530,0.226772389646,0.809334656190,0.973348552803\n']

Now we have a list called `lines`, where each element is a line from the file `array3.csv`. Note that some cleaning needs to be done as these lines still contain whitespace characters like newlines.

In [None]:
str_test = '    test   '
s=str_test.strip()
s

'test'

In [None]:
cleaned_lines = []
for line in lines:
    line = line.strip()
    cleaned_lines.append(line)

In [None]:
cleaned_lines

['0.322102519573,0.003838535263,0.029708353561,0.911864235169,0.934494371035',
 '0.343968156552,0.767771813325,0.143700773197,0.649020016879,0.027778488244',
 '0.474948783366,0.775934484381,0.397296564911,0.392097538433,0.838167343918',
 '0.636458399008,0.000754254301,0.799089014128,0.035236712233,0.720586764954',
 '0.198810435127,0.315008298530,0.226772389646,0.809334656190,0.973348552803']

The next step would be to convert each line to a list by splitting the string on the separator. This will lead to a list of lists, which is already quite similar to a two-dimensional NumPy array.

In [None]:
lists = []
for line in cleaned_lines:
    lst = line.split(',')
    lists.append(lst)

In [None]:
lists

[['0.322102519573',
  '0.003838535263',
  '0.029708353561',
  '0.911864235169',
  '0.934494371035'],
 ['0.343968156552',
  '0.767771813325',
  '0.143700773197',
  '0.649020016879',
  '0.027778488244'],
 ['0.474948783366',
  '0.775934484381',
  '0.397296564911',
  '0.392097538433',
  '0.838167343918'],
 ['0.636458399008',
  '0.000754254301',
  '0.799089014128',
  '0.035236712233',
  '0.720586764954'],
 ['0.198810435127',
  '0.315008298530',
  '0.226772389646',
  '0.809334656190',
  '0.973348552803']]

Note how all the elements still have the type of `str`, meaning they are text, not numbers. Luckily there is an easy fix for that.

In [None]:
type(lists[0][0])

str

In [None]:
float_lists = []
for lst in lists:
    flst = []
    for element in lst:
        element = float(element)
        flst.append(element)
    float_lists.append(flst)

In [None]:
float_lists

In [None]:
type(float_lists[0][0])

Now we can use this list of lists to create a NumPy array.

In [None]:
e = np.array(float_lists)

In [None]:
e

We can confirm that we got the same result as we would have gotten using `np.genfromtxt()` by comparing it to the array `d` from before.

In [None]:
e == d

Finally we have to remember to close the file. This is very important to avoid any potential file corruption.

In [None]:
file.close()

Forgetting to close the file could lead to various issues and have serious consequences. Hence, it is commonplace to use `open()` in conjunction with a `with`statement. Any code executed within the block defined by the `with` statement has access to the file and any code outside of the block does not. This reduces the potential for errors and does not require you to use manually close the connection to the file.

Also note how our previous processing involved looping over basically the same list numerous times. We can simplify this a little by looping over indices instead.

In [None]:
with open('array3.csv') as f:
    lines = f.readlines()

In [None]:
lines

In [None]:
for i in range(len(lines)):
    lines[i] = lines[i].strip().split(',')
    for j in range(len(lines[i])):
        lines[i][j] = float(lines[i][j])

In [None]:
lines

In [None]:
arr = np.array(lines)

In [None]:
arr

We can confirm that the result is indeed the same as before.

In [None]:
arr == e

Note that you can condense this even more by using `map()` with `lambda` and remembering that `np.array()` has a `dtype` argument.

In [None]:
with open('array3.csv') as f:
    arr2 = np.array(list(map(lambda x : x.strip().split(','), f.readlines())), dtype=float)

In [None]:
arr2

In [None]:
arr == arr2