# <center> <div style="width: 370px;"> ![numpy title](pictures/numpy_tytle.jpg)

# <center> File I/O

## File I/O and Numpy

**Exploring Data with NumPy and File I/O**

Now that we've harnessed the power of NumPy for array computation and manipulation, and we've become proficient in constructing record arrays, it's time to embark on some real-world data analysis adventures. In this segment, we'll dive into the world of file input/output (I/O) and learn how to seamlessly integrate external data into NumPy arrays, as well as save our processed results for further analysis.

In this exciting journey, you will gain valuable insights into the art of loading and importing diverse datasets. The right approach to data loading depends on the specific file `type` you're dealing with. Whether it's text files, SAS/Stata files, HDF5 files, or other formats, we'll equip you with the essential skills.

HDF (Hierarchical Data Format) stands out as one of the go-to choices for efficiently storing and organizing copious amounts of data, making it particularly indispensable when working with multidimensional, homogeneous arrays. To simplify your interaction with HDF5 files, we'll introduce you to Pandas' remarkable HDFStore class, designed to streamline your data manipulation efforts.

As you delve deeper into the realm of data science projects, you'll encounter an array of file formats. Fear not, for we shall unravel the mysteries of the most commonly used ones, such as ***NumPy binary files, plain text files*** (`.txt`), and the ubiquitous ***Comma Separated Values*** (`.csv`) files. Armed with this knowledge, you'll be well-prepared to tackle a wide range of data analysis challenges.

## Text and CSV Files

We should talk about reading the file first and then exporting the file. But now, we are going to reverse the process, and create a record array first and then output the array to a CSV file. We read the exported CSV file into the NumPy record arrays and compared it with our original record array. The sample array we're going to create will contain an `id` field with consecutive integers, a `value` field containing random floats, and a `date` field juat number of day in the year. This exercise will use all the knowledge you gained from the previous sections and chapters. Let's start creating the record array:

In [1]:
import numpy as np

In [2]:
id_ = np.arange(1000)

In [3]:
value = np.random.random(1000)

In [4]:
day = np.random.randint(0, 366, 1000)

In [5]:
record_array = np.core.records.fromarrays(
    [id_, value, day],
    names='id, value, day',
    formats='i4, f4, i4'
)

In [6]:
record_array[:10]

rec.array([(0, 0.5390407 ,  25), (1, 0.3517334 ,  42),
           (2, 0.47612056, 289), (3, 0.15126763,  85),
           (4, 0.9585677 , 245), (5, 0.47141927, 307),
           (6, 0.7562439 , 303), (7, 0.44614092, 360),
           (8, 0.2888388 ,   1), (9, 0.5640747 , 360)],
          dtype=[('id', '<i4'), ('value', '<f4'), ('day', '<i4')])

We first creat three NumPy representing the fields we need: `id`, `value`, and `date`.
Then we use the `numpy.core.records.fromarrays()` function to merge the three array into record array and assign the `names` (field name).
what we are going to do next-exporting the record array to a CSV file:

In [7]:
np.savetxt('./record.csv', record_array, fmt='%.4i, %.4f, %.4i')

We use the `numpy.savetxt()` function to handle the exporting, and we specify the first argument as the exported file location, the array name, and the format using the `fmt` argument. We have three fields with two different data types and we want to add `,` in between each field in the CSV file. If you prefer any other delimiters, replace the comma in the `fmt` argument. We also get rid of redundant digits in the value field, so we specify only four digits after the decimal points to the file by using `%.4f`. Now you may go to the file location we specified in the first argument to check the CSV file. Open it in a spreadsheet software program and you can see the following:

```csv
0, 0.7436445 , 334
1, 0.1363907 , 281
2, 0.28818563, 118 
3, 0.3506355 , 184
4, 0.03474142, 105 
5, 0.23175852, 280
6, 0.34881884, 273 
7, 0.17016436, 246
8, 0.29626068,  17 
9, 0.17631991, 161
...
```

Next, we are going to read the CSV file to a record array and use the `value` field to generate a mask field, named `mask`, which represents a value larger than or equal to 0.75. Then we will append the new mask field to the record array. Let's read the CSV file first:

In [8]:
a = read_array = np.genfromtxt(
    './record.csv',
    dtype='i4, f4, i4',
    delimiter=',',
    skip_header=0
)

In [9]:
read_array.dtype

dtype([('f0', '<i4'), ('f1', '<f4'), ('f2', '<i4')])

In [10]:
a[0]

(0, 0.539, 25)

We employ the `numpy.genfromtxt()` function to efficiently ingest data from the specified file and transform it into a NumPy record array. The first argument supplied to this function is the file's location, while the optional `dtype` argument allows us to explicitly define the data type. It's highly recommended to specify the `dtype` argument to ensure proper data interpretation, especially when you have prior knowledge of the data's structure.

Furthermore, the `delimiter` argument, although optional, allows us to specify the character used to separate values within the file. By default, consecutive whitespaces serve as delimiters, but in our case, we've used `","` as the delimiter since we're working with a CSV file. 

Another optional argument we've utilized is `skip_header`, which enables us to skip a specified number of lines at the beginning of the file. While our data didn't include field names at the top, NumPy provides this functionality, making it flexible for various data sources.

In addition to `skip_header`, the `numpy.genfromtext()` function offers 22 more operation parameters that allow for fine-tuning the resulting array, such as handling missing values and specifying fill values. For comprehensive details on these parameters, please consult the official documentation at https://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.genfromtxt.html.

Now that the data has been successfully loaded into the record array, you may notice that the second field contains more than four digits after the decimal point, as we previously specified when exporting the CSV. This behavior is due to our choice of `f4` as the data type during the read-in process. NumPy automatically fills in any empty digits, but the valid four digits from the file remain unchanged.

One final observation is the absence of field names in our record array. To address this, let's explicitly specify the field names to ensure clarity and ease of data manipulation.

In [11]:
read_array.dtype.names = ('id', 'value', 'date') 

## `.npy` or `.npz`

When dealing with arrays in your data workflow, it's a common practice to preserve them as NumPy binary files once your work is complete. This approach offers several advantages, primarily because it allows you to retain crucial information about the array, including its shape and data type. This stored knowledge becomes invaluable when you later reload the array, as NumPy seamlessly recalls these attributes, enabling you to pick up your work exactly where you left off.

What's even more remarkable about NumPy binary files is their cross-platform compatibility. Regardless of whether you transfer the file to another machine with a different architecture, the stored information about the array remains intact and interpretable. This robust portability underscores the utility of NumPy binary files in data sharing and collaboration.

To facilitate the creation and retrieval of NumPy binary files, NumPy provides a suite of methods, including `load()`, `save()`, `savez()`, and `savez_compressed()`. These functions empower you to effortlessly load and save NumPy binary files, ensuring the preservation of your array data and its characteristics.

In [12]:
example_array = np.arange(12).reshape(3,4)

In [13]:
# for why `allow_pickle` is set to false, read the note mentioned in a few lines below.
np.save('example.npy', example_array, allow_pickle=False)

In [14]:
d = np.load('example.npy')

In [15]:
d.shape

(3, 4)

In [16]:
d

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [17]:
(d == example_array).all()

True

In the provided code snippet, we perform the following steps to demonstrate the process of saving an array as a binary file and subsequently loading it without altering its shape:

1. We initiate the process by creating an array with a designated shape of `(3, 4)`.

2. Next, we save this array as a binary file.

3. Following the save operation, we proceed to load the saved array from the binary file.

4. Crucially, we verify whether the array's original shape remains unchanged after the load operation.

> **NOte:** It's worth noting that unless the `dtype` of the array includes Python objects, we should set `allow_pickle=False` when using both `numpy.save()` and `numpy.load()`. This configuration is essential to ensure that the array is saved and loaded efficiently without relying on the pickling mechanism. It's important to keep in mind that pickles are not secure against data that may be erroneous or maliciously constructed.

Furthermore, for scenarios where you need to save multiple arrays into a single file, you can effectively employ the `savez()` function. If you seek to optimize storage space by compressing your NumPy binary files, you can employ the `savez_compressed()` function, enhancing the efficiency of file storage and retrieval.

In [18]:
x = np.arange(5)
y = np.arange(10)

In [19]:
np.savez('x_y.npz', x, y)

In [20]:
npzfile = np.load('x_y.npz')

In [21]:
npzfile.files

['arr_0', 'arr_1']

In [22]:
npzfile['arr_0']

array([0, 1, 2, 3, 4])

In [23]:
npzfile['arr_1']

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

When you save several arrays in a single file, if you give a keyword argument such as `first_array=x`, your array will be saved with this name. Otherwise, by default, your first array will be given a variable name, such as `arr_0`.

In [24]:
np.savez_compressed('x_y_compressed.npz', first_array=x , second_array=y)

In [25]:
npzfile = np.load('x_y_compressed.npz')

In [26]:
npzfile.files

['first_array', 'second_array']

In [27]:
npzfile['first_array']

array([0, 1, 2, 3, 4])

In [28]:
npzfile['second_array']

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

> **Note:** As a general guideline, it is advisable to prioritize the use of `numpy.save` and `numpy.load` over `numpy.ndarray.tofile` and `numpy.fromfile`. The reason for this preference lies in the fact that the latter methods tend to lose crucial information pertaining to endianness and precision. Consequently, they are best suited for temporary or scratch storage purposes, making them less suitable for applications requiring robust data preservation and compatibility.

## `json`

> **Warning:** NumPy arrays are not directly JSON serializable