# <center> <div style="width: 370px;"> ![numpy title](pictures/numpy_tytle.jpg)

# <center>Available Data Types in NumPy

## Numerical Types: int, bool, float, and complex

When diving into data science and numerical computations, it's crucial to start with a solid understanding of numeric data types. In NumPy, you'll encounter four primary numeric types, each offering a range of sizes to accommodate various data needs.


Here's a breakdown of these numeric types:




| Name     | # of Bits | Python Type | NumPy Type    |
| -------- | --------- | ----------- | ------------- |
| Integer  | 64        | int         | np.int_       |
| Booleans | 8         | bool        | np.bool_      |
| Float    | 64        | float       | np.float_     |
| Complex  | 128       | complex     | np.complex_   |



These types align with their counterparts in Python, ensuring a smooth transition when working with NumPy. However, NumPy offers even more flexibility by providing smaller-sized versions of these types, including 8-bit, 16-bit, and 32-bit integers, as well as 32-bit single-precision floating-point numbers, and 64-bit single-precision complex numbers. You can find a comprehensive list of these types in the NumPy documentation.

Starting your journey with a strong grasp of these numeric data types will pave the way for effective data manipulation and analysis in NumPy.

To specify the type when creating an array, you can provide a `dtype` argument:

In [1]:
import numpy as np

In [2]:
np.array([1, 3, 5.5, 7.7, 9.2], dtype=np.single)

array([1. , 3. , 5.5, 7.7, 9.2], dtype=float32)

In [3]:
np.array([1, 3, 5.5, 7.7, 9.2], dtype=np.uint8)

array([1, 3, 5, 7, 9], dtype=uint8)

NumPy exhibits a handy feature by automatically converting the platform-independent data type `np.single` to the specific fixed-size type that aligns with your platform's capabilities. For instance, it seamlessly translates `np.single` to `np.float32` when suitable. However, it's important to note that if the values you provide don't align with the shape of the specified data type, NumPy will take one of two actions: it will either adjust the data to fit the specified dtype, or if that's not feasible, it will raise an error to alert you to the discrepancy.

## String Types: Sized Unicode

Strings behave a little strangely in NumPy code because NumPy needs to know how many bytes to expect, which isn’t usually a factor in Python programming. Luckily, NumPy does a pretty good job at taking care of less complex cases for you.

In [4]:
import numpy as np

In [5]:
names = np.array(["bob", "amy", "han"], dtype=str)
names

array(['bob', 'amy', 'han'], dtype='<U3')

In [6]:
names.itemsize

12

In [7]:
names = np.array(["bob", "amy", "han"])
names

array(['bob', 'amy', 'han'], dtype='<U3')

In [8]:
more_names = np.array(["bobo", "jehosephat"])

In [9]:
more_names.dtype, more_names.itemsize

(dtype('<U10'), 40)

In [10]:
np.concatenate((names, more_names))

array(['bob', 'amy', 'han', 'bobo', 'jehosephat'], dtype='<U10')

In `names`, you provide a `dtype` of Python’s built-in `str` type, but in its output, it’s been converted into a little-endian Unicode string of size 3. When you check the size of a given item in input 4, you see that they’re each 12 bytes: three 4-byte Unicode characters.

> **Note:** Delving into NumPy data types requires considering factors such as the endianness of your data values. In this context, the dtype `'<U3'` signifies that each value is equivalent to the size of three Unicode characters. It also indicates that the least-significant byte is stored first in memory, while the most-significant byte is stored last. Conversely, a `dtype` of `'>U3'` would convey the reverse order.
> 
> To illustrate this concept, consider the Unicode character "🐍." With a dtype of `'<U1'`, it is represented by the bytes `0xF4 0x01 0x00`, while with a `dtype` of `'>U1'`, it is represented as `0x00 0x01 0xF4`. You can experiment with this by creating an array filled with emojis, specifying one of these dtypes, and then using `.tobytes()` on your array to observe the results.
> 
> If you're interested in delving deeper into how Python handles the binary representation of your typical Python data types, you might find the official documentation for the [struct library](https://docs.python.org/3/library/struct.html#struct-alignment) valuable. The struct library is a standard Python module that specializes in working with raw bytes, providing further insights into data representation and alignment.

When you amalgamate an array with larger items to generate a new array (as exemplified in input 8), NumPy exhibits a helpful behavior by automatically determining the required size for each item in the new array. It uniformly expands all items to a size of `<U10`.

However, things take a different turn when you attempt to modify one of the slots by assigning a value that exceeds the capacity of the specified dtype:

In [11]:
names[2] = "jamima"

names

array(['bob', 'amy', 'jam'], dtype='<U3')

Unfortunately, the expected behavior doesn't materialize in this scenario; instead, your value gets truncated. If you're dealing with an existing array, NumPy's automatic size detection won't come to your rescue. You're limited to three characters, and any excess content is essentially lost in the void.

In essence, NumPy is a reliable companion when handling strings. However, it's imperative to maintain vigilance regarding the size of your elements. Always ensure that you have ample space when making modifications or altering arrays in place to prevent unintended truncation.

## Structured Arrays

Originally, you learned that array items all have to be the same data type, but that wasn’t entirely correct. NumPy has a special kind of array, called a ***record array*** or ***structured array***, with which you can specify a type and, optionally, a name on a per-column basis. This makes sorting and filtering even more powerful, and it can feel similar to working with data in Excel, CSVs, or relational databases.

Structured arrays or record arrays are useful when you perform computations, and at the same time you could keep closely related data together. For example, when you process incident data and each incident contains geographic coordinates and the occurrence time, while you calculate the final result, you can easily find the associated geographic locations and timepoint for further visualization. NumPy also provides powerful capabilities to create arrays of records, as multiple data types live in one NumPy array. However, one principle in NumPy that still needs to be honored is that the data type in each field (you can think of this as a column in the records) needs to be homogeneous. Here are some simple examples that show you how it works:

In [12]:
import numpy as np

In [13]:
x = np.empty((2,), dtype = ('i4, f4, U10'))

In [14]:
x[:] = [(1, 0.5, 'NumPy'), (10, -0.5, 'Essential')]

In the preceding example, we crafted a one-dimensional record array by employing `numpy.empty()`. We meticulously specified data types for its elements, with the first element denoted as `i4` (representing a 32-bit signed integer, where 'i' signifies an integer and '4' signifies 4 bytes, akin to `np.int32`). The second element assumes the form of a 32-bit float (`f` indicating a float and also taking up 4 bytes), while the third element is a string constrained to a length of 10 characters or less. Subsequently, we assigned values to this array, meticulously adhering to the order of data types we had defined.

Upon examining the print-out of `x`, we observe the presence of three distinct types of records, accompanied by default field names within the dtype: `f0`, `f1`, and `f2`. It's essential to recognize that you have the liberty to specify your field names, as we shall demonstrate in forthcoming examples.

A noteworthy detail in this context is the employment of the `<` symbol preceding `i4` and `f4` in the print-out. This `<` symbol signifies the byte order as **big-endian**, denoting the order in which memory addresses increase.

In [15]:
x[0]

(1, 0.5, 'NumPy')

In [16]:
x['f2']

array(['NumPy', 'Essential'], dtype='<U10')

The way we retrieve data remains the same, we use the index to obtain the record, but moreover, we can use the field name to obtain the value of certain fields, so in the previous example, we used `f2` to obtain the string field. In the following example, we are going to create a view of `x`, named `y`, and see how it interacts with the original record array:

In [17]:
y = x['f0']

In [18]:
y

array([ 1, 10], dtype=int32)

In [19]:
y.strides

(48,)

In [20]:
y[:] = y + 0.5

In [21]:
y

array([ 1, 10], dtype=int32)

In [22]:
x

array([( 1,  0.5, 'NumPy'), (10, -0.5, 'Essential')],
      dtype=[('f0', '<i4'), ('f1', '<f4'), ('f2', '<U10')])

In this context, `y` represents the view of field `f0` within the record array `x`. It's important to note that record arrays retain the fundamental characteristics of NumPy arrays. Consequently, when you perform scalar multiplication with the value 10, it applies to the entire `y` array, adhering to the broadcasting rule. Moreover, it always respects the data type defined. For instance, even after adding `0.5` to `y`, the data type of field `f0` being a 32-bit integer ensures that the result remains `[10, 100]`.

Additionally, it's worth highlighting that `y` acts as a view of `f0` within `x`, signifying that they share the same underlying memory block. Consequently, any alterations made to `y` directly affect the values in `x`. This becomes evident when we examine `x` after performing calculations on `y`.

Before delving further into record arrays, let's clarify how to define one. The simplest approach is demonstrated in the previous example, where we initialize a NumPy array and employ a string argument to specify the data type of its fields. NumPy provides several acceptable forms of string arguments, and you can choose the most suitable one for your needs. For detailed information, you can refer to the [official documentation](https://numpy.org/doc/stable/user/basics.rec.html). Here are some commonly used representations:

| Data Types | Representation            |
| ---------- | -------------------------- |
| `b1`       | Bytes                      |
| `i1, i2, i4, i8` | Signed integers with 1, 2, 4, and 8 bytes |
| `u1, u2, u4, u8` | Unsigned integers with 1, 2, 4, and 8 bytes |
| `f2, f4, f8` | Floats with 2, 4, and 8 bytes |
| `c8, c16`   | Complex with 8 and 16 bytes |
| `a<n>`     | Fixed-length strings of length `n` |

You may also prefix the string arguments with a repeated number or a shape to define the dimension of the field, but it's still considered as just one field in the record arrays. Let's try using the shape as prefix to the string arguments in the following example:

In [23]:
z = np.ones((2,), dtype = ('3i4, (2,3)f4'))

In [24]:
z

array([([1, 1, 1], [[1., 1., 1.], [1., 1., 1.]]),
       ([1, 1, 1], [[1., 1., 1.], [1., 1., 1.]])],
      dtype=[('f0', '<i4', (3,)), ('f1', '<f4', (2, 3))])

In the previous example, field `f0` is a one-dimensional array with size `3` and `f1` is a two-dimensional array with shape `(2, 3)`. Now, we are clear about the structure of a record array and how to define it. You might be wondering whether the default field name can be changed to something meaningful in your analysis? Of course it can! This is how:

In [25]:
x.dtype.names 

('f0', 'f1', 'f2')

In [26]:
x.dtype.names = ('id', 'value', 'note') 

In [27]:
x

array([( 1,  0.5, 'NumPy'), (10, -0.5, 'Essential')],
      dtype=[('id', '<i4'), ('value', '<f4'), ('note', '<U10')])

By assigning the new field names back to the names attribute in the `dtype` object, we can have our customized field names. Or you can do this when you initialize the record arrays by using a list with a tuple, or a dictionary. In the following examples, we are going to create two identical record arrays with customized field names using a list, and a dictionary:

In [28]:
list_x = np.zeros((2,), dtype = [('id', 'i4'), ('value', 'f4', (2,))]) 

In [29]:
list_x

array([(0, [0., 0.]), (0, [0., 0.])],
      dtype=[('id', '<i4'), ('value', '<f4', (2,))])

In [30]:
dict_x = np.zeros((2,), dtype = {'names':['id', 'value'], 'formats':['i4', '2f4']}) 

In [31]:
dict_x

array([(0, [0., 0.]), (0, [0., 0.])],
      dtype=[('id', '<i4'), ('value', '<f4', (2,))])

In the list example, we make a tuple of (field name, data type, and shape) for each field. The shape argument is optional; you may also specify the shape with the data type argument. While using a dictionary to define the field, there are two required keys (`names` and `formats`) and each key has an equally sized list of values.

Before we go on to the next section, we are going to show you how to access multiple fields in your record array all at once. The following example still uses the array `x` that we created at beginning with a customized field: `id`, `value`, and `note`:

In [32]:
x[['id', 'note']]

array([( 1, 'NumPy'), (10, 'Essential')],
      dtype={'names': ['id', 'note'], 'formats': ['<i4', '<U10'], 'offsets': [0, 8], 'itemsize': 48})

For more learn about [SQL Like Query on Structured Data](https://github.com/pytopia/ML/blob/main/Machine%20Learning/02.%20Data%20Processing/02.%20Numpy/08%20More%20on%20Data%20Types.ipynb) and [Dates and time in NumPy](https://github.com/pytopia/ML/blob/main/Machine%20Learning/02.%20Data%20Processing/02.%20Numpy/08%20More%20on%20Data%20Types.ipynb)