<a href="https://colab.research.google.com/github/Saifullah785/python-data-science-handbook-notes/blob/main/02_09_Structured_Data_NumPy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Structured Data: Numpy's Structured Arrays**

These arrays allow you to store data with different data types in a single array, similar to a table or spreadsheet. You can define custom data types with named fields and specific formats for each field.

In [52]:
#import necessary libraries

import numpy as np

In [36]:
# Define lists for name, age, and weight
name = ['Alice', 'Bob', 'Cathy', 'Doug']

age = [25, 45, 37, 19]

weight = [55.0, 85.5, 68.0, 61.5]

In [37]:
# Create a NumPy array of zeros with integer data type
x = np.zeros(4, dtype=int)

In [38]:
# Use a compound data type for structured arrays
# Define the structure with names and formats for each field
data = np.zeros(4, dtype={'names':('name', 'age', 'weight'),
                          'formats':('U10', 'i4', 'f8')})
# Print the data type of the structured array
print(data.dtype)

[('name', '<U10'), ('age', '<i4'), ('weight', '<f8')]


In [39]:
# Assign data to the structured array
data['name'] = name
data['age'] = age
data['weight'] = weight
# Print the structured array
print(data)

[('Alice', 25, 55. ) ('Bob', 45, 85.5) ('Cathy', 37, 68. )
 ('Doug', 19, 61.5)]


In [40]:
# Access and print all names from the structured array
data['name']

array(['Alice', 'Bob', 'Cathy', 'Doug'], dtype='<U10')

In [41]:
# Access and print the first row of data from the structured array
data[0]

np.void(('Alice', 25, 55.0), dtype=[('name', '<U10'), ('age', '<i4'), ('weight', '<f8')])

In [42]:
# Access and print the name from the last row of the structured array
data[-1]['name']

np.str_('Doug')

In [43]:
# Access and print names where age is under 30 using boolean indexing
data[data['age'] < 30]['name']

array(['Alice', 'Doug'], dtype='<U10')

# **Exploring Structured Array Creation**

This section demonstrates various ways to define the compound data types for structured arrays, including using dictionaries, lists of tuples, and comma-separated strings. Different format specifications like string lengths and data type objects are explored.

In [44]:
# Define a compound data type using a dictionary with names and formats
np.dtype({'names':('name', 'age', 'weight'),
          'formats':('U10', 'i4', 'f8')})

dtype([('name', '<U10'), ('age', '<i4'), ('weight', '<f8')])

In [45]:
# Define a compound data type using a dictionary with names and format types
np.dtype({'names':('name', 'age', 'weight'),
          'formats':((np.str_, 10), int, np.float32)})

dtype([('name', '<U10'), ('age', '<i8'), ('weight', '<f4')])

In [46]:
# Define a compound data type using a list of tuples with names and type strings
np.dtype([('name', 'S10'), ('age', 'i4'), ('weight', 'f8')])

dtype([('name', 'S10'), ('age', '<i4'), ('weight', '<f8')])

In [47]:
# Define a compound data type using a comma-separated string of type strings
np.dtype('S10,i4,f8')

dtype([('f0', 'S10'), ('f1', '<i4'), ('f2', '<f8')])

# **More Advanced Compound Types**

This part shows how structured arrays can handle more complex data structures, such as including multi-dimensional arrays (like matrices) as fields within the structured array. This allows for representing more intricate data relationships.

In [48]:
# Define a compound data type with an integer ID and a 3x3 float matrix
tp = np.dtype([('id', 'i8'), ('mat', 'f8', (3, 3))])
# Create a zero array with the defined compound data type
X = np.zeros(1, dtype=tp)
# Print the first element of the array
print(X[0])
# Print the 'mat' field of the first element
print(X['mat'][0])

(0, [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0]])
[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]


# **Record Array: Structured Arrays with a twist**

Record arrays provide an alternative way to access data in structured arrays using attribute-like notation (e.g., data_rec.age) instead of dictionary-like indexing (data['age']). While offering a different syntax, timing tests show that attribute access might be slower than standard indexing.

In [49]:
# Access the 'age' field of the structured array
data['age']

array([25, 45, 37, 19], dtype=int32)

In [50]:
# Create a record array view of the structured array
data_rec = data.view(np.recarray)
# Access the 'age' field using attribute-like access in the record array
data_rec.age

array([25, 45, 37, 19], dtype=int32)

In [51]:
# Time the access of the 'age' field using standard structured array indexing
%timeit data['age']
# Time the access of the 'age' field using dictionary-like indexing on the record array
%timeit data_rec['age']
# Time the access of the 'age' field using attribute-like access on the record array
%timeit data_rec.age

133 ns ± 29.3 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
1.79 µs ± 280 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
2.99 µs ± 70 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


# **On to Pandas**

This marks the transition from using NumPy's structured arrays to the Pandas library, which is built on NumPy and provides more powerful and flexible data manipulation and analysis tools, often simplifying tasks performed with structured arrays.