Start: Numpy Basics from Python for Data Analysis book by Wes McKinney

In [2]:
import numpy as np

Topics for Data Analysis:
- Fast vectorized array operations for data munging and cleaning, subsetting and
filtering, transformation, and any other kinds of computations
- Common array algorithms like sorting, unique, and set operations
- Efficient descriptive statistics and aggregating/summarizing data
- Data alignment and relational data manipulations for merging and joining
together heterogeneous datasets
- Expressing conditional logic as array expressions instead of loops with if-elifelse
branches
- Group-wise data manipulations (aggregation, transformation, function application)

4.1 The NumPy ndarray: A Multidimensional Array Object

In [3]:
# Generate a random set of data
data = np.random.rand(2, 3)
print(data)

[[0.88858052 0.7520688  0.45499294]
 [0.97589812 0.15466148 0.47364846]]


In [4]:
# Mathematical operations on the data
mul = data * 10
print(mul)

add = data + data
print(add)

[[8.88580518 7.52068798 4.54992944]
 [9.75898115 1.54661477 4.73648456]]
[[1.77716104 1.5041376  0.90998589]
 [1.95179623 0.30932295 0.94729691]]


In [5]:
# Get the shape and type of the array
shape = data.shape
print(shape)

type = data.dtype
print(type)

(2, 3)
float64


In [6]:
# Creating ndarrays with one dimension
data1 = [1, 1.2, 3, 5.5, -12, -7.8]
data1 = np.array(data1)
print(data1)

[  1.    1.2   3.    5.5 -12.   -7.8]


In [7]:
# Creating ndarrays with multiple dimensions
data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]
data2 = np.array(data2)
print(data2)

print(f"Dimension of the array {data2}: {data2.ndim} ")
print(f"Shape of the array {data2}: {data2.shape} ")

[[1 2 3 4]
 [5 6 7 8]]
Dimension of the array [[1 2 3 4]
 [5 6 7 8]]: 2 
Shape of the array [[1 2 3 4]
 [5 6 7 8]]: (2, 4) 


In [8]:
# The ndarrays are automatically typed
print(f" Type of {data1} is {data1.dtype}")
print(f" Type of {data2} is {data2.dtype}")

 Type of [  1.    1.2   3.    5.5 -12.   -7.8] is float64
 Type of [[1 2 3 4]
 [5 6 7 8]] is int64


In [9]:
# Create array of zeros
print(np.zeros(10))
# Create array of ones (tuple as a shape)
print(np.ones((10,2)))



[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[[1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]]


In [10]:
# Create an array without initializing its values
np.empty((2, 3, 2))

array([[[9.47799446e-312, 3.16202013e-322],
        [0.00000000e+000, 0.00000000e+000],
        [1.11260619e-306, 8.23009093e-067]],

       [[2.59834267e-056, 1.18779705e-075],
        [3.95373276e+179, 1.21544159e-046],
        [1.50195669e-076, 8.03394082e-042]]])

In [11]:
# Create an array-valued version of the built-in Python range function
np.arange(start=5, stop=100, step=5, dtype=float)

array([ 5., 10., 15., 20., 25., 30., 35., 40., 45., 50., 55., 60., 65.,
       70., 75., 80., 85., 90., 95.])

In [12]:
# Produce an array of the given shape and dtype with all values set to the indicated “fill value”
np.full((5,2), 12)

array([[12, 12],
       [12, 12],
       [12, 12],
       [12, 12],
       [12, 12]])

In [13]:
arrray_zeros = np.zeros((2,3))
# full_like takes another array and produces a filled array of the same shape and dtype
filled_array_twos = np.full_like(arrray_zeros, 2)
print(arrray_zeros.shape, filled_array_twos.shape)
print(arrray_zeros)
print(filled_array_twos)

(2, 3) (2, 3)
[[0. 0. 0.]
 [0. 0. 0.]]
[[2. 2. 2.]
 [2. 2. 2.]]


In [14]:
# Create identity matrices
identity_matrix1 = np.eye(10,10)
identity_matrix2 = np.identity(10)
print(identity_matrix1)
print(identity_matrix2)


[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]]
[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]]


Data Types for ndarrays

In [15]:
arr1 = np.array([1, 2, 3], dtype=np.complex128,)
arr1.dtype
print(arr1)

[1.+0.j 2.+0.j 3.+0.j]


In [16]:
arr2 = np.array([1, 2, 3], dtype=np.int32)
arr2.dtype

dtype('int32')

List of types supported by NumPy:
- int8 (i1), uint8 (u1): Signed and unsigned 8-bit (1 byte) integer types
- int16 (i2), uint16 (u2): Signed and unsigned 16-bit integer types
- int32 (i4), uint32 (u4): Signed and unsigned 32-bit integer types
- int64 (i8), uint64 (u8): Signed and unsigned 64-bit integer types
- float16 (f2): Half-precision floating point
- float32 (f4 or f): Standard single-precision floating point; compatible with C float
- float64 (f8 or d): Standard double-precision floating point; compatible with C double and Python float object
- float128 (f16 or g): Extended-precision floating point
- complex64 (c8), complex128 (c16), complex256 (c32): Complex numbers represented by two 32, 64, or 128 floats, respectively
- bool (?): Boolean type storing True and False values
- object (O): Python object type; a value can be any Python object
- string (S): Fixed-length ASCII string type (1 byte per character); for example, to create a string dtype with length 10, use 'S10'
- unicode (U): Fixed-length Unicode type (number of bytes platform specific); same
specification semantics as string_ (e.g., 'U10')

In [17]:
# Cast an array explicitly
arr = np.array([1, 2, 3, 4, 5])
print(f"First type of the array {arr.dtype}")
float_arr = arr.astype(np.float64)
print(f"Casted type of the array {float_arr.dtype}")

First type of the array int64
Casted type of the array float64


In [18]:
# Loss of information during cast
arr = np.array([3.7, -1.2, -2.6, 0.5, 12.9, 10.1])
print(f"Information of the array with floating points {arr}")
# Cast of the array to integers
int_arr = arr.astype(np.int32)
print(f"Information of the array with integers {int_arr}")

Information of the array with floating points [ 3.7 -1.2 -2.6  0.5 12.9 10.1]
Information of the array with integers [ 3 -1 -2  0 12 10]


In [19]:
# Cast array of strings representing numeric values (use np.bytes_ instead of np.string since NumPy 2.0)
numeric_strings = np.array(['1.25', '-9.6', '42'], dtype=np.bytes_)
numeric_strings = numeric_strings.astype(float)
print(f"Casted string to numerical floating point values {numeric_strings.dtype}")

Casted string to numerical floating point values float64


In [20]:
# Use shorthand as dtypes
empty_uint32 = np.empty(8, dtype='u4')
print(empty_uint32, empty_uint32.dtype)



[         0          0 2794633008        446 2794629888        446
 2794629968        446] uint32


Arithmetic with NumPy arrays

In [21]:
# Use batch operations without loops
arr = np.array([[1., 2., 3.], [4., 5., 6.]])

arr ** 2

array([[ 1.,  4.,  9.],
       [16., 25., 36.]])

In [22]:
# Comparison between two arrays create a new array with boolean values
arr1 = np.array([[1., 2., 3.], [4., 5., 6.]])
arr2 = np.array([[0., 4., 1.], [7., 2., 12.]])

print(arr1 > arr2)

[[ True False  True]
 [False  True False]]


Boolean indexing

In [23]:
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
data = np.random.randn(7, 4)
print(names, names.dtype, names.shape)
print(data, data.dtype, data.shape)


['Bob' 'Joe' 'Will' 'Bob' 'Will' 'Joe' 'Joe'] <U4 (7,)
[[ 2.72029814  1.0460852   0.18522771  0.01946102]
 [-0.87208698 -3.36461117  0.93769823  0.73371306]
 [-0.33607918 -0.67318552  0.29474069 -0.54601805]
 [-0.61497639 -2.04792625 -0.05170833 -0.92930583]
 [ 0.38734891 -0.50578761 -0.17594827 -0.01480459]
 [ 0.09930235  1.91556529 -1.08454362 -0.97803802]
 [ 0.60509287  0.57622709  0.44878758  0.69129616]] float64 (7, 4)


In [24]:
# Boolean selection
print(names == "Bob")
print(data[names == "Bob"])
# Boolean selection with indexing
print(data[names == "Bob", 2:])

[ True False False  True False False False]
[[ 2.72029814  1.0460852   0.18522771  0.01946102]
 [-0.61497639 -2.04792625 -0.05170833 -0.92930583]]
[[ 0.18522771  0.01946102]
 [-0.05170833 -0.92930583]]


In [25]:
# Selection with inverted condition
cond = names == 'Bob'
print(data[~cond])

[[-0.87208698 -3.36461117  0.93769823  0.73371306]
 [-0.33607918 -0.67318552  0.29474069 -0.54601805]
 [ 0.38734891 -0.50578761 -0.17594827 -0.01480459]
 [ 0.09930235  1.91556529 -1.08454362 -0.97803802]
 [ 0.60509287  0.57622709  0.44878758  0.69129616]]


In [26]:
# Selection with inverted condition with creation of a mask
mask = (names == 'Bob') | (names == 'Will')
print(data[~mask])

## The Python keywords and and or do not work with boolean arrays we have to use & (and) and | (or) instead ##

[[-0.87208698 -3.36461117  0.93769823  0.73371306]
 [ 0.09930235  1.91556529 -1.08454362 -0.97803802]
 [ 0.60509287  0.57622709  0.44878758  0.69129616]]


In [27]:
# Example of boolean condition: set the negative values to 0
data = np.random.randn(7, 4)
data[data < 0] = 0
print(data)


[[0.         0.0478897  0.90944899 0.12402133]
 [0.04248635 0.         1.76956796 0.91642999]
 [0.         0.77396051 0.         0.        ]
 [0.         0.         0.19429904 1.75393729]
 [0.15895959 0.         0.62658654 0.58401748]
 [0.         0.         0.66601036 0.        ]
 [0.7339389  0.06325991 0.         0.        ]]


Fancy indexing:
* Definition: it is a term adopted by NumPy to describe indexing using integer arrays

In [28]:
arr = np.empty((8, 4), dtype=int)
for i in range(8):
    arr[i] = i

print(arr)

[[0 0 0 0]
 [1 1 1 1]
 [2 2 2 2]
 [3 3 3 3]
 [4 4 4 4]
 [5 5 5 5]
 [6 6 6 6]
 [7 7 7 7]]


In [29]:
# In order to select out a subset of the rows in a particular order, you can simply pass a list or
#ndarray of integers specifying the desired order
print(arr[[4, 3, 0, 6]])

# Using negative indices specify the rows to select in reverse order
arr[[-3, -5, -7]]
# Indices: 8-3=5  8-5=3 8-7=1

[[4 4 4 4]
 [3 3 3 3]
 [0 0 0 0]
 [6 6 6 6]]


array([[5, 5, 5, 5],
       [3, 3, 3, 3],
       [1, 1, 1, 1]])

In [30]:
# Passing multiple index arrays does something slightly different; it selects a onedimensional
#array of elements corresponding to each tuple of indices:
arr = np.arange(32).reshape((8, 4))
print(arr)

select_arr = arr[[1, 5, 7, 2], [0, 3, 1, 2]]
print(select_arr)  # Output: [4, 23, 29, 10] -> positions of the values = [(0,1), (3,5), (1,7), (2,2)]



[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]
 [16 17 18 19]
 [20 21 22 23]
 [24 25 26 27]
 [28 29 30 31]]
[ 4 23 29 10]


Transposing Arrays and Swapping Axes

In [31]:
arr = np.arange(9).reshape((3, 3))
print(arr)

# Transpose the array
arr_t1 = np.transpose(arr)
print(arr_t1)

arr_t2 = arr.T
print(arr_t2)

[[0 1 2]
 [3 4 5]
 [6 7 8]]
[[0 3 6]
 [1 4 7]
 [2 5 8]]
[[0 3 6]
 [1 4 7]
 [2 5 8]]


In [32]:
# Swap axes
arr = np.arange(16).reshape((2, 2, 4))
print(arr)
swap_ax = arr.swapaxes(1,2)
print(swap_ax)

[[[ 0  1  2  3]
  [ 4  5  6  7]]

 [[ 8  9 10 11]
  [12 13 14 15]]]
[[[ 0  4]
  [ 1  5]
  [ 2  6]
  [ 3  7]]

 [[ 8 12]
  [ 9 13]
  [10 14]
  [11 15]]]


## Universal Functions: Fast Element-Wise Array Functions:
* A universal function, or ufunc, is a function that performs element-wise operations on data in ndarrays
* You can think of them as fast vectorized wrappers for simple
functions that take one or more scalar values and produce one or more scalar results

In [34]:
array = np.arange(10)
print(array)

[0 1 2 3 4 5 6 7 8 9]


In [36]:
# Perform the square root on all items of the np.array
array = np.sqrt(array)
print(array)

[0.         1.         1.41421356 1.73205081 2.         2.23606798
 2.44948974 2.64575131 2.82842712 3.        ]


In [37]:
# Perform the exponential on all items of the nd.array
array = np.exp(array)
print(array)

[ 1.          2.71828183  4.11325038  5.65223367  7.3890561   9.35646902
 11.58243519 14.09403011 16.91882868 20.08553692]


* ***unary***: Others functions, such as add or maximum, take two arrays (thus, binary ufuncs) and return a single array as the result

In [39]:
# Create two np.arrays
x = np.random.randn(8)
y = np.random.randn(8)

print(x)
print(y)

[-0.61353843 -0.53915955  1.50271319  1.79970101  2.18844309 -0.72009134
  0.50101919 -0.32461596]
[-0.81240752 -0.97099821  1.49635335  1.47699218  0.63893209 -0.92272442
  1.76376202  1.06246533]


In [40]:
max_arr = np.maximum(x, y)
print(max_arr)

[-0.61353843 -0.53915955  1.50271319  1.79970101  2.18844309 -0.72009134
  1.76376202  1.06246533]


* While not common, a ufunc can return multiple arrays. modf is one example, a vectorized version of the built-in Python divmod; it returns the fractional and integral parts of a floating-point array

In [42]:
array = np.random.randn(7) * 5
print(array)

[  4.92347651  -0.23923073   0.82358909  -5.56702421   1.36602415
 -11.70924824   0.3201148 ]


In [45]:
remainder, whole_part = np.modf(array)
print(remainder)
print(whole_part)

[ 0.92347651 -0.23923073  0.82358909 -0.56702421  0.36602415 -0.70924824
  0.3201148 ]
[  4.  -0.   0.  -5.   1. -11.   0.]
