# Enhancing Your NumPy Skills - Statistics, Conditionals, and Dimension Manipulation

## Introduction
Having established the fundamentals of NumPy on Day 1, which included understanding the core structure of the ndarray, its shape and dtype attributes, and the principles of indexing and vectorized operations, one is now prepared to expand their skillset. On Day 2, functionalities are explored that enable the extraction of statistical insights from data, the efficient application of conditional logic, and crucially, the reshaping and combination of arrays to address the requirements of complex problems. These tools are considered indispensable for any tasks involving data analysis, machine learning, and numerical simulation.

## 1. Essential Descriptive Statistics: np.mean and np.std
One of NumPy's common applications in scientific computing involves its capability for rapid and efficient statistical calculations. The np.mean() and np.std() functions are prime examples, facilitating the computation of the mean and standard deviation for entire arrays or along specific axes.

### 1.1. np.mean(): Calculating the Arithmetic Mean
The np.mean() function computes the arithmetic mean of an array's elements. By default, it calculates the mean of all flattened elements, as if the array were a 1D vector. However, its true versatility lies in the axis argument, which allows the mean to be calculated along a specific dimension.

#### Basic Syntax:
numpy.mean(a, axis=None, dtype=None, out=None, keepdims=<no value>, *, where=<no value>)

- a: The input array.
- axis: The axis or axes along which the mean is computed. If None (default), the mean is calculated over all flattened array elements.
- keepdims: If True, the reduced axis will retain a dimension of size 1 in the result.

In [58]:
import numpy as np

# 1D Array
data_1d = np.array([10, 20, 30, 40, 50])
mean_1d = np.mean(data_1d)
print(f"Mean of 1D array: {mean_1d}\n")

# 2D Array (Matrix)
data_2d = np.array([[1, 2, 3],
                    [4, 5, 6],
                    [7, 8, 9]])

# Mean of all elements (flattened)
mean_all = np.mean(data_2d)
print(f"Mean of all elements in the matrix: {mean_all}\n") 

# Mean by column (axis=0): calculation is performed "down" the columns
mean_columns = np.mean(data_2d, axis=0)
print(f"Mean by column:\n{mean_columns}\n") # Output: [4. 5. 6.] (mean of [1,4,7], [2,5,8], [3,6,9])

# Mean by row (axis=1): calculation is performed "across" the rows
mean_rows = np.mean(data_2d, axis=1)
print(f"Mean by row:\n{mean_rows}\n")

# Mean by column, keeping dimensions
mean_columns_keepdims = np.mean(data_2d, axis=0, keepdims=True)
print(f"Mean by column (with keepdims=True):\n{mean_columns_keepdims}\n")

Mean of 1D array: 30.0

Mean of all elements in the matrix: 5.0

Mean by column:
[4. 5. 6.]

Mean by row:
[2. 5. 8.]

Mean by column (with keepdims=True):
[[4. 5. 6.]]



The use of axis is crucial for meaningfully summarizing data in tables or matrices, whether for analyzing the mean of features (columns) or the mean of observations (rows).

### 1.2. np.std(): Calculating the Standard Deviation
The standard deviation is a measure of dispersion indicating the extent to which data values deviate from the mean. np.std() computes either the sample or population standard deviation, depending on the ddof (delta degrees of freedom) parameter.

#### Basic Syntax:
numpy.std(a, axis=None, dtype=None, out=None, ddof=0, keepdims=<no value>, *, where=<no value>)

- a: The input array.
- axis: The axis or axes along which the standard deviation is computed.
- ddof: Delta Degrees of Freedom. The divisor used in the calculation is N - ddof, where N is the number of elements. By default, it is 0 (calculating the population standard deviation). For the sample standard deviation, ddof=1 should be used.

In [60]:
# 1D Array
data_1d = np.array([1, 2, 3, 4, 5])
std_pop_1d = np.std(data_1d)
std_sample_1d = np.std(data_1d, ddof=1)
print(f"Population Standard Deviation (1D): {std_pop_1d:.2f}") # Output: 1.41
print(f"Sample Standard Deviation (1D): {std_sample_1d:.2f}\n") # Output: 1.58

# 2D Array
data_2d = np.array([[10, 12, 14],
                    [16, 18, 20]])

# Standard deviation of all elements
std_all = np.std(data_2d)
print(f"Standard Deviation of all elements in the matrix: {std_all:.2f}\n") # Output: 3.74

# Standard deviation by row (axis=1)
std_rows = np.std(data_2d, axis=1)
print(f"Standard Deviation by row:\n{std_rows}\n") # Output: [1.63 1.63]

# Standard deviation by column (axis=0)
std_columns = np.std(data_2d, axis=0)
print(f"Standard Deviation by column:\n{std_columns}\n") # Output: [3.00 3.00 3.00]

Population Standard Deviation (1D): 1.41
Sample Standard Deviation (1D): 1.58

Standard Deviation of all elements in the matrix: 3.42

Standard Deviation by row:
[1.63299316 1.63299316]

Standard Deviation by column:
[3. 3. 3.]



np.mean and np.std are merely two of the many statistical functions offered by NumPy (e.g., np.sum, np.min, np.max, np.median, np.percentile). They form the basis for any exploratory data analysis.

## 2. Powerful Conditional Logic: np.where and Boolean Masks
The ability to apply conditional logic to array elements is fundamental for data preprocessing and manipulating information based on specific criteria. NumPy offers highly efficient tools for this purpose, notably np.where() and the extensive use of boolean masks.

### 2.1. np.where(): Element-wise Conditional Selection
The np.where() function serves as a vectorized equivalent of the conditional if-else statement. It enables the creation of a new array by selecting elements from two input arrays based on a boolean condition.

#### Basic Syntax:
numpy.where(condition, x, y)

- condition: A boolean array. Where True, the corresponding element from x is selected; where False, the element from y is selected.
- x, y: Arrays or scalars.

In [62]:
# Replace negative values with zero
data = np.array([-1, 2, -3, 4, -5])
data_positive = np.where(data > 0, data, 0)
print(f"Positive or zero data: {data_positive}\n") 

# Classify elements based on a threshold
scores = np.array([75, 88, 62, 91, 55])
status = np.where(scores >= 70, 'Approved', 'Failed')
print(f"Student status: {status}\n")

# `np.where` with multiple input arrays
array_a = np.array([1, 10, 3, 15, 5])
array_b = np.array([6, 7, 8, 9, 10])
condition = array_a > array_b
result = np.where(condition, array_a, array_b)
print(f"Result of comparison with where: {result}\n") # Output: [ 6 10  8 15 10]

Positive or zero data: [0 2 0 4 0]

Student status: ['Approved' 'Approved' 'Failed' 'Approved' 'Failed']

Result of comparison with where: [ 6 10  8 15 10]



When only the condition argument is provided, np.where() returns the indices of the elements where the condition is True. This functionality is useful for combining with advanced indexing.

In [64]:
indices_greater_than_zero = np.where(data > 0)
print(f"Indices where elements are greater than zero: {indices_greater_than_zero}\n")

print(f"Values at selected indices: {data[indices_greater_than_zero]}\n") # Output: [2 4]

Indices where elements are greater than zero: (array([1, 3], dtype=int64),)

Values at selected indices: [2 4]



### 2.2. Boolean Masks: Powerful Filtering and Modification
As briefly demonstrated on Day 1, boolean masks are arrays of type bool that possess the same shape as the array being filtered. They are a natural outcome of comparison operations between arrays and scalars, and their application as an indexing mechanism is both incredibly powerful and idiomatic within NumPy.

#### Filtering Data with Boolean Masks:

By passing a boolean mask to the indexing brackets of a NumPy array, only the elements corresponding to True in the mask are returned.

In [66]:
temperatures = np.array([25, 28, 22, 30, 19, 23])
hot_mask = temperatures > 25
print(f"Mask for temperatures > 25: {hot_mask}")
print(f"Temperatures above 25 degrees: {temperatures[hot_mask]}\n")

# Combining multiple conditions with logical operators (& for AND, | for OR)
ideal_mask = (temperatures >= 20) & (temperatures <= 25)
print(f"Mask for temperatures between 20 and 25: {ideal_mask}")
print(f"Ideal temperatures: {temperatures[ideal_mask]}\n")

Mask for temperatures > 25: [False  True False  True False False]
Temperatures above 25 degrees: [28 30]

Mask for temperatures between 20 and 25: [ True False  True False False  True]
Ideal temperatures: [25 22 23]



#### Modifying Data with Boolean Masks:

Boolean masks are not only useful for filtering; they also serve as an efficient means to modify elements within an array that satisfy a particular condition.

In [70]:
prices = np.array([10.50, 12.00, 8.75, 15.25, 9.90])
# Increase prices greater than 10 by 5%
increase_mask = prices > 10
prices[increase_mask] = prices[increase_mask] * 1.05
print(f"Prices after increase: {prices}\n")
# Set values below a threshold to NaN (Not a Number)
sales = np.array([100, 50, 120, 30, 80], dtype=np.float64) 
sales[sales < 60] = np.nan 
print(f"Sales with values below 60 replaced by NaN: {sales}\n")

Prices after increase: [11.025  12.6     8.75   16.0125  9.9   ]

Sales with values below 60 replaced by NaN: [100.  nan 120.  nan  80.]



The application of boolean masks is a cornerstone of data manipulation in both NumPy and Pandas, enabling concise and efficient operations on subsets of data.

## 3. Reshaping and Combining Arrays: reshape, stack, and concatenate
The flexibility of NumPy arrays extends beyond their creation and elementary manipulation. Frequently, it becomes necessary to alter the dimensions of an existing array or to combine multiple arrays into a single structure. The functions reshape, stack, and concatenate are the designated tools for these tasks.

### 3.1. reshape(): Changing Array Dimensions
The reshape() function allows for altering the shape of an array without modifying its underlying data. The total number of elements in the array must remain constant. A value of -1 can be employed in one of the dimensions to allow NumPy to automatically infer the size of that particular dimension.

#### Basic Syntax:
ndarray.reshape(shape, order='C') or numpy.reshape(a, newshape, order='C')

- shape: A tuple of integers that defines the new dimensions of the array.
- order: Specifies the order in which array elements should be read for insertion into the new array. 'C' (default) corresponds to C-like (row-major) order, while 'F' corresponds to Fortran-like (column-major) order.

In [76]:
# 1D Array to 2D
data_1d = np.array([1, 2, 3, 4, 5, 6])
matrix_2x3 = data_1d.reshape((2, 3))
print(f"1D Array:\n{data_1d}")
print(f"2x3 Matrix (reshape):\n{matrix_2x3}\n")

# 2D Array to 1D
flattened_vector = matrix_2x3.reshape(-1) # -1 calculates the size automatically
print(f"Matrix flattened to 1D:\n{flattened_vector}\n") # Output: [1 2 3 4 5 6]

# Using -1 to infer a dimension
matrix_3x2 = data_1d.reshape((3, -1))
print(f"3x2 Matrix (with dimension inference):\n{matrix_3x2}\n")


# Reshaping to a 3D tensor
tensor_2x3x1 = data_1d.reshape((2, 3, 1))
print(f"2x3x1 Tensor:\n{tensor_2x3x1}\n")


1D Array:
[1 2 3 4 5 6]
2x3 Matrix (reshape):
[[1 2 3]
 [4 5 6]]

Matrix flattened to 1D:
[1 2 3 4 5 6]

3x2 Matrix (with dimension inference):
[[1 2]
 [3 4]
 [5 6]]

2x3x1 Tensor:
[[[1]
  [2]
  [3]]

 [[4]
  [5]
  [6]]]



reshape typically performs a "view" operation, meaning it does not copy the data but rather creates a new view of the same data with an altered shape. This characteristic contributes to its high efficiency.

### 3.2. np.stack(): Stacking Arrays
The np.stack() function combines a sequence of arrays into a single array, thereby increasing the number of dimensions. This is particularly useful when multiple arrays share the same shape and require stacking along a newly introduced axis.

#### Basic Syntax:
numpy.stack(arrays, axis=0, out=None)

- arrays: A sequence of input arrays. All arrays within the sequence must possess identical shapes.
- axis: The axis along which the input arrays will be stacked. The resultant array will feature an additional dimension at this specified axis.

In [79]:
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
c = np.array([7, 8, 9])

# Stacking along a new axis 0 (the default)
stacked_axis0 = np.stack((a, b, c), axis=0)
print(f"Stacked along axis 0:\n{stacked_axis0}\n")

# Stacking along a new axis 1
stacked_axis1 = np.stack((a, b, c), axis=1)
print(f"Stacked along axis 1:\n{stacked_axis1}\n")

# Example with 2D arrays
matrix1 = np.array([[1, 2], [3, 4]])
matrix2 = np.array([[5, 6], [7, 8]])

stacked_matrices = np.stack((matrix1, matrix2), axis=0)
print(f"Stacked matrices (axis=0):\n{stacked_matrices}\n")


Stacked along axis 0:
[[1 2 3]
 [4 5 6]
 [7 8 9]]

Stacked along axis 1:
[[1 4 7]
 [2 5 8]
 [3 6 9]]

Stacked matrices (axis=0):
[[[1 2]
  [3 4]]

 [[5 6]
  [7 8]]]



np.stack finds particular utility in deep learning contexts for combining batches of data or for adding a "channel" dimension to images.

### 3.3. np.concatenate(): Joining Arrays
The np.concatenate() function joins a sequence of existing arrays along a specified axis. In contrast to np.stack(), np.concatenate() does not introduce a new dimension; instead, it merges arrays along an existing dimension. A crucial requirement is that arrays slated for concatenation must possess identical shapes across all dimensions, with the exception of the dimension along which the concatenation is being performed.

#### Basic Syntax:
numpy.concatenate((a1, a2, ...), axis=0, out=None)

- arrays: A sequence (tuple or list) of input arrays.
- axis: The axis along which the arrays will be concatenated.

In [83]:
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])

# Concatenating 1D arrays (default is axis=0)
concatenated_1d = np.concatenate((arr1, arr2))
print(f"Concatenated 1D arrays:\n{concatenated_1d}\n")

# Concatenating 2D arrays
matrix1 = np.array([[1, 2],
                    [3, 4]])
matrix2 = np.array([[5, 6],
                    [7, 8]])

# Concatenating along axis 0 (rows)
concatenated_rows = np.concatenate((matrix1, matrix2), axis=0)
print(f"Concatenated by rows (axis=0):\n{concatenated_rows}\n")

# Concatenating along axis 1 (columns)
concatenated_columns = np.concatenate((matrix1, matrix2), axis=1)
print(f"Concatenated by columns (axis=1):\n{concatenated_columns}\n")

# Example with incompatible shapes (will raise an error)
# arr_incompatible = np.array([[9]])
# np.concatenate((matrix1, arr_incompatible), axis=0) # Error: all input arrays must have same number of dimensions

Concatenated 1D arrays:
[1 2 3 4 5 6]

Concatenated by rows (axis=0):
[[1 2]
 [3 4]
 [5 6]
 [7 8]]

Concatenated by columns (axis=1):
[[1 2 5 6]
 [3 4 7 8]]



np.concatenate serves as a general replacement for more specific functions such as np.vstack (vertical stacking) and np.hstack (horizontal stacking), which act as convenient shortcuts for np.concatenate with axis=0 and axis=1, respectively, for 2D arrays.