## Introduction to NumPy
Imagine you just started a new job as a data analyst at an online store. On your first day, you're handed a large file of numbers – it's last year's sales data for various product categories across different quarters. You need to analyze this data to find trends and insights. You could use plain Python lists and loops for this task, but soon you realize it would be slow and cumbersome. This is where NumPy comes to the rescue!

**NumPy** (short for Numerical Python) is a fundamental library for scientific computing and data analysis in Python. It provides a powerful N-dimensional array object (ndarray) that allows you to store and manipulate large datasets efficiently. Operations on NumPy arrays are optimized and implemented in C, making them significantly faster than pure Python loops or lists for numerical computations. NumPy forms the backbone of many other libraries in the data science ecosystem (such as pandas, SciPy, and scikit-learn) and is an essential tool for anyone working with numerical data.

Before diving in, make sure you have NumPy installed. If you're using Anaconda or a Google Colab environment, NumPy is likely already available. Otherwise, you can install it via pip:

In [1]:
pip install numpy



Once installed, we typically import NumPy with an alias for convenience. The common convention is:

In [2]:
import numpy as np

## Creating Arrays
Our data analyst (let's call them Alex) obtains the sales numbers for each product category. First, Alex creates NumPy arrays from Python lists containing the data. For example, suppose Product A had quarterly sales (in thousands of dollars) of `[100, 120, 130, 150]` for Q1 through Q4. We can convert this list into a NumPy array:

In [3]:
# Create a NumPy array from a Python list for Product A's sales
product_a_sales = np.array([100, 120, 130, 150])
print(product_a_sales)
# The output is a one-dimensional NumPy array:

[100 120 130 150]


Now `product_a_sales` is an `ndarray` object. We can easily check its attributes:

In [4]:
print("Shape:", product_a_sales.shape)
print("Data type:", product_a_sales.dtype)
print("Dimensions:", product_a_sales.ndim)

Shape: (4,)
Data type: int64
Dimensions: 1


Here, `(4,)` indicates it's a 1-dimensional array of length 4. The data type is integer (NumPy chose a suitable integer type for us), and `ndim=1` confirms it's 1D. All elements in a NumPy array must be of the same type, which is why NumPy arrays are called *homogeneous*.

Now, let's say the store has three product categories (A, B, and C) with their sales data for Q1-Q4 in Python lists. We can create a 2D NumPy array (matrix) to represent all products' sales. Each row will represent one product's sales over the four quarters:

In [5]:
# Sales data for three products (A, B, C) across 4 quarters
product_a = [100, 120, 130, 150]
product_b = [80, 90, 100, 110]
product_c = [90, 95, 100, 105]
sales_matrix = np.array([product_a, product_b, product_c])
print(sales_matrix)
print("Shape:", sales_matrix.shape)

[[100 120 130 150]
 [ 80  90 100 110]
 [ 90  95 100 105]]
Shape: (3, 4)


We now have a 2D array with 3 rows and 4 columns. The shape `(3, 4)` corresponds to 3 products and 4 quarters. This `sales_matrix` will be used in our analysis throughout the tutorial.

NumPy also provides convenient functions to create arrays without explicitly providing all the data. This is useful for initializing arrays or generating sequences of numbers:
* np.zeros(shape) creates an array filled with zeros. For example, if the store plans to launch a new product (Product D) and we want to prepare an array for its future 4 quarters of sales initialized to 0:

In [6]:
zeros_array = np.zeros(4)
print(zeros_array)
print("Shape:", zeros_array.shape)

[0. 0. 0. 0.]
Shape: (4,)


Notice that by default, NumPy creates arrays of type float (thus the `0.` with a decimal point). You can specify the `dtype` if needed (e.g., `np.zeros(4, dtype=int)` for integers).

* `np.ones(shape)` creates an array of ones. For example, `np.ones((2,3))` would create a 2x3 array of ones. This can be handy for creating a baseline array or an array to hold coefficients.
* `np.arange(start, stop, step)` is similar to Python's `range()` but returns an array. It's often used to generate sequences. For instance, if we want an array of quarter numbers 1 through 4, we can do:

In [7]:
quarters = np.arange(1, 5)
print(quarters)

[1 2 3 4]


* `np.linspace(start, stop, num)` generates a specified number of evenly spaced values between start and stop (inclusive by default). This is useful in scientific computations or creating smooth sequences. For example, if Alex wants to simulate checking sales at 5 evenly-spaced points between 0 and 1 (perhaps as fractions of a target), they could use:

In [8]:
fractions = np.linspace(0, 1, 5)
print(fractions)

[0.   0.25 0.5  0.75 1.  ]


In summary, NumPy offers multiple ways to create arrays, either from existing data (Python lists) or by generating new data. Now that we have our `sales_matrix` ready, let's learn how to retrieve and manipulate data from it.

## Array Indexing and Slicing
With our `sales_matrix` in hand, Alex needs to extract specific information from it. NumPy array indexing works similar to Python lists (0-based indexing), but with additional capabilities for multi-dimensional arrays.

**Indexing Single Elements (1D and 2D)**: If we want a single value, we can directly index by its position. For a 1D array like `product_a_sales`, the syntax is the same as a list. For example, `product_a_sales[2]` would give the third element (Q3 sales for Product A). For a 2D array, we provide indices for each dimension, separated by a comma: `array[row_index, column_index]`.

Let's retrieve some specific sales figures:

In [9]:
# Example: Get Product B's sales in Q3 (Product B is index 1, Q3 is index 2)
product_b_q3 = sales_matrix[1, 2]
print("Product B sales in Q3:", product_b_q3)

# Example: Get Product C's sales in Q4 (Product C is index 2, Q4 is index 3)
product_c_q4 = sales_matrix[2, 3]
print("Product C sales in Q4:", product_c_q4)

Product B sales in Q3: 100
Product C sales in Q4: 105


In the code above, `sales_matrix[1, 2]` accesses the element in the 2nd row (index 1 corresponds to Product B) and 3rd column (index 2 corresponds to Q3). Similarly, `sales_matrix[2, 3]` fetches Product C's Q4 sales.

**Slicing Arrays**: Slicing in NumPy extends Python's list slicing to multiple dimensions. You use `start:stop` indices for each axis, separated by commas. Think of it as cropping out a sub-array from the matrix.

  * For 1D arrays, slicing is identical to lists. For example, `product_a_sales[1:3]` would give a slice of Product A's sales for Q2 and Q3 (indices 1 and 2).
  * For 2D arrays, you can slice each dimension. For instance, let's get the sales data for the first two products (rows) and first three quarters (columns):

In [10]:
first_two_products = sales_matrix[0:2, 0:3]
print(first_two_products)

[[100 120 130]
 [ 80  90 100]]


This slice took rows 0 and 1 (Product A and B) and columns 0, 1, 2 (Q1, Q2, Q3) from `sales_matrix`. You can omit either the start or end index to slice from the beginning or through the end respectively. For example, `sales_matrix[:2, :]` would also select the first two rows and all columns.

**Using Boolean Indexing:** NumPy allows you to use boolean conditions to filter arrays. This is a powerful feature to extract elements that satisfy a condition. When you apply a condition to an array, you get a boolean array of the same shape, with `True` where the condition holds and `False` elsewhere. You can use this boolean array to index the original array.

For example, Alex might want to find all quarter sales that exceeded $100k. Let's create a boolean mask and then use it:

In [11]:
mask = sales_matrix > 100
print("Mask of sales > 100:\n", mask)
# Use the mask to filter the array
high_sales = sales_matrix[sales_matrix > 100]
print("Sales values > 100:", high_sales)

Mask of sales > 100:
 [[False  True  True  True]
 [False False False  True]
 [False False False  True]]
Sales values > 100: [120 130 150 110 105]


The first output is a boolean matrix the same shape as `sales_matrix`, showing which entries are >100. The second line shows the actual values that met the condition, returned as a 1D array. Boolean indexing is very useful for filtering data based on conditions.

**Fancy Indexing (Indexing with Arrays of Indices)**: Fancy indexing refers to passing an array (or list) of indices to directly select multiple elements. This allows non-consecutive or custom order retrieval of array elements.

Let's say Alex wants to compare specific quarters or specific products:

In [12]:
# Select Product A and Product C rows (indices 0 and 2)
print("Products A and C (rows 0 and 2):\n", sales_matrix[[0, 2], :])

# Select Q1 and Q4 columns (indices 0 and 3) for all products
print("Quarters Q1 and Q4 (cols 0 and 3):\n", sales_matrix[:, [0, 3]])

Products A and C (rows 0 and 2):
 [[100 120 130 150]
 [ 90  95 100 105]]
Quarters Q1 and Q4 (cols 0 and 3):
 [[100 150]
 [ 80 110]
 [ 90 105]]


In the first print, we picked rows 0 and 2 (Product A and C) and all columns to get a sub-matrix containing only those products. In the second print, we picked columns 0 and 3 (Q1 and Q4) for every row, retrieving the sales for those quarters. Fancy indexing makes it easy to grab specific rows and columns in one go.

  **Note:** Fancy indexing (using index arrays) typically returns a copy of the data, whereas basic slicing (using `:`) returns a view (reference) of the original array. This means if you modify a slice, it may affect the original array, but modifying a fancy-indexed result will not affect the original. Keep this in mind when manipulating subarrays.


Now that we can access and filter our data, let's see how to change the shape or layout of arrays to suit our needs.

## Reshaping and Resizing
Sometimes data isn't in the ideal shape you need. For example, imagine the sales data was loaded as a single long list of values (perhaps from a CSV or an API) without clear row/column structure. NumPy makes it easy to reshape arrays without changing the data.

Let's say we received a flat list of 12 sales values (which we know correspond to 3 products × 4 quarters). We'll convert that list to a NumPy array and then reshape it into a 3x4 matrix:

In [13]:
# Flat list of 12 values (e.g., from a file) - same data as our sales_matrix
flat_data = [100, 120, 130, 150, 80, 90, 100, 110, 90, 95, 100, 105]
data_array = np.array(flat_data)
print("Flat array shape:", data_array.shape)

# Reshape this 1D array into 3 rows and 4 columns
reshaped_data = data_array.reshape(3, 4)
print("Reshaped array:\n", reshaped_data)
print("Reshaped array shape:", reshaped_data.shape)

Flat array shape: (12,)
Reshaped array:
 [[100 120 130 150]
 [ 80  90 100 110]
 [ 90  95 100 105]]
Reshaped array shape: (3, 4)


We started with a 1D array of length 12 and reshaped it into a 3x4 2D array. Note that the total number of elements remains the same (12); NumPy cannot reshape if the total size doesn't match. (A useful trick: you can use `-1` as one of the dimensions in `reshape` to let NumPy compute it automatically, as long as the other dimension is correct. For example, `data_array.reshape(3, -1)` would have figured out 4 columns automatically.)

Reshaping is often handy to organize raw data into a matrix or to flatten multi-dimensional data for processing. Speaking of flattening, sometimes we need to go back to 1D. NumPy provides:
  * `flatten()` method to create a 1D copy of an array.
  * `ravel()` method to get a flattened view of the array (if possible, it avoids copying data).

For example, if we flatten our reshaped_data back to 1D:

In [14]:
flat_again = reshaped_data.flatten()
ravel_again = reshaped_data.ravel()
print("Flattened array:", flat_again)
print("Raveled array:", ravel_again)

Flattened array: [100 120 130 150  80  90 100 110  90  95 100 105]
Raveled array: [100 120 130 150  80  90 100 110  90  95 100 105]


They look the same. The difference is that `flat_again` is a new independent array, whereas `ravel_again` is likely a view of the original `reshaped_data`. If we were to modify `ravel_again`, `reshaped_data` would reflect those changes (because they share data in memory), whereas modifying `flat_again` would not affect the original. For most basic uses, both give you a 1D array of the elements.

**Resizing Arrays**: What if we need to change the size of an array? NumPy's `reshape` only changes the shape, not the total size. To change the size (for example, adding or removing elements), we can use the `resize` method.

Continuing our story, suppose the store adds a new product (Product D) and we want to extend our `sales_matrix` to include this new product's data. We currently have a 3x4 array and want a 4x4 array. We can resize the array in-place:

In [15]:
expanded_sales = sales_matrix.copy()
expanded_sales.resize((4, 4))  # Resize to 4 rows and 4 columns
print(expanded_sales)
print("New shape:", expanded_sales.shape)

[[100 120 130 150]
 [ 80  90 100 110]
 [ 90  95 100 105]
 [  0   0   0   0]]
New shape: (4, 4)


We resized the array to add an extra row. The new row is filled with `0` because no data was available for it (NumPy fills with zeros when expanding an array via `resize`). We could now, for instance, fill in this new row with projected sales for Product D or leave them as zeros until data is available.

Be cautious when resizing: if we resize to a smaller shape, data will be truncated (cut off). Also, `resize` (method) changes the array in place; if other variables were referencing the same array, they would be affected. In practice, many NumPy workflows avoid in-place resizing and instead create new arrays as needed, but it's good to know this functionality exists.

Now that our data is well-structured and we know how to reshape it as needed, let's explore one of NumPy's most powerful features: broadcasting.

## Broadcasting
One of NumPy's most powerful features is **broadcasting**. Broadcasting allows NumPy to perform arithmetic operations on arrays of different shapes by automatically expanding one array to match the shape of the other (without actually copying data for each expansion). This sounds abstract, but it's easier to understand with examples.

Let's start with a simple example: adding a 1D array to a 2D array. Suppose we have a 2x3 matrix and a 1x3 array:

In [16]:
A = np.ones((2, 3))
b = np.array([10, 20, 30])
print("A =\n", A)
print("b =", b)
print("A + b =\n", A + b)

A =
 [[1. 1. 1.]
 [1. 1. 1.]]
b = [10 20 30]
A + b =
 [[11. 21. 31.]
 [11. 21. 31.]]


Here, `A` is shape (2,3) and `b` is shape (3,). When we do `A + b`, NumPy broadcasts `b` across the two rows of `A`. In other words, it behaves as if `b` were duplicated for each row of `A`, and then addition is done element-wise. The result is a 2x3 matrix where `b`'s values have been added to each row of `A`.

The rule of broadcasting is that two arrays can be operated on together if their shapes are compatible: this means for each dimension, the sizes are either equal or one of them is 1 (or the dimension doesn't exist in one of the arrays). If a dimension is of size 1, it can stretch to match the other array. In our example, `b` behaves like a 1x3 array (its shape can be viewed as (1,3) for compatibility), which then stretches to (2,3) to match `A`.

Now let's apply this to our sales data. Suppose the store manager tells Alex that each quarter next year is expected to grow by a certain factor (due to seasonality or market trends). For example, they expect:

  * Q1 sales to increase by 10% (factor 1.1)
  * Q2 sales to increase by 20% (factor 1.2)
  * Q3 sales to increase by 5% (factor 1.05)
  * Q4 sales to increase by 30% (factor 1.3)

We can represent these growth factors as a NumPy array:

In [17]:
growth_factors = np.array([1.1, 1.2, 1.05, 1.3])
print("Growth factors:", growth_factors)
print("Shape:", growth_factors.shape)

Growth factors: [1.1  1.2  1.05 1.3 ]
Shape: (4,)


This is a 1D array with 4 elements, corresponding to the factor for each quarter. Now, to apply these factors to the entire sales matrix (shape 3x4), we simply multiply the two arrays:

In [18]:
projected_sales = sales_matrix * growth_factors
print("Projected sales for next year:\n", projected_sales)

Projected sales for next year:
 [[110.  144.  136.5 195. ]
 [ 88.  108.  105.  143. ]
 [ 99.  114.  105.  136.5]]


Every column of `sales_matrix` was multiplied by the corresponding factor in `growth_factors`. Notice that the resulting array is float-valued (the factors were floats and some multiplications, like 130×1.05, produce fractional values). NumPy automatically took our 3x4 `sales_matrix` and a 4-element `growth_factors` and broadcasted the latter to shape (3,4) behind the scenes.

Broadcasting is extremely useful for applying operations across data without writing loops. We could just as easily add a constant to an entire matrix (e.g., `sales_matrix + 5` would add 5 to every entry), or subtract two arrays of different shapes as long as they are compatible. It makes code more concise and often more efficient.

Now that we've seen how NumPy can handle different shapes seamlessly, let's look at another major advantage of NumPy: performing operations on entire arrays (vectorization) which leads to huge performance gains.

## Vectorized Operations and Performance
NumPy is designed for **vectorized operations**, meaning you can apply operations to entire arrays without writing explicit loops in Python. These operations happen under the hood in optimized C code. We've already seen examples of this: when we did `sales_matrix * growth_factors`, NumPy multiplied each element of `sales_matrix` by the corresponding element of `growth_factors` automatically. No Python `for` loops were needed in our code.

Vectorized operations make code more concise and much faster. To illustrate the performance gain, let's do a simple experiment. Alex wants to compute the sum of squares of the first one million integers. We'll do it first using pure Python (with a loop or list comprehension) and then using NumPy, and time each approach:

In [19]:
N = 1000000
python_list = list(range(N))
numpy_array = np.arange(N)

# Summation of squares using Python list comprehension
%timeit sum([x**2 for x in python_list])

# Summation of squares using NumPy vectorized operations
%timeit np.sum(numpy_array**2)

85.4 ms ± 2.48 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.42 ms ± 218 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


*(The %timeit magic command runs the operation multiple times and reports the best average time.)*

You can see that the NumPy approach is **an order of magnitude faster** (in this example, around 10-12 times faster) than the pure Python approach. The exact numbers will vary, but NumPy's speedup becomes more pronounced as the data size grows.

Why this huge difference? Because the NumPy version offloads the looping to highly optimized C code (and uses vectorized CPU operations), while the Python version has to loop in Python, which is much slower for large workloads. In general, whenever possible, you should try to use NumPy's vectorized operations (such as arithmetic on arrays, NumPy math functions, aggregations like sum/mean, etc.) instead of explicit Python loops.

Beyond speed, vectorized code is often more concise and easier to read. Compare `np.sum(numpy_array**2)` to a multi-line loop or comprehension; the intent is clear and the code is shorter.

Next, let's look at some of these convenient math functions and aggregations that NumPy provides out of the box for arrays.

## Common Math Functions and Statistics
NumPy comes with a plethora of built-in mathematical functions that apply to arrays as a whole, or along a specified axis. These allow us to quickly compute statistics or transform data without manual loops.

Let's explore some of the commonly used functions using our sales data:

In [20]:
# Total sales across all products and quarters (sum of all elements)
total_sales = np.sum(sales_matrix)
# Average (mean) sales across all products and quarters
average_sales = np.mean(sales_matrix)
# Standard deviation of sales values
sales_std = np.std(sales_matrix)
# Minimum and maximum sales values
min_sales = np.min(sales_matrix)
max_sales = np.max(sales_matrix)
# Indices of minimum and maximum (in the flattened array)
min_index = np.argmin(sales_matrix)
max_index = np.argmax(sales_matrix)
print(f"Total sales (all products, all quarters): {total_sales}")
print(f"Average sales value: {average_sales:.2f}")
print(f"Standard deviation of sales: {sales_std:.2f}")
print(f"Minimum sale value: {min_sales} at flattened index {min_index}")
print(f"Maximum sale value: {max_sales} at flattened index {max_index}")

Total sales (all products, all quarters): 1270
Average sales value: 105.83
Standard deviation of sales: 18.58
Minimum sale value: 80 at flattened index 4
Maximum sale value: 150 at flattened index 3


We see that the total sales for the year (summing all products and quarters) is 1270 (thousand dollars in our example dataset). The average sales value is about 105.83. NumPy's `mean` and `sum` made it easy to get these with one function call. The standard deviation (~18.58) gives a sense of how much variation there is in the sales figures.

The `np.min` and `np.max` give us the smallest and largest sales values. We also used `np.argmin` and `np.argmax`, which return the index of the minimum and maximum values (here the index is in the flattened 1D sense). In our case, the minimum value 80 has index 4 (which corresponds to Product B Q1 if we map it back to 2D), and the maximum value 150 has index 3 (Product A Q4).

We can also compute these statistics along a particular axis. For example, let's find the total sales for each product (summing across quarters, axis=1) and the average sales for each quarter (averaging across products, axis=0):

In [21]:
total_per_product = np.sum(sales_matrix, axis=1)
avg_per_quarter = np.mean(sales_matrix, axis=0)
print("Total sales per product:", total_per_product)
print("Average sales per quarter:", np.round(avg_per_quarter, 2))

Total sales per product: [500 380 390]
Average sales per quarter: [ 90.   101.67 110.   121.67]


The first array tells us Product A sold 500, Product B 380, and Product C 390 (in whatever units our data represents). The second array shows the average sales in each quarter across the products (Q1 average 90, Q2 ~101.67, etc.). We used `np.round` just to make the quarter averages easier to read.

Similarly, we could use `np.argmax(sales_matrix, axis=1)` to find which quarter each product had its peak sales (you'd get an array of quarter indices), or `np.argmin(..., axis=0)` to find which product performed worst in each quarter, and so on.

These functions (`sum`, `mean`, `std`, `min`, `max`, `argmin`, `argmax`, and many more like `median`, `percentile`, etc.) allow for quick analysis of data. By leveraging these, Alex can quickly summarize and derive insights from the sales data without manual calculations.

We've now covered a lot of ground: from creating arrays and slicing them, to reshaping and broadcasting, to performing fast computations and summary statistics. Let's wrap up with a brief conclusion.

## Conclusion
In this tutorial, we followed a story-driven approach to learn the fundamentals of NumPy. We saw how NumPy provides:
1. **Efficiency:** Operations on arrays are fast and use vectorized implementations in C, which gave us huge performance gains over pure Python loops.
2. **Convenience:** A simple and expressive syntax for array operations (e.g., adding a constant, element-wise arithmetic, slicing with ease).
3. **Powerful features:** such as broadcasting (for handling arrays of different shapes) and a wide range of built-in functions for computations and statistics.
4. **Ecosystem:** NumPy arrays are the foundation of many other libraries (pandas, SciPy, scikit-learn, etc.), so mastering NumPy will pay off across the data science stack.

Using our online store sales data example, Alex was able to quickly manipulate and analyze the data in ways that would be cumbersome with plain Python lists. NumPy enabled concise code and fast computation, turning raw data into insights.

To further solidify your NumPy skills, here are a few practice ideas:
1. Create a NumPy array of a range of numbers (say 1 to 20) and reshape it into a 4x5 matrix. Try selecting different sub-blocks of this matrix using slicing.
2. Generate an array of random integers (for example, using `np.random.randint`) and use boolean indexing to filter out all values above a certain threshold.
3. Given two lists of numbers, turn them into NumPy arrays and compute their element-wise product and sum without using any explicit loops.


Keep practicing by applying NumPy to real or simulated datasets. The more you use it, the more naturally these operations will come. With these fundamentals, you are well on your way to leveraging NumPy for efficient numerical computing. Happy coding!