# 01 - NumPy Introduction

## Introduction

NumPy (Numerical Python) is the fundamental package for numerical computing in Python. It provides powerful N-dimensional array objects and tools for working with them. NumPy is the foundation for most data science and machine learning libraries in Python.

## What You'll Learn

- What is NumPy?
- Why NumPy for data engineering?
- Understanding arrays vs Python lists
- Installing NumPy
- Basic array creation
- Array attributes (shape, dtype, size)
- Viewing and inspecting arrays


## What is NumPy?

**NumPy** stands for "Numerical Python" and is the core library for numerical computing in Python. It provides:

- **ndarray**: N-dimensional array object (the main data structure)
- **Vectorized operations**: Fast element-wise operations
- **Mathematical functions**: Comprehensive library of mathematical functions
- **Linear algebra**: Tools for matrix operations
- **Broadcasting**: Operations on arrays of different shapes
- **Memory efficiency**: Optimized C code for performance

**Why NumPy for Data Engineering?**
- **Performance**: 10-100x faster than Python lists for numerical operations
- **Memory efficiency**: Uses less memory than Python lists
- **Foundation**: Pandas, scikit-learn, TensorFlow, and PyTorch all use NumPy
- **Vectorization**: Enables batch operations without explicit loops
- **Scientific computing**: Essential for numerical computations and data preprocessing
- **Industry standard**: Used in virtually every data science and ML project


## NumPy Arrays vs Python Lists

Understanding the difference between NumPy arrays and Python lists is crucial:

| Feature | Python Lists | NumPy Arrays |
|---------|-------------|--------------|
| **Data Type** | Can contain mixed types | Homogeneous (same type) |
| **Memory** | More memory per element | Less memory per element |
| **Speed** | Slower for numerical ops | Much faster (vectorized) |
| **Operations** | Element-wise loops needed | Vectorized operations |
| **Size** | Dynamic, can grow | Fixed size (more efficient) |
| **Use Case** | General purpose | Numerical computing |

**Key Advantage**: NumPy arrays enable vectorized operations, meaning operations are applied to entire arrays at once, rather than looping through elements.


## Importing NumPy

The standard convention is to import NumPy as `np`.


In [1]:
import numpy as np

# Check NumPy version
print(f"NumPy version: {np.__version__}")
print("NumPy imported successfully!")


NumPy version: 2.4.0
NumPy imported successfully!


## Understanding NumPy Arrays

A **NumPy array** (ndarray) is a homogeneous multidimensional array. Think of it as:
- A 1D array: Like a Python list, but all elements are the same type
- A 2D array: Like a matrix or table (rows and columns)
- A 3D+ array: Like a cube or higher-dimensional structure

**Key characteristics:**
- All elements must be the same data type
- Fixed size (more memory efficient)
- Supports vectorized operations
- Fast mathematical operations


In [2]:
# Creating a 1D array from a Python list
arr1d = np.array([1, 2, 3, 4, 5])
print("1D Array:")
print(arr1d)
print(f"Type: {type(arr1d)}")
print(f"Shape: {arr1d.shape}")
print(f"Dimensions: {arr1d.ndim}")


1D Array:
[1 2 3 4 5]
Type: <class 'numpy.ndarray'>
Shape: (5,)
Dimensions: 1


In [3]:
# Creating a 2D array (matrix)
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print("2D Array (Matrix):")
print(arr2d)
print(f"Shape: {arr2d.shape}")  # (rows, columns)
print(f"Dimensions: {arr2d.ndim}")


2D Array (Matrix):
[[1 2 3]
 [4 5 6]
 [7 8 9]]
Shape: (3, 3)
Dimensions: 2


In [4]:
# Creating a 3D array
arr3d = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
print("3D Array:")
print(arr3d)
print(f"Shape: {arr3d.shape}")  # (depth, rows, columns)
print(f"Dimensions: {arr3d.ndim}")


3D Array:
[[[1 2]
  [3 4]]

 [[5 6]
  [7 8]]]
Shape: (2, 2, 2)
Dimensions: 3


## Array Attributes

NumPy arrays have several important attributes that describe their properties:


In [5]:
# Create a sample array
arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
print("Array:")
print(arr)
print()

# Shape: dimensions of the array (rows, columns, ...)
print(f"Shape: {arr.shape}")
print(f"Number of rows: {arr.shape[0]}")
print(f"Number of columns: {arr.shape[1]}")

# Size: total number of elements
print(f"Size (total elements): {arr.size}")

# ndim: number of dimensions
print(f"Number of dimensions: {arr.ndim}")

# dtype: data type of elements
print(f"Data type: {arr.dtype}")

# itemsize: size of each element in bytes
print(f"Item size (bytes): {arr.itemsize}")

# nbytes: total memory used
print(f"Total memory (bytes): {arr.nbytes}")


Array:
[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]

Shape: (3, 4)
Number of rows: 3
Number of columns: 4
Size (total elements): 12
Number of dimensions: 2
Data type: int64
Item size (bytes): 8
Total memory (bytes): 96


## Data Types (dtype)

NumPy arrays are homogeneous - all elements must be the same type. NumPy provides many data types:

- **Integer types**: int8, int16, int32, int64
- **Float types**: float16, float32, float64
- **Complex types**: complex64, complex128
- **Boolean**: bool
- **String**: str, unicode

**Why different types?**
- Memory efficiency: Use smaller types when possible
- Performance: Some operations are faster with specific types
- Precision: Choose appropriate precision for your data


In [6]:
# NumPy automatically infers data type
arr_int = np.array([1, 2, 3, 4])
print(f"Integer array dtype: {arr_int.dtype}")

arr_float = np.array([1.0, 2.0, 3.0, 4.0])
print(f"Float array dtype: {arr_float.dtype}")

arr_mixed = np.array([1, 2.5, 3])  # Mixed types become float
print(f"Mixed array dtype: {arr_mixed.dtype}")
print(f"Values: {arr_mixed}")

# Specify data type explicitly
arr_specific = np.array([1, 2, 3, 4], dtype=np.float32)
print(f"Explicit float32 dtype: {arr_specific.dtype}")


Integer array dtype: int64
Float array dtype: float64
Mixed array dtype: float64
Values: [1.  2.5 3. ]
Explicit float32 dtype: float32


## Performance Comparison: NumPy vs Python Lists

Let's see why NumPy is faster for numerical operations:


In [7]:
import time

# Create large lists/arrays
size = 1000000
python_list = list(range(size))
numpy_array = np.array(python_list)

# Python list operation (element-wise)
start = time.time()
result_list = [x * 2 for x in python_list]
python_time = time.time() - start

# NumPy array operation (vectorized)
start = time.time()
result_array = numpy_array * 2
numpy_time = time.time() - start

print(f"Python list time: {python_time:.6f} seconds")
print(f"NumPy array time: {numpy_time:.6f} seconds")
print(f"NumPy is {python_time/numpy_time:.2f}x faster!")


Python list time: 0.047077 seconds
NumPy array time: 0.006845 seconds
NumPy is 6.88x faster!


## Viewing and Inspecting Arrays

NumPy provides several ways to inspect arrays:


In [8]:
# Create a sample array
arr = np.array([[1, 2, 3, 4, 5], 
                [6, 7, 8, 9, 10], 
                [11, 12, 13, 14, 15]])

print("Full array:")
print(arr)
print()

# Print array representation
print("Array representation:")
print(repr(arr))
print()

# Get basic information
print(f"Shape: {arr.shape}")
print(f"Size: {arr.size}")
print(f"Data type: {arr.dtype}")
print(f"Memory size: {arr.nbytes} bytes")


Full array:
[[ 1  2  3  4  5]
 [ 6  7  8  9 10]
 [11 12 13 14 15]]

Array representation:
array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10],
       [11, 12, 13, 14, 15]])

Shape: (3, 5)
Size: 15
Data type: int64
Memory size: 120 bytes


## Summary

In this notebook, you learned:

1. **What NumPy is**: The fundamental package for numerical computing in Python
2. **Why NumPy matters**: Performance, memory efficiency, and foundation for data science
3. **Arrays vs Lists**: NumPy arrays are homogeneous, fixed-size, and support vectorization
4. **Array creation**: Using `np.array()` to create arrays from Python lists
5. **Array attributes**: shape, size, ndim, dtype, itemsize, nbytes
6. **Data types**: Understanding homogeneous types and dtype specification
7. **Performance**: NumPy is significantly faster for numerical operations

**Key Takeaway**: NumPy arrays are the foundation for all numerical computing in Python. Understanding arrays is essential before learning Pandas, which uses NumPy arrays internally.

**Next Steps**: In the next notebook, we'll explore different ways to create arrays and work with array data types.
