<a href="https://colab.research.google.com/github/Dasaru-t/My-Machine-Learning-Course/blob/main/Section%201-%20Python%20Crash%20Course/2_NumPy_The_Engine_of_Data_Science.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>NumPy: The Engine of Data Science</h1>
<p><b>NumPy</b> (Numerical Python) is the library that makes Python fast enough for Data Science. It provides the <code>ndarray</code> object, which is up to 50x faster than traditional Python lists.</p>

<p><b>Why is it faster?</b></p>
<ul>
  <li><b>Locality of Reference:</b> NumPy arrays are stored in continuous blocks of memory (unlike lists, which are scattered pointers).</li>
  <li><b>SIMD (Single Instruction, Multiple Data):</b> Modern CPUs can process entire blocks of NumPy data in one clock cycle.</li>
</ul>

<p>In this tutorial, we will move beyond basic syntax and learn how to manipulate data like a Data Engineer.</p>

In [1]:
import numpy as np

# 1. Creating Arrays: List vs NumPy
# A standard Python list (Good for general purpose, bad for math)
py_list = [1, 2, 3, 4, 5]

# A NumPy Array (Optimized for calculation)
np_arr = np.array(py_list)

print(f"Type: {type(np_arr)}")
print(f"Array: {np_arr}")

# 2. Multi-Dimensional Arrays (Matrices)
# Think of this as a dataset with Rows and Columns
data_matrix = np.array([
    [1, 2, 3],  # Row 0
    [4, 5, 6],  # Row 1
    [7, 8, 9]   # Row 2
])

print("\n--- Matrix Shape ---")
print(f"Shape: {data_matrix.shape}") # Output: (3, 3) -> (Rows, Columns)
print(f"Dimensions: {data_matrix.ndim}") # Output: 2

Type: <class 'numpy.ndarray'>
Array: [1 2 3 4 5]

--- Matrix Shape ---
Shape: (3, 3)
Dimensions: 2


<h2>1. Modern Random Number Generation</h2>
<p>In older tutorials, you will see <code>np.random.rand()</code>. In modern Data Science (NumPy 1.17+), we use the <code>default_rng()</code> Generator. It is faster and statistically superior.</p>
<p>We use this to simulate datasets, initialize neural network weights, or split data into Train/Test sets.</p>

In [2]:
# Initialize the modern random number generator
rng = np.random.default_rng(seed=42)

# Generate a mock dataset: 5 Rows (Users), 3 Columns (Features: Age, Income, Score)
# random(shape) gives floats between 0.0 and 1.0
mock_data = rng.random((5, 3))

print("--- Mock Normalized Data (0-1) ---")
print(mock_data)

# Generate Integers (e.g., Random User IDs between 1000 and 9999)
user_ids = rng.integers(low=1000, high=9999, size=5)
print(f"\nUser IDs: {user_ids}")

--- Mock Normalized Data (0-1) ---
[[0.77395605 0.43887844 0.85859792]
 [0.69736803 0.09417735 0.97562235]
 [0.7611397  0.78606431 0.12811363]
 [0.45038594 0.37079802 0.92676499]
 [0.64386512 0.82276161 0.4434142 ]]

User IDs: [5053 3044 1829 5990 8990]


<h2>2. Slicing & Indexing (The "SQL" of NumPy)</h2>
<p>As a Data Engineer, you often need to select specific rows or columns. This is NumPy's version of SQL's <code>SELECT</code>.</p>
<ul>
    <li><b>Syntax:</b> <code>array[row_start:row_end, col_start:col_end]</code></li>
</ul>

In [3]:
# Let's create a clear 4x4 matrix to practice on
matrix = np.array([
    [10, 20, 30, 40],
    [50, 60, 70, 80],
    [90, 100, 110, 120],
    [130, 140, 150, 160]
])

# Scenario 1: Select the first 2 rows and first 2 columns (Top-Left quadrant)
top_left = matrix[:2, :2]
print(f"Top Left:\n{top_left}")

# Scenario 2: Select all rows, but only the last column (Common for extracting 'Labels' in ML)
labels = matrix[:, -1]
print(f"\nLabels (Last Column): {labels}")

# Scenario 3: Conditional Selection (Filtering)
# "Select all values greater than 100" (Like SQL WHERE value > 100)
high_values = matrix[matrix > 100]
print(f"\nValues > 100: {high_values}")

Top Left:
[[10 20]
 [50 60]]

Labels (Last Column): [ 40  80 120 160]

Values > 100: [110 120 130 140 150 160]


<h2>3. Broadcasting & Vectorization</h2>
<p>This is the "Magic" of NumPy. You don't need <code>for</code> loops to multiply arrays. NumPy automatically broadcasts operations across the entire array.</p>
<p><b>Example:</b> If you want to convert a list of prices from USD to LKR, you don't loop through them. You just multiply the array by the exchange rate.</p>

In [4]:
# Prices in USD
prices_usd = np.array([10, 25, 50, 100])
exchange_rate = 300 # Approximate LKR

# The Old Way (Slow List Comprehension)
# prices_lkr = [x * exchange_rate for x in prices_usd]

# The NumPy Way (Vectorized - Instant)
prices_lkr = prices_usd * exchange_rate

print(f"USD: {prices_usd}")
print(f"LKR: {prices_lkr}")

# Broadcasting with Matrices
# Add 5 to every single element in the matrix
adjusted_matrix = matrix + 5
print(f"\nMatrix + 5:\n{adjusted_matrix}")


USD: [ 10  25  50 100]
LKR: [ 3000  7500 15000 30000]

Matrix + 5:
[[ 15  25  35  45]
 [ 55  65  75  85]
 [ 95 105 115 125]
 [135 145 155 165]]


<h2>4. Essential Statistical Methods</h2>
<p>Before training any ML model, you must understand your data's distribution. NumPy provides built-in functions for this.</p>

In [5]:
data = rng.normal(loc=50, scale=15, size=1000) # Generate 1000 data points (Bell Curve)

print(f"Mean (Average): {np.mean(data):.2f}")
print(f"Median (Middle): {np.median(data):.2f}")
print(f"Std Dev (Spread): {np.std(data):.2f}")
print(f"Variance: {np.var(data):.2f}")

# Axis parameter is crucial for matrices
# axis=0 -> Column-wise operation
# axis=1 -> Row-wise operation
simple_matrix = np.array([[1, 2], [3, 4]])
print(f"\nColumn Sums (axis=0): {np.sum(simple_matrix, axis=0)}") # [1+3, 2+4] = [4, 6]
print(f"Row Sums (axis=1): {np.sum(simple_matrix, axis=1)}")    # [1+2, 3+4] = [3, 7]

Mean (Average): 49.54
Median (Middle): 49.99
Std Dev (Spread): 14.81
Variance: 219.42

Column Sums (axis=0): [4 6]
Row Sums (axis=1): [3 7]


<h2>ðŸŽ“ Final Challenge: The "Sales Data" Analysis</h2>
<p><b>Scenario:</b> You have two arrays. One contains <b>Item Prices</b> and the other contains <b>Quantity Sold</b> for 5 different products. You need to calculate the total revenue and find which product performed best.</p>
<ol>
  <li>Calculate Total Revenue per product (Price * Quantity).</li>
  <li>Calculate Total Revenue for the shop.</li>
  <li>Find the index of the Best Selling product.</li>
</ol>

In [6]:
# 1. Setup Data
prices = np.array([100, 50, 200, 25, 10])      # Price per unit
quantity = np.array([10, 50, 5, 100, 500])     # Units sold

# 2. Vectorized Calculation (Revenue per product)
revenue_per_product = prices * quantity
print(f"Revenue per Product: {revenue_per_product}")

# 3. Aggregation (Total Shop Revenue)
total_revenue = np.sum(revenue_per_product)
print(f"Total Revenue: ${total_revenue}")

# 4. Analysis (Finding the 'ArgMax')
# np.argmax returns the INDEX of the highest value
best_selling_index = np.argmax(revenue_per_product)
print(f"Best Selling Product Index: {best_selling_index}")
print(f"Best Selling Revenue: {revenue_per_product[best_selling_index]}")

Revenue per Product: [1000 2500 1000 2500 5000]
Total Revenue: $12000
Best Selling Product Index: 4
Best Selling Revenue: 5000


<h3>NumPy Quick Reference</h3>
<table style="border-collapse: collapse; width: 100%; border: 1px solid #dddddd;">
  <tr style="background-color: #f2f2f2; text-align: left;">
    <th style="border: 1px solid #dddddd; padding: 8px;">Feature</th>
    <th style="border: 1px solid #dddddd; padding: 8px;">Python List <code>[ ]</code></th>
    <th style="border: 1px solid #dddddd; padding: 8px;">NumPy Array <code>np.array([ ])</code></th>
  </tr>
  <tr>
    <td style="border: 1px solid #dddddd; padding: 8px;"><b>Speed</b></td>
    <td style="border: 1px solid #dddddd; padding: 8px;">Slow (General purpose)</td>
    <td style="border: 1px solid #dddddd; padding: 8px;"><b>Fast</b> (Optimized for Math)</td>
  </tr>
  <tr>
    <td style="border: 1px solid #dddddd; padding: 8px;"><b>Operations</b></td>
    <td style="border: 1px solid #dddddd; padding: 8px;">Requires Loops</td>
    <td style="border: 1px solid #dddddd; padding: 8px;"><b>Vectorized</b> (Batch operations)</td>
  </tr>
  <tr>
    <td style="border: 1px solid #dddddd; padding: 8px;"><b>Use Case</b></td>
    <td style="border: 1px solid #dddddd; padding: 8px;">Data Collection</td>
    <td style="border: 1px solid #dddddd; padding: 8px;"><b>Data Processing & ML</b></td>
  </tr>
</table>

<hr>

<h3>Key Concepts</h3>
<ul>
  <li><b>Broadcasting:</b> Applying math to entire arrays instantly (e.g., <code>prices * 1.1</code>).</li>
  <li><b>Slicing:</b> <code>arr[row, col]</code>. Use <code>arr[:, -1]</code> to grab the last column (often the target variable).</li>
  <li><b>Filtering:</b> <code>arr[arr > 50]</code> acts like a SQL WHERE clause.</li>
  <li><b>Axes:</b> <code>axis=0</code> processes columns; <code>axis=1</code> processes rows.</li>
</ul>