# NumPy & Pandas: Beginner to Advanced

This notebook teaches **NumPy** (numerical arrays) and **Pandas** (tabular data) from basics to advanced topics. It fits with the *Python Basics for ML* guide: variables → lists → **NumPy** → **Pandas** → ML workflow.

**You will:**
- Use one **sample dataset** across the whole notebook
- Learn **data manipulation** (filter, clean, aggregate, merge)
- Use **inbuilt methods** of arrays and DataFrames
- Progress from beginner to advanced with notes and examples

**How to use:** Run cells in order (Shift+Enter). Read the markdown notes above each section.

---
## Setup & Sample Dataset

We create a single dataset once and reuse it for NumPy and Pandas. It has:
- **Numeric columns** (sales, score, age) for arrays and aggregations
- **Categorical column** (region) for `value_counts` and `groupby`
- **Missing values** for learning `fillna`, `dropna`, `isna`
- **String column** (name) for `.str` and a **date** for datetime basics

In [None]:
import numpy as np
import pandas as pd

# Sample dataset: sales/performance (used throughout the notebook)
np.random.seed(42)
n = 16
df_sample = pd.DataFrame({
    "name": ["Alice", "Bob", "Charlie", "Diana", "Eve", "Frank", "Grace", "Henry",
             "Ivy", "Jack", "Kate", "Leo", "Mia", "Noah", "Olivia", "Paul"],
    "region": ["North", "South", "North", "East", "South", "East", "North", "South",
               "East", "North", "South", "East", "North", "South", "East", "North"],
    "sales": [120, 95, 140, 88, 110, np.nan, 130, 75, 100, 115, 90, 102, 135, 82, 98, 108],
    "score": [85, 78, 92, 88, 90, 72, 95, 70, 86, 91, 79, 84, 93, 77, 81, 89],
    "age": [28, 34, 29, 31, 27, 38, 26, 40, 33, 30, 35, 36, 25, 39, 32, 37],
    "join_date": ["2022-01-15", "2021-06-20", "2023-02-10", "2021-11-01", "2022-08-14",
                  "2020-03-22", "2023-05-05", "2020-09-12", "2022-04-18", "2023-01-08",
                  "2021-07-30", "2022-11-25", "2023-07-01", "2020-12-10", "2022-02-28", "2021-10-05"]
})

print("Sample dataset shape:", df_sample.shape)
df_sample

---
# Part 1: NumPy

**NumPy** gives fast arrays and numerical operations. Almost every ML library (pandas, scikit-learn) uses it.

**Concepts:** `ndarray`, `shape`, `dtype`, indexing, vectorized operations, broadcasting.

## 1.1 NumPy — Beginner: Arrays and Creation

- Create arrays from lists or with **inbuilt constructors**: `np.zeros`, `np.ones`, `np.arange`, `np.linspace`
- Inspect **shape** (dimensions) and **dtype** (data type)

In [None]:
# From list
arr = np.array([1, 2, 3, 4, 5])
print("1D array:", arr)
print("  shape:", arr.shape, "  dtype:", arr.dtype)

# 2D array (e.g. feature matrix)
matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print("\n2D array shape:", matrix.shape)

# Inbuilt constructors
print("\nnp.zeros((2,3)):", np.zeros((2, 3)))
print("np.arange(0, 10, 2):", np.arange(0, 10, 2))
print("np.linspace(0, 1, 5):", np.linspace(0, 1, 5))

In [None]:
# Use sample data as NumPy array (numeric columns only)
values = df_sample[["sales", "score", "age"]].values
print("Sample data as NumPy array (sales, score, age):")
print("  shape:", values.shape)
print(values)

## 1.2 NumPy — Beginner: Indexing and Slicing

- Index by position: `arr[0]`, `arr[-1]`
- Slice: `arr[start:stop]` (stop excluded)
- 2D: `matrix[row, col]`, `matrix[:, col]`, `matrix[1:3, :]`

In [None]:
arr = np.array([10, 20, 30, 40, 50])
print("arr[0]:", arr[0], "  arr[-1]:", arr[-1])
print("arr[1:4]:", arr[1:4])

matrix = values[:5, :]  # first 5 rows, all cols
print("\nFirst 5 rows of sample values:")
print(matrix)
print("  matrix[:, 0] (first column):", matrix[:, 0])
print("  matrix[2, 1] (row 2, col 1):", matrix[2, 1])

## 1.3 NumPy — Beginner: Operations and Inbuilt Methods

- **Element-wise:** `a + b`, `a * 2`, `np.sqrt(a)`
- **Aggregations (inbuilt methods):** `.sum()`, `.mean()`, `.min()`, `.max()`, `.std()`
- **Axis:** `axis=0` = down columns, `axis=1` = across rows

In [None]:
a = np.array([1.0, 2.0, 3.0])
print("Element-wise: a * 2 =", a * 2)
print("np.sqrt(a) =", np.sqrt(a))

arr = np.array([2, 4, 6, 8, 10])
print("\nInbuilt methods: .sum() =", arr.sum(), " .mean() =", arr.mean())

matrix = values[:5, :]
print("\nWith axis (matrix):")
print("  .sum(axis=0) (per column):", matrix.sum(axis=0))
print("  .mean(axis=1) (per row):", matrix.mean(axis=1))

## 1.4 NumPy — Intermediate: Boolean Indexing

Filter arrays with conditions. Very useful in data manipulation.

In [None]:
scores = values[:, 1]  # score column (2nd column, index 1)
print("Scores (from sample):", scores)
print("  scores[~np.isnan(scores)] (drop NaN for comparison):", scores[~np.isnan(scores)])
valid = scores[~np.isnan(scores)]
print("  valid[valid >= 90]:", valid[valid >= 90])

# Simple 1D array (no NaN)
x = np.array([3, 7, 2, 9, 1, 8, 4])
print("\nx[x > 5]:", x[x > 5])
print("x[(x >= 2) & (x <= 6)]:", x[(x >= 2) & (x <= 6)])

In [None]:
# Clean way for our sample: use valid (non-NaN) scores
scores = df_sample["score"].values
print("Scores >= 85:", scores[scores >= 85])
print("Mean of scores >= 85:", scores[scores >= 85].mean())

## 1.5 NumPy — Intermediate: Broadcasting

NumPy expands dimensions automatically so shapes match. Saves writing loops.

In [None]:
matrix = np.array([[1, 2], [3, 4], [5, 6]])
row = np.array([10, 20])
print("matrix (3x2) + row (2,):\n", matrix + row)

col = np.array([[100], [200], [300]])
print("matrix + col (3x1):\n", matrix + col)

## 1.6 NumPy — Advanced: Views vs Copies, Reshape, Stacking

- **Views** share memory (slicing); **copies** are independent (`.copy()`)
- **Reshape** and **stack** for reorganizing data

In [None]:
arr = np.array([1, 2, 3, 4, 5])
view = arr[1:4]
view[0] = 99
print("Slice is a view; changing it changes original:", arr)

arr2 = np.array([1, 2, 3, 4, 5])
copy = arr2[1:4].copy()
copy[0] = 88
print("Using .copy() leaves original unchanged:", arr2)

flat = np.arange(12)
reshaped = flat.reshape(3, 4)
print("\nreshape(3, 4):\n", reshaped)
print("vstack / hstack:", np.vstack((reshaped[:2], reshaped[2:])).shape)

## 1.7 NumPy — Advanced: NaN-Safe and Linear Algebra

- Use `np.nanmean`, `np.nansum` when data has NaNs
- `np.dot(a, b)` or `a @ b` for dot product / matrix multiply

In [None]:
with_nan = np.array([1.0, 2.0, np.nan, 4.0, 5.0])
print("np.nanmean(with_nan):", np.nanmean(with_nan))
print("np.nansum(with_nan):", np.nansum(with_nan))

a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6], [7, 8]])
print("\nMatrix multiply a @ b:\n", a @ b)

---
# Part 2: Pandas

**Pandas** is for tabular data: load CSVs, filter rows, select columns, handle missing values, and aggregate. DataFrames are built on NumPy.

**Concepts:** Series, DataFrame, indexing (`loc`/`iloc`), inbuilt methods, groupby, merge.

## 2.1 Pandas — Beginner: Series and DataFrame

- **Series:** 1D labeled array
- **DataFrame:** 2D table with labeled rows and columns
- Our **sample dataset** is already a DataFrame: `df_sample`

In [None]:
s = pd.Series([10, 20, 30], index=["a", "b", "c"])
print("Series:", s)
print("  s["b"]:", s["b"])

print("\nDataFrame columns:", list(df_sample.columns))
print("Shape:", df_sample.shape)

## 2.2 Pandas — Beginner: Inspect with Inbuilt Methods

- **.head()**, **.tail()** — first/last rows
- **.describe()** — summary statistics
- **.info()** — dtypes and non-null counts
- **.dtypes**, **.shape**, **.columns**

In [None]:
print("head(3):")
display(df_sample.head(3))
print("describe():")
display(df_sample.describe())
print("info():")
df_sample.info()

## 2.3 Pandas — Beginner: Select Columns and Rows

- Single column: `df['col']`
- Multiple columns: `df[['col1', 'col2']]`
- Filter rows: `df[df['col'] > value]`

In [None]:
print("Single column df_sample['region']:")
print(df_sample["region"])
print("\nMultiple columns:")
display(df_sample[["name", "sales", "score"]])
print("Filter: score >= 90")
display(df_sample[df_sample["score"] >= 90])

## 2.4 Pandas — Beginner: Inbuilt Methods (value_counts, add/rename columns)

- **.value_counts()** — count occurrences (great for categories)
- Add column: `df['new'] = ...`
- Rename: **.rename(columns=...)**

In [None]:
print("value_counts() for 'region':")
print(df_sample["region"].value_counts())
print("\nWith proportions: value_counts(normalize=True)")
print(df_sample["region"].value_counts(normalize=True).round(2))

df_demo = df_sample.copy()
df_demo["bonus"] = df_demo["sales"] * 0.1  # 10% bonus
df_demo = df_demo.rename(columns={"join_date": "start_date"})
print("\nAdded 'bonus', renamed 'join_date' -> 'start_date':")
display(df_demo.head(3))

---
## 2.5 Data Manipulation: Missing Values

**Detect:** `.isna()`, `.isna().sum()`  
**Drop:** `.dropna()`  
**Fill:** `.fillna(value)` or `.fillna(df['col'].median())`

In [None]:
print("Missing values per column:")
print(df_sample.isna().sum())

df_clean = df_sample.copy()
df_clean["sales"] = df_clean["sales"].fillna(df_clean["sales"].median())
print("\nAfter fillna(sales.median()):")
print(df_clean["sales"].isna().sum(), "missing in 'sales'")

## 2.6 Data Manipulation: Sort and Drop Duplicates

- **.sort_values('col')** — order by column; use `ascending=False` for descending
- **.drop_duplicates(subset=['col'], keep='first')** — remove duplicate rows

In [None]:
print("Sort by score (descending):")
display(df_sample.sort_values("score", ascending=False).head(5))

df_dup = pd.concat([df_sample.head(3), df_sample.head(2)], ignore_index=True)
print("After drop_duplicates(subset=['name']):")
display(df_dup.drop_duplicates(subset=["name"], keep="first"))

## 2.7 Data Manipulation: groupby and Aggregate

- **.groupby('col')** — split by column values
- **.agg({'col': 'mean', ...})** or **.mean()**, **.sum()** — aggregate per group

In [None]:
by_region = df_sample.groupby("region")
print("Mean sales and mean score by region:")
display(by_region.agg({"sales": "mean", "score": "mean"}).round(2))
print("Count per region:")
print(by_region.size())

## 2.8 Pandas — Intermediate: loc and iloc

- **.loc[]** — label-based (row/column names)
- **.iloc[]** — position-based (integer indices)

In [None]:
df_idx = df_sample.set_index("name")
print("After set_index('name'), loc['Bob']:")
print(df_idx.loc["Bob"])
print("\nloc slice: df_idx.loc['Alice':'Diana', ['region', 'score']]")
display(df_idx.loc["Alice":"Diana", ["region", "score"]])
print("iloc: first 3 rows, columns 0 and 2")
display(df_sample.iloc[:3, [0, 2]])

## 2.9 Pandas — Intermediate: String and Datetime

- **.str** accessor: `.str.upper()`, `.str.contains()`, `.str.len()`
- **pd.to_datetime()** and **.dt** accessor: `.dt.year`, `.dt.month`

In [None]:
print("String: name in uppercase")
print(df_sample["name"].str.upper().head())
print("\nNames containing 'a':")
print(df_sample[df_sample["name"].str.contains("a", case=False)]["name"].tolist())

df_sample["date"] = pd.to_datetime(df_sample["join_date"])
print("\nDatetime: .dt.year and .dt.month")
print(df_sample[["join_date", "date"]].head(3))
print("Years:", df_sample["date"].dt.year.unique()[:5])

## 2.10 Pandas — Advanced: Merge (Join)

Combine two DataFrames on a key: **pd.merge(left, right, on='col', how='inner'/'left')**

In [None]:
regions_info = pd.DataFrame({
    "region": ["North", "South", "East"],
    "manager": ["Manager_A", "Manager_B", "Manager_C"],
    "target_sales": [500, 400, 450]
})
merged = pd.merge(df_sample, regions_info, on="region", how="left")
print("Merged sample with region info (manager, target_sales):")
display(merged.head(5))

## 2.11 Pandas — Advanced: Apply and DataFrame → NumPy

- **.apply(func, axis=0/1)** — apply function per column or row
- **.values** — get NumPy array for use in scikit-learn etc.

In [None]:
print("Apply: sum per column")
print(df_sample[["sales", "score", "age"]].apply(np.nansum, axis=0))

print("\nDataFrame to NumPy (for ML):")
X = df_sample[["sales", "score", "age"]].values
y = df_sample["score"].values  # or any target column
print("  X.shape:", X.shape, "  y.shape:", y.shape)

---
# Quick Reference (Cheat Sheet)

| Topic | NumPy | Pandas |
|-------|--------|--------|
| Create | `np.array([...])`, `np.zeros()`, `np.arange()` | `pd.DataFrame({})`, `pd.read_csv()` |
| Shape | `arr.shape`, `arr.dtype` | `df.shape`, `df.columns`, `df.dtypes` |
| Index | `arr[i]`, `arr[1:4]`, `matrix[:, 0]` | `df['col']`, `df.loc[]`, `df.iloc[]` |
| Filter | `arr[arr > 5]` | `df[df['col'] > 5]` |
| Aggregate | `arr.sum()`, `arr.mean(axis=0)` | `df.mean()`, `df.groupby('col').agg()` |
| Missing | `np.nanmean(arr)` | `df.isna()`, `df.fillna()`, `df.dropna()` |
| Combine | `np.vstack()`, `np.hstack()` | `pd.merge()`, `pd.concat()` |

**Next steps:**
- Practice on real CSVs: `pd.read_csv('file.csv')`
- Combine with **scikit-learn**: use `df[features].values` and `train_test_split`
- Visualize: **matplotlib** or **pandas .plot()**
- Revisit the *Python Basics for ML* guide and the `numpy_basics.py` / `pandas_basics.py` scripts in this project.

In [None]:
# Optional: save sample dataset to CSV for practice
df_sample.to_csv("sample_dataset.csv", index=False)
print("Saved df_sample to 'sample_dataset.csv'.")