# Student Notes: Advanced Pandas, NumPy & Functional Programming for ML

**LogicMojo AI/ML Bootcamp – Part 4**  
Theory → example format.

## Topics
1. **Functional Programming**: Lambda, Map, Filter, Reduce
2. **Pandas Apply & Map** (DataFrames/Series)
3. **Advanced Pandas**: GroupBy, Agg vs Transform, Merge, Text, Time, One-Hot
4. **NumPy for ML**: Arrays, Shape, Broadcasting, Boolean Indexing, Dot Product
5. **Data Preprocessing**: Binning, Encoding, Outliers

In [None]:
import numpy as np
import pandas as pd
from functools import reduce
import warnings
warnings.filterwarnings('ignore')

---
# PART 1: FUNCTIONAL PROGRAMMING

## 1.1 Lambda Functions

### Theory
- **Lambda** = Small anonymous function: `lambda arguments: expression`.
- Use for short, one-off operations (no `def` or `return`).
- Often used with `map`, `filter`, `sorted(key=...)`, and pandas `.apply()`.
- Limited to a single expression.

In [None]:
# EXAMPLE: Lambda vs normal function
def square(x):
    return x ** 2

square_lambda = lambda x: x ** 2
print(square(5), square_lambda(5))

# Sorting by custom key
pairs = [(1, 'one'), (3, 'three'), (2, 'two')]
pairs.sort(key=lambda x: x[1])  # Sort by string
print(pairs)

---
## 1.2 Map, Filter, Reduce

### Theory
- **map(function, iterable)** → Apply function to every item; returns an iterator.
- **filter(function, iterable)** → Keep only items where function is True.
- **reduce(function, iterable [, initial])** → Combine all items into one value (e.g. product, sum).
- Use `list()` on map/filter to get a list. Reduce returns a single value.
- Essential for data cleaning and pipelines.

In [None]:
# EXAMPLE: Map, Filter, Reduce
numbers = [1, 2, 3, 4, 5, 6]

# Map: square every number
squared = list(map(lambda x: x**2, numbers))
print("Map (squared):", squared)

# Filter: keep even numbers
evens = list(filter(lambda x: x % 2 == 0, numbers))
print("Filter (evens):", evens)

# Reduce: product of all numbers
product = reduce(lambda x, y: x * y, numbers)
print("Reduce (product):", product)

---
## 1.3 Pandas Apply & Map

### Theory
- **Series.map(function)** → Element-wise mapping (substitutions, lookups).
- **Series.apply(function)** → Element-wise; can use lambdas or multi-line logic.
- **DataFrame.apply(function, axis=0/1)** → Apply along rows (axis=1) or columns (axis=0).
- `.map()` is for Series only; `.apply()` works on Series and DataFrame.

In [None]:
# EXAMPLE: Pandas apply and map
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Income': [50000, 60000, 55000],
    'Department': ['IT', 'HR', 'IT']
})

# .map() – element-wise substitution (Series only)
df['Dept_Code'] = df['Department'].map({'IT': 1, 'HR': 2})

# .apply() – custom function per row/column
def tax_calc(income):
    return income * 0.3 if income > 55000 else income * 0.2

df['Tax'] = df['Income'].apply(tax_calc)
df['Net_Income'] = df['Income'].apply(lambda x: x * 0.8)
print(df)

---
# PART 2: ADVANCED PANDAS

## 2.1 GroupBy & Aggregation

### Theory
- **groupby(column)** groups rows by unique values; then apply aggregations.
- **agg()** lets you specify different aggregations per column: sum, mean, min, max, count.
- **Multiple columns**: `groupby(['A','B'])` and `.agg({col: [list of funcs]})`.
- **unstack()** turns an index level into columns (long → wide).

In [None]:
# EXAMPLE: GroupBy and aggregation
np.random.seed(0)
df = pd.DataFrame({
    'Date': pd.date_range('2023-01-01', periods=100),
    'Region': np.random.choice(['North', 'South', 'East', 'West'], 100),
    'Product': np.random.choice(['A', 'B', 'C'], 100),
    'Sales': np.random.randint(100, 1000, 100)
})

# Total sales per region
region_sales = df.groupby('Region')['Sales'].sum()
print("Sales by Region:", region_sales)

# Multiple aggregations per product
product_stats = df.groupby('Product')['Sales'].agg(['mean', 'min', 'max', 'count'])
print("\nProduct stats:", product_stats)

# Group by two columns, then unstack
summary = df.groupby(['Region', 'Product'])['Sales'].mean().unstack()
print("\nRegion vs Product (mean):", summary)

---
## 2.2 Agg vs Transform

### Theory
- **Aggregation**: Many rows → one row per group (e.g. sum, mean).
- **Transform**: Same shape as original; each row gets a group-level value (e.g. group sum attached to every row).
- Use **transform** when you need “percent of group total” or “deviation from group mean” while keeping one row per observation.

In [None]:
# EXAMPLE: Agg vs Transform
df_sales = pd.DataFrame({
    'Store': ['A', 'A', 'B', 'B', 'B'],
    'Sales': [100, 200, 150, 250, 100]
})

# Agg: one row per group
agg_result = df_sales.groupby('Store')['Sales'].agg('sum')
print("Agg (sum per store):", agg_result)

# Transform: group sum repeated for each row
df_sales['Store_Total'] = df_sales.groupby('Store')['Sales'].transform('sum')
df_sales['Pct_of_Store'] = (df_sales['Sales'] / df_sales['Store_Total']) * 100
print("\nWith transform:", df_sales)

---
## 2.3 Merging & Joining

### Theory
- **pd.merge(left, right, on='col', how='inner'|'left'|'right'|'outer')**.
- **inner**: Only matching keys in both.
- **left**: All rows from left; match from right where possible (NaN if no match).
- **right**: All from right.
- **outer**: All keys from both; NaN where no match.
- For ML we often use **left** to keep all events and fill missing with NaN or default.

In [None]:
# EXAMPLE: Merge (inner vs left)
users = pd.DataFrame({'UserID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})
trans = pd.DataFrame({'TxID': [101, 102, 103], 'UserID': [1, 2, 5], 'Amount': [500, 200, 100]})

inner = pd.merge(trans, users, on='UserID', how='inner')
print("Inner (only matches):", inner)

left = pd.merge(trans, users, on='UserID', how='left')
print("\nLeft (all transactions):", left)

---
## 2.4 Text Processing (str accessor)

### Theory
- Use **Series.str** for string methods: `.str.strip()`, `.str.title()`, `.str.contains()`, `.str.split()`.
- Enables cleaning and feature extraction from text (e.g. extract domain, first/last name).

In [None]:
# EXAMPLE: Text with .str
text_df = pd.DataFrame({
    'Name': ['  john doe ', 'jane smith', 'bob johnson'],
    'Email': ['john@gmail.com', 'jane@test.com', 'bob@yahoo.com']
})
text_df['Name_Clean'] = text_df['Name'].str.strip().str.title()
text_df['Is_Gmail'] = text_df['Email'].str.contains('gmail')
text_df[['First', 'Last']] = text_df['Name_Clean'].str.split(' ', expand=True)
print(text_df)

---
## 2.5 Time Series Features

### Theory
- Use **pd.to_datetime()** so the column is datetime.
- **.dt** accessor: `.dt.month`, `.dt.dayofweek`, `.dt.day`, etc.
- **rolling(window=n)** for moving average or other rolling stats.
- ML models need numeric features; extract month, day of week, is_weekend, etc.

In [None]:
# EXAMPLE: Time features and rolling
df = pd.DataFrame({
    'Date': pd.date_range('2023-01-01', periods=10, freq='D'),
    'Sales': [100, 120, 90, 150, 130, 110, 140, 160, 120, 180]
})
df['Date'] = pd.to_datetime(df['Date'])
df['Month'] = df['Date'].dt.month
df['DayOfWeek'] = df['Date'].dt.dayofweek
df['Is_Weekend'] = df['Date'].dt.dayofweek > 4
df['Sales_3Day_Avg'] = df['Sales'].rolling(window=3).mean()
print(df)

---
## 2.6 One-Hot Encoding

### Theory
- ML models need numbers; categories like "Red", "Blue" must be encoded.
- **pd.get_dummies(df, columns=[...])** creates binary columns per category.
- **drop_first=True** avoids multicollinearity in linear models (one column is redundant).

In [None]:
# EXAMPLE: One-hot encoding
data = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Red', 'Blue'], 'Price': [10, 20, 15, 10, 20]})
encoded = pd.get_dummies(data, columns=['Color'], drop_first=False)
print(encoded)

---
# PART 3: NUMPY FOR ML

## 3.1 Creating Arrays

### Theory
- **np.array(list)** converts list to array.
- **np.arange(start, stop, step)** like range but returns array.
- **np.zeros(shape)**, **np.ones(shape)** for initialization.
- **np.random.rand(shape)** uniform [0,1); **np.random.randn(shape)** standard normal.
- ML libraries expect NumPy arrays; they are fast and support vectorization.

In [None]:
# EXAMPLE: Array creation
arr = np.array([1, 2, 3, 4, 5])
print("Array:", arr)
print("arange:", np.arange(0, 10, 2))
print("zeros 3x3:\n", np.zeros((3, 3)))
print("ones 2x4:\n", np.ones((2, 4)))
np.random.seed(42)
print("rand (0-1):\n", np.random.rand(2, 3))
print("randn (normal):\n", np.random.randn(2, 3))

---
## 3.2 Shape and Reshape

### Theory
- **.shape** gives dimensions (rows, cols) or (n,) for 1D.
- **.reshape(rows, cols)** changes layout; total size must stay the same.
- **reshape(-1, n)** means “n columns, compute rows automatically”.
- **.flatten()** turns matrix into 1D. Critical for avoiding shape errors in deep learning.

In [None]:
# EXAMPLE: Shape and reshape
arr = np.arange(12)
print("1D:", arr, "shape:", arr.shape)
arr_2d = arr.reshape(3, 4)
print("2D (3x4):\n", arr_2d)
arr_auto = arr.reshape(-1, 4)
print("reshape(-1, 4):\n", arr_auto)
print("flatten:", arr_2d.flatten())

---
## 3.3 Broadcasting & Vectorization

### Theory
- **Vectorization**: Operations on whole arrays without Python loops (much faster).
- **Broadcasting**: NumPy expands smaller dimensions so shapes match (e.g. add (3,) to (2,3)).
- Rule: dimensions are compatible if they are equal or one of them is 1.
- Example: adding a bias vector to a batch of samples (matrix + vector).

In [None]:
# EXAMPLE: Broadcasting
data = np.array([1, 2, 3, 4])
print("Plus 10:", data + 10)
print("Squared:", data ** 2)

matrix = np.array([[10, 20, 30], [40, 50, 60]])  # (2, 3)
bias = np.array([1, 2, 3])                        # (3,)
result = matrix + bias  # bias broadcast to each row
print("Matrix + bias:\n", result)

---
## 3.4 Boolean Indexing

### Theory
- Create a boolean array (mask) from a condition: `arr > 0`.
- Use mask to select: `arr[mask]` or `arr[arr > 0]`.
- Use for filtering outliers or selecting classes.

In [None]:
# EXAMPLE: Boolean indexing
data = np.array([10, -5, 30, -2, 50, 0])
mask = data > 0
print("Mask:", mask)
print("Positive:", data[mask])

# Clip: cap values
data_clip = data.copy()
data_clip[data_clip > 40] = 40
print("Clipped:", data_clip)

---
## 3.5 Dot Product (Linear Algebra)

### Theory
- **y = Wx + b** is the core of neural layers: matrix-vector or matrix-matrix multiply.
- **np.dot(a, b)** or **a @ b** for matrix multiplication.
- Shape rule: (m, n) @ (n, p) → (m, p).

In [None]:
# EXAMPLE: Dot product (like one layer)
X = np.array([0.5, 0.1, -0.2])  # 1 sample, 3 features
W = np.array([[1.0, 0.5], [-0.5, 0.2], [0.1, -0.1]])  # 3 in, 2 out
output = X @ W  # (3,) @ (3, 2) -> (2,)
print("X:", X)
print("W:\n", W)
print("Output (X @ W):", output)

---
# PART 4: DATA PREPROCESSING

## 4.1 Binning (pd.cut, pd.qcut)

### Theory
- **pd.cut(series, bins=[...], labels=[...])** – fixed bin edges (e.g. age groups).
- **pd.qcut(series, q=n, labels=[...])** – quantile-based; roughly equal counts per bin.
- Turns continuous variables into categories for analysis or encoding.

In [None]:
# EXAMPLE: Binning
ages = pd.Series([15, 25, 45, 70, 22, 55, 18, 80])
bins = [0, 18, 35, 60, 100]
labels = ['Child', 'Young', 'Adult', 'Senior']
ages_cut = pd.cut(ages, bins=bins, labels=labels)
print("pd.cut:", ages_cut)

incomes = pd.Series([20000, 50000, 80000, 35000, 90000, 45000])
incomes_q = pd.qcut(incomes, q=3, labels=['Low', 'Mid', 'High'])
print("pd.qcut:", incomes_q)

---
## 4.2 Label & Frequency Encoding

### Theory
- **Label encoding**: Category → integer (e.g. Red=0, Blue=1). Use `.astype('category').cat.codes` or a mapping.
- **Frequency encoding**: Replace category with its count or proportion in the dataset; good for high-cardinality.

In [None]:
# EXAMPLE: Label and frequency encoding
df = pd.DataFrame({'City': ['NY', 'Paris', 'London', 'Paris', 'NY'], 'Price': [100, 200, 150, 220, 110]})
df['City_Label'] = df['City'].astype('category').cat.codes
freq_map = df['City'].value_counts(normalize=True)
df['City_Freq'] = df['City'].map(freq_map)
print(df)

---
## 4.3 Outlier Handling

### Theory
- **Clipping (capping)**: Restrict values to a range, e.g. 5th–95th percentile: `series.clip(lower, upper)`.
- **IQR removal**: Values outside Q1 − 1.5×IQR or Q3 + 1.5×IQR are often treated as outliers (remove or cap).
- Outliers can hurt linear models; clipping keeps shape while limiting impact.

In [None]:
# EXAMPLE: Outliers – clip and IQR
data = pd.Series([10, 12, 11, 14, 12, 11, 10, 1000])
print("Original mean:", data.mean())

lower, upper = data.quantile(0.05), data.quantile(0.95)
clipped = data.clip(lower=lower, upper=upper)
print("After clip:", clipped.values, "mean:", clipped.mean())

Q1, Q3 = data.quantile(0.25), data.quantile(0.75)
IQR = Q3 - Q1
fence = 1.5 * IQR
filtered = data[(data >= Q1 - fence) & (data <= Q3 + fence)]
print("After IQR filter:", filtered.values, "mean:", filtered.mean())

---
## Summary – Quick Reference

| Topic | Key idea | Example |
|-------|----------|--------|
| Lambda | Anonymous function | `lambda x: x**2` |
| Map | Apply to each | `list(map(f, lst))` |
| Filter | Keep where True | `list(filter(f, lst))` |
| Reduce | Single value | `reduce(lambda x,y: x*y, lst)` |
| apply/map | Pandas element-wise | `df['col'].apply(f)` |
| groupby | Group + agg/transform | `df.groupby('A')['B'].sum()` |
| merge | Join tables | `pd.merge(left, right, on=..., how='left')` |
| NumPy shape | Reshape for ML | `arr.reshape(-1, n)` |
| Broadcasting | Array + scalar/vector | `matrix + bias` |
| Dot product | Weights × inputs | `X @ W` |
| Binning | Continuous → categories | `pd.cut`, `pd.qcut` |
| Outliers | Clip or IQR filter | `series.clip(lower, upper)` |