# NumPy Basics: Stats, Normalization, and Smoothing

**Goal:** use NumPy arrays to implement the math-y parts of wrangling.

**Covered:**
- Computing descriptive stats and handle NaNs
- Normalize/standardize arrays
- Apply simple smoothing for noisy signals

## 1. Setup
- Import `numpy as np`.
- Assume we start from either a Pandas column (`df['col'].to_numpy()`) or a raw array `x`.

In [None]:
import os, sys, glob
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# ------------------
# GENERATING ARRAYS
# ------------------

rng = np.random.default_rng(7)

# Section 2: stats with NaNs
arr_nan = np.array([42.0, 50.0, np.nan, 60.0, 70.0, np.nan, 55.0])

# Section 3: clean numeric with one outlier
arr_clean = np.array([1., 5., 10., 12., 14., 15., 20., 22., 25., 200.])  # 200 is an outlier

# Section 4: noisy signal for smoothing
linspace = np.linspace(0, 2*np.pi, 40)
signal_noisy = np.sin(linspace) + rng.normal(0, 0.15, size=linspace.size)
signal_noisy[10] = np.nan  # one missing sample to show handling
print("Demo arrays ready: arr_nan, arr_clean, signal_noisy")

## 2. Descriptive Stats with NaNs
---
**Dataset:** use `arr_nan` (a tiny 1D array).

**Use:** `np.nanmean`, `np.nanmedian`, `np.nanstd` for arrays containing NaNs.
- Plain `np.mean/median/std` will return `nan` if any NaN is present.
- The `nan*` versions compute while **ignoring** NaNs.


In [None]:
# --- Section 2: Descriptive stats that ignore NaNs ---
print("[nan-aware stats on arr_nan]")
print(f"arr_nan: {arr_nan}\n")

print(f"np.mean      -> {np.mean(arr_nan)}")         # returns nan
print(f"np.nanmean   -> {np.nanmean(arr_nan)}\n")    # ignores NaN

print(f"np.median    -> {np.median(arr_nan)}")       # returns nan
print(f"np.nanmedian -> {np.nanmedian(arr_nan)}\n")  # ignores NaN

print(f"np.std       -> {np.std(arr_nan)}")          # returns nan
print(f"np.nanstd    -> {np.nanstd(arr_nan)}")       # ignores NaN

In [None]:
# --------------
# Handling NaNs: count 
# Fill with constants/mean, 
# Forward-fill 
# Interpolation
# --------------

# Count NaNs
nan_count = np.isnan(arr_nan).sum()
print(f"arr_nan: {arr_nan}")
print(f"NaN count: {nan_count}\n")

# Fill with 0
filled_zero = np.nan_to_num(arr_nan, nan=0.0)
print("[Fill NaNs with 0]")
print(filled_zero, "\n")

# Fill with mean
mu = np.nanmean(arr_nan)
filled_mean = np.nan_to_num(arr_nan, nan=mu)
print(f"[Fill NaNs with mean={mu:.2f}]")
print(filled_mean, "\n")

# Interpolation across gaps (linear)
arr_interp = arr_nan.copy()
mask = np.isnan(arr_interp)
if mask.any():
    idx = np.arange(arr_interp.size)
    arr_interp[mask] = np.interp(idx[mask], idx[~mask], arr_interp[~mask])
print("[Linear interpolation across NaNs]")
print(arr_interp, "\n")

# True forward-fill (carry last valid value)
arr_ffill = arr_nan.copy()
mask = np.isnan(arr_ffill)
idx = np.where(~mask, np.arange(arr_ffill.size), 0)
np.maximum.accumulate(idx, out=idx)
arr_ffill = arr_ffill[idx]
print("[Forward-fill NaNs (carry last valid)]")
print(arr_ffill)

## 3. Normalization & Standardization
---
**Dataset:** use `arr_clean` (1D array with an outlier).
> How do outliers influence our normalization & standardization methods?

**Formulas (NumPy algebra):**
- Min–max: `(x - x.min()) / (x.max() - x.min())`
- Z-score: `(x - x.mean()) / x.std()`
- Robust (median/Median Absolute Deviation): `(x - median) / MAD` where `MAD = median(|x - median|)`


In [None]:
# --------------
# Normalization/Standardization
# --------------

x = arr_clean.copy()
print("Original:", x)
print('-' * 50)

# Min–max
x_min, x_max = x.min(), x.max()
x_minmax = (x - x_min) / (x_max - x_min)
print("[Min–max scaling]")
print(x_minmax)
print('-' * 50)

# Z-score
mu, sigma = x.mean(), x.std()
x_z = (x - mu) / sigma
print("[Z-score standardization]")
print(x_z)
print('-' * 50)

# Robust scaling (median/MAD)
med = np.median(x)
mad = np.median(np.abs(x - med))
x_robust = (x - med) / mad
print("[Robust scaling: median/MAD]")
print(x_robust)

In [None]:
# Plot unsclaed vs scaled arrays
plt.figure(figsize=(8,4))
plt.plot(x, 'o-', label='Original', markersize=8)
plt.plot(x_minmax, 'o-', label='Min–max', markersize=8)
plt.plot(x_z, 'o-', label='Z-score', markersize=8)
plt.plot(x_robust, 'o-', label='Robust (med/MAD)', markersize=8)
plt.title('Normalization/Standardization Methods')
plt.legend()
plt.show()

## 4. Smoothing / Noise Reduction
---
**Dataset:** use `signal_noisy` (noisy sine wave) to show the effect without plots.

- Simple moving average (SMA) with window `k` using 1D convolution
- Exponential moving average (EMA) with decay `alpha`
- Handle NaNs first so filters don't propagate missing values.

In [None]:
# --------------
# Smoothing a noisy 1D signal
# --------------

# Implementation for simple moving average assisted by NumPy ops
def sma(y, k):
    kernel = np.ones(k) / k
    return np.convolve(y, kernel, mode="valid")

# Implementation for exponential moving average assisted by NumPy ops
def ema(y, alpha=0.2):
    out = np.empty_like(y)
    out[0] = y[0]
    for t in range(1, len(y)):
        out[t] = alpha * y[t] + (1 - alpha) * out[t-1]
    return out

# Handle a NaN in the signal by linear interpolation so filters don't propagate NaNs
interp_s = signal_noisy.copy()
mask = np.isnan(interp_s)
if mask.any():
    idx = np.arange(interp_s.size)
    interp_s[mask] = np.interp(idx[mask], idx[~mask], interp_s[~mask])
   
# Generate smoothes arrays 
sma_s = sma(interp_s, k=5)
ema_s = ema(interp_s, alpha=0.2)

In [None]:
# Smoothed plots in a single plot with matching x/y shapes
plt.figure(figsize=(6, 3))
plt.plot(linspace, interp_s, 'o-', label='Interpolated signal', markersize=5)
plt.plot(linspace[2:-2], sma_s, 'o-', label='SMA (k=5)', markersize=5)  # SMA is shorter by k-1
plt.plot(linspace, ema_s, 'o-', label='EMA (alpha=0.2)', markersize=5)
plt.title('Smoothed Signals')
plt.legend()
plt.show()