# Chapter 17: The Data Science Stack

Python's rise to prominence in the software industry is inextricably linked to its dominance in data science. From experimental prototyping to production-grade machine learning pipelines, Python offers a unified ecosystem for numerical computing, data manipulation, and visualization. This ecosystem is built on three pillars: **NumPy** for efficient numerical operations, **Pandas** for structured data analysis, and visualization libraries for communicating insights.

This chapter explores these foundational tools. You will learn to leverage vectorization for performance that rivals compiled languages, manipulate datasets with expressive syntax, and generate publication-quality visualizations. We emphasize the patterns that differentiate data science code from general-purpose scripting: array-oriented thinking, the split-apply-combine strategy, and the grammar of graphics.

## 17.1 NumPy: N-Dimensional Arrays

**NumPy** (Numerical Python) is the foundation of the scientific Python stack. It provides the `ndarray`—a multidimensional array object that enables vectorized operations, broadcasting, and memory-efficient storage.

### The ndarray: Memory and Performance

Unlike Python lists, which store references to scattered objects, NumPy arrays store homogeneous data in a contiguous memory block. This enables CPU cache optimization and SIMD (Single Instruction, Multiple Data) operations.

```python
import numpy as np
from typing import Tuple
import time

# Comparing Python lists vs. NumPy arrays
def performance_comparison() -> None:
    """Demonstrate NumPy's performance advantage."""
    size: int = 10_000_000
    
    # Python lists
    list_a = list(range(size))
    list_b = list(range(size))
    
    start = time.time()
    # Element-wise operation requires loop
    list_c = [a + b for a, b in zip(list_a, list_b)]
    list_time = time.time() - start
    
    # NumPy arrays
    array_a = np.arange(size)
    array_b = np.arange(size)
    
    start = time.time()
    # Vectorized operation (no explicit loop)
    array_c = array_a + array_b
    numpy_time = time.time() - start
    
    print(f"Python List: {list_time:.4f}s")
    print(f"NumPy Array: {numpy_time:.6f}s")
    print(f"Speedup: {list_time / numpy_time:.0f}x")

# Array creation
arr_1d = np.array([1, 2, 3, 4, 5])
arr_2d = np.array([[1, 2, 3], [4, 5, 6]])

# Special constructors
zeros = np.zeros((3, 4))           # 3x4 matrix of 0s
ones = np.ones((2, 3), dtype=np.float64)
identity = np.eye(3)               # 3x3 identity matrix
random = np.random.randn(2, 2)     # Standard normal distribution
linspace = np.linspace(0, 10, 5)   # [0., 2.5, 5., 7.5, 10.]
```

### Array Attributes and Indexing

```python
import numpy as np

arr = np.random.randint(0, 100, (3, 4, 5))  # 3D array: 3 blocks, 4 rows, 5 cols

# Attributes
print(f"Shape: {arr.shape}")       # (3, 4, 5)
print(f"Dimensions: {arr.ndim}")   # 3
print(f"Size: {arr.size}")         # 60 total elements
print(f"Dtype: {arr.dtype}")       # int64

# Indexing (0-based)
print(arr[0, 1, 2])                # First block, second row, third column

# Slicing
print(arr[:, 0, :])                # All blocks, first row, all columns
print(arr[::2])                    # Every other block (step=2)
print(arr[..., -1])                # Last column of all blocks (ellipsis)

# Boolean Indexing (Filtering)
data = np.array([1, 2, 3, 4, 5, 6, 7, 8])
mask = data > 4
filtered = data[mask]              # [5, 6, 7, 8]

# Fancy Indexing (Integer array indexing)
indices = np.array([0, 2, 5])
selected = data[indices]           # [1, 3, 6]
```

### Vectorization and Broadcasting

**Broadcasting** describes how NumPy treats arrays with different shapes during arithmetic operations. Smaller arrays are "broadcast" across larger ones without copying data.

```python
import numpy as np

# Broadcasting rules:
# 1. Compare shapes right-to-left
# 2. Dimensions are compatible if equal or one is 1
# 3. Missing dimensions are treated as 1

# Example 1: Scalar + Array
arr = np.array([1, 2, 3])
result = arr + 5                   # [6, 7, 8] (Scalar broadcast to shape (3,))

# Example 2: 2D + 1D
matrix = np.ones((3, 3))           # Shape (3, 3)
row = np.array([1, 2, 3])          # Shape (3,)
result = matrix + row              # Row broadcast across each row of matrix

# Example 3: Column + Row
col = np.array([[1], [2], [3]])    # Shape (3, 1)
row = np.array([1, 2, 3])          # Shape (3,)
result = col + row                 # Outer product: Shape (3, 3)

# Visualizing Broadcasting
"""
(3, 1) + (3,) -> (3, 3)
[[1],          [1, 2, 3]    [[1+1, 1+2, 1+3],
 [2],      +            =   [2+1, 2+2, 2+3],
 [3]]                        [3+1, 3+2, 3+3]]
"""
```

### Universal Functions (ufuncs)

NumPy provides vectorized wrappers for mathematical functions:

```python
import numpy as np

arr = np.array([1, 4, 9, 16, 25])

# Mathematical functions
np.sqrt(arr)                       # [1., 2., 3., 4., 5.]
np.exp(arr)
np.log(arr)
np.sin(arr)

# Statistical aggregations
arr = np.random.randn(1000)
np.mean(arr)
np.std(arr)
np.min(arr), np.max(arr)
np.sum(arr)

# Axis-based operations (for multi-dimensional arrays)
matrix = np.random.rand(3, 4)
np.sum(matrix, axis=0)             # Sum each column (result shape: (4,))
np.sum(matrix, axis=1)             # Sum each row (result shape: (3,))
np.mean(matrix, axis=1, keepdims=True)  # Keep dimension (shape: (3, 1))
```

### Linear Algebra

```python
import numpy as np
from numpy.linalg import inv, eig, svd

A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

# Matrix multiplication
C = A @ B                         # Preferred (Python 3.5+)
C = np.matmul(A, B)               # Equivalent

# Element-wise multiplication
D = A * B

# Transpose
A_T = A.T

# Inverse
A_inv = inv(A)

# Eigenvalues and Eigenvectors
eigenvalues, eigenvectors = eig(A)

# Singular Value Decomposition
U, S, Vh = svd(A)
```

## 17.2 Pandas: Data Manipulation and Analysis

**Pandas** builds on NumPy to provide labeled, tabular data structures. It handles heterogeneous data types, missing values, and time series functionality that NumPy alone cannot provide.

### Core Data Structures

```python
import pandas as pd
import numpy as np
from typing import List, Dict

# Series: 1D labeled array
series = pd.Series(
    [10, 20, 30, 40],
    index=['a', 'b', 'c', 'd'],
    name='values'
)
print(series['b'])                 # 20 (label-based indexing)
print(series.iloc[1])              # 20 (position-based indexing)

# DataFrame: 2D labeled table (dict of Series)
data: Dict[str, List] = {
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'city': ['NY', 'LA', 'NY', 'Chicago'],
    'salary': [50000, 60000, 75000, np.nan]
}
df = pd.DataFrame(data)

# Attributes
print(df.shape)                    # (4, 4)
print(df.columns)                  # Index(['name', 'age', 'city', 'salary'])
print(df.dtypes)                   # Data types per column
print(df.info())                   # Summary including memory usage
print(df.describe())               # Statistical summary of numeric columns
```

### Data Loading and Persistence

```python
import pandas as pd

# CSV
df = pd.read_csv('data.csv', sep=',', header=0, index_col=0)
df.to_csv('output.csv', index=False)

# Excel (requires openpyxl or xlrd)
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
df.to_excel('output.xlsx', sheet_name='Results')

# JSON
df = pd.read_json('data.json', orient='records')
df.to_json('output.json', indent=2)

# SQL (requires SQLAlchemy)
from sqlalchemy import create_engine
engine = create_engine('postgresql://user:password@localhost/db')
df = pd.read_sql('SELECT * FROM users', engine)
df.to_sql('users_copy', engine, if_exists='replace', index=False)

# Large files: Chunking
chunk_iter = pd.read_csv('huge_file.csv', chunksize=10000)
for chunk in chunk_iter:
    process_chunk(chunk)
```

### Data Cleaning

Real-world data is messy. Pandas provides robust tools for handling issues:

```python
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A': [1, 2, np.nan, 4, 5],
    'B': [5, np.nan, np.nan, 8, 10],
    'C': ['apple', 'banana', 'apple', np.nan, 'banana'],
    'D': [10, 20, 30, 40, 50]
})

# Handling Missing Data
# Check for missing values
print(df.isnull().sum())

# Drop rows with any missing values
df_clean = df.dropna()

# Drop columns with missing values
df_clean_cols = df.dropna(axis=1)

# Fill missing values
df_filled = df.fillna(0)           # Fill all with 0
df_ffill = df.fillna(method='ffill')  # Forward fill

# Column-specific filling
values = {'A': df['A'].mean(), 'B': 0, 'C': 'unknown'}
df_filled_custom = df.fillna(values)

# Duplicates
df_deduped = df.drop_duplicates(subset=['C'], keep='first')

# Type Conversion
df['D'] = df['D'].astype(float)
df['D'] = pd.to_numeric(df['D'], errors='coerce')  # Invalid -> NaN

# String Operations (accessed via .str)
df['C_upper'] = df['C'].str.upper()
df['C_len'] = df['C'].str.len()
df['is_apple'] = df['C'].str.contains('apple', na=False)

# Applying Functions
# Element-wise
df['D_squared'] = df['D'].apply(lambda x: x ** 2)

# Row-wise
def categorize(row):
    if row['age'] < 30:
        return 'Young'
    elif row['age'] < 50:
        return 'Adult'
    return 'Senior'

# Vectorized string operations (faster than .apply)
df['C_first'] = df['C'].str[0]     # First character
```

### Data Transformation: Split-Apply-Combine

The `groupby` operation is Pandas' most powerful analytical tool, implementing the split-apply-combine pattern:

```python
import pandas as pd

# Sample data
df = pd.DataFrame({
    'department': ['Sales', 'Sales', 'IT', 'IT', 'HR', 'HR'],
    'employee': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank'],
    'salary': [50000, 60000, 70000, 80000, 45000, 55000],
    'years_experience': [2, 5, 8, 10, 1, 3]
})

# Grouping
grouped = df.groupby('department')

# Aggregations
print(grouped['salary'].mean())    # Average salary per department
print(grouped['salary'].agg(['mean', 'min', 'max']))

# Multiple columns
print(grouped.agg({
    'salary': 'mean',
    'years_experience': 'max'
}))

# Custom aggregation
def salary_range(series):
    return series.max() - series.min()

print(grouped['salary'].agg(salary_range))

# Named aggregations (cleaner syntax)
result = df.groupby('department').agg(
    avg_salary=('salary', 'mean'),
    total_experience=('years_experience', 'sum'),
    employee_count=('employee', 'count')
)

# Transformation (return same shape as input)
# Standardize within group
df['salary_normalized'] = df.groupby('department')['salary'].transform(
    lambda x: (x - x.mean()) / x.std()
)

# Filtering groups
# Keep only departments with avg salary > 60000
high_paid = df.groupby('department').filter(
    lambda g: g['salary'].mean() > 60000
)
```

### Merging and Joining

```python
import pandas as pd

employees = pd.DataFrame({
    'emp_id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'dept_id': [10, 20, 10, 30]
})

departments = pd.DataFrame({
    'dept_id': [10, 20, 40],
    'dept_name': ['Sales', 'IT', 'Marketing']
})

# Inner Join (intersection)
inner = pd.merge(employees, departments, on='dept_id', how='inner')
# Only emp_id 1, 2, 3 remain (dept 30 not in departments, dept 40 has no employees)

# Left Join (keep all employees)
left = pd.merge(employees, departments, on='dept_id', how='left')
# Diana (dept 30) has NaN for dept_name

# Outer Join (union)
outer = pd.merge(employees, departments, on='dept_id', how='outer')

# Concatenation
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})

vertical = pd.concat([df1, df2], ignore_index=True)
horizontal = pd.concat([df1, df2], axis=1)
```

## 17.3 Data Visualization: Matplotlib and Seaborn

Visualization transforms numbers into narratives. **Matplotlib** provides low-level control, while **Seaborn** offers high-level abstractions for statistical graphics.

### Matplotlib Fundamentals

```python
import matplotlib.pyplot as plt
import numpy as np

# Data
x = np.linspace(0, 10, 100)
y1 = np.sin(x)
y2 = np.cos(x)

# Figure and Axes (Object-Oriented Approach - Recommended)
fig, ax = plt.subplots(figsize=(10, 6))

# Plotting
ax.plot(x, y1, label='sin(x)', color='blue', linestyle='-', linewidth=2)
ax.plot(x, y2, label='cos(x)', color='red', linestyle='--', linewidth=2)

# Customization
ax.set_xlabel('X Axis')
ax.set_ylabel('Y Axis')
ax.set_title('Trigonometric Functions')
ax.legend(loc='upper right')
ax.grid(True, alpha=0.3)

# Annotations
ax.annotate(
    'Local Max', 
    xy=(np.pi/2, 1), 
    xytext=(np.pi/2, 1.5),
    arrowprops=dict(facecolor='black', shrink=0.05)
)

# Saving the figure
fig.savefig('trig_functions.png', dpi=300, bbox_inches='tight')

# Display (in interactive environments)
plt.show()
```

### Multiple Subplots

```python
import matplotlib.pyplot as plt
import numpy as np

# Create 2x2 grid of subplots
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
fig.suptitle('Multiple Plot Types', fontsize=16)

# Flatten axes array for easy iteration
axes = axes.ravel()

# 1. Line plot
x = np.linspace(0, 10, 50)
axes[0].plot(x, np.sin(x), 'b-')
axes[0].set_title('Line Plot')

# 2. Scatter plot
axes[1].scatter(np.random.rand(50), np.random.rand(50), c='green', alpha=0.6)
axes[1].set_title('Scatter Plot')

# 3. Bar chart
categories = ['A', 'B', 'C', 'D']
values = [23, 45, 12, 67]
axes[2].bar(categories, values, color='purple')
axes[2].set_title('Bar Chart')

# 4. Histogram
axes[3].hist(np.random.randn(1000), bins=30, color='orange', edgecolor='black')
axes[3].set_title('Histogram')

plt.tight_layout()  # Adjust spacing
plt.show()
```

### Seaborn: Statistical Visualization

Seaborn simplifies creating attractive statistical graphics and integrates seamlessly with Pandas DataFrames.

```python
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Set aesthetic style
sns.set_theme(style="whitegrid")

# Sample dataset
tips = sns.load_dataset("tips")
print(tips.head())

# 1. Distribution Plots
# Histogram with Kernel Density Estimate (KDE)
sns.histplot(data=tips, x="total_bill", kde=True, bins=20)
plt.title("Distribution of Total Bill")
plt.show()

# 2. Categorical Plots
# Box plot (shows quartiles and outliers)
plt.figure(figsize=(10, 6))
sns.boxplot(data=tips, x="day", y="total_bill", hue="smoker")
plt.title("Total Bill by Day and Smoking Status")
plt.show()

# Violin plot (combines box plot with KDE)
sns.violinplot(data=tips, x="day", y="total_bill", hue="time", split=True)
plt.show()

# 3. Relational Plots
# Scatter plot with regression line
sns.lmplot(data=tips, x="total_bill", y="tip", hue="smoker", height=6, aspect=1.5)
plt.title("Tip vs Total Bill with Regression")
plt.show()

# Scatter plot with semantic mapping
plt.figure(figsize=(10, 6))
sns.scatterplot(
    data=tips, 
    x="total_bill", 
    y="tip", 
    hue="time",    # Color by time
    style="sex",   # Marker style by sex
    size="size",   # Marker size by party size
    sizes=(50, 200)
)
plt.title("Multi-dimensional Scatter Plot")
plt.show()

# 4. Pairwise Relationships (Exploratory Data Analysis)
# Plots pairwise relationships in a dataset
iris = sns.load_dataset("iris")
sns.pairplot(iris, hue="species", corner=True)  # corner=True removes upper triangle
plt.show()

# 5. Heatmaps (Correlation Matrices)
# Select only numeric columns for correlation
numeric_tips = tips.select_dtypes(include=[np.number])
correlation_matrix = numeric_tips.corr()

plt.figure(figsize=(8, 6))
sns.heatmap(
    correlation_matrix, 
    annot=True,       # Show values
    cmap='coolwarm',  # Color map
    center=0,         # Center colormap at 0
    square=True
)
plt.title("Correlation Matrix")
plt.show()

# 6. Faceting (Plotting on multiple axes based on data subsets)
g = sns.FacetGrid(tips, col="time", row="smoker", height=3, aspect=1.5)
g.map_dataframe(sns.scatterplot, x="total_bill", y="tip")
g.add_legend()
plt.show()
```

### Best Practices for Visualization

1.  **Choose the Right Plot Type:**
    *   **Distribution**: Histogram, KDE, Box Plot
    *   **Relationship**: Scatter Plot, Line Plot
    *   **Comparison**: Bar Chart, Violin Plot
    *   **Composition**: Stacked Bar, Pie Chart (use sparingly)
    
2.  **Declutter (Data-Ink Ratio):**
    Remove chart junk (excessive grid lines, 3D effects, distracting backgrounds) that doesn't add information.

3.  **Label Clearly:**
    Titles, axis labels, and legends must be unambiguous. Units should be included in labels.

4.  **Color Blindness Accessibility:**
    Use palettes that are distinguishable for color-blind users (Seaborn's default is generally good; avoid red-green combinations).

## Summary

Data science in Python is built on a powerful stack of libraries that transform raw data into actionable insights. **NumPy** provides the computational engine with its N-dimensional arrays, enabling vectorized operations that are orders of magnitude faster than Python loops. You understand broadcasting and the memory efficiency of contiguous arrays.

**Pandas** brings data to life through DataFrames and Series, offering intuitive tools for cleaning, transforming, and analyzing tabular data. You have mastered the split-apply-combine pattern through `groupby`, handled missing data gracefully, and joined datasets with SQL-like flexibility.

**Visualization** with Matplotlib and Seaborn allows you to communicate findings effectively. Matplotlib provides low-level control over every element of a plot, while Seaborn offers high-level abstractions for statistical visualization that integrate seamlessly with Pandas data structures.

These tools form the basis for analysis, but they are also the stepping stones to automation. In the next chapter, we apply Python to system administration and task automation—writing scripts to interact with the operating system, scrape the web, and schedule routine work.

**Next Chapter**: Chapter 18: Automation and Scripting.