# Session 4 — Data Types & File Formats

Deep dive into **structured**, **semi-structured**, **unstructured**, plus **time‑series**, **graph**, and **geospatial** data. Learn common file formats, when to use them, and how they flow through pipelines.

## 🧭 1️⃣ Why Data Variety Matters

Modern pipelines rarely handle tables alone. You’ll ingest logs (JSON), documents (PDF), media (images/video), metrics (time‑series), and even relationships (graph).

**Goal of this session:** understand each data type, typical storage/format choices, and how that impacts ETL, querying, and cost.

## 📊 2️⃣ Structured Data

**Definition:** Tabular data with a **fixed schema** (rows/columns).

**Pros:** easy SQL querying, strong integrity (keys/constraints)

**Cons:** schema changes require migration; less flexible for nested data

**Examples:** CSV, TSV, Excel; relational DB tables (PostgreSQL, MySQL, SQL Server)

**Mini sample (CSV):**
```csv
customer_id,name,region
101,Aria,North
102,Dev,South
```


## 🧩 3️⃣ Semi‑Structured Data

**Definition:** Self‑describing, **flexible schema** (often hierarchical/nested).

**Pros:** schema evolution, good for APIs/logs; columnar formats compress well

**Cons:** complex joins; requires tools that understand nested data

**Examples:** JSON, XML, Parquet, Avro, YAML, logs

**Mini sample (JSON):**
```json
{
  "order_id": 9001,
  "customer": { "id": "C-123", "name": "Alice" },
  "items": [ { "sku": "LAPTOP-15", "qty": 2 } ],
  "total": 2400.00
}
```


## 🗃️ 4️⃣ Unstructured Data

**Definition:** No predefined schema; binary or free text.

**Pros:** richest signals (text sentiment, images, audio)

**Cons:** needs metadata/indexing; heavy processing

**Examples:** PDFs, images, audio, video, emails, raw text files


## ⏱️ 5️⃣ Time‑Series Data

**Definition:** Observations indexed by **time** (regular or event‑driven). Ideal for metrics and IoT.

**Stores:** TimescaleDB, InfluxDB, AWS Timestream, Azure Data Explorer

**Mini sample:**
```csv
timestamp,temperature
2025-10-28T10:00:00Z,22.3
2025-10-28T10:05:00Z,22.5
```


## 🕸️ 6️⃣ Graph Data

**Definition:** Entities (**nodes**) and their **relationships** (**edges**) with properties.

**Stores:** Neo4j, Amazon Neptune, Cosmos DB (Gremlin)

**Mini sample (Cypher):**
```cypher
CREATE (a:Person {name:'Alice'})-[:FRIENDS_WITH]->(b:Person {name:'Bob'})
```


## 🌍 7️⃣ Geospatial Data

**Definition:** Coordinates, shapes, and spatial relationships.

**Formats:** GeoJSON, Shapefile, KML

**Stores/Engines:** PostGIS, BigQuery GIS, Azure Maps, AWS Location

**Mini sample (GeoJSON):**
```json
{ "type": "Point", "coordinates": [77.5946, 12.9716] }
```


## 📦 8️⃣ Common File Formats & Compression

| Format | Type | Readability | Compression | Schema | Typical Use |
|--------|------|-------------|------------|--------|-------------|
| CSV | Structured | Human‑readable | Poor | Static | Simple exports, interoperability |
| JSON | Semi‑Structured | Human‑readable | Moderate | Flexible | APIs, logs, configs |
| Parquet | Semi‑Structured (columnar) | Binary | Excellent | Embedded | Analytics over large data |
| Avro | Semi‑Structured (row) | Binary | Excellent | External/registry | Streaming ETL, schema evolution |
| ORC | Semi‑Structured (columnar) | Binary | Excellent | Embedded | Hadoop/Spark ecosystems |
| XML | Semi‑Structured | Verbose | Poor | Hierarchical | Legacy systems, configs |
| Image/Video | Unstructured | Binary | — | — | Media storage, ML inputs |


## 🧠 9️⃣ Choosing the Right Format

- **Human inspection?** → CSV/JSON
- **Analytics at scale?** → Parquet/ORC (columnar, compress well)
- **Streaming & schema evolution?** → Avro (+ schema registry)
- **Nested data?** → JSON/Parquet
- **Large media?** → Object storage (S3/Blob) with metadata catalog


## 🧪 🔟 Practical: Read/Write Examples (CSV, JSON, Parquet)

In [3]:
import pandas as pd
from pathlib import Path
import json

# Use current directory instead of /mnt/data
base = Path.cwd() / "session4_samples"
base.mkdir(parents=True, exist_ok=True)

df = pd.DataFrame([
    {'customer_id': 101, 'name': 'Aria', 'region': 'North', 'amount': 250.5},
    {'customer_id': 102, 'name': 'Dev', 'region': 'South', 'amount': 175.0},
])

# CSV
csv_path = base / 'customers.csv'
df.to_csv(csv_path, index=False)
print('Wrote:', csv_path)
print(pd.read_csv(csv_path).head())

# JSON (lines)
json_path = base / 'customers.json'
df.to_json(json_path, orient='records', lines=True)
print('\nWrote:', json_path)
print(pd.read_json(json_path, lines=True).head())

# Parquet (optional)
parquet_path = base / 'customers.parquet'
try:
    df.to_parquet(parquet_path)
    print('\nWrote:', parquet_path)
    print(pd.read_parquet(parquet_path).head())
except Exception as e:
    print('\nParquet example skipped (install pyarrow or fastparquet):', e)


OSError: [Errno 30] Read-only file system: '/mnt'

### 📄 XML Example (write/read minimal)

In [None]:
from xml.etree.ElementTree import Element, SubElement, tostring
from xml.dom import minidom

root = Element('customers')
for _, row in df.iterrows():
    c = SubElement(root, 'customer')
    SubElement(c, 'customer_id').text = str(row['customer_id'])
    SubElement(c, 'name').text = row['name']
    SubElement(c, 'region').text = row['region']
    SubElement(c, 'amount').text = str(row['amount'])

xml_str = minidom.parseString(tostring(root)).toprettyxml(indent='  ')
xml_path = base/'customers.xml'
with open(xml_path, 'w') as f:
    f.write(xml_str)
print('Wrote:', xml_path)
print('\n'.join(xml_str.splitlines()[:10]))


## 🖼️ 1️⃣1️⃣ Visual: Data Landscape Overview

In [None]:
import matplotlib.pyplot as plt
from matplotlib.patches import FancyBboxPatch

BG   = '#f5f9ff'
FILL = '#e0ebff'
EDGE = '#2563eb'
TXT  = '#111827'
TITLE= '#0f172a'

groups = [
    ('Structured', 'CSV, Tables, Excel', 'RDBMS (Postgres, MySQL)'),
    ('Semi-Structured', 'JSON, XML, Parquet, Avro', 'NoSQL / Docs (Mongo, Cosmos)'),
    ('Unstructured', 'PDF, Images, Audio, Video', 'Object Stores (S3, Blob)'),
    ('Time-Series', 'Metrics & Sensors', 'TimescaleDB, Influx, Timestream'),
    ('Graph', 'Nodes & Edges', 'Neo4j, Neptune'),
    ('Geospatial', 'GeoJSON, Shapefiles', 'PostGIS, BigQuery GIS')
]

fig, ax = plt.subplots(figsize=(12, 5.6))
fig.patch.set_facecolor(BG); ax.set_facecolor(BG); ax.set_axis_off()
ax.set_xlim(0, 1); ax.set_ylim(0, 1)

W, H = 0.27, 0.15
X_GAP, Y_GAP = 0.06, 0.09
start_x = 0.07
start_y = 0.70

def box(x, y, title, ex, sys):
    r = FancyBboxPatch((x, y), W, H, boxstyle='round,pad=0.02,rounding_size=10', fc=FILL, ec=EDGE, lw=1.5)
    ax.add_patch(r)
    ax.text(x+W/2, y+H*0.68, title, ha='center', va='center', fontsize=11, fontweight='bold', color=TITLE)
    ax.text(x+W/2, y+H*0.43, ex, ha='center', va='center', fontsize=9.5, color=TXT)
    ax.text(x+W/2, y+H*0.22, sys, ha='center', va='center', fontsize=9, color='#374151')

# First row
x = start_x
for i in range(3):
    t, ex, sy = groups[i]
    box(x, start_y, t, ex, sy)
    x += W + X_GAP

# Second row
x = start_x
y2 = start_y - (H + Y_GAP)
for i in range(3, 6):
    t, ex, sy = groups[i]
    box(x, y2, t, ex, sy)
    x += W + X_GAP

plt.tight_layout(); plt.show()


## 💡 1️⃣2️⃣ Practice / Assignment

1) For 12 sample files (CSV, JSON, logs, PDFs, images), **classify** the data type and format.

2) Convert a JSON log file to **Parquet** and compare size and read speed (if pyarrow available).

3) Build a simple **time‑series** dataframe and compute moving averages.

4) Sketch a pipeline diagram showing how each data type lands in your data lake/warehouse.