# Week 1 — Introduction to Data Handling, SQL & NoSQL

**Objectives**
- Understand DE workflows and data types (structured, semi-structured, unstructured)
- Practice data cleaning with `pandas`
- Learn SQL basics with SQLite
- Explore NoSQL via JSON-like documents

## 0) Setup

In [1]:
# If in Colab, optionally install extras
# !pip -q install pandas duckdb sqlite-utils
import pandas as pd, numpy as np, sqlite3, json
from io import StringIO
print("Pandas:", pd.__version__)

Pandas: 2.3.3


## 1) Sample messy dataset

In [2]:
raw = StringIO("""Name,Age,Email,Country,Income
Alice,25,alice@email.com,US,80000
Bob,,bob@,UK,62000
,35,charlie@email.com,IN,NaN
David,45,,US,120000
Eva,29,eva@email.com,DE,70000
""")
df = pd.read_csv(raw)
df

Unnamed: 0,Name,Age,Email,Country,Income
0,Alice,25.0,alice@email.com,US,80000.0
1,Bob,,bob@,UK,62000.0
2,,35.0,charlie@email.com,IN,
3,David,45.0,,US,120000.0
4,Eva,29.0,eva@email.com,DE,70000.0


### Notes
- **Structured**: CSVs, relational tables (like `df` above)
- **Semi-Structured**: JSON, XML (flexible schema)
- **Unstructured**: Free text, images, audio/video

## 2) Clean with `pandas`

In [3]:
clean = df.copy()
clean['Name'] = clean['Name'].fillna('Unknown')
clean['Age'] = clean['Age'].fillna(clean['Age'].mean())
clean = clean[clean['Email'].str.contains('@', na=False)]
clean['Income'] = pd.to_numeric(clean['Income'], errors='coerce').fillna(clean['Income'].median())
clean.reset_index(drop=True, inplace=True)
clean

Unnamed: 0,Name,Age,Email,Country,Income
0,Alice,25.0,alice@email.com,US,80000.0
1,Bob,33.5,bob@,UK,62000.0
2,Unknown,35.0,charlie@email.com,IN,70000.0
3,Eva,29.0,eva@email.com,DE,70000.0


## 3) SQL with SQLite (in-memory)

In [4]:
conn = sqlite3.connect(":memory:")
clean.to_sql("people", conn, index=False, if_exists="replace")

q1 = pd.read_sql_query("SELECT Country, COUNT(*) as cnt FROM people GROUP BY Country ORDER BY cnt DESC", conn)
q2 = pd.read_sql_query("SELECT Name, Income FROM people WHERE Income > 75000 ORDER BY Income DESC", conn)

display(q1); display(q2)

Unnamed: 0,Country,cnt
0,US,1
1,UK,1
2,IN,1
3,DE,1


Unnamed: 0,Name,Income
0,Alice,80000.0


**Try more SQL:**  
- Top earners per country  
- Average income by country  
- People aged between 25 and 40

## 4) NoSQL-style JSON querying

In [5]:
people_docs = [
    {"_id": 1, "name": "Alice", "skills": ["sql","python"], "meta": {"country":"US","newsletter":True}},
    {"_id": 2, "name": "Bob", "skills": ["spark","airflow"], "meta": {"country":"UK","newsletter":False}},
    {"_id": 3, "name": "Eva", "skills": ["pandas","ml"], "meta": {"country":"DE","newsletter":True}},
]

country_us = [d for d in people_docs if d["meta"]["country"] == "US"]
has_ml = [d for d in people_docs if "ml" in d["skills"]]

print("US people:", country_us)
print("Has ML skill:", has_ml)

US people: [{'_id': 1, 'name': 'Alice', 'skills': ['sql', 'python'], 'meta': {'country': 'US', 'newsletter': True}}]
Has ML skill: [{'_id': 3, 'name': 'Eva', 'skills': ['pandas', 'ml'], 'meta': {'country': 'DE', 'newsletter': True}}]


### Key Takeaways
- Use **SQL** for structured analytics & joins
- Use **JSON/NoSQL** when schema evolves or for nested data
- In the real world you might use Postgres/BigQuery for SQL and MongoDB/Elasticsearch for semi-structured search