# DATA 304 – Module 3, Session 1: Hands‑On
Flat files, paths, CSV/Excel, compression, and large-file strategies.

**Instructions**
- Use only standard Python, pandas, and pathlib.
- Data files are already provided in the working directory under `data/`.


In [None]:
# Setup
from pathlib import Path
import pandas as pd

DATA_DIR = Path("data")

list(DATA_DIR.iterdir())

## Task 1 — Paths and portability
**Goal:** Build a robust relative path to `clean_sales.csv` and resolve its absolute path.  
**Deliverable:** A variable `CLEAN_PATH` of type `Path` and the resolved path printed.

In [None]:
CLEAN_PATH = ...
print(CLEAN_PATH)
print(CLEAN_PATH.resolve())

## Task 2 — Import and inspect a clean CSV
**Goal:** Import `clean_sales.csv` and inspect.
**Checklist:**
- Preview with `.head()`
- Check `.info()` and `.dtypes`

In [None]:
df_clean = ...

## Task 3 — Semicolon CSV with European decimals and NA tokens
**File:** `messy_semicolon.csv`
**Goal:**
- Read with correct delimiter.
- Treat 'NA' as missing.
- Convert `amount` from string with comma decimal to `float`.  

**Deliverable:** `df_messy` with `amount` as float and NA handled.

In [None]:
MESSY_PATH = DATA_DIR / "messy_semicolon.csv"
df_messy = ...
df_messy.dtypes, df_messy

## Task 4 — Quoted fields and multiline text
**File:** `multiline_quotes.csv`.  
**Goal:** Read the file and confirm the second line in `notes` contains a newline and a comma that does not split columns.  
**Hint:** Default `quotechar` is `"`.

In [None]:
QUOTES_PATH = DATA_DIR / "multiline_quotes.csv"
df_quotes = ...

## Task 5 — Excel with junk rows and header repair
**File:** `report.xlsx`
**Subtasks:**
1. List sheet names.
2. Read `Summary` skipping the top 3 rows.
3. Read `RawData`.

**Deliverables:** `df_summary`, `df_raw`.

In [None]:
XLS_PATH = DATA_DIR / "report.xlsx"
xls = ...

## Task 6 — Read compressed CSV without extracting
**File:** `events.csv.gz`  
**Goal:** Read directly with pandas and preview 3 rows.

In [None]:
GZ_PATH = DATA_DIR / "events.csv.gz"
df_gz = ...

## Task 7 — Large-file streaming and aggregation
**File:** `large_synthetic.csv`  
**Goal:**
- Stream in chunks of 50,000 rows.
- Filter rows where `flag == 'B'` and compute count and sum of `value`.
- Report the results.  

In [None]:
LARGE_PATH = DATA_DIR / "large_synthetic.csv"

chunks = ...
total_B = 0
sum_value_B = 0.0

for chunk in chunks:
    df = ...
    total_A += len(df)
    sum_value_A += df['value'].sum()

print("Rows with flag B:", total_B)
print("Sum of value for flag A:", round(sum_value_B, 4))

**Stretch:** Downcast dtypes and compare memory usage.

In [None]:
sample = pd.read_csv(LARGE_PATH)
before_mb = sample.memory_usage(deep=True).sum() / (1024**2)

opt = sample.copy()
opt['user_id'] = ...
opt['value'] = ...
opt['flag'] = ...

after_mb = opt.memory_usage(deep=True).sum() / (1024**2)
print("Before MB:", round(before_mb, 3), "After MB:", round(after_mb, 3), "Saved %:", round(100*(1-after_mb/before_mb), 2))