HOMEWORK 05- DATA STORAGE

IMPORTING NECESSARY LIBRARIES-

In [17]:
import os, pathlib, datetime as dt
import pandas as pd
import numpy as np
from dotenv import load_dotenv

# Creating two folders in data folder i.e. raw, processed
load_dotenv()
RAW = pathlib.Path(os.getenv('DATA_DIR_RAW', 'data/raw'))
PROC = pathlib.Path(os.getenv('DATA_DIR_PROCESSED', 'data/processed'))
RAW.mkdir(parents=True, exist_ok=True)
PROC.mkdir(parents=True, exist_ok=True)
print('RAW ->', RAW.resolve())
print('PROC ->', PROC.resolve())

RAW -> C:\Users\hinat\bootcamp_hina_tomar\homework\homework5\notebooks\data\raw
PROC -> C:\Users\hinat\bootcamp_hina_tomar\homework\homework5\notebooks\data\processed


1) SAVE IN TWO FORMATS
i)  Load Sample Dataframe.
ii)  Save to data/raw/ as CSV and to data/processed/ as Parquet.
iii) Use DATA_DIR_RAW and DATA_DIR_PROCESSED from .env


In [18]:
# Loading Sample Dataframe
dates = pd.date_range('2024-01-01', periods=20, freq='D')
df = pd.DataFrame({'date': dates, 'ticker': ['AAPL']*20, 'price': 150 + np.random.randn(20).cumsum()})
df.head()

Unnamed: 0,date,ticker,price
0,2024-01-01,AAPL,149.873206
1,2024-01-02,AAPL,149.314237
2,2024-01-03,AAPL,149.501673
3,2024-01-04,AAPL,151.422687
4,2024-01-05,AAPL,149.373938


In [19]:
# Saving to data/raw/ as CSV and to data/processed/ as Parquet.
def ts(): return dt.datetime.now().strftime('%Y%m%d-%H%M%S')

# TODO: Save CSV
csv_path = RAW / f"sample_{ts()}.csv"
df.to_csv(csv_path, index=False)
csv_path

# TODO: Save Parquet
pq_path = PROC / f"sample_{ts()}.parquet"
try:
    df.to_parquet(pq_path)
except Exception as e:
    print('Parquet engine not available. Install pyarrow or fastparquet to complete this step.')
    pq_path = None
pq_path


Parquet engine not available. Install pyarrow or fastparquet to complete this step.


 2. RELOAD AND VALIDATE
 i) Reload both files.
 ii) Confirm shapes match and critical columns keep expected dtypes.
 iii) Add a small validation function and show results in the notebook.

In [20]:
# Creating validate function
def validate_loaded(original, reloaded):
    checks = {
        'shape_equal': original.shape == reloaded.shape,
        'date_is_datetime': pd.api.types.is_datetime64_any_dtype(reloaded['date']) if 'date' in reloaded.columns else False,
        'price_is_numeric': pd.api.types.is_numeric_dtype(reloaded['price']) if 'price' in reloaded.columns else False,
    }
    return checks
    
df_csv = pd.read_csv(csv_path, parse_dates=['date']) # reloading csv file
validate_loaded(df, df_csv) # Checking csv file shape match and columns dtypes as required

{'shape_equal': True, 'date_is_datetime': True, 'price_is_numeric': True}

In [22]:
if pq_path:
    try:
        df_pq = pd.read_parquet(pq_path) # reloading parquet file
        validate_loaded(df, df_pq) # Checking parquet file shape match and columns dtypes as required
    except Exception as e:
        print('Parquet read failed:', e)

3. REFACTOR TO UTILITIES-
i) Implement write_df and read_df that route by file suffix (csv/parquet).
ii) Handle missing directories and missing Parquet engine with a clear message.

In [26]:
import typing as t, pathlib

def detect_format(path: t.Union[str, pathlib.Path]):
    s = str(path).lower()
    if s.endswith('.csv'): return 'csv'
    if s.endswith('.parquet') or s.endswith('.pq') or s.endswith('.parq'): return 'parquet'
    raise ValueError('Unsupported format: ' + s)

def write_df(df: pd.DataFrame, path: t.Union[str, pathlib.Path]):
    p = pathlib.Path(path); p.parent.mkdir(parents=True, exist_ok=True)
    fmt = detect_format(p)
    if fmt == 'csv':
        df.to_csv(p, index=False)
    else:
        try:
            df.to_parquet(p)
        except Exception as e:
            raise RuntimeError('Parquet engine not available. Install pyarrow or fastparquet.') from e
    return p

def read_df(path: t.Union[str, pathlib.Path]):
    p = pathlib.Path(path)
    fmt = detect_format(p)
    if fmt == 'csv':
        return pd.read_csv(p, parse_dates=['date']) if 'date' in pd.read_csv(p, nrows=0).columns else pd.read_csv(p)
    else:
        try:
            return pd.read_parquet(p)
        except Exception as e:
            raise RuntimeError('Parquet engine not available. Install pyarrow or fastparquet.') from e

# Demo
p_csv = RAW / f"util_{ts()}.csv"
p_pq  = PROC / f"util_{ts()}.parquet"
write_df(df, p_csv); read_df(p_csv).head()
try:
    write_df(df, p_pq)
    read_df(p_pq).head()
except RuntimeError as e:
    print('Skipping Parquet util demo:', e)

Skipping Parquet util demo: Parquet engine not available. Install pyarrow or fastparquet.
