This project ingests a raw stock_market.csv, performs cleaning and validation, builds analytical aggregates, and exposes a Streamlit dashboard for quick exploration.
The README below gives a compact Quick Start, the pipeline steps, produced artifacts, and a few notes to help you run it locally.
- Create (or activate) a Python environment with a supported version (Python 3.9+ recommended).
- Install the lightweight dependencies:
pip install pandas pyarrow streamlit- Run the pipeline steps (clean → aggregate → dashboard):
# Clean the raw CSV and produce cleaned.parquet
python api/cleaning.py
# Build the aggregate parquet files
python api/aggregates.py
# Start the Streamlit dashboard
streamlit run app.pyHigh-level flow:
- Input:
api/stock_market.csv(raw) - Step 1: Normalize and validate the raw CSV →
cleaned.parquet - Step 2: Build analytic aggregates (parquet files)
- Step 3: Launch Streamlit dashboard to visualize results
Command:
python api/cleaning.pyCleaning features:
- Convert column names to
snake_case - Trim whitespace and lowercase textual fields
- Normalize common missing-value tokens (
na,-,null, etc.) - Enforce a schema for key columns (date, float, int, string, boolean)
- Output:
cleaned.parquet
Command:
python api/aggregates.pyProduced artifacts (Parquet files):
agg_daily_avg_close.parquet— average close price per day per tickeragg_avg_volume_sector.parquet— average trading volume by sectoragg_daily_return.parquet— daily return (% change of close price)
These are lightweight, columnar files suitable for fast reads in analytics or feeding the dashboard.
Command (from repository root):
streamlit run app.pyDashboard features:
- Daily Avg Close (interactive bar chart with filters)
- Daily Return (bar chart + filters)
- Volume by Sector (bar chart)
- Recommended Python: 3.9+.
- If you prefer isolating the environment, use
python -m venv .venvthensource .venv/bin/activate(or the Windows equivalent). - For reproducible installs, consider adding a
requirements.txtwith pinned versions. - If the source CSV is very large, run the cleaning step on a machine with sufficient memory or adapt
api/cleaning.pyto stream/process in chunks.
api/cleaning.py— cleaning & validation logicapi/aggregates.py— aggregation and output writersapp.py— Streamlit app that reads the parquet artifacts and renders charts
Below are example screenshots from the dashboard. Replace these files in images/ with your own screenshots if you want different images to appear in this README.


