This solution aims to design and implement a scalable data pipeline that extracts New York Taxi Trip data, processes it to derive analytical insights, and loads the processed data into a data warehouse for further analysis.
- Python 3.8+
- SQLite
-
Clone the repository:
git clone <repository_url> cd New_York_Assignment
-
Create and activate a virtual environment:
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install the required packages:
pip install -r prerequisites.txt
To download the CSV files for the year 2019:
python scripts/download_data.py
To convert downloaded data to csv
python scripts/parquet_to_csv.py
To clean and transform the downloaded data:
python scripts/processed_data.py
To load the data into database:
python scripts/loading_data.py
To generate insights and visualizations:
python scripts/analysis_data.py