Skip to content

JavierGalindo91/NYC-Collisions

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NYC Driver Behavior Analysis Project

OVERVIEW

I initiated the NYC Driver Behavior Analysis Project with a multifaceted motivation that extends beyond the immediate scenario. While the project addresses the impact of the COVID-19 pandemic on driving behaviors in New York City, it also serves as a platform for personal and professional growth.

Personal Motivations

Skill Enhancement: I aim to polish my data skills and build a comprehensive data portfolio through hands-on experience with real-world datasets.

End-to-End Learning: I aspire to gain a holistic understanding of data-driven decision-making processes by immersing myself in every stage of the project lifecycle, including understanding the business context, extracting and processing data from source to delivery, and deriving actionable insights.

Technology Exploration: As the owner of this project, I am eager to explore and master the technologies commonly used in similar endeavors, such as leveraging tools like Docker for containerization, AWS Cloud for scalable computing and storage solutions, and database technologies for efficient data management, aiming to expand my technical skill set and adapt to industry-standard practices.

Domain Expertise: Delving into the nuances of vehicle collisions in a bustling metropolis like New York City, I seek to broaden my knowledge base and contribute meaningfully to the project's objectives, gaining a deep understanding of this complex issue.


SCENARIO: A Brief History of the NYC Driver Behavior

Join us as we delve into the impact of the COVID-19 pandemic on driving behaviors in New York City. Led by The Car Insurance Company TCIC's data and business intelligence team, our mission is clear:

Refine the premium pricing methodology to adapt to significant changes in urban mobility patterns.

Our project involves a deep analysis of various datasets to understand how the pandemic has reshaped driving habits, risk factors, and accident trends among our NYC customer base. By leveraging these insights, we aim to make informed decisions within our underwriting and customer service departments, ensuring that our products and services remain relevant to our customers' evolving needs.

The information provided below outlines the business context and expected outcomes for this project. While I'll be sharing some key points from the presentation, you can access the slideshow via the Miro board link provided: link.

Business Context

METHODOLOGY

In this section, we outline our approach to extracting actionable insights:

Methodology

DATA SOURCES and TOOLS

Data Sources

NYPD Open Data API, Powered by Socrata

Tools Used

  • Python 3: Libraries (pandas, scipy, numpy, boto3, sys, io, sodapy)
  • Docker: Deploy data pipelines using docker containers.
  • AWS Lambda: Automating workflows in AWS ecosystem.
  • AWS S3: Repository for both raw and meticulously processed data.
  • AWS Batch: Data Processing tool
  • AWS Redshift: Build the Data Warehouse for OLAP (Online Analytical Processing).
  • Microsoft PowerBI: Our chosen tool for data visualization.
  • Miro: The canvas for our data story.

Data Pipeline

The following diagram showcases a sophisticated ecosystem, meticulously designed to transform raw data into actionable insights. This data pipeline is the backbone of this project, enabling us to process vast amounts of data efficiently and reliably. Below is an overview of the key components of the data pipeline and the tools employed in each stage:

image

Data Ingestion

Mass Upload: To kickstart our data pipeline, I've employed a Python application tailored to directly fetch data from the Socrata API. Once retrieved, this data is securely stored within an S3 bucket dedicated to raw collisions data.

I devised two distinct methods for this upload process: sequential retrieval and parallel retrieval. Each method has its own set of advantages and drawbacks, extensively discussed within this tutorial.

This initial data influx is executed locally, serving as a crucial step to establish the foundational storage infrastructure, priming the pipeline for subsequent processing stages.


Daily Updates: A pivotal aspect of this pipeline involves ensuring that our data remains up-to-date with daily advancements. To achieve this, I designed a streamlined pipeline orchestrated to execute daily updates seamlessly.

Leveraging tools such as Docker containers, AWS ECR (Elastic Container Registry), AWS CloudWatch, AWS Secrets Manager, and AWS Lambda, to automate the ETL (Extract, Transform, Load) process.

This automated pipeline handles the retrieval and processing of data according to a pre-defined daily schedule, ensuring our dataset remains current and actionable. Please access this link for more details.

Workflow Automation

AWS Lambda serves as the backbone for automating workflows within the AWS ecosystem, empowering us to efficiently manage and automate various processes within our data pipeline. By leveraging Lambda, we significantly enhance efficiency while concurrently minimizing the risk of manual errors. This streamlined approach to workflow automation ensures smooth operation throughout the entirety of our data pipeline, facilitating seamless data processing and analysis.

Data Storage

AWS S3 serves as the primary data repository, hosting both the raw data collected from the Socrata API and the processed data meticulously organized for ease of access and analysis.

Data Transformation

AWS Batch plays a crucial role, extracting raw data from S3 and transforming it into a format suitable for analysis. This transformation includes data cleaning, normalization, and aggregation, preparing the data for in-depth analysis.

Data Warehousing for Analysis

AWS Redshift constructs a Data Warehouse for Online Analytical Processing (OLAP). This powerful tool manages large volumes of processed data, making it readily available for complex queries and analysis.

Data Visualization and Reporting

Miro is our chosen tool for data visualization. It enables us to unearth deep insights from our data and present them in an intuitive, visually compelling format.


The integration of these tools forms a robust and dynamic data pipeline, essential for navigating the complexities of urban driving behavior analysis in the post-pandemic era.

FILE and RESOURCE ACCESS

This repository contains all the necessary files and resources used in the NYC Driver Behavior Analysis Project. Here's how to navigate and utilize them:

Directory Structure

  • AWS: Contains scripts and configuration files related to AWS services such as Lambda, S3, and Redshift.
  • Docker: Includes Dockerfile and related scripts for setting up the Docker containers used in the project.
  • data pipelines: Scripts and code for setting up and managing the data pipelines.
  • data: This directory is typically used for storing data files. However, due to the sensitive nature of the data, it may not contain raw data files.
  • resources: Additional resources for the project, such as documentation, configuration files, or reference material.

HOW TO USE

To work with the files in this repository, follow these steps:

  1. Clone the Repository: Clone this repository to your local machine using Git command: git clone https://github.com/JavierGalindo91/NYC-Collisions.git
  2. Navigate to Directories: Use the command line to navigate into any of the directories listed above. For example: cd NYC-Collisions/AWS
  3. File Access: Access files directly within the cloned directories. If you're using an IDE or text editor, you can open the entire project folder to browse through the files.
  4. Running Scripts: To run scripts, ensure you have the necessary runtime environments set up, such as Python for .py files or Docker for containers. For Python scripts, the command might look like: python3 script_name.py
  5. Data Privacy: Please note that the actual data may be protected due to privacy concerns and thus not available in the repository. If you require access to the data, contact the repository owner at the provided email address.
  6. Updating Files: If you've made changes and wish to push them to the repository, use the standard Git commands git add, git commit, and git push to update the repository.
  7. Docker Containers: To work with Docker containers, make sure Docker is installed on your system and use the Docker CLI to build and run containers.
  8. AWS and Cloud Resources: For AWS resources, you will need appropriate access credentials and permissions. Use the AWS CLI or management console to interact with the services.

Need Help?

If you encounter any issues or have questions about accessing specific resources, please open an issue in this repository or contact the repository administrator at javier.galindobrito@gmail.com.

HOW TO CONTRIBUTE

I welcome contributions from data analysts, data scientists, and urban mobility experts. If you're interested in contributing to this project, please follow these steps:

  1. Fork this repository.
  2. Create a new branch for your feature (git checkout -b feature/YourFeature).
  3. Commit your changes (git commit -am 'Add some YourFeature').
  4. Push to the branch (git push origin feature/YourFeature).
  5. Open a new Pull Request.
  6. For any queries or suggestions, feel free to open an issue in the repository.

Contact Information

For more information on this project please email javier.galindobrito@gmail.com

This README is part of the NYC Driver Behavior Analysis Project initiated by Javier Galindo

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages