# Phase I: Project Proposal and Data Sources (5\%)

### Due (Each Student): October 4

Each **individual student** will submit a project proposal in a **jupyter notebook** which:

1. (2\%) Describes and motivates a real-world problem where data science may provide helpful insights. This problem must consist of at least two key questions of interest and your description of the problem and questions should be easily understood by a casual reader. Citations to motivating sources are preferred where possible (e.g. news articles, published papers, etc. Do not use Wikipedia itself, but the links Wikipedia articles cite may be useful)

2. (2\%) Explicity state source and load data (does not have to be clean) from data source:
   * Data must include at least **2 numeric** (e.g. number of friends, height, gpa, temperature, etc.) and **1 categorical** (e.g. color, class, car type, etc.) features
   * Data **MUST BE COLLECTED VIA PYTHON, EITHER WITH AN API OR VIA WEB SCRAPING**. You may **NOT** simply download a .csv file
     * Once you have succesfully collected, curated, cleaned the data, you may save it as a .csv or .json in your GitHub repo and call that final data set when implementing the ML (i.e. you do not have to call the API/do the web scraping every time you want to use the data).

4. (1\%) Write a paragraph about how the data will be used to solve the problem and your two questions of interest. At this point of the semester, we haven't studied the machine learning methods yet, but you should have a general idea of what you can do with ML (predict numerical values, predict class labels, or characterize relationships between features). If you do not, ask a TA or the professor or do a little googling.

### **1. Project Proposal: Analyzing Public Transportation Commute Times**

#### **Problem Description and Motivation (2%)**

The Metropolitan Transportation Authority (MTA) in New York City is important for getting millions of people around the city every day. However, the subway system faces a lot of problems, such as delays, overcrowding, and service interruptions. These issues not only frustrate riders but also impact the city’s economy. By diving into the data behind the MTA's operations, we might find patterns or causes for some of these problems and even identify potential solutions.

Key Questions of Interest:
1. What are the main causes of delays in the subway system, and how can we address them? 
   - Delays can happen for a lot of reasons; mechanical breakdowns, signal problems, weather conditions, or just rush-hour crowds. Finding out which of these factors are the biggest contributors could help make things run smoother.

2. How does subway ridership change at different times, days, or on different lines?
   - By understanding when and where people use the subway most, we can spot patterns that might help with resource planning, like knowing when to add more trains or where to focus repairs and upgrades.

Citations
- https://www.nytimes.com/interactive/2017/06/28/nyregion/subway-delays-overcrowding.html

### 2. Data Source and Collection (2%)

Data Source:
The data for this project is sourced from the MTA’s Subway Real-Time Feeds available at the URL: https://api-endpoint.mta.info/Dataservice/mtagtfsfeeds/nyct%2Fgtfs-bdfm. This provides real-time information on train locations, service statuses, and delays.

Data Features:
- Numerical Data:
1. Train Headways: Time intervals between consecutive trains on a given line, which can be calculated to analyze congestion and train frequency.
2. Train Delays: The duration of delays reported in real-time feeds, providing insights into subway performance.
- Categorical Data:
1. Train Line: Identifiers for different subway lines (e.g., B, D, F, M).

Data Collection Method:
We will use Python to collect data from the MTA's real-time feed endpoint. The requests library will be used to make HTTP requests to the URL, and gtfs-realtime-bindings will parse the response. The collected data will be stored in a pandas DataFrame and saved as a CSV file for further analysis.

In [1]:
import requests
import pandas as pd
from google.transit import gtfs_realtime_pb2
from datetime import datetime

# MTA data feed URL
URL = 'https://api-endpoint.mta.info/Dataservice/mtagtfsfeeds/nyct%2Fgtfs-bdfm'

def fetch_mta_data():
    response = requests.get(URL)
    
    if response.status_code == 200:
        # Parse the GTFS data
        feed = gtfs_realtime_pb2.FeedMessage()
        feed.ParseFromString(response.content)
        
        data = []
        
        for entity in feed.entity:
            if entity.trip_update:
                train_info = {
                    'train_line': entity.trip_update.trip.route_id,  # Categorical data: Train Line
                    'start_time': entity.trip_update.trip.start_time,
                    'arrival_time': None,
                    'delay': None  # Numerical data: Train Delay
                }
                
                # Extract the first stop update (if available)
                if entity.trip_update.stop_time_update:
                    stop_update = entity.trip_update.stop_time_update[0]
                    train_info['arrival_time'] = datetime.fromtimestamp(
                        stop_update.arrival.time).strftime('%Y-%m-%d %H:%M:%S')
                    train_info['delay'] = stop_update.arrival.delay if stop_update.arrival.HasField('delay') else 0
                
                data.append(train_info)
        
        return pd.DataFrame(data)
    else:
        print(f"Failed to fetch data: {response.status_code}")
        return None

# Fetch the data
mta_data = fetch_mta_data()

# Check if data was successfully retrieved
if mta_data is not None:
    # Save the data to a CSV file for later use
    mta_data.to_csv('mta_real_time_data.csv', index=False)
    print(mta_data.head())
else:
    print("No data to load.")




  train_line start_time         arrival_time  delay
0          B   21:17:20  2024-10-04 22:30:39    0.0
1                                       None    NaN
2          D   20:42:30  2024-10-04 22:30:36    0.0
3                                       None    NaN
4          D   20:57:00  2024-10-04 22:30:39    0.0


### 3. How the Data Will Be Used (1%)

The data collected from the MTA real-time feed for subway lines B, D, F, and M provides valuable information into train performance and delay patterns. This dataset, which includes features such as train line identifiers, arrival times, and delay durations, allows us to explore two key questions: (1) What are the primary factors contributing to train delays? (2) How does train performance vary across different times of day? By analyzing the arrival times and delay durations, we can identify peak hours and trends that may contribute to train delays, such as increased congestion during rush hours or technical issues at certain times. Furthermore, this data enables us to build models to predict delays based on historical patterns, providing useful information for MTA scheduling and resource allocation. Using machine learning techniques, we can also explore relationships between train lines and their performance, helping to identify which lines may require additional support to improve service quality.
