# Exploratory Data Analysis: US Transportation
## Authors: Yasmine Thandi, Kyle Truong, Bin Xu
**Original Dataset Source: Monthly Transportation Statistics (Updated 2024). Kaggle Data Science Platform. https://www.kaggle.com/datasets/utkarshx27/monthly-transportation-statistics/data**

**Modified Dataset: https://raw.githubusercontent.com/HenryCROSS/eecs3401_final_project/main/data/Monthly_Transportation_Statistics.csv**

## Transportation Dataset Description
From the original dataset, any data prior to 1967 was removed, due to there being an insufficient amount of data recorded by The Bureau of Transportation Statistics.

We believe that most of the data provided to us is excessive and isn't required for the task we want to focus on. Therefore we reduced our 136 unique attributes to 27 that we thought were useful for our model.
### Attributes Used:
1. **Index** - Entry number.
1. **Date** - The date the data was recorded (Typically the first day of each month at 12:00AM)
1. **Transit Ridership - Other Transit Modes - Adjusted** - Total number of riders on other transit modes.
1. **Transit Ridership - Fixed Route Bus - Adjusted** - Total number of riders on any bus routes.
1. **Transit Ridership - Urban Rail - Adjusted** - Total number of riders on any methods of urban rail (i.e. Subway, Local Trains, etc.)
1. **Freight Rail Intermodal Units** - Number of freight cars used per month.
1. **Freight Rail Carloads** - Number of freight cars with cargo loaded per month.
1. **Highway Vehicle Miles Traveled - All Systems** - Total combined miles travelled on a highway.
1. **Highway Fuel Price - Regular Gasoline** - Price of regular gasoline per gallon.
1. **Highway Fuel Price - On-highway diesel** - Price of diesel per gallon.
1. **Personal Spending on Transportation - Transportation Services - Seasonally Adjusted** - Average monthly cost on transportation.
1. **Personal Spending on Transportation - Gasoline and Other Energy Goods - Seasonally Adjusted** - Average monthly on gasoline, diesel or electricity.
1. **Personal Spending on Transportation - Motor Vehicles and Parts - Seasonally Adjusted** - Average monthly spending on autoshops, repair parts and services.
1. **Passenger Rail Passengers** - Number of passengers who use the trains every month
1. **Transportation Services Index - Freight** - Month to month performance output measure of freight services
1. **Transportation Services Index - Passenger** - Month to month performance output measure of passenger services
1. **Real Gross Domestic Product - Seasonally Adjusted** - Monetary value of all transportation services
1. **U.S.-Canada Incoming Person Crossings** - Number of people entering the United States from Canada
1. **U.S.-Canada Incoming Truck Crossings** - Number of trucks entering the United States from Canada
1. **U.S.-Mexico Incoming Person Crossings** - Number of people entering the United States from Mexico
1. **U.S.-Mexico Incoming Truck Crossings** - Number of trucks entering the United States from Mexico
1. **U.S. Airline Traffic - Domestic - Non Seasonally Adjusted** - Amount of airline traffic travelling within the United States
1. **U.S. Airline Traffic - Total - Non Seasonally Adjusted** - Amount of airline traffic travelling collectively involving the United States
1. **U.S. Airline Traffic - International - Non Seasonally Adjusted** - Amount of airline traffic travelling in and out of the United States
1. **Transborder - Total North American Freight** - Total freight travelled across North America
1. **Transborder - U.S. - Mexico Freight** - Total freight travelled across the US-Mexico border into the United States
1. **Transborder - U.S. - Canada Freight** - Total freight travelled across the US-Canada border into the United States





# 1- Look at the big picture

### Frame the problem
1. Supervised learning.
2. A regression task – predict a value.
3. Batch learning 
    - Small data set
    - No need to continuously adjust any incoming data because the last data recorded was in December, 2023

### Look at the big picture
Predictions will be used to inform operators in the US about future transportation metrics by using previous data on border crossings, ridership count, freight values, prices and revenue. We will be predicting the future cost of transportation and the future size of ridership. This will help with resource allocation, and predicting the future demand of transportation services for the operators. By understanding the relationship between the demand and revenue in the data set, we will provide a suitable budget as a future reference to operators to assist with optimizing pricing strategies for transportation services. 

In [1]:
# Import libraries
# you can install missing library using pip install numpy 

import sklearn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 2- Load the data

In [2]:
url = "https://raw.githubusercontent.com/HenryCROSS/eecs3401_final_project/main/data/Monthly_Transportation_Statistics.csv"
data = pd.read_csv(url, sep=',')
data_bak = data

In [3]:
data

Unnamed: 0,Index,Date,Air Safety - General Aviation Fatalities,Highway Fatalities Per 100 Million Vehicle Miles Traveled,Highway Fatalities,U.S. Airline Traffic - Total - Seasonally Adjusted,U.S. Airline Traffic - International - Seasonally Adjusted,U.S. Airline Traffic - Domestic - Seasonally Adjusted,Transit Ridership - Other Transit Modes - Adjusted,Transit Ridership - Fixed Route Bus - Adjusted,...,Heavy truck sales SAAR (millions),U.S. Airline Traffic - Total - Non Seasonally Adjusted,Light truck sales SAAR (millions),U.S. Airline Traffic - International - Non Seasonally Adjusted,Auto sales SAAR (millions),U.S. Airline Traffic - Domestic - Non Seasonally Adjusted,Transborder - Total North American Freight,Transborder - U.S. - Mexico Freight,U.S. marketing air carriers on-time performance (percent),Transborder - U.S. - Canada Freight
0,0,01/01/1947 12:00:00 AM,,,,,,,,,...,,,,,,,,,,
1,1,02/01/1947 12:00:00 AM,,,,,,,,,...,,,,,,,,,,
2,2,03/01/1947 12:00:00 AM,,,,,,,,,...,,,,,,,,,,
3,3,04/01/1947 12:00:00 AM,,,,,,,,,...,,,,,,,,,,
4,4,05/01/1947 12:00:00 AM,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
919,919,08/01/2023 12:00:00 AM,43.0,,,,,,,,...,544000.0,,12196000.0,,3099000.0,,,,0.8,
920,920,09/01/2023 12:00:00 AM,17.0,,,,,,,,...,501000.0,,12438000.0,,3220000.0,,,,0.8,
921,921,10/01/2023 12:00:00 AM,28.0,,,,,,,,...,451000.0,,12408000.0,,3038000.0,,,,,
922,922,11/01/2023 12:00:00 AM,12.0,,,,,,,,...,478000.0,,12329000.0,,2991000.0,,,,,


# 3. Explore and visualize the data to gain insights.