# I-94 Westbound Traffic Analysis

This project focuses on analyzing traffic data collected from the westbound lanes of the I-94 Interstate highway. The dataset, provided by John Hogue, contains information on traffic flow along with several potential factors that might influence congestion, such as weather conditions, time of day, and day of the week.

The main goal of this analysis is to identify patterns and indicators that contribute to heavier traffic on I-94, such as seasonal changes or specific weather events. By gaining insights into these factors, we can better understand traffic trends and potentially inform transportation planning or improve traffic management strategies.

### Description of the dataset:

| Variable Name        | Role       | Type          | Description                                                                 | Units   | Missing Values |
|----------------------|------------|---------------|-----------------------------------------------------------------------------|---------|----------------|
| holiday              | Feature    | Categorical   | US National holidays plus regional holiday, Minnesota State Fair           | no      | no             |
| temp                 | Feature    | Continuous    | Average temp in kelvin                                                      | Kelvin  | no             |
| rain_1h              | Feature    | Continuous    | Amount in mm of rain that occurred in the hour                              | mm      | no             |
| snow_1h              | Feature    | Continuous    | Amount in mm of snow that occurred in the hour                              | mm      | no             |
| clouds_all           | Feature    | Integer       | Percentage of cloud cover                                                  | %       | no             |
| weather_main         | Feature    | Categorical   | Short textual description of the current weather                           | no      | no             |
| weather_description  | Feature    | Categorical   | Longer textual description of the current weather                          | no      | no             |
| date_time            | Feature    | Date          | Hour of the data collected in local CST time                               | no      | no             |
| traffic_volume       | Target     | Integer       | Hourly I-94 ATR 301 reported westbound traffic volume                      | no      | no             |

**Note:** The "no" value under "Units" indicates the variable has no applicable unit (e.g., categorical or date types). The "no" under "Missing Values" confirms no missing data in the dataset.

In [1]:
# Import necessary libraries for data manipulation
import pandas as pd

# Load the dataset from a CSV file into a DataFrame
traffic = pd.read_csv('Metro_Interstate_Traffic_Volume.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'Metro_Interstate_Traffic_Volume.csv'

In [None]:
# We examine the first and last five rows
traffic

In [None]:
# to find more information about the dataset.
traffic.info()

In [None]:
# inquire the number of NAN values
traffic.isnull().sum()

### Initial Exploratory data visualization

In [None]:
# Import necessary library for data visualization
import matplotlib.pyplot as plt

# A Jupyter command that ensures matplotlib plots are embedded directly in the notebook instead of a pop-up window. 
%matplotlib inline

### We examine the  distribution of `traffic_volume` column

In [None]:
# Construct a historigram to examine the distribution of values in the traffic volume column
traffic['traffic_volume'].plot.hist(bins=30,
                                    color='lightgreen',
                                    edgecolor='black')


# Label axes and title to clarify the histogram's purpose  
plt.xlabel('Traffic Volume')
plt.ylabel('Frequency')
plt.title('Distribution of Hourly Traffic Volume on I-94 Westbound')

# Render the plot to analyze traffic patterns 
plt.show()

In [None]:
# Inspection of statistical description to have a side by side comparison with the graph
traffic['traffic_volume'].describe()

### Key observations :
**Bimodal Distribution:** Two peaks (0-1000 and ~5000) imply two dominant traffic scenarios:
- Low Traffic: Possibly nighttime or off-peak hours.
- High Traffic: Rush hours (e.g., 8 AM or 5 PM).
### Possible action :
- Inquire the relationship between the time of the day and the volume of the traffic.
- Observe if the traffic is higher in the middle of the week and lower in the weekend or vice versa.

# Traffic Volume Analysis: Day vs. Night (I-94 Westbound)  

## Methodology  
1. **Data Preparation**:  
   - Converted `date_time` to datetime format.  
   - Segmented data into:  
     - **Daytime**: 7 AM – 7 PM (rush hour focus).  
     - **Nighttime**: 7 PM – 7 AM (off-peak focus).  

In [None]:
# Convert 'data_time' to datetime to enable time-based operations  
traffic['date_time'] = pd.to_datetime(traffic['date_time'])  

# Extract hour from datetime to categorize day/night  
traffic['hour'] = traffic['date_time'].dt.hour  

# Split data into daytime (7 AM–7 PM) and nighttime (7 PM–7 AM)  
daytime_data = traffic[(traffic['hour'] >= 7) & (traffic['hour'] < 19)].copy()  
nighttime_data = traffic[(traffic['hour'] >= 19) | (traffic['hour'] < 7)].copy()  

2. **Visual Comparison**:  
   - Histograms with identical axes for direct comparison.   

In [None]:
# Set up the figure size for the entire plot (width, height)
plt.figure(figsize=(8,4))

# -------------------------------
# Subplot 1: Daytime Traffic
# -------------------------------
plt.subplot(1,2,1) # 1 row, 2 columns, activate 1st subplot

# Create histogram for daytime traffic data
plt.hist(daytime_data['traffic_volume'],
         bins=30,
         color='skyblue',        # Light blue fill
         edgecolor='black')      # Black borders for bars

# Labels and formatting
plt.title('Day Traffic Volume')  # Subplot title
plt.xlabel('Traffic Volume')     # X-axis label
plt.ylabel('Frequency')          # Y-axis label
plt.xlim(0, max(traffic['traffic_volume']) + 1000)  # Set x-axis range (aligns both plots)
plt.ylim(0,5000)                 # Set consistent y-axis range for comparison

# -------------------------------
# Subplot 2: Nighttime Traffic
# -------------------------------
plt.subplot(1,2,2) # Activate 2nd subplot

# Create histogram for nighttime traffic data
plt.hist(nighttime_data['traffic_volume'],
         bins=30,
         color='salmon',         # Light red/orange fill
         edgecolor='black')      # Black borders for bars

# Labels and formatting
plt.title('Night Traffic Volume')
plt.xlabel('Traffic Volume')
plt.ylabel('Frequency')
plt.xlim(0, max(traffic['traffic_volume']) + 1000)  # Match x-axis with daytime plot
plt.ylim(0,5000)                 # Match y-axis with daytime plot
# -------------------------------
# Final Adjustments
# -------------------------------
plt.tight_layout()               # Automatically adjust spacing between subplots
plt.show()                       # Render the entire figure

In [None]:
daytime_data['traffic_volume'].describe()

In [None]:
nighttime_data['traffic_volume'].describe()

## Key Findings Traffic Volume observation

### Daytime Traffic
#### 1- Dominant Peak Hours:
- Traffic volume peaks at **~8,000 vehicles/hour**, indicating heavy congestion during **rush hours (7 AM–7 PM)**.
- Two distinct peaks suggest morning **(8–9 AM)** and evening **(5–6 PM)** rush periods.

#### 2- High Variability:
- Broad distribution reflects fluctuating traffic (e.g., commuter patterns, weather disruptions).

---

### Nighttime Traffic
#### 1- Consistently Lower Volume:
- Traffic rarely exceeds **~6,000 vehicles/hour**, with most observations clustered below **4,000 vehicles/hour.**
- Single peak suggests steady, low-volume flow (e.g., night shifts, freight transport).

#### 2- Predictable Patterns:
- Narrow distribution implies fewer disruptions compared to daytime.

# Time Indicators

Our goal is to find indicators of heavy traffic, so **we decided to only focus on the daytime data** moving forward.  
One of the possible indicators of heavy traffic is time. **There might be more people on the road in a certain month, on a certain day, or at a certain time of the day.**  
We're going to look at a few line plots showing how the traffic volume changed according to the following parameters:

- Month
- Day of the week
- Time of day

## Unit 1 : Month

In [None]:
# Extract the month from the 'date_time' column and create a new 'month' column
daytime_data['month'] = daytime_data['date_time'].dt.month

# Group data by month and calculate mean values for all numeric columns
by_month = daytime_data.groupby('month').mean(numeric_only=True)

# Display only the average traffic volume per month
by_month['traffic_volume']

In [None]:
# Create line plot for weekly pattern
by_month['traffic_volume'].plot.line(
    x='month',
    y='traffic_volume'
)

plt.title('Mean Traffic Volume by Month')
plt.xlabel('Months')
plt.ylabel('Average Vehicles/Hour')
plt.grid(True)
plt.show()

## Observations
### Monthly Traffic Pattern

#### 1. Peak Seasons:
- Highest traffic from **March (3)** to **October (10)** (~4850-4920 vehicles/month).
- Peak month: **May (5) at 4,911 vehicles**.
#### 2. Low Seasons:
- Sharp drop in **July (7)** (4,595) - possibly due to summer vacation effect.
- Lowest in **December (12)** (4,374) - holiday season/winter weather impact.
#### 3. Commuter Pattern:
- Consistent high volumes March-October suggest **work-related commuting** dominates.
- 8% drop from May to December **implies seasonal behavior changes**.

## Unit 2 : Day of the week

In [None]:
# Extract day of week (0=Monday, 6=Sunday) from datetime column
daytime_data['dayofweek'] = daytime_data['date_time'].dt.dayofweek

# Group data by day of week and calculate mean traffic volume
by_dayofweek = daytime_data.groupby('dayofweek').mean(numeric_only=True)

# Display average traffic volume for each weekday
by_dayofweek['traffic_volume'] # 0 is Monday, 6 is Sunday

In [None]:
# Create line plot for weekly pattern
by_dayofweek['traffic_volume'].plot.line(
    x='dayofweek',
    y='traffic_volume'
)

plt.title('Mean Traffic Volume by Day of Week') 
plt.xlabel('Day of Week') 
plt.ylabel('Average Vehicles/Hour')
plt.xticks(range(0,7), ['Mon','Tue','Wed','Thu','Fri','Sat','Sun'])  # Convert numbers to labels
plt.grid(True)
plt.show()

---
## Observations
### Weekly Traffic Pattern Analysis

- **Workday Surge** 📈
  - Peak traffic on **Wednesday (3)** at **5,311 vehicles**
  - Consistent high volume **Monday-Friday (~4,900-5,300)**

- **Weekend Drop 📉**
  - 26% decrease on **Saturday (5)** (3,927)
  - 34% decrease on **Sunday (6)** (3,436)
---

## Unit 3 : Time of day (Hour)

We'll now generate a line plot for the time of day.  
The weekends, however, will drag down the average values, so we're going to look at the averages separately.  
To do that, we'll start by splitting the data based on the day type: business day or weekend.

In [None]:
daytime_data['hour'] = daytime_data['date_time'].dt.hour
business_days = daytime_data.copy()[daytime_data['dayofweek'] <= 4] # 4 == Friday
weekend = daytime_data.copy()[daytime_data['dayofweek'] >= 5] # 5 == Saturday

by_hour_business = business_days.groupby('hour').mean(numeric_only=True)
by_hour_weekend = weekend.groupby('hour').mean(numeric_only=True)

print('Average number of vehicles each hour on businessday')
print(by_hour_business['traffic_volume'])
print('\n')
print('Average number of vehicles each hour on weekend')
print(by_hour_weekend['traffic_volume'])

In [None]:
plt.figure(figsize=(6, 3))

# Business Days Plot
plt.subplot(1,2,1)

plt.plot(
    by_hour_business['traffic_volume'],
    marker='o',
    linestyle='-'
)
plt.grid(True)
plt.title('Mean Traffic Volume per \n Hour on Business day')
plt.xlabel('Time of Day (Hour)')
plt.ylabel('Average Vehicle/Hour')
plt.ylim(1500, 7000)  # Keep scales identical for fair comparison

# Weekend Plot
plt.subplot(1,2,2)

plt.plot(
    by_hour_weekend['traffic_volume'],
    color='salmon',
    marker='o',
    linestyle='-'
)
plt.grid(True)
plt.title('Mean Traffic Volume per \n Hour on weekend')
plt.xlabel('Time of Day (Hour)')
plt.ylabel('Average Vehicle/Hour')
plt.ylim(1500, 7000)  # Keep scales identical for fair comparison

# Shared formatting
plt.suptitle('Hourly Traffic Comparison', fontweight='bold')
plt.tight_layout()

## Observations
### Hourly Traffic Patterns

#### 🏙️ Business Days
- **Rush Hour Peaks:**

  - Morning surge: **7-9 AM** (~6,500 vehicles)

  - Evening surge: **4-6 PM** (~6,200 vehicles)

- **Commuter Pattern:** Traffic mirrors standard work hours (9 AM–5 PM baseline).

#### 🌇 Weekends
- **Leisure Travel:**

  - Gradual midday peak (**12 PM–4 PM** ~4,500 vehicles)

  - 30% lower volumes compared to business days

- **Early morning:** Minimal traffic (<2,000 vehicles at **7 AM**)

# Summary observation 

## 📅 **Monthly Patterns**  
- **Peak Months**: March–October (~4,850–4,920 vehicles), peaking in **May** (4,911).  
- **Low Months**: December (4,374) and July (4,595) due to holidays/summer vacations.  
- **Insight**: Seasonal commuter dominance with predictable drops during holidays.  

---

## 📆 **Weekly Patterns**  
- **Workdays**:  
  - Highest on **Wednesday** (5,311 vehicles).  
  - Consistent high volume Monday-Friday (~4,900–5,300).  
- **Weekends**:  
  - 26–34% drop (Saturday: 3,927; Sunday: 3,436).  
- **Insight**: Traffic aligns with traditional work schedules.  

---

## 🕒 **Hourly Patterns**  
- **Business Days**:  
  - **Morning Rush**: 7–9 AM (~6,500 vehicles).  
  - **Evening Rush**: 4–6 PM (~6,200 vehicles).  
- **Weekends**:  
  - Midday leisure peak: 12–4 PM (~4,500 vehicles).  
- **Insight**: Clear commuter vs. leisure-driven patterns.  

---

**Conclusion**:  
Traffic on I-94 Westbound follows **predictable human behavior** tied to work, leisure, and seasons.

# Weather Indicators

The dataset provides us with a few useful columns about weather: temp, rain_1h, snow_1h, clouds_all, weather_main, weather_description.
A few of these columns are numerical (for instance : **temp, rain_1h, snow_1h, and clouds_all** ) so let's start by looking up their correlation values with traffic_volume.

In [None]:
daytime_data[['traffic_volume', 'temp', 'rain_1h', 'snow_1h', 'clouds_all']].corr()

### Key Observations  
1. **Weak Temperature Correlation**:  
   - The `temp` column has the strongest correlation with traffic volume (r = 0.128), but this is still **weak**.  
   - Higher temperatures *slightly* correlate with busier roads (e.g., summer months).  

2. **No Meaningful Weather Relationships**:  
   - Precipitation (`rain_1h`, `snow_1h`) and cloud cover (`clouds_all`) show **no practical correlation** with traffic volume.  

3. **Scatter Plot Insights**:  
   - A plot of `temp` vs. `traffic_volume` (not shown here) would likely show **no clear linear pattern**, confirming the weak correlation.  

---

## Conclusion  
**No weather column is a reliable indicator of heavy traffic** based on this dataset. The strongest correlation (`temp`) is too weak to draw actionable conclusions.  

To see if we can find more useful data, we'll look next at the categorical weather-related columns: `weather_main` and `weather_description`.

In [None]:
by_weather_main = daytime_data.groupby('weather_main').mean(numeric_only=True)
by_weather_description = daytime_data.groupby('weather_description').mean(numeric_only=True)

In [None]:
by_weather_main

In [None]:
plt.bar(
    by_weather_main.index,
    by_weather_main['traffic_volume']
)
plt.title('Average Traffic Volume by Weather Condition')
plt.xlabel('Weather Type')
plt.ylabel('Average Traffic Volume')
plt.xticks(rotation = 30)
plt.grid(True)
plt.show()

## Key Observations from `weather_main` Bar Plot  
- **Near-Threshold Volumes**:  
  - Weather types like **Rain** and **Clouds** show average traffic volumes close to **~4,900 cars**, approaching but not exceeding 5,000.  
  - **Clear** weather averages **~4,750 cars**.  

- **No Clear Heavy Traffic Indicators**:  
  - No weather type in `weather_main` consistently exceeds **5,000 cars** threshold for "heavy traffic."  

In [None]:
by_weather_description

In [None]:
plt.figure(figsize=(7,7))
plt.barh(
    by_weather_description.index,
    by_weather_description['traffic_volume'], 
    color = 'salmon'
)
plt.title('Average Traffic Volume by Weather Particularity')
plt.ylabel('Weather Description')
plt.xlabel('Average Traffic Volume')
plt.yticks(fontsize = 8)
plt.grid(True)

plt.show()

### Key Findings  
1. **Threshold Exceedance in Granular Weather Types**:  
   - **Highest Traffic Volumes**:  
     - `shower snow`: ~5,200 cars  
     - `thunderstorm with heavy rain`: ~5,100 cars  
   - These are the **only weather descriptions** exceeding the 5,000-car threshold.  

2. **Rarity of High-Traffic Weather Events**:  
   - Both `shower snow` and `thunderstorm with heavy rain` are **rare occurrences**, making them unreliable as frequent indicators.  

3. **Broad Weather Categories vs. Specific Descriptions**:  
   - Generic `weather_main` categories (e.g., "Rain", "Clouds") show **no correlation** with heavy traffic.  
   - Detailed `weather_description` entries reveal niche patterns but lack consistency.  

---

### Reliability of Weather Indicators  
- **No standalone weather type** (main or description) reliably predicts heavy traffic.  
- Observed high-traffic weather events:  
  - **Context-dependent**: Likely tied to **unplanned disruptions** (e.g., accidents during storms).  
  - **Not causal**: Correlation ≠ causation (e.g., traffic jams during snow may stem from accidents, not snow itself).  

---

### Project Conclusion  
**Weather alone is not a robust predictor of heavy traffic**. To improve accuracy:  
1. **Combine with temporal data**: Analyze "rush hour + heavy rain".  
2. **Integrate incident reports**: Pair weather with accident/construction data.  

> *"Data whispers patterns; context shouts insights."* 🌧️🚗