# Commute Carbon Footprint Analysis: Data Exploration

This notebook explores the dataset collected to analyze the impact of departure time and traffic conditions on commute carbon footprint.

## Project Overview

This project analyzes how **departure time** and **traffic conditions** impact the **carbon footprint** of daily commutes between home and campus. The goal is to identify patterns and factors that contribute to higher emissions and suggest strategies to reduce environmental impact.

### Dataset Description

- **Fuel Consumption**: Tracked using a car app, measured in liters per trip
- **Traffic Conditions**: Collected using GPS data from Google Maps
- **Trip Time**: Duration of each trip
- **Distance Traveled**: Recorded using GPS data, measured in kilometers
- **Carbon Emissions**: Calculated using the formula: CO2 Emissions (kg) = Fuel Consumption (liters) × 2.31 kg CO2/liter

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Set plot style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('deep')
plt.rcParams['figure.figsize'] = [10, 6]
plt.rcParams['figure.dpi'] = 100

## Loading and Inspecting the Data

In [2]:
# Load data
df = pd.read_csv('commute_data.csv')

# Display the first few rows
print("First 5 rows of the dataset:")
df.head()

First 5 rows of the dataset:


Unnamed: 0,date,departure_time,trip_direction,trip_duration,distance_km,fuel_efficiency_l_per_100km,fuel_used_l,traffic_condition,day_of_week,co2_emissions_kg
0,2025-03-13,13:10,Home to Campus,47,39.0,4.9,1.911,1,Thursday,4.41441
1,2025-03-13,21:52,Campus to Home,29,33.8,3.8,1.2844,1,Thursday,2.966964
2,2025-03-17,13:32,Home to Campus,48,40.0,4.1,1.64,1,Monday,3.7884
3,2025-03-18,13:39,Home to Campus,42,39.4,5.0,1.97,2,Tuesday,4.5507
4,2025-03-26,18:27,Campus to Home,54,40.0,4.9,1.96,1,Wednesday,4.5276


In [3]:
# Check the shape of the dataset
print(f"Dataset dimensions: {df.shape[0]} rows and {df.shape[1]} columns")

# Check column types
print("\nData types:")
df.dtypes

Dataset dimensions: 33 rows and 10 columns

Data types:


date                            object
departure_time                  object
trip_direction                  object
trip_duration                    int64
distance_km                    float64
fuel_efficiency_l_per_100km    float64
fuel_used_l                    float64
traffic_condition                int64
day_of_week                     object
co2_emissions_kg               float64
dtype: object

In [4]:
# Check for missing values
print("Missing values in each column:")
df.isnull().sum()

Missing values in each column:


date                           0
departure_time                 0
trip_direction                 0
trip_duration                  0
distance_km                    0
fuel_efficiency_l_per_100km    0
fuel_used_l                    0
traffic_condition              0
day_of_week                    0
co2_emissions_kg               0
dtype: int64

## Data Preprocessing

In [5]:
# Convert date to datetime and extract features
df['date'] = pd.to_datetime(df['date'])

# Extract time components from departure_time
df['departure_hour'] = pd.to_datetime(df['departure_time'], format='%H:%M').dt.hour

# Flag for peak hours (7-9 AM and 5-7 PM)
df['peak_hour'] = ((df['departure_hour'] >= 7) & (df['departure_hour'] <= 9)) | \
                  ((df['departure_hour'] >= 17) & (df['departure_hour'] <= 19))

# Calculate average speed (km/h)
df['avg_speed_kmh'] = df['distance_km'] / (df['trip_duration'] / 60)

# Calculate CO2 per km (emissions efficiency)
df['co2_per_km'] = df['co2_emissions_kg'] / df['distance_km']

# Ensure traffic_condition is consistent (convert strings to numeric if needed)
if df['traffic_condition'].dtype == 'object':
    traffic_mapping = {'low': 0, 'moderate': 1, 'high': 2}
    df['traffic_condition'] = df['traffic_condition'].map(traffic_mapping)

# Show the updated dataframe
df.head()

Unnamed: 0,date,departure_time,trip_direction,trip_duration,distance_km,fuel_efficiency_l_per_100km,fuel_used_l,traffic_condition,day_of_week,co2_emissions_kg,departure_hour,peak_hour,avg_speed_kmh,co2_per_km
0,2025-03-13,13:10,Home to Campus,47,39.0,4.9,1.911,1,Thursday,4.41441,13,False,49.787234,0.11319
1,2025-03-13,21:52,Campus to Home,29,33.8,3.8,1.2844,1,Thursday,2.966964,21,False,69.931034,0.08778
2,2025-03-17,13:32,Home to Campus,48,40.0,4.1,1.64,1,Monday,3.7884,13,False,50.0,0.09471
3,2025-03-18,13:39,Home to Campus,42,39.4,5.0,1.97,2,Tuesday,4.5507,13,False,56.285714,0.1155
4,2025-03-26,18:27,Campus to Home,54,40.0,4.9,1.96,1,Wednesday,4.5276,18,True,44.444444,0.11319


## Basic Statistics

In [6]:
# Overall statistics
print("Overall Statistics:")
stats_summary = df[['trip_duration', 'distance_km', 'fuel_efficiency_l_per_100km', 
                   'fuel_used_l', 'co2_emissions_kg', 'avg_speed_kmh', 'co2_per_km']].describe()
stats_summary

Overall Statistics:


Unnamed: 0,trip_duration,distance_km,fuel_efficiency_l_per_100km,fuel_used_l,co2_emissions_kg,avg_speed_kmh,co2_per_km
count,33.0,33.0,33.0,33.0,33.0,33.0,33.0
mean,54.060606,39.587879,4.306061,1.726139,3.987382,47.546528,0.10073
std,16.872573,1.249939,0.601529,0.245322,0.566693,12.504464,0.014039
min,29.0,33.8,3.1,1.2152,2.807112,25.578947,0.07161
25%,45.0,39.4,3.8,1.521,3.51351,38.285714,0.09009
50%,49.0,39.8,4.4,1.764,4.07484,49.22449,0.10395
75%,63.0,40.0,4.8,1.911,4.41441,53.333333,0.11319
max,95.0,41.4,5.2,2.0935,4.835985,71.818182,0.12243


In [7]:
# Statistics by trip direction
print("\nStatistics by Trip Direction:")
direction_stats = df.groupby('trip_direction')[['trip_duration', 'fuel_efficiency_l_per_100km', 
                                              'co2_emissions_kg', 'avg_speed_kmh']].agg(['mean', 'std', 'count'])
direction_stats


Statistics by Trip Direction:


Unnamed: 0_level_0,trip_duration,trip_duration,trip_duration,fuel_efficiency_l_per_100km,fuel_efficiency_l_per_100km,fuel_efficiency_l_per_100km,co2_emissions_kg,co2_emissions_kg,co2_emissions_kg,avg_speed_kmh,avg_speed_kmh,avg_speed_kmh
Unnamed: 0_level_1,mean,std,count,mean,std,count,mean,std,count,mean,std,count
trip_direction,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
Campus to Home,58.1875,21.646305,16,4.25,0.621825,16,3.877552,0.603201,16,45.767602,14.952539,16
Home to Campus,50.176471,9.850351,17,4.358824,0.59588,17,4.090752,0.527024,17,49.220812,9.847611,17


In [None]:
# Statistics by traffic condition
print("\nStatistics by Traffic Condition:")
# Create a mapping dictionary for display purposes
traffic_display = {0: 'Low', 1: 'Moderate', 2: 'High'}

traffic_stats = df.groupby('traffic_condition')[['trip_duration', 'fuel_efficiency_l_per_100km', 
                                              'co2_emissions_kg', 'avg_speed_kmh']].agg(['mean', 'std', 'count'])
traffic_stats.index = [traffic_display[i] for i in traffic_stats.index]
traffic_stats

In [None]:
# Statistics by day of week
print("\nStatistics by Day of Week:")
# Sorting days in proper order
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
day_stats = df.groupby('day_of_week')[['trip_duration', 'fuel_efficiency_l_per_100km', 
                                      'co2_emissions_kg', 'avg_speed_kmh']].agg(['mean', 'std', 'count'])
# Sort by day of week
day_stats = day_stats.reindex([d for d in day_order if d in day_stats.index])
day_stats

## Analyzing Peak vs. Non-Peak Hours

In [None]:
# Stats by peak hour
peak_stats = df.groupby('peak_hour')[['trip_duration', 'fuel_efficiency_l_per_100km', 
                                     'co2_emissions_kg', 'avg_speed_kmh']].agg(['mean', 'std', 'count'])
peak_stats.index = ['Non-Peak Hours', 'Peak Hours']
peak_stats

## Trip Duration Distribution

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x='trip_duration', bins=10, kde=True)
plt.axvline(x=df['trip_duration'].mean(), color='red', linestyle='--', 
            label=f'Mean: {df["trip_duration"].mean():.1f} min')
plt.axvline(x=df['trip_duration'].median(), color='green', linestyle='--', 
            label=f'Median: {df["trip_duration"].median():.1f} min')
plt.title('Trip Duration Distribution', fontsize=14)
plt.xlabel('Trip Duration (minutes)', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## Summary of Data Exploration

From this initial exploration, we can observe several key patterns in the commute data:

1. The average trip duration is variable, with most trips taking between 40-55 minutes
2. Traffic conditions show a notable influence on trip duration, fuel efficiency, and CO2 emissions
3. Trip direction (Home to Campus vs. Campus to Home) shows some differences in metrics
4. Day of the week may have an impact on commute efficiency

In the next notebook, we'll perform more detailed visualizations and statistical analysis to further investigate these relationships.