# 🚴‍♂️ Bike Rental Analytics - Phase 1: Data Exploration

## Project Overview
This notebook explores Citi Bike ridership data from Jersey City (2016) combined with NOAA weather data from Newark Airport to understand weather impact on bike rentals.

## Phase 1 Objectives
- Load and inspect Citi Bike CSV files
- Examine weather data structure and quality
- Identify data quality issues and patterns
- Document initial findings and assumptions

## 📦 Import Libraries


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings

# Set up plotting styles
sns.set_theme(style="whitegrid", palette="muted", font_scale=1.1)
plt.style.use("ggplot")

# Suppress warnings for cleaner output
warnings.filterwarnings("ignore")


## 📁 Data Loading

Let's start by exploring the data structure and loading our datasets.


In [11]:
# TODO: Explore the data directory structure
# Hint: Use Path('../data') to create a path object, then use .glob('*.csv') to find CSV files
# Print out all the CSV files you find
data_path = Path("../data")
data_csv = list(data_path.glob("*.csv"))
print("CSV files found:")
for csv_file in data_csv:
    print(csv_file.name)



CSV files found:
JC-201602-citibike-tripdata.csv
JC-201608-citibike-tripdata.csv
newark_airport_2016.csv
JC-201606-citibike-tripdata.csv
JC-201610-citibike-tripdata.csv
JC-201604-citibike-tripdata.csv
JC-201612-citibike-tripdata.csv
JC-201603-citibike-tripdata.csv
JC-201609-citibike-tripdata.csv
JC-201601-citibike-tripdata.csv
JC-201611-citibike-tripdata.csv
JC-201607-citibike-tripdata.csv
JC-201605-citibike-tripdata.csv


### 🚴‍♂️ Citi Bike Data Exploration

Let's start by examining one month of Citi Bike data to understand the structure.


In [22]:
# TODO: Load the January 2016 Citi Bike data
# Hint: Use pd.read_csv() to load 'JC-201601-citibike-tripdata.csv'
# Print the shape, number of records, and column names
# What do you notice about the data structure?
january_data = pd.read_csv("../data/JC-201601-citibike-tripdata.csv")
print(january_data.head())
print()
print("Data shape:")
print(january_data.shape)
print(f"Data has {january_data.shape[0]} rows and {january_data.shape[1]} columns")
print()
print("Column names:")
print(january_data.columns)
print()
print("Type of data:")
print(january_data.dtypes)

   Trip Duration           Start Time            Stop Time  Start Station ID  \
0            362  2016-01-01 00:02:52  2016-01-01 00:08:54              3186   
1            200  2016-01-01 00:18:22  2016-01-01 00:21:42              3186   
2            202  2016-01-01 00:18:25  2016-01-01 00:21:47              3186   
3            248  2016-01-01 00:23:13  2016-01-01 00:27:21              3209   
4            903  2016-01-01 01:03:20  2016-01-01 01:18:24              3195   

  Start Station Name  Start Station Latitude  Start Station Longitude  \
0      Grove St PATH               40.719586               -74.043117   
1      Grove St PATH               40.719586               -74.043117   
2      Grove St PATH               40.719586               -74.043117   
3       Brunswick St               40.724176               -74.050656   
4            Sip Ave               40.730743               -74.063784   

   End Station ID End Station Name  End Station Latitude  \
0            3209   

In [None]:
# TODO: Examine the first few rows of the Citi Bike data
# Hint: Use .head() method to see the first 5 rows
# What patterns do you see in the data?


       STATION                                         NAME        DATE  \
0  USW00014734  NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US  2016-01-01   
1  USW00014734  NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US  2016-01-02   
2  USW00014734  NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US  2016-01-03   
3  USW00014734  NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US  2016-01-04   
4  USW00014734  NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US  2016-01-05   

    AWND  PGTM  PRCP  SNOW  SNWD  TAVG  TMAX  TMIN  TSUN  WDF2   WDF5  WSF2  \
0  12.75   NaN   0.0   0.0   0.0    41    43    34   NaN   270  280.0  25.9   
1   9.40   NaN   0.0   0.0   0.0    36    42    30   NaN   260  260.0  21.0   
2  10.29   NaN   0.0   0.0   0.0    37    47    28   NaN   270  250.0  23.9   
3  17.22   NaN   0.0   0.0   0.0    32    35    14   NaN   330  330.0  25.9   
4   9.84   NaN   0.0   0.0   0.0    19    31    10   NaN   360  350.0  25.1   

   WSF5  
0  35.1  
1  25.1  
2  30.0  
3  33.1  
4  31.1  


In [None]:
# TODO: Get basic information about the Citi Bike data
# Hint: Use .info() to see data types and .describe() for statistics
# What data types do you see? Are there any obvious issues?

# Your code here:


### 🌤️ Weather Data Exploration

Now let's examine the weather data structure.


In [31]:
# TODO: Load the weather data
# Hint: Use pd.read_csv() to load 'newark_airport_2016.csv'
# Print the shape, number of records, and column names
# How does this data structure compare to the Citi Bike data?

# Your c# Your code here:
newark_airport_df = pd.read_csv("../data/newark_airport_2016.csv")

print(newark_airport_df.head())
print()
print("Columns names:")
print(newark_airport_df.columns)
print()
print("Data shape:")
print(newark_airport_df.shape)

print()
print("Types of data:")
print(newark_airport_df.dtypes)
print()
print("Describe data:")
print(newark_airport_df.describe())
print()

print(newark_airport_df.info())

       STATION                                         NAME        DATE  \
0  USW00014734  NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US  2016-01-01   
1  USW00014734  NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US  2016-01-02   
2  USW00014734  NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US  2016-01-03   
3  USW00014734  NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US  2016-01-04   
4  USW00014734  NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US  2016-01-05   

    AWND  PGTM  PRCP  SNOW  SNWD  TAVG  TMAX  TMIN  TSUN  WDF2   WDF5  WSF2  \
0  12.75   NaN   0.0   0.0   0.0    41    43    34   NaN   270  280.0  25.9   
1   9.40   NaN   0.0   0.0   0.0    36    42    30   NaN   260  260.0  21.0   
2  10.29   NaN   0.0   0.0   0.0    37    47    28   NaN   270  250.0  23.9   
3  17.22   NaN   0.0   0.0   0.0    32    35    14   NaN   330  330.0  25.9   
4   9.84   NaN   0.0   0.0   0.0    19    31    10   NaN   360  350.0  25.1   

   WSF5  
0  35.1  
1  25.1  
2  30.0  
3  33.1  
4  31.1  

Columns names

In [None]:
# TODO: Examine the weather data structure
# Hint: Use .head(), .info(), and .describe() methods
# What weather variables are available? What's the time granularity?

# Your code here:


## 🔍 Initial Data Quality Assessment

Let's identify potential data quality issues that we'll need to address in Phase 2.


In [None]:
# Is this correct? Let's review and comment.

# The code below loads all Citi Bike data for 2016, concatenates it, and checks for missing values.
# This approach is correct for checking missing data across all months.

import glob

# Get all monthly files for 2016
citibike_files = sorted(glob.glob("../data/JC-2016*-citibike-tripdata.csv"))

# Read each file into a DataFrame and concatenate them
all_months_data = [pd.read_csv(f) for f in citibike_files]
citibike_2016_data = pd.concat(all_months_data, ignore_index=True)

# Check for missing values using .isnull().sum()
number_nulls_citibike = citibike_2016_data.isnull().sum()
print("Number of missing values:")
print(number_nulls_citibike)

# .isnull() and .isna() are equivalent in pandas, so the next line is redundant.
# You only need to use one of them.
# number_nan_citibike = citibike_2016_data.isna().sum()
# print(number_nan_citibike)

# To better understand the impact, let's also calculate the percentage of missing values per column:
print("\nPercentage of missing values per column:")
percent_nulls = round((number_nulls_citibike / len(citibike_2016_data)) * 100, 2)
print(percent_nulls)




Number of missing values:
Trip Duration                  0
Start Time                     0
Stop Time                      0
Start Station ID               0
Start Station Name             0
Start Station Latitude         0
Start Station Longitude        0
End Station ID                 0
End Station Name               0
End Station Latitude           0
End Station Longitude          0
Bike ID                        0
User Type                    380
Birth Year                 18999
Gender                         0
dtype: int64

Percentage of missing values per column:
Trip Duration              0.00
Start Time                 0.00
Stop Time                  0.00
Start Station ID           0.00
Start Station Name         0.00
Start Station Latitude     0.00
Start Station Longitude    0.00
End Station ID             0.00
End Station Name           0.00
End Station Latitude       0.00
End Station Longitude      0.00
Bike ID                    0.00
User Type                  0.15
Birth Ye

In [41]:
# TODO: Check for missing values in Weather data
# Hint: Use .isnull().sum() to count missing values
# Calculate the percentage of missing values
# Are there any missing weather observations?

# Your code here:
newark_missing_values = newark_airport_df.isnull().sum()
print("Missing numbers in weather df:")
print(newark_missing_values)
missing_weather_percentage = round((newark_missing_values / len(newark_airport_df)) * 100, 2)
print(missing_weather_percentage)
print()
print("Data shape:", newark_airport_df.shape)


Missing numbers in weather df:
STATION      0
NAME         0
DATE         0
AWND         0
PGTM       366
PRCP         0
SNOW         0
SNWD         0
TAVG         0
TMAX         0
TMIN         0
TSUN       366
WDF2         0
WDF5         2
WSF2         0
WSF5         2
dtype: int64
STATION      0.00
NAME         0.00
DATE         0.00
AWND         0.00
PGTM       100.00
PRCP         0.00
SNOW         0.00
SNWD         0.00
TAVG         0.00
TMAX         0.00
TMIN         0.00
TSUN       100.00
WDF2         0.00
WDF5         0.55
WSF2         0.00
WSF5         0.55
dtype: float64

Data shape: (366, 16)


## 📊 Quick Exploratory Analysis

Now let's do some basic analysis to understand the data better.


In [45]:
# TODO: Analyze trip duration patterns
# Hint: Use .describe() on the 'Trip Duration' column
# What's the average trip duration? Are there any extreme values?
# Think about: What might cause very short or very long trips?

# Your code here:
trip_duration_patterns = citibike_2016_data["Trip Duration"].describe()
print("Trip duration patterns:")
print(trip_duration_patterns)


Trip duration patterns:
count    2.475840e+05
mean     8.856305e+02
std      3.593798e+04
min      6.100000e+01
25%      2.480000e+02
50%      3.900000e+02
75%      6.660000e+02
max      1.632981e+07
Name: Trip Duration, dtype: float64


In [54]:
# TODO: Look at user types and gender distribution
# Hint: Use .value_counts() on 'User Type' and 'Gender' columns
# What types of users are there? How is gender coded?
print(citibike_2016_data["User Type"].value_counts())
print(citibike_2016_data["Gender"].value_counts())

# Your code here:


User Type
Subscriber    231683
Customer       15521
Name: count, dtype: int64
Gender
1    177197
2     50486
0     19901
Name: count, dtype: int64


In [56]:
# TODO: Examine the date ranges in both datasets
# Hint: Convert date columns to datetime and check min/max dates
# For Citi Bike: look at 'Start Time' column
# For Weather: look at 'DATE' column
# Do the date ranges overlap? What's the coverage?
# Split the 'Start Time' string into date and time using .str.split(), then take the date part
citibike_2016_data["Date"] = citibike_2016_data["Start Time"].str.split(" ").str[0]
citibike_2016_data["Time"] = citibike_2016_data["Start Time"].str.split(" ").str[1]
print("Max Citibike date:", citibike_2016_data["Date"].max())
print("Min Citibike date:", citibike_2016_data["Date"].min())
print()
print("Max weather date:", newark_airport_df["DATE"].max())
print("Min weather date:", newark_airport_df["DATE"].min())



# Your code here:


Max Citibike date: 2016-12-31
Min Citibike date: 2016-01-01

Max weather date: 2016-12-31
Min weather date: 2016-01-01


## 📝 Your Initial Findings

### Document Your Observations:
1. **Citi Bike Data Structure**: 
   - Number of records: **247,584** (full year 2016)
   - Columns: **15 columns** (Trip Duration, Start/Stop Time, Station info, Bike ID, User Type, Birth Year, Gender)
   - Data types: Mix of int64, float64, and object types
   - Missing values: **Birth Year (7.67% missing)**, **User Type (0.15% missing)**

2. **Weather Data Structure**:
   - Number of records: **366** (daily data for 2016)
   - Columns: **16 columns** (Station info, Date, Temperature, Precipitation, Wind, etc.)
   - Data types: Mix of int64, float64, and object types
   - Missing values: **PGTM (100% missing)**, **TSUN (100% missing)**, **WDF5/WSF5 (0.55% missing)**

3. **Data Quality Issues Found**:
   - [x] Missing birth year data (7.67% - significant for demographic analysis)
   - [x] Extreme trip durations (max: 16,329,810 seconds = ~189 days!)
   - [x] Missing weather data (PGTM, TSUN completely missing)
   - [x] Other issues: **User Type missing for some records**

4. **Key Insights Discovered**:
   - **Date Coverage**: Perfect overlap (2016-01-01 to 2016-12-31) for both datasets
   - **User Distribution**: 93.7% Subscribers vs 6.3% Customers
   - **Gender Distribution**: 1=Male (71.5%), 2=Female (20.4%), 0=Unknown (8.0%)
   - **Trip Duration**: Median ~6.5 minutes, but extreme outliers exist
   - **Weather Variables**: Temperature (TAVG, TMAX, TMIN), Precipitation (PRCP), Wind (AWND, WSF2/WSF5)

### Key Questions for Phase 2:
- How will you handle missing birth year data? (7.67% missing)
- Should you filter out extreme trip durations? (some trips > 100 days!)
- How will you align weather data with bike trips? (daily weather vs hourly trips)
- What time zones are you working with? (appears to be EST/EDT)

### Next Steps:
- [x] Complete Phase 1 exploration
- [ ] Move to Phase 2: Data Cleaning & Validation
- [ ] Create data cleaning pipeline
- [ ] Handle extreme trip duration outliers
- [ ] Decide on missing data strategy

---

**Great work!** You've successfully identified the key data quality issues and understand both datasets well. The perfect date overlap is excellent for analysis!
