# 🧹 Bike Rental Analytics - Phase 2: Data Cleaning & Validation

## Project Overview
This notebook focuses on cleaning and validating the Citi Bike and weather data identified in Phase 1, preparing it for database implementation.

## Phase 2 Objectives
- Handle missing data strategically
- Filter extreme outliers in trip durations
- Standardize data formats and types
- Validate data quality and business rules
- Prepare clean datasets for database loading

---


## 📦 Import Libraries & Load Data


In [None]:
# TODO: Import necessary libraries and load your cleaned data from Phase 1
# Hint: Import pandas, numpy, matplotlib, seaborn, pathlib, warnings
# Load both the concatenated Citi Bike data and weather data
# You can copy the loading code from your Phase 1 notebook

# Your code here:


## 🚨 Handle Trip Duration Outliers

From Phase 1, we discovered extreme trip durations (some > 100 days!). Let's analyze and filter these outliers.


In [None]:
# TODO: Analyze trip duration distribution
# Hint: Use .describe() and create a histogram
# What's a reasonable maximum trip duration? (think: bike rentals are typically minutes/hours, not days)
# Consider: What might cause trips > 24 hours?

# Your code here:


In [None]:
# TODO: Define and apply trip duration filters
# Hint: Set reasonable bounds (e.g., 60 seconds to 24 hours)
# Use boolean indexing to filter: df[(df['column'] >= min) & (df['column'] <= max)]
# Print how many records you're removing and why

# Your code here:


## 🔧 Handle Missing Data

Now let's address the missing data issues identified in Phase 1.


In [None]:
# TODO: Handle missing Birth Year data (7.67% missing)
# Hint: Options include: drop rows, impute with median/mode, or create "Unknown" category
# Think: Is birth year critical for analysis? What's the best approach?
# Use .fillna() or .dropna() methods

# Your code here:


In [None]:
# TODO: Handle missing User Type data (0.15% missing)
# Hint: This is a small percentage - consider dropping these rows
# Or investigate if there's a pattern to missing User Type

# Your code here:


In [None]:
# TODO: Handle missing weather data
# Hint: PGTM and TSUN are 100% missing - consider dropping these columns
# WDF5 and WSF5 have minimal missing data - decide on strategy
# Use .drop() for columns or .fillna() for missing values

# Your code here:


## 📅 Standardize Date/Time Formats

Let's ensure all date/time data is properly formatted for database storage.


In [None]:
# TODO: Convert date/time columns to proper datetime format
# Hint: Use pd.to_datetime() for 'Start Time', 'Stop Time', and 'DATE' columns
# Consider timezone handling - are these EST/EDT times?
# Create separate date and time columns if needed for analysis

# Your code here:


## 🏷️ Create Derived Features

Let's add useful features for analytics.


In [None]:
# TODO: Create derived features for Citi Bike data
# Hint: Add columns for:
# - Day of week (Monday=0, Sunday=6)
# - Hour of day (0-23)
# - Trip duration in minutes
# - Age (calculated from birth year)
# Use .dt.dayofweek, .dt.hour, and arithmetic operations

# Your code here:


In [None]:
# TODO: Create derived features for weather data
# Hint: Add columns for:
# - Day of week
# - Month
# - Season (Winter, Spring, Summer, Fall)
# - Weather categories (e.g., "Hot" if TAVG > 80, "Cold" if TAVG < 40)
# Use .dt.dayofweek, .dt.month, and conditional logic

# Your code here:


## ✅ Data Quality Validation

Let's validate our cleaned data and check for any remaining issues.


In [None]:
# TODO: Validate cleaned data
# Hint: Check for:
# - Remaining missing values
# - Data type consistency
# - Reasonable value ranges
# - Duplicate records
# Print summary statistics and any issues found

# Your code here:


In [None]:
# TODO: Save cleaned datasets
# Hint: Save your cleaned DataFrames to CSV files in a 'processed' folder
# Use .to_csv() method
# Consider naming: 'citibike_cleaned.csv' and 'weather_cleaned.csv'

# Your code here:


## 📝 Phase 2 Summary

### Document Your Cleaning Decisions:
1. **Trip Duration Filtering**: 
   - Minimum duration: ___ seconds
   - Maximum duration: ___ seconds
   - Records removed: ___
   - Reasoning: ___

2. **Missing Data Handling**:
   - Birth Year: ___ (strategy used)
   - User Type: ___ (strategy used)
   - Weather columns: ___ (strategy used)

3. **Derived Features Created**:
   - Citi Bike: ___
   - Weather: ___

4. **Data Quality Issues Resolved**:
   - [ ] Trip duration outliers
   - [ ] Missing birth year data
   - [ ] Missing user type data
   - [ ] Missing weather data
   - [ ] Date/time formatting
   - [ ] Data type consistency

### Next Steps:
- [x] Complete Phase 2 data cleaning
- [ ] Move to Phase 3: Database Schema Design
- [ ] Design table structures and relationships
- [ ] Create ER diagram

---

**Great work!** Your data is now clean and ready for database implementation. Document your decisions for the portfolio!
