# Philadelphia 311 Data Cleaning Pipeline
This notebook implements comprehensive data cleaning for the Philadelphia 311 Service Requests dataset including:
- Missing value handling
- Deduplication
- Column standardization
- Data type conversion
- Location validation
- Text normalization

In [1]:
import pandas as pd
import numpy as np
import re
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

## Load and Explore Data

In [2]:
# Load Philadelphia 311 data
# Note: Using on_bad_lines='skip' to handle malformed rows in the CSV
# Update: Data is now hosted on OneDrive for team access
onedrive_url = "https://falconbgsu-my.sharepoint.com/:f:/g/personal/lmoraa_bgsu_edu/IgBxvcsA2OrmQaDBpFTlbfAZAeon55WcaDPhRLTOGjI4v6c?e=fxfd82"

# If using pandas >= 1.2, you can read directly from the link if shared as a direct download
# Otherwise, download the file manually and update the path below
# Example for direct download (if link is direct to CSV):
# df = pd.read_csv(onedrive_url, on_bad_lines='skip', engine='python')

# For now, instruct the user to download and update the local path if needed
print("NOTE: The raw data is available for download at:")
print(onedrive_url)

# Default: Try to load from local path (update if you download to a different location)
df = pd.read_csv("../data/raw/philly_311_raw.csv", on_bad_lines='skip', engine='python')

print(f"Original dataset shape: {df.shape}")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
df.head()


NOTE: The raw data is available for download at:
https://falconbgsu-my.sharepoint.com/:f:/g/personal/lmoraa_bgsu_edu/IgBxvcsA2OrmQaDBpFTlbfAZAeon55WcaDPhRLTOGjI4v6c?e=fxfd82
Original dataset shape: (518841, 18)
Memory usage: 445.54 MB


Unnamed: 0,objectid,service_request_id,subject,status,status_notes,service_name,service_code,agency_responsible,service_notice,requested_datetime,updated_datetime,expected_datetime,closed_datetime,address,zipcode,media_url,lat,lon
0,5056952,17346520,Graffiti Removal,Closed,Issue Resolved,Graffiti Removal,SR-CL01,Community Life Improvement Program,7 Business Days,2025-01-01 00:00:34+00,2025-01-07 12:53:27+00,2025-01-10 00:00:00+00,2025-01-07 12:53:24+00,1701 SPRING GARDEN ST,19130.0,,39.963169,-75.166457
1,5056953,17346521,Graffiti Removal,Closed,Issue Resolved,Graffiti Removal,SR-CL01,Community Life Improvement Program,7 Business Days,2025-01-01 00:03:51+00,2025-01-07 12:52:31+00,2025-01-10 00:00:00+00,2025-01-07 12:52:28+00,1901 SPRING GARDEN ST,19130.0,,39.963625,-75.1695
2,5056954,17346523,Recycling Collection,Closed,,Rubbish/Recyclable Material Collection,SR-ST03,Streets Department,2 Business Days,2025-01-01 00:06:29+00,2025-03-03 14:28:52+00,2025-01-03 00:00:00+00,2025-03-03 14:25:15+00,5902 JEFFERSON ST,19151.0,https://d17aqltn7cihbm.cloudfront.net/uploads/...,39.978809,-75.239265
3,5056955,17346524,Graffiti Removal,Closed,Issue Resolved,Graffiti Removal,SR-CL01,Community Life Improvement Program,7 Business Days,2025-01-01 00:06:47+00,2025-01-07 12:48:56+00,2025-01-10 00:00:00+00,2025-01-07 12:48:46+00,1921 SPRING GARDEN ST,19130.0,,39.963703,-75.170116
4,5056956,17346525,Recycling Collection,Closed,,Rubbish/Recyclable Material Collection,SR-ST03,Streets Department,2 Business Days,2025-01-01 00:07:43+00,2025-01-03 11:30:26+00,2025-01-03 00:00:00+00,2025-01-03 11:25:48+00,6739 RUTLAND ST,19149.0,,40.043401,-75.072138


## Standardize Column Names

In [3]:
# SECTION 2: Standardize Column Names (REQUIREMENT 1)
# Rule-based column name standardization
df.columns = (
    df.columns
    .str.lower()           # Convert to lowercase
    .str.strip()           # Remove leading/trailing whitespace
    .str.replace(" ", "_")  # Replace spaces with underscores
    .str.replace(r"[^\w_]", "", regex=True)  # Remove special characters
)

print(f"✓ Standardized {len(df.columns)} column names")
print(f"Sample columns: {list(df.columns)}")

✓ Standardized 18 column names
Sample columns: ['objectid', 'service_request_id', 'subject', 'status', 'status_notes', 'service_name', 'service_code', 'agency_responsible', 'service_notice', 'requested_datetime', 'updated_datetime', 'expected_datetime', 'closed_datetime', 'address', 'zipcode', 'media_url', 'lat', 'lon']


## Select relevant columns

In [4]:
# SECTION 3: Select Relevant Columns for Philly 311
# Update this list based on the actual columns in philly_311_raw.csv
columns_to_keep = [
    "service_request_id",      # Unique identifier for Philly 311
    "subject",                 # Subject of the request (if available)
    "requested_datetime",      # When the request was made
    "service_name",            # Type/category of service requested
    "service_code",            # Service code (if available)
    "service_notice",         # Service notice (if available)
    "address",                 # Address of the request
    "zipcode",                # ZIP code
    "lat",                     # Latitude
    "lon",                    # Longitude
    "status",                  # Status of the request (if available)
    "agency_responsible"       # Agency handling the request (if available)
]

# Only keep columns that exist in the DataFrame
available_cols = [c for c in columns_to_keep if c in df.columns]
missing_cols = [c for c in columns_to_keep if c not in df.columns]

if missing_cols:
    print(f"⚠ Missing columns: {missing_cols}")

df = df[available_cols].copy()
print(f"✓ Selected {len(df.columns)} columns for Philly 311")
print(f"Dataset shape: {df.shape}")
df.head()


✓ Selected 12 columns for Philly 311
Dataset shape: (518841, 12)


Unnamed: 0,service_request_id,subject,requested_datetime,service_name,service_code,service_notice,address,zipcode,lat,lon,status,agency_responsible
0,17346520,Graffiti Removal,2025-01-01 00:00:34+00,Graffiti Removal,SR-CL01,7 Business Days,1701 SPRING GARDEN ST,19130.0,39.963169,-75.166457,Closed,Community Life Improvement Program
1,17346521,Graffiti Removal,2025-01-01 00:03:51+00,Graffiti Removal,SR-CL01,7 Business Days,1901 SPRING GARDEN ST,19130.0,39.963625,-75.1695,Closed,Community Life Improvement Program
2,17346523,Recycling Collection,2025-01-01 00:06:29+00,Rubbish/Recyclable Material Collection,SR-ST03,2 Business Days,5902 JEFFERSON ST,19151.0,39.978809,-75.239265,Closed,Streets Department
3,17346524,Graffiti Removal,2025-01-01 00:06:47+00,Graffiti Removal,SR-CL01,7 Business Days,1921 SPRING GARDEN ST,19130.0,39.963703,-75.170116,Closed,Community Life Improvement Program
4,17346525,Recycling Collection,2025-01-01 00:07:43+00,Rubbish/Recyclable Material Collection,SR-ST03,2 Business Days,6739 RUTLAND ST,19149.0,40.043401,-75.072138,Closed,Streets Department


Summarized conversation history

## Handle Missing Values

In [5]:
# SECTION 5: Handle Missing Values for Philly 311
# 1: Drop rows with missing CRITICAL fields
# Adjust these fields to match your Philly 311 schema
# Only use columns that exist in the DataFrame
possible_critical_fields = ['service_request_id', 'requested_datetime', 'lat', 'long', 'longitude', 'latitude']
critical_fields = [col for col in possible_critical_fields if col in df.columns]
rows_before = len(df)

df = df.dropna(subset=critical_fields)
rows_dropped_critical = rows_before - len(df)

print(f"✓ Dropped {rows_dropped_critical} rows with missing critical fields")
print(f"  Remaining rows: {len(df):,}")
print(f"  Critical fields preserved: {', '.join(critical_fields)}")


✓ Dropped 273033 rows with missing critical fields
  Remaining rows: 245,808
  Critical fields preserved: service_request_id, requested_datetime, lat


## Remove Duplicates

In [6]:
# Remove exact duplicates for Philly 311
# Use the unique service_request_id if available, otherwise drop full row duplicates
initial_rows = len(df)

if 'service_request_id' in df.columns:
    df = df.drop_duplicates(subset=['service_request_id'], keep='first')
    exact_dupes = initial_rows - len(df)
    print(f"✓ Exact duplicates removed by service_request_id: {exact_dupes}")
else:
    df = df.drop_duplicates(keep='first')
    exact_dupes = initial_rows - len(df)
    print(f"✓ Exact full-row duplicates removed: {exact_dupes}")

print(f"  Rows after deduplication: {len(df):,}")

if 'service_request_id' in df.columns:
    unique_count = df['service_request_id'].nunique()
    print(f"  Unique service_request_id count: {unique_count:,}")


✓ Exact duplicates removed by service_request_id: 0
  Rows after deduplication: 245,808
  Unique service_request_id count: 245,808


## Data Type Conversion

In [7]:
# SECTION 7: Data Type Conversion for Philly 311

# Convert date columns to datetime
if 'requested_datetime' in df.columns:
    df['requested_datetime'] = pd.to_datetime(df['requested_datetime'], errors='coerce')
    print("✓ Converted requested_datetime to datetime64")

# Convert numeric columns
numeric_cols = ['lat', 'long', 'zip_code']
for col in numeric_cols:
    if col in df.columns:
        df[col] = pd.to_numeric(df[col], errors='coerce')
print(f"✓ Converted {len(numeric_cols)} numeric columns to float64")

# Convert categorical columns (memory efficiency)
categorical_cols = ['service_name', 'status', 'agency_responsible']
for col in categorical_cols:
    if col in df.columns:
        df[col] = df[col].astype('category')
print(f"✓ Converted {len(categorical_cols)} columns to category dtype")

print(f"\nMemory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")


✓ Converted requested_datetime to datetime64
✓ Converted 3 numeric columns to float64
✓ Converted 3 columns to category dtype

Memory usage: 77.02 MB


## Location Validation and Cleaning

In [8]:
# SECTION 8: Location Cleaning & Geospatial Validation 
# Rule-based: Philadelphia bounding box validation
print("Validating coordinates with Philadelphia geographic bounds:")
print("  Latitude: 39.85° to 40.15°N")
print("  Longitude: -75.35° to -74.95°W")

invalid_coords = (
    (df['lat'] < 39.85) | (df['lat'] > 40.15) |
    (df['lon'] < -75.35) | (df['lon'] > -74.95)
)

rows_invalid_coords = invalid_coords.sum()
print(f"\n✓ Rows with invalid coordinates: {rows_invalid_coords}")

# Statistical: IQR-based outlier detection
from scipy import stats

def detect_outliers_iqr(data, column):
    """Detect outliers using Interquartile Range method"""
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return (data[column] < lower_bound) | (data[column] > upper_bound)

outliers_lat = detect_outliers_iqr(df, 'lat')
outliers_lon = detect_outliers_iqr(df, 'lon')

print(f"Statistical outlier detection (IQR method):")
print(f"  Latitude outliers: {outliers_lat.sum()}")
print(f"  Longitude outliers: {outliers_lon.sum()}")

# Drop invalid coordinates
df = df[~invalid_coords].copy()
print(f"\n✓ Dataset after coordinate validation: {len(df):,} rows")

# Coordinate statistics
print(f"\nCoordinate Statistics (after validation):")
print(df[['lat', 'lon']].describe())

Validating coordinates with Philadelphia geographic bounds:
  Latitude: 39.85° to 40.15°N
  Longitude: -75.35° to -74.95°W

✓ Rows with invalid coordinates: 0
Statistical outlier detection (IQR method):
  Latitude outliers: 0
  Longitude outliers: 8851

✓ Dataset after coordinate validation: 245,808 rows

Coordinate Statistics (after validation):
                 lat            lon
count  245808.000000  245808.000000
mean       39.994135     -75.150572
std         0.048004       0.061682
min        39.874328     -75.279832
25%        39.954238     -75.186091
50%        39.988049     -75.159147
75%        40.031963     -75.120106
max        40.137024     -74.957954


In [9]:
# SECTION 9: ZIP Code Cleaning
if 'incident_zip' in df.columns:
    # Extract 5-digit ZIP codes
    df['incident_zip'] = df['incident_zip'].astype(str).str.extract('(\d{5})', expand=False)
    
    # Count valid vs invalid
    valid_zips = df['incident_zip'].notna().sum()
    invalid_zips = df['incident_zip'].isna().sum()
    
    print(f"✓ ZIP code extraction (5-digit format):")
    print(f"  Valid ZIP codes: {valid_zips:,}")
    print(f"  Invalid/missing: {invalid_zips:,}")
    
    # Convert to numeric
    df['incident_zip'] = pd.to_numeric(df['incident_zip'], errors='coerce')

In [10]:
# SECTION 10: Location/District Normalization 
if 'borough' in df.columns:
    print("Normalizing location/district names:")
    
    # Convert to uppercase and strip whitespace
    df['borough'] = df['borough'].str.upper().str.strip()
    
    # Rule-based mapping for Philadelphia districts
    location_mapping = {
        '1': 'CENTRAL',
        '2': 'SOUTH',
        '3': 'NORTHEAST',
        '4': 'NORTH',
        '5': 'SOUTHWEST',
        '6': 'EAST',
        'CENTRAL': 'CENTRAL',
        'SOUTH': 'SOUTH',
        'NORTHEAST': 'NORTHEAST',
        'NORTH': 'NORTH',
        'SOUTHWEST': 'SOUTHWEST',
        'EAST': 'EAST'
    }
    
    for old, new in location_mapping.items():
        df['borough'] = df['borough'].replace(old, new)
    
    print(f"✓ Location distribution (normalized):")
    print(df['borough'].value_counts())

## Text Normalization

In [11]:
# SECTION 11: Text Normalization for Philly 311

def normalize_text(text):
    """Normalize text: uppercase, trim, remove extra spaces"""
    if pd.isna(text):
        return 'UNKNOWN'
    text = str(text).strip().upper()
    text = re.sub(r'\s+', ' ', text)  # Remove extra whitespace
    return text

# Normalize service_name (main complaint/service type)
if 'service_name' in df.columns:
    df['service_name'] = df['service_name'].apply(normalize_text)
    print(f"✓ Service types normalized (uppercase, trimmed)")
    print(f"\nTop 15 Service Types:")
    print(df['service_name'].value_counts().head(15))

# Normalize status
if 'status' in df.columns:
    df['status'] = df['status'].apply(normalize_text)
    print(f"\n✓ Status values normalized")
    print(f"Unique status values: {df['status'].nunique()}")
    print(df['status'].value_counts())

# Normalize agency_responsible
if 'agency_responsible' in df.columns:
    df['agency_responsible'] = df['agency_responsible'].apply(normalize_text)
    print(f"\n✓ Agency responsible normalized")
    print(f"Unique agencies: {df['agency_responsible'].nunique()}")
    print(df['agency_responsible'].value_counts().head(10))

# Normalize address
if 'address' in df.columns:
    df['address'] = df['address'].apply(normalize_text)
    print(f"\n✓ Address normalized (uppercase, trimmed)")

# Normalize zip_code
if 'zip_code' in df.columns:
    df['zip_code'] = df['zip_code'].astype(str).str.extract(r'(\d{5})', expand=False)
    print(f"\n✓ ZIP codes normalized to 5-digit strings")
    print(df['zip_code'].value_counts().head(10))


✓ Service types normalized (uppercase, trimmed)

Top 15 Service Types:
service_name
MAINTENANCE COMPLAINT                     37800
RUBBISH/RECYCLABLE MATERIAL COLLECTION    36943
ABANDONED VEHICLE                         27943
ILLEGAL DUMPING                           22644
STREET DEFECT                             16128
INFORMATION REQUEST                       13728
GRAFFITI REMOVAL                          10141
STREET LIGHT OUTAGE                        8274
SANITATION VIOLATION                       7504
OTHER (STREETS)                            6842
CONSTRUCTION COMPLAINTS                    6027
LICENSE COMPLAINT                          5579
TRAFFIC SIGNAL EMERGENCY                   5106
STREET TREES                               4785
DANGEROUS SIDEWALK                         3556
Name: count, dtype: int64

✓ Status values normalized
Unique status values: 2
status
CLOSED    199889
OPEN       45919
Name: count, dtype: int64

✓ Agency responsible normalized
Unique agencies: 1

## Advanced Complaint Type Normalization
### Rule-Based Category Mapping
Using domain knowledge and statistical analysis to group similar complaint types into standardized categories

In [12]:
# Advanced Complaint Type Normalization for Philly 311
# Group similar service types into broader categories for analysis

# Example mapping: (Update this mapping based on your data and analysis)
service_type_mapping = {
    'ILLEGAL DUMPING': 'ENVIRONMENTAL',
    'TRASH COLLECTION': 'ENVIRONMENTAL',
    'GRAFFITI REMOVAL': 'ENVIRONMENTAL',
    'STREET LIGHT OUTAGE': 'INFRASTRUCTURE',
    'POTHOLE': 'INFRASTRUCTURE',
    'ABANDONED VEHICLE': 'PUBLIC SAFETY',
    'NOISE COMPLAINT': 'QUALITY OF LIFE',
    # Add more mappings as needed
}

def map_service_type(service):
    if pd.isna(service):
        return 'OTHER'
    return service_type_mapping.get(service, 'OTHER')

if 'service_name' in df.columns:
    df['service_category'] = df['service_name'].apply(map_service_type)
    print("✓ Service types mapped to broad categories (service_category)")
    print("\nService Category Distribution:")
    print(df['service_category'].value_counts())
    print(f"\nTotal unique service types: {df['service_name'].nunique()}")
    print(f"Total unique categories: {df['service_category'].nunique()}")


✓ Service types mapped to broad categories (service_category)

Service Category Distribution:
service_category
OTHER             176806
ENVIRONMENTAL      32785
PUBLIC SAFETY      27943
INFRASTRUCTURE      8274
Name: count, dtype: int64

Total unique service types: 47
Total unique categories: 4


## Outlier Detection: Statistical Methods
### Using IQR (Interquartile Range) and Z-Score for anomaly detection

In [13]:
# Outlier Detection: Statistical Methods for Philly 311
# (This section is a placeholder. Adjust columns as needed for your data.)

from scipy import stats

# Detect spatial outliers using IQR method
def detect_outliers_iqr(data, column):
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return (data[column] < lower_bound) | (data[column] > upper_bound)

# Example: Check for latitude/longitude outliers if present
if 'lat' in df.columns and 'long' in df.columns:
    outliers_lat = detect_outliers_iqr(df, 'lat')
    outliers_long = detect_outliers_iqr(df, 'long')
    print(f"Latitude outliers (IQR method): {outliers_lat.sum()}")
    print(f"Longitude outliers (IQR method): {outliers_long.sum()}")


## Near Duplicate Detection


In [14]:
# SECTION 12: Near-Duplicate Detection (Philly 311)
# Identify potential near-duplicate service requests based on location, time, and service type

from datetime import timedelta

# Parameters for near-duplicate detection
TIME_WINDOW = timedelta(hours=2)  # Requests within 2 hours
DISTANCE_THRESHOLD = 0.001        # ~100 meters if using decimal degrees (approximate for Philly)

# Only run if required columns exist
required_cols = {'lat', 'lon', 'service_name', 'requested_datetime'}
if required_cols.issubset(df.columns):
    # Sort by time for efficient comparison
    df = df.sort_values('requested_datetime')
    df['near_duplicate'] = False

    # Simple pairwise check (O(n^2), can be optimized for large datasets)
    for i in range(1, len(df)):
        prev = df.iloc[i-1]
        curr = df.iloc[i]
        # Check if same service type and within time window
        if (
            curr['service_name'] == prev['service_name'] and
            abs((curr['requested_datetime'] - prev['requested_datetime']).total_seconds()) <= TIME_WINDOW.total_seconds() and
            abs(curr['lat'] - prev['lat']) <= DISTANCE_THRESHOLD and
            abs(curr['lon'] - prev['lon']) <= DISTANCE_THRESHOLD
        ):
            df.iloc[i, df.columns.get_loc('near_duplicate')] = True

    near_dupe_count = df['near_duplicate'].sum()
    print(f"✓ Near-duplicate requests flagged: {near_dupe_count}")
    print(df[df['near_duplicate']].head())
else:
    print("Near-duplicate detection skipped: required columns missing.")


✓ Near-duplicate requests flagged: 5778
    service_request_id                                            subject  \
5             17346526                               Recycling Collection   
20            17346545                                   Graffiti Removal   
36            17346576  What numbers do I call to contact PECO and Ver...   
37            17346577           What calls are considered 911 transfers?   
50            17346600                                     Pothole Repair   

          requested_datetime                            service_name  \
5  2025-01-01 00:08:58+00:00  RUBBISH/RECYCLABLE MATERIAL COLLECTION   
20 2025-01-01 00:38:43+00:00                        GRAFFITI REMOVAL   
36 2025-01-01 04:26:58+00:00                     INFORMATION REQUEST   
37 2025-01-01 04:32:45+00:00                     INFORMATION REQUEST   
50 2025-01-01 12:26:13+00:00                           STREET DEFECT   

   service_code    service_notice          address  zipcode     

## Missing Value Handling: Documented Strategy
### Four-tier approach: Drop Critical → Impute → Fill → Default

In [15]:
# Analyze missing values BEFORE handling

#2: Fill categorical fields with semantic defaults
categorical_impute = {
    'problem_formerly_complaint_type': 'UNKNOWN',
    'agency': 'VARIOUS',
    'location_type': 'UNSPECIFIED',
    'borough': 'UNSPECIFIED'
}

print("Filling categorical missing values with defaults:")
for col, fill_value in categorical_impute.items():
    if col in df.columns and df[col].isnull().sum() > 0:
        fill_count = df[col].isnull().sum()
        df[col] = df[col].fillna(fill_value)
        print(f"  {col}: Filled {fill_count} missing values with '{fill_value}'")
    elif col in df.columns:
        print(f"  {col}: No missing values")

# SECTION 14: Missing Value Handling - TIER 3 (TEXT)
# 3: Fill text descriptions
if 'problem_detail_formerly_descriptor' in df.columns:
    missing_count = df['problem_detail_formerly_descriptor'].isnull().sum()
    df['problem_detail_formerly_descriptor'] = df['problem_detail_formerly_descriptor'].fillna('No description provided')
    print(f"✓ problem_detail_formerly_descriptor: Filled {missing_count} with default text")

# SECTION 15: Missing Value Handling - TIER 4 (STATISTICAL IMPUTATION)
# 4: Statistical imputation for incident_zip using borough median
if 'incident_zip' in df.columns:
    print("Statistical imputation strategy: Borough-level median")
    print("\nFilling missing ZIP codes by borough:")
    
    for borough in df['borough'].unique():
        if borough != 'UNSPECIFIED':
            # Calculate borough median ZIP
            borough_median_zip = df[df['borough'] == borough]['incident_zip'].median()
            
            # Create mask for missing values in this borough
            mask = (df['borough'] == borough) & (df['incident_zip'].isnull())
            filled_count = mask.sum()
            
            # Impute if median exists
            if not pd.isna(borough_median_zip) and filled_count > 0:
                df.loc[mask, 'incident_zip'] = int(borough_median_zip)
                print(f"  {borough}: Filled {filled_count} ZIPs with median {int(borough_median_zip)}")
            elif filled_count > 0:
                print(f"  {borough}: No valid ZIPs found for imputation")

# SECTION 16: Analyze Missing Values (After Cleaning)
missing_after = df.isnull().sum()
missing_after = missing_after[missing_after > 0].sort_values(ascending=False)

if len(missing_after) > 0:
    print("Remaining missing values:")
    print(missing_after)
else:
    print("✓ No missing values remaining!")

print(f"\nDataset shape: {df.shape}")
print(f"Total missing cells: {df.isnull().sum().sum()}")

Filling categorical missing values with defaults:
Remaining missing values:
service_code      68072
service_notice    15446
zipcode            4352
dtype: int64

Dataset shape: (245808, 14)
Total missing cells: 87870


## Data Quality Metrics and Final Validation

In [16]:
# SECTION 17: Final Data Quality Assessment (Philly 311)

# 1. Completeness
complete_rows = len(df[df.isnull().sum(axis=1) == 0])
completeness = (complete_rows / len(df)) * 100 if len(df) > 0 else 0
print(f"\n1. COMPLETENESS:")
print(f"   ✓ Complete rows (0 missing values): {complete_rows:,} ({completeness:.1f}%)")
print(f"   Total missing cells: {df.isnull().sum().sum()}")

# 2. Validity (Geospatial)
if set(['lat', 'long']).issubset(df.columns):
    valid_coords = ((df['lat'] >= 39.85) & (df['lat'] <= 40.15) &
                    (df['long'] >= -75.35) & (df['long'] <= -74.95)).sum()
    validity_coords = (valid_coords / len(df)) * 100 if len(df) > 0 else 0
    print(f"\n2. VALIDITY (Geospatial):")
    print(f"   ✓ Valid Philadelphia coordinates: {valid_coords:,} ({validity_coords:.1f}%)")
    print(f"   Coordinate range:")
    print(f"     Latitude:  {df['lat'].min():.4f}° to {df['lat'].max():.4f}°")
    print(f"     Longitude: {df['long'].min():.4f}° to {df['long'].max():.4f}°")
else:
    print("\n2. VALIDITY (Geospatial): lat/long columns not found.")

# 3. Uniqueness
if 'service_request_id' in df.columns:
    unique_ids = df['service_request_id'].nunique()
    uniqueness = (unique_ids / len(df)) * 100 if len(df) > 0 else 0
    print(f"\n3. UNIQUENESS:")
    print(f"   ✓ Unique service_request_id: {unique_ids:,} out of {len(df):,} ({uniqueness:.1f}%)")
else:
    print("\n3. UNIQUENESS: service_request_id column not found.")

# 4. Consistency
print(f"\n4. CONSISTENCY:")
print(f"   ✓ Standardized column names: All lowercase with underscores")
if 'service_name' in df.columns:
    print(f"   ✓ Normalized service types: {df['service_name'].nunique()} unique types")
if 'requested_datetime' in df.columns:
    print(f"   ✓ Standardized date format: datetime64[ns]")
if set(['lat', 'long']).issubset(df.columns):
    print(f"   ✓ Normalized coordinates: float64 within Philadelphia bounds")

# 5. Coverage
print(f"\n5. COVERAGE:")
if 'requested_datetime' in df.columns:
    min_date = df['requested_datetime'].min()
    max_date = df['requested_datetime'].max()
    print(f"   Time span: {min_date.date() if pd.notnull(min_date) else 'N/A'} to {max_date.date() if pd.notnull(max_date) else 'N/A'}")
if 'borough' in df.columns:
    print(f"   Boroughs: {df['borough'].nunique()} boroughs")
if 'agency_responsible' in df.columns:
    print(f"   Agencies: {df['agency_responsible'].nunique()} agencies")
if 'service_name' in df.columns:
    print(f"   Service types: {df['service_name'].nunique()} types")

# 6. Distribution Statistics
if 'borough' in df.columns:
    print(f"\n6. GEOGRAPHIC DISTRIBUTION (Top Boroughs):")
    print(df['borough'].value_counts())
if 'service_category' in df.columns:
    print(f"\n7. SERVICE CATEGORY DISTRIBUTION (Top 10):")
    print(df['service_category'].value_counts().head(10))
if 'agency_responsible' in df.columns:
    print(f"\n8. AGENCY DISTRIBUTION (Top 10):")
    print(df['agency_responsible'].value_counts().head(10))

# 9. Temporal Statistics
if 'requested_datetime' in df.columns:
    min_date = df['requested_datetime'].min()
    max_date = df['requested_datetime'].max()
    if pd.notnull(min_date) and pd.notnull(max_date):
        days_covered = (max_date - min_date).days
        records_per_day = len(df) / (days_covered + 1) if days_covered > 0 else 0
        print(f"\n9. TEMPORAL STATISTICS:")
        print(f"   Date range: {days_covered} days")
        print(f"   Records per day (avg): {records_per_day:.0f}")



1. COMPLETENESS:
   ✓ Complete rows (0 missing values): 162,614 (66.2%)
   Total missing cells: 87870

2. VALIDITY (Geospatial): lat/long columns not found.

3. UNIQUENESS:
   ✓ Unique service_request_id: 245,808 out of 245,808 (100.0%)

4. CONSISTENCY:
   ✓ Standardized column names: All lowercase with underscores
   ✓ Normalized service types: 47 unique types
   ✓ Standardized date format: datetime64[ns]

5. COVERAGE:
   Time span: 2025-01-01 to 2025-12-31
   Agencies: 10 agencies
   Service types: 47 types

7. SERVICE CATEGORY DISTRIBUTION (Top 10):
service_category
OTHER             176806
ENVIRONMENTAL      32785
PUBLIC SAFETY      27943
INFRASTRUCTURE      8274
Name: count, dtype: int64

8. AGENCY DISTRIBUTION (Top 10):
agency_responsible
STREETS DEPARTMENT                    124193
LICENSE & INSPECTIONS                  44722
POLICE DEPARTMENT                      27996
COMMUNITY LIFE IMPROVEMENT PROGRAM     19802
PHILLY311 CONTACT CENTER               13962
PARKS & RECREATION 

## Save Cleaned Data and Generate Documentation

In [18]:
output_path = "../data/processed/philly_311_cleaned.csv"

# OneDrive link for team data access
onedrive_url = "https://falconbgsu-my.sharepoint.com/:f:/g/personal/lmoraa_bgsu_edu/IgBxvcsA2OrmQaDBpFTlbfAZAeon55WcaDPhRLTOGjI4v6c?e=fxfd82"

print(f"Data source for this project (OneDrive):\n{onedrive_url}\n")
df.to_csv(output_path, index=False)

print(f"✓ Cleaned data saved to: {output_path}")
print(f"\n========== CLEANING SUMMARY ==========")
print(f"Final row count: {len(df):,}")
print(f"Final column count: {len(df.columns)}")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")


Data source for this project (OneDrive):
https://falconbgsu-my.sharepoint.com/:f:/g/personal/lmoraa_bgsu_edu/IgBxvcsA2OrmQaDBpFTlbfAZAeon55WcaDPhRLTOGjI4v6c?e=fxfd82

✓ Cleaned data saved to: ../data/processed/philly_311_cleaned.csv

Final row count: 245,808
Final column count: 14
Memory usage: 92.31 MB
