# 🔍 STR Nuisance Prediction - Scottsdale Data Analysis

**Objective**: Explore Scottsdale's comprehensive STR and complaint datasets.

**Data**: 9 Scottsdale public datasets loaded directly from Google Drive

## 📊 What We'll Analyze:
1. 🏠 Licensed, unlicensed, and pending STR properties
2. 📞 EZ complaints and code violations  
3. 👮 Police incidents, citations, and arrests
4. 🗺️ Property parcel information
5. 🎯 STR nuisance prediction targets

---

In [None]:
# 📚 Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings
import sys
import os

warnings.filterwarnings('ignore')
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 6)

print("🚀 STR Nuisance Prediction Analysis - Scottsdale")
print(f"📅 Analysis Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("📚 Libraries loaded successfully!")

In [None]:
# 📥 Load Scottsdale Data
# Import the data loader
sys.path.append('../src')

try:
    from data_processing.scottsdale_data_loader import ScottsdaleSTRDataLoader
    print("✅ Data loader imported successfully")
    
    # Initialize and load all data (no setup needed - file IDs are pre-configured)
    loader = ScottsdaleSTRDataLoader()
    
    print("\n🔄 Loading all 9 Scottsdale datasets...")
    print("⏳ This will take a few minutes for the large files...")
    
    # Load everything
    datasets = loader.load_all_datasets()
    
    # Organize by category
    categorized = loader.get_dataset_by_category()
    
    print("\n🎉 Data loading completed!")
    
except ImportError as e:
    print(f"❌ Could not import data loader: {e}")
    print("📝 Make sure you've created: ml-pipeline/src/data_processing/scottsdale_data_loader.py")
except Exception as e:
    print(f"❌ Error loading data: {e}")

In [None]:
# 📊 Extract Individual Datasets
if 'datasets' in locals():
    # STR Properties
    licensed_strs = datasets.get('licensed_strs')
    unlicensed_strs = datasets.get('unlicensed_strs')
    pending_strs = datasets.get('pending_licences')
    
    # Complaints
    ez_complaints = datasets.get('ez_complaints')
    code_violations = datasets.get('code_violations')
    
    # Police Data
    police_incidents = datasets.get('police_incidents')
    police_citations = datasets.get('police_citations')
    police_arrests = datasets.get('police_arrests')
    
    # Geographic
    parcels = datasets.get('parcels')
    
    print("📋 DATASET SUMMARY")
    print("=" * 40)
    
    # Function to show dataset info
    def show_dataset_info(df, name, emoji):
        if df is not None:
            print(f"{emoji} {name}: {df.shape[0]:,} rows × {df.shape[1]} columns")
            return True
        else:
            print(f"❌ {name}: Failed to load")
            return False
    
    print("\n🏠 STR Properties:")
    licensed_loaded = show_dataset_info(licensed_strs, "Licensed STRs", "✅")
    unlicensed_loaded = show_dataset_info(unlicensed_strs, "Unlicensed STRs", "✅")
    pending_loaded = show_dataset_info(pending_strs, "Pending Licences", "✅")
    
    print("\n📞 Complaints & Violations:")
    complaints_loaded = show_dataset_info(ez_complaints, "EZ Complaints", "✅")
    violations_loaded = show_dataset_info(code_violations, "Code Violations", "✅")
    
    print("\n👮 Police Data:")
    incidents_loaded = show_dataset_info(police_incidents, "Police Incidents", "✅")
    citations_loaded = show_dataset_info(police_citations, "Police Citations", "✅")
    arrests_loaded = show_dataset_info(police_arrests, "Police Arrests", "✅")
    
    print("\n🗺️ Geographic:")
    parcels_loaded = show_dataset_info(parcels, "Property Parcels", "✅")
    
    # Count successful loads
    successful_loads = sum([
        licensed_loaded, unlicensed_loaded, pending_loaded,
        complaints_loaded, violations_loaded,
        incidents_loaded, citations_loaded, arrests_loaded,
        parcels_loaded
    ])
    
    print(f"\n📈 Successfully loaded: {successful_loads}/9 datasets")
    
else:
    print("❌ Datasets not available - check data loading above")

In [None]:
# 🏠 Examine Licensed STR Properties (Main Dataset)
if licensed_strs is not None:
    print("🏠 LICENSED STR PROPERTIES - DETAILED ANALYSIS")
    print("=" * 50)
    
    print(f"📊 Basic Info:")
    print(f"   Total licensed properties: {len(licensed_strs):,}")
    print(f"   Columns: {len(licensed_strs.columns)}")
    print(f"   Memory usage: {licensed_strs.memory_usage(deep=True).sum() / 1024**2:.1f} MB")
    
    print(f"\n📋 Column Names and Types:")
    for i, (col, dtype) in enumerate(zip(licensed_strs.columns, licensed_strs.dtypes), 1):
        non_null = licensed_strs[col].count()
        null_pct = (len(licensed_strs) - non_null) / len(licensed_strs) * 100
        print(f"   {i:2d}. {col:<35} | {str(dtype):<15} | {null_pct:5.1f}% missing")
    
    print(f"\n👁️ Sample Data (first 5 rows):")
    display(licensed_strs.head())
    
else:
    print("❌ Licensed STR data not available - this is our main dataset!")

In [None]:
# 📞 Examine EZ Complaints (Main Complaints Dataset)
if ez_complaints is not None:
    print("📞 EZ COMPLAINTS - DETAILED ANALYSIS")
    print("=" * 40)
    
    print(f"📊 Basic Info:")
    print(f"   Total complaints: {len(ez_complaints):,}")
    print(f"   Columns: {len(ez_complaints.columns)}")
    print(f"   Memory usage: {ez_complaints.memory_usage(deep=True).sum() / 1024**2:.1f} MB")
    
    print(f"\n📋 Column Names:")
    for i, col in enumerate(ez_complaints.columns, 1):
        print(f"   {i:2d}. {col}")
    
    print(f"\n👁️ Sample Complaints Data (first 3 rows):")
    display(ez_complaints.head(3))
    
    # Look for key columns
    address_cols = [col for col in ez_complaints.columns if 'address' in col.lower()]
    date_cols = [col for col in ez_complaints.columns if 'date' in col.lower()]
    type_cols = [col for col in ez_complaints.columns if any(word in col.lower() for word in ['type', 'category', 'subject'])]
    
    print(f"\n🔍 Key Columns Identified:")
    if address_cols:
        print(f"   📍 Address columns: {address_cols}")
    if date_cols:
        print(f"   📅 Date columns: {date_cols}")
    if type_cols:
        print(f"   📋 Type columns: {type_cols}")
        
else:
    print("❌ EZ Complaints data not available")

In [None]:
# 🔗 Identify Data Linking Opportunities
print("🔗 DATA LINKING ANALYSIS")
print("=" * 35)

if licensed_strs is not None and ez_complaints is not None:
    # Look for common columns
    str_cols = set(licensed_strs.columns)
    complaint_cols = set(ez_complaints.columns)
    common_cols = str_cols.intersection(complaint_cols)
    
    print(f"📊 Column Analysis:")
    print(f"   Licensed STR columns: {len(str_cols)}")
    print(f"   EZ Complaints columns: {len(complaint_cols)}")
    print(f"   Common columns: {len(common_cols)}")
    
    if common_cols:
        print(f"\n🔗 Common Columns for Linking:")
        for col in sorted(common_cols):
            print(f"   📋 {col}")
    
    # Look for address-like columns in each
    str_address_cols = [col for col in licensed_strs.columns if 'address' in col.lower()]
    complaint_address_cols = [col for col in ez_complaints.columns if 'address' in col.lower()]
    
    print(f"\n📍 Address Columns for Spatial Linking:")
    if str_address_cols:
        print(f"   🏠 Licensed STRs: {str_address_cols}")
    if complaint_address_cols:
        print(f"   📞 EZ Complaints: {complaint_address_cols}")
    
    # Show sample addresses for comparison
    if str_address_cols and complaint_address_cols:
        print(f"\n🔍 Sample Address Formats:")
        print(f"   STR addresses:")
        for addr in licensed_strs[str_address_cols[0]].dropna().head(3):
            print(f"      {addr}")
        
        print(f"   Complaint addresses:")
        for addr in ez_complaints[complaint_address_cols[0]].dropna().head(3):
            print(f"      {addr}")

else:
    print("❌ Cannot analyze linking - missing STR or complaints data")

print(f"\n✅ Initial exploration completed!")
print(f"📝 Next: Based on the column structure, we'll create the linking and target variables")

## 🎯 Next Steps

Based on the data structure we discovered above:

1. **🔗 Data Linking Strategy**: Connect STR properties to complaints using address matching
2. **📊 Complaint Analysis**: Understand complaint patterns and frequencies  
3. **🎯 Target Creation**: Define nuisance properties based on complaint history
4. **🛠️ Feature Engineering**: Create predictive features from all 9 datasets
5. **🤖 Model Development**: Build and validate prediction models

## 📋 What We Learned

The comprehensive Scottsdale dataset provides:
- **Complete STR landscape**: Licensed, unlicensed, and pending properties
- **Multi-source complaints**: EZ system + code violations + police data
- **Rich geographic context**: Parcel-level property information
- **Temporal depth**: Historical patterns for trend analysis

This foundation enables sophisticated nuisance prediction modeling! 🏛️