# 🔍 STR Nuisance Prediction - Initial EDA

**Objective**: Explore STR and complaint datasets to understand patterns and identify features for nuisance prediction.

**Dataset Source**: Scottsdale public data from Google Drive

## Analysis Goals:
1. 📊 Understanding data structure and quality
2. 🔍 Identifying nuisance patterns and trends  
3. 🏠 Exploring relationships between properties and complaints
4. 🛠️ Feature engineering opportunities
5. 🎯 Target variable definition

---

In [None]:
# 📚 Import Required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings
import sys
import os

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 6)

print("📚 Libraries imported successfully!")
print(f"🗓️ Analysis Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("🚀 Ready to analyze Scottsdale STR data!")

## 📁 Load Scottsdale STR Data

Loading all 9 datasets from Google Drive.

In [None]:
# Add project path to Python path
project_root = os.path.abspath('../src')
if project_root not in sys.path:
    sys.path.append(project_root)

# Import the Scottsdale data loader
try:
    from data_processing.scottsdale_data_loader import ScottsdaleSTRDataLoader
    print("✅ Scottsdale data loader imported successfully")
except ImportError as e:
    print(f"❌ Error importing data loader: {e}")
    print("📝 Please create the scottsdale_data_loader.py file first")

In [None]:
# 🔗 Initialize Data Loader and Set Up File Links
loader = ScottsdaleSTRDataLoader()

# Set up all your Google Drive file links
file_links = {
    'unlicensed_strs': 'https://drive.google.com/file/d/12mlo9JtfIUfOz3CxJVCEIgVIEZKGyQ6X/view?usp=drive_link',
    'ez_complaints': 'https://drive.google.com/file/d/1UDbXLlVdikJGFyVgOxLExYWFqe3ADcSj/view?usp=drive_link',
    'police_incidents': 'https://drive.google.com/file/d/1PF_cAutvvEMiAmEHzH2Qbljz73k75x0R/view?usp=drive_link',
    'police_citations': 'https://drive.google.com/file/d/1PQW90VjQsbYXxhOlpRXKM2MRyEbmOL0N/view?usp=drive_link',
    'police_arrests': 'https://drive.google.com/file/d/118W8cbYAnEgzPwy1I_cVuqoHLpoMz9UG/view?usp=drive_link',
    'code_violations': 'https://drive.google.com/file/d/1vUJ-HXU1RGb9AOvn0jAaaSkiYq4ICITs/view?usp=drive_link',
    'pending_licences': 'https://drive.google.com/file/d/1ybALd2DDYsdP6VgeLfnioo1kKSYt_xRR/view?usp=drive_link',
    'parcels': 'https://drive.google.com/file/d/19PPloUcM2FHxxQP17s4091aaLZjp4juB/view?usp=drive_link',
    'licensed_strs': 'https://drive.google.com/file/d/16-lg-5fj-dttKUgwWTbzo0wDH4lCvV-t/view?usp=drive_link'
}

print("🔗 Setting up Google Drive connections...")
loader.setup_file_links(file_links)
loader.print_file_info()

In [None]:
# 📥 Load All Datasets
print("📥 Loading all Scottsdale STR datasets...")
print("⏳ This may take a few minutes for large files...")

# Load all datasets
datasets = loader.load_all_datasets()

# Get organized datasets by category
categorized = loader.get_dataset_by_category()

# Extract main datasets for analysis
str_properties = categorized['str_properties']
complaints_data = categorized['complaints']
police_data = categorized['police']
geographic_data = categorized['geographic']

# Main datasets for EDA
licensed_strs = str_properties['licensed']           # Main STR properties
unlicensed_strs = str_properties['unlicensed']       # Unlicensed properties
pending_strs = str_properties['pending']             # Pending applications

ez_complaints = complaints_data['ez_complaints']      # Main complaints system
code_violations = complaints_data['code_violations']  # Code enforcement

police_incidents = police_data['incidents']          # Police incidents
police_citations = police_data['citations']          # Citations issued
police_arrests = police_data['arrests']              # Arrests made

parcels = geographic_data['parcels']                 # Property details

print("\n✅ Data loading completed!")
print("🎯 Ready for comprehensive STR nuisance analysis!")

## 📊 Dataset Overview

Let's examine what we have in each dataset.

In [None]:
# 📋 Dataset Summary
def print_dataset_summary(df, name):
    """Print summary of a dataset"""
    if df is not None:
        print(f"📊 {name}:")
        print(f"   Rows: {df.shape[0]:,}")
        print(f"   Columns: {df.shape[1]}")
        print(f"   Memory: {df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")
        print(f"   Columns: {list(df.columns)[:5]}{'...' if len(df.columns) > 5 else ''}")
        print()
    else:
        print(f"❌ {name}: Not loaded")
        print()

print("📋 DATASET SUMMARY")
print("=" * 40)

print("\n🏠 STR Properties:")
print_dataset_summary(licensed_strs, "Licensed STRs")
print_dataset_summary(unlicensed_strs, "Unlicensed STRs")
print_dataset_summary(pending_strs, "Pending Licences")

print("\n📞 Complaints & Violations:")
print_dataset_summary(ez_complaints, "EZ Complaints")
print_dataset_summary(code_violations, "Code Violations")

print("\n👮 Police Data:")
print_dataset_summary(police_incidents, "Police Incidents")
print_dataset_summary(police_citations, "Police Citations")
print_dataset_summary(police_arrests, "Police Arrests")

print("\n🗺️ Geographic Data:")
print_dataset_summary(parcels, "Parcels")

## 🏠 STR Properties Analysis

Let's start with the core STR properties data.

In [None]:
# 🏠 Licensed STR Analysis
if licensed_strs is not None:
    print("🏠 LICENSED STR PROPERTIES ANALYSIS")
    print("=" * 45)
    
    print(f"📊 Dataset Overview:")
    print(f"   Total licensed STRs: {len(licensed_strs):,}")
    print(f"   Columns: {len(licensed_strs.columns)}")
    
    # Show column names and types
    print(f"\n📋 Column Information:")
    for i, (col, dtype) in enumerate(zip(licensed_strs.columns, licensed_strs.dtypes), 1):
        non_null = licensed_strs[col].count()
        print(f"   {i:2d}. {col:30} | {str(dtype):15} | {non_null:,} non-null")
    
    # Show sample data
    print(f"\n👁️  Sample Licensed STRs:")
    display(licensed_strs.head(3))
    
    # Basic statistics for numeric columns
    numeric_cols = licensed_strs.select_dtypes(include=[np.number]).columns
    if len(numeric_cols) > 0:
        print(f"\n📈 Numeric Columns Statistics:")
        display(licensed_strs[numeric_cols].describe())
else:
    print("❌ Licensed STR data not available")

In [None]:
# 📞 EZ Complaints Analysis
if ez_complaints is not None:
    print("📞 EZ COMPLAINTS ANALYSIS")
    print("=" * 35)
    
    print(f"📊 Dataset Overview:")
    print(f"   Total complaints: {len(ez_complaints):,}")
    print(f"   Columns: {len(ez_complaints.columns)}")
    
    # Show column names
    print(f"\n📋 Complaint Columns:")
    for i, col in enumerate(ez_complaints.columns, 1):
        print(f"   {i:2d}. {col}")
    
    # Show sample data
    print(f"\n👁️  Sample Complaints:")
    display(ez_complaints.head(3))
    
    # Look for date columns
    date_cols = [col for col in ez_complaints.columns if 'date' in col.lower()]
    if date_cols:
        print(f"\n📅 Date columns found: {date_cols}")
    
    # Look for complaint type columns
    type_cols = [col for col in ez_complaints.columns 
                if any(word in col.lower() for word in ['type', 'category', 'nature'])]
    if type_cols:
        print(f"📋 Type columns found: {type_cols}")
else:
    print("❌ EZ Complaints data not available")

## 🔗 Data Relationships

Let's explore how we can connect the different datasets.

In [None]:
# 🔗 Find Common Columns for Linking Datasets
print("🔗 DATASET LINKING ANALYSIS")
print("=" * 35)

# Function to find common columns
def find_common_columns(df1, df2, name1, name2):
    if df1 is not None and df2 is not None:
        common = set(df1.columns).intersection(set(df2.columns))
        if common:
            print(f"\n🔗 {name1} ↔ {name2}:")
            for col in common:
                print(f"   📋 {col}")
        return common
    return set()

# Check key relationships
print("📊 Looking for linkage columns...")

# STR properties to complaints
if licensed_strs is not None and ez_complaints is not None:
    common = find_common_columns(licensed_strs, ez_complaints, "Licensed STRs", "EZ Complaints")

# STR properties to parcels
if licensed_strs is not None and parcels is not None:
    common = find_common_columns(licensed_strs, parcels, "Licensed STRs", "Parcels")

# Complaints to police data
if ez_complaints is not None and police_incidents is not None:
    common = find_common_columns(ez_complaints, police_incidents, "EZ Complaints", "Police Incidents")

# Look for address-like columns
print(f"\n🏠 Address-like columns for spatial linking:")
for name, df in [('Licensed STRs', licensed_strs), ('EZ Complaints', ez_complaints), 
                ('Parcels', parcels), ('Police Incidents', police_incidents)]:
    if df is not None:
        addr_cols = [col for col in df.columns if 'address' in col.lower() or 'location' in col.lower()]
        if addr_cols:
            print(f"   📍 {name}: {addr_cols}")

## 🎯 Target Variable Creation

Based on the data structure, let's create our nuisance prediction targets.

In [None]:
# 🎯 Create Target Variables
print("🎯 TARGET VARIABLE CREATION")
print("=" * 35)

# This will be customized based on what we find in the actual data structure
print("📝 Next steps after examining the data:")
print("   1. Identify address/location matching strategy")
print("   2. Link STR properties to complaints")
print("   3. Count complaints per property")
print("   4. Define nuisance thresholds")
print("   5. Create binary/categorical targets")

print(f"\n✅ Data loading and initial exploration completed!")
print(f"📊 Ready for detailed feature engineering based on actual data structure")
print(f"🗓️ Analysis timestamp: {datetime.now()}")

## 📝 Next Steps

Based on this initial exploration, we'll proceed with:

1. **🔗 Data Linking** - Connect STRs to complaints using address matching
2. **🎯 Target Definition** - Define nuisance based on complaint patterns
3. **🛠️ Feature Engineering** - Create predictive features from all datasets
4. **🤖 Model Development** - Build and train prediction models
5. **✅ Validation** - Test model performance

The comprehensive Scottsdale dataset provides excellent opportunities for:
- Multi-source complaint analysis (EZ + Code + Police)
- Licensed vs unlicensed property comparison
- Geographic risk assessment using parcels
- Temporal pattern analysis