# Generate Fake Parquet Files for Testing

This notebook generates synthetic parquet files that mimic the structure of real dwells data for testing the outlier detection analysis.

## Data Structure

The generated parquet files contain the following columns (matching BigQuery table structure):

### Required Columns:
- **identifier**: Unique user identifier (string, format: `user_YYYYMM_XXXXXX`)
- **identifier_type**: Type of identifier - either 'GAID' (Android) or 'IDFA' (iOS)
- **date**: Date of the dwell event (datetime, within the specified month)
- **timestamp**: Full timestamp of the dwell event (datetime)
- **duration_seconds**: Duration of the dwell in seconds (int, minimum 180 seconds)
- **centroid_latitude**: Latitude coordinate (float, around Tel Aviv area: ~32.0853)
- **centroid_longitude**: Longitude coordinate (float, around Tel Aviv area: ~34.7818)
- **classification**: Type of dwell event - 'DWELL', 'AREA_DWELL', or 'STOP'
- **bump_count**: Number of location updates during the dwell (int, 0-5)

### Derived Columns (calculated automatically):
- **duration_hours**: Calculated from duration_seconds / 3600
- **geohash**: Calculated from latitude/longitude using pygeohash (precision=7)
- **flag_night**: Boolean flag for night hours (20:00-03:59)
- **flag_work_hours**: Boolean flag for work hours (08:00-17:59)

## Usage

1. Set the output folder and months to generate
2. Adjust parameters (number of users, dwells per user range)
3. Run the generator
4. Use the generated files in `demo-outlier-detection-parquet.ipynb`


In [1]:
# Imports
import pandas as pd
import numpy as np
import os
import random
from datetime import timedelta
import pygeohash

print("Imports loaded successfully")


Imports loaded successfully


## Configuration

Set the output folder and months to generate.


In [2]:
# Configuration
OUTPUT_FOLDER = "data/test_parquet_files"  # Folder to save generated parquet files

# Months to generate (format: YYYYMM)
MONTHS = ['202001', '202002', '202003', '202004', '202005', '202006']

# Data generation parameters
NUM_USERS_PER_MONTH = 500  # Number of unique users per month
DWELLS_PER_USER_RANGE = (2, 30)  # Min and max dwells per user

print(f"Configuration:")
print(f"  Output folder: {OUTPUT_FOLDER}")
print(f"  Months to generate: {len(MONTHS)} months")
print(f"  Users per month: {NUM_USERS_PER_MONTH}")
print(f"  Dwells per user: {DWELLS_PER_USER_RANGE[0]}-{DWELLS_PER_USER_RANGE[1]}")


Configuration:
  Output folder: data/test_parquet_files
  Months to generate: 6 months
  Users per month: 500
  Dwells per user: 2-30


## Generator Function

Function to generate fake parquet files with realistic dwells data.


In [3]:
def generate_fake_parquet_files(output_folder, months, num_users_per_month=1000, dwells_per_user_range=(1, 50)):
    """
    Generate fake parquet files for testing outlier detection analysis.
    
    This function creates synthetic dwells data that mimics the structure of real data
    from the BigQuery table `phd_dwells.dwells_monthly_fltrd`. The generated data includes:
    
    - Multiple users per month (with unique identifiers)
    - Multiple dwells per user (varying counts)
    - Realistic timestamps and dates within each month
    - Geographic coordinates around Tel Aviv, Israel
    - Duration values (minimum 180 seconds as per filtering)
    - Classification types and bump counts
    
    The generated files can be used to test the outlier detection notebook
    (`demo-outlier-detection-parquet.ipynb`) without needing access to BigQuery or real data.
    
    Parameters
    ----------
    output_folder : str
        Path to folder where parquet files will be saved. Will be created if it doesn't exist.
        
    months : list of str
        List of month identifiers in YYYYMM format (e.g., ['202001', '202002', '202003']).
        Each month will generate a separate parquet file named {month}.parquet.
        
    num_users_per_month : int, default=1000
        Number of unique users to generate for each month. Each user will have
        a unique identifier and multiple dwell events.
        
    dwells_per_user_range : tuple of (int, int), default=(1, 50)
        Range of dwell counts per user. Each user will have a random number of dwells
        between the minimum and maximum values (inclusive).
        
    Returns
    -------
    None
        Files are saved to disk. Prints progress and summary.
        
    Generated File Format
    ---------------------
    Each parquet file contains a DataFrame with the following columns:
    
    - identifier (str): Unique user ID, format: 'user_{month}_{index:06d}'
    - identifier_type (str): Either 'GAID' (Android) or 'IDFA' (iOS), randomly assigned
    - date (datetime): Date of the dwell event (within the specified month)
    - timestamp (datetime): Full timestamp including time of day
    - duration_seconds (int): Duration in seconds, range: 180 to 28800 (8 hours)
    - centroid_latitude (float): Latitude coordinate, centered around Tel Aviv (~32.0853)
    - centroid_longitude (float): Longitude coordinate, centered around Tel Aviv (~34.7818)
    - classification (str): Randomly selected from ['DWELL', 'AREA_DWELL', 'STOP']
    - bump_count (int): Number of location updates, range: 0 to 5
    
    Notes
    -----
    - Geographic coordinates are randomly distributed within ~0.1 degrees of Tel Aviv center
    - Timestamps are randomly distributed throughout each day
    - Dates are randomly selected from days 1-28 of each month (to avoid month-end issues)
    - Duration values respect the minimum 180 seconds filter used in real data processing
    
    Examples
    --------
    >>> # Generate 6 months of test data
    >>> months = ['202001', '202002', '202003', '202004', '202005', '202006']
    >>> generate_fake_parquet_files('data/test', months, num_users_per_month=500)
    
    >>> # Generate a single month with many users
    >>> generate_fake_parquet_files('data/test', ['202001'], num_users_per_month=2000)
    """
    os.makedirs(output_folder, exist_ok=True)
    
    # Israel coordinates (Tel Aviv area)
    base_lat = 32.0853
    base_lon = 34.7818
    
    total_dwells = 0
    
    for month in months:
        print(f"Generating fake data for {month}...")
        
        # Parse month
        year = int(month[:4])
        month_num = int(month[4:6])
        
        all_dwells = []
        
        # Generate users
        for user_idx in range(num_users_per_month):
            identifier = f"user_{month}_{user_idx:06d}"
            identifier_type = random.choice(['GAID', 'IDFA'])
            
            # Number of dwells for this user
            num_dwells = random.randint(dwells_per_user_range[0], dwells_per_user_range[1])
            
            # Generate dwells for this user
            for dwell_idx in range(num_dwells):
                # Random date within the month
                day = random.randint(1, 28)  # Use 28 to avoid month-end issues
                date = pd.Timestamp(year, month_num, day)
                
                # Random timestamp within the day
                hour = random.randint(0, 23)
                minute = random.randint(0, 59)
                timestamp = date.replace(hour=hour, minute=minute)
                
                # Random duration (in seconds, minimum 180)
                duration_seconds = random.randint(180, 8 * 3600)  # 3 min to 8 hours
                
                # Random location (around Tel Aviv)
                lat = base_lat + random.uniform(-0.1, 0.1)
                lon = base_lon + random.uniform(-0.1, 0.1)
                
                # Random classification
                classification = random.choice(['DWELL', 'AREA_DWELL', 'STOP'])
                
                # Random bump count
                bump_count = random.randint(0, 5)
                
                dwell = {
                    'identifier': identifier,
                    'identifier_type': identifier_type,
                    'date': date,
                    'timestamp': timestamp,
                    'duration_seconds': duration_seconds,
                    'centroid_latitude': lat,
                    'centroid_longitude': lon,
                    'classification': classification,
                    'bump_count': bump_count,
                }
                
                all_dwells.append(dwell)
        
        # Create DataFrame
        df = pd.DataFrame(all_dwells)
        total_dwells += len(df)
        
        # Save as parquet
        filename = f"{month}.parquet"
        filepath = os.path.join(output_folder, filename)
        df.to_parquet(filepath, index=False)
        
        print(f"  ✓ Saved {len(df):,} dwells to {filename}")
        print(f"    - {df['identifier'].nunique():,} unique users")
        print(f"    - Date range: {df['date'].min().date()} to {df['date'].max().date()}")
    
    print(f"\n{'='*60}")
    print(f"✓ Generated {len(months)} parquet files in {output_folder}")
    print(f"  Total dwells: {total_dwells:,}")
    print(f"  Total users: {len(months) * num_users_per_month:,}")
    print(f"{'='*60}")
    print(f"\nTo use these files, update PARQUET_FOLDER in demo-outlier-detection-parquet.ipynb to:")
    print(f"  {os.path.abspath(output_folder)}")


In [4]:
# Generate the fake parquet files
generate_fake_parquet_files(
    OUTPUT_FOLDER, 
    MONTHS, 
    num_users_per_month=NUM_USERS_PER_MONTH,
    dwells_per_user_range=DWELLS_PER_USER_RANGE
)


Generating fake data for 202001...
  ✓ Saved 8,122 dwells to 202001.parquet
    - 500 unique users
    - Date range: 2020-01-01 to 2020-01-28
Generating fake data for 202002...
  ✓ Saved 8,015 dwells to 202002.parquet
    - 500 unique users
    - Date range: 2020-02-01 to 2020-02-28
Generating fake data for 202003...
  ✓ Saved 8,199 dwells to 202003.parquet
    - 500 unique users
    - Date range: 2020-03-01 to 2020-03-28
Generating fake data for 202004...
  ✓ Saved 8,408 dwells to 202004.parquet
    - 500 unique users
    - Date range: 2020-04-01 to 2020-04-28
Generating fake data for 202005...
  ✓ Saved 7,916 dwells to 202005.parquet
    - 500 unique users
    - Date range: 2020-05-01 to 2020-05-28
Generating fake data for 202006...
  ✓ Saved 8,127 dwells to 202006.parquet
    - 500 unique users
    - Date range: 2020-06-01 to 2020-06-28

✓ Generated 6 parquet files in data/test_parquet_files
  Total dwells: 48,787
  Total users: 3,000

To use these files, update PARQUET_FOLDER in de

## Verify Generated Files

Check that the files were created correctly.


In [5]:
# Verify files were created
import glob

parquet_files = glob.glob(os.path.join(OUTPUT_FOLDER, "*.parquet"))
print(f"Found {len(parquet_files)} parquet files:")

for file_path in sorted(parquet_files):
    filename = os.path.basename(file_path)
    df = pd.read_parquet(file_path)
    print(f"\n{filename}:")
    print(f"  - Rows: {len(df):,}")
    print(f"  - Columns: {list(df.columns)}")
    print(f"  - Unique users: {df['identifier'].nunique():,}")
    print(f"  - Date range: {df['date'].min().date()} to {df['date'].max().date()}")
    print(f"  - Identifier types: {df['identifier_type'].value_counts().to_dict()}")
    
    # Show sample row
    print(f"\n  Sample row:")
    print(df.iloc[0].to_dict())
    break  # Just show first file as example


Found 6 parquet files:

202001.parquet:
  - Rows: 8,122
  - Columns: ['identifier', 'identifier_type', 'date', 'timestamp', 'duration_seconds', 'centroid_latitude', 'centroid_longitude', 'classification', 'bump_count']
  - Unique users: 500
  - Date range: 2020-01-01 to 2020-01-28
  - Identifier types: {'IDFA': 4377, 'GAID': 3745}

  Sample row:
{'identifier': 'user_202001_000000', 'identifier_type': 'IDFA', 'date': Timestamp('2020-01-13 00:00:00'), 'timestamp': Timestamp('2020-01-13 20:59:00'), 'duration_seconds': 26440, 'centroid_latitude': 32.09231240281736, 'centroid_longitude': 34.78557187708164, 'classification': 'STOP', 'bump_count': 1}
