# Using Predictive Analytics on Business Registration Trends to Prioritize Development Funds in San Francisco

## Problem Statement

**How can predictive analytics using Survival Analysis Technique on business registration trends help prioritize neighborhoods for development funds in business-dense areas of San Francisco for Supervisor District 3?**

Understanding where new businesses are emerging, and how business density evolves over time, allows City Funding Team to strategically allocate resources. Predictive models can uncover patterns in business registrations, enabling more informed, data-driven decisions about neighborhood-level investment.

---

## Data Collection Process

### Overview

A Python script was developed to collect and store data on registered businesses in San Francisco. This data is retrieved from the [San Francisco Open Data API](https://data.sfgov.org/Economy-and-Community/Registered-Business-Locations-San-Francisco/g8m3-pdis/about_data) and saved locally in a CSV format for further analysis.

### API Source

- **Endpoint**: `https://data.sfgov.org/resource/g8m3-pdis.json`
- **Limit per request**: 1,000 records (as per API specifications)

### Features of the Script

#### 1. **CSV Loading**
- The script attempts to load previously saved data from `sf_registered_business.csv`.
- If the file does not exist, it initializes an empty DataFrame to begin data collection.

#### 2. **Paginated Requests**
- Records are fetched in batches of 1,000 using an adjustable offset.
- Ensures complete data retrieval without overlap or duplication.

#### 3. **Deduplication**
- A `set` of `uniqueid` values is maintained to prevent duplicate records.
- Only new, unseen records are added to the dataset.

#### 4. **Data Storage**
- New records are appended to the existing dataset.
- The combined dataset is then saved back to `sf_registered_business.csv`.

#### 5. **Error Handling**
- The script handles common API issues like connection errors and JSON decoding problems.
- Includes logging to capture any exceptions that occur during data retrieval.

#### 6. **Execution Tracking**
- Real-time progress updates are printed to the console.
- Logs the number of new records fetched in each iteration.
- Tracks and reports total execution time and the number of records processed.

#### 7. **Rate Limiting**
- Introduces a small delay between API requests to respect server load and avoid throttling.

---


In [2]:
import requests
import pandas as pd
import time

url = 'https://data.sfgov.org/resource/g8m3-pdis.json'  # Using JSON for structured data
csv_filename = '../data/sf_registered_business.csv'
limit = 1000
offset = 0
all_records = []
seen_ids = set()
total_fetched = 0

print("Fetching data in chunks and removing duplicates...")
start_time = time.time()

while True:
    fetch_start_time = time.time()
    params = {'$limit': limit, '$offset': offset}
    try:
        response = requests.get(url, params=params)
        response.raise_for_status()
        data = response.json()
        fetch_end_time = time.time()
        fetch_time = fetch_end_time - fetch_start_time
        num_records = len(data)
        total_fetched += num_records
        print(f"Fetched {num_records} records (offset: {offset}), Fetch time: {fetch_time:.2f} seconds")

        new_records = 0
        for record in data:
            unique_id = record.get('uniqueid')
            if unique_id and unique_id not in seen_ids:
                all_records.append(record)
                seen_ids.add(unique_id)
                new_records += 1

        print(f"Added {new_records} new unique records.")

        if num_records < limit:
            print("Finished fetching all data.")
            break

        offset += limit
        time.sleep(0.1)  

    except requests.exceptions.RequestException as e:
        print(f"Error fetching data at offset {offset}: {e}")
        break
    except json.JSONDecodeError:
        print(f"Error decoding JSON response at offset {offset}.")
        break
    except Exception as e:
        print(f"An unexpected error occurred at offset {offset}: {e}")
        break

collect_start_time = time.time()
if all_records:
    df = pd.DataFrame(all_records)
    collect_end_time = time.time()
    collect_time = collect_end_time - collect_start_time
    print(f"\nData collected into DataFrame in {collect_time:.2f} seconds.")

    # Save the DataFrame to CSV
    save_start_time = time.time()
    df.to_csv(csv_filename, index=False)
    save_end_time = time.time()
    save_time = save_end_time - save_start_time
    print(f"Data saved to {csv_filename} in {save_time:.2f} seconds.")
    print(f"\nData saved successfully as {csv_filename}")  # Added success message

    print("\nFirst few rows of the saved DataFrame:")
    print(df.head())
    print("\nDataFrame information:")
    df.info()

else:
    print("No unique records were collected.")

total_time = time.time() - start_time
print(f"\nTotal execution time: {total_time:.2f} seconds.")
print(f"Total unique records collected: {len(seen_ids)}")
print(f"Total records fetched from API: {total_fetched}")

Fetching data in chunks and removing duplicates...
Fetched 1000 records (offset: 0), Fetch time: 0.87 seconds
Added 1000 new unique records.
Fetched 1000 records (offset: 1000), Fetch time: 1.12 seconds
Added 1000 new unique records.
Fetched 1000 records (offset: 2000), Fetch time: 0.85 seconds
Added 1000 new unique records.
Fetched 1000 records (offset: 3000), Fetch time: 1.19 seconds
Added 985 new unique records.
Fetched 1000 records (offset: 4000), Fetch time: 0.91 seconds
Added 1000 new unique records.
Fetched 1000 records (offset: 5000), Fetch time: 1.25 seconds
Added 1000 new unique records.
Fetched 1000 records (offset: 6000), Fetch time: 0.89 seconds
Added 1000 new unique records.
Fetched 1000 records (offset: 7000), Fetch time: 1.19 seconds
Added 1000 new unique records.
Fetched 1000 records (offset: 8000), Fetch time: 0.96 seconds
Added 985 new unique records.
Fetched 1000 records (offset: 9000), Fetch time: 1.17 seconds
Added 1000 new unique records.
Fetched 1000 records (of