# Space Launch Data Acquisition via API Pagination

**Notebook for 1st Phase** - Data Collection

## Purpose
This notebook collects comprehensive historical rocket launch data from the Space Devs Launch Library 2 API using automated pagination to retrieve all available records.

## Authors
- **Phillip Roman** - Pagination implementation and rate limiting
- **Jillian Kunze** - Initial API exploration and date filtering prototype

## Workflow Position
1. **This Notebook** → Collect raw launch data via API (5-6 hours for full collection)
2. Data Cleaning → Extract and clean specific parameters
3. Data Merging → Combine cleaned datasets

## Key Features
- Automated pagination through API results
- Built-in rate limiting compliance (15 requests/hour)
- Test mode for validation before full collection
- Automatic file saving with collection metadata

## Output
- `raw_baseline_launches_[collector_name].json` - Complete launch dataset ready for cleaning

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Import Required Libraries

Load all necessary Python packages for API communication, data handling, and file operations.

In [2]:
from datetime import datetime, timedelta
from zoneinfo import ZoneInfo
import dateutil.parser as dateparser
from pprint import pprint
import requests
import time
import json

## Date Filtering (Optional - Not Currently Used)

**Note:** This section is included for reference but is NOT used in the current data collection workflow. The main pagination loop collects ALL historical launches without date restrictions.

### When Date Filtering Would Be Useful:
- Collecting specific time periods for targeted analysis
- Incremental updates to add only new launches since last collection
- Testing with smaller datasets before full collection
- Recovering missing data from specific date ranges

### How to Implement:
If needed in future, add the `net_filters` string to the URL assembly in the "Set Up API Query Parameters" section using `&net_filters` in the join statement.

**For this project, skip to the next section.**

In [3]:
# Optional: Date filtering
start_time = dateparser.parse("1/1/25")
end_time = dateparser.parse("2/28/25")

# Visual check
print("Start date:", start_time.isoformat())
print("End date:", end_time.isoformat())

# Set the filter parameters with the start and end date:
net_filters = f'net__gte={start_time.isoformat()}&net__lte={end_time.isoformat()}'

Start date: 2025-01-01T00:00:00
End date: 2025-02-28T00:00:00


## Set Up API Query Parameters

Configure the API request settings and build the initial URL for data collection.

**Parameters:**
- **Mode:** detailed - Returns all nested objects for each launch
- **Limit:** 100 records per page (maximum allowed by API)
- **Ordering:** net (launch date) - Ascending chronological order

In [4]:
# API parameters
mode = 'mode=detailed'  # returns all related objects
limit = 'limit=100'  # this is the max!
ordering = 'ordering=net'  # ascending date order

# Assemble the full URL
current_url = "https://ll.thespacedevs.com/2.3.0/launches/previous/" + "?" + "&".join(
    (mode, limit, ordering)
)

print(f'Query URL: {current_url}')  # Visual check

Query URL: https://ll.thespacedevs.com/2.3.0/launches/previous/?mode=detailed&limit=100&ordering=net


## Configuration: Test Mode vs Full Collection

Set TEST_MODE to control the data collection scope.

**TEST_MODE = True:**
- Collects 10 API calls (1,000 launches)
- Pauses after 4 calls for 2 minutes
- Use for testing and validation

**TEST_MODE = False:**
- Collects ALL historical launches (7,000+ records)
- Pauses after 14 calls for 1.5 hours
- Full collection takes approximately 5-6 hours

**Important:** Run this cell before the pagination loop to set your collection mode.

In [5]:

TEST_MODE = True  # Set to True for test run, False for full collection

# Set parameters
if TEST_MODE:
    max_calls = 12
    pause_after_call_num = 4
    pause_duration_in_seconds = 120  # 2 min
else:
    max_calls = None  # collects everything
    pause_after_call_num = 14  # API limit is 15 calls/hour
    pause_duration_in_seconds = 3630  # 1 hour 30 seconds

print(f"Running in {'TEST' if TEST_MODE else 'FULL COLLECTION'} mode")

Running in TEST mode


## Data Collection via Automated Pagination

This cell executes the main data collection workflow, including:

1. **API Pagination:** Automatically follows 'next' URLs until all launches are collected
2. **Rate Limiting:** Pauses every 14 calls (1hr 30 sec) to comply with API limits
3. **Network Timeout:** 60-second timeout to handle connection issues
4. **Progress Tracking:** Displays request count and cumulative totals
5. **Metadata Collection:** Records collector name, timestamp, and total count
6. **Automatic Saving:** Saves complete dataset to JSON file upon completion

### Before Running:
**REQUIRED:** Update the `collector_name` variable with your name

Update `output_path` if you want to save somewhere other than the current directory (choose from 3 options)

### Expected Runtime:
- **Test Mode:** ~6-10 minutes
- **Full Collection:** ~5-6 hours (due to API rate limiting)

### Output File:
`raw_baseline_launches_[collector_name].json`

---

**Pagination Logic Credit:**  
Based on [Stack Overflow example](https://stackoverflow.com/questions/56206038/how-to-loop-through-paginated-api-using-python)


In [6]:
all_launches = []

api_call_count = 0

print("STARTING DATA ACQUISITION")
while current_url and (max_calls is None or api_call_count < max_calls):

    print(f"Request #: {api_call_count + 1}")

    try:
        response = requests.get(current_url, timeout=60)  # 60 to prevent network drop

        if response.status_code == 200:

            raw_data = response.json()

            all_launches.extend(raw_data['results'])

            # get next page url (None if last page)
            current_url = raw_data['next']

            print(f"Success! So far collected {len(all_launches)} total launches.")

            api_call_count += 1

            # pause
            should_pause = (api_call_count % pause_after_call_num == 0 and
                  current_url is not None)

            if max_calls is not None:  # check to avoid test run bug
                should_pause = should_pause and (api_call_count < max_calls)

            if should_pause:

                philly_tz = ZoneInfo('America/New_York')
                time_in_philly = datetime.now(philly_tz)
                length_of_pause = timedelta(seconds=pause_duration_in_seconds)
                resume_time = time_in_philly + length_of_pause

                print(f"{api_call_count} CALLS MADE. PAUSING FOR {pause_duration_in_seconds / 60:.0f} MIN.")
                print(f"Resuming at {resume_time.strftime('%I:%M:%S %p %Z')}")

                time.sleep(pause_duration_in_seconds)

                print("RESUMING API CALLS")

        else:
            print(f"Error! Status code: {response.status_code}")
            break

    except requests.exceptions.Timeout:
        print(f"Request timed out. Stopping collection.")
        break

    except requests.exceptions.RequestException as e:
        print(f"Request failed: {e}. Stopping collection.")
        break

print(f"FINISHED: Collected {len(all_launches)} total launches!!!")

# update collector's name before running
collector_name = 'PJR322test'

final_data = {
    'collector': collector_name,
    'total_launches': len(all_launches),
    'collection_date': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
    'launches': all_launches
}

# file saving
print(f"Saving {len(all_launches)} launches to a file...")

# option 1 (default) - save to current directory
# output_path = ''

# option 2 (Colab users) - for Google Drive
output_path = '/content/drive/MyDrive/DSCI511/Term Project/'

# option 3 - to local path
# output_path = '/Users/YourName/Desktop/DSCI511/'

with open(output_path + f'raw_baseline_launches_{collector_name}.json', 'w', encoding='utf-8') as f:
    json.dump(final_data, f, indent=4)

print(f"File saved to: {output_path if output_path else 'current directory'}")

print("Save complete!")

STARTING DATA ACQUISITION
Request #: 1
Success! So far collected 100 total launches.
Request #: 2
Success! So far collected 200 total launches.
2 CALLS MADE. PAUSING FOR 1 MIN.
Resuming at 07:25:49 AM EST
RESUMING API CALLS
Request #: 3
Success! So far collected 300 total launches.
Request #: 4
Success! So far collected 400 total launches.
FINISHED: Collected 400 total launches!!!
Saving 400 launches to a file...
File saved to: /content/drive/MyDrive/DSCI511/Term Project/
Save complete!


## Data Exploration and Validation (Optional)

The following cells are for exploring the collected data and validating the API collection worked correctly. These are NOT required for the data acquisition workflow.

**Skip these cells if you only need to collect the data.**

In [7]:
# Quick data check - view first few launches

print(f"Collected a total of {len(all_launches)} launches")
print("First 3 launches:")

pprint(all_launches[:3])

Collected a total of 400 launches
First 3 launches:
[{'agency_launch_attempt_count': 1,
  'agency_launch_attempt_count_year': 1,
  'failreason': '',
  'flightclub_url': None,
  'hashtag': None,
  'id': 'e3df2ecd-c239-472f-95e4-2b89b4f75800',
  'image': {'credit': None,
            'id': 1844,
            'image_url': 'https://thespacedevs-prod.nyc3.digitaloceanspaces.com/media/images/sputnik_8k74ps_image_20210830185541.jpg',
            'license': {'id': 1,
                        'link': None,
                        'name': 'Unknown',
                        'priority': 9},
            'name': '[AUTO] Sputnik 8K74PS - image',
            'single_use': True,
            'thumbnail_url': 'https://thespacedevs-prod.nyc3.digitaloceanspaces.com/media/images/255bauto255d__image_thumbnail_20240305193923.jpeg',
            'variants': []},
  'info_urls': [],
  'infographic': None,
  'last_updated': '2024-03-17T19:17:35Z',
  'launch_designator': '1957-001',
  'launch_service_provider': {'abbr

This lookup date tool is great. It appears the first launches recorded in Space Devs were in 1957.

In [8]:
# dates for each launch collected
for item in all_launches[:10]:
    print(item["net"])

1957-10-04T19:28:34Z
1957-11-03T02:30:00Z
1957-12-06T16:44:35Z
1958-02-01T03:47:56Z
1958-02-05T07:33:00Z
1958-03-05T18:27:57Z
1958-03-17T12:15:41Z
1958-03-26T17:38:01Z
1958-04-27T07:00:35Z
1958-04-29T02:53:00Z


Check on API throttle, based on [docs](https://ll.thespacedevs.com/2.3.0/api-throttle/)

In [12]:
API_throttle_URL = "https://ll.thespacedevs.com/2.3.0/api-throttle/"

response_throttle = requests.get(API_throttle_URL)

In [13]:
throttle_data = response_throttle.json()
pprint(throttle_data)

{'current_use': 4,
 'ident': '34.61.113.243',
 'limit_frequency_secs': 3600,
 'next_use_secs': 0,
 'your_request_limit': 15}


Test out automatically dividing an input year into months

In [11]:
year = 2014
year_month_starts = [dateparser.parse("1/1/"+str(year)), dateparser.parse("2/1/"+str(year)), dateparser.parse("3/1/"+str(year)),
                     dateparser.parse("4/1/"+str(year)), dateparser.parse("5/1/"+str(year)), dateparser.parse("6/1/"+str(year)),
                     dateparser.parse("7/1/"+str(year)), dateparser.parse("8/1/"+str(year)), dateparser.parse("9/1/"+str(year)),
                     dateparser.parse("10/1/"+str(year)), dateparser.parse("11/1/"+str(year)), dateparser.parse("12/1/"+str(year)),
                     dateparser.parse("1/1/"+str(year+1))]

# print(year_month_starts[-1])
for index, month in enumerate(year_month_starts[:-1]):
    start_date = month
    end_date = year_month_starts[index + 1]
    print(f"From {start_date} to {end_date}")

From 2014-01-01 00:00:00 to 2014-02-01 00:00:00
From 2014-02-01 00:00:00 to 2014-03-01 00:00:00
From 2014-03-01 00:00:00 to 2014-04-01 00:00:00
From 2014-04-01 00:00:00 to 2014-05-01 00:00:00
From 2014-05-01 00:00:00 to 2014-06-01 00:00:00
From 2014-06-01 00:00:00 to 2014-07-01 00:00:00
From 2014-07-01 00:00:00 to 2014-08-01 00:00:00
From 2014-08-01 00:00:00 to 2014-09-01 00:00:00
From 2014-09-01 00:00:00 to 2014-10-01 00:00:00
From 2014-10-01 00:00:00 to 2014-11-01 00:00:00
From 2014-11-01 00:00:00 to 2014-12-01 00:00:00
From 2014-12-01 00:00:00 to 2015-01-01 00:00:00
