# Split JSON Documents by Year

This notebook loads a JSON file containing multiple study documents, inspects its structure, and saves each document into a separate JSON file organized by year. The output files are placed in the corresponding year subfolders under `data/all_documents`.

In [7]:
import os
import sys
import json
from pathlib import Path

# add parent directory to path
library_path = os.path.abspath('..')
if library_path not in sys.path:
    sys.path.append(library_path)

## 1. Import Required Libraries and Set Up Paths

The first code cell imports necessary Python libraries and sets up the path to the project and data folders. This ensures that the notebook can access the data and output directories regardless of where it is run.

In [19]:
DATA_FOLDER = Path(library_path) / "data"
documents_folder = DATA_FOLDER / 'all_documents'
documents_folder.exists()

True

## 2. Define Data and Output Folders

This cell defines the location of the main data folder and the folder where the split documents will be saved. It also checks that the output folder exists.

In [9]:
# Load the JSON file containing the documents
with open(DATA_FOLDER/ 'pd_review.studies.json', 'r', encoding='utf-8') as f:
    data = json.load(f)

# Check the structure and number of documents
print(type(data))
if isinstance(data, dict):
    print('Keys:', data.keys())
elif isinstance(data, list):
    print('Number of documents:', len(data))
else:
    print('Unknown structure:', data)

<class 'list'>
Number of documents: 345


## 3. Load and Inspect the JSON Data

This cell loads the main JSON file containing all study documents and prints its structure, including the number of documents and sample keys. This helps verify the data before processing.

In [None]:
# Iterate over each document in the loaded data
for i, doc in enumerate(data):
    # Get the year as a string from the document (used for folder name)
    year = str(doc.get('year'))
    # Get the study_id, replacing '/' with '_' to avoid issues in filenames
    study_id = doc.get('study_id', f"unknown_{i+1}").replace("/", "_")
    # Build the output folder path for the year
    output_folder =  documents_folder / year
    # Save the document as a JSON file in the corresponding year folder
    with open(output_folder / f"{study_id}.json", 'w', encoding='utf-8') as out_f:
        json.dump(doc, out_f, ensure_ascii=False, indent=2)

## 4. Split and Save Documents by Year

This cell iterates over each document, extracts the year and study ID, and saves each document as a separate JSON file in the corresponding year folder. Filenames are sanitized to avoid issues with special characters.