# 🧬 OpenFDA Structure Analysis Notebook

This notebook explores the structure of OpenFDA drug adverse event reports.  
Goals:
 - Goal 1
 - Goal 2
 - ...

## 🔍 Notebook Overview: OpenFDA Structure Analysis

This notebook performs structural analysis of the OpenFDA drug adverse event reports. It follows these steps:

1. **Step 1**
2. **Step 2**
3. **...**

In [22]:
import os
import glob
import ijson
from collections import Counter, defaultdict
import pandas as pd
from typing import Any, Dict
from tqdm import tqdm


### Initialize the path to the dataset:

In [24]:
data_path = "../data/raw/source_data"

## Function for iterating over the reports in the dataset:

In [23]:
def iterate_reports_ijson(path):
    """Yields one report at a time from the 'results' array inside the full dataset,
    iterating over all .json files if a directory is provided."""
    
    def yield_file(file_path):
        with open(file_path, 'rb') as f:
            parser = ijson.items(f, 'results.item')
            for report in parser:
                yield report

    if os.path.isdir(path):
        for file_path in sorted(glob.glob(os.path.join(path, '*.json'))):
            yield from yield_file(file_path)
    else:
        yield from yield_file(path)


## 1. Counting the number of reports and identifying duplicates.

In [26]:
# Count total number of reports in the dataset
from tqdm import tqdm

report_count = 0
for _ in tqdm(iterate_reports_ijson(data_path), desc="Counting reports"):
    report_count += 1
    if report_count == 100:
        break

print(f"📦 Total number of reports: {report_count}")


Counting reports: 99it [00:02, 41.62it/s]

📦 Total number of reports: 100



