# 🧪 MongoDB Query Performance Evaluation (Baseline)
This notebook runs a set of core queries against the `openfda.full_reports` collection in MongoDB.
It measures execution time to establish a performance baseline **before indexing**.

Later, these same queries will be used to compare against:
- ✅ Indexed MongoDB collection (e.g., `indexed_reports`)
- ✅ SQLite database (once finalized)


In [15]:
from pymongo import MongoClient
import time

# MongoDB connection
client = MongoClient('mongodb://localhost:27017')
db = client['openfda']
collection = db['full_reports']

## Q0 – Count All Reports in Collection

Before running specific queries, it's useful to perform a quick sanity check to confirm that the MongoDB collection is populated as expected.

This query counts all documents in the `full_reports` collection.

In [7]:
def count_all_reports():
    start_time = time.time()
    count = collection.count_documents({})
    elapsed = time.time() - start_time
    print(f'Total Reports: {count}, Elapsed: {elapsed:.6f} seconds')
    return count

# Example
count_all_reports()

Total Reports: 35999, Elapsed: 0.068735 seconds


35999

## Q1 – Find Report by `safetyreportid`

In [2]:
def query_by_safetyreportid(report_id):
    start_time = time.time()
    result = collection.find_one({'safetyreportid': str(report_id)})
    elapsed = time.time() - start_time
    print(f'Elapsed: {elapsed:.6f} seconds')
    return result

# Example
query_by_safetyreportid('19854733')

Elapsed: 0.023175 seconds


{'_id': ObjectId('6819f1af7894e637cc60dc80'),
 'safetyreportversion': '15',
 'safetyreportid': '19854733',
 'primarysourcecountry': 'CA',
 'occurcountry': 'CA',
 'transmissiondateformat': '102',
 'transmissiondate': '20240409',
 'reporttype': '1',
 'serious': '1',
 'seriousnessdeath': '2',
 'seriousnesslifethreatening': '2',
 'seriousnesshospitalization': '1',
 'seriousnessdisabling': '2',
 'seriousnesscongenitalanomali': '2',
 'seriousnessother': '1',
 'receivedateformat': '102',
 'receivedate': '20210920',
 'receiptdateformat': '102',
 'receiptdate': '20240112',
 'fulfillexpeditecriteria': '1',
 'companynumb': 'CA-CELLTRION INC.-2021CA001620',
 'duplicate': '1',
 'reportduplicate': [{'duplicatesource': 'PFIZER INC',
   'duplicatenumb': 'CA-PFIZER INC-2021085662'},
  {'duplicatesource': 'PFIZER INC',
   'duplicatenumb': 'CA-PFIZER INC-2021085662'},
  {'duplicatesource': 'PFIZER INC',
   'duplicatenumb': 'CA-PFIZER INC-2021085662'},
  {'duplicatesource': 'PFIZER INC',
   'duplicatenumb

## Q2 – Count Reports Where `serious == 1`

In [8]:
def count_serious_reports():
    start_time = time.time()
    count = collection.count_documents({'serious': "1"})
    elapsed = time.time() - start_time
    print(f'Count: {count}, Elapsed: {elapsed:.6f} seconds')
    return count

# Example
count_serious_reports()

Count: 20375, Elapsed: 0.050486 seconds


20375

## Q3 – Count Reports by `receivedate` Year

In [4]:
def count_by_year():
    pipeline = [
        {'$project': {'year': {'$substr': ['$receivedate', 0, 4]}}},
        {'$group': {'_id': '$year', 'count': {'$sum': 1}}},
        {'$sort': {'_id': 1}}
    ]
    start_time = time.time()
    results = list(collection.aggregate(pipeline))
    elapsed = time.time() - start_time
    print(f'Elapsed: {elapsed:.6f} seconds')
    return results

# Example
count_by_year()

Elapsed: 0.193533 seconds


[{'_id': '2007', 'count': 2},
 {'_id': '2008', 'count': 2},
 {'_id': '2010', 'count': 3},
 {'_id': '2012', 'count': 2},
 {'_id': '2013', 'count': 1},
 {'_id': '2014', 'count': 10},
 {'_id': '2015', 'count': 23},
 {'_id': '2016', 'count': 55},
 {'_id': '2017', 'count': 91},
 {'_id': '2018', 'count': 110},
 {'_id': '2019', 'count': 186},
 {'_id': '2020', 'count': 298},
 {'_id': '2021', 'count': 580},
 {'_id': '2022', 'count': 1521},
 {'_id': '2023', 'count': 7257},
 {'_id': '2024', 'count': 25858}]

## Q4 – Find Reports for a Specific `drugname`

In [18]:
def find_reports_by_drugname(drug_name, preview_limit=5):
    query = {
        'patient.drug.medicinalproduct': {
            '$regex': f'^{drug_name}$',
            '$options': 'i'  # Case-insensitive
        }
    }

    start_time = time.time()
    count = collection.count_documents(query)
    results = list(collection.find(query).limit(preview_limit))
    elapsed = time.time() - start_time

    print(f"Found {count} reports for '{drug_name}' (showing up to {preview_limit}) in {elapsed:.6f} seconds")
    for r in results:
        print(r.get('safetyreportid'))

    return count

find_reports_by_drugname('aspirin', preview_limit=5)

Found 1188 reports for 'aspirin' (showing up to 5) in 3.574678 seconds
17998181
23656546
18981049
22802248
23382421


1188

## Q5 – Fatal Reports with `drugindication == 'headache'`

In [19]:
def find_fatal_headache_reports(preview_limit=5):
    query = {
        'seriousnessdeath': "1",  # stored as string in OpenFDA
        'patient.drug.drugindication': {
            '$regex': '^headache$',
            '$options': 'i'
        }
    }

    start_time = time.time()
    count = collection.count_documents(query)
    results = list(collection.find(query).limit(preview_limit))
    elapsed = time.time() - start_time

    print(f"Found {count} fatal headache reports (showing up to {preview_limit}) in {elapsed:.6f} seconds")
    for r in results:
        print(r.get('safetyreportid'))

    return count

# Find fatal reports with drug indication 'headache', preview up to 5
find_fatal_headache_reports(preview_limit=5)

Found 3 fatal headache reports (showing up to 5) in 0.216393 seconds
23458086
23604512
23427798


3

# Debugging - To delete later

In [16]:
sample = collection.find_one(
    {"drug.medicinalproduct": {"$exists": True}},
    {"safetyreportid": 1, "drug.medicinalproduct": 1}
)
sample


In [17]:
doc = collection.find_one({"drug": {"$exists": True}}, {"drug": 1})
doc["drug"]

TypeError: 'NoneType' object is not subscriptable