# MongoDB Ingestion and NoSQL Data Modeling

This notebook implements the ingestion and modeling phase of the Big Data architecture.
Two heterogeneous prostate cancer datasets are ingested into MongoDB to demonstrate
NoSQL schema flexibility and support downstream distributed analytics.

- A CSV dataset representing structured historical patient records
- A JSON dataset representing semi-structured clinical records

MongoDB is used as a flexible storage layer to support data variety prior to batch and
velocity-oriented processing.


In [None]:
from pymongo import MongoClient
import pandas as pd
import json

# Connect to MongoDB
client = MongoClient("mongodb://localhost:27017/")
db = client.prostate_big_data






## Ingesting Structured Patient Data (CSV)

The CSV dataset contains structured patient-level attributes such as age, PSA level,
cancer stage, and lifestyle indicators. This data is stored in MongoDB using a
collection-per-entity strategy.


In [None]:
# --------- Ingest CSV data ---------
csv_df = pd.read_csv("../data/prostate_cancer_prediction.csv")
db.csv_patients.insert_many(csv_df.to_dict("records"))

print("CSV data ingested into MongoDB.")

## Ingesting Semi-Structured Patient Records (JSON)

The JSON dataset represents semi-structured clinical records with flexible attributes.
Storing this data in MongoDB demonstrates schema-on-read capabilities and supports
heterogeneous medical data modeling.


In [None]:
# --------- Ingest JSON data ---------
with open("../data/patients_records.json", "r") as f:
    json_records = [json.loads(line) for line in f]

db.json_patients.insert_many(json_records)

print("JSON data ingested into MongoDB.")

print("MongoDB ingestion completed successfully.")

In [None]:
print("CSV records count:", db.csv_patients.count_documents({}))
print("JSON records count:", db.json_patients.count_documents({}))

print("\nSample CSV document:")
print(db.csv_patients.find_one())

print("\nSample JSON document:")
print(db.json_patients.find_one())
