## Transforming Structured CSV Data into Semi-Structured JSON Records

The original Kaggle dataset is provided in CSV format, representing a highly structured and tabular view of prostate cancer patient data. While this format is well suited for batch analytics, it does not naturally capture the hierarchical and evolving nature of real-world clinical records.

To address this, a subset of the CSV data is transformed into a semi-structured JSON representation. This transformation serves two key purposes:

1. **Demonstrating data variety**:  
   By converting flat CSV rows into nested JSON documents, the project illustrates how heterogeneous medical data (demographics, clinical visits, and outcomes) can be modeled more naturally using a document-oriented schema.

2. **Justifying the use of NoSQL storage**:  
   The JSON structure groups related attributes (e.g., demographics, clinical visits, outcomes) into nested fields, reflecting how patient records evolve over time and may contain optional or repeated elements. This design aligns well with MongoDBâ€™s schema-on-read philosophy and highlights its flexibility compared to rigid relational schemas.

The transformation is intentionally lightweight and derived directly from the CSV dataset to maintain medical consistency while avoiding synthetic or fabricated clinical information. The resulting JSON records are later ingested into MongoDB as a separate collection, demonstrating how structured and semi-structured datasets can coexist within a unified Big Data architecture.


In [1]:
import pandas as pd
import json
import os

# Load the Kaggle CSV
df = pd.read_csv("../data/prostate_cancer_prediction.csv")

# Select a small subset for clarity and speed
df = df.head(10)

json_records = []

for _, row in df.iterrows():
    record = {
        "patient_id": int(row["Patient_ID"]),
        "demographics": {
            "age": int(row["Age"]),
            "race": "African Ancestry" if row["Race_African_Ancestry"] == 1 else "Other"
        },
        "clinical_visits": [
            {
                "visit_type": "Initial Screening",
                "psa": float(row["PSA_Level"]),
                "biopsy_result": "Positive" if row["Biopsy_Result"] == 1 else "Negative"
            }
        ],
        "family_history": bool(row["Family_History"]),
        "outcomes": {
            "survival_5_years": bool(row["Survival_5_Years"]),
            "follow_up_required": bool(row["Follow_Up_Required"])
        }
    }
    json_records.append(record)

# Ensure data directory exists
os.makedirs("../data", exist_ok=True)

# Write JSON Lines file
output_path = "../data/patients_records.json"
with open(output_path, "w") as f:
    for rec in json_records:
        f.write(json.dumps(rec) + "\n")

print(f"JSON dataset created at: {output_path}")


JSON dataset created at: ../data/patients_records.json


# MongoDB Ingestion and NoSQL Data Modeling

This notebook implements the ingestion and modeling phase of the Big Data architecture.
Two heterogeneous prostate cancer datasets are ingested into MongoDB to demonstrate
NoSQL schema flexibility and support downstream distributed analytics.

- A CSV dataset representing structured historical patient records
- A JSON dataset representing semi-structured clinical records

MongoDB is used as a flexible storage layer to support data variety prior to batch and
velocity-oriented processing.


In [2]:
from pymongo import MongoClient
# Connect to MongoDB
client = MongoClient("mongodb://localhost:27017/")
db = client.prostate_big_data






## Ingesting Structured Patient Data (CSV)

The CSV dataset contains structured patient-level attributes such as age, PSA level,
cancer stage, and lifestyle indicators. This data is stored in MongoDB using a
collection-per-entity strategy.


In [3]:
# --------- Ingest CSV data ---------
csv_df = pd.read_csv("../data/prostate_cancer_prediction.csv")
db.csv_patients.insert_many(csv_df.to_dict("records"))

print("CSV data ingested into MongoDB.")

CSV data ingested into MongoDB.


## Ingesting Semi-Structured Patient Records (JSON)

The JSON dataset represents semi-structured clinical records with flexible attributes.
Storing this data in MongoDB demonstrates schema-on-read capabilities and supports
heterogeneous medical data modeling.


In [4]:
# --------- Ingest JSON data ---------
with open("../data/patients_records.json", "r") as f:
    json_records = [json.loads(line) for line in f]

db.json_patients.insert_many(json_records)

print("JSON data ingested into MongoDB.")

print("MongoDB ingestion completed successfully.")

JSON data ingested into MongoDB.
MongoDB ingestion completed successfully.


In [5]:
print("CSV records count:", db.csv_patients.count_documents({}))
print("JSON records count:", db.json_patients.count_documents({}))

print("\nSample CSV document:")
print(db.csv_patients.find_one())

print("\nSample JSON document:")
print(db.json_patients.find_one())


CSV records count: 27945
JSON records count: 10

Sample CSV document:
{'_id': ObjectId('696bb4fa982f5ebc7f31e352'), 'Patient_ID': 1, 'Age': 78, 'Family_History': 'No', 'Race_African_Ancestry': 'Yes', 'PSA_Level': 5.07, 'DRE_Result': 'Normal', 'Biopsy_Result': 'Benign', 'Difficulty_Urinating': 'No', 'Weak_Urine_Flow': 'No', 'Blood_in_Urine': 'No', 'Pelvic_Pain': 'No', 'Back_Pain': 'No', 'Erectile_Dysfunction': 'No', 'Cancer_Stage': 'Localized', 'Treatment_Recommended': 'Active Surveillance', 'Survival_5_Years': 'Yes', 'Exercise_Regularly': 'No', 'Healthy_Diet': 'Yes', 'BMI': 22.3, 'Smoking_History': 'Yes', 'Alcohol_Consumption': 'Moderate', 'Hypertension': 'No', 'Diabetes': 'No', 'Cholesterol_Level': 'Normal', 'Screening_Age': 45, 'Follow_Up_Required': 'No', 'Prostate_Volume': 46.0, 'Genetic_Risk_Factors': 'No', 'Previous_Cancer_History': 'No', 'Early_Detection': 'Yes'}

Sample JSON document:
{'_id': ObjectId('696bb4fa982f5ebc7f32507b'), 'patient_id': 1, 'demographics': {'age': 78, 'rac