# 🏥 01_data_collection.ipynb

### **Objective**
This notebook documents the process of acquiring the data for the Healthcare KPI project. In a real-world scenario, this could involve connecting to databases, querying APIs, or scraping web data. For this project, we are generating a synthetic dataset to ensure we have the necessary fields for analysis.


### **1. Library and Dependencies**
We'll use `pandas` and `numpy` for data generation and manipulation.

In [None]:
import pandas as pd
import numpy as np
import random
from datetime import date, timedelta

np.random.seed(42)
random.seed(42)
print('Libraries imported successfully.')

### 2. Data Generation Logic
We'll use a script to create a large, realistic dataset with fields for hospital performance and patient experience. The data will be saved to the data/raw folder.

In [None]:
from src.data_generator import generate_healthcare_data

num_patients_to_simulate = 50000
start_date_sim = date(2024, 1, 1)
end_date_sim = date(2024, 12, 31)

print(f'Generating data for {num_patients_to_simulate} patients from {start_date_sim} to {end_date_sim}...')
df_raw = generate_healthcare_data(
    num_patients=num_patients_to_simulate,
    start_date=start_date_sim,
    end_date=end_date_sim
)

print('\nData generation complete. Preview of the raw data:')
print(df_raw.head())

output_path = '../data/raw/healthcare_raw_data.csv'
df_raw.to_csv(output_path, index=False)
print(f'\nRaw data saved to {output_path}')