This Jupyter notebook will guide users through the data cleaning process interactively. Here's an outline of what the notebook will contain:

# Data Cleaning and Preprocessing - Healthcare Data
## 1. Introduction
In this notebook, we will clean and preprocess the `patient_records.csv` dataset, which contains 6,000 rows of healthcare data. We will focus on handling missing values, correcting data types, and adding missing data flags.

## 2. Load the Data
```python

In [1]:
import pandas as pd

# Load the dataset

In [2]:
df = pd.read_csv('/workspaces/swiss-data-science-demos-/data_cleaning_preprocessing/data/patient_records.csv')
df.head()

Unnamed: 0,patient_id,name,dob,gender,visit_date,diagnosis,treatment,cost,insurance_status,missing_data_flags
0,7325,Amanda Rivera,1973-10-21,Other,2021-06-19,COVID-19,Medication,4856.443195,Yes,
1,9822,Brandy Hunt,1992-06-03,M,2020-08-07,Diabetes,,7657.783544,Yes,
2,2612,Heidi Brooks,,M,2020-03-12,Flu,,9991.06453,No,
3,6799,Jacqueline Conley,1992-04-08,Other,2022-11-03,Hypertension,,5913.568352,,
4,3202,Julia Benjamin,1988-10-08,F,2021-09-01,Hypertension,Surgery,,Yes,


## 3. Handle Missing Values

We need to address missing values in columns like gender, dob, diagnosis, and cost.

In [3]:
# Fill missing gender
df['gender'].fillna('Unknown', inplace=True)

# Drop rows with missing essential data
df.dropna(subset=['dob', 'diagnosis', 'treatment'], inplace=True)

# Fill missing cost with mean
df['cost'].fillna(df['cost'].mean(), inplace=True)

# Check the result
df.isnull().sum()


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['gender'].fillna('Unknown', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['cost'].fillna(df['cost'].mean(), inplace=True)


patient_id               0
name                     0
dob                      0
gender                   0
visit_date               0
diagnosis                0
treatment                0
cost                     0
insurance_status       897
missing_data_flags    2226
dtype: int64

## 4. Correct Data Types

We will ensure that the data types of columns such as dob and visit_date are correctly set.

In [4]:
# Convert 'dob' and 'visit_date' to datetime format
df['dob'] = pd.to_datetime(df['dob'])
df['visit_date'] = pd.to_datetime(df['visit_date'])

# Check data types
df.dtypes


patient_id                     int64
name                          object
dob                   datetime64[ns]
gender                        object
visit_date            datetime64[ns]
diagnosis                     object
treatment                     object
cost                         float64
insurance_status              object
missing_data_flags           float64
dtype: object

## 5. Add Missing Data Flags

We can create a new column that flags whether any data was missing in a given record.

In [5]:
# Add missing data flag
df['missing_data_flags'] = df.isnull().any(axis=1).astype(int)

# Display sample rows
df.head()


Unnamed: 0,patient_id,name,dob,gender,visit_date,diagnosis,treatment,cost,insurance_status,missing_data_flags
0,7325,Amanda Rivera,1973-10-21,Other,2021-06-19,COVID-19,Medication,4856.443195,Yes,1
4,3202,Julia Benjamin,1988-10-08,F,2021-09-01,Hypertension,Surgery,5126.41537,Yes,1
7,7164,Catherine Baker,1934-08-12,M,2023-03-20,Diabetes,Therapy,9725.533041,No,1
8,8251,Gary Howard,1980-11-08,Other,2024-07-25,Flu,Medication,9000.986751,,1
9,8502,Veronica Perez,1973-09-02,Unknown,2022-09-03,COVID-19,Surgery,5126.41537,No,1


## 6. Save Cleaned Data

Finally, we save the cleaned dataset.

In [6]:
# Save the cleaned data
df.to_csv('/workspaces/swiss-data-science-demos-/data_cleaning_preprocessing/data/patient_records_cleaned.csv', index=False)


# Final Thoughts

This project demonstrates a systematic approach to cleaning and preprocessing messy healthcare data. The Python script (data_cleaning.py) allows for an automated approach, while the Jupyter notebook (preprocessing_notebook.ipynb) provides an interactive guide for users to understand each step of the process.