# Python 102 – Data Analysis and Manipulation for Healthcare


## Course Overview
This notebook builds on the foundations of Python 101 and introduces structured data manipulation using Python. You will explore core data structures, work with files, and use libraries such as `pandas`, `numpy`, and `matplotlib` to clean, analyze, and visualize healthcare data.

This knowledge is essential for developing machine learning pipelines and AI-driven healthcare applications.

## Learning Objectives
- Understand Python’s core data structures (lists, dictionaries, sets, tuples)
- Read and write CSV and text files
- Use `pandas` and `numpy` to manipulate tabular data
- Clean data by handling missing values and filtering
- Perform data aggregation with `groupby` and `pivot`
- Create basic visualizations using `matplotlib`
- Practice exercises using real-world patient data

## 1. Review: Loading Data from Hugging Face
We will use the same synthetic dataset from [patjs/patient1](https://huggingface.co/datasets/patjs/patient1).
Ensure you are using Google Colab or Hugging Face Notebooks for best compatibility.


In [None]:
!pip install -q datasets
from datasets import load_dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
data = load_dataset('patjs/patient1')
patients_df = data['patients'].to_pandas()
encounters_df = data['encounters'].to_pandas()

In [None]:
print("Patients Sample:")
print(patients_df.head())

print("Encounters Sample:")
print(encounters_df.head())

## 2. Core Data Structures
### Lists
Ordered, mutable collections used for vitals, medications, and time series.

In [None]:
vitals = [98.6, 99.1, 100.2]
print("Recorded temperatures:", vitals)

### Dictionaries
Key-value structures ideal for patient records.

In [None]:
patient = {
    "id": "patient-001",
    "age": 67,
    "conditions": ["diabetes", "hypertension"]
}
print(patient)

### Sets and Tuples
Sets are unordered and unique; tuples are fixed-size and immutable.

In [None]:
diagnoses = set(["hypertension", "diabetes", "hypertension"])
print("Unique diagnoses:", diagnoses)

## 3. Reading and Writing Files
Use pandas for loading and saving CSV files.

In [None]:
patients_df.to_csv('patients_output.csv', index=False)
loaded_df = pd.read_csv('patients_output.csv')
print(loaded_df.head())

## 4. Data Cleaning
### Handle Missing Values

In [None]:
print(patients_df.isnull().sum())
patients_df = patients_df.fillna("Unknown")

### Filter and Sort

In [None]:
older_patients = patients_df[patients_df['BIRTHDATE'] < '1975-01-01']
print(older_patients[['Id', 'BIRTHDATE']].head())

## 5. Data Aggregation and Pivoting

In [None]:
grouped = encounters_df.groupby('PATIENT').size()
print(grouped.head())

In [None]:
# Create a pivot if the dataset supports it
# Example: Pivot encounter reasons by patient
pivot = encounters_df.pivot_table(index='PATIENT', values='Id', aggfunc='count')
print(pivot.head())

## 6. Basic Visualization with matplotlib

In [None]:
plt.hist(pd.to_datetime(patients_df['BIRTHDATE']).dt.year, bins=20)
plt.title('Distribution of Birth Years')
plt.xlabel('Year')
plt.ylabel('Count')
plt.show()

## 7. Practical Exercise
Use the `encounters_df` and `patients_df` to answer:
- How many encounters does each patient have?
- What is the average year of birth?
- Plot the gender distribution (requires gender column if available).

## Summary
- You reviewed core data structures and file I/O
- You learned how to clean, filter, group, and pivot clinical datasets
- You created basic visualizations

These skills form the basis for building real-world healthcare analytics pipelines.