# Step 1 – Understand Context (Patient Behaviour Profiling)

## 1.1 Business / Domain Problem
The objective is to build an **AI Behaviour Profile for each patient**, which involves:  
- Understanding **baseline patterns** of vitals, ADLs (activities of daily living), medications, and behaviours.  
- Detecting **deviations** or abnormal patterns, such as missed medications, low mobility, or elevated heart rate.  
- Supporting clinical decision-making by providing alerts, summarised patient health, and early intervention recommendations.  

Decisions that depend on this data include:  
1. Identifying patients at risk of deteriorating health.  
2. Monitoring adherence to medications and routine activities.  
3. Detecting emotional or behavioural changes over time.  
4. Providing AI-driven summaries and recommendations for clinical staff.  

---

## 1.2 EDA Goals
The exploratory data analysis should aim to:  
- **Assess data quality**: Identify missing values, inconsistencies, and duplicate records.  
- **Understand feature distributions**: Determine normal ranges for vitals, ADLs, and other measurements.  
- **Identify relationships**: Explore correlations between behaviours, ADLs, vitals, and patient states.  
- **Inform feature engineering**: Generate derived metrics for AI profiling, such as variability in heart rate or frequency of skipped meals.  
- **Detect anomalies**: Highlight unusual patterns that may require clinical attention or alerts.  

---

## 1.3 Data Type
The dataset consists of **mixed-type tabular data with time-series elements**:  

| Feature Type       | Example Columns                     | Notes                                                      |
|------------------|-----------------------------------|-----------------------------------------------------------|
| Categorical       | `gender`, `state`, `behaviourTags`, `emotionTags` | Single or multi-label; some columns are lists (behaviourTags) |
| Numerical         | `heartRate`, `spo2`, `temperature`, `stepsTaken`, `sleepHours` | Continuous or discrete                                    |
| Text              | `nursingNote`, `clinicalSummary`   | Free text; can be used for NLP or sentiment analysis     |
| Time              | `observationStart`, `observationEnd` | Allows computation of durations, trends, or lag features |
| JSON / Structured | `baselineStats`, `entitiesExtracted`, `alerts` | Nested information that may need to be flattened         |

---


In [8]:

sheet_url = "https://docs.google.com/spreadsheets/d/1qfsk9oR_Ml9Upkbgje8qEKyX8w78SulLiJZ9UIbL0PI/export?format=csv"



In [10]:
# =========================
# IMPORTS
# =========================
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# =========================
# LOAD DATA FROM GOOGLE SHEET
# =========================

df = pd.read_csv(sheet_url)

# =========================
# DATA PREPROCESSING
# =========================

# 1. Convert numeric columns
def extract_number(x):
    try:
        return float(str(x).split()[0])
    except:
        return np.nan

df['heartRate'] = df['heartRate'].apply(extract_number)
df['spo2'] = df['spo2'].apply(extract_number)
df['temperature'] = df['temperature'].apply(extract_number)

# Split blood pressure into systolic and diastolic
def split_bp(bp):
    try:
        sys, dia = str(bp).split('/')[0:2]
        return pd.Series([float(sys), float(dia)])
    except:
        return pd.Series([np.nan, np.nan])

df[['systolicBP','diastolicBP']] = df['bloodPressure'].apply(split_bp)

# Convert date columns
df['observationStart'] = pd.to_datetime(df['observationStart'])
df['observationEnd'] = pd.to_datetime(df['observationEnd'])

# Convert list-like columns from string to list
import ast
list_columns = ['behaviourTags', 'emotionTags']
for col in list_columns:
    df[col] = df[col].apply(lambda x: ast.literal_eval(str(x)) if pd.notnull(x) else [])

# =========================
# DATA QUALITY AUDIT
# =========================
print("Data Shape:", df.shape)
print("\nMissing Values:\n", df.isnull().sum())
print("\nDuplicates:", df.duplicated().sum())
print("\nData Types:\n", df.dtypes)
print("\nUnique Patient IDs:", df['patientId'].nunique())

# =========================
# UNIVARIATE ANALYSIS
# =========================

# Numerical columns
num_cols = ['age', 'heartRate', 'spo2', 'temperature', 'systolicBP', 'diastolicBP',
            'stepsTaken', 'calorieIntake', 'sleepHours', 'waterIntakeMl',
            'mealsSkipped', 'exerciseMinutes', 'bathroomVisits']

df[num_cols].describe().T

# Histograms
df[num_cols].hist(bins=20, figsize=(15,12))
plt.tight_layout()
plt.show()

# Categorical columns
cat_cols = ['gender', 'state']
for col in cat_cols:
    print(f"\nValue counts for {col}:\n", df[col].value_counts())
    sns.countplot(data=df, x=col)
    plt.title(f"Distribution of {col}")
    plt.show()

# Behaviour and Emotion Tags (list columns)
for col in list_columns:
    exploded = df.explode(col)
    counts = exploded[col].value_counts()
    print(f"\n{col} counts:\n", counts)
    plt.figure(figsize=(8,4))
    sns.barplot(x=counts.index, y=counts.values, palette="viridis")
    plt.xticks(rotation=45)
    plt.title(f"{col} Frequency")
    plt.show()

# =========================
# BIVARIATE ANALYSIS
# =========================

# Correlation matrix for numeric variables
plt.figure(figsize=(12,10))
sns.heatmap(df[num_cols].corr(), annot=True, fmt=".2f", cmap="coolwarm")
plt.title("Correlation Matrix")
plt.show()

# Age vs HeartRate by State
sns.boxplot(data=df, x='state', y='heartRate')
plt.title("Heart Rate Distribution by State")
plt.show()

# =========================
# TIME SERIES / TREND ANALYSIS
# =========================

# Heart rate trend for a single patient
patient = df['patientId'].iloc[0]
patient_df = df[df['patientId'] == patient].sort_values('observationStart')

plt.figure(figsize=(10,4))
plt.plot(patient_df['observationStart'], patient_df['heartRate'], marker='o')
plt.title(f"Heart Rate Over Time for {patient}")
plt.xlabel("Observation Start")
plt.ylabel("Heart Rate (bpm)")
plt.xticks(rotation=45)
plt.show()

# =========================
# SAVE CLEANED DATA
# =========================
df.to_csv("patient_observations_cleaned.csv", index=False)
print("Cleaned data saved to patient_observations_cleaned.csv")


ValueError: malformed node or string on line 1: <ast.Name object at 0x177599390>