# Healthcare Appointments (No-Show) Analysis

This notebook demonstrates an end-to-end data analysis workflow using a healthcare-style appointments dataset:

- Load and validate data
- Clean and prepare features
- Exploratory Data Analysis (EDA)
- Baseline Machine Learning model (Logistic Regression)
- Insight summary and recommendations

**Dataset:** `appointments.csv` stored in this GitHub repository.

In [None]:
import os
import urllib.request
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 120)

LOCAL_PATH = '../data/appointments.csv'
GITHUB_RAW_URL = 'https://raw.githubusercontent.com/Brantson08/healthcare-data-analysis-python/main/data/appointments.csv'
COLAB_LOCAL = 'appointments.csv'

if os.path.exists(LOCAL_PATH):
    df = pd.read_csv(LOCAL_PATH)
    print('Loaded from local repository structure:', LOCAL_PATH)
else:
    urllib.request.urlretrieve(GITHUB_RAW_URL, COLAB_LOCAL)
    df = pd.read_csv(COLAB_LOCAL)
    print('Loaded from GitHub raw URL:', GITHUB_RAW_URL)

df.head()


In [None]:
df.info()

## Data Cleaning & Feature Engineering

In [None]:
# Ensure correct types
df['AppointmentDay'] = pd.to_datetime(df['AppointmentDay'], errors='coerce')

# Create binary label
df['NoShowFlag'] = (df['NoShow'].astype(str).str.strip().str.lower() == 'yes').astype(int)

# Basic validation
print("Missing values:\n", df.isna().sum())
print("\nNoShow distribution:\n", df['NoShow'].value_counts())
df.head()

## Overall No-Show Rate

In [None]:
no_show_rate = df['NoShowFlag'].mean()
print(f"Overall no-show rate: {no_show_rate:.1%}")

## No-Show Rate by Key Factors

In [None]:
def rate_by(col):
    out = df.groupby(col)['NoShowFlag'].mean().sort_values(ascending=False)
    return (out * 100).round(1)

for c in ['Gender', 'Scholarship', 'Hypertension', 'Diabetes', 'Alcoholism', 'SMS_Received']:
    print(f"\nNo-show rate by {c} (%):")
    print(rate_by(c))

## Visualisations

In [None]:
# No-show by SMS received
sms_rates = df.groupby('SMS_Received')['NoShowFlag'].mean()
sms_rates.plot(kind='bar')
plt.ylabel('No-show rate')
plt.title('No-show rate by SMS reminder')
plt.xticks(rotation=0)
plt.show()

In [None]:
# Age distribution by No-show
plt.figure()
df[df['NoShowFlag'] == 0]['Age'].plot(kind='hist', alpha=0.7, bins=8)
df[df['NoShowFlag'] == 1]['Age'].plot(kind='hist', alpha=0.7, bins=8)
plt.xlabel('Age')
plt.title('Age distribution: Show vs No-show')
plt.legend(['Show', 'No-show'])
plt.show()

# Baseline Machine Learning Model

We train a simple Logistic Regression model to predict whether a patient will miss an appointment (No-Show).

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression

features = ['Age','Scholarship','Hypertension','Diabetes','Alcoholism','SMS_Received']
X = df[features].copy()
y = df['NoShowFlag'].copy()

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print('Accuracy:', round(accuracy_score(y_test, y_pred), 3))
print('\nConfusion Matrix:\n', confusion_matrix(y_test, y_pred))
print('\nClassification Report:\n', classification_report(y_test, y_pred))

In [None]:
# Feature weights (directional importance)
importance = pd.Series(model.coef_[0], index=features).sort_values()
importance.plot(kind='barh')
plt.title('Baseline model feature weights (Logistic Regression)')
plt.xlabel('Weight')
plt.show()

importance

## Insight Summary

In [None]:
summary = pd.DataFrame([
    ('Overall no-show rate', f'{df["NoShowFlag"].mean():.1%}'),
    ('No-show rate (SMS=0)', f'{df[df["SMS_Received"]==0]["NoShowFlag"].mean():.1%}'),
    ('No-show rate (SMS=1)', f'{df[df["SMS_Received"]==1]["NoShowFlag"].mean():.1%}')
], columns=['Metric', 'Value'])
summary

## Recommendations

- Strengthen reminder workflows (SMS/phone) for higher-risk groups.
- Monitor no-show patterns by age and chronic conditions to improve scheduling.
- Add data quality checks (consistent patient demographics and contact fields).
- Consider targeting tailored reminders where predicted no-show risk is high.