# HealthConnect Patient Data Analysis

## Overview
This notebook analyzes patient data from HealthConnect to discover trends in health outcomes, treatment effectiveness, and demographic patterns. The analysis will help inform strategic decisions and improve healthcare service delivery.

## Analysis Goals
1. Identify most common diagnoses by region
2. Analyze treatment success rates
3. Understand demographic patterns in health conditions
4. Predict patient outcomes using machine learning

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# Set plotting style
plt.style.use('seaborn')
sns.set_palette('husl')
plt.rcParams['figure.figsize'] = [12, 8]

## 1. Data Loading and Initial Exploration

In [None]:
# Load the dataset
df = pd.read_csv('../data/patient_data.csv')

# Display basic information
print('Dataset Shape:', df.shape)
print('\nColumns:', df.columns.tolist())
print('\nData Types:')
print(df.dtypes)

# Display first few rows
df.head()

## 2. Data Cleaning and Preprocessing

In [None]:
# Convert date columns to datetime
df['AdmissionDate'] = pd.to_datetime(df['AdmissionDate'])
df['DischargeDate'] = pd.to_datetime(df['DischargeDate'])

# Check for missing values
print('Missing Values:')
print(df.isnull().sum())

# Check for duplicates
print('\nDuplicate Rows:', df.duplicated().sum())

## 3. Exploratory Data Analysis

### 3.1 Demographic Analysis

In [None]:
# Age distribution
plt.figure(figsize=(12, 6))
sns.histplot(data=df, x='Age', bins=30)
plt.title('Age Distribution of Patients')
plt.show()

# Gender distribution
plt.figure(figsize=(8, 6))
df['Gender'].value_counts().plot(kind='pie', autopct='%1.1f%%')
plt.title('Gender Distribution')
plt.show()

### 3.2 Disease Patterns by Region

In [None]:
# Create a heatmap of diagnoses by region
diagnosis_region = pd.crosstab(df['Region'], df['Diagnosis'])
plt.figure(figsize=(15, 8))
sns.heatmap(diagnosis_region, annot=True, fmt='d', cmap='YlOrRd')
plt.title('Diagnosis Distribution by Region')
plt.xticks(rotation=45)
plt.show()

### 3.3 Treatment Success Analysis

In [None]:
# Calculate success rates by treatment
treatment_success = df.groupby('Treatment')['TreatmentSuccess'].agg(['count', 'mean'])
treatment_success = treatment_success.sort_values('mean', ascending=False)

# Plot success rates
plt.figure(figsize=(12, 6))
sns.barplot(x=treatment_success.index, y='mean', data=treatment_success)
plt.title('Treatment Success Rates')
plt.xticks(rotation=45)
plt.ylabel('Success Rate')
plt.show()

## 4. Predictive Analytics

In [None]:
# Prepare data for modeling
# Convert categorical variables to dummy variables
X = pd.get_dummies(df[['Age', 'Gender', 'Region', 'Diagnosis', 'Treatment', 'InitialSeverity', 'Comorbidities']])
y = df['TreatmentSuccess']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a decision tree classifier
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Print classification report
print('Classification Report:')
print(classification_report(y_test, y_pred))

## 5. Key Findings and Recommendations

1. Demographics:
   - [To be filled after analysis]

2. Regional Patterns:
   - [To be filled after analysis]

3. Treatment Effectiveness:
   - [To be filled after analysis]

4. Predictive Insights:
   - [To be filled after analysis]

## Next Steps
1. [To be determined based on findings]
2. [Additional recommendations]