# Week 1, Class 1: Introduction to Machine Learning in Healthcare
## Hands-on Lab: Google Colab Setup and Medical Data Exploration
### INSTRUCTOR VERSION - WITH SOLUTIONS

**Course:** AI/ML in Medicine and Healthcare  
**Module:** Week 1 - Foundations  
**Lab Type:** Individual Work  
**Estimated Time:** 90 minutes

---

## Learning Objectives
By the end of this lab, you will be able to:
1. Navigate and use Google Colab effectively
2. Mount Google Drive for file persistence
3. Load and explore a medical dataset
4. Perform basic data analysis with NumPy and pandas
5. Create visualizations with matplotlib

---

## Part 1: Welcome to Google Colab! üöÄ

Google Colab is a free cloud-based Jupyter notebook environment. It provides:
- Free GPU access (with limits)
- Pre-installed ML libraries
- Easy sharing and collaboration
- Integration with Google Drive

### Colab Basics
- **Run a cell:** Shift+Enter or click the play button
- **Add cell:** Click "+Code" or "+Text" buttons
- **Save:** File ‚Üí Save or Ctrl+S
- **Share:** Click "Share" button (top right)


In [None]:
# Let's start with a simple test
print("Hello, AI/ML in Medicine!")
print("You're running Python in the cloud! ‚òÅÔ∏è")

# Check Python version
import sys
print(f"\nPython version: {sys.version}")

# Expected output: Python 3.10.x

### Mounting Google Drive

‚ö†Ô∏è **IMPORTANT:** Mount your Google Drive to save your work persistently!

Without this, your files will disappear when the Colab session ends.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

print("‚úì Google Drive mounted successfully!")
print("Your files are now accessible at: /content/drive/MyDrive/")

# Create a folder for this course
import os
course_folder = '/content/drive/MyDrive/AI_ML_Healthcare'
os.makedirs(course_folder, exist_ok=True)
print(f"\n‚úì Course folder created: {course_folder}")

---

## Part 2: Essential Libraries for Medical ML üìö

In [None]:
# Import essential libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 11

print("Libraries imported successfully!")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

# Set random seed for reproducibility
np.random.seed(42)

---

## Part 3: Loading the Diabetes Dataset üè•

We'll use the UCI Pima Indians Diabetes dataset:
- **768 patient records**
- **8 features:** pregnancies, glucose, blood pressure, skin thickness, insulin, BMI, diabetes pedigree, age
- **Target:** diabetes diagnosis (0 = no, 1 = yes)

This dataset is perfect for learning ML fundamentals!

In [None]:
# Load diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"

column_names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 
                'Insulin', 'BMI', 'DiabetesPedigree', 'Age', 'Outcome']

df = pd.read_csv(url, names=column_names)

print("‚úì Dataset loaded successfully!")
print(f"\nDataset shape: {df.shape}")
print(f"  - {df.shape[0]} patients")
print(f"  - {df.shape[1]} columns (8 features + 1 target)")

### First Look at the Data

In [None]:
# Display first few rows
print("First 5 patients in the dataset:\n")
df.head()

In [None]:
# Get basic information about the dataset
print("Dataset Information:")
print("="*50)
df.info()

In [None]:
# Statistical summary
print("Statistical Summary:\n")
df.describe().round(2)

### ü§î Observation Questions

Look at the statistical summary above and answer:
1. What's the average glucose level?
2. What's the age range of patients?
3. Do you notice any unusual values? (Hint: Can blood pressure be 0?)

**SOLUTION - Expected Student Observations:**
1. **Average glucose:** ~120.9 mg/dL (from describe() output)
2. **Age range:** 21 to 81 years (min to max)
3. **Unusual values:** 
   - Blood pressure has minimum of 0 (physiologically impossible - likely missing data)
   - Skin thickness minimum is 0 (also suspicious)
   - Insulin has 0 values (missing data)
   - BMI minimum is 0 (impossible)
   
   **Key Insight:** Many features use 0 to represent missing data, which needs to be handled in preprocessing!

In [None]:
# SOLUTION: Let's investigate the missing data (zeros)
print("Number of zero values (likely missing data):\n")
for col in df.columns[1:6]:  # Columns that shouldn't have zeros
    zero_count = (df[col] == 0).sum()
    zero_pct = (zero_count / len(df)) * 100
    print(f"{col:20s}: {zero_count:3d} ({zero_pct:5.1f}%)")

print("\nüí° This is a common data quality issue in medical datasets!")

---

## Part 4: Data Exploration with NumPy üî¢

In [None]:
# Convert to NumPy arrays for practice
glucose = df['Glucose'].values
bmi = df['BMI'].values

print("NumPy Array Operations:")
print("="*50)
print(f"Glucose - Mean: {np.mean(glucose):.2f}, Std: {np.std(glucose):.2f}")
print(f"BMI - Mean: {np.mean(bmi):.2f}, Std: {np.std(bmi):.2f}")
print(f"\nGlucose range: {np.min(glucose):.0f} to {np.max(glucose):.0f}")
print(f"BMI range: {np.min(bmi):.1f} to {np.max(bmi):.1f}")

# SOLUTION: Additional statistics
print(f"\nGlucose - Median: {np.median(glucose):.2f}")
print(f"Glucose - 25th percentile: {np.percentile(glucose, 25):.2f}")
print(f"Glucose - 75th percentile: {np.percentile(glucose, 75):.2f}")

In [None]:
# Array operations - vectorization is powerful!
# Let's categorize BMI

# BMI categories: <18.5 (underweight), 18.5-25 (normal), 25-30 (overweight), >30 (obese)
underweight = np.sum(bmi < 18.5)
normal = np.sum((bmi >= 18.5) & (bmi < 25))
overweight = np.sum((bmi >= 25) & (bmi < 30))
obese = np.sum(bmi >= 30)

print("BMI Distribution:")
print(f"  Underweight (<18.5): {underweight} ({underweight/len(bmi)*100:.1f}%)")
print(f"  Normal (18.5-25): {normal} ({normal/len(bmi)*100:.1f}%)")
print(f"  Overweight (25-30): {overweight} ({overweight/len(bmi)*100:.1f}%)")
print(f"  Obese (>30): {obese} ({obese/len(bmi)*100:.1f}%)")

# SOLUTION: Expected output
# Underweight: 1 (0.1%)
# Normal: 143 (18.6%)
# Overweight: 210 (27.3%)
# Obese: 414 (53.9%)
# Note: High obesity rate in this population!

---

## Part 5: Visualizing Medical Data üìä

Visualization is crucial in medical ML for:
- Understanding data distributions
- Identifying outliers and anomalies
- Communicating insights to clinicians
- Debugging models

In [None]:
# Distribution of diabetes outcomes
plt.figure(figsize=(8, 6))
outcome_counts = df['Outcome'].value_counts()
plt.bar(['No Diabetes', 'Diabetes'], outcome_counts.values, 
        color=['#2ecc71', '#e74c3c'], alpha=0.8, edgecolor='black', linewidth=1.5)
plt.title('Distribution of Diabetes Diagnosis', fontsize=14, fontweight='bold')
plt.ylabel('Number of Patients', fontsize=12)
plt.grid(axis='y', alpha=0.3)

# Add value labels on bars
for i, v in enumerate(outcome_counts.values):
    plt.text(i, v + 10, str(v), ha='center', fontweight='bold', fontsize=12)

plt.tight_layout()
plt.show()

print(f"Class distribution:")
print(f"  No diabetes: {outcome_counts[0]} ({outcome_counts[0]/len(df)*100:.1f}%)")
print(f"  Diabetes: {outcome_counts[1]} ({outcome_counts[1]/len(df)*100:.1f}%)")
print(f"\nüí° The dataset is imbalanced (65% no diabetes, 35% diabetes)")
print(f"   This is important for model training!")

In [None]:
# Glucose distribution by diabetes status
plt.figure(figsize=(14, 5))

# Subplot 1: Histogram
plt.subplot(1, 2, 1)
plt.hist(df[df['Outcome'] == 0]['Glucose'], bins=25, alpha=0.7, 
         label='No Diabetes', color='#2ecc71', edgecolor='black')
plt.hist(df[df['Outcome'] == 1]['Glucose'], bins=25, alpha=0.7, 
         label='Diabetes', color='#e74c3c', edgecolor='black')
plt.xlabel('Glucose Level (mg/dL)', fontsize=11)
plt.ylabel('Frequency', fontsize=11)
plt.title('Glucose Distribution by Diabetes Status', fontweight='bold')
plt.legend(fontsize=10)
plt.grid(alpha=0.3)

# Subplot 2: Box plot
plt.subplot(1, 2, 2)
df.boxplot(column='Glucose', by='Outcome', ax=plt.gca(), 
           patch_artist=True, 
           boxprops=dict(facecolor='lightblue', alpha=0.7))
plt.xlabel('Diabetes Status (0=No, 1=Yes)', fontsize=11)
plt.ylabel('Glucose Level (mg/dL)', fontsize=11)
plt.title('Glucose Levels: Box Plot Comparison', fontweight='bold')
plt.suptitle('')  # Remove default title

plt.tight_layout()
plt.show()

# SOLUTION: Statistical comparison
glucose_no_diabetes = df[df['Outcome'] == 0]['Glucose']
glucose_diabetes = df[df['Outcome'] == 1]['Glucose']

print("Statistical Comparison:")
print(f"  No Diabetes - Mean: {glucose_no_diabetes.mean():.1f}, Median: {glucose_no_diabetes.median():.1f}")
print(f"  Diabetes - Mean: {glucose_diabetes.mean():.1f}, Median: {glucose_diabetes.median():.1f}")
print(f"  Difference: {glucose_diabetes.mean() - glucose_no_diabetes.mean():.1f} mg/dL")

In [None]:
# Scatter plot: BMI vs Age, colored by diabetes status
plt.figure(figsize=(10, 6))

# Plot non-diabetic patients
no_diabetes = df[df['Outcome'] == 0]
plt.scatter(no_diabetes['Age'], no_diabetes['BMI'], 
            alpha=0.6, s=50, c='#2ecc71', label='No Diabetes', 
            edgecolors='black', linewidth=0.5)

# Plot diabetic patients
diabetes = df[df['Outcome'] == 1]
plt.scatter(diabetes['Age'], diabetes['BMI'], 
            alpha=0.6, s=50, c='#e74c3c', label='Diabetes', 
            edgecolors='black', linewidth=0.5)

plt.xlabel('Age (years)', fontsize=12)
plt.ylabel('BMI (kg/m¬≤)', fontsize=12)
plt.title('Age vs BMI: Diabetes Status', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

print("\nüí° Observations:")
print("  - Diabetic patients (red) tend to have higher BMI")
print("  - Older age + higher BMI = higher diabetes risk")
print("  - But there's significant overlap - not a perfect separator")

---

## Part 6: Correlation Analysis üîó

Understanding feature correlations is crucial for:
- Feature selection
- Understanding relationships
- Avoiding multicollinearity

In [None]:
# Compute correlation matrix
correlation_matrix = df.corr()

# Visualize with heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
            center=0, square=True, linewidths=1, cbar_kws={"shrink": 0.8},
            vmin=-1, vmax=1)
plt.title('Feature Correlation Heatmap', fontsize=14, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

# Show correlations with outcome
print("\nCorrelations with Diabetes Outcome:")
print("="*50)
outcome_corr = correlation_matrix['Outcome'].sort_values(ascending=False)
for feature, corr in outcome_corr.items():
    if feature != 'Outcome':
        print(f"{feature:20s}: {corr:+.3f}")

### ü§î Analysis Questions

Based on the correlation heatmap:
1. Which feature has the strongest correlation with diabetes outcome?
2. Are there any highly correlated features (excluding the diagonal)?
3. Would you consider removing any features? Why?

**SOLUTION - Expected Answers:**

1. **Strongest correlation with Outcome:** 
   - **Glucose** (r ‚âà 0.47) - highest positive correlation
   - This makes clinical sense: high blood glucose is a primary indicator of diabetes

2. **Highly correlated features:**
   - **Age & Pregnancies** (r ‚âà 0.54) - older women tend to have had more pregnancies
   - **SkinThickness & BMI** (r ‚âà 0.39) - both measure body fat
   - **Insulin & SkinThickness** (r ‚âà 0.44) - both related to metabolism
   
3. **Features to potentially remove:**
   - **SkinThickness** - has many missing values (0s) and correlates with BMI
   - **Insulin** - 48% missing data (zeros), might not be reliable
   - However, in practice we'd test model performance with/without these features
   - Better approach: impute missing values rather than remove features
   
**Teaching Point:** Correlation doesn't imply causation, and feature selection should be data-driven (test performance), not just correlation-based.

---

## Part 7: Your First ML Insight! üéØ

Let's make a simple observation about glucose levels and diabetes risk.

In [None]:
# Compare average glucose between groups
avg_glucose_no_diabetes = df[df['Outcome'] == 0]['Glucose'].mean()
avg_glucose_diabetes = df[df['Outcome'] == 1]['Glucose'].mean()

print("Average Glucose Levels:")
print("="*50)
print(f"No Diabetes: {avg_glucose_no_diabetes:.1f} mg/dL")
print(f"Diabetes:    {avg_glucose_diabetes:.1f} mg/dL")
print(f"\nDifference:  {avg_glucose_diabetes - avg_glucose_no_diabetes:.1f} mg/dL")
print(f"({(avg_glucose_diabetes/avg_glucose_no_diabetes - 1)*100:.1f}% higher in diabetic patients)")

# SOLUTION: Simple rule-based prediction
threshold = 120  # mg/dL
rule_based_predictions = (df['Glucose'] > threshold).astype(int)
accuracy = (rule_based_predictions == df['Outcome']).mean()

print(f"\nüéØ Simple Rule: 'Predict diabetes if glucose > {threshold}'")
print(f"   Accuracy: {accuracy*100:.1f}%")

# SOLUTION: Let's try to find the best threshold
best_threshold = 0
best_accuracy = 0

for thresh in range(80, 180, 5):
    predictions = (df['Glucose'] > thresh).astype(int)
    acc = (predictions == df['Outcome']).mean()
    if acc > best_accuracy:
        best_accuracy = acc
        best_threshold = thresh

print(f"\nüéØ Best threshold: {best_threshold} mg/dL")
print(f"   Accuracy: {best_accuracy*100:.1f}%")

print(f"\nüí° This is machine learning at its simplest!")
print(f"   We'll learn to do much better with sophisticated models!")

# SOLUTION: Expected best threshold around 125-130 mg/dL with ~73-75% accuracy

---

## Part 8: Exercise - Feature Engineering üõ†Ô∏è

**YOUR TURN!** Create a new feature and test its predictive value.

**Task:** Create a "risk score" combining Age and BMI, then test if it's better than using glucose alone.

In [None]:
# SOLUTION:

# Create a composite risk score
# Normalize age and BMI to [0, 1] range first
age_normalized = (df['Age'] - df['Age'].min()) / (df['Age'].max() - df['Age'].min())
bmi_normalized = (df['BMI'] - df['BMI'].min()) / (df['BMI'].max() - df['BMI'].min())

# Simple risk score: weighted average
df['RiskScore'] = 0.5 * age_normalized + 0.5 * bmi_normalized

print("Risk Score Statistics:")
print(df.groupby('Outcome')['RiskScore'].describe())

# Test predictive value
best_threshold_risk = 0
best_accuracy_risk = 0

for thresh in np.linspace(0, 1, 100):
    predictions = (df['RiskScore'] > thresh).astype(int)
    acc = (predictions == df['Outcome']).mean()
    if acc > best_accuracy_risk:
        best_accuracy_risk = acc
        best_threshold_risk = thresh

print(f"\nRisk Score Best Threshold: {best_threshold_risk:.3f}")
print(f"Risk Score Accuracy: {best_accuracy_risk*100:.1f}%")

print(f"\nComparison:")
print(f"  Glucose alone: ~74-75%")
print(f"  Risk Score (Age+BMI): {best_accuracy_risk*100:.1f}%")
print(f"\nüí° Glucose is still better! But combining features might help...")

# Expected: Risk score gets ~65-67% accuracy, worse than glucose alone
# This teaches that not all feature combinations improve performance

---

## Part 9: Save Your Work üíæ

In [None]:
# Create a comprehensive summary report
summary = {
    'Total Patients': len(df),
    'Diabetic': outcome_counts[1],
    'Non-diabetic': outcome_counts[0],
    'Avg Age': df['Age'].mean(),
    'Avg Glucose': df['Glucose'].mean(),
    'Avg BMI': df['BMI'].mean(),
    'Best Glucose Threshold': best_threshold,
    'Best Accuracy': best_accuracy,
}

summary_df = pd.DataFrame([summary])

# Save to Google Drive
output_path = '/content/drive/MyDrive/AI_ML_Healthcare/Week1_Class1_Summary.csv'
summary_df.to_csv(output_path, index=False)
print(f"‚úì Summary saved to: {output_path}")

print("\nData Summary:")
print(summary_df.T)

# Also save the processed dataframe
df_output_path = '/content/drive/MyDrive/AI_ML_Healthcare/Week1_Class1_Processed_Data.csv'
df.to_csv(df_output_path, index=False)
print(f"\n‚úì Processed data saved to: {df_output_path}")

---

**Great job completing your first lab! üéâ**  
See you in Class 2!