# Phase 2: Data Enrichment via Web Scraping
## Objective: Contextualize ICD-9 diagnosis codes.

This notebook enriches the diabetic dataset by adding human-readable descriptions for the top 20 most frequent ICD-9 diagnosis codes.

In [1]:
import pandas as pd
import sys
import os

# Add src to path
sys.path.append(os.path.abspath(os.path.join('..', 'src')))
from icd9_scraper import scrape_top_codes

### 1. Load Data

In [2]:
data_path = '../data/processed/diabetic_data_clean.csv'
df = pd.read_csv(data_path)
print(f"Dataset loaded: {df.shape}")

Dataset loaded: (100114, 52)


  df = pd.read_csv(data_path)


### 2. Identify Top 20 Most Frequent Diagnoses

In [3]:
top_20_codes = df['diag_1'].value_counts().head(20).index.tolist()
print("Top 20 ICD-9 Codes:", top_20_codes)

Top 20 ICD-9 Codes: ['428', '414', '786', '410', '486', '427', '491', '715', '682', '780', '434', '996', '276', '250.8', '599', '38', '584', 'V57', '250.6', '820']


### 3. Scrape Descriptions

In [4]:
print("Starting scraper (this may take 20+ seconds)...")
descriptions = scrape_top_codes(top_20_codes)
print("Scraping complete.")

Starting scraper (this may take 20+ seconds)...
Scraping code: 428...
Found: Heart failure
Scraping code: 414...
Found: Other forms of chronic ischemic heart disease
Scraping code: 786...
Found: Symptoms involving respiratory system and other chest symptoms
Scraping code: 410...
Found: Acute myocardial infarction
Scraping code: 486...
Found: Pneumonia, organism unspecified
Scraping code: 427...
Found: Cardiac dysrhythmias
Scraping code: 491...
Found: Chronic bronchitis
Scraping code: 715...
Found: Osteoarthrosis and allied disorders
Scraping code: 682...
Found: Other cellulitis and abscess
Scraping code: 780...
Found: General symptoms
Scraping code: 434...
Found: Occlusion of cerebral arteries
Scraping code: 996...
Found: Complications peculiar to certain specified procedures
Scraping code: 276...
Found: Disorders of fluid, electrolyte, and acid-base balance
Scraping code: 250.8...
Found: Diabetes with other specified manifestations
Scraping code: 599...
Found: Other disorders of ureth

### 4. Integration
Create `Primary_Diagnosis_Desc` column. Map scraped descriptions and label others as "Other".

In [5]:
# Create the new column
df['Primary_Diagnosis_Desc'] = df['diag_1'].map(descriptions)

# Fill missing descriptions (not in top 20 or failed scrape) with "Other"
df['Primary_Diagnosis_Desc'] = df['Primary_Diagnosis_Desc'].fillna('Other')

# Mark everything not in top 20 as Other explicitly if map didn't do it (it would be NaN)
# (Logic above handles it by filling NaNs)

# Verify
print(df[['diag_1', 'Primary_Diagnosis_Desc']].head(10))
print(df['Primary_Diagnosis_Desc'].value_counts().head())

   diag_1                             Primary_Diagnosis_Desc
0  250.83                                              Other
1     276  Disorders of fluid, electrolyte, and acid-base...
2     648                                              Other
3       8                                              Other
4     197                                              Other
5     414      Other forms of chronic ischemic heart disease
6     414      Other forms of chronic ischemic heart disease
7     428                                      Heart failure
8     398                                              Other
9     434                     Occlusion of cerebral arteries
Primary_Diagnosis_Desc
Other                                                             49276
Heart failure                                                      6735
Other forms of chronic ischemic heart disease                      6555
Symptoms involving respiratory system and other chest symptoms     4016
Acute myocardial i

### 5. Save Data

In [6]:
output_path = '../data/processed/diabetic_data_enriched.csv'
os.makedirs(os.path.dirname(output_path), exist_ok=True)
df.to_csv(output_path, index=False)
print(f"Enriched data saved to {output_path}")

Enriched data saved to ../data/processed/diabetic_data_enriched.csv
