# Comprehensive Data Standardization

This notebook consolidates all standardization tasks for the AMR dataset, including:
- DEPARTMENT code standardization
- SEX code standardization
- Organism name mapping
- Organism type mapping

## Objectives
- Ensure consistency and readability across all key columns.
- Validate transformations to maintain data integrity.
- Export the fully standardized dataset.

In [8]:
# Import required libraries
import pandas as pd
import os

# Load the dataset
input_file = r'C:\NATIONAL AMR DATA ANALYSIS FILES\data\processed\mapped\df_mapped_org_type_2025-06-12.csv'
df = pd.read_csv(input_file)

print("📂 Dataset loaded successfully!")
print(f"   📊 Shape: {df.shape}")
print(f"   📋 Columns: {list(df.columns)}")

📂 Dataset loaded successfully!
   📊 Shape: (32688, 47)
   📋 Columns: ['PATIENT_ID', 'ORGANISM_CODE', 'ORGANISM_NAME', 'ORGANISM_TYPE', 'ROW_IDX', 'COUNTRY', 'SEX', 'AGE', 'INSTITUTION', 'REGION', 'DEPARTMENT', 'SPEC_DATE', 'ORG_TYPE', 'AMC_ND20', 'AMK_ND30', 'AMP_ND10', 'AMX_ND30', 'AZM_ND15', 'CAZ_ND30', 'CHL_ND30', 'CIP_ND5', 'CLI_ND2', 'CLO_ND5', 'CRO_ND30', 'CTX_ND30', 'CXM_ND30', 'ERY_ND15', 'ETP_ND10', 'FEP_ND30', 'FLC_ND', 'FOX_ND30', 'GEN_ND10', 'LEX_ND30', 'LIN_ND4', 'LNZ_ND30', 'LVX_ND5', 'MEM_ND10', 'MNO_ND30', 'OXA_ND1', 'PEN_ND10', 'PNV_ND10', 'RIF_ND5', 'SXT_ND1_2', 'TCY_ND30', 'TGC_ND15', 'TZP_ND100', 'VAN_ND30']


  exec(code_obj, self.user_global_ns, self.user_ns)


In [9]:
# 1. DEPARTMENT Code Standardization
print("🏥 Standardizing DEPARTMENT codes")

# Define mapping for DEPARTMENT codes
department_mapping = {"Out": "Out-patient", "Inp": "In-patient"}

# Apply mapping to the DEPARTMENT column
if "DEPARTMENT" in df.columns:
    df["DEPARTMENT"] = df["DEPARTMENT"].str.strip().str.title().map(department_mapping)
    print("✅ DEPARTMENT codes standardized successfully!")
else:
    print("❌ DEPARTMENT column not found in the dataset!")

🏥 Standardizing DEPARTMENT codes
✅ DEPARTMENT codes standardized successfully!


In [10]:
# 2. SEX Code Standardization
print("🔄 Standardizing SEX codes")

# Define mapping for SEX codes
sex_mapping = {"f": "Female", "m": "Male"}

# Apply mapping to the SEX column
if "SEX" in df.columns:
    df["SEX"] = df["SEX"].str.strip().str.lower().map(sex_mapping)
    print("✅ SEX codes standardized successfully!")
else:
    print("❌ SEX column not found in the dataset!")

🔄 Standardizing SEX codes
✅ SEX codes standardized successfully!


In [11]:
# 3. Organism Name Mapping
print("🦠 Mapping organism codes to names")

# Load organism reference table
organism_ref_file = r'c:\NATIONAL AMR DATA ANALYSIS FILES\data\Database Resources\Organisms_Data_Final.csv'
df_organism_ref = pd.read_csv(organism_ref_file)

# Create mapping dictionary
organism_mapping = df_organism_ref.set_index("ORGANISM_CODE")["ORGANISM_NAME"].to_dict()

# Apply mapping to the ORGANISM_CODE column
if "ORGANISM_CODE" in df.columns:
    df["ORGANISM_NAME"] = df["ORGANISM_CODE"].map(organism_mapping)
    print("✅ Organism names mapped successfully!")
else:
    print("❌ ORGANISM_CODE column not found in the dataset!")

🦠 Mapping organism codes to names
✅ Organism names mapped successfully!


In [12]:
# 4. Organism Type Mapping
print("🔬 Mapping organism codes to types")

# Create mapping dictionary for organism types
organism_type_mapping = df_organism_ref.set_index("ORGANISM_CODE")["ORGANISM_TYPE_DESCRIPTION"].to_dict()

# Apply mapping to the ORGANISM_CODE column
if "ORGANISM_CODE" in df.columns:
    df["ORGANISM_TYPE"] = df["ORGANISM_CODE"].map(organism_type_mapping)
    print("✅ Organism types mapped successfully!")
else:
    print("❌ ORGANISM_CODE column not found in the dataset!")

🔬 Mapping organism codes to types
✅ Organism types mapped successfully!


In [13]:
# Export the fully standardized dataset
output_file = r'c:\NATIONAL AMR DATA ANALYSIS FILES\data\processed\mapped\df_fully_standardized_2025-06-13.csv'
df.to_csv(output_file, index=False)

print("💾 Dataset exported successfully!")
print(f"   📁 Location: {output_file}")

💾 Dataset exported successfully!
   📁 Location: c:\NATIONAL AMR DATA ANALYSIS FILES\data\processed\mapped\df_fully_standardized_2025-06-13.csv
