# oBDS Data Import and Analysis

This notebook imports oBDS mapping data from Excel files with multiple sheets (modules) and prepares it for quality analysis.

In [141]:
import pandas as pd
import numpy as np
import os
from pathlib import Path

# Display options for better data viewing
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 50)

## File Path Configuration

Specify the path to your oBDS Excel file here.

In [142]:
# TODO: Update this path to point to your oBDS Excel file
excel_file_path = "rawData/oBDS_Module_alle_neu.xlsx"

# Check if file exists
if os.path.exists(excel_file_path):
    print(f"✓ File found: {excel_file_path}")
else:
    print(f"✗ File not found: {excel_file_path}")
    print("Please update the file path above.")

✓ File found: rawData/oBDS_Module_alle_neu.xlsx


## Discover Sheet Names

First, let's see what sheets (modules) are available in the Excel file.

In [143]:
# Read Excel file and get sheet names
try:
    excel_file = pd.ExcelFile(excel_file_path)
    sheet_names = excel_file.sheet_names
    
    print(f"Found {len(sheet_names)} sheets in the Excel file:")
    for i, sheet in enumerate(sheet_names, 1):
        print(f"{i:2d}. {sheet}")
        
except FileNotFoundError:
    print("Excel file not found. Please check the file path.")
    sheet_names = []

Found 19 sheets in the Excel file:
 1. Modul 5
 2. Modul 6
 3. Modul 9
 4. Modul 10
 5. Modul 11
 6. Modul 12
 7. Modul 13
 8. Modul 14
 9. Modul 15
10. Modul 16
11. Modul 17
12. Modul 18
13. Modul 19
14. Modul 20
15. Modul 21
16. Modul 22
17. Modul 23
18. Modul 24
19. Modul 25


## Import All Sheets

Load all sheets into a dictionary of DataFrames.

In [144]:
# Dictionary to store all sheets
obds_data = {}

if sheet_names:
    for sheet_name in sheet_names:
        try:
            df = pd.read_excel(excel_file_path, sheet_name=sheet_name)
            obds_data[sheet_name] = df
            print(f"✓ Loaded sheet '{sheet_name}': {df.shape[0]} rows, {df.shape[1]} columns")
        except Exception as e:
            print(f"✗ Error loading sheet '{sheet_name}': {e}")
    
    print(f"\nTotal sheets loaded: {len(obds_data)}")
else:
    print("No sheets to load.")

✓ Loaded sheet 'Modul 5': 35 rows, 27 columns
✓ Loaded sheet 'Modul 6': 26 rows, 24 columns
✓ Loaded sheet 'Modul 9': 5 rows, 24 columns
✓ Loaded sheet 'Modul 10': 17 rows, 24 columns
✓ Loaded sheet 'Modul 11': 19 rows, 24 columns
✓ Loaded sheet 'Modul 12': 9 rows, 24 columns
✓ Loaded sheet 'Modul 13': 94 rows, 26 columns
✓ Loaded sheet 'Modul 14': 202 rows, 26 columns
✓ Loaded sheet 'Modul 15': 14 rows, 24 columns
✓ Loaded sheet 'Modul 16': 42 rows, 24 columns
✓ Loaded sheet 'Modul 17': 41 rows, 24 columns
✓ Loaded sheet 'Modul 18': 9 rows, 24 columns
✓ Loaded sheet 'Modul 19': 23 rows, 24 columns
✓ Loaded sheet 'Modul 20': 9 rows, 24 columns
✓ Loaded sheet 'Modul 21': 2 rows, 24 columns
✓ Loaded sheet 'Modul 22': 2 rows, 24 columns
✓ Loaded sheet 'Modul 23': 14 rows, 24 columns
✓ Loaded sheet 'Modul 24': 7 rows, 24 columns
✓ Loaded sheet 'Modul 25': 9 rows, 24 columns

Total sheets loaded: 19


## Examine Data Structure

Let's look at the column structure of the first available sheet to understand the data format.

In [145]:
if obds_data:
    # Get the first sheet for examination
    first_sheet_name = list(obds_data.keys())[0]
    first_df = obds_data[first_sheet_name]
    
    print(f"Examining structure of sheet: '{first_sheet_name}'")
    print(f"Shape: {first_df.shape}")
    print("\nColumn names:")
    
    for i, col in enumerate(first_df.columns):
        print(f"{chr(65+i):2s} ({i:2d}): {col}")
    
    print("\nFirst few rows:")
    display(first_df.head())

Examining structure of sheet: 'Modul 5'
Shape: (35, 27)

Column names:
A  ( 0): Identifier
B  ( 1): Feldbezeichnung
C  ( 2): Ausprägungen
D  ( 3): SNOMED CT_Code_Nini 
E  ( 4): SNOMED CT_FSN_Nini
F  ( 5): ISO Score_Nini
G  ( 6): SNOMED CT_Code_Sophie 
H  ( 7): SNOMED CT_FSN_Sophie
I  ( 8): ISO Score_Sophie
J  ( 9): SNOMED CT_Code_Paul 
K  (10): SNOMED CT_FSN_Paul
L  (11): ISO
M  (12): SNOMED CT_Code_Lotte
N  (13): SNOMED CT_FSN_Lotte
O  (14): ISO Score_Lotte
P  (15): Unnamed: 15
Q  (16): SNOMED CT_Code_All
R  (17): SNOMED CT_FSN_all
S  (18): Häufigkeit der Codeverwendung
T  (19): Codes gesamt
U  (20): Unnamed: 20
V  (21): Unnamed: 21
W  (22): SNOMED CT_Code_Konsens
X  (23): SNOMED CT_FSN_Konsens
Y  (24): ISO_Score_Konsens
Z  (25): nan
[  (26): German Edition: Version: 2024-05-15

First few rows:


Unnamed: 0,Identifier,Feldbezeichnung,Ausprägungen,SNOMED CT_Code_Nini,SNOMED CT_FSN_Nini,ISO Score_Nini,SNOMED CT_Code_Sophie,SNOMED CT_FSN_Sophie,ISO Score_Sophie,SNOMED CT_Code_Paul,SNOMED CT_FSN_Paul,ISO,SNOMED CT_Code_Lotte,SNOMED CT_FSN_Lotte,ISO Score_Lotte,Unnamed: 15,SNOMED CT_Code_All,SNOMED CT_FSN_all,Häufigkeit der Codeverwendung,Codes gesamt,Unnamed: 20,Unnamed: 21,SNOMED CT_Code_Konsens,SNOMED CT_FSN_Konsens,ISO_Score_Konsens,NaN,German Edition: Version: 2024-05-15
0,5.0,Diagnose,,439401000.0,Diagnosis (observable entity),0.0,439401001.0,Diagnosis (observable entity),0.0,439401000.0,Diagnosis (observable entity),0.0,439401001 (SCT),Diagnosis (observable entity),0.0,,439401000.0,Diagnosis (observable entity),4.0,1,1.0,,,,,0.0,439401001 Diagnose
1,5.1,Primärtumor Tumordiagnose ICD Code,C00-C97 Bösartige Neubildung?,,,4.0,372087000.0,Primary malignant neoplasm (disorder),1.0,363346000.0,Malignant neoplastic disease (disorder),2.0,363346000,Malignant neoplastic disease (disorder),2.0,Identifier,363346000.0,Malignant neoplastic disease (disorder),2.0,2,0.5,,,,,2.25,-
2,5.2,ICD VERSION ?,,,,4.0,,,4.0,,,4.0,,,4.0,,,,0.0,0,0.0,,,,,4.0,-
3,5.3,Freitext,,9e+17,String (foundation metadata concept),1.0,,,4.0,9e+17,String (foundation metadata concept),1.0,,,4.0,,9e+17,String (foundation metadata concept),2.0,2,0.5,,,,,2.5,-
4,5.4,Primärtumor Topographie ICD-O,,371480000.0,Anatomic location of neoplasm (observable entity),2.0,399687005.0,Anatomic location of primary malignant neoplas...,1.0,399687000.0,Anatomic location of primary malignant neoplas...,1.0,,,4.0,,399687000.0,Anatomic location of primary malignant neoplas...,2.0,2,0.5,,,,,2.0,-


## Column Mapping

Based on the project description, the expected column structure is:
- **A-C**: Original oBDS fields
- **D-E**: Nina (SCTID, FSN, ISO-Score)
- **F**: (Additional Nina column?)
- **G-I**: Sophie (SCTID, FSN, ISO-Score)
- **J-L**: Paul (SCTID, FSN, ISO-Score)
- **M-O**: Lotte (SCTID, FSN, ISO-Score)
- **P**: (Gap or additional column?)
- **Q-R**: Thomas MII Onko (SCTID, FSN, ISO-Score)
- **S**: Code usage frequency
- **T**: Total unique codes
- **U**: Code ratio
- **V**: ISO mismatch annotations
- **W-Z**: Additional mapping data (varies by module)
- **AA**: German translations (some modules)

In [146]:
def analyze_column_structure(df, sheet_name):
    """Analyze and categorize columns based on expected structure."""
    
    print(f"\n=== Column Analysis for {sheet_name} ===")
    
    # Expected column ranges (0-indexed)
    column_ranges = {
        'Original_Fields': (0, 3),    # A-C
        'Nina': (3, 6),               # D-F
        'Sophie': (6, 9),             # G-I
        'Paul': (9, 12),              # J-L
        'Lotte': (12, 15),            # M-O
        'Thomas': (15, 18),           # P-R (adjusted)
        'Metrics': (18, 22),          # S-V
        'Additional': (22, None)      # W+
    }
    
    for category, (start, end) in column_ranges.items():
        if end is None:
            cols = df.columns[start:]
        else:
            cols = df.columns[start:end]
        
        if len(cols) > 0:
            print(f"\n{category} ({len(cols)} columns):")
            for i, col in enumerate(cols):
                col_idx = start + i
                excel_col = chr(65 + col_idx) if col_idx < 26 else f"A{chr(65 + col_idx - 26)}"
                print(f"  {excel_col}: {col}")

# Analyze structure for all sheets
for sheet_name, df in obds_data.items():
    analyze_column_structure(df, sheet_name)


=== Column Analysis for Modul 5 ===

Original_Fields (3 columns):
  A: Identifier
  B: Feldbezeichnung
  C: Ausprägungen

Nina (3 columns):
  D: SNOMED CT_Code_Nini 
  E: SNOMED CT_FSN_Nini
  F: ISO Score_Nini

Sophie (3 columns):
  G: SNOMED CT_Code_Sophie 
  H: SNOMED CT_FSN_Sophie
  I: ISO Score_Sophie

Paul (3 columns):
  J: SNOMED CT_Code_Paul 
  K: SNOMED CT_FSN_Paul
  L: ISO

Lotte (3 columns):
  M: SNOMED CT_Code_Lotte
  N: SNOMED CT_FSN_Lotte
  O: ISO Score_Lotte

Thomas (3 columns):
  P: Unnamed: 15
  Q: SNOMED CT_Code_All
  R: SNOMED CT_FSN_all

Metrics (4 columns):
  S: Häufigkeit der Codeverwendung
  T: Codes gesamt
  U: Unnamed: 20
  V: Unnamed: 21

Additional (5 columns):
  W: SNOMED CT_Code_Konsens
  X: SNOMED CT_FSN_Konsens
  Y: ISO_Score_Konsens
  Z: nan
  AA: German Edition: Version: 2024-05-15

=== Column Analysis for Modul 6 ===

Original_Fields (3 columns):
  A: Identifier
  B: Feldbezeichnung
  C: Ausprägungen

Nina (3 columns):
  D: SNOMED CT_Code_Nini 
  E: SN

## Data Quality Overview

Check for missing values, data types, and basic statistics.

In [147]:
def data_quality_summary(df, sheet_name):
    """Provide a data quality summary for a sheet."""
    
    print(f"\n=== Data Quality Summary: {sheet_name} ===")
    print(f"Shape: {df.shape}")
    print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024:.1f} KB")
    
    # Missing values
    missing = df.isnull().sum()
    missing_pct = (missing / len(df)) * 100
    
    print("\nColumns with missing values:")
    for col in missing[missing > 0].index:
        print(f"  {col}: {missing[col]} ({missing_pct[col]:.1f}%)")
    
    # Data types
    print("\nData types:")
    print(df.dtypes.value_counts())
    
    return {
        'shape': df.shape,
        'missing_values': missing.sum(),
        'missing_pct': (missing.sum() / (df.shape[0] * df.shape[1])) * 100
    }

# Generate quality summaries
quality_summary = {}
for sheet_name, df in obds_data.items():
    quality_summary[sheet_name] = data_quality_summary(df, sheet_name)


=== Data Quality Summary: Modul 5 ===
Shape: (35, 27)
Memory usage: 31.6 KB

Columns with missing values:
  Identifier: 8 (22.9%)
  Feldbezeichnung: 13 (37.1%)
  Ausprägungen: 21 (60.0%)
  SNOMED CT_Code_Nini : 8 (22.9%)
  SNOMED CT_FSN_Nini: 8 (22.9%)
  ISO Score_Nini: 1 (2.9%)
  SNOMED CT_Code_Sophie : 6 (17.1%)
  SNOMED CT_FSN_Sophie: 6 (17.1%)
  ISO Score_Sophie: 1 (2.9%)
  SNOMED CT_Code_Paul : 4 (11.4%)
  SNOMED CT_FSN_Paul: 4 (11.4%)
  ISO: 1 (2.9%)
  SNOMED CT_Code_Lotte: 6 (17.1%)
  SNOMED CT_FSN_Lotte: 6 (17.1%)
  ISO Score_Lotte: 1 (2.9%)
  Unnamed: 15: 34 (97.1%)
  SNOMED CT_Code_All: 5 (14.3%)
  SNOMED CT_FSN_all: 5 (14.3%)
  Häufigkeit der Codeverwendung: 1 (2.9%)
  Unnamed: 21: 25 (71.4%)
  SNOMED CT_Code_Konsens: 35 (100.0%)
  SNOMED CT_FSN_Konsens: 35 (100.0%)
  ISO_Score_Konsens: 35 (100.0%)
  nan: 1 (2.9%)

Data types:
float64    15
object     12
dtype: int64

=== Data Quality Summary: Modul 6 ===
Shape: (26, 24)
Memory usage: 22.7 KB

Columns with missing values:
 

## Data Quality Control and Cleaning

This section performs quality checks on the raw data and creates cleaned versions with full documentation of all changes made.

In [148]:
import re

# SNOMED CT concept ID validation regex
# Valid SNOMED IDs: exactly 6-18 digits, no other characters
snomed_pattern = re.compile(r'^\d{6,18}$')

def is_valid_snomed(concept_id):
      """
      Check if concept_id is a valid SNOMED CT format.
      
      Valid SNOMED CT IDs:
      - Only digits (no letters, spaces, symbols)
      - Between 6 and 18 digits long
      - No whitespace (will fail validation if present)
      """
      if pd.isna(concept_id):
          return False
      return bool(snomed_pattern.match(str(concept_id)))

  # Extract SNOMED concepts from 4 mappers
mapper_columns = {
      'Nina': 3,    # Column D - SNOMED CT_Code_Nini
      'Sophie': 6,  # Column G - SNOMED CT_Code_Sophie
      'Paul': 9,    # Column J - SNOMED CT_Code_Paul
      'Lotte': 12   # Column M - SNOMED CT_Code_Lotte
  }

  # Use sets to automatically store only unique values
valid_concepts = set()      # Unique valid SNOMED concepts
invalid_entries = set()     # Unique invalid entries for review

for sheet_name, df in obds_data.items():
      for mapper_name, col_idx in mapper_columns.items():
          concepts = df.iloc[:, col_idx].dropna()
          for concept in concepts:
              concept_str = str(concept)  # No stripping - keep original format

              if is_valid_snomed(concept_str):
                  valid_concepts.add(concept_str)
              else:
                  invalid_entries.add(concept_str)

print(f"Valid SNOMED concepts: {len(valid_concepts)} unique")
print(f"Invalid entries: {len(invalid_entries)} unique")
print(f"Sample valid: {sorted(list(valid_concepts))[:5]}")
print(f"Sample invalid: {list(invalid_entries)[:10]}")


# Analyze the invalid entries by type
float_pattern = re.compile(r'^\d{6,18}\.0$')
multi_value_pattern = re.compile(r'\n')
whitespace_pattern = re.compile(r'^\s|\s$')

issue_counts = {'float_format': 0, 'multi_value': 0, 'whitespace': 0, 'other': 0}

for entry in invalid_entries:
      if float_pattern.match(entry):
          issue_counts['float_format'] += 1
      elif multi_value_pattern.search(entry):
          issue_counts['multi_value'] += 1
      elif whitespace_pattern.search(entry):
          issue_counts['whitespace'] += 1
      else:
          issue_counts['other'] += 1

print("Invalid entry analysis:")
for issue_type, count in issue_counts.items():
      print(f"  {issue_type}: {count} entries")

Valid SNOMED concepts: 305 unique
Invalid entries: 556 unique
Sample valid: ['10200004', '105590001', '108369006', '113120007', '113192009']
Sample invalid: ['266987004.0', '54735007 \n64688005 ', '1287955007\n1287954006 ', '234105001.0', '897713009\n1255831008', '258337006.0', '397440000.0', '394804000.0', '746224000 \n746225004', '29836001.0']
Invalid entry analysis:
  float_format: 502 entries
  multi_value: 42 entries
  whitespace: 2 entries
  other: 10 entries


## Data Cleaning and Documentation

### Clean concepts and track multi-value cases
 
DATA QUALITY ISSUE CONTEXT:

The .0 endings are caused by pandas reading Excel files with missing values.
When a column contains NaN (missing values), pandas treats the entire column as float64
because NaN is a float concept. This converts valid integers like 439401001 
to floats like 439401001.0, which then fail our strict SNOMED validation.
 
This is pandas being "helpful" but creating data cleaning requirements.
Alternative would be to use pd.read_excel(dtype=str) to keep everything as strings.

  

In [149]:
multi_value_cases = []  # For export - cases where mappers selected multiple concepts
cleaned_valid_concepts = set()
cleaning_log = {'float_fixed': 0, 'multi_value_split': 0, 'whitespace_cleaned': 0}
for sheet_name, df in obds_data.items():
      for mapper_name, col_idx in mapper_columns.items():
          concepts = df.iloc[:, col_idx].dropna()
          for row_idx, concept in enumerate(concepts):
              original = str(concept)

              # Fix pandas float conversion (e.g., "439401001.0" -> "439401001")
              if original.endswith('.0') and original[:-2].isdigit():
                  cleaned = original[:-2]
                  cleaning_log['float_fixed'] += 1
              # Handle multi-value entries (record for later analysis)
              elif '\n' in original:
                  parts = [p.strip() for p in original.split('\n') if p.strip()]
                  cleaned = parts[0]  # Take first concept for now
                  cleaning_log['multi_value_split'] += 1
                  # Record this multi-value case with oBDS context
                  multi_value_cases.append({
                      'obds_id': df.iloc[row_idx, 0],  # oBDS identifier
                      'sheet': sheet_name,
                      'mapper': mapper_name,
                      'all_concepts': parts,
                      'selected_concept': cleaned
                  })
              else:
                  cleaned = original.strip()
                  if cleaned != original:
                      cleaning_log['whitespace_cleaned'] += 1

              if is_valid_snomed(cleaned):
                  cleaned_valid_concepts.add(cleaned)

print(f"Cleaning results:")
print(f"  Float format fixed: {cleaning_log['float_fixed']}")
print(f"  Multi-value cases split: {cleaning_log['multi_value_split']}")
print(f"  Whitespace cleaned: {cleaning_log['whitespace_cleaned']}")
print(f"  Final valid concepts: {len(cleaned_valid_concepts)}")
print(f"  Multi-value cases for analysis: {len(multi_value_cases)}")

Cleaning results:
  Float format fixed: 1095
  Multi-value cases split: 51
  Whitespace cleaned: 2
  Final valid concepts: 596
  Multi-value cases for analysis: 51


In [150]:


# Remove workflow comment rows from Modul 11
  # These are project notes from initial mapping pilot, not actual oBDS data
print("=== CLEANING MODUL 11 ===")
df_m11 = obds_data['Modul 11']
print(f"Original: {df_m11.shape[0]} rows")

  # Keep only rows where oBDS ID (first column) is not null
clean_df = df_m11[df_m11.iloc[:, 0].notna()].copy()
obds_data['Modul 11'] = clean_df

print(f"After removing comments: {clean_df.shape[0]} rows")
print(f"Removed: {df_m11.shape[0] - clean_df.shape[0]} comment rows")



print("=== APPLYING CONCEPT FORMAT CLEANING ===")

for sheet_name, df in obds_data.items():
      for mapper_name, col_idx in mapper_columns.items():
          # Apply cleaning transformations to the column
          concepts = df.iloc[:, col_idx]

          for row_idx in concepts.index:
              if pd.notna(concepts[row_idx]):
                  original = str(concepts[row_idx])

                  # Fix float format (.0 endings)
                  if original.endswith('.0') and original[:-2].isdigit():
                      cleaned = original[:-2]
                  # Handle multi-value (take first)
                  elif '\n' in original:
                      parts = [p.strip() for p in original.split('\n') if p.strip()]
                      cleaned = parts[0] if parts else None
                  else:
                      cleaned = original.strip()

                  # Update the dataframe with cleaned value
                  df.iloc[row_idx, col_idx] = cleaned

print("Concept format cleaning applied to dataframes")


# Handle specific SNOMED concept formatting issues found in the data
print("=== CLEANING SPECIFIC SNOMED FORMATTING ISSUES ===")

cleaning_stats = {
      'scientific_notation_converted': 0,
      'sct_annotations_removed': 0,
      'pipe_expressions_extracted': 0,
      'non_breaking_spaces_fixed': 0, 
      'typo_fixed': 0
  }

scientific_notation_map = {
      '9.00000000000465e+17': '900000000000465024', # should be  String (foundation metadata concept), was first incorrectly mapped to 900000000000465024 (no valid id)
      '9.00000000000519e+17': '900000000000519001', # should be Annotation value (foundation metadata concept), was first incorrectly mapped to 900000000000519040 (no valid id)
      '9.11753521000004e+17': '911753521000004000'  # Update with correct value, was first incorrectly mapped to 911753521000003968 (no valid id)
  }, 

for sheet_name, df in obds_data.items():
      for mapper_name, col_idx in mapper_columns.items():
          concepts = df.iloc[:, col_idx]

          for row_idx in concepts.index:
              if pd.notna(concepts[row_idx]):
                  original = str(concepts[row_idx])
                  cleaned = original

                    
                  # Convert Excel scientific notation back to full SNOMED concept IDs
                  # e.g., '9.00000000000465e+17' -> '900000000000465024'
                  if 'e+' in original.lower():
                      if original in scientific_notation_map:
                          cleaned = scientific_notation_map[original]
                          cleaning_stats['scientific_notation_converted'] += 1
                          print(f"  Scientific notation: '{original}' -> '{cleaned}'")
                      else:
                          print(f"  WARNING: Unmapped scientific notation: '{original}'")
                          cleaned = None  # Handle unmapped cases

                  # Remove SNOMED CT system annotations from concept IDs  
                  # e.g., '439401001 (SCT)' -> '439401001'
                  elif '(' in original and original.endswith(')'):
                      concept_id = original.split('(')[0].strip()
                      cleaned = concept_id
                      cleaning_stats['sct_annotations_removed'] += 1
                      print(f"  Removed SCT annotation: '{original}' -> '{cleaned}'")

                  # Extract concept ID from full SNOMED CT expressions
                  # e.g., '16633941000119101 | Radiotherapy declined by patient (situation) |' -> '16633941000119101'
                  elif '|' in original:
                      concept_id = original.split('|')[0].strip()
                      cleaned = concept_id
                      cleaning_stats['pipe_expressions_extracted'] += 1
                      print(f"  Extracted from expression: '{original[:50]}...' -> '{cleaned}'")

                  # Fix non-breaking spaces (Unicode 160) at start of concept IDs
                  # e.g., ' 371932001' (with Unicode 160) -> '371932001'
                  elif len(original) > 0 and ord(original[0]) == 160:
                      concept_id = original[1:].strip()
                      cleaned = concept_id
                      cleaning_stats['non_breaking_spaces_fixed'] += 1
                      print(f"  Fixed non-breaking space: leading char removed from '{len(original)} chars'")
                  # fix original typo of lymph node concept
                  elif original == '44025001':
                      cleaned = '444025001'  # Fix the lost first digit
                      cleaning_stats['typo_fixed'] += 1
                      print(f"  Fixed typo: '{original}' -> '{cleaned}'")

                  # Update the dataframe with the cleaned concept ID
                  if cleaned != original:
                      df.iloc[row_idx, col_idx] = cleaned

print(f"\nCleaning summary:")
for operation, count in cleaning_stats.items():
      print(f"  {operation}: {count} entries")







  # Replace '-' with None across all modules
print(f"\n=== REPLACING '-' WITH MISSING VALUES ===")
total_dashes = 0
for sheet_name, df in obds_data.items():
      for col_idx in mapper_columns.values():
          dashes = (df.iloc[:, col_idx] == '-').sum()
          total_dashes += dashes
          df.iloc[:, col_idx] = df.iloc[:, col_idx].replace('-', None)

print(f"Replaced {total_dashes} dashes with missing values")

=== CLEANING MODUL 11 ===
Original: 19 rows
After removing comments: 15 rows
Removed: 4 comment rows
=== APPLYING CONCEPT FORMAT CLEANING ===
Concept format cleaning applied to dataframes
=== CLEANING SPECIFIC SNOMED FORMATTING ISSUES ===
  Removed SCT annotation: '439401001 (SCT)' -> '439401001'
  Fixed typo: '44025001' -> '444025001'
  Extracted from expression: '16633941000119101 | Radiotherapy declined by patie...' -> '16633941000119101'
  Extracted from expression: '16633941000119101 | Radiotherapy declined by patie...' -> '16633941000119101'

Cleaning summary:
  scientific_notation_converted: 0 entries
  sct_annotations_removed: 1 entries
  pipe_expressions_extracted: 2 entries
  non_breaking_spaces_fixed: 0 entries
  typo_fixed: 1 entries

=== REPLACING '-' WITH MISSING VALUES ===
Replaced 1 dashes with missing values


In [151]:
 # Let's specifically look at the 10 "other" entries that didn't match our patterns
other_entries = []

for entry in invalid_entries:
      # Check if it matches any of our known patterns
      is_float = entry.endswith('.0') and entry[:-2].isdigit()
      has_linebreak = '\n' in entry
      has_whitespace = entry.startswith(' ') or entry.endswith(' ')

      # If it doesn't match any known pattern, it's "other"
      if not (is_float or has_linebreak or has_whitespace):
          other_entries.append(entry)

print(f"The {len(other_entries)} 'other' invalid entries are:")
for entry in other_entries:
      print(f"  '{entry}' (length: {len(entry)})")
      # Show character codes to see any hidden characters
      print(f"    chars: {[ord(c) for c in entry[:20]]}")  # First 20 chars

The 12 'other' invalid entries are:
  '-' (length: 1)
    chars: [45]
  '9.00000000000465e+17' (length: 20)
    chars: [57, 46, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 52, 54, 53, 101, 43, 49, 55]
  'SNOMED-Kodierung für 100 Begriffe' (length: 33)
    chars: [83, 78, 79, 77, 69, 68, 45, 75, 111, 100, 105, 101, 114, 117, 110, 103, 32, 102, 252, 114]
  ' 371932001' (length: 10)
    chars: [160, 51, 55, 49, 57, 51, 50, 48, 48, 49]
  '439401001 (SCT)' (length: 15)
    chars: [52, 51, 57, 52, 48, 49, 48, 48, 49, 32, 40, 83, 67, 84, 41]
  'Für FHIR-Spezifikation: Sophie, Nina' (length: 36)
    chars: [70, 252, 114, 32, 70, 72, 73, 82, 45, 83, 112, 101, 122, 105, 102, 105, 107, 97, 116, 105]
  'Semantisch: Paul, Lotte' (length: 23)
    chars: [83, 101, 109, 97, 110, 116, 105, 115, 99, 104, 58, 32, 80, 97, 117, 108, 44, 32, 76, 111]
  'ISO mitführen' (length: 13)
    chars: [73, 83, 79, 32, 109, 105, 116, 102, 252, 104, 114, 101, 110]
  '9.00000000000519e+17' (length: 20)
    chars: [57, 4

In [152]:
valid_concepts_clean=set()     # Unique valid SNOMED concepts, taken from above and emptied
invalid_entries_clean=set()  # Unique invalid entries for review, tken from above and emptied

for sheet_name, df in obds_data.items():
      for mapper_name, col_idx in mapper_columns.items():
          concepts = df.iloc[:, col_idx].dropna()
          for concept in concepts:
              concept_str = str(concept)  # No stripping - keep original format

              if is_valid_snomed(concept_str):
                  valid_concepts_clean.add(concept_str)
              else:
                  invalid_entries_clean.add(concept_str)

print(f"Valid SNOMED concepts: {len(valid_concepts_clean)} unique")
print(f"Invalid entries: {len(invalid_entries_clean)} unique")
print(f"Sample valid: {sorted(list(valid_concepts_clean))[:5]}")
print(f"Sample invalid: {list(invalid_entries_clean)[:10]}")

Valid SNOMED concepts: 596 unique
Invalid entries: 0 unique
Sample valid: ['10200004', '103165007', '103319000', '103337004', '103338009']
Sample invalid: []


## Extract Mapper Data

Extract SCTID mappings from each mapper for further analysis.

In [153]:
def extract_mapper_data(df, sheet_name):
    """Extract SCTID mappings from each mapper."""
    
    # This function will need to be adjusted based on actual column names
    # For now, we'll create a template structure
    
    mapper_data = {
        'sheet': sheet_name,
        'mappers': {}
    }
    
    # Expected mapper column positions (will need adjustment)
    mapper_positions = {
        'Nina': {'sctid_col': 3, 'fsn_col': 4, 'iso_col': 5},
        'Sophie': {'sctid_col': 6, 'fsn_col': 7, 'iso_col': 8},
        'Paul': {'sctid_col': 9, 'fsn_col': 10, 'iso_col': 11},
        'Lotte': {'sctid_col': 12, 'fsn_col': 13, 'iso_col': 14},
        'Thomas': {'sctid_col': 15, 'fsn_col': 16, 'iso_col': 17}
    }
    
    for mapper_name, positions in mapper_positions.items():
        try:
            sctid_col = positions['sctid_col']
            if sctid_col < len(df.columns):
                sctids = df.iloc[:, sctid_col]
                mapper_data['mappers'][mapper_name] = {
                    'sctids': sctids.tolist(),
                    'non_null_count': sctids.notna().sum(),
                    'unique_codes': sctids.nunique()
                }
            else:
                print(f"Warning: Column index {sctid_col} for {mapper_name} not found in {sheet_name}")
        except Exception as e:
            print(f"Error extracting data for {mapper_name} in {sheet_name}: {e}")
    
    return mapper_data

# Extract mapper data from all sheets
mapper_extracts = {}
for sheet_name, df in obds_data.items():
    mapper_extracts[sheet_name] = extract_mapper_data(df, sheet_name)
    
print("Mapper data extraction completed.")

Mapper data extraction completed.


## Summary Statistics

Display overall project statistics.

In [154]:
if obds_data:
    print("=== oBDS Mapping Project Summary ===")
    print(f"Total modules (sheets): {len(obds_data)}")
    
    total_rows = sum(df.shape[0] for df in obds_data.values())
    total_concepts = sum(df.shape[0] for df in obds_data.values() if df.shape[0] > 0)
    
    print(f"Total concepts to map: {total_concepts}")
    
    print("\nModule breakdown:")
    for sheet_name, df in obds_data.items():
        print(f"  {sheet_name}: {df.shape[0]} concepts")
    
    # Overall data quality
    total_missing = sum(summary['missing_values'] for summary in quality_summary.values())
    total_cells = sum(summary['shape'][0] * summary['shape'][1] for summary in quality_summary.values())
    overall_missing_pct = (total_missing / total_cells) * 100 if total_cells > 0 else 0
    
    print(f"\nOverall missing data: {total_missing} cells ({overall_missing_pct:.1f}%)")
else:
    print("No data loaded. Please check the file path and try again.")

=== oBDS Mapping Project Summary ===
Total modules (sheets): 19
Total concepts to map: 575

Module breakdown:
  Modul 5: 35 concepts
  Modul 6: 26 concepts
  Modul 9: 5 concepts
  Modul 10: 17 concepts
  Modul 11: 15 concepts
  Modul 12: 9 concepts
  Modul 13: 94 concepts
  Modul 14: 202 concepts
  Modul 15: 14 concepts
  Modul 16: 42 concepts
  Modul 17: 41 concepts
  Modul 18: 9 concepts
  Modul 19: 23 concepts
  Modul 20: 9 concepts
  Modul 21: 2 concepts
  Modul 22: 2 concepts
  Modul 23: 14 concepts
  Modul 24: 7 concepts
  Modul 25: 9 concepts

Overall missing data: 3564 cells (24.4%)


## Next Steps

This notebook provides the foundation for importing oBDS mapping data. Next steps could include:

1. **Column Refinement**: Adjust column mappings based on actual data structure
2. **Data Cleaning**: Handle missing values and invalid SNOMED codes
3. **Export for R Analysis**: Prepare data formats for Krippendorff Alpha calculations
4. **Validation**: Check SNOMED code validity against current terminologies
5. **Quality Metrics**: Calculate agreement statistics and identify problematic mappings

Please provide guidance on the actual column structure to refine the data extraction.

In [155]:


 # Export unique concepts list with timestamped filenames
from datetime import datetime

print("=== EXPORTING UNIQUE CONCEPTS FOR VALIDATION ===")

  # Create timestamp for filenames
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

  # Convert set to sorted list
unique_concepts_sorted = sorted(list(valid_concepts_clean))

  # Export as simple text file (one concept per line)
txt_filename = f'obds_unique_concepts_for_validation_{timestamp}.txt'
with open(txt_filename, 'w') as f:
      for concept in unique_concepts_sorted:
          f.write(f"{concept}\n")

print(f"✓ Exported {len(unique_concepts_sorted)} unique concepts")
print(f"✓ File: {txt_filename}")

  # Also export as CSV with metadata
csv_filename = f'obds_unique_concepts_metadata_{timestamp}.csv'
import pandas as pd

concepts_df = pd.DataFrame({
      'concept_id': unique_concepts_sorted,
      'length': [len(c) for c in unique_concepts_sorted],
      'is_extension': [len(c) > 10 and '1000' in c[-8:] for c in unique_concepts_sorted]
  })

concepts_df.to_csv(csv_filename, index=False)
print(f"✓ Also exported: {csv_filename}")
print(f"✓ Timestamp: {timestamp}")
print(f"✓ Sample concepts: {unique_concepts_sorted[:5]}")

=== EXPORTING UNIQUE CONCEPTS FOR VALIDATION ===
✓ Exported 596 unique concepts
✓ File: obds_unique_concepts_for_validation_20250804_150736.txt
✓ Also exported: obds_unique_concepts_metadata_20250804_150736.csv
✓ Timestamp: 20250804_150736
✓ Sample concepts: ['10200004', '103165007', '103319000', '103337004', '103338009']


In [156]:
import json
from datetime import datetime

  # Export obds_data as JSON file with timestamp       
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") 
json_filename = f'obds_cleaned_data_{timestamp}.json'

print("=== EXPORTING OBDS DATA AS JSON ===")

  # Convert DataFrames to JSON-serializable format
obds_json = {}
for sheet_name, df in obds_data.items():
      # Convert DataFrame to dict with proper handling of NaN values
      obds_json[sheet_name] = df.to_dict('records')  # Each row as a dictionary

  # Export to JSON file
with open(json_filename, 'w', encoding='utf-8') as f:
      json.dump(obds_json, f, indent=2, ensure_ascii=False, default=str)

print(f"Before: {len(obds_data)} modules in memory")
print(f"After: Exported to {json_filename}")
print(f"File contains: {sum(len(data) for data in obds_json.values())} total rows")

=== EXPORTING OBDS DATA AS JSON ===
Before: 19 modules in memory
After: Exported to obds_cleaned_data_20250804_150736.json
File contains: 575 total rows
