# SD_Study Data Files Scanner and Analyzer
In the drug registration process, SD files (short for Study Data files) refer to electronic data submissions that contain structured datasets from nonclinical and clinical studies. These datasets are prepared following standardized data models to allow regulatory agencies (like the FDA, EMA, or local authorities) to efficiently review, validate, and analyze the study results.

### Definition
**SD files** = Study Data Files — electronic data packages containing the raw and tabulated data from:
**Nonclinical studies** (toxicology, pharmacology)
**Clinical studies** (efficacy, safety, pharmacokinetics, etc.)

### CDISC standards
- SDTM (Study Data Tabulation Model) – standardized structure for tabulated data.
- ADaM (Analysis Data Model) – datasets used for statistical analysis.
- SEND (Standard for Exchange of Nonclinical Data) – for preclinical (animal) data.

This notebook scans the current directory for files and analyzes specific file types using appropriate libraries:
- **.sdf files**:  Structure Data File (Chemoinformatics) Plain text molecular data format used to store 3D structures, bonds, properties, and metadata - RDKit for chemical structure analysis
- **.xpt files**: SAS XPORT (SDTM Clinical Datasets) Contains clinical study data such as Demographics (DM), Adverse Events (AE), and Laboratory Results (LB) - pyreadstat for SAS transport files
- **.asnt files**: Textual representation of Abstract Syntax Notation One, sed in regulatory metadata, pharma labeling, and bioinformatics standards - Biopython for ASN.1 files

In [1]:
# Install required packages if not already installed
# Uncomment the following lines if you need to install packages

!pip install -q rdkit-pypi
!pip install -q pandas
!pip install -q biopython

In [2]:
import os
import glob
from pathlib import Path

# Scan current directory for files
current_dir = Path('.')
all_files = list(current_dir.rglob('*'))
files_only = [f for f in all_files if f.is_file()]

print(f"Found {len(files_only)} files in the current directory and subdirectories")

# Group files by extension
file_extensions = {}
for file_path in files_only:
    ext = file_path.suffix.lower()
    if ext not in file_extensions:
        file_extensions[ext] = []
    file_extensions[ext].append(file_path)

print("\nFiles by extension:")
for ext, files in file_extensions.items():
    print(f"{ext}: {len(files)} files")
    for file in files[:5]:  # Show first 5 files per extension
        print(f"  - {file}")
    if len(files) > 5:
        print(f"  ... and {len(files) - 5} more")

Found 103 files in the current directory and subdirectories

Files by extension:
.md: 2 files
  - validation_README.md
  - README.md
: 59 files
  - .DS_Store
  - m5/.DS_Store
  - m5/53-clin-stud-reports/.DS_Store
  - m5/53-clin-stud-reports/study1234/.DS_Store
  - m5/53-clin-stud-reports/study1234/datasets/.DS_Store
  ... and 54 more
.csv: 6 files
  - integrity_results.csv
  - SD_Study-Data-files.csv
  - validation_results.csv
  - m5/53-clin-stud-reports/study1234/datasets/datasets/dm.csv
  - m5/53-clin-stud-reports/study1234/datasets/datasets/ae.csv
  ... and 1 more
.ipynb: 2 files
  - SD_Study-Data-files.ipynb
  - SD_Study-Data-files_validated.ipynb
.txt: 1 files
  - validation_report.txt
.pdf: 4 files
  - SD_Study-Data-files.pdf
  - m5/53-clin-stud-reports/study1234/datasets/datasets/annotated-crf.pdf
  - m5/53-clin-stud-reports/study1234/datasets/datasets/study1234-clin-report.pdf
  - m5/53-clin-stud-reports/study1234/datasets/datasets/study1234-sdtm-rg.pdf
.html: 1 files
  - SD_St

## Validation Functions

In [3]:
# Validation functions based on regulatory requirements
import chardet
import os

def validate_xpt_file(file_path):
    """Validate XPT file according to regulatory requirements."""
    validation_results = {
        'File': str(file_path.name),
        'Format Check': 'PASS',
        'CDISC Compliance': 'PASS',
        'Required Variables': 'PASS',
        'Data Integrity': 'PASS',
        'Issues': []
    }
    
    try:
        import pyreadstat
        df, meta = pyreadstat.read_xport(str(file_path))
        
        # Check format version
        if meta.file_format != 'XPORT':
            validation_results['Format Check'] = 'FAIL'
            validation_results['Issues'].append(f'Not XPORT format: {meta.file_format}')
        
        # Check for required variables
        required_vars = ['STUDYID', 'USUBJID']
        missing_vars = [var for var in required_vars if var not in df.columns]
        if missing_vars:
            validation_results['Required Variables'] = 'FAIL'
            validation_results['Issues'].append(f'Missing required variables: {missing_vars}')
        
        # Check for nulls in required fields
        for var in required_vars:
            if var in df.columns and df[var].isnull().any():
                validation_results['Data Integrity'] = 'FAIL'
                validation_results['Issues'].append(f'Null values in {var}')
                break
        
        # Check for --SEQ if present
        seq_cols = [col for col in df.columns if '--SEQ' in col.upper()]
        for seq_col in seq_cols:
            if df[seq_col].isnull().any():
                validation_results['Data Integrity'] = 'FAIL'
                validation_results['Issues'].append(f'Null values in {seq_col}')
                break
        
    except Exception as e:
        validation_results['Format Check'] = 'ERROR'
        validation_results['Issues'].append(f'Error reading file: {str(e)}')
    
    return validation_results

def validate_sdf_file(file_path):
    """Validate SDF file according to regulatory requirements."""
    validation_results = {
        'File': str(file_path.name),
        'Structure Check': 'PASS',
        'Molecule Count': 'PASS',
        'Property Blocks': 'PASS',
        'Connectivity': 'PASS',
        'Issues': []
    }
    
    try:
        from rdkit import Chem
        
        with open(str(file_path), 'r') as f:
            content = f.read()
        
        # Check for proper SDF structure (molecules separated by $$$$)
        molecules = content.split('$$$$')
        molecules = [mol.strip() for mol in molecules if mol.strip()]
        
        if not molecules:
            validation_results['Structure Check'] = 'FAIL'
            validation_results['Issues'].append('No molecules found')
            return validation_results
        
        # Validate each molecule
        valid_molecules = 0
        for i, mol_block in enumerate(molecules):
            if not mol_block.strip():
                continue
            
            # Try to parse with RDKit
            mol = Chem.MolFromMolBlock(mol_block)
            if mol is None:
                validation_results['Connectivity'] = 'FAIL'
                validation_results['Issues'].append(f'Molecule {i+1}: Invalid structure')
            else:
                valid_molecules += 1
                
                # Check atom/bond counts
                if mol.GetNumAtoms() == 0:
                    validation_results['Connectivity'] = 'FAIL'
                    validation_results['Issues'].append(f'Molecule {i+1}: No atoms')
                
                # Check for property blocks
                if '> ' not in mol_block:
                    validation_results['Property Blocks'] = 'WARN'
                    validation_results['Issues'].append(f'Molecule {i+1}: No property blocks found')
        
        if valid_molecules == 0:
            validation_results['Molecule Count'] = 'FAIL'
            validation_results['Issues'].append('No valid molecules')
        
    except Exception as e:
        validation_results['Structure Check'] = 'ERROR'
        validation_results['Issues'].append(f'Error reading file: {str(e)}')
    
    return validation_results

def validate_asnt_file(file_path):
    """Validate ASNT file according to regulatory requirements."""
    validation_results = {
        'File': str(file_path.name),
        'ASN.1 Structure': 'PASS',
        'Encoding': 'PASS',
        'Schema Compliance': 'PASS',
        'Mandatory Fields': 'PASS',
        'Issues': []
    }
    
    try:
        # Check encoding
        with open(str(file_path), 'rb') as f:
            raw_data = f.read()
        
        detected_encoding = chardet.detect(raw_data)
        encoding = detected_encoding.get('encoding', 'unknown')
        
        if encoding not in ['utf-8', 'ascii', 'UTF-8', 'ASCII']:
            validation_results['Encoding'] = 'WARN'
            validation_results['Issues'].append(f'Encoding {encoding} may not be compliant')
        
        # Try to decode
        try:
            content = raw_data.decode('utf-8')
        except UnicodeDecodeError:
            try:
                content = raw_data.decode('ascii')
            except UnicodeDecodeError:
                validation_results['Encoding'] = 'FAIL'
                validation_results['Issues'].append('Cannot decode file content')
                return validation_results
        
        # Check if XML
        if content.startswith('<?xml'):
            import xml.etree.ElementTree as ET
            try:
                root = ET.fromstring(content)
                validation_results['ASN.1 Structure'] = 'PASS (XML)'
                
                # Check for mandatory fields in assessment
                mandatory_fields = ['StudyID', 'Reviewer', 'AssessmentDate']
                missing_fields = []
                for field in mandatory_fields:
                    if not root.find(field) or not root.find(field).text:
                        missing_fields.append(field)
                
                if missing_fields:
                    validation_results['Mandatory Fields'] = 'FAIL'
                    validation_results['Issues'].append(f'Missing mandatory fields: {missing_fields}')
                
            except ET.ParseError as e:
                validation_results['Schema Compliance'] = 'FAIL'
                validation_results['Issues'].append(f'Invalid XML: {str(e)}')
        else:
            # Assume ASN.1 text format
            validation_results['ASN.1 Structure'] = 'PASS (Text)'
            # Basic checks for ASN.1 structure
            if '::=' not in content:
                validation_results['Schema Compliance'] = 'WARN'
                validation_results['Issues'].append('No ASN.1 definitions found')
    
    except Exception as e:
        validation_results['ASN.1 Structure'] = 'ERROR'
        validation_results['Issues'].append(f'Error reading file: {str(e)}')
    
    return validation_results

def validate_file_integrity(file_path):
    """General file integrity checks."""
    integrity_results = {
        'File': str(file_path.name),
        'File Size': 'PASS',
        'File Exists': 'PASS',
        'Readable': 'PASS',
        'Issues': []
    }
    
    try:
        # Check file exists
        if not file_path.exists():
            integrity_results['File Exists'] = 'FAIL'
            integrity_results['Issues'].append('File does not exist')
            return integrity_results
        
        # Check file size (FDA eCTD limit is typically 100MB per file)
        size_mb = file_path.stat().st_size / (1024 * 1024)
        if size_mb > 100:
            integrity_results['File Size'] = 'WARN'
            integrity_results['Issues'].append(f'File size {size_mb:.2f}MB exceeds typical limits')
        
        # Check readability
        try:
            with open(str(file_path), 'rb') as f:
                f.read(1024)  # Read first 1KB
        except Exception as e:
            integrity_results['Readable'] = 'FAIL'
            integrity_results['Issues'].append(f'File not readable: {str(e)}')
    
    except Exception as e:
        integrity_results['Readable'] = 'ERROR'
        integrity_results['Issues'].append(f'Error checking file: {str(e)}')
    
    return integrity_results

## SDF Files Analysis (RDKit)

In [4]:
# Analyze SDF files using RDKit
try:
    from rdkit import Chem
    from rdkit.Chem import Descriptors
    
    sdf_files = file_extensions.get('.sdf', [])
    if sdf_files:
        print(f"Found {len(sdf_files)} SDF files")
        
        for sdf_file in sdf_files:
            print(f"\nAnalyzing {sdf_file}:")
            
            # Read SDF file
            suppl = Chem.SDMolSupplier(str(sdf_file))
            molecules = [mol for mol in suppl if mol is not None]
            
            print(f"  - Number of molecules: {len(molecules)}")
            
            if molecules:
                # Analyze first molecule as example
                mol = molecules[0]
                print(f"  - First molecule: {mol.GetProp('_Name') if mol.HasProp('_Name') else 'Unnamed'}")
                print(f"  - Molecular weight: {Descriptors.MolWt(mol):.2f}")
                print(f"  - Number of atoms: {mol.GetNumAtoms()}")
                print(f"  - Number of bonds: {mol.GetNumBonds()}")
                print(f"  - SMILES: {Chem.MolToSmiles(mol)}")
    else:
        print("No SDF files found")
        
except ImportError:
    print("RDKit not installed. Install with: pip install rdkit-pypi")
except Exception as e:
    print(f"Error analyzing SDF files: {e}")

Found 3 SDF files

Analyzing m5/53-clin-stud-reports/study1234/datasets/datasets/compound.sdf:
  - Number of molecules: 0

Analyzing m5/53-clin-stud-reports/study1234/datasets/datasets/Structure2D_COMPOUND_CID_197365.sdf:
  - Number of molecules: 1
  - First molecule: 197365
  - Molecular weight: 699.99
  - Number of atoms: 44
  - Number of bonds: 46
  - SMILES: CN1CCN(c2ccc3nc(-c4ccc5nc(CCCc6ccc(N(CCCl)CCCl)cc6)[nH]c5c4)[nH]c3c2)CC1.Cl.Cl.Cl

Analyzing m5/53-clin-stud-reports/study1234/datasets/datasets/Conformer3D_COMPOUND_CID_197366.sdf:
  - Number of molecules: 1
  - First molecule: 197366
  - Molecular weight: 590.60
  - Number of atoms: 41
  - Number of bonds: 46
  - SMILES: CN1CCN(c2ccc3nc(-c4ccc5nc(CCCc6ccc(N(CCCl)CCCl)cc6)[nH]c5c4)[nH]c3c2)CC1


[17:20:08] ERROR: Atom line too short: '   -0.0015    1.2095    0.0000 C' on line 5
[17:20:08] ERROR: moving to the beginning of the next molecule


## XPT Files Analysis (pyreadstat)

In [5]:
# Analyze XPT files using pyreadstat
try:
    import pyreadstat
    
    xpt_files = file_extensions.get('.xpt', [])
    if xpt_files:
        print(f"Found {len(xpt_files)} XPT files")
        
        for xpt_file in xpt_files:
            print(f"\nAnalyzing {xpt_file}:")
            
            try:
                # Read XPT file using pyreadstat
                df, meta = pyreadstat.read_xport(str(xpt_file))
                
                print(f"  - Shape: {df.shape}")
                print(f"  - Columns: {list(df.columns)}")
                print(f"  - First 5 rows:\n{df.head()}")
                
                # Basic statistics
                print(f"  - Summary statistics:\n{df.describe()}")
                
            except Exception as file_error:
                print(f"  - Error reading {xpt_file}: {file_error}")
                
    else:
        print("No XPT files found")
        
except ImportError as e:
    print(f"Required packages not installed: {e}")
    print("Install with: pip install pyreadstat")
except Exception as e:
    print(f"Error analyzing XPT files: {e}")

Found 3 XPT files

Analyzing m5/53-clin-stud-reports/study1234/datasets/datasets/lb.xpt:
  - Shape: (2, 5)
  - Columns: ['STUDYID', 'USUBJID', 'LBTEST', 'LBSTRESN', 'LBSTRESU']
  - First 5 rows:
  STUDYID  USUBJID      LBTEST  LBSTRESN LBSTRESU
0  ABC123  SUBJ001  Hemoglobin      14.2     g/dL
1  ABC123  SUBJ002     Glucose      88.0    mg/dL
  - Summary statistics:
       LBSTRESN
count   2.00000
mean   51.10000
std    52.18448
min    14.20000
25%    32.65000
50%    51.10000
75%    69.55000
max    88.00000

Analyzing m5/53-clin-stud-reports/study1234/datasets/datasets/ae.xpt:
  - Shape: (2, 5)
  - Columns: ['STUDYID', 'USUBJID', 'AETERM', 'AESEV', 'AEREL']
  - First 5 rows:
  STUDYID  USUBJID    AETERM     AESEV      AEREL
0  ABC123  SUBJ001  Headache      MILD    RELATED
1  ABC123  SUBJ002    Nausea  MODERATE  UNRELATED
  - Summary statistics:
       STUDYID  USUBJID    AETERM AESEV    AEREL
count        2        2         2     2        2
unique       1        2         2     2     

## ASNT Files Analysis (Biopython)

In [6]:
# Analyze ASNT files using Biopython and XML parsing
try:
    from Bio import SeqIO
    import xml.etree.ElementTree as ET
    
    def read_asnt(asnt_file):
        # Check if it's XML
        with open(asnt_file, 'r') as f:
            first_line = f.readline()
            if first_line.startswith('<?xml'):
                # It's XML, parse as XML
                tree = ET.parse(asnt_file)
                root = tree.getroot()
                return [root]  # Return as list with one element
            else:
                # Try as ASN.1 sequence
                records = list(SeqIO.parse(asnt_file, "genbank"))
                return records
    
    asnt_files = file_extensions.get('.asnt', [])
    if asnt_files:
        print(f"Found {len(asnt_files)} ASNT files")
        
        for asnt_file in asnt_files:
            print(f"\nAnalyzing {asnt_file}:")
            
            try:
                records = read_asnt(str(asnt_file))
                print(f"  - Number of records: {len(records)}")
                
                for i, record in enumerate(records[:3]):  # Show first 3 records
                    if hasattr(record, 'id'):  # BioPython record
                        print(f"  - Record {i+1}: {record.id}")
                        print(f"    Description: {record.description}")
                        print(f"    Sequence length: {len(record.seq)}")
                        print(f"    Sequence type: {record.seq.alphabet}")
                    else:  # XML element
                        print(f"  - XML Root: {record.tag}")
                        for child in record:
                            print(f"    {child.tag}: {child.text}")
                print("---")
                        
            except Exception as e:
                print(f"  - Error reading {asnt_file}: {e}")
                # Fallback: read as text
                try:
                    with open(asnt_file, 'r') as f:
                        content = f.read()
                        print(f"  - File size: {len(content)} characters")
                        print(f"  - First 500 characters:\n{content[:500]}...")
                except Exception as e2:
                    print(f"  - Error reading as text: {e2}")
    else:
        print("No ASNT files found")
        
except ImportError:
    print("Biopython not installed. Install with: pip install biopython")
except Exception as e:
    print(f"Error analyzing ASNT files: {e}")

Found 3 ASNT files

Analyzing m5/53-clin-stud-reports/study1234/datasets/datasets/Conformer3D_COMPOUND_CID_197366.asnt:
  - Number of records: 0
---

Analyzing m5/53-clin-stud-reports/study1234/datasets/datasets/Structure2D_COMPOUND_CID_197365.asnt:
  - Number of records: 0
---

Analyzing m5/53-clin-stud-reports/study1234/datasets/datasets/assessment.asnt:
  - Number of records: 1
  - XML Root: AssessmentTemplate
    StudyID: STUDY1234
    Reviewer: Dr. Smith
    AssessmentDate: 2025-11-05
    Findings: 
    
    Recommendation: Approved for next review phase
---


## Summary

In [7]:
# Summary of analysis
print("=== ANALYSIS SUMMARY ===")
print(f"Total files scanned: {len(files_only)}")
print(f"File extensions found: {len(file_extensions)}")

# Check specific file types
special_types = ['.sdf', '.xpt', '.asnt']
for ext in special_types:
    count = len(file_extensions.get(ext, []))
    print(f"{ext.upper()} files: {count}")

print("\nAnalysis complete!")

=== ANALYSIS SUMMARY ===
Total files scanned: 103
File extensions found: 15
.SDF files: 3
.XPT files: 3
.ASNT files: 3

Analysis complete!


## Regulatory Validation Report

In [8]:
# Run comprehensive validation checks based on regulatory requirements
import pandas as pd
from datetime import datetime

# Collect all validation results
validation_report = []

# Validate XPT files
xpt_files = file_extensions.get('.xpt', [])
for file_path in xpt_files:
    result = validate_xpt_file(file_path)
    validation_report.append(result)

# Validate SDF files
sdf_files = file_extensions.get('.sdf', [])
for file_path in sdf_files:
    result = validate_sdf_file(file_path)
    validation_report.append(result)

# Validate ASNT files
asnt_files = file_extensions.get('.asnt', [])
for file_path in asnt_files:
    result = validate_asnt_file(file_path)
    validation_report.append(result)

# Validate file integrity for all files
all_target_files = xpt_files + sdf_files + asnt_files
integrity_report = []
for file_path in all_target_files:
    result = validate_file_integrity(file_path)
    integrity_report.append(result)

# Create validation summary DataFrame
validation_df = pd.DataFrame(validation_report)
integrity_df = pd.DataFrame(integrity_report)

# Display validation results
print("=== REGULATORY VALIDATION REPORT ===")
print(f"Report generated on: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Total files validated: {len(validation_report)}")
print()

# Summary statistics
total_checks = len(validation_df) * (len(validation_df.columns) - 2)  # Exclude 'File' and 'Issues'
pass_count = 0
fail_count = 0
warn_count = 0
error_count = 0

for _, row in validation_df.iterrows():
    for col in validation_df.columns:
        if col not in ['File', 'Issues']:
            status = str(row[col]).upper()
            if 'PASS' in status:
                pass_count += 1
            elif 'FAIL' in status:
                fail_count += 1
            elif 'WARN' in status:
                warn_count += 1
            elif 'ERROR' in status:
                error_count += 1

print(f"Validation Summary:")
print(f"  PASS: {pass_count}")
print(f"  FAIL: {fail_count}")
print(f"  WARN: {warn_count}")
print(f"  ERROR: {error_count}")
print(f"  Overall Compliance: {'PASS' if fail_count == 0 and error_count == 0 else 'REVIEW REQUIRED'}")
print()

# Display detailed validation results
print("Detailed Validation Results:")
for _, row in validation_df.iterrows():
    print(f"\nFile: {row['File']}")
    for col in validation_df.columns:
        if col not in ['File', 'Issues']:
            status = row[col]
            print(f"  {col}: {status}")
    if row['Issues']:
        print(f"  Issues: {', '.join(row['Issues'])}")

print("\n=== FILE INTEGRITY CHECKS ===")
for _, row in integrity_df.iterrows():
    print(f"\nFile: {row['File']}")
    for col in integrity_df.columns:
        if col != 'File':
            status = row[col]
            print(f"  {col}: {status}")
    if row['Issues']:
        print(f"  Issues: {', '.join(row['Issues'])}")

# Save validation report to file
validation_report_filename = 'validation_report.txt'
with open(validation_report_filename, 'w') as f:
    f.write("=== REGULATORY VALIDATION REPORT ===\n")
    f.write(f"Report generated on: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
    f.write(f"Total files validated: {len(validation_report)}\n\n")
    f.write(f"Validation Summary:\n")
    f.write(f"  PASS: {pass_count}\n")
    f.write(f"  FAIL: {fail_count}\n")
    f.write(f"  WARN: {warn_count}\n")
    f.write(f"  ERROR: {error_count}\n")
    f.write(f"  Overall Compliance: {'PASS' if fail_count == 0 and error_count == 0 else 'REVIEW REQUIRED'}\n\n")
    
    f.write("Detailed Validation Results:\n")
    for _, row in validation_df.iterrows():
        f.write(f"\nFile: {row['File']}\n")
        for col in validation_df.columns:
            if col not in ['File', 'Issues']:
                status = row[col]
                f.write(f"  {col}: {status}\n")
        if row['Issues']:
            f.write(f"  Issues: {', '.join(row['Issues'])}\n")
    
    f.write("\n=== FILE INTEGRITY CHECKS ===\n")
    for _, row in integrity_df.iterrows():
        f.write(f"\nFile: {row['File']}\n")
        for col in integrity_df.columns:
            if col != 'File':
                status = row[col]
                f.write(f"  {col}: {status}\n")
        if row['Issues']:
            f.write(f"  Issues: {', '.join(row['Issues'])}\n")

print(f"\nValidation report saved to: {validation_report_filename}")

# Save validation data to CSV for further analysis
validation_csv_filename = 'validation_results.csv'
validation_df.to_csv(validation_csv_filename, index=False)
integrity_df.to_csv('integrity_results.csv', index=False)
print(f"Validation results saved to: {validation_csv_filename}")
print(f"Integrity results saved to: integrity_results.csv")

=== REGULATORY VALIDATION REPORT ===
Report generated on: 2025-11-05 17:20:11
Total files validated: 9

Validation Summary:
  PASS: 30
  FAIL: 6
  WARN: 0
  ERROR: 0
  Overall Compliance: REVIEW REQUIRED

Detailed Validation Results:

File: lb.xpt
  Format Check: FAIL
  CDISC Compliance: PASS
  Required Variables: PASS
  Data Integrity: PASS
  Structure Check: nan
  Molecule Count: nan
  Property Blocks: nan
  Connectivity: nan
  ASN.1 Structure: nan
  Encoding: nan
  Schema Compliance: nan
  Mandatory Fields: nan
  Issues: Not XPORT format: xport

File: ae.xpt
  Format Check: FAIL
  CDISC Compliance: PASS
  Required Variables: PASS
  Data Integrity: PASS
  Structure Check: nan
  Molecule Count: nan
  Property Blocks: nan
  Connectivity: nan
  ASN.1 Structure: nan
  Encoding: nan
  Schema Compliance: nan
  Mandatory Fields: nan
  Issues: Not XPORT format: xport

File: dm.xpt
  Format Check: FAIL
  CDISC Compliance: PASS
  Required Variables: PASS
  Data Integrity: PASS
  Structure Chec

[17:20:11] Atom line too short: '   -0.0015    1.2095    0.0000 C' on line 5


## Enhanced Comprehensive File Summary Table with Detailed Content Columns

In [9]:
# Create enhanced summary table with detailed content columns
import pandas as pd

# Helper functions to extract detailed file information
def analyze_sdf_file(file_path):
    """Analyze SDF file and return detailed information."""
    try:
        from rdkit import Chem
        from rdkit.Chem import Descriptors
        
        suppl = Chem.SDMolSupplier(str(file_path))
        molecules = [mol for mol in suppl if mol is not None]
        
        if molecules:
            mol = molecules[0]
            return {
                'Number of molecules': len(molecules),
                'First molecule': mol.GetProp('_Name') if mol.HasProp('_Name') else 'Unnamed',
                'Molecular weight': f"{Descriptors.MolWt(mol):.2f}",
                'Number of atoms': mol.GetNumAtoms(),
                'Number of bonds': mol.GetNumBonds(),
                'SMILES': Chem.MolToSmiles(mol)
            }
        else:
            return {
                'Number of molecules': 0,
                'First molecule': 'N/A',
                'Molecular weight': 'N/A',
                'Number of atoms': 'N/A',
                'Number of bonds': 'N/A',
                'SMILES': 'N/A'
            }
    except Exception as e:
        return {
            'Number of molecules': 'Error',
            'First molecule': f'Error: {str(e)}',
            'Molecular weight': 'Error',
            'Number of atoms': 'Error',
            'Number of bonds': 'Error',
            'SMILES': 'Error'
        }

def analyze_xpt_file(file_path):
    """Analyze XPT file and return detailed information."""
    try:
        import pyreadstat
        
        df, meta = pyreadstat.read_xport(str(file_path))
        
        # Get first 5 rows as formatted string
        first_rows = df.head().to_string(index=True)
        
        # Get summary statistics
        stats = df.describe().to_string()
        
        return {
            'Shape': str(df.shape),
            'Columns': str(list(df.columns)),
            'First 5 rows': first_rows,
            'Summary statistics': stats
        }
    except Exception as e:
        return {
            'Shape': f'Error: {str(e)}',
            'Columns': 'Error',
            'First 5 rows': 'Error',
            'Summary statistics': 'Error'
        }

def analyze_asnt_file(file_path):
    """Analyze ASNT file and return detailed information."""
    try:
        from Bio import SeqIO
        import xml.etree.ElementTree as ET
        
        # Check if it's XML
        with open(file_path, 'r') as f:
            first_line = f.readline()
            
        if first_line.startswith('<?xml'):
            # It's XML, parse as XML
            tree = ET.parse(str(file_path))
            root = tree.getroot()
            records = [root]
        else:
            # Try as ASN.1 sequence
            records = list(SeqIO.parse(str(file_path), "genbank"))
        
        if records:
            record = records[0]
            if hasattr(record, 'id'):  # BioPython record
                return {
                    'Number of records': len(records),
                    'XML Root': 'N/A (ASN.1 format)',
                    'StudyID': 'N/A',
                    'Reviewer': 'N/A',
                    'AssessmentDate': 'N/A',
                    'Findings': 'N/A',
                    'Recommendation': 'N/A'
                }
            else:  # XML element
                # Extract XML data
                xml_data = {}
                for child in record:
                    xml_data[child.tag] = child.text or ''
                
                return {
                    'Number of records': len(records),
                    'XML Root': record.tag,
                    'StudyID': xml_data.get('StudyID', ''),
                    'Reviewer': xml_data.get('Reviewer', ''),
                    'AssessmentDate': xml_data.get('AssessmentDate', ''),
                    'Findings': xml_data.get('Findings', ''),
                    'Recommendation': xml_data.get('Recommendation', '')
                }
        else:
            return {
                'Number of records': 0,
                'XML Root': 'N/A',
                'StudyID': 'N/A',
                'Reviewer': 'N/A',
                'AssessmentDate': 'N/A',
                'Findings': 'N/A',
                'Recommendation': 'N/A'
            }
    except Exception as e:
        return {
            'Number of records': f'Error: {str(e)}',
            'XML Root': 'Error',
            'StudyID': 'Error',
            'Reviewer': 'Error',
            'AssessmentDate': 'Error',
            'Findings': 'Error',
            'Recommendation': 'Error'
        }

# Filter for only the three file types
target_extensions = ['.xpt', '.sdf', '.asnt']

# Create summary data with detailed columns
summary_data = []
for ext in target_extensions:
    if ext in file_extensions:
        files = file_extensions[ext]
        
        for file_path in files:
            file_name = file_path.name
            
            # Initialize row data
            row = {
                'File Name': file_name,
                'Extension': ext.upper(),
                'File Path': str(file_path)
            }
            
            # Add file-specific detailed columns
            if ext == '.sdf':
                sdf_info = analyze_sdf_file(file_path)
                row.update({
                    'Number of molecules': sdf_info['Number of molecules'],
                    'First molecule': sdf_info['First molecule'],
                    'Molecular weight': sdf_info['Molecular weight'],
                    'Number of atoms': sdf_info['Number of atoms'],
                    'Number of bonds': sdf_info['Number of bonds'],
                    'SMILES': sdf_info['SMILES']
                })
                # Add empty columns for other file types
                row.update({
                    'Shape': '',
                    'Columns': '',
                    'First 5 rows': '',
                    'Summary statistics': '',
                    'Number of records': '',
                    'XML Root': '',
                    'StudyID': '',
                    'Reviewer': '',
                    'AssessmentDate': '',
                    'Findings': '',
                    'Recommendation': ''
                })
            
            elif ext == '.xpt':
                xpt_info = analyze_xpt_file(file_path)
                row.update({
                    'Shape': xpt_info['Shape'],
                    'Columns': xpt_info['Columns'],
                    'First 5 rows': xpt_info['First 5 rows'],
                    'Summary statistics': xpt_info['Summary statistics']
                })
                # Add empty columns for other file types
                row.update({
                    'Number of molecules': '',
                    'First molecule': '',
                    'Molecular weight': '',
                    'Number of atoms': '',
                    'Number of bonds': '',
                    'SMILES': '',
                    'Number of records': '',
                    'XML Root': '',
                    'StudyID': '',
                    'Reviewer': '',
                    'AssessmentDate': '',
                    'Findings': '',
                    'Recommendation': ''
                })
            
            elif ext == '.asnt':
                asnt_info = analyze_asnt_file(file_path)
                row.update({
                    'Number of records': asnt_info['Number of records'],
                    'XML Root': asnt_info['XML Root'],
                    'StudyID': asnt_info['StudyID'],
                    'Reviewer': asnt_info['Reviewer'],
                    'AssessmentDate': asnt_info['AssessmentDate'],
                    'Findings': asnt_info['Findings'],
                    'Recommendation': asnt_info['Recommendation']
                })
                # Add empty columns for other file types
                row.update({
                    'Number of molecules': '',
                    'First molecule': '',
                    'Molecular weight': '',
                    'Number of atoms': '',
                    'Number of bonds': '',
                    'SMILES': '',
                    'Shape': '',
                    'Columns': '',
                    'First 5 rows': '',
                    'Summary statistics': ''
                })
            
            summary_data.append(row)

# Create DataFrame and display
summary_df = pd.DataFrame(summary_data)
summary_df

[17:20:11] ERROR: Atom line too short: '   -0.0015    1.2095    0.0000 C' on line 5
[17:20:11] ERROR: moving to the beginning of the next molecule


Unnamed: 0,File Name,Extension,File Path,Shape,Columns,First 5 rows,Summary statistics,Number of molecules,First molecule,Molecular weight,Number of atoms,Number of bonds,SMILES,Number of records,XML Root,StudyID,Reviewer,AssessmentDate,Findings,Recommendation
0,lb.xpt,.XPT,m5/53-clin-stud-reports/study1234/datasets/dat...,"(2, 5)","['STUDYID', 'USUBJID', 'LBTEST', 'LBSTRESN', '...",STUDYID USUBJID LBTEST LBSTRESN LBSTR...,LBSTRESN\ncount 2.00000\nmean 51.10...,,,,,,,,,,,,,
1,ae.xpt,.XPT,m5/53-clin-stud-reports/study1234/datasets/dat...,"(2, 5)","['STUDYID', 'USUBJID', 'AETERM', 'AESEV', 'AER...",STUDYID USUBJID AETERM AESEV AE...,STUDYID USUBJID AETERM AESEV AER...,,,,,,,,,,,,,
2,dm.xpt,.XPT,m5/53-clin-stud-reports/study1234/datasets/dat...,"(2, 4)","['STUDYID', 'USUBJID', 'AGE', 'SEX']",STUDYID USUBJID AGE SEX\n0 ABC123 SUBJ0...,AGE\ncount 2.000000\nmean 48....,,,,,,,,,,,,,
3,compound.sdf,.SDF,m5/53-clin-stud-reports/study1234/datasets/dat...,,,,,0.0,,,,,,,,,,,,
4,Structure2D_COMPOUND_CID_197365.sdf,.SDF,m5/53-clin-stud-reports/study1234/datasets/dat...,,,,,1.0,197365.0,699.99,44.0,46.0,CN1CCN(c2ccc3nc(-c4ccc5nc(CCCc6ccc(N(CCCl)CCCl...,,,,,,,
5,Conformer3D_COMPOUND_CID_197366.sdf,.SDF,m5/53-clin-stud-reports/study1234/datasets/dat...,,,,,1.0,197366.0,590.6,41.0,46.0,CN1CCN(c2ccc3nc(-c4ccc5nc(CCCc6ccc(N(CCCl)CCCl...,,,,,,,
6,Conformer3D_COMPOUND_CID_197366.asnt,.ASNT,m5/53-clin-stud-reports/study1234/datasets/dat...,,,,,,,,,,,0.0,,,,,,
7,Structure2D_COMPOUND_CID_197365.asnt,.ASNT,m5/53-clin-stud-reports/study1234/datasets/dat...,,,,,,,,,,,0.0,,,,,,
8,assessment.asnt,.ASNT,m5/53-clin-stud-reports/study1234/datasets/dat...,,,,,,,,,,,1.0,AssessmentTemplate,STUDY1234,Dr. Smith,2025-11-05,\n,Approved for next review phase


## Save Summary Table to CSV

In [10]:
# Save the summary table to CSV
csv_filename = 'SD_Study-Data-files.csv'
summary_df.to_csv(csv_filename, index=False)
print(f"Summary table saved to {csv_filename}")
print(f"CSV file contains {len(summary_df)} rows and {len(summary_df.columns)} columns")
print(f"Columns: {list(summary_df.columns)}")

Summary table saved to SD_Study-Data-files.csv
CSV file contains 9 rows and 20 columns
Columns: ['File Name', 'Extension', 'File Path', 'Shape', 'Columns', 'First 5 rows', 'Summary statistics', 'Number of molecules', 'First molecule', 'Molecular weight', 'Number of atoms', 'Number of bonds', 'SMILES', 'Number of records', 'XML Root', 'StudyID', 'Reviewer', 'AssessmentDate', 'Findings', 'Recommendation']
