# Elliptic Bitcoin Dataset - Data Quality Check

**Objective:** Assess data quality of the three Elliptic Bitcoin dataset files

**Analysis Focus:**
1. Missing values
2. Duplicate records
3. Data type mismatches

**Datasets:**
- `elliptic_txs_classes.csv` - Transaction labels (illicit/licit/unknown)
- `elliptic_txs_edgelist.csv` - Transaction graph edges
- `elliptic_txs_features.csv` - Transaction features (166 features)

---
## Setup

In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")

Libraries imported successfully!


---
## Dataset 1: Transaction Classes (`elliptic_txs_classes.csv`)

### Load Data

In [2]:
# Load transaction classes
classes_df = pd.read_csv("raw_data/elliptic_bitcoin_dataset/elliptic_txs_classes.csv")

print(f"Shape: {classes_df.shape}")
print(f"Columns: {list(classes_df.columns)}")
print("\nFirst 5 rows:")
classes_df.head()

Shape: (203769, 2)
Columns: ['txId', 'class']

First 5 rows:


Unnamed: 0,txId,class
0,230425980,unknown
1,5530458,unknown
2,232022460,unknown
3,232438397,2
4,230460314,unknown


### Check Missing Values

In [3]:
# Count missing values
missing_counts = classes_df.isnull().sum()
missing_pct = (missing_counts / len(classes_df)) * 100

missing_summary = pd.DataFrame({
    'Missing_Count': missing_counts,
    'Missing_Percentage': missing_pct
})

print("Missing Values Summary:")
print(missing_summary)

if missing_summary['Missing_Count'].sum() == 0:
    print("\n✓ No missing values detected")
else:
    print(f"\n⚠ Total missing values: {missing_summary['Missing_Count'].sum():,}")

Missing Values Summary:
       Missing_Count  Missing_Percentage
txId               0                 0.0
class              0                 0.0

✓ No missing values detected


### Check Duplicates

In [4]:
# Check for duplicate transaction IDs
n_duplicates = classes_df['txId'].duplicated().sum()
n_unique = classes_df['txId'].nunique()
n_total = len(classes_df)

print(f"Total records: {n_total:,}")
print(f"Unique transaction IDs: {n_unique:,}")
print(f"Duplicate transaction IDs: {n_duplicates:,}")

if n_duplicates == 0:
    print("\n✓ No duplicate transaction IDs found")
else:
    print(f"\n⚠ {n_duplicates} duplicate transaction IDs detected")
    print("\nDuplicate entries:")
    duplicated_txs = classes_df[classes_df['txId'].duplicated(keep=False)].sort_values('txId')
    print(duplicated_txs.head(10))

Total records: 203,769
Unique transaction IDs: 203,769
Duplicate transaction IDs: 0

✓ No duplicate transaction IDs found


### Check Data Types

In [5]:
# Check data types
print("Data Types:")
print(classes_df.dtypes)

# Validate txId is numeric
is_numeric = pd.api.types.is_numeric_dtype(classes_df['txId'])
print(f"\ntxId is numeric: {is_numeric}")

if not is_numeric:
    print("⚠ txId should be numeric (int64)")
else:
    print("✓ txId has correct data type")

# Check class values
print("\nClass value distribution:")
print(classes_df['class'].value_counts())

valid_classes = {'1', '2', 'unknown'}
invalid_classes = set(classes_df['class'].unique()) - valid_classes

if len(invalid_classes) == 0:
    print("\n✓ All class values are valid ('1', '2', 'unknown')")
else:
    print(f"\n⚠ Invalid class values found: {invalid_classes}")

Data Types:
txId      int64
class    object
dtype: object

txId is numeric: True
✓ txId has correct data type

Class value distribution:
class
unknown    157205
2           42019
1            4545
Name: count, dtype: int64

✓ All class values are valid ('1', '2', 'unknown')


---
## Dataset 2: Transaction Edgelist (`elliptic_txs_edgelist.csv`)

### Load Data

In [6]:
# Load transaction edgelist
edges_df = pd.read_csv("raw_data/elliptic_bitcoin_dataset/elliptic_txs_edgelist.csv")

print(f"Shape: {edges_df.shape}")
print(f"Columns: {list(edges_df.columns)}")
print("\nFirst 5 rows:")
edges_df.head()

Shape: (234355, 2)
Columns: ['txId1', 'txId2']

First 5 rows:


Unnamed: 0,txId1,txId2
0,230425980,5530458
1,232022460,232438397
2,230460314,230459870
3,230333930,230595899
4,232013274,232029206


### Check Missing Values

In [7]:
# Count missing values
missing_counts = edges_df.isnull().sum()
missing_pct = (missing_counts / len(edges_df)) * 100

missing_summary = pd.DataFrame({
    'Missing_Count': missing_counts,
    'Missing_Percentage': missing_pct
})

print("Missing Values Summary:")
print(missing_summary)

if missing_summary['Missing_Count'].sum() == 0:
    print("\n✓ No missing values detected")
else:
    print(f"\n⚠ Total missing values: {missing_summary['Missing_Count'].sum():,}")

Missing Values Summary:
       Missing_Count  Missing_Percentage
txId1              0                 0.0
txId2              0                 0.0

✓ No missing values detected


### Check Duplicates

In [8]:
# Check for duplicate edges
n_duplicates = edges_df.duplicated().sum()
n_total = len(edges_df)

print(f"Total edges: {n_total:,}")
print(f"Duplicate edges: {n_duplicates:,}")

if n_duplicates == 0:
    print("\n✓ No duplicate edges found")
else:
    print(f"\n⚠ {n_duplicates} duplicate edges detected")
    print("\nDuplicate edge examples:")
    duplicated_edges = edges_df[edges_df.duplicated(keep=False)].sort_values(['txId1', 'txId2'])
    print(duplicated_edges.head(10))

Total edges: 234,355
Duplicate edges: 0

✓ No duplicate edges found


### Check Data Types

In [9]:
# Check data types
print("Data Types:")
print(edges_df.dtypes)

# Validate both columns are numeric
txId1_numeric = pd.api.types.is_numeric_dtype(edges_df['txId1'])
txId2_numeric = pd.api.types.is_numeric_dtype(edges_df['txId2'])

print(f"\ntxId1 is numeric: {txId1_numeric}")
print(f"txId2 is numeric: {txId2_numeric}")

if txId1_numeric and txId2_numeric:
    print("\n✓ Both columns have correct data types (numeric)")
else:
    print("\n⚠ Transaction IDs should be numeric (int64)")

# Check for self-loops (edges where txId1 == txId2)
self_loops = (edges_df['txId1'] == edges_df['txId2']).sum()
print(f"\nSelf-loops (txId1 == txId2): {self_loops:,}")

if self_loops == 0:
    print("✓ No self-loops found")
else:
    print(f"⚠ {self_loops} self-loops detected")

Data Types:
txId1    int64
txId2    int64
dtype: object

txId1 is numeric: True
txId2 is numeric: True

✓ Both columns have correct data types (numeric)

Self-loops (txId1 == txId2): 0
✓ No self-loops found


---
## Dataset 3: Transaction Features (`elliptic_txs_features.csv`)

### Load Data

In [10]:
# Load transaction features
features_df = pd.read_csv("raw_data/elliptic_bitcoin_dataset/elliptic_txs_features.csv", header=None)

print(f"Shape: {features_df.shape}")
print(f"\nFirst 3 rows (first 10 columns):")
features_df.iloc[:3, :10]

Shape: (203769, 167)

First 3 rows (first 10 columns):


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,230425980,1,-0.171469,-0.184668,-1.201369,-0.12197,-0.043875,-0.113002,-0.061584,-0.162097
1,5530458,1,-0.171484,-0.184668,-1.201369,-0.12197,-0.043875,-0.113002,-0.061584,-0.162112
2,232022460,1,-0.172107,-0.184668,-1.201369,-0.12197,-0.043875,-0.113002,-0.061584,-0.162749


In [12]:
# Rename columns for clarity
# Column 0: txId
# Column 1: time_step
# Columns 2-167: feature_1 to feature_166
feature_cols = ['txId', 'time_step'] + [f'feature_{i}' for i in range(1, 166)]
features_df.columns = feature_cols

print("Column names assigned:")
print(f"Total columns: {len(features_df.columns)}")
print(f"First 5 column names: {feature_cols[:5]}")
print(f"Last 5 column names: {feature_cols[-5:]}")

Column names assigned:
Total columns: 167
First 5 column names: ['txId', 'time_step', 'feature_1', 'feature_2', 'feature_3']
Last 5 column names: ['feature_161', 'feature_162', 'feature_163', 'feature_164', 'feature_165']


### Check Missing Values

In [13]:
# Count missing values
missing_counts = features_df.isnull().sum()
columns_with_missing = missing_counts[missing_counts > 0]

print(f"Total columns: {len(features_df.columns)}")
print(f"Columns with missing values: {len(columns_with_missing)}")

if len(columns_with_missing) == 0:
    print("\n✓ No missing values detected in any column")
else:
    print(f"\n⚠ Missing values found in {len(columns_with_missing)} columns")
    print("\nColumns with missing values:")
    missing_summary = pd.DataFrame({
        'Column': columns_with_missing.index,
        'Missing_Count': columns_with_missing.values,
        'Missing_Percentage': (columns_with_missing.values / len(features_df)) * 100
    })
    print(missing_summary.to_string(index=False))

Total columns: 167
Columns with missing values: 0

✓ No missing values detected in any column


### Check Duplicates

In [14]:
# Check for duplicate transaction IDs
n_duplicates = features_df['txId'].duplicated().sum()
n_unique = features_df['txId'].nunique()
n_total = len(features_df)

print(f"Total records: {n_total:,}")
print(f"Unique transaction IDs: {n_unique:,}")
print(f"Duplicate transaction IDs: {n_duplicates:,}")

if n_duplicates == 0:
    print("\n✓ No duplicate transaction IDs found")
else:
    print(f"\n⚠ {n_duplicates} duplicate transaction IDs detected")
    print("\nDuplicate entries (first 5):")
    duplicated_txs = features_df[features_df['txId'].duplicated(keep=False)].sort_values('txId')
    print(duplicated_txs.head())

Total records: 203,769
Unique transaction IDs: 203,769
Duplicate transaction IDs: 0

✓ No duplicate transaction IDs found


### Check Data Types

In [15]:
# Check data types for key columns
print("Data Types (first 10 columns):")
print(features_df.dtypes[:10])

# Validate txId is numeric
txId_numeric = pd.api.types.is_numeric_dtype(features_df['txId'])
print(f"\ntxId is numeric: {txId_numeric}")

# Validate time_step is numeric
time_numeric = pd.api.types.is_numeric_dtype(features_df['time_step'])
print(f"time_step is numeric: {time_numeric}")

# Count numeric vs non-numeric feature columns
feature_columns = [col for col in features_df.columns if col.startswith('feature_')]
numeric_features = sum([pd.api.types.is_numeric_dtype(features_df[col]) for col in feature_columns])

print(f"\nFeature columns: {len(feature_columns)}")
print(f"Numeric feature columns: {numeric_features}")
print(f"Non-numeric feature columns: {len(feature_columns) - numeric_features}")

if numeric_features == len(feature_columns):
    print("\n✓ All feature columns are numeric")
else:
    print(f"\n⚠ {len(feature_columns) - numeric_features} feature columns are not numeric")
    non_numeric = [col for col in feature_columns if not pd.api.types.is_numeric_dtype(features_df[col])]
    print(f"Non-numeric columns: {non_numeric[:10]}...")  # Show first 10

Data Types (first 10 columns):
txId           int64
time_step      int64
feature_1    float64
feature_2    float64
feature_3    float64
feature_4    float64
feature_5    float64
feature_6    float64
feature_7    float64
feature_8    float64
dtype: object

txId is numeric: True
time_step is numeric: True

Feature columns: 165
Numeric feature columns: 165
Non-numeric feature columns: 0

✓ All feature columns are numeric


In [16]:
# Check time_step range
print("Time Step Summary:")
print(f"  Min: {features_df['time_step'].min()}")
print(f"  Max: {features_df['time_step'].max()}")
print(f"  Unique time steps: {features_df['time_step'].nunique()}")
print(f"  Expected time steps: 49 (1-49)")

if features_df['time_step'].min() == 1 and features_df['time_step'].max() == 49:
    print("\n✓ Time step range is valid (1-49)")
else:
    print("\n⚠ Time step range differs from expected (1-49)")

Time Step Summary:
  Min: 1
  Max: 49
  Unique time steps: 49
  Expected time steps: 49 (1-49)

✓ Time step range is valid (1-49)


---
## Cross-Dataset Validation

### Check Transaction ID Consistency Across Datasets

In [17]:
# Get unique transaction IDs from each dataset
txIds_classes = set(classes_df['txId'].unique())
txIds_features = set(features_df['txId'].unique())
txIds_edges_all = set(edges_df['txId1'].unique()) | set(edges_df['txId2'].unique())

print("Transaction ID counts by dataset:")
print(f"  Classes:  {len(txIds_classes):,} unique transaction IDs")
print(f"  Features: {len(txIds_features):,} unique transaction IDs")
print(f"  Edges:    {len(txIds_edges_all):,} unique transaction IDs (combined from txId1 and txId2)")

Transaction ID counts by dataset:
  Classes:  203,769 unique transaction IDs
  Features: 203,769 unique transaction IDs
  Edges:    203,769 unique transaction IDs (combined from txId1 and txId2)


In [18]:
# Check if classes and features have the same transaction IDs
classes_not_in_features = txIds_classes - txIds_features
features_not_in_classes = txIds_features - txIds_classes

print("\nClasses vs Features:")
print(f"  Transaction IDs in classes but not in features: {len(classes_not_in_features):,}")
print(f"  Transaction IDs in features but not in classes: {len(features_not_in_classes):,}")

if len(classes_not_in_features) == 0 and len(features_not_in_classes) == 0:
    print("  ✓ Classes and features have identical transaction IDs")
else:
    print("  ⚠ Transaction ID mismatch between classes and features")


Classes vs Features:
  Transaction IDs in classes but not in features: 0
  Transaction IDs in features but not in classes: 0
  ✓ Classes and features have identical transaction IDs


In [19]:
# Check if edge transaction IDs exist in features
edges_not_in_features = txIds_edges_all - txIds_features

print("\nEdges vs Features:")
print(f"  Transaction IDs in edges but not in features: {len(edges_not_in_features):,}")

if len(edges_not_in_features) == 0:
    print("  ✓ All edge transaction IDs exist in features dataset")
else:
    print(f"  ⚠ {len(edges_not_in_features)} transaction IDs from edges are missing in features")
    print(f"  Missing IDs (first 10): {list(edges_not_in_features)[:10]}")


Edges vs Features:
  Transaction IDs in edges but not in features: 0
  ✓ All edge transaction IDs exist in features dataset


---
## Summary

In [20]:
print("="*70)
print("DATA QUALITY CHECK - SUMMARY")
print("="*70)

print("\n1. DATASET DIMENSIONS:")
print(f"   Classes:  {classes_df.shape[0]:,} rows × {classes_df.shape[1]} columns")
print(f"   Edges:    {edges_df.shape[0]:,} rows × {edges_df.shape[1]} columns")
print(f"   Features: {features_df.shape[0]:,} rows × {features_df.shape[1]} columns")

print("\n2. MISSING VALUES:")
classes_missing = classes_df.isnull().sum().sum()
edges_missing = edges_df.isnull().sum().sum()
features_missing = features_df.isnull().sum().sum()

print(f"   Classes:  {classes_missing:,} missing values")
print(f"   Edges:    {edges_missing:,} missing values")
print(f"   Features: {features_missing:,} missing values")

if classes_missing + edges_missing + features_missing == 0:
    print("   ✓ No missing values in any dataset")
else:
    print("   ⚠ Missing values detected")

print("\n3. DUPLICATES:")
classes_dupes = classes_df['txId'].duplicated().sum()
edges_dupes = edges_df.duplicated().sum()
features_dupes = features_df['txId'].duplicated().sum()

print(f"   Classes:  {classes_dupes:,} duplicate transaction IDs")
print(f"   Edges:    {edges_dupes:,} duplicate edges")
print(f"   Features: {features_dupes:,} duplicate transaction IDs")

if classes_dupes + edges_dupes + features_dupes == 0:
    print("   ✓ No duplicates in any dataset")
else:
    print("   ⚠ Duplicates detected")

print("\n4. DATA TYPES:")
classes_types_ok = pd.api.types.is_numeric_dtype(classes_df['txId'])
edges_types_ok = (pd.api.types.is_numeric_dtype(edges_df['txId1']) and 
                  pd.api.types.is_numeric_dtype(edges_df['txId2']))
features_types_ok = (pd.api.types.is_numeric_dtype(features_df['txId']) and
                     pd.api.types.is_numeric_dtype(features_df['time_step']))

print(f"   Classes:  {'✓' if classes_types_ok else '⚠'} txId is numeric")
print(f"   Edges:    {'✓' if edges_types_ok else '⚠'} txId1 and txId2 are numeric")
print(f"   Features: {'✓' if features_types_ok else '⚠'} txId and time_step are numeric")

print("\n5. CROSS-DATASET CONSISTENCY:")
consistency_ok = (len(classes_not_in_features) == 0 and 
                  len(features_not_in_classes) == 0 and
                  len(edges_not_in_features) == 0)

if consistency_ok:
    print("   ✓ Transaction IDs are consistent across all datasets")
else:
    print("   ⚠ Transaction ID mismatches detected between datasets")

print("\n" + "="*70)

# Overall assessment
all_checks_passed = (classes_missing + edges_missing + features_missing == 0 and
                     classes_dupes + edges_dupes + features_dupes == 0 and
                     classes_types_ok and edges_types_ok and features_types_ok and
                     consistency_ok)

if all_checks_passed:
    print("\n✓ ALL DATA QUALITY CHECKS PASSED")
    print("  The datasets are clean and ready for analysis.")
else:
    print("\n⚠ SOME DATA QUALITY ISSUES DETECTED")
    print("  Review the warnings above for details.")

print("\n" + "="*70)

DATA QUALITY CHECK - SUMMARY

1. DATASET DIMENSIONS:
   Classes:  203,769 rows × 2 columns
   Edges:    234,355 rows × 2 columns
   Features: 203,769 rows × 167 columns

2. MISSING VALUES:
   Classes:  0 missing values
   Edges:    0 missing values
   Features: 0 missing values
   ✓ No missing values in any dataset

3. DUPLICATES:
   Classes:  0 duplicate transaction IDs
   Edges:    0 duplicate edges
   Features: 0 duplicate transaction IDs
   ✓ No duplicates in any dataset

4. DATA TYPES:
   Classes:  ✓ txId is numeric
   Edges:    ✓ txId1 and txId2 are numeric
   Features: ✓ txId and time_step are numeric

5. CROSS-DATASET CONSISTENCY:
   ✓ Transaction IDs are consistent across all datasets


✓ ALL DATA QUALITY CHECKS PASSED
  The datasets are clean and ready for analysis.

