There are two different locations: 
- The full dataset here: https://huggingface.co/datasets/rusheeliyer/german-courts
- The excerpt from st and mp here:https://docs.google.com/spreadsheets/d/1EKI58wSObtLT0Uf_SH-R8FixGzQTpn1IuOVLDix03F8/edit?usp=sharing

The excerpt is somewhat deduplicated and cleaned up. The full dataset is not deduplicated and contains more information. It is here for informational purposes; we work with the full dataset in this notebook.

# Loading and Merging All German Court Subsets

The huggingface dataset 'german-courts' contains 6 different subsets (courts), each with their own train/validation/test splits. We'll load all of them and merge them into one DataFrame with additional columns for subset name and split name.

In [1]:
from datasets import load_dataset
import pandas as pd

# List of all subsets in the german-courts dataset (based on available configurations)
court_subsets = [
    "bundesarbeitsgericht",
    "bundesfinanzhof",
    "bundesgerichtshof",
    "bundessozialgericht",
    "bundesverfassungsgericht",
    "bundesverwaltungsgericht"
]

# List of all splits
split_names = ["train", "validation", "test"]

# Create a list to store all dataframes
all_dfs = []

print("Loading all subsets from the german-courts dataset...")

# Iterate through each subset (configuration)
for court in court_subsets:
    try:
        # Load the specific subset
        print(f"Loading {court}...")
        subset_data = load_dataset("rusheeliyer/german-courts", court)
        
        # Process each split in the dataset
        for split in split_names:
            if split in subset_data:
                # Convert to pandas DataFrame
                split_df = subset_data[split].to_pandas()
                
                # Add columns for subset name and split name
                # Using the capitalized version for better readability in subset_name
                capitalized_court = court[0].upper() + court[1:]
                split_df['subset_name'] = capitalized_court
                split_df['split_name'] = split
                
                # Add to our list of dataframes
                all_dfs.append(split_df)
                
                print(f"Loaded {court}/{split} with {len(split_df)} entries")
            else:
                print(f"Split {split} not found in {court}")
    except Exception as e:
        print(f"Error loading {court}: {e}")

# Merge all dataframes into one
merged_df = pd.concat(all_dfs, ignore_index=True)

# Display information about the merged dataset
print("\nMerged dataset information:")
print(f"Total entries: {len(merged_df)}")
print("\nEntries per subset:")
print(merged_df.groupby('subset_name').size())
print("\nEntries per split:")
print(merged_df.groupby('split_name').size())
print("\nEntries per subset and split:")
print(merged_df.groupby(['subset_name', 'split_name']).size())

# Display the first few rows
print("\nSample of the merged dataset:")
print(merged_df.head())

Loading all subsets from the german-courts dataset...
Loading bundesarbeitsgericht...


README.md:   0%|          | 0.00/6.02k [00:00<?, ?B/s]

Bundesarbeitsgericht_train.csv:   0%|          | 0.00/6.05M [00:00<?, ?B/s]

Bundesarbeitsgericht_test.csv:   0%|          | 0.00/1.40M [00:00<?, ?B/s]

Bundesarbeitsgericht_val.csv:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/117 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/31 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/30 [00:00<?, ? examples/s]

Loaded bundesarbeitsgericht/train with 117 entries
Loaded bundesarbeitsgericht/validation with 30 entries
Loaded bundesarbeitsgericht/test with 31 entries
Loading bundesfinanzhof...


Bundesfinanzhof_train.csv:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

Bundesfinanzhof_test.csv:   0%|          | 0.00/3.78M [00:00<?, ?B/s]

Bundesfinanzhof_val.csv:   0%|          | 0.00/1.62M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/569 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/133 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/59 [00:00<?, ? examples/s]

Loaded bundesfinanzhof/train with 569 entries
Loaded bundesfinanzhof/validation with 59 entries
Loaded bundesfinanzhof/test with 133 entries
Loading bundesgerichtshof...


Bundesgerichtshof_train.csv:   0%|          | 0.00/54.7M [00:00<?, ?B/s]

Bundesgerichtshof_test.csv:   0%|          | 0.00/14.6M [00:00<?, ?B/s]

Bundesgerichtshof_val.csv:   0%|          | 0.00/10.5M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1889 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/377 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/288 [00:00<?, ? examples/s]

Loaded bundesgerichtshof/train with 1889 entries
Loaded bundesgerichtshof/validation with 288 entries
Loaded bundesgerichtshof/test with 377 entries
Loading bundessozialgericht...


Bundessozialgericht_train.csv:   0%|          | 0.00/4.68M [00:00<?, ?B/s]

Bundessozialgericht_test.csv:   0%|          | 0.00/1.10M [00:00<?, ?B/s]

Bundessozialgericht_val.csv:   0%|          | 0.00/790k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/113 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/29 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/20 [00:00<?, ? examples/s]

Loaded bundessozialgericht/train with 113 entries
Loaded bundessozialgericht/validation with 20 entries
Loaded bundessozialgericht/test with 29 entries
Loading bundesverfassungsgericht...


Bundesverfassungsgericht_train.csv:   0%|          | 0.00/64.3M [00:00<?, ?B/s]

Bundesverfassungsgericht_test.csv:   0%|          | 0.00/22.9M [00:00<?, ?B/s]

Bundesverfassungsgericht_val.csv:   0%|          | 0.00/12.5M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1280 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/305 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/193 [00:00<?, ? examples/s]

Loaded bundesverfassungsgericht/train with 1280 entries
Loaded bundesverfassungsgericht/validation with 193 entries
Loaded bundesverfassungsgericht/test with 305 entries
Loading bundesverwaltungsgericht...


Bundesverwaltungsgericht_train.csv:   0%|          | 0.00/35.0M [00:00<?, ?B/s]

Bundesverwaltungsgericht_test.csv:   0%|          | 0.00/8.50M [00:00<?, ?B/s]

Bundesverwaltungsgericht_val.csv:   0%|          | 0.00/7.20M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/826 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/176 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/157 [00:00<?, ? examples/s]

Loaded bundesverwaltungsgericht/train with 826 entries
Loaded bundesverwaltungsgericht/validation with 157 entries
Loaded bundesverwaltungsgericht/test with 176 entries

Merged dataset information:
Total entries: 6592

Entries per subset:
subset_name
Bundesarbeitsgericht         178
Bundesfinanzhof              761
Bundesgerichtshof           2554
Bundessozialgericht          162
Bundesverfassungsgericht    1778
Bundesverwaltungsgericht    1159
dtype: int64

Entries per split:
split_name
test          1051
train         4794
validation     747
dtype: int64

Entries per subset and split:
subset_name               split_name
Bundesarbeitsgericht      test            31
                          train          117
                          validation      30
Bundesfinanzhof           test           133
                          train          569
                          validation      59
Bundesgerichtshof         test           377
                          train         1889
         

## Saving the Merged Dataset

In [3]:
# Save the merged dataset to a CSV file
from pathlib import Path

# Create output directory if it doesn't exist
output_dir = Path('data')
output_dir.mkdir(exist_ok=True)

# Save the merged dataset
output_file = output_dir / "german_courts.csv"
merged_df.to_csv(output_file, index=False)
print(f"Saved merged dataset to {output_file} with {len(merged_df)} entries")

Saved merged dataset to data/german_courts.csv with 6592 entries
