Skip to content

EF20K/Datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 

Repository files navigation

Epstein Files 20K Dataset

Curated dataset from official House Oversight Committee release

dataset documents format


πŸ“” Table of Contents

🌟 About

This repository catalogs the main EPSTEIN_FILES_20K dataset and derived datasets from the Epstein estate documents released by the U.S. House Committee on Oversight and Accountability on November 12, 2025.

Original Source: House Oversight Committee Release

πŸ“‚ Dataset Structure

The processed dataset is a single CSV file with two columns:

Column Description
text Full text content extracted from the document
filename Modified file path maintaining directory structure

Access: Hugging Face

βš™οΈ Processing Pipeline

Image-to-Text Conversion

All JPG files converted to text using convert_jpg_to_txt.py while maintaining original directory structure.

Data Compilation

  1. Loaded all text files (converted + existing) into pandas DataFrame
  2. Created two-column structure: text and filename
  3. Exported as single CSV file

πŸ‘€ Usage

import pandas as pd

# Load the dataset
df = pd.read_csv('path_to_dataset.csv')

# View basic information
print(f"Total documents: {len(df)}")
print(df.head())

Citation

Epstein Estate Documents Dataset
Source: U.S. House Committee on Oversight and Accountability
Available at: https://huggingface.co/datasets/tensonaut/EPSTEIN_FILES_20K

πŸ“¦ Derived Datasets

New derived datasets are organized in separate folders with individual README files.

Structure:

Dataset/
β”œβ”€β”€ README.md (this file)
β”œβ”€β”€ derived-dataset-1/
β”‚   └── README.md
└── derived-dataset-2/
    └── README.md

To add a derived dataset, create a new folder with documentation following the main dataset format.

πŸ‘‹ Contributing

Contributions are welcome for:

  • Derivative datasets
  • Processing pipeline improvements
  • Documentation enhancements

How to contribute:

  1. Fork this repository
  2. Create your dataset/improvement
  3. Document thoroughly
  4. Submit a pull request

πŸ’Ž Resources

⚠️ License

Please refer to the original source for licensing and usage terms of the Epstein estate documents.

🀝 Contact

For questions or concerns:


Community-maintained β€’ Not affiliated with any official investigation

About

Datasets curated by processing Epstein files from official government sources.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages