# Introduction

This is the first of a series of notebooks that will guide you through processing oral interview PDFs from the United States Holocaust Memorial Museum. These notebooks, along with several datasets and machine learning models, are an output of the NEH-funded Placing the Holocaust Project. A collection of these PDFs, which are in English exclusively, are available on HuggingFace [here](https://huggingface.co/datasets/placingholocaust/ushmm-pdfs).

## Objective of Notebook

This notebook will:

- Walk through correcting the OCR
- Demonstrate how to use the `ushmm` Python package for processing the PDFs.

This will be the most computationally expensive portion of the workflow.

# Getting Started

In order to work with this notebook, you will need to have several packages already installed. To install these packages, you can use the `requirements.txt` file. Or, you can install these with the following pip command inside this notebook:

In [1]:
import os
os.chdir('..')

import glob
import datasets
import pandas as pd
from src.visualize import visualize_diff
from src.clean import clean_html

In [2]:
metadata = datasets.load_dataset("placingholocaust/testimony-metadata")["train"]
metadata

Dataset({
    features: ['RG Number', 'PDF URL', 'USHMM URL', 'First Name', 'Middle Name', 'Last Name', 'Birth Name', 'Gender', 'Birth Date', 'Birth Year', 'Place of Birth', 'Country', 'Experience Group', 'Ghetto(s) Encyclopedia', 'Ghetto', 'Camp(s) Encyclopedia', 'Camp', 'Non-SS Camp  ', 'Region', 'Needs Research', 'Data Entry', 'Accession', 'Notes:', 'Revisit'],
    num_rows: 977
})

In [3]:
df = pd.DataFrame(metadata)
df

Unnamed: 0,RG Number,PDF URL,USHMM URL,First Name,Middle Name,Last Name,Birth Name,Gender,Birth Date,Birth Year,...,Ghetto,Camp(s) Encyclopedia,Camp,Non-SS Camp,Region,Needs Research,Data Entry,Accession,Notes:,Revisit
0,RG-50.549.02.0033,https://collections.ushmm.org/oh_findingaids/R...,https://collections.ushmm.org/search/catalog/i...,Hetty,d'Ancona de,Leeuwe,Hetty D'Ancona,F,1930-05-01,1930.0,...,,,,,,,CL,1999.A.0293,,
1,RG-50.549.02.0072,https://collections.ushmm.org/oh_findingaids/R...,https://collections.ushmm.org/search/catalog/i...,Emanuel,,Mandel,,M,,1936.0,...,,,,,,checked,GG,2003.205,Follow-up interview,
2,RG-50.549.02.0035,https://collections.ushmm.org/oh_findingaids/R...,https://collections.ushmm.org/search/catalog/i...,Judith,,Meisel,,F,,1929.0,...,Kaunas,,,,,checked,GG,1999.A.0024,This is a follow-up interview to one already d...,checked
3,RG-50.471.0015,https://collections.ushmm.org/oh_findingaids/R...,https://collections.ushmm.org/search/catalog/i...,Esther,,Lurie,,F,,,...,,,,,,,CL,1998.A.0119.15,,
4,RG-50.030.0585,https://collections.ushmm.org/oh_findingaids/R...,https://collections.ushmm.org/search/catalog/i...,Eugene,,Miller,,M,1923-10-16,1923.0,...,Lodz,"Auschwitz,Dachau",,,,checked,GG,2010.249,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
972,RG-50.549.02.0073,https://collections.ushmm.org/oh_findingaids/R...,https://collections.ushmm.org/search/catalog/i...,Flory,,Jagoda,,F,1923-12-21,1923.0,...,,,,,,,GG,2004.48,Follow-up,checked
973,RG-50.030.0137,https://collections.ushmm.org/oh_findingaids/R...,https://collections.ushmm.org/search/catalog/i...,Cornelius,,Loen,,M,1922-05-02,1922.0,...,,,,,,,CL,1990.437.1,,
974,RG-50.030.0058,https://collections.ushmm.org/oh_findingaids/R...,https://collections.ushmm.org/search/catalog/i...,Isaac,,Danon,,M,,1929.0,...,,,,,,,GG,,,
975,RG-50.549.02.0078,https://collections.ushmm.org/oh_findingaids/R...,https://collections.ushmm.org/search/catalog/i...,Lucie,,Rosenberg,,F,,1921.0,...,,,,,,checked,CL,2004.214,"Not a survivor, volunteered for the museum?",


In [4]:
html_files = glob.glob("./data/pdfs/*.html")
print(html_files[:5])
print(len(html_files))

['./data/pdfs/RG-50.030.0706_tcn_en.html', './data/pdfs/RG-50.030.0670_sum_en.html', './data/pdfs/RG-50.030.0414_trs_en.html', './data/pdfs/RG-50.030.0144_trs_en.html', './data/pdfs/RG-50.549.02.0062_trs_en.html']
1214


In [5]:
approved_files = []
for filename in html_files:
    if "sum" not in filename and "trs_en" in filename:
        approved_files.append(filename)
len(approved_files)

949

In [6]:
approved_files.sort()

In [7]:
for filename in approved_files:
    clean_html(filename, "./data/cleaned_htmls")


No removals made in ./data/pdfs/RG-50.030.0001_trs_en.html
Cleaned HTML saved to: ./data/cleaned_htmls/RG-50.030.0001_trs_en_cleaned.html

No removals made in ./data/pdfs/RG-50.030.0002_trs_en.html
Cleaned HTML saved to: ./data/cleaned_htmls/RG-50.030.0002_trs_en_cleaned.html

Removals for ./data/pdfs/RG-50.030.0003_trs_en.html:
Removed 'http://collections.ushmm.org Contact reference@ushmm.org for further information about this collection' 34 time(s)
Removed 'This is a verbatim transcript of spoken word. It is not the primary source, and it has not been checked for spelling or accuracy.' 34 time(s)
Cleaned HTML saved to: ./data/cleaned_htmls/RG-50.030.0003_trs_en_cleaned.html

No removals made in ./data/pdfs/RG-50.030.0004_trs_en.html
Cleaned HTML saved to: ./data/cleaned_htmls/RG-50.030.0004_trs_en_cleaned.html

No removals made in ./data/pdfs/RG-50.030.0006_trs_en.html
Cleaned HTML saved to: ./data/cleaned_htmls/RG-50.030.0006_trs_en_cleaned.html

No removals made in ./data/pdfs/RG-