# Bob Nelkin Collection - Pre-processing

<br>

**Notebook author:** Ben Naismith  
**Last modified:** July 15, 2021

<br>

**Notebook contents:**
1. [Initial setup](#1.-Initial-setup)
2. [Combined dataframe](#2.-Combined-dataframe)
3. [Missing data](#3.-Missing-data)
4. [Standardization](#4.-Standardization)
5. [Save dataframe](#5.-Save-dataframe)

## 1. Initial setup

In [1]:
# Import necessary modules

import pandas as pd
import pprint
from IPython.core.interactiveshell import InteractiveShell
import csv
import glob
from textblob import TextBlob
import PyPDF2
import pdfminer
import joblib

In [2]:
# Set preferred notebook format

InteractiveShell.ast_node_interactivity = "all" # Show all output, not just last item
pd.set_option('display.max_columns', 999) # Allow viewing of all columns

## 2. Combined dataframe
Create combined dataframe from `base-layer_archival.csv`, and the files in the `ocr_new` folder.

In [3]:
# Read in base_layer_archival.csv

bob_df = pd.read_csv('../../../base-layers/bob-nelkin-collection/bob-nelkin-collection_item-base-layer_archival.csv')
bob_df.head(1)

Unnamed: 0,id,title,creator,contributor,creation_date,sort_date,display_date,language,type_of_resource,format,extent,genre,abstract,subject,temporal_coverage,geographic_coverage,host,series,container,owner,depositor,collection_id
0,MSS_1002_B001_F11_I01,Recent Litigation Memo,,,,,"July 11, 1975",,,,,,A PARC internal memo that summarizes recent li...,,,,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 11, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History\n ...",collection.341


In [4]:
# Read in all the ocr file names and texts as a list of tuples

texts = [(filename, open(filename).read()) for filename in glob.glob("../../../source-data/bob-nelkin-collection/ocr_new/*.txt")]
len(texts)
texts[2] #Random example

537

('../../../source-data/bob-nelkin-collection/ocr_new/pitt_MSS_1002_B004_F17_I13_PDF.txt',
 '\ufeff\ni I\nA27 Vermont Avenue\nErie, Pennsylvania 16505\nApril 2A, 1973\n\nMs. Helen Wohlgemuth Secretary of Welfare Harrisburg, Pennsylvania\nDear Ms. Wohlgemuth,\nI am a Special Education teacher and h-ave a severely living at Polk State School. Therefore, I feel more than express my great distress caused by your recent firing of superintendent of Polk State School.\nretarded daughter\nQualified to\nDr. JamES McClelland,\n/\n\n\n\nYour ahrupt action indicates to me that you have very little intimate knowledge of severely and profoundly retarded children and adults. There are individuals who require close supervision and partial confinement in a playpen-like enclosure for their safety and the safety of others. This is not cruel and inhumane, it is sensible action taken for safety. Some of the children and adults do reouire rather heavy medication at various times because of severely aggressiv

In [5]:
# Shorten filenames to match dataframe

texts = [(x[0][56:-8],x[1]) for x in texts]
len(texts)
texts[2]

537

('MSS_1002_B004_F17_I13',
 '\ufeff\ni I\nA27 Vermont Avenue\nErie, Pennsylvania 16505\nApril 2A, 1973\n\nMs. Helen Wohlgemuth Secretary of Welfare Harrisburg, Pennsylvania\nDear Ms. Wohlgemuth,\nI am a Special Education teacher and h-ave a severely living at Polk State School. Therefore, I feel more than express my great distress caused by your recent firing of superintendent of Polk State School.\nretarded daughter\nQualified to\nDr. JamES McClelland,\n/\n\n\n\nYour ahrupt action indicates to me that you have very little intimate knowledge of severely and profoundly retarded children and adults. There are individuals who require close supervision and partial confinement in a playpen-like enclosure for their safety and the safety of others. This is not cruel and inhumane, it is sensible action taken for safety. Some of the children and adults do reouire rather heavy medication at various times because of severely aggressive hehavior.\nIt is my opinion, and the opinion of various profes

In [6]:
# Create dictionary from the tuples

texts_dict = dict(texts)
texts_dict['MSS_1002_B004_F17_I13']

'\ufeff\ni I\nA27 Vermont Avenue\nErie, Pennsylvania 16505\nApril 2A, 1973\n\nMs. Helen Wohlgemuth Secretary of Welfare Harrisburg, Pennsylvania\nDear Ms. Wohlgemuth,\nI am a Special Education teacher and h-ave a severely living at Polk State School. Therefore, I feel more than express my great distress caused by your recent firing of superintendent of Polk State School.\nretarded daughter\nQualified to\nDr. JamES McClelland,\n/\n\n\n\nYour ahrupt action indicates to me that you have very little intimate knowledge of severely and profoundly retarded children and adults. There are individuals who require close supervision and partial confinement in a playpen-like enclosure for their safety and the safety of others. This is not cruel and inhumane, it is sensible action taken for safety. Some of the children and adults do reouire rather heavy medication at various times because of severely aggressive hehavior.\nIt is my opinion, and the opinion of various professional persons that I have 

In [7]:
# Map dictionary to the dataframe based on the id column

bob_df['text'] = bob_df.id.map(texts_dict)
bob_df.head()

Unnamed: 0,id,title,creator,contributor,creation_date,sort_date,display_date,language,type_of_resource,format,extent,genre,abstract,subject,temporal_coverage,geographic_coverage,host,series,container,owner,depositor,collection_id,text
0,MSS_1002_B001_F11_I01,Recent Litigation Memo,,,,,"July 11, 1975",,,,,,A PARC internal memo that summarizes recent li...,,,,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 11, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History\n ...",collection.341,﻿V\n\nPennsylvania Association for Retarded Ci...
1,MSS_1002_B001_F12_I01,Letter from Peter Polloni to Bob Nelkin,,,,,"March 11, 1975",,,,,,"A letter from Peter Polloni, executive directo...",,,,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 12, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History\n ...",collection.341,﻿Pennsylvania Association for Retarded Citizen...
2,MSS_1002_B001_F13_I01,Letter to Frank Beal from Families and Friends...,,,,,"August 19, 1976",,,,,,A letter from Families and Friends of Southwes...,,,,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 13, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History\n ...",collection.341,﻿11\nFAMILIES & FRIENDS OF SOUTHWEST HABILITAT...
3,MSS_1002_B001_F13_I02,Letter from families of patients at Southwest ...,,,,,"July 27, 1976",,,,,,A letter requesting Bob Nelkin's advice on adv...,,,,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 13, Item 2",Heinz History Center,"Detre Library & Archives, Heinz History\n ...",collection.341,﻿FAMILIES & FRIENDS OF\nSOUTHWEST HABILITATION...
4,MSS_1002_B001_F16_I01,ACC-PARC Recent Benefits to Families Memo,,,,,"March 28, 1977",,,,,,Correspondence from Bob Nelkin to Joan Murdoch...,,,,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 16, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History\n ...",collection.341,C ommonwealth of Pennsylvania\n\nDepartment of...


## 3. Missing data

Check columns to see if anything is needed or if any columns are unnecessary

In [8]:
[(x,len(bob_df.loc[~bob_df[x].isnull()])) for x in bob_df.columns]

[('id', 542),
 ('title', 542),
 ('creator', 1),
 ('contributor', 0),
 ('creation_date', 0),
 ('sort_date', 0),
 ('display_date', 541),
 ('language', 1),
 ('type_of_resource', 1),
 ('format', 1),
 ('extent', 1),
 ('genre', 0),
 ('abstract', 540),
 ('subject', 1),
 ('temporal_coverage', 0),
 ('geographic_coverage', 1),
 ('host', 541),
 ('series', 541),
 ('container', 541),
 ('owner', 541),
 ('depositor', 542),
 ('collection_id', 542),
 ('text', 536)]

Look at the ones with 1 non-null to see what these are.

In [9]:
# creator column (are any non-null?)

bob_df.loc[~bob_df.creator.isnull()]

# Only one file, the guide, has a creator

Unnamed: 0,id,title,creator,contributor,creation_date,sort_date,display_date,language,type_of_resource,format,extent,genre,abstract,subject,temporal_coverage,geographic_coverage,host,series,container,owner,depositor,collection_id,text
541,US-QQS-MSS1002,Guide to the Bob Nelkin Collection of ACC-PARC...,"Nelkin, Bob",,,,,English,text,print,4.5 linear feet + shelf,,The Bob Nelkin Collection of ACC-PARC records ...,Social service|||Social welfare|||Disability a...,,Pennsylvania|||Pittsburgh,,,,,"Detre Library & Archives, Heinz History Center",collection.341,


In [10]:
# Remove this guide from the dataframe to keep only actual collection texts

len(bob_df)
bob_df = bob_df.loc[bob_df.id != 'US-QQS-MSS1002']
len(bob_df)

542

541

In [11]:
# Check null values again

[(x,len(bob_df.loc[~bob_df[x].isnull()])) for x in bob_df.columns]

[('id', 541),
 ('title', 541),
 ('creator', 0),
 ('contributor', 0),
 ('creation_date', 0),
 ('sort_date', 0),
 ('display_date', 541),
 ('language', 0),
 ('type_of_resource', 0),
 ('format', 0),
 ('extent', 0),
 ('genre', 0),
 ('abstract', 539),
 ('subject', 0),
 ('temporal_coverage', 0),
 ('geographic_coverage', 0),
 ('host', 541),
 ('series', 541),
 ('container', 541),
 ('owner', 541),
 ('depositor', 541),
 ('collection_id', 541),
 ('text', 536)]

Everything either has a a value or is completely empty except for 'abstract'

In [12]:
# check missing abstracts

bob_df.loc[bob_df.abstract.isnull()]

Unnamed: 0,id,title,creator,contributor,creation_date,sort_date,display_date,language,type_of_resource,format,extent,genre,abstract,subject,temporal_coverage,geographic_coverage,host,series,container,owner,depositor,collection_id,text
50,MSS_1002_B001_F65_I12,Letter from Mr. W. to Dr. James R. McClelland,,,,,"September 28, 1972",,,,,,,,,,Bob Nelkin Collection of ACC-PARC Records,II. State School and Hospital (SSH) and Interi...,"box 1, Folder 65, Item 12",Heinz History Center,"Detre Library & Archives, Heinz History\n ...",collection.341,﻿
51,MSS_1002_B001_F65_I13,Reforming the State Schools and Interim Care C...,,,,,"November 28, 1972",,,,,,,,,,Bob Nelkin Collection of ACC-PARC Records,II. State School and Hospital (SSH) and Interi...,"box 1, Folder 65, Item 13",Heinz History Center,"Detre Library & Archives, Heinz History\n ...",collection.341,﻿REFORMING OF THE STATE SCHOOLS AND HOSPITALS ...


In [13]:
# Write new temporary abstracts (can be used if desired by the collection owners)

new_abstract1 = "A letter from Mr. W. to Dr. James R. McClelland expressing his opposition to the rule prohibiting the use of phone commmunication between patiens and parents."
new_abstract2 = "A statement from the 'Reforming the State Schools and Interim Care Committee' Statement' providing an update on the committee and the challenges they face."

In [14]:
# Add the new abstracts to the dataframe

bob_df.loc[bob_df.id == 'MSS_1002_B001_F65_I12','abstract'] = new_abstract1
bob_df.loc[bob_df.id == 'MSS_1002_B001_F65_I13','abstract'] = new_abstract2

In [15]:
# Delete empty columns

cols = [x for x in bob_df.columns if len(bob_df.loc[~bob_df[x].isnull()])!=0]
bob_df = bob_df[cols]
bob_df.head()

Unnamed: 0,id,title,display_date,abstract,host,series,container,owner,depositor,collection_id,text
0,MSS_1002_B001_F11_I01,Recent Litigation Memo,"July 11, 1975",A PARC internal memo that summarizes recent li...,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 11, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History\n ...",collection.341,﻿V\n\nPennsylvania Association for Retarded Ci...
1,MSS_1002_B001_F12_I01,Letter from Peter Polloni to Bob Nelkin,"March 11, 1975","A letter from Peter Polloni, executive directo...",Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 12, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History\n ...",collection.341,﻿Pennsylvania Association for Retarded Citizen...
2,MSS_1002_B001_F13_I01,Letter to Frank Beal from Families and Friends...,"August 19, 1976",A letter from Families and Friends of Southwes...,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 13, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History\n ...",collection.341,﻿11\nFAMILIES & FRIENDS OF SOUTHWEST HABILITAT...
3,MSS_1002_B001_F13_I02,Letter from families of patients at Southwest ...,"July 27, 1976",A letter requesting Bob Nelkin's advice on adv...,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 13, Item 2",Heinz History Center,"Detre Library & Archives, Heinz History\n ...",collection.341,﻿FAMILIES & FRIENDS OF\nSOUTHWEST HABILITATION...
4,MSS_1002_B001_F16_I01,ACC-PARC Recent Benefits to Families Memo,"March 28, 1977",Correspondence from Bob Nelkin to Joan Murdoch...,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 16, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History\n ...",collection.341,C ommonwealth of Pennsylvania\n\nDepartment of...


In [16]:
# text - check texts that are just an empty string or nan

len(bob_df.loc[bob_df.text == ''])
bob_df.loc[bob_df.text == ''].head()

len(bob_df.loc[bob_df.text.isnull()])
bob_df.loc[bob_df.text.isnull()].head()

0

Unnamed: 0,id,title,display_date,abstract,host,series,container,owner,depositor,collection_id,text


5

Unnamed: 0,id,title,display_date,abstract,host,series,container,owner,depositor,collection_id,text
532,MSS_1002_B004_F56_I01,Highland Park Center Restraining Chair,August 1983,An image of a chair with restraints used on in...,Bob Nelkin Collection of ACC-PARC Records,II. State School and Hospital (SSH) and Interi...,"box 4, Folder 56, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History\n ...",collection.341,
533,MSS_1002_B004_F56_I02,Highland Park Center Resident Helmet,August 1983,An image of a helmet used on an individual wit...,Bob Nelkin Collection of ACC-PARC Records,II. State School and Hospital (SSH) and Interi...,"box 4, Folder 56, Item 2",Heinz History Center,"Detre Library & Archives, Heinz History\n ...",collection.341,
534,MSS_1002_B004_F56_I03,Highland Park Center Resident Helmet (2),August 1983,An image of a helmet used on an individual wit...,Bob Nelkin Collection of ACC-PARC Records,II. State School and Hospital (SSH) and Interi...,"box 4, Folder 56, Item 3",Heinz History Center,"Detre Library & Archives, Heinz History\n ...",collection.341,
535,MSS_1002_B004_F56_I04,Highland Park Center Straitjacket,August 1983,An image of a straitjacket used on individuals...,Bob Nelkin Collection of ACC-PARC Records,II. State School and Hospital (SSH) and Interi...,"box 4, Folder 56, Item 4",Heinz History Center,"Detre Library & Archives, Heinz History\n ...",collection.341,
536,MSS_1002_B004_F56_I05,Highland Park Center Cattle Prod,August 1983,An image of a cattle prod used to shock indivi...,Bob Nelkin Collection of ACC-PARC Records,II. State School and Hospital (SSH) and Interi...,"box 4, Folder 56, Item 5",Heinz History Center,"Detre Library & Archives, Heinz History\n ...",collection.341,


These five are all photos.

#### Check language

Do not run multiple times as Google API has daily limit of queries.

In [17]:
# Use textblob (API for Google translate) - there is a max number of requests per day so do not run unless needed

def find_lang(text):
    return TextBlob(text).detect_language()

find_lang("This is a test")

'en'

In [18]:
# Apply function to title column (since text columns sometimes blank)

bob_df['language'] = bob_df.title.apply(find_lang)

In [19]:
bob_df.language.value_counts()

en    536
de      3
fy      1
no      1
Name: language, dtype: int64

In [20]:
# I can see why these were mistagged 

bob_df.loc[bob_df.language == 'de']
bob_df.loc[bob_df.language == 'fy']
bob_df.loc[bob_df.language == 'no']

Unnamed: 0,id,title,display_date,abstract,host,series,container,owner,depositor,collection_id,text,language
260,MSS_1002_B004_F17_I10,Letter from Mrs. Besser to Helene Wohlgemuth,May 1973,"A letter from Mrs. Besser, a parent of a child...",Bob Nelkin Collection of ACC-PARC Records,II. State School and Hospital (SSH) and Interi...,"box 4, Folder 17, Item 10",Heinz History Center,"Detre Library & Archives, Heinz History\n ...",collection.341,﻿\n\n\n\n\n-v:\nt'\nT\n£\n\n5\n1\nv<\n\n.3 .\n...,de
334,MSS_1002_B004_F20_I01,Letter from Marjorie Felder to Helene Wohlgemuth,"April 23, 1973",A letter from Marjorie Felder to Secretary Hel...,Bob Nelkin Collection of ACC-PARC Records,II. State School and Hospital (SSH) and Interi...,"box 4, Folder 20, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History\n ...",collection.341,﻿\n\n4\n\n5F\nt-\n>*\n.><\n\n/\n!\n/\nV? c\nf ...,de
360,MSS_1002_B004_F20_I27,Note from Barbara Fruchter to Helene Wohlgemuth,"April 18, 1973",A brief note from Barbara Fruchter to Secretar...,Bob Nelkin Collection of ACC-PARC Records,II. State School and Hospital (SSH) and Interi...,"box 4, Folder 20, Item 27",Heinz History Center,"Detre Library & Archives, Heinz History\n ...",collection.341,"﻿\n*4t\n>\n»(<\n•\n\nf •\n-,T\n... ,„,,,\n\nc\...",de


Unnamed: 0,id,title,display_date,abstract,host,series,container,owner,depositor,collection_id,text,language
270,MSS_1002_B004_F17_I20,Letter from David Ferleger to Helene Wohlgemuth,"April 26, 1973",A letter from David Ferleger of the Mental Pat...,Bob Nelkin Collection of ACC-PARC Records,II. State School and Hospital (SSH) and Interi...,"box 4, Folder 17, Item 20",Heinz History Center,"Detre Library & Archives, Heinz History\n ...",collection.341,﻿MENTAL PATIENT CIVIL LIBERTIES PROJECT\n121 S...,fy


Unnamed: 0,id,title,display_date,abstract,host,series,container,owner,depositor,collection_id,text,language
265,MSS_1002_B004_F17_I15,Letter from Eleanor Etter to Helene Wohlgemuth,"April 27, 1973",A letter from Eleanor Etter to Helene Wohlgemu...,Bob Nelkin Collection of ACC-PARC Records,II. State School and Hospital (SSH) and Interi...,"box 4, Folder 17, Item 15",Heinz History Center,"Detre Library & Archives, Heinz History\n ...",collection.341,"﻿\n# • •\n\n\ni\n> H / x\n• f K\nApril 27, 197...",no


In [21]:
# Change all language column to 'English'

bob_df.language = 'English'

## 4. Standardization

Most columns and values are already standardized and consistent. One option would be to create a new column for 'text' type, e.g., article, press release, etc. based on the titles. However, a number of these are not clear from the title alone.

In [22]:
# Fix string so all on one line

bob_df.depositor = 'Detre Library & Archives, Heinz History Center'

## 5. Save dataframe

In [23]:
# Pickle dataframe

joblib.dump(bob_df,'bob_df_pre-processed.pkl')

['bob_df_pre-processed.pkl']

In [24]:
bob_df.head()

Unnamed: 0,id,title,display_date,abstract,host,series,container,owner,depositor,collection_id,text,language
0,MSS_1002_B001_F11_I01,Recent Litigation Memo,"July 11, 1975",A PARC internal memo that summarizes recent li...,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 11, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,﻿V\n\nPennsylvania Association for Retarded Ci...,English
1,MSS_1002_B001_F12_I01,Letter from Peter Polloni to Bob Nelkin,"March 11, 1975","A letter from Peter Polloni, executive directo...",Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 12, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,﻿Pennsylvania Association for Retarded Citizen...,English
2,MSS_1002_B001_F13_I01,Letter to Frank Beal from Families and Friends...,"August 19, 1976",A letter from Families and Friends of Southwes...,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 13, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,﻿11\nFAMILIES & FRIENDS OF SOUTHWEST HABILITAT...,English
3,MSS_1002_B001_F13_I02,Letter from families of patients at Southwest ...,"July 27, 1976",A letter requesting Bob Nelkin's advice on adv...,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 13, Item 2",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,﻿FAMILIES & FRIENDS OF\nSOUTHWEST HABILITATION...,English
4,MSS_1002_B001_F16_I01,ACC-PARC Recent Benefits to Families Memo,"March 28, 1977",Correspondence from Bob Nelkin to Joan Murdoch...,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 16, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,C ommonwealth of Pennsylvania\n\nDepartment of...,English


[Back to top](#Bob-Nelkin-Collection---Pre-processing)