# Bob Nelkin Collection - Exploratory Data Analysis (EDA)  

<br>

**Notebook author:** Ben Naismith  
**Last modified:** July 15, 2021

<br>

**Notebook contents:**
1. [Initial setup](#1.-Initial-setup)
2. [`source-data` folder](#2.-source-data-folder)
    - [`ead` folder](#ead-folder)
    - [`mods` folder](#mods-folder)
    - [`ocr` folder](#ocr-folder)
    - [`rel-ext` folder](#rel-ext-folder)
    - [`source-data` summary](#source-data-summary)
3. [`base-layers` folder](#3.-base-layers-folder)

## 1. Initial setup

In [1]:
# Import necessary modules

import pandas as pd
import pprint
from IPython.core.interactiveshell import InteractiveShell
import csv
from ast import literal_eval
import joblib
import xml.etree.ElementTree as ET
import os

In [2]:
# Set preferred notebook format

InteractiveShell.ast_node_interactivity = "all" # Show all output, not just last item
pd.set_option('display.max_columns', 999) # Allow viewing of all columns

In [3]:
# Set root folder to the data-layers parent directory

root = '/Users/Ben/Documents/data-layers'
os.chdir(root)
os.getcwd()

'/Users/Ben/Documents/data-layers'

## 2. `source-data` folder

### Folder contents

In [4]:
# Move into source-data folder

os.chdir('source-data/bob-nelkin-collection')
os.listdir()

['ead',
 '.DS_Store',
 'ocr',
 'pdf',
 'rels-ext',
 'ocr_new',
 'mods',
 'CLAWS_tagged']

I will look into each of these individually (except .DS_Store which is just a hidden file on my local computer).

### `ead` folder

There is only one large file in this folder which contains information about the collection (abstract, history, etc.) and about data objects.

In [5]:
# Move into ead folder

os.chdir('ead')
os.listdir()

['pitt_US-QQS-MSS1002_EAD.xml']

In [6]:
# Reading in .xml file

ead_tree = ET.parse('pitt_US-QQS-MSS1002_EAD.xml')
ead_root = ead_tree.getroot()

In [7]:
ead_root.tag

'{urn:isbn:1-931666-22-9}ead'

There are no records for this isbn number.

In [8]:
ead_root.attrib

{'{http://www.w3.org/2001/XMLSchema-instance}schemaLocation': 'urn:isbn:1-931666-22-9 http://www.loc.gov/ead/ead.xsd'}

The w3 data repo link to the library of congress is broken.

In [9]:
for child in ead_root:
    print(child.tag, child.attrib)

{urn:isbn:1-931666-22-9}eadheader {'countryencoding': 'iso3166-1', 'dateencoding': 'iso8601', 'langencoding': 'iso639-2b', 'repositoryencoding': 'iso15511'}
{urn:isbn:1-931666-22-9}archdesc {'level': 'collection'}


In [10]:
# Example of subelements (children) in the file (first 20)

list(ead_root.iter())[:20]

[<Element '{urn:isbn:1-931666-22-9}ead' at 0x7fbbc7c9c890>,
 <Element '{urn:isbn:1-931666-22-9}eadheader' at 0x7fbbc7c9c9b0>,
 <Element '{urn:isbn:1-931666-22-9}eadid' at 0x7fbbc7c9ca70>,
 <Element '{urn:isbn:1-931666-22-9}filedesc' at 0x7fbbc7c9cb30>,
 <Element '{urn:isbn:1-931666-22-9}titlestmt' at 0x7fbbc7c9cc50>,
 <Element '{urn:isbn:1-931666-22-9}titleproper' at 0x7fbbc7c9cd70>,
 <Element '{urn:isbn:1-931666-22-9}num' at 0x7fbbc7c9ce30>,
 <Element '{urn:isbn:1-931666-22-9}author' at 0x7fbbc7c9ce90>,
 <Element '{urn:isbn:1-931666-22-9}sponsor' at 0x7fbbc7c9cef0>,
 <Element '{urn:isbn:1-931666-22-9}publicationstmt' at 0x7fbbc7caa050>,
 <Element '{urn:isbn:1-931666-22-9}publisher' at 0x7fbbc7caa110>,
 <Element '{urn:isbn:1-931666-22-9}p' at 0x7fbbc7caa170>,
 <Element '{urn:isbn:1-931666-22-9}date' at 0x7fbbc7caa1d0>,
 <Element '{urn:isbn:1-931666-22-9}address' at 0x7fbbc7caa230>,
 <Element '{urn:isbn:1-931666-22-9}addressline' at 0x7fbbc7caa2f0>,
 <Element '{urn:isbn:1-931666-22-9}ad

In [11]:
# Checking one of these - most don't have text, but 'titleproper' does

list(ead_root.iter())[5].text

'Guide to the Bob Nelkin Collection of ACC-PARC Records, 1953-2000 '

In [12]:
# Total number of subelements

len(list(ead_root.iter()))

9011

### `mods` folder 

In [13]:
# Move into mod folder

os.chdir('../mods')
len(os.listdir())
os.listdir()[:5]

542

['pitt_MSS_1002_B004_F12_I08_MODS.xml',
 'pitt_MSS_1002_B004_F12_I09_MODS.xml',
 'pitt_MSS_1002_B002_F47_I01_MODS.xml',
 'pitt_MSS_1002_B004_F38_I01_MODS.xml',
 'pitt_MSS_1002_B003_F52_I10_MODS.xml']

In [14]:
# Explore one of these xml files as the names suggest they are all similar in format

mods_tree = ET.parse('pitt_MSS_1002_B004_F12_I08_MODS.xml')
mods_root = mods_tree.getroot()

In [15]:
mods_root.tag
mods_root.attrib # This link also broken

'{http://www.loc.gov/mods/v3}mods'

{'{http://www.w3.org/2001/XMLSchema-instance}schemaLocation': 'http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-3.xsd'}

In [16]:
for child in mods_root:
    print(child.tag, child.attrib)

{http://www.loc.gov/mods/v3}titleInfo {}
{http://www.loc.gov/mods/v3}accessCondition {}
{http://www.loc.gov/mods/v3}name {}
{http://www.loc.gov/mods/v3}originInfo {}
{http://www.loc.gov/mods/v3}abstract {}
{http://www.loc.gov/mods/v3}identifier {'type': 'pitt'}
{http://www.loc.gov/mods/v3}relatedItem {'type': 'host'}


In [17]:
# All the subelements

len(list(mods_root.iter()))
list(mods_root.iter())

25

[<Element '{http://www.loc.gov/mods/v3}mods' at 0x7fbbc83050b0>,
 <Element '{http://www.loc.gov/mods/v3}titleInfo' at 0x7fbbc8305290>,
 <Element '{http://www.loc.gov/mods/v3}title' at 0x7fbbc83053b0>,
 <Element '{http://www.loc.gov/mods/v3}accessCondition' at 0x7fbbc83054d0>,
 <Element '{http://www.cdlib.org/inside/diglib/copyrightMD}copyright' at 0x7fbbc8305590>,
 <Element '{http://www.loc.gov/mods/v3}name' at 0x7fbbc8305650>,
 <Element '{http://www.loc.gov/mods/v3}namePart' at 0x7fbbc8305710>,
 <Element '{http://www.loc.gov/mods/v3}role' at 0x7fbbc8305830>,
 <Element '{http://www.loc.gov/mods/v3}roleTerm' at 0x7fbbc8305950>,
 <Element '{http://www.loc.gov/mods/v3}originInfo' at 0x7fbbc8305a10>,
 <Element '{http://www.loc.gov/mods/v3}dateIssued' at 0x7fbbc8305ad0>,
 <Element '{http://www.loc.gov/mods/v3}dateOther' at 0x7fbbc8305b90>,
 <Element '{http://www.loc.gov/mods/v3}abstract' at 0x7fbbc8305c50>,
 <Element '{http://www.loc.gov/mods/v3}identifier' at 0x7fbbc8305d10>,
 <Element '{h

In [18]:
# Determining how much of the name to strip when looking at them in Python

len('{http://www.loc.gov/mods/v3}')

28

In [19]:
# Let's look at format of each of these (my own formatting added for clarity)

for elem in mods_tree.iter():
    print(elem.tag[28:],':\n',elem.text,'\n _______ \n')

mods :
 
   
 _______ 

titleInfo :
 
     
 _______ 

title :
 Letter from Charles Peters to Norman Taylor 
 _______ 

accessCondition :
 
     
 _______ 

/diglib/copyrightMD}copyright :
 None 
 _______ 

name :
 
     
 _______ 

namePart :
 Detre Library & Archives, Heinz History
                            Center 
 _______ 

role :
 
       
 _______ 

roleTerm :
 depositor 
 _______ 

originInfo :
 
     
 _______ 

dateIssued :
 1972-11-15/1972-11-15 
 _______ 

dateOther :
 November 15, 1972 
 _______ 

abstract :
 A letter from Charles Peters, executive director of ACC-PARC, to Secretary Norman Taylor expressing his support for Mr. W. and Mike Levine's demands regarding telephone access for residents of Polk State School and Hospital. 
 _______ 

identifier :
 MSS_1002_B004_F12_I08 
 _______ 

relatedItem :
 
     
 _______ 

titleInfo :
 
       
 _______ 

title :
 Bob Nelkin Collection of ACC-PARC Records 
 _______ 

identifier :
 MSS 1002 
 _______ 

originInfo :
 
       

### `ocr_new` folder

The original ocr fold contained many blank files so I used the [Optical Character Recognition (OCR) station](https://www.library.pitt.edu/digital-scholarship-commons) at Pitt which uses ABBYY FineReader to convert images to text. These new text files are used throughout this extension layer.

In [20]:
# Move into ocr_new folder

os.chdir('../ocr_new')
len(os.listdir())
ocr_list = os.listdir()
ocr_list[:5]

537

['pitt_MSS_1002_B004_F17_I03_PDF.txt',
 'pitt_MSS_1002_B001_F65_I12_PDF.txt',
 'pitt_MSS_1002_B004_F17_I13_PDF.txt',
 'pitt_MSS_1002_B001_F65_I02_PDF.txt',
 'pitt_MSS_1002_B001_F76_I01_PDF.txt']

There are 537 ocr files, but need to find out how many total text files to see what percentage have ocr.

In [21]:
# Check one file as an example

f = open("pitt_MSS_1002_B004_F17_I13_PDF.txt","r")
ocr = f.read()
print(ocr)

﻿
i I
A27 Vermont Avenue
Erie, Pennsylvania 16505
April 2A, 1973

Ms. Helen Wohlgemuth Secretary of Welfare Harrisburg, Pennsylvania
Dear Ms. Wohlgemuth,
I am a Special Education teacher and h-ave a severely living at Polk State School. Therefore, I feel more than express my great distress caused by your recent firing of superintendent of Polk State School.
retarded daughter
Qualified to
Dr. JamES McClelland,
/



Your ahrupt action indicates to me that you have very little intimate knowledge of severely and profoundly retarded children and adults. There are individuals who require close supervision and partial confinement in a playpen-like enclosure for their safety and the safety of others. This is not cruel and inhumane, it is sensible action taken for safety. Some of the children and adults do reouire rather heavy medication at various times because of severely aggressive hehavior.
It is my opinion, and the opinion of various professional persons that I have discussed this matter w

Even with the updated OCR, the quality of the original scans has led to numerous OCR issues resulting in spelling mistakes, odd characters, and spacing issues.

### `rel-ext` folder

In [22]:
# Move into rels-ext folder

os.chdir('../rels-ext')
len(os.listdir())
rels_ext_list = os.listdir()
rels_ext_list[:5]

542

['pitt_MSS_1002_B004_F20_I14_RELS-EXT.xml',
 'pitt_MSS_1002_B003_F18_I13_RELS-EXT.xml',
 'pitt_MSS_1002_B004_F20_I13_RELS-EXT.xml',
 'pitt_MSS_1002_B004_F18_I39_RELS-EXT.xml',
 'pitt_MSS_1002_B001_F52_I02_RELS-EXT.xml']

There are 5 more files in this folder than OCR files (likely due to not all objects being text files).

In [23]:
# Standardize list names for easy list comprehension

ocr_list[0]
ocr_list[0][:-8]
ocr_list_temp = [x[:-8] for x in ocr_list]

'pitt_MSS_1002_B004_F17_I03_PDF.txt'

'pitt_MSS_1002_B004_F17_I03'

In [24]:
# The files with no equivalent in OCR folder

[x for x in rels_ext_list if x[:-13] not in ocr_list_temp]

['pitt_MSS_1002_B004_F56_I03_RELS-EXT.xml',
 'pitt_MSS_1002_B004_F56_I04_RELS-EXT.xml',
 'pitt_MSS_1002_B004_F56_I05_RELS-EXT.xml',
 'pitt_MSS_1002_B004_F56_I02_RELS-EXT.xml',
 'pitt_US-QQS-MSS1002_RELS-EXT.xml',
 'pitt_MSS_1002_B004_F56_I01_RELS-EXT.xml']

In [25]:
# Check the one different file name

f = open('pitt_US-QQS-MSS1002_RELS-EXT.xml',"r")
rel_ext1 = f.read()
print(rel_ext1)


<rdf:RDF xmlns:fedora="info:fedora/fedora-system:def/relations-external#" xmlns:fedora-model="info:fedora/fedora-system:def/model#" xmlns:islandora="http://islandora.ca/ontology/relsext#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <rdf:Description rdf:about="info:fedora/pitt:US-QQS-MSS1002">
    <fedora-model:hasModel rdf:resource="info:fedora/islandora:findingAidCModel"></fedora-model:hasModel>
    <isMemberOfSite xmlns="http://digital.library.pitt.edu/ontology/relations#" rdf:resource="info:fedora/pitt:site.historic-pittsburgh"></isMemberOfSite>
    <fedora:isMemberOfCollection rdf:resource="info:fedora/pitt:collection.341"></fedora:isMemberOfCollection>
  </rdf:Description>
</rdf:RDF>



Global information about fedora - not a data object file.

In [26]:
# Explore one of the other xml files which correspond to the OCR files

rel_ext_tree = ET.parse('pitt_MSS_1002_B004_F20_I13_RELS-EXT.xml')
rel_ext_root = rel_ext_tree.getroot()

In [27]:
rel_ext_root.tag
rel_ext_root.attrib

'{http://www.w3.org/1999/02/22-rdf-syntax-ns#}RDF'

{}

In [28]:
for child in rel_ext_root:
    print(child.tag, child.attrib)

{http://www.w3.org/1999/02/22-rdf-syntax-ns#}Description {'{http://www.w3.org/1999/02/22-rdf-syntax-ns#}about': 'info:fedora/pitt:MSS_1002_B004_F20_I13'}


In [29]:
# All the subelements

len(list(rel_ext_root.iter()))
list(rel_ext_root.iter())

8

[<Element '{http://www.w3.org/1999/02/22-rdf-syntax-ns#}RDF' at 0x7fbbc83464d0>,
 <Element '{http://www.w3.org/1999/02/22-rdf-syntax-ns#}Description' at 0x7fbbc83466b0>,
 <Element '{http://digital.library.pitt.edu/ontology/relations#}isMemberOfSite' at 0x7fbbc8346830>,
 <Element '{info:fedora/fedora-system:def/model#}hasModel' at 0x7fbbc8346950>,
 <Element '{http://digital.library.pitt.edu/ontology/relations#}isMemberOfSite' at 0x7fbbc8346a10>,
 <Element '{info:fedora/fedora-system:def/relations-external#}isMemberOfCollection' at 0x7fbbc8346a70>,
 <Element '{http://islandora.ca/ontology/relsext#}deferDerivatives' at 0x7fbbc8346b30>,
 <Element '{info:fedora/fedora-system:def/relations-external#}isMemberOf' at 0x7fbbc8346bf0>]

In [30]:
# Let's look at format of each of these (my own formatting added for clarity)

for elem in rel_ext_tree.iter():
    print(elem.tag,':\n',elem.text,'\n _______ \n')

{http://www.w3.org/1999/02/22-rdf-syntax-ns#}RDF :
 
   
 _______ 

{http://www.w3.org/1999/02/22-rdf-syntax-ns#}Description :
 
     
 _______ 

{http://digital.library.pitt.edu/ontology/relations#}isMemberOfSite :
 None 
 _______ 

{info:fedora/fedora-system:def/model#}hasModel :
 None 
 _______ 

{http://digital.library.pitt.edu/ontology/relations#}isMemberOfSite :
 None 
 _______ 

{info:fedora/fedora-system:def/relations-external#}isMemberOfCollection :
 None 
 _______ 

{http://islandora.ca/ontology/relsext#}deferDerivatives :
 true 
 _______ 

{info:fedora/fedora-system:def/relations-external#}isMemberOf :
 None 
 _______ 



There appears to be minimal metadata for each text

### `source-data` summary

- `ead` folder: contains one large xml file with extensive metadata about the collection
- `ocr` folder: contains 537 text files. The quality of OCR appears to be variable, leading to spelling issues.
- `mods` and `rel-ext` folders: contain 542 .xml files with metadata which correspond to the OCR text files, plus one for the 'Finding aid content model'. These figures are in line with the description [online](https://historicpittsburgh.org/collection/nelkin-acc-parc-records).

## 3. `base-layers` folder

### Folder contents

In [31]:
# Move into source-data folder

os.chdir('../../../base-layers/bob-nelkin-collection')
os.listdir()

['bob-nelkin-collection_collection-base-layer.csv',
 'bob-nelkin-collection_item-base-layer_archival.csv']

### `collection-base-layer.csv`

In [32]:
# Read in base_layer.csv

base_df = pd.read_csv("bob-nelkin-collection_collection-base-layer.csv")
base_df.head()

Unnamed: 0,finding_aid_id,finding_aid_title,finding_aid_creator,finding_aid_creation_date,finding_aid_publisher,finding_aid_pub_date,acquisition_number,collection_title,collection_creator,collection_language,collection_extent,collection_temp_coverage,collection_scope_content,biography_history,collection_abstract,subject_headings,related_material,repository,preferred_citation,conditions_governing_use,collection_id
0,US-QQS-MSS1002,Guide to the Bob Nelkin Collection of ACC-PARC...,The guide to this collection was written by Si...,2020-08-11 17:14:30 -0400,Heinz History Center,,MSS 1002,Bob Nelkin Collection of ACC-PARC Records,,,4.5 linear feet|||(5 boxes + shelf),1953-2000,The Bob Nelkin Collection of ACC-PARC Records ...,"First formed in 1951, the Allegheny County Cha...","First formed in 1951, the Allegheny County Cha...",Social service--Pennsylvania--Pittsburgh|||Soc...,Pennsylvania Association for Retarded Children...,Heinz History Center,"Bob Nelkin Collection of ACC-PARC Records, 195...",Property rights reside with the Senator John H...,collection.341


In [33]:
base_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 21 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   finding_aid_id             1 non-null      object 
 1   finding_aid_title          1 non-null      object 
 2   finding_aid_creator        1 non-null      object 
 3   finding_aid_creation_date  1 non-null      object 
 4   finding_aid_publisher      1 non-null      object 
 5   finding_aid_pub_date       0 non-null      float64
 6   acquisition_number         1 non-null      object 
 7   collection_title           1 non-null      object 
 8   collection_creator         0 non-null      float64
 9   collection_language        0 non-null      float64
 10  collection_extent          1 non-null      object 
 11  collection_temp_coverage   1 non-null      object 
 12  collection_scope_content   1 non-null      object 
 13  biography_history          1 non-null      object 
 14

Only 1 row , 3 null values

### `base-layer_archival.csv`

In [34]:
# Read in base_layer.csv

archive_df = pd.read_csv('bob-nelkin-collection_item-base-layer_archival.csv')
archive_df.head()

Unnamed: 0,id,title,creator,contributor,creation_date,sort_date,display_date,language,type_of_resource,format,extent,genre,abstract,subject,temporal_coverage,geographic_coverage,host,series,container,owner,depositor,collection_id
0,MSS_1002_B001_F11_I01,Recent Litigation Memo,,,,,"July 11, 1975",,,,,,A PARC internal memo that summarizes recent li...,,,,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 11, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History\n ...",collection.341
1,MSS_1002_B001_F12_I01,Letter from Peter Polloni to Bob Nelkin,,,,,"March 11, 1975",,,,,,"A letter from Peter Polloni, executive directo...",,,,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 12, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History\n ...",collection.341
2,MSS_1002_B001_F13_I01,Letter to Frank Beal from Families and Friends...,,,,,"August 19, 1976",,,,,,A letter from Families and Friends of Southwes...,,,,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 13, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History\n ...",collection.341
3,MSS_1002_B001_F13_I02,Letter from families of patients at Southwest ...,,,,,"July 27, 1976",,,,,,A letter requesting Bob Nelkin's advice on adv...,,,,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 13, Item 2",Heinz History Center,"Detre Library & Archives, Heinz History\n ...",collection.341
4,MSS_1002_B001_F16_I01,ACC-PARC Recent Benefits to Families Memo,,,,,"March 28, 1977",,,,,,Correspondence from Bob Nelkin to Joan Murdoch...,,,,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 16, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History\n ...",collection.341


In [35]:
archive_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 542 entries, 0 to 541
Data columns (total 22 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   id                   542 non-null    object 
 1   title                542 non-null    object 
 2   creator              1 non-null      object 
 3   contributor          0 non-null      float64
 4   creation_date        0 non-null      float64
 5   sort_date            0 non-null      float64
 6   display_date         541 non-null    object 
 7   language             1 non-null      object 
 8   type_of_resource     1 non-null      object 
 9   format               1 non-null      object 
 10  extent               1 non-null      object 
 11  genre                0 non-null      float64
 12  abstract             540 non-null    object 
 13  subject              1 non-null      object 
 14  temporal_coverage    0 non-null      float64
 15  geographic_coverage  1 non-null      obj

- 1 row for each text + the finding aid
- many null/near null columns
- 1 missing abstract

[Back to top](#Bob-Nelkin-Collection---Exploratory-Data-Analysis-(EDA))