# Bob Nelkin Collection - Processing

<br>

**Notebook author:** Ben Naismith  
**Last modified:** July 15, 2021

<br>

**Notebook contents:**
1. [Initial setup](#1.-Initial-setup)
2. [Text cleaning](#2.-Text-cleaning)
3. [Tokenization](#3.-Tokenization)
4. [POS tagging and lemmatization](#4.-POS-tagging-and-lemmatization)
5. [Spelling correction](#5.-Spelling-correction)
6. [Genre tagging](#6.-Genre-tagging)
7. [Wrap-up](#7.-Wrap-up)

## 1. Initial setup

In [1]:
# Import necessary modules

import pandas as pd
import pprint
from IPython.core.interactiveshell import InteractiveShell
import joblib
import re
import nltk
from nltk import pos_tag_sents
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer 
from nltk import FreqDist
import shutil
from CLAWSTag import Tagger
import random
from symspellpy import SymSpell
from symspellpy import Verbosity
import pkg_resources

In [2]:
# Set preferred notebook format

InteractiveShell.ast_node_interactivity = "all" # Show all output, not just last item
pd.set_option('display.max_columns', 999) # Allow viewing of all columns

In [3]:
# Read in pre-processed dataframe

bob_df = joblib.load('../pre-processing/bob_df_pre-processed.pkl')
bob_df.head(2)

Unnamed: 0,id,title,display_date,abstract,host,series,container,owner,depositor,collection_id,text,language
0,MSS_1002_B001_F11_I01,Recent Litigation Memo,"July 11, 1975",A PARC internal memo that summarizes recent li...,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 11, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,﻿V\n\nPennsylvania Association for Retarded Ci...,English
1,MSS_1002_B001_F12_I01,Letter from Peter Polloni to Bob Nelkin,"March 11, 1975","A letter from Peter Polloni, executive directo...",Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 12, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,﻿Pennsylvania Association for Retarded Citizen...,English


## 2. Text cleaning

Prior to tokenizing the texts, I will remove whitespace and other formatting strings.

In [4]:
# Replace NaN with empty strings

bob_df.text = bob_df.text.fillna('')

In [5]:
# Create function to remove unwanted whitespace characters

def clean_text(text):
    clean_text = re.sub(r"\s+", " ", text) #Strip 1 or more white space characters using regex
    clean_text = re.sub(r"\ufeff", " ", clean_text) #Remove \ufeff string seen in many texts
    clean_text = clean_text.strip() # remove trailing whitespace at start or end of string
    return clean_text

In [6]:
# Test function

# Create examples
foo = bob_df.text[0][:200]
foo
foo2 = ''

# Apply function to examples
clean_text(foo)
clean_text(foo2)

'\ufeffV\n\nPennsylvania Association for Retarded Citizens\nr\nt5CO NORTH SECOND STREET • HARRIS8URG, PA. 1 71C2\nTEL: (717) 234-2621\n: -J:\nMEMO TO:\nOfficers\nPARC Residential Services Committee\nAll Regional Resi'

'V Pennsylvania Association for Retarded Citizens r t5CO NORTH SECOND STREET • HARRIS8URG, PA. 1 71C2 TEL: (717) 234-2621 : -J: MEMO TO: Officers PARC Residential Services Committee All Regional Resi'

''

In [7]:
# Apply function to dataframe

bob_df['toks'] = bob_df.text.apply(clean_text) # Will become the tokens column later

## 3. Tokenization

Tokenization using NLTK tools.

In [8]:
# Create tokens column based on cleaned text from previous section

bob_df["toks"] = bob_df.toks.apply(nltk.word_tokenize)

In [9]:
# Add number of words column - different function needed so that punctuation is not counted as words

def re_tokenize(text):
    return re.findall(r"[A-Za-z0-9_]+", text.lower())

bob_df['len'] = bob_df.text.apply(re_tokenize).apply(len)

In [10]:
bob_df.head(2)

Unnamed: 0,id,title,display_date,abstract,host,series,container,owner,depositor,collection_id,text,language,toks,len
0,MSS_1002_B001_F11_I01,Recent Litigation Memo,"July 11, 1975",A PARC internal memo that summarizes recent li...,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 11, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,﻿V\n\nPennsylvania Association for Retarded Ci...,English,"[V, Pennsylvania, Association, for, Retarded, ...",3042
1,MSS_1002_B001_F12_I01,Letter from Peter Polloni to Bob Nelkin,"March 11, 1975","A letter from Peter Polloni, executive directo...",Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 12, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,﻿Pennsylvania Association for Retarded Citizen...,English,"[Pennsylvania, Association, for, Retarded, Cit...",242


## 4. POS tagging and lemmatization

#### NLTK

Uses the [Penn Treebank tagset](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).
The NLTK lemmatizer was selected due to its ease of use and public availability, though other bespoke lemmatizers would provide greater lemmatization accuracy.

In [11]:
# POS tag tokens - NLTK

bob_df['POS_NLTK'] = pos_tag_sents(bob_df.toks.tolist())

In [12]:
bob_df.head(2)

Unnamed: 0,id,title,display_date,abstract,host,series,container,owner,depositor,collection_id,text,language,toks,len,POS_NLTK
0,MSS_1002_B001_F11_I01,Recent Litigation Memo,"July 11, 1975",A PARC internal memo that summarizes recent li...,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 11, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,﻿V\n\nPennsylvania Association for Retarded Ci...,English,"[V, Pennsylvania, Association, for, Retarded, ...",3042,"[(V, NNP), (Pennsylvania, NNP), (Association, ..."
1,MSS_1002_B001_F12_I01,Letter from Peter Polloni to Bob Nelkin,"March 11, 1975","A letter from Peter Polloni, executive directo...",Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 12, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,﻿Pennsylvania Association for Retarded Citizen...,English,"[Pennsylvania, Association, for, Retarded, Cit...",242,"[(Pennsylvania, NNP), (Association, NNP), (for..."


In [13]:
# Lemmatize with POS tag

# Define function for lemmatizing a word
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# Define function for lemmatizing a text
def lemmatize_text(toks):
    return [lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in toks]

In [14]:
# Lemmatize texts

bob_df['lemmas_NLTK'] = bob_df.toks.apply(lemmatize_text)

In [15]:
# Combine toks, lemmas, POS into one column

bob_df['tok_lem_POS_NLTK'] = bob_df[['toks','lemmas_NLTK','POS_NLTK']].apply(lambda x: tuple(zip(x[0],x[1],[y[1] for y in x[2]])), axis=1)
bob_df['tok_lem_POS_NLTK'] = bob_df['tok_lem_POS_NLTK'].apply(list)

del bob_df['toks']
del bob_df['lemmas_NLTK']
del bob_df['POS_NLTK']

In [16]:
bob_df.head(2)

Unnamed: 0,id,title,display_date,abstract,host,series,container,owner,depositor,collection_id,text,language,len,tok_lem_POS_NLTK
0,MSS_1002_B001_F11_I01,Recent Litigation Memo,"July 11, 1975",A PARC internal memo that summarizes recent li...,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 11, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,﻿V\n\nPennsylvania Association for Retarded Ci...,English,3042,"[(V, V, NNP), (Pennsylvania, Pennsylvania, NNP..."
1,MSS_1002_B001_F12_I01,Letter from Peter Polloni to Bob Nelkin,"March 11, 1975","A letter from Peter Polloni, executive directo...",Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 12, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,﻿Pennsylvania Association for Retarded Citizen...,English,242,"[(Pennsylvania, Pennsylvania, NNP), (Associati..."


#### CLAWS7

Uses the [CLAWS7 tagset](http://ucrel.lancs.ac.uk/claws7tags.html).

This wrapper uses the free online tagger from the University of Lancaster. Each text must be in a separate txt file.

In [17]:
# Copy ocr folder

# Create function for copying directory
def copyDirectory(src, dest):
    try:
        shutil.copytree(src, dest)
    except shutil.Error as e:
        print('Directory not copied. Error: %s' % e)
    except OSError as e:
        print('Directory not copied. Error: %s' % e)

# Copies directory, but only if one doesn't already exist
copyDirectory("../../../source-data/bob-nelkin-collection/ocr_new/", "../../../source-data/bob-nelkin-collection/CLAWS_tagged/")

Directory not copied. Error: [Errno 17] File exists: '../../../source-data/bob-nelkin-collection/CLAWS_tagged/'


In [18]:
# Change extensions from .asc to .txt - no need with ocr_new folder as already txt files

#folder = '../../source-data/bob-nelkin-collection/CLAWS_tagged/'

#for filename in glob.iglob(os.path.join(folder, '*.asc')):
#    os.rename(filename, filename[:-4] + '.txt')

In [19]:
# Move into text folder

%cd '../../../source-data/bob-nelkin-collection/CLAWS_tagged/'

/Users/Ben/Documents/data-layers/source-data/bob-nelkin-collection/CLAWS_tagged


In [20]:
#%%capture

# Tag texts (output suppressed)

#Tagger.Postag('c7', 'horiz').start()

# This takes a while, so no need to re-run every time

In [21]:
# Remove untagged txt files

to_remove = glob.glob('pitt*.txt', recursive=True)

for file in to_remove:
    os.remove(file)

<IPython.core.display.Javascript object>

In [22]:
# Read in tagged CLAWS texts as one list

CLAWS_texts = []

for file in glob.glob("*.txt"):
    f = open(file, "r")
    text = [file,f.read()]
    CLAWS_texts.append(text)

<IPython.core.display.Javascript object>

In [23]:
CLAWS_texts[3] # Example
len(CLAWS_texts)

['Tagged_pitt_MSS_1002_B004_F20_I08_PDF.txt',
 '\nt_ZZ1 WEDNESDAY_NPD1 ,_, AfRIL_NP1 25_MC ,_, 1973_MC Association_NN1 Backs_VVZ \nMcClellan_NP1 Bans_NN2 Use_VV0 Post-Gozette_JJ Harrlsburo_NP1 Burea_NP1 \nHARRISBURGThe_NP1 association_NN1 of_IO State_NN1 Mental_JJ Hospital_NN1 \nPhysicians_NN2 has_VHZ called_VVN for_IF the_AT reinstatement_NN1 of_IO \nDr._NNB Janies_NP1 McClelland_NP1 as_CSA superintendent_NN1 of_IO the_AT \nPolk_NP1 State_NN1 School_NN1 an_AT1 Hospital_NN1 Meanwhile_RR Welfare_NN1 \nSecretary_NN1 Helen_NP1 Wohlgemuth_NP1 issued_VVD an_AT1 order_NN1 banning_VVG \nthe_AT use_NN1 of_IO locked_JJ cages_NN2 or_CC pens_NN2 for_IF the_AT \nincarceration_NN1 of_IO residents_NN2 at_II all_DB other_JJ state_NN1 \ninstitution_NN1 under_II her_APPGE control_NN1 or_CC supervision_NN1 ._. \nThese_DD2 institutions_NN2 house_VV0 some_DD 40.000_MC residents_NN2 The_AT \nuse_NN1 of_IO five_MC feet_NN2 by_II fiv_NN1 feet_NN2 by_II five_MC feet_NN2 \npens_NN2 at_II Polk_NP1 led_VVD to_II

537

In [24]:
# Move back to working directory

%cd '../../../extension-layers/bob-nelkin-collection/processing'

/Users/Ben/Documents/data-layers/extension-layers/bob-nelkin-collection/processing


In [25]:
# Clean up CLAWS tagged texts

# Shorten identifier for easier matching
CLAWS_clean = [(x[0][12:-8],x[1]) for x in CLAWS_texts]

# Remove new line characters, split on whitespace, and remove identifier at end
CLAWS_clean = [(x[0],x[1].replace('\n', '').split(' ')[:-2]) for x in CLAWS_clean]

# Remove strings with no tags
CLAWS_clean = [(x[0],[y for y in x[1] if '_' in y]) for x in CLAWS_clean]

# Change tags into tuples
CLAWS_clean = [(x[0],[tuple(y.split('_')) for y in x[1]]) for x in CLAWS_clean]

In [26]:
# Check example output

CLAWS_clean[3]

('MSS_1002_B004_F20_I08',
 [('t', 'ZZ1'),
  ('WEDNESDAY', 'NPD1'),
  (',', ','),
  ('AfRIL', 'NP1'),
  ('25', 'MC'),
  (',', ','),
  ('1973', 'MC'),
  ('Association', 'NN1'),
  ('Backs', 'VVZ'),
  ('McClellan', 'NP1'),
  ('Bans', 'NN2'),
  ('Use', 'VV0'),
  ('Post-Gozette', 'JJ'),
  ('Harrlsburo', 'NP1'),
  ('Burea', 'NP1'),
  ('HARRISBURGThe', 'NP1'),
  ('association', 'NN1'),
  ('of', 'IO'),
  ('State', 'NN1'),
  ('Mental', 'JJ'),
  ('Hospital', 'NN1'),
  ('Physicians', 'NN2'),
  ('has', 'VHZ'),
  ('called', 'VVN'),
  ('for', 'IF'),
  ('the', 'AT'),
  ('reinstatement', 'NN1'),
  ('of', 'IO'),
  ('Dr.', 'NNB'),
  ('Janies', 'NP1'),
  ('McClelland', 'NP1'),
  ('as', 'CSA'),
  ('superintendent', 'NN1'),
  ('of', 'IO'),
  ('the', 'AT'),
  ('Polk', 'NP1'),
  ('State', 'NN1'),
  ('School', 'NN1'),
  ('an', 'AT1'),
  ('Hospital', 'NN1'),
  ('Meanwhile', 'RR'),
  ('Welfare', 'NN1'),
  ('Secretary', 'NN1'),
  ('Helen', 'NP1'),
  ('Wohlgemuth', 'NP1'),
  ('issued', 'VVD'),
  ('an', 'AT1'),
  (

In [27]:
# Create dict and add as column

CLAWS_dict = dict(CLAWS_clean)
bob_df['tok_CLAWS'] = bob_df.id.map(CLAWS_dict)
bob_df.head(1)

Unnamed: 0,id,title,display_date,abstract,host,series,container,owner,depositor,collection_id,text,language,len,tok_lem_POS_NLTK,tok_CLAWS
0,MSS_1002_B001_F11_I01,Recent Litigation Memo,"July 11, 1975",A PARC internal memo that summarizes recent li...,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 11, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,﻿V\n\nPennsylvania Association for Retarded Ci...,English,3042,"[(V, V, NNP), (Pennsylvania, Pennsylvania, NNP...","[(Pennsylvania, NP1), (Association, NN1), (for..."


In [28]:
# Create lemma_CLAWS column using COCA data

coca_freq_dict = joblib.load('../../../../COCA_data/COCA_2020_lemma_freq_dict.pkl')

In [29]:
# Sample from dictionary

random.sample(coca_freq_dict.items(), 2)

[(('drag-and-drop', 'j'), 156), (('sinuously', 'r'), 62)]

COCA data is from a paid license. Please see [Dr Na-Rae Han](https://www.linguistics.pitt.edu/people/na-rae-han) for access information for Pitt students and faculty or the [COCA website](https://www.wordfrequency.info/purchase.asp) for purchase information.

In [30]:
# Remove tagged text with only 1 token causing issues

bob_df.loc[bob_df.tok_CLAWS.str.len()==1]

bob_df.loc[351,'tok_CLAWS']=['']

Unnamed: 0,id,title,display_date,abstract,host,series,container,owner,depositor,collection_id,text,language,len,tok_lem_POS_NLTK,tok_CLAWS
261,MSS_1002_B004_F17_I11,Letter from Mrs. E. Wilkinson to Helene Wohlge...,"April 26, 1973",A letter from Mrs. E. Wilkinson thanking Secre...,Bob Nelkin Collection of ACC-PARC Records,II. State School and Hospital (SSH) and Interi...,"box 4, Folder 17, Item 11",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,﻿i/,English,1,"[(i/, i/, NN)]","[(i/, FU)]"
289,MSS_1002_B004_F18_I15,Letter from Edward and Elizabeth M.,c. 1973,"A letter from Edward and Elizabeth M., parents...",Bob Nelkin Collection of ACC-PARC Records,II. State School and Hospital (SSH) and Interi...,"box 4, Folder 18, Item 15",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,﻿V,English,1,"[(V, V, NN)]","[(V, ZZ1)]"
292,MSS_1002_B004_F18_I18,Letter from George S. to Helene Wohlgemuth,"May 16, 1973","A letter from George S., a parent of a Polk St...",Bob Nelkin Collection of ACC-PARC Records,II. State School and Hospital (SSH) and Interi...,"box 4, Folder 18, Item 18",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,"﻿i\n\nfay 16, 1<?73-\nI\npAd Helene Wolyemudh ...",English,250,"[(i, i, NN), (fay, fay, VBD), (16, 16, CD), (,...","[(fay, VV0)]"
428,MSS_1002_B004_F22_I39,"Letter from a ""Penn Citizen"" to Helene Wohlgemuth","April 21, 1973","An anonymous letter signed by a ""Penn Citizen""...",Bob Nelkin Collection of ACC-PARC Records,II. State School and Hospital (SSH) and Interi...,"box 4, Folder 22, Item 39",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,﻿\n1/\nZDcg-acXx\n/Zx2 a\ni^p-eiU/Xa t ^9\t>Dv...,English,192,"[(1/, 1/, CD), (ZDcg-acXx, ZDcg-acXx, JJ), (/Z...","[(1, MC1)]"


In [31]:
def get_short_POS(tok_CLAWS_list):
    tok_CLAWS_list = [x for x in tok_CLAWS_list if len(x[1])!=0]
    if len(tok_CLAWS_list) !=0:
        short_POS = [(x[0],x[1][0].lower()) for x in tok_CLAWS_list]
    else:
        short_POS = []
    return short_POS

In [32]:
# Replace NaN values for the photos

bob_df.tok_CLAWS = bob_df.tok_CLAWS.fillna('')

In [33]:
# Keep only first letter of CLAWS POS tags

bob_df.tok_CLAWS = bob_df.tok_CLAWS.apply(get_short_POS)

In [34]:
# Check all COCA lemma dict POS

COCA_POS = sorted(list(set([x[1] for x in coca_freq_dict.keys()])))
COCA_POS

['a', 'c', 'd', 'e', 'i', 'j', 'm', 'n', 'p', 'r', 't', 'u', 'v', 'x']

In [35]:
# Remove puncuation from CLAWS texts

bob_df.tok_CLAWS = bob_df.tok_CLAWS.apply(lambda row: [x for x in row if x[1] in COCA_POS])

In [36]:
# Create CLAWS lemma column

# First lower case all toks (as in the word_lemma dict)
bob_df['lemmas_CLAWS'] = bob_df.tok_CLAWS.apply(lambda row: [(x[0].lower(),x[1]) for x in row])

In [37]:
# Then map dict

coca_word_lemma_dict = joblib.load('../../../../COCA_data/COCA_2020_word_lemma_dict.pkl')

bob_df.lemmas_CLAWS = bob_df.lemmas_CLAWS.apply(
    lambda row:[coca_word_lemma_dict[x] if x in coca_word_lemma_dict else x for x in row])

In [38]:
# Combine toks, lemmas, POS into one column

bob_df['tok_lem_POS_CLAWS'] = bob_df[['tok_CLAWS','lemmas_CLAWS']].apply(
    lambda x: tuple(zip([y[0] for y in x[0]],[y[0] for y in x[1]],[y[1] for y in x[1]])), axis=1)
bob_df['tok_lem_POS_CLAWS'] = bob_df['tok_lem_POS_CLAWS'].apply(list)

del bob_df['tok_CLAWS']
del bob_df['lemmas_CLAWS']

In [39]:
bob_df.head()

Unnamed: 0,id,title,display_date,abstract,host,series,container,owner,depositor,collection_id,text,language,len,tok_lem_POS_NLTK,tok_lem_POS_CLAWS
0,MSS_1002_B001_F11_I01,Recent Litigation Memo,"July 11, 1975",A PARC internal memo that summarizes recent li...,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 11, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,﻿V\n\nPennsylvania Association for Retarded Ci...,English,3042,"[(V, V, NNP), (Pennsylvania, Pennsylvania, NNP...","[(Pennsylvania, pennsylvania, n), (Association..."
1,MSS_1002_B001_F12_I01,Letter from Peter Polloni to Bob Nelkin,"March 11, 1975","A letter from Peter Polloni, executive directo...",Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 12, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,﻿Pennsylvania Association for Retarded Citizen...,English,242,"[(Pennsylvania, Pennsylvania, NNP), (Associati...","[(Pennsylvania, pennsylvania, n), (Association..."
2,MSS_1002_B001_F13_I01,Letter to Frank Beal from Families and Friends...,"August 19, 1976",A letter from Families and Friends of Southwes...,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 13, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,﻿11\nFAMILIES & FRIENDS OF SOUTHWEST HABILITAT...,English,268,"[(11, 11, CD), (FAMILIES, FAMILIES, NNP), (&, ...","[(1, 1, m), (FAMILIES, family, n), (FRIENDS, f..."
3,MSS_1002_B001_F13_I02,Letter from families of patients at Southwest ...,"July 27, 1976",A letter requesting Bob Nelkin's advice on adv...,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 13, Item 2",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,﻿FAMILIES & FRIENDS OF\nSOUTHWEST HABILITATION...,English,320,"[(FAMILIES, FAMILIES, NNP), (&, &, CC), (FRIEN...","[(FAMILIES, family, n), (FRIENDS, friend, n), ..."
4,MSS_1002_B001_F16_I01,ACC-PARC Recent Benefits to Families Memo,"March 28, 1977",Correspondence from Bob Nelkin to Joan Murdoch...,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 16, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,C ommonwealth of Pennsylvania\n\nDepartment of...,English,6932,"[(C, C, NNP), (ommonwealth, ommonwealth, NN), ...","[(C, c, n), (ommonwealth, ommonwealth, n), (of..."


## 5. Spelling correction

Code adapted from https://github.com/ELI-Data-Mining-Group/PELIC-spelling

### Building non_words_df
In this section, we build a dataframe, `non_words_df`, which collects all of the non-words from the `bob_df`. The final dataframe has the following columns:
- `non_word`: tuples with the non-words and their parts of speech
- `sentence`: the complete sentence containing the non-word to provide context
- `id`: the id of the text they come from

In [40]:
# Creating small dataframe to be used for finding non-words

non_words = bob_df[['text','tok_lem_POS_NLTK']]
non_words.head()

Unnamed: 0,text,tok_lem_POS_NLTK
0,﻿V\n\nPennsylvania Association for Retarded Ci...,"[(V, V, NNP), (Pennsylvania, Pennsylvania, NNP..."
1,﻿Pennsylvania Association for Retarded Citizen...,"[(Pennsylvania, Pennsylvania, NNP), (Associati..."
2,﻿11\nFAMILIES & FRIENDS OF SOUTHWEST HABILITAT...,"[(11, 11, CD), (FAMILIES, FAMILIES, NNP), (&, ..."
3,﻿FAMILIES & FRIENDS OF\nSOUTHWEST HABILITATION...,"[(FAMILIES, FAMILIES, NNP), (&, &, CC), (FRIEN..."
4,C ommonwealth of Pennsylvania\n\nDepartment of...,"[(C, C, NNP), (ommonwealth, ommonwealth, NN), ..."


**Note:** For spelling correction, it is necessary to decide what list of words will be used for determining if a word is real or not.

Here, we use the [`SCOWL_condensed.txt`](https://github.com/ELI-Data-Mining-Group/PELIC-spelling/blob/master/SCOWL_condensed.txt) file which is a combination of wordlists available for download at http://wordlist.aspell.net/. We include items from all the dictionaries _except_ the abbreviations dictionary. For a detailed look at the compilation of this dictionary, please see the [SCOWL_wordlist](https://github.com/ELI-Data-Mining-Group/PELIC-spelling/blob/master/SCOWL_wordlist.ipynb) notebook.

In [41]:
# Reading in SCOWL_condensed as a set as a lookup list for spelling (500k words)

pelic = '/Users/Ben/Documents/ELI_Data_Mining/PELIC-spelling/'
scowl = set(open(pelic+"PELIC-SCOWL.txt", "r").read().split('\n'))
scowl = set([x.lower() for x in scowl])
print(random.sample(scowl,5))
len(scowl)

['gritrock', 'bahuts', 'petula', 'quinquagenaries', 'cumba']


565840

The following is a list of words which should be considered words but which were previously being labelled as non-words. These items have been manually added to this list based on output later in this notebook.

In [42]:
bob_df_spelling_supp = open("bob_df_spelling_supp.txt", "r").read().split('\n')
len(bob_df_spelling_supp)
bob_df_spelling_supp

67

['acc',
 'assoc',
 'assn',
 'baumstein',
 'bcc',
 'bldg',
 'blvd',
 'cassileth',
 'chamovitz',
 'colautti',
 'colombatto',
 'columbatto',
 'cornwells',
 'countv',
 'delmuth',
 'dept',
 'div',
 'dr',
 'drs',
 'eg',
 'embreeville',
 'etc',
 'ferleger',
 'ft',
 'gardenside',
 'gilhool',
 'gishbaugher',
 'gov',
 'handwashing',
 'hockendoner',
 'hosp',
 'hrs',
 'interferring',
 'inservice',
 'iq',
 'jr',
 'lt',
 'mcclellands',
 'mckees',
 'meadowside',
 'minetti',
 'mulgrave',
 'nelkin',
 'northside',
 'passavant',
 'pennhurst',
 'pgh',
 'pl',
 'polloni',
 'pr',
 'rd',
 'reg',
 'rusynyk',
 'sec',
 'sistik',
 'southside',
 'sr',
 'supp',
 'tv',
 'underprogrammed',
 'veedock',
 'wettick',
 'willowbrook',
 'woglemuth',
 'wohlegemuth',
 'wolgemuth',
 'woodhaven']

In [43]:
# Lower case all toks

non_words.tok_lem_POS_NLTK = non_words.tok_lem_POS_NLTK.apply(lambda row: [(x[0].lower(),x[1],x[2]) for x in row])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [44]:
# Function to find non-words

def spell_check(tok_lem_POS_list):
    word_list = scowl # Choose word_list here. Default is scowl described above.
    not_in_word_list = []
    for tok_lem_POS in tok_lem_POS_list:
        if tok_lem_POS[0] not in word_list and tok_lem_POS[0] not in bob_df_spelling_supp:
            not_in_word_list.append(tok_lem_POS)
    return not_in_word_list

In [45]:
# Apply spell check function to find all misspelled-words. 

non_words['misspelled_words'] = non_words.tok_lem_POS_NLTK.apply(spell_check)
non_words.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,text,tok_lem_POS_NLTK,misspelled_words
0,﻿V\n\nPennsylvania Association for Retarded Ci...,"[(v, V, NNP), (pennsylvania, Pennsylvania, NNP...","[(t5co, t5CO, VBD), (•, •, NNP), (harris8urg, ..."
1,﻿Pennsylvania Association for Retarded Citizen...,"[(pennsylvania, Pennsylvania, NNP), (associati...","[(1500, 1500, CD), (•, •, NNP), (,, ,, ,), (pa..."
2,﻿11\nFAMILIES & FRIENDS OF SOUTHWEST HABILITAT...,"[(11, 11, CD), (families, FAMILIES, NNP), (&, ...","[(11, 11, CD), (&, &, CC), (2517, 2517, CD), (..."
3,﻿FAMILIES & FRIENDS OF\nSOUTHWEST HABILITATION...,"[(families, FAMILIES, NNP), (&, &, CC), (frien...","[(&, &, CC), (2517, 2517, CD), (,, ,, ,), (pa...."
4,C ommonwealth of Pennsylvania\n\nDepartment of...,"[(c, C, NNP), (ommonwealth, ommonwealth, NN), ...","[(ommonwealth, ommonwealth, NN), (,, ,, ,), (,..."


#### Adding context to the dataframe
Seeing the mistakes in the context of a sentence will allow for better manual checking if required.

In [46]:
# Sent-tokenizing the text

non_words['sentence'] = non_words['text'].apply(lambda x: nltk.sent_tokenize(x))

# And delete text column which is no longer needed

del non_words['text']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [47]:
# Checking for hyphenated words tagged as misspellings because SCOWL doesn't contain hypenated words

hyphenated = set([x[0] for x in [x for y in non_words.misspelled_words.to_list() for x in y] if '-' in x[0]])
print(len(hyphenated))
print(list(hyphenated)[:10])

# These need to be removed from the non-words dataframe if composed of valid words

2517
['-c*', 'playpen-like', '287-0791', '——.-r———^^-^.y..', '6-^xtnv^', 'medical-coverage', 'staff-to-resident', 'on-the', '_-c', '-aot.chruse']


In [48]:
# Hyphenated items whose components are not in scowl - possible misspellings or punctuation strings

sorted([y for y in [x.split('-') for x in hyphenated] if y[0].lower() not in scowl or y[1].lower() not in scowl])

[['', "'/", ''],
 ['', "'/vwwi"],
 ['', "'^"],
 ['', "'al", '/'],
 ['', "'astefl"],
 ['', "'ini"],
 ['', "'~"],
 ['', "'■"],
 ['', '*'],
 ['', '*', '1'],
 ['', '***t'],
 ['', '**1^1'],
 ['', '*1'],
 ['', '*•'],
 ['', '.'],
 ['', '.', '.c'],
 ['', '.', '/a'],
 ['', '.', 'u', '—~'],
 ['', '..'],
 ['', '..4'],
 ['', '..ft'],
 ['', './'],
 ['', '.1'],
 ['', '._'],
 ['', '.a'],
 ['', '.ax'],
 ['', '.f'],
 ['', '.n'],
 ['', '.~'],
 ['', '.~~.', 'as'],
 ['', '.—'],
 ['', '.•••21'],
 ['', '.■*•'],
 ['', '/'],
 ['', '/', ''],
 ['', '/', 'zz/c/^'],
 ['', "/.'./l"],
 ['', '///institution'],
 ['', '/j/..'],
 ['', '/r'],
 ['', '/tv'],
 ['', '/•'],
 ['', '0'],
 ['', '0', ''],
 ['', '0352'],
 ['', '1'],
 ['', '1', ''],
 ['', '1.17'],
 ['', '1.42'],
 ['', '1.81'],
 ['', '10', ''],
 ['', '11', ''],
 ['', '12', ''],
 ['', '13', ''],
 ['', '14'],
 ['', '14', ''],
 ['', '15', ''],
 ['', '16', ''],
 ['', '17', ''],
 ['', '18', ''],
 ['', '19', ''],
 ['', '1977', '78'],
 ['', '1^11'],
 ['', '1l'],
 ['', '1—

After manual checking, all the hyphenated words are punctuation and can be removed from the non-words df. Because all hyphenated word are fine, we will just eliminate all words that are not purely composed of letters. This will have the effect of removing the following categories from the dataframe:
- punctuation
- hyphenated words (e.g. well-known)
- contractions (e.g. 'll, 've)
- years (e.g. 1950s)
- ordinals (e.g. 1st, 2nd)

In [49]:
# Removing items that are not purely alpha

non_words.misspelled_words = non_words.misspelled_words.apply(lambda row: [x for x in row if x[0].isalpha()])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [50]:
# Removing all words with length 1

non_words.misspelled_words = non_words.misspelled_words.apply(lambda row: [x for x in row if len(x[0]) > 1])

In [51]:
# Removing all words with special characters (non_ascii)
# after checking, these are all foreign words with accents and other non-latin characters

# Creating function to check
def is_ascii(s):
    return all(ord(c) < 128 for c in s)

In [52]:
non_word_list = set([x for y in non_words.misspelled_words.to_list() for x in y])
foreign_words = [x for x in non_word_list if is_ascii(x[0]) == False ]
foreign_words[:20]

[('yﬁly', 'yﬁly', 'NN'),
 ('ofﬂce', 'ofﬂce', 'NNS'),
 ('welﬂcuze', 'Welﬂcuze', 'NNP'),
 ('lllllaﬁ', 'Lllllaﬁ', 'NNP'),
 ('frrﬁ', 'frrﬁ', 'JJ'),
 ('supportedé', 'Supportedé', 'NNP'),
 ('iﬁcn', 'iﬁcn', 'NN'),
 ('iugétegnv', 'Iugétegnv', 'NNP'),
 ('mcﬁlelland', 'Mcﬁlelland', 'NNP'),
 ('ﬂy', 'ﬂy', 'NNP'),
 ('éeﬁeve', 'éeﬁeve', 'VBP'),
 ('ixﬁfngerjgv', 'ixﬁfngerjgv', 'NN'),
 ('ﬂy', 'ﬂy', 'NN'),
 ('ﬁmw', 'ﬁmw', 'NNP'),
 ('lcuué', 'lcuué', 'VBD'),
 ('aﬂutﬁouas', 'aﬂutﬁouas', 'VBP'),
 ('éeléeve', 'éeléeve', 'NNP'),
 ('péeaoecl', 'péeaoecl', 'NN'),
 ('iﬁ', 'Iﬁ', 'NNP'),
 ('ﬁacket', 'ﬁacket', 'NN')]

In [53]:
# Removing foreign words

non_words.misspelled_words = non_words.misspelled_words.apply(lambda row: [x for x in row if x not in foreign_words])

In [54]:
# Checking affect of removal

print(len([x for y in non_words.misspelled_words.to_list() for x in y]))
print(len(set([x for y in non_words.misspelled_words.to_list() for x in y])))

5063
3566


In [55]:
non_words.head()

Unnamed: 0,tok_lem_POS_NLTK,misspelled_words,sentence
0,"[(v, V, NNP), (pennsylvania, Pennsylvania, NNP...","[(dpw, DPW, NNP), (dpw, DPW, NNP), (dpw, DPW, ...",[﻿V\n\nPennsylvania Association for Retarded C...
1,"[(pennsylvania, Pennsylvania, NNP), (associati...","[(ppp, PPP, NNP), (schmi, Schmi, NNP), (dt, dt...",[﻿Pennsylvania Association for Retarded Citize...
2,"[(11, 11, CD), (families, FAMILIES, NNP), (&, ...","[(fodi, Fodi, NNP)]",[﻿11\nFAMILIES & FRIENDS OF SOUTHWEST HABILITA...
3,"[(families, FAMILIES, NNP), (&, &, CC), (frien...","[(tesident, tesident, NN), (fodi, Fodi, NNP), ...",[﻿FAMILIES & FRIENDS OF\nSOUTHWEST HABILITATIO...
4,"[(c, C, NNP), (ommonwealth, ommonwealth, NN), ...","[(ommonwealth, ommonwealth, NN), (hmamssune, H...",[C ommonwealth of Pennsylvania\n\nDepartment o...


Create new dataframe so that each misspelling token is a separate row.

In [56]:
# Removing rows with no misspellings

non_words2 = non_words.loc[non_words.misspelled_words.str.len() > 0,:].copy()

In [57]:
# Exploding the lists in misspelled words so that each misspelling gets its own row

non_words2 = non_words2.explode('misspelled_words')

In [58]:
# Keeping only the sentence containing the error (the first occurence of the error if repeated)

non_words2['sentence'] = list(zip([x[0] for x in non_words2.misspelled_words], non_words2.sentence))
non_words2['sentence'] = non_words2['sentence'].apply(
    lambda row: [i for i in row[1] if row[0] in re_tokenize(i) or row[0]+"n't" in i.lower()])
non_words2['sentence'] = [x[0] for x in non_words2['sentence']]
non_words2 = non_words2.drop_duplicates(subset = ['misspelled_words','sentence'])

In [59]:
# Keeping the id (which is no longer unique) as a separate column

non_words2 = non_words2.reset_index(drop = False)
non_words2.head()

Unnamed: 0,index,tok_lem_POS_NLTK,misspelled_words,sentence
0,0,"[(v, V, NNP), (pennsylvania, Pennsylvania, NNP...","(dpw, DPW, NNP)",Downs v. Pennsylvania DPW: outlines the plan w...
1,0,"[(v, V, NNP), (pennsylvania, Pennsylvania, NNP...","(bazelon, Bazelon, NNP)",cc:\tGilhool\nBazelon Haggerty Cohen\nPolloni ...
2,0,"[(v, V, NNP), (pennsylvania, Pennsylvania, NNP...","(nastav, Nastav, NNP)",cc:\tGilhool\nBazelon Haggerty Cohen\nPolloni ...
3,0,"[(v, V, NNP), (pennsylvania, Pennsylvania, NNP...","(lgmiutbvno, lgmiutbvNo, NN)",9674. ycg£hi<H^v^_Wo.bl!lgmiutbvNo.
4,0,"[(v, V, NNP), (pennsylvania, Pennsylvania, NNP...","(plaintitts, Plaintitts, NNP)",Plaintitts and Plaintiff-Intervenors represent...


Adding a bigrams column, i.e. one token left and right of the misspelled word

In [60]:
# Creating a tokenized version of the sentence without punctuation and with the index for each token

non_words2['enumerated'] = non_words2.sentence.apply(lambda x: re_tokenize(x)).apply(enumerate).apply(list)
non_words2.head()

Unnamed: 0,index,tok_lem_POS_NLTK,misspelled_words,sentence,enumerated
0,0,"[(v, V, NNP), (pennsylvania, Pennsylvania, NNP...","(dpw, DPW, NNP)",Downs v. Pennsylvania DPW: outlines the plan w...,"[(0, downs), (1, v), (2, pennsylvania), (3, dp..."
1,0,"[(v, V, NNP), (pennsylvania, Pennsylvania, NNP...","(bazelon, Bazelon, NNP)",cc:\tGilhool\nBazelon Haggerty Cohen\nPolloni ...,"[(0, cc), (1, gilhool), (2, bazelon), (3, hagg..."
2,0,"[(v, V, NNP), (pennsylvania, Pennsylvania, NNP...","(nastav, Nastav, NNP)",cc:\tGilhool\nBazelon Haggerty Cohen\nPolloni ...,"[(0, cc), (1, gilhool), (2, bazelon), (3, hagg..."
3,0,"[(v, V, NNP), (pennsylvania, Pennsylvania, NNP...","(lgmiutbvno, lgmiutbvNo, NN)",9674. ycg£hi<H^v^_Wo.bl!lgmiutbvNo.,"[(0, 9674), (1, ycg), (2, hi), (3, h), (4, v),..."
4,0,"[(v, V, NNP), (pennsylvania, Pennsylvania, NNP...","(plaintitts, Plaintitts, NNP)",Plaintitts and Plaintiff-Intervenors represent...,"[(0, plaintitts), (1, and), (2, plaintiff), (3..."


In [61]:
# Creating a function to extract the bigrams (1 word either side of misspelling)

def get_bigrams(misspelled_word, enumerated_list):
    if len(enumerated_list) <2:
        return []
    for tup in enumerated_list:
        if tup[1] == misspelled_word[0]:
            if tup[0] == 0:
                bigram = ' '.join([x[1] for x in (enumerated_list[tup[0]],enumerated_list[tup[0]+1])])
                return [(bigram,2)]
            if tup[0] == len(enumerated_list)-1:
                bigram = ' '.join([x[1] for x in (enumerated_list[tup[0]-1],enumerated_list[tup[0]])])
                return [(bigram,1)]
            else:
                bigram1 = ' '.join([x[1] for x in (enumerated_list[tup[0]-1],enumerated_list[tup[0]])])
                bigram2 = ' '.join([x[1] for x in (enumerated_list[tup[0]],enumerated_list[tup[0]+1])])
                return [(bigram1,1), (bigram2,2)]
            
# Notes:
# '1' and '2' added as identifiers to show which word in bigram is the misspelling
# Type 1 bigram = (word, misspelling), Type 2 bigram = (misspelling, word)

In [62]:
# Applying getbigrams

non_words2['bigrams'] = non_words2[['misspelled_words','enumerated']].apply(lambda x: get_bigrams(x[0],x[1]), axis=1)

In [63]:
# Deleting the enumerated column as no longer necessary

del non_words2['enumerated']

In [64]:
# Renaming the 'misspelled_words' column as there is only one word in each row

non_words2 = non_words2.rename(columns={"misspelled_words": "misspelling"})

In [65]:
# Checking final non_words2 dataframe

non_words2.head()

Unnamed: 0,index,tok_lem_POS_NLTK,misspelling,sentence,bigrams
0,0,"[(v, V, NNP), (pennsylvania, Pennsylvania, NNP...","(dpw, DPW, NNP)",Downs v. Pennsylvania DPW: outlines the plan w...,"[(pennsylvania dpw, 1), (dpw outlines, 2)]"
1,0,"[(v, V, NNP), (pennsylvania, Pennsylvania, NNP...","(bazelon, Bazelon, NNP)",cc:\tGilhool\nBazelon Haggerty Cohen\nPolloni ...,"[(gilhool bazelon, 1), (bazelon haggerty, 2)]"
2,0,"[(v, V, NNP), (pennsylvania, Pennsylvania, NNP...","(nastav, Nastav, NNP)",cc:\tGilhool\nBazelon Haggerty Cohen\nPolloni ...,"[(roth nastav, 1), (nastav v, 2)]"
3,0,"[(v, V, NNP), (pennsylvania, Pennsylvania, NNP...","(lgmiutbvno, lgmiutbvNo, NN)",9674. ycg£hi<H^v^_Wo.bl!lgmiutbvNo.,"[(bl lgmiutbvno, 1)]"
4,0,"[(v, V, NNP), (pennsylvania, Pennsylvania, NNP...","(plaintitts, Plaintitts, NNP)",Plaintitts and Plaintiff-Intervenors represent...,"[(plaintitts and, 2)]"


In [66]:
# Total number of non-words (tokens)
len(non_words2)

# Total number of non-words (types)
non_words2.misspelling.nunique()

4141

3566

#### Creating a dataframe of misspellings
In the `non-words2` dataframe above, each row is an occurrence of a misspelling (i.e. _tokens_ ). We also want a dataframe where each row is a misspelling _type_ with frequency information attached.

In [67]:
# Gathering the total misspellings and bigrams

total_unigram_misspell = [x for x in non_words2['misspelling']]
total_bigram_misspell = [x[0] for y in non_words2['bigrams'] for x in y] #flattened list

In [68]:
# Creating frequency dictionaries for unigrams and bigrams

unigram_misspell_freq_dict = {}
for word in total_unigram_misspell:
    if word not in unigram_misspell_freq_dict:
        unigram_misspell_freq_dict[word] = 1
    else:
        unigram_misspell_freq_dict[word] += 1
        
bigram_misspell_freq_dict = {}
for bigram in total_bigram_misspell:
    if bigram not in bigram_misspell_freq_dict:
        bigram_misspell_freq_dict[bigram] = 1
    else:
        bigram_misspell_freq_dict[bigram] += 1

In [69]:
# Checking dictionaries

print(random.sample(list(unigram_misspell_freq_dict),5))
print(random.sample(list(bigram_misspell_freq_dict),5))

[('nutes', 'nutes', 'VBZ'), ('vpjl', 'VPJL', 'NNP'), ('defeffied', 'DEFEffiED', 'NNP'), ('erank', 'Erank', 'NNP'), ('yrnnsijlnania', 'yrnnsijlnania', 'RB')]
['opment they', 'become entepreneurs', 'ris mfifafa', 'vg c', 'bttipely unrealistic']


In [70]:
# Remove duplicates

final_unigram_misspellings = sorted(list(set(total_unigram_misspell)))
final_bigram_misspellings = sorted(list(set(total_bigram_misspell)))
len(final_unigram_misspellings)
len(final_bigram_misspellings)

3566

6748

In [71]:
# Constructing misspell_df

misspell_df = pd.DataFrame(final_unigram_misspellings)

In [72]:
# Renaming columns to match other DataFrames in this notebook

misspell_df.rename(columns = {0: 'misspelling',1:'lemma',2:'POS'}, inplace = True)

In [73]:
# Recreating tok_lem_POS column to match dictionary

misspell_df['tok_lem_POS'] = list(zip(misspell_df.misspelling, misspell_df.lemma, misspell_df.POS))
misspell_df.head()

Unnamed: 0,misspelling,lemma,POS,tok_lem_POS
0,aa,AA,NNP,"(aa, AA, NNP)"
1,aa,aA,NN,"(aa, aA, NN)"
2,aa,aa,VBP,"(aa, aa, VBP)"
3,aaaoctatton,AAAoctatton,NNP,"(aaaoctatton, AAAoctatton, NNP)"
4,aae,aAe,RB,"(aae, aAe, RB)"


In [74]:
# Mapping dictionary to DataFrame

misspell_df['freq'] = misspell_df['tok_lem_POS'].map(unigram_misspell_freq_dict)
misspell_df = misspell_df.sort_values(by=['freq'], ascending=False)

In [75]:
# Resetting index and deleting unnecessary columns

misspell_df = misspell_df.reset_index(drop = True)
del misspell_df['lemma']
del misspell_df['POS']

misspell_df.head(10)

Unnamed: 0,misspelling,tok_lem_POS,freq
0,shapp,"(shapp, Shapp, NNP)",73
1,dpw,"(dpw, DPW, NNP)",58
2,wssh,"(wssh, WSSH, NNP)",40
3,cla,"(cla, CLA, NNP)",19
4,cp,"(cp, cp, NN)",18
5,omr,"(omr, OMR, NNP)",10
6,cca,"(cca, CCA, NNP)",8
7,tdmt,"(tdmt, TDMT, NNP)",8
8,torisky,"(torisky, Torisky, NNP)",7
9,jcah,"(jcah, JCAH, NNP)",7


#### bob_df_spelling_supp
The following is the basis for the 'bob_df_spelling_supp' list used earlier. Here, errors with a frequency over 1 were manually checked, and if determined to be a real word, were added to the bob_df_spelling_supp. There were originally 418 items which met this criteria.

In [76]:
len(misspell_df.loc[misspell_df.freq > 1])
misspell_df.loc[misspell_df.freq > 1].head()

208

Unnamed: 0,misspelling,tok_lem_POS,freq
0,shapp,"(shapp, Shapp, NNP)",73
1,dpw,"(dpw, DPW, NNP)",58
2,wssh,"(wssh, WSSH, NNP)",40
3,cla,"(cla, CLA, NNP)",19
4,cp,"(cp, cp, NN)",18


In [77]:
%%capture

# Creating a text file of the bob_df_spelling_supp to check manually

off_list_to_check = misspell_df.loc[misspell_df.freq > 1].misspelling
with open('off_list', 'w') as f:
    for item in off_list_to_check:
        f.write("%s\n" % item)

### Applying spelling correction

In some ways SymSpell is not ideal as full sentence context is not considered, only general frequencies. However, other well-known spellcheckers (hunspell, pyspell, etc.) use the same strategy - frequency based criteria for suggestions, without considering immediate cotext. As such, we have followed this common practice, but it is important to remember that accuracy of corrected tokens will not be 100% and must be taken into consideration.

As a compromise and to consider context, spelling corrections based on bigrams is first implemented. If no suggestions are available, spelling corrections based on unigrams are implemented.

In [78]:
# Testing spelling suggestions with 'becuase'

sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
    "symspellpy", "frequency_dictionary_en_82_765.txt")
# term_index is the column of the term and count_index is the column of the term frequency
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)

input_term = "becuase"
suggestions = sym_spell.lookup(input_term, Verbosity.CLOSEST, max_edit_distance=2, #Edit distance can be adjusted
                               transfer_casing=True, #Optional argument set to ignore case
                              include_unknown=True) #Return same word if unknown
for suggestion in suggestions:
    print(suggestion)  

True

because, 1, 271323986


In [79]:
# Creating function for finding unigram suggestions

def get_unigram_suggestions(word):
    if len(word) >= 4:
        suggestions = sym_spell.lookup(word, Verbosity.CLOSEST,max_edit_distance=2, transfer_casing=True)
    else:
        suggestions = sym_spell.lookup(word, Verbosity.CLOSEST,max_edit_distance=1, transfer_casing=True)
    return [str(x).split(',') for x in suggestions]

In [80]:
# Testing function

get_unigram_suggestions('becuase')

[['because', ' 1', ' 271323986']]

**Note**: The function has a variable edit distance: words of length 4 or more get edit distance of 2, shorter words get edit distance of 1. These preferences can be adjusted in the function if desired.

In [81]:
# Testing spelling suggestions with 'becuase of'

max_edit_distance_dictionary = 2
prefix_length = 7
sym_spell = SymSpell(max_edit_distance_dictionary, prefix_length)
bigram_path = pkg_resources.resource_filename("symspellpy", "frequency_bigramdictionary_en_243_342.txt")
if not sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1):
    print("Dictionary file not found")
if not sym_spell.load_bigram_dictionary(bigram_path, term_index=0, count_index=2):
    print("Bigram dictionary file not found")
input_term = 'becuase of'
max_edit_distance_lookup = 2
suggestions = sym_spell.lookup_compound(input_term, max_edit_distance_lookup)
for suggestion in suggestions:
    print(suggestion) 

because of, 1, 3481714


In [82]:
# Creating function for finding bigram suggestions

def get_bigram_suggestions(bigram):
    suggestions = sym_spell.lookup_compound(bigram, max_edit_distance_lookup)
    for suggestion in suggestions:
        return [str(x).split(',') for x in suggestions] 

In [83]:
# Testing function

get_bigram_suggestions('worq harg')

[['work hard', ' 2', ' 53229']]

In [84]:
# Returing to non_words2 dataframe and applying functions to create new column

# Creating unigram suggestions column

non_words2['unigram_suggestions'] =  non_words2['misspelling'].apply(
    lambda x: get_unigram_suggestions(x[0]))

In [85]:
# Turning into tuples for easier processing

non_words2.unigram_suggestions = non_words2.unigram_suggestions.apply(
    lambda row: [tuple(x) for x in row])

In [86]:
# Creating bigram suggestions column

non_words2['bigram_suggestions'] =  non_words2['bigrams'].apply(
    lambda row: [(get_bigram_suggestions(x[0]),x[1]) for x in row])

In [87]:
# Flattening and turning into tuples for easier processing

non_words2['bigram_suggestions'] = non_words2.bigram_suggestions.apply(
    lambda row: [(tuple([i for j in x[0] for i in j]),x[1]) for x in row])

In [88]:
# Checking how many items without suggestions

len(non_words2.loc[(non_words2.unigram_suggestions.str.len() == 0),:])
len(non_words2.loc[(non_words2.bigram_suggestions.str.len() == 0),:])
len(non_words2.loc[(non_words2.bigram_suggestions.str.len() == 0) & (non_words2.unigram_suggestions.str.len() == 0),:])

762

38

6

Items with no suggestions - these will be left in their original form though manual corrections could be applied if desired.  
Next, we create a new column with just the most likely correction (based on frequency). Bigram suggestions are given preference before unigram suggestions. If there is no suggestion, the original word is returned.

In [89]:
# Create new column with just the most likely correction (based on frequency)

def sort_unigram_tuple(tup):  
    tup.sort(key = lambda x: x[2], reverse=True)  
    return tup    

In [90]:
# Keeping the unigram correction with the highest frequency

non_words2['unigram_correction'] = [sort_unigram_tuple(x)[0][0] if len(x) != 0 else np.NaN for x in non_words2['unigram_suggestions']]

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [91]:
# Create new column with just the most likely correction (based on frequency)

def sort_bigram_tuple(tup):  
    tup.sort(key = lambda x: x[0][2], reverse=True)  
    return tup    

In [92]:
# Keeping the bigram correction with the highest frequency

non_words2['bigram_correction'] = [(sort_bigram_tuple(x)[0][0][0],x[0][1]) if len(x) != 0 else np.NaN for x in non_words2['bigram_suggestions']]

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [93]:
# Splitting the bigram corrections into separate words

mask = non_words2.loc[non_words2.bigram_correction.isnull() == False].index
non_words2.loc[mask, 'bigram_correction'] = non_words2.loc[mask, 'bigram_correction'].apply(
    lambda x: (x[0].split(),x[1]))

Some bigrams that were previously two words have now been corrected to one word, e.g. _paragragh, s --> paragraphs._  
These will now be labelled Type 3.

In [94]:
mask2 = non_words2.loc[mask].loc[non_words2.loc[mask].bigram_correction.apply(lambda x: len(x[0])==1)].index 
non_words2.loc[mask2].head()
len(non_words2.loc[mask2].head())

Unnamed: 0,index,tok_lem_POS_NLTK,misspelling,sentence,bigrams,unigram_suggestions,bigram_suggestions,unigram_correction,bigram_correction
9,0,"[(v, V, NNP), (pennsylvania, Pennsylvania, NNP...","(dep, Dep, NNP)",Mental Patient Civil Liberties Project v. Dep’...,"[(v dep, 1), (dep t, 2)]","[(dep, 0, 3107120)]","[((dept, 1, 6534300), 2), ((a dep, 1, 2753...",dep,"([dept], 2)"
22,0,"[(v, V, NNP), (pennsylvania, Pennsylvania, NNP...","(osser, Osser, NNP)",The court found that at least one of plaintiff...,"[(v osser, 1), (osser 409, 2)]","[(tosser, 1, 64096), (osier, 1, 53672)]","[((tosser, 2, 64095), 1), ((tosser of a, 5,...",tosser,"([tosser], 1)"
27,1,"[(pennsylvania, Pennsylvania, NNP), (associati...","(schmi, Schmi, NNP)",PPP:ef cc: Brown\nSchmi dt\nPhi 11 ips,"[(brown schmi, 1), (schmi dt, 2)]","[(schmo, 1, 60716)]","[((schmidt, 1, 4367060), 2), ((brown schmo, ...",schmo,"([schmidt], 2)"
28,1,"[(pennsylvania, Pennsylvania, NNP), (associati...","(dt, dt, VBD)",PPP:ef cc: Brown\nSchmi dt\nPhi 11 ips,"[(schmi dt, 1), (dt phi, 2)]","[(do, 1, 950751722), (dpt, 1, 665373), (dt...","[((schmidt, 1, 4367060), 1), ((it phi, 1, ...",do,"([schmidt], 1)"
29,1,"[(pennsylvania, Pennsylvania, NNP), (associati...","(ips, ip, NNS)",PPP:ef cc: Brown\nSchmi dt\nPhi 11 ips,"[(11 ips, 1)]","[(tips, 1, 94800899), (lips, 1, 9228435), ...","[((tips, 3, 94800899), 1)]",tips,"([tips], 1)"


5

In [95]:
non_words2.loc[mask2,'bigram_correction'] = non_words2.loc[mask2,'bigram_correction'].apply(lambda x: (x[0],3))

In [96]:
# Keeping only the word in the bigram correction that was originally misspelled

def corrected_only(bigram_tuple):
    bigram_list = bigram_tuple[0]
    bigram_type = bigram_tuple[1]
    for tup in bigram_tuple:
        if bigram_type == 1:
            corrected_word = bigram_list[1:]
        if bigram_type == 2:
            corrected_word = bigram_list[:-1]
        if bigram_type == 3:
            corrected_word = bigram_list
    return ' '.join(corrected_word)

# Type 1 bigram = (word, misspelling), Type 2 bigram = (misspelling, word), Type 3 bigram = misspelling only

In [97]:
# Applying the function

non_words2.loc[mask, 'bigram_correction'] = non_words2.loc[mask, 'bigram_correction'].apply(
    lambda x: corrected_only(x))

In [98]:
# Creating a 'final_correction' column with order of preference: bigram_correction, unigram_correction, misspelling

# Changing empty strings to NaN in the bigram correction column
non_words2.bigram_correction.replace("", np.nan, inplace=True)

# Creating a tuple of these three items
non_words2['final_correction'] = list(zip(non_words2.misspelling.apply(
    lambda x: x[1]), non_words2.unigram_correction, non_words2.bigram_correction))

# Choosing which item based on if strings or not (i.e. NaNs)
non_words2['final_correction'] = [x[2] if isinstance(x[2], str) else x[1] if isinstance(x[1], str) else x[0] for x in non_words2['final_correction']]

<IPython.core.display.Javascript object>

In [99]:
# Create correction_POS column

non_words2['final_correction_POS'] = list(zip(non_words2.final_correction, non_words2.final_correction, non_words2.misspelling.apply(lambda x: x[2])))
non_words2.head(6)

Unnamed: 0,index,tok_lem_POS_NLTK,misspelling,sentence,bigrams,unigram_suggestions,bigram_suggestions,unigram_correction,bigram_correction,final_correction,final_correction_POS
0,0,"[(v, V, NNP), (pennsylvania, Pennsylvania, NNP...","(dpw, DPW, NNP)",Downs v. Pennsylvania DPW: outlines the plan w...,"[(pennsylvania dpw, 1), (dpw outlines, 2)]","[(dpi, 1, 7260183), (dpt, 1, 665373), (daw...","[((dpi outlines, 1, 31), 2), ((pennsylvania ...",dpi,dpi,dpi,"(dpi, dpi, NNP)"
1,0,"[(v, V, NNP), (pennsylvania, Pennsylvania, NNP...","(bazelon, Bazelon, NNP)",cc:\tGilhool\nBazelon Haggerty Cohen\nPolloni ...,"[(gilhool bazelon, 1), (bazelon haggerty, 2)]","[(ballon, 2, 372246), (babylon, 2, 3284724)]","[((girlhood babylon, 4, 0), 1), ((babylon ja...",ballon,babylon,babylon,"(babylon, babylon, NNP)"
2,0,"[(v, V, NNP), (pennsylvania, Pennsylvania, NNP...","(nastav, Nastav, NNP)",cc:\tGilhool\nBazelon Haggerty Cohen\nPolloni ...,"[(roth nastav, 1), (nastav v, 2)]","[(nasty, 2, 7434847), (pasta, 2, 5855460),...","[((nasty a, 3, 65876), 2), ((roth nasty, 2,...",nasty,nasty,nasty,"(nasty, nasty, NNP)"
3,0,"[(v, V, NNP), (pennsylvania, Pennsylvania, NNP...","(lgmiutbvno, lgmiutbvNo, NN)",9674. ycg£hi<H^v^_Wo.bl!lgmiutbvNo.,"[(bl lgmiutbvno, 1)]",[],"[((by limit bono, 5, 0), 1)]",,limit bono,limit bono,"(limit bono, limit bono, NN)"
4,0,"[(v, V, NNP), (pennsylvania, Pennsylvania, NNP...","(plaintitts, Plaintitts, NNP)",Plaintitts and Plaintiff-Intervenors represent...,"[(plaintitts and, 2)]","[(plaintiffs, 2, 3433271)]","[((plaintiffs and, 2, 43539), 2)]",plaintiffs,plaintiffs,plaintiffs,"(plaintiffs, plaintiffs, NNP)"
5,0,"[(v, V, NNP), (pennsylvania, Pennsylvania, NNP...","(untii, untii, IN)",The court has granted a motion to intervene an...,"[(and untii, 1), (untii he, 2)]","[(untie, 1, 146576), (until, 1, 113090086)]","[((until he, 1, 93001), 2), ((and until, 1,...",untie,until,until,"(until, until, IN)"


#### Incorporating corrections into `bob_df`

In [100]:
# Focusing on rows in bob_df containing spelling mistakes to replace

mask = bob_df.index.isin(non_words2["index"])
len(bob_df)
len(bob_df.loc[mask])

541

469

In [101]:
# Transforming non_words2 back so that each row is a text (with a list of the misspellings and final corrections)

# First create a column with tuples of misspellings and their corrections
non_words2['misspelling_correction'] = list(zip(non_words2.misspelling, non_words2.final_correction_POS))

In [102]:
# Then group by the text and combine the misspelling_corrections

non_words3 = non_words2[['index','tok_lem_POS_NLTK','misspelling_correction']]
non_words3 = non_words2.groupby('index').agg({'tok_lem_POS_NLTK':'first','misspelling_correction': sum}).reset_index()
non_words3.misspelling_correction = non_words3.misspelling_correction.apply(lambda x:list(zip(x[::2], x[1::2]))) #combining items into tuples
non_words3 = non_words3.set_index('index') # use answer_id as new index since it is now unique
non_words3.head()

Unnamed: 0_level_0,tok_lem_POS_NLTK,misspelling_correction
index,Unnamed: 1_level_1,Unnamed: 2_level_1
0,"[(v, V, NNP), (pennsylvania, Pennsylvania, NNP...","[((dpw, DPW, NNP), (dpi, dpi, NNP)), ((bazelon..."
1,"[(pennsylvania, Pennsylvania, NNP), (associati...","[((ppp, PPP, NNP), (pop, pop, NNP)), ((schmi, ..."
2,"[(11, 11, CD), (families, FAMILIES, NNP), (&, ...","[((fodi, Fodi, NNP), (jodi, jodi, NNP))]"
3,"[(families, FAMILIES, NNP), (&, &, CC), (frien...","[((tesident, tesident, NN), (resident, residen..."
4,"[(c, C, NNP), (ommonwealth, ommonwealth, NN), ...","[((ommonwealth, ommonwealth, NN), (commonwealt..."


In [103]:
# Creating a function to find and replace misspellings

def replace_misspelling(tok_lem_POS, misspelling_correction):
    tok_lem_POS_corrected = []
    misspellings = [x[0] for x in misspelling_correction]
    corrections = [x[1] for x in misspelling_correction]
    correction_dict = dict((x, y) for x, y in misspelling_correction)
    for tok in tok_lem_POS:
        if tok in misspellings:
            tok = correction_dict[tok]
        tok_lem_POS_corrected.append(tok)
    return tok_lem_POS_corrected

In [104]:
# Applying the above function

non_words3['tok_lem_POS_NLTK_corrected'] = non_words3[['tok_lem_POS_NLTK','misspelling_correction']].apply(
    lambda x: replace_misspelling(x[0],x[1]), axis=1)

non_words3.head()

Unnamed: 0_level_0,tok_lem_POS_NLTK,misspelling_correction,tok_lem_POS_NLTK_corrected
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,"[(v, V, NNP), (pennsylvania, Pennsylvania, NNP...","[((dpw, DPW, NNP), (dpi, dpi, NNP)), ((bazelon...","[(v, V, NNP), (pennsylvania, Pennsylvania, NNP..."
1,"[(pennsylvania, Pennsylvania, NNP), (associati...","[((ppp, PPP, NNP), (pop, pop, NNP)), ((schmi, ...","[(pennsylvania, Pennsylvania, NNP), (associati..."
2,"[(11, 11, CD), (families, FAMILIES, NNP), (&, ...","[((fodi, Fodi, NNP), (jodi, jodi, NNP))]","[(11, 11, CD), (families, FAMILIES, NNP), (&, ..."
3,"[(families, FAMILIES, NNP), (&, &, CC), (frien...","[((tesident, tesident, NN), (resident, residen...","[(families, FAMILIES, NNP), (&, &, CC), (frien..."
4,"[(c, C, NNP), (ommonwealth, ommonwealth, NN), ...","[((ommonwealth, ommonwealth, NN), (commonwealt...","[(c, C, NNP), (commonwealth, commonwealth, NN)..."


In [105]:
# Creating a dictionary of answer_id and tok_lem_POS_corrected

corrected_text_dict = pd.Series(non_words3.tok_lem_POS_NLTK_corrected.values,non_words3.index).to_dict()

In [106]:
# Adding the tok_lem_POS_corrected column to pelic_df

bob_df['tok_lem_POS_NLTK_corrected'] = bob_df.index.map(corrected_text_dict)

In [107]:
# Creating a dictionary of answer_id and misspelling_correction

misspelling_correction_dict = pd.Series(non_words3.misspelling_correction.values,non_words3.index).to_dict()

In [108]:
# Adding the misspelling_correction column to pelic_df

bob_df['misspelling_correction'] = bob_df.index.map(misspelling_correction_dict)

In [109]:
# Creating a column with number of corrected errors

bob_df['len_errors'] = bob_df.misspelling_correction.fillna('').apply(len)

In [110]:
%pprint

# Checking with random short text

bob_df.iloc[40,-5] #uncorrected
bob_df.iloc[40,-3] #corrected
bob_df.iloc[40,-2] #corrected mistakes

%pprint

Pretty printing has been turned OFF


[('MARY', 'MARY', 'NNP'), ('LOU', 'LOU', 'NNP'), ('MAGISTRI', 'MAGISTRI', 'NNP'), (',', ',', ','), ('Public', 'Public', 'NNP'), ('Relationa', 'Relationa', 'NNP'), ('Director', 'Director', 'NNP'), ('Phone', 'Phone', 'NNP'), (':', ':', ':'), ('322-6008', '322-6008', 'CD'), ('of', 'of', 'IN'), ('the', 'the', 'DT'), ('Pennsylvania', 'Pennsylvania', 'NNP'), ('Association', 'Association', 'NNP'), ('for', 'for', 'IN'), ('Retarded', 'Retarded', 'NNP'), ('Children', 'Children', 'NNP'), (',', ',', ','), ('Inc.', 'Inc.', 'NNP'), ('917-1001', '917-1001', 'CD'), ('Brighton', 'Brighton', 'NNP'), ('Road', 'Road', 'NNP'), ('Pittsburgh', 'Pittsburgh', 'NNP'), (',', ',', ','), ('Pa.', 'Pa.', 'NNP'), ('15233', '15233', 'CD'), ('fom', 'fom', 'NN'), ('October', 'October', 'NNP'), ('24', '24', 'CD'), (',', ',', ','), ('1972', '1972', 'CD'), ('Information', 'Information', 'NN'), ('to', 'to', 'TO'), ('be', 'be', 'VB'), ('used', 'use', 'VBN'), ('by', 'by', 'IN'), ('Don', 'Don', 'NNP'), ('Cannon', 'Cannon', 'NN

[('mary', 'MARY', 'NNP'), ('lou', 'LOU', 'NNP'), ('magistral', 'magistral', 'NNP'), (',', ',', ','), ('public', 'Public', 'NNP'), ('relations', 'relations', 'NNP'), ('director', 'Director', 'NNP'), ('phone', 'Phone', 'NNP'), (':', ':', ':'), ('322-6008', '322-6008', 'CD'), ('of', 'of', 'IN'), ('the', 'the', 'DT'), ('pennsylvania', 'Pennsylvania', 'NNP'), ('association', 'Association', 'NNP'), ('for', 'for', 'IN'), ('retarded', 'Retarded', 'NNP'), ('children', 'Children', 'NNP'), (',', ',', ','), ('inc.', 'Inc.', 'NNP'), ('917-1001', '917-1001', 'CD'), ('brighton', 'Brighton', 'NNP'), ('road', 'Road', 'NNP'), ('pittsburgh', 'Pittsburgh', 'NNP'), (',', ',', ','), ('pa.', 'Pa.', 'NNP'), ('15233', '15233', 'CD'), ('for', 'for', 'NN'), ('october', 'October', 'NNP'), ('24', '24', 'CD'), (',', ',', ','), ('1972', '1972', 'CD'), ('information', 'Information', 'NN'), ('to', 'to', 'TO'), ('be', 'be', 'VB'), ('used', 'use', 'VBN'), ('by', 'by', 'IN'), ('don', 'Don', 'NNP'), ('cannon', 'Cannon', '

[(('magistri', 'MAGISTRI', 'NNP'), ('magistral', 'magistral', 'NNP')), (('relationa', 'Relationa', 'NNP'), ('relations', 'relations', 'NNP')), (('fom', 'fom', 'NN'), ('for', 'for', 'NN')), (('hissim', 'Hissim', 'NNP'), ('his sim', 'his sim', 'NNP')), (('dpw', 'DPW', 'NNP'), ('dpi', 'dpi', 'NNP')), (('cind', 'cind', 'VBP'), ('find', 'find', 'VBP')), (('magistri', 'Magistri', 'NNP'), ('magistral', 'magistral', 'NNP'))]

Pretty printing has been turned ON


In [111]:
bob_df.head()

Unnamed: 0,id,title,display_date,abstract,host,series,container,owner,depositor,collection_id,text,language,len,tok_lem_POS_NLTK,tok_lem_POS_CLAWS,tok_lem_POS_NLTK_corrected,misspelling_correction,len_errors
0,MSS_1002_B001_F11_I01,Recent Litigation Memo,"July 11, 1975",A PARC internal memo that summarizes recent li...,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 11, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,﻿V\n\nPennsylvania Association for Retarded Ci...,English,3042,"[(V, V, NNP), (Pennsylvania, Pennsylvania, NNP...","[(Pennsylvania, pennsylvania, n), (Association...","[(v, V, NNP), (pennsylvania, Pennsylvania, NNP...","[((dpw, DPW, NNP), (dpi, dpi, NNP)), ((bazelon...",26
1,MSS_1002_B001_F12_I01,Letter from Peter Polloni to Bob Nelkin,"March 11, 1975","A letter from Peter Polloni, executive directo...",Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 12, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,﻿Pennsylvania Association for Retarded Citizen...,English,242,"[(Pennsylvania, Pennsylvania, NNP), (Associati...","[(Pennsylvania, pennsylvania, n), (Association...","[(pennsylvania, Pennsylvania, NNP), (associati...","[((ppp, PPP, NNP), (pop, pop, NNP)), ((schmi, ...",4
2,MSS_1002_B001_F13_I01,Letter to Frank Beal from Families and Friends...,"August 19, 1976",A letter from Families and Friends of Southwes...,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 13, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,﻿11\nFAMILIES & FRIENDS OF SOUTHWEST HABILITAT...,English,268,"[(11, 11, CD), (FAMILIES, FAMILIES, NNP), (&, ...","[(1, 1, m), (FAMILIES, family, n), (FRIENDS, f...","[(11, 11, CD), (families, FAMILIES, NNP), (&, ...","[((fodi, Fodi, NNP), (jodi, jodi, NNP))]",1
3,MSS_1002_B001_F13_I02,Letter from families of patients at Southwest ...,"July 27, 1976",A letter requesting Bob Nelkin's advice on adv...,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 13, Item 2",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,﻿FAMILIES & FRIENDS OF\nSOUTHWEST HABILITATION...,English,320,"[(FAMILIES, FAMILIES, NNP), (&, &, CC), (FRIEN...","[(FAMILIES, family, n), (FRIENDS, friend, n), ...","[(families, FAMILIES, NNP), (&, &, CC), (frien...","[((tesident, tesident, NN), (resident, residen...",3
4,MSS_1002_B001_F16_I01,ACC-PARC Recent Benefits to Families Memo,"March 28, 1977",Correspondence from Bob Nelkin to Joan Murdoch...,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 16, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,C ommonwealth of Pennsylvania\n\nDepartment of...,English,6932,"[(C, C, NNP), (ommonwealth, ommonwealth, NN), ...","[(C, c, n), (ommonwealth, ommonwealth, n), (of...","[(c, C, NNP), (commonwealth, commonwealth, NN)...","[((ommonwealth, ommonwealth, NN), (commonwealt...",209


## 6. Genre tagging

As much as possible has been done automatically based on titles and abstract, and then the rest have been manually checked. The first genre column keeps the genre terms used in the collection, and the second genre column uses the [MODS metadata terms](https://github.com/uls-mad/islandora_metadata/wiki/Genre-Terms-for-Historic-Pittsburgh-Digital-Objects).

In [112]:
# Create lists based on titles (these have each been checked)

letter = [(x,'letter') for x in bob_df.title if 'letter' in x.lower()]
memo = [(x,'memo') for x in bob_df.title if 'memo' in x.lower()]
report = [(x,'report') for x in bob_df.title if 'report' in x.lower()]
excerpt = [(x,'excerpt') for x in bob_df.title if 'excerpt' in x.lower()]
pamphlet = [(x,'pamphlet') for x in bob_df.title if 'pamphlet' in x.lower()]
flyer = [(x,'flyer') for x in bob_df.title if 'flyer' in x.lower()]
handout = [(x,'handout') for x in bob_df.title if 'handout' in x.lower()]
summary = [(x,'summary') for x in bob_df.title if 'summary' in x.lower()]
news_request = [(x,'news_request') for x in bob_df.title if 'request for coverage' in x.lower()]
minutes = [(x,'minutes') for x in bob_df.title if 'minutes' in x.lower()]
news_release = [(x,'news_release') for x in bob_df.title if 'news release' in x.lower()]
senate_bill = [(x,'senate_bill') for x in bob_df.title if 'senate bill' in x.lower()]
note = [(x,'note') for x in bob_df.title if 'note' in x.lower()]
notes = [(x,'notes') for x in bob_df.title if 'notes' in x.lower()] #make sure to update this one after 'note'
statement = [(x,'statement') for x in bob_df.title if 'statement' in x.lower()]
telegram = [(x,'telegram') for x in bob_df.title if 'telegram' in x.lower()]
article = [(x,'article') for x in bob_df.title if 'article' in x.lower()]
transcript = [(x,'transcript') for x in bob_df.title if 'transcript' in x.lower()]
news_conference = [(x,'news_conference') for x in bob_df.title if 'news conference' in x.lower()]
proclamation = [(x,'proclamation') for x in bob_df.title if 'proclamation' in x.lower()]
data = [(x,'data') for x in bob_df.title if 'data' in x.lower()]
agreement = [(x,'agreement') for x in bob_df.title if 'agreement' in x.lower()]
court_order = [(x,'court_order') for x in bob_df.title if 'court order' in x.lower()]
legislative_action = [(x,'legislative_action') for x in bob_df.title if 'legislative action' in x.lower()]
press_release = [(x,'press_release') for x in bob_df.title if 'press release' in x.lower()]
transcript = [(x,'transcript') for x in bob_df.title if 'transcript' in x.lower()]
findings = [(x,'findings') for x in bob_df.title if 'findings' in x.lower()]
review = [(x,'review') for x in bob_df.title if 'review of' in x.lower()]
policy = [(x,'policy') for x in bob_df.title if 'polic' in x.lower()]
court_order = [(x,'court_order') for x in bob_df.title if 'court order' in x.lower()]
speech = [(x,'speech') for x in bob_df.title if 'speech' in x.lower()]
questions = [(x,'questions') for x in bob_df.title if 'questions' in x.lower()]
plan = [(x,'plan') for x in bob_df.title if 'plan' in x.lower()]
response = [(x,'response') for x in bob_df.title if 'response' in x.lower()]
information = [(x,'information') for x in bob_df.title if 'information' in x.lower()]
press_packet = [(x,'press_packet') for x in bob_df.title if 'press packet' in x.lower()]
trial_testimony = [(x,'trial_testimony') for x in bob_df.title if 'trial testimony' in x.lower()]

In [113]:
# Manual genre classifications

manual_dict = {
    'Halderman v. Pennhurst Appendices':'court_order',
    'Release of Investigation into the Death of Paul Jenkins':'report',
    'Highland Park Center Restraining Chair':'photograph',
    'Highland Park Center Resident Helmet':'photograph',
    'Highland Park Center Resident Helmet (2)':'photograph',
    'Highland Park Center Straitjacket':'photograph',
    'Highland Park Center Cattle Prod':'photograph',
    'Monitoring of Community Living Arrangements in Allegheny County':'plan',
    'ACC-PARC Residential Services Committee Listing of Objectives Draft':'plan',
    "December 1972 Visit to Children's Rehabilitation Center":'report',
    'Parent Training Workshop Ideas':'notes',
    'Supreme Court of the United States Pennhurst State School and Hospital v. Halderman et al., On Writ of Certiorari to the United States Court of Appeals for the Third Circuit':'legal_brief',
    'Progress in Pennhurst Dispersal':'report',
    'Joyce Z. Case Petition for Civil Court Commitment Under Section 406 of The Mental Health and Mental Retardation Act of 1966':'report',
    'Quiet Room Orders from Physicians':'report',
    'Immediate Needs for Intermediate Unit 1 at Western State School and Hospital':'report',
    'Gardenside Building Concerns':'letter',
    'Parent Allegations Against Polk State School and Hospital':'letter',
    'Resolution from the Borough of Grove City in Support of Dr. James McClelland':'resolution'
}

In [114]:
# Create and add to genre_dict

genre_dict = {}
genre_dict.update(dict(plan))
genre_dict.update(dict(questions))
genre_dict.update(dict(agreement))
genre_dict.update(dict(policy))
genre_dict.update(dict(letter))
genre_dict.update(dict(memo))
genre_dict.update(dict(court_order))
genre_dict.update(dict(report))
genre_dict.update(dict(excerpt))
genre_dict.update(dict(pamphlet))
genre_dict.update(dict(flyer))
genre_dict.update(dict(handout))
genre_dict.update(dict(summary))
genre_dict.update(dict(news_request))
genre_dict.update(dict(minutes))
genre_dict.update(dict(news_release))
genre_dict.update(dict(senate_bill))
genre_dict.update(dict(note))
genre_dict.update(dict(notes))
genre_dict.update(dict(statement))
genre_dict.update(dict(telegram))
genre_dict.update(dict(article))
genre_dict.update(dict(news_conference))
genre_dict.update(dict(proclamation))
genre_dict.update(dict(data))
genre_dict.update(dict(court_order))
genre_dict.update(dict(legislative_action))
genre_dict.update(dict(press_release))
genre_dict.update(dict(transcript))
genre_dict.update(dict(findings))
genre_dict.update(dict(review))
genre_dict.update(dict(speech))
genre_dict.update(dict(response))
genre_dict.update(dict(information))
genre_dict.update(dict(press_packet))
genre_dict.update(dict(trial_testimony))
genre_dict.update(manual_dict)

In [115]:
# Check that all texts are in dict

len([x for x in bob_df.title if x not in genre_dict])

0

In [116]:
# Map to dataframe

bob_df['genre'] = bob_df.title.map(genre_dict)
len(bob_df.loc[bob_df.genre.isnull()])

0

In [117]:
bob_df.loc[bob_df.genre == 'news_request','abstract'].to_list()[0]

"An ACC-PARC News announcement requesting press coverage for Frank S. Beal's follow-up visit. The third in a series of meetings on a lack of programs, services, and facilities."

In [118]:
# Check results

bob_df.genre.value_counts()

# These can be collapsed if desired

letter                314
report                 37
article                33
memo                   32
news_release           13
notes                  11
minutes                10
telegram               10
plan                    7
statement               6
senate_bill             6
photograph              5
policy                  5
information             4
court_order             4
excerpt                 4
review                  4
response                3
questions               3
summary                 3
note                    3
transcript              2
news_request            2
data                    2
findings                2
press_release           2
flyer                   2
agreement               2
resolution              1
legislative_action      1
legal_brief             1
trial_testimony         1
proclamation            1
press_packet            1
handout                 1
news_conference         1
speech                  1
pamphlet                1
Name: genre,

In [119]:
# Create column with MODS dict and categories

mods_dict = {'letter': 'correspondence',
             'report': 'report',
             'article': 'newspaper',
             'memo': 'correspondence',
             'news_release': 'press release',
             'notes': 'correspondence',
             'telegram':'correspondence',
             'minutes':'minutes',
             'plan':'report',
             'statement':'press release',
             'senate_bill':'case file',
             'photograph':'photograph',
             'policy':'report',
             'review':'report',
             'information':'report',
             'court_order':'case file',
             'excerpt':'clipping',
             'response':'correspondence',
             'summary':'report',
             'note':'correspondence',
             'questions':'correspondence',
             'data':'case file',
             'agreement':'case file',
             'news_request':'correspondence',
             'flyer':'flier',
             'transcript':'transcript',
             'findings':'report',
             'speech':'transcript',
             'resolution':'report',
             'press packet':'press release',
             'handout':'broadside',
             'trial_testimony':'case file',
             'news_conference':'press release',
             'pamphlet':'pamphlet',
             'legal_brief':'case file',
             'legislative_action':'case file',
             'proclamation':'broadside'}

In [120]:
bob_df['genre_MODS'] = bob_df.genre.map(mods_dict)
bob_df.genre_MODS.value_counts()

correspondence    378
report             63
newspaper          33
press release      20
case file          17
minutes            10
photograph          5
clipping            4
transcript          3
flier               2
broadside           2
pamphlet            1
Name: genre_MODS, dtype: int64

In [121]:
# Create resource_type column

bob_df['resource_type'] = ['text' if x != 'photograph' else 'still image' for x in bob_df.genre_MODS]
bob_df.resource_type.value_counts()

text           536
still image      5
Name: resource_type, dtype: int64

## 7. Wrap-up

In [122]:
# Write out dataframe as .pkl file

joblib.dump(bob_df,'bob_df.pkl')

['bob_df.pkl']

In [123]:
bob_df.head()

Unnamed: 0,id,title,display_date,abstract,host,series,container,owner,depositor,collection_id,text,language,len,tok_lem_POS_NLTK,tok_lem_POS_CLAWS,tok_lem_POS_NLTK_corrected,misspelling_correction,len_errors,genre,genre_MODS,resource_type
0,MSS_1002_B001_F11_I01,Recent Litigation Memo,"July 11, 1975",A PARC internal memo that summarizes recent li...,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 11, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,﻿V\n\nPennsylvania Association for Retarded Ci...,English,3042,"[(V, V, NNP), (Pennsylvania, Pennsylvania, NNP...","[(Pennsylvania, pennsylvania, n), (Association...","[(v, V, NNP), (pennsylvania, Pennsylvania, NNP...","[((dpw, DPW, NNP), (dpi, dpi, NNP)), ((bazelon...",26,memo,correspondence,text
1,MSS_1002_B001_F12_I01,Letter from Peter Polloni to Bob Nelkin,"March 11, 1975","A letter from Peter Polloni, executive directo...",Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 12, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,﻿Pennsylvania Association for Retarded Citizen...,English,242,"[(Pennsylvania, Pennsylvania, NNP), (Associati...","[(Pennsylvania, pennsylvania, n), (Association...","[(pennsylvania, Pennsylvania, NNP), (associati...","[((ppp, PPP, NNP), (pop, pop, NNP)), ((schmi, ...",4,letter,correspondence,text
2,MSS_1002_B001_F13_I01,Letter to Frank Beal from Families and Friends...,"August 19, 1976",A letter from Families and Friends of Southwes...,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 13, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,﻿11\nFAMILIES & FRIENDS OF SOUTHWEST HABILITAT...,English,268,"[(11, 11, CD), (FAMILIES, FAMILIES, NNP), (&, ...","[(1, 1, m), (FAMILIES, family, n), (FRIENDS, f...","[(11, 11, CD), (families, FAMILIES, NNP), (&, ...","[((fodi, Fodi, NNP), (jodi, jodi, NNP))]",1,letter,correspondence,text
3,MSS_1002_B001_F13_I02,Letter from families of patients at Southwest ...,"July 27, 1976",A letter requesting Bob Nelkin's advice on adv...,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 13, Item 2",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,﻿FAMILIES & FRIENDS OF\nSOUTHWEST HABILITATION...,English,320,"[(FAMILIES, FAMILIES, NNP), (&, &, CC), (FRIEN...","[(FAMILIES, family, n), (FRIENDS, friend, n), ...","[(families, FAMILIES, NNP), (&, &, CC), (frien...","[((tesident, tesident, NN), (resident, residen...",3,letter,correspondence,text
4,MSS_1002_B001_F16_I01,ACC-PARC Recent Benefits to Families Memo,"March 28, 1977",Correspondence from Bob Nelkin to Joan Murdoch...,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 16, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,C ommonwealth of Pennsylvania\n\nDepartment of...,English,6932,"[(C, C, NNP), (ommonwealth, ommonwealth, NN), ...","[(C, c, n), (ommonwealth, ommonwealth, n), (of...","[(c, C, NNP), (commonwealth, commonwealth, NN)...","[((ommonwealth, ommonwealth, NN), (commonwealt...",209,memo,correspondence,text


[Back to top](#Bob-Nelkin-Collection---Processing)