# Bob Nelkin Collection - Lexical analysis

<br>

**Notebook author:** Ben Naismith  
**Last modified:** July 16, 2021

<br>

This notebook looks at options for creating _concordances_ and examining _collocations_ with the Bob Nelkin Collection. These tools and data are intended to allow for a greater understanding of the lexis used in the collection's texts through consideration of frequencies of lexical items and the contexts in which they occur.  

For more detail, please see the `README.md` file in this folder.

<br>

**Notebook contents:**
1. [Initial setup](#1.-Initial-setup)
2. [Concordancing](#2.-Concordancing)
3. [Collocations](#3.-Collocations)

## 1. Initial setup

In [1]:
# Import necessary modules

import pandas as pd
import pprint
from IPython.core.interactiveshell import InteractiveShell
import joblib
from pelitk import conc
from more_itertools import unique_everseen
from nltk import FreqDist

In [2]:
# Set preferred notebook format

InteractiveShell.ast_node_interactivity = "all" # Show all output, not just last item
pd.set_option('display.max_columns', 999) # Allow viewing of all columns

In [3]:
# Read in pre-processed dataframe

bob_df = joblib.load('../bob_df.pkl')
bob_df.head(3)

Unnamed: 0,id,title,display_date,abstract,host,series,container,owner,depositor,collection_id,text,language,len,tok_lem_POS_NLTK,tok_lem_POS_CLAWS,tok_lem_POS_NLTK_corrected,misspelling_correction,len_errors,genre,genre_MODS,resource_type,sentiment_polarity,sentiment_agreement,sentiment_confidence,entities,topics
0,MSS_1002_B001_F11_I01,Recent Litigation Memo,"July 11, 1975",A PARC internal memo that summarizes recent li...,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 11, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,﻿V\n\nPennsylvania Association for Retarded Ci...,English,3042,"[(V, V, NNP), (Pennsylvania, Pennsylvania, NNP...","[(Pennsylvania, pennsylvania, n), (Association...","[(v, V, NNP), (pennsylvania, Pennsylvania, NNP...","[((dpw, DPW, NNP), (dpi, dpi, NNP)), ((bazelon...",26,memo,correspondence,text,NEU,DISAGREEMENT,86,"[(Legal Services, Company), (Supreme Court, Go...","[(plaintiff, Person), (patient, Person), (pati..."
1,MSS_1002_B001_F12_I01,Letter from Peter Polloni to Bob Nelkin,"March 11, 1975","A letter from Peter Polloni, executive directo...",Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 12, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,﻿Pennsylvania Association for Retarded Citizen...,English,242,"[(Pennsylvania, Pennsylvania, NNP), (Associati...","[(Pennsylvania, pennsylvania, n), (Association...","[(pennsylvania, Pennsylvania, NNP), (associati...","[((ppp, PPP, NNP), (pop, pop, NNP)), ((schmi, ...",4,letter,correspondence,text,P,DISAGREEMENT,84,"[(Pennsylvania, Adm1)]","[(report, Top)]"
2,MSS_1002_B001_F13_I01,Letter to Frank Beal from Families and Friends...,"August 19, 1976",A letter from Families and Friends of Southwes...,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 13, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,﻿11\nFAMILIES & FRIENDS OF SOUTHWEST HABILITAT...,English,268,"[(11, 11, CD), (FAMILIES, FAMILIES, NNP), (&, ...","[(1, 1, m), (FAMILIES, family, n), (FRIENDS, f...","[(11, 11, CD), (families, FAMILIES, NNP), (&, ...","[((fodi, Fodi, NNP), (jodi, jodi, NNP))]",1,letter,correspondence,text,P,DISAGREEMENT,92,"[(Habilitation Center, Facility), (Pennsylvani...","[(southwest, Location), (unit, Unit), (group, ..."


## 2. Concordancing

In [4]:
%pprint

# Example with single lemma in single text

example = [x[0].lower() for x in bob_df.iloc[10,13]]
conc.concordance(example,'advocacy',5,pretty=True)

Pretty printing has been turned OFF


['                                           advocacy   report march 21 , 1978                  ', '                  , 1978 i . educational   advocacy   10 year old - placed                    ', '              held . -2- ii• residential   advocacy   somerset m , r .                        ', '        report - an in-depth residential   advocacy   report is available . ill               ']

In [5]:
# Create function which creates a concordance line for each occurrence of any item in a list

def get_concs(tok_text, forms_list):
    conclist = []
    for x in tok_text:
        if x.lower() in [x for x in forms_list]: 
            conclist.append(conc.concordance(tok_text, x, 5))
    return [x for y in list(unique_everseen(conclist)) for x in y]

In [6]:
# Test with example

conc.prettify(get_concs(example,['advocacy','school']))

['                                           advocacy   report march 21 , 1978                  ', '                  , 1978 i . educational   advocacy   10 year old - placed                    ', '              held . -2- ii• residential   advocacy   somerset m , r .                        ', '        report - an in-depth residential   advocacy   report is available . ill               ', '                    year old - placed in    school    acc-parc was informed by a              ', '           year old child never attended    school    . the child is now                      ', '                  ’ s education . parent    school    conference - penn hills acc-parc        ', '             , and evaluation . expedite    school    placement acc-parc requested pittsburgh schools', '               to our intervention , the    school    board refused to make the               ', '                make the placement . the    school    placement was made and the              ', '           

In [7]:
# Test on entire dataset

all_toks = bob_df.tok_lem_POS_NLTK.apply(lambda row: [x[0].lower() for x in row])
all_toks = [x for y in all_toks for x in y]

In [8]:
conc.prettify(get_concs(all_toks,['behavior']))[:10]
len(conc.prettify(get_concs(all_toks,['behavior'])))

['                    very dire need for a   behavior   management unit at southwest habilitation', '                     recently , due to a   behavior   problem of some residents ,             ', '               a much needed third floor   behavior   care unit is needed for                 ', '        be- cause of severely aggressive   behavior   . it is my opinion                      ', '   inclined toward assaultive or violent   behavior   than are any other persons              ', ' and emotionally disturbed or exhibiting   behavior   oroblems . the problem is               ', '  families reduce aggressive or abnormal   behavior   of their children and reduce            ', '  this responsibility because of extreme   behavior   . 3 ) the conversion                    ', '                 the care of people with   behavior   problems , who currently reside         ', 'also emotionally disturbed or exhibiting   behavior   problems 4 ) adult training             ']

178

## 3. Collocations

In [9]:
# Extract potential collocations in span 4 (up to 4 words either side of key word)

padding = [('x','x'),('x','x'),('x','x'),('x','x'),('x','x')]
all_lemmas = bob_df.tok_lem_POS_CLAWS.apply(lambda row: [(x[1],x[2]) for x in row])
all_lemmas = [x for y in all_lemmas for x in y]
padded_lemmas = padding + all_lemmas.copy() + padding

def find_cols(lemma,POS):
    col_list = []
    for i in range(len(padded_lemmas)):
        if padded_lemmas[i] == (lemma,POS):
            col_list.extend(padded_lemmas[i-4:i])
            col_list.extend(padded_lemmas[i+1:i+5])
    i += 1
    col_freq = [(x,FreqDist(col_list)[x]) for x in FreqDist(col_list)] 
    return sorted(col_freq,key = lambda x: x[1],reverse=True)

In [10]:
padded_lemmas[:10]

[('x', 'x'), ('x', 'x'), ('x', 'x'), ('x', 'x'), ('x', 'x'), ('pennsylvania', 'n'), ('association', 'n'), ('for', 'i'), ('retarded', 'j'), ('citizen', 'n')]

In [11]:
# Test function - 10 most common collocations with the adjective 'mental'

find_cols('mental','j')[:10]

[(('retardation', 'n'), 283), (('the', 'a'), 265), (('of', 'i'), 213), (('health', 'n'), 128), (('and', 'c'), 125), (('in', 'i'), 90), (('mental', 'j'), 82), (('be', 'v'), 74), (('county', 'n'), 67), (('office', 'n'), 63)]

To make these results more meaningful, they can be filtered in different ways. Here, I show two:
1. filter results to keep only words with more semantic meaning (nouns, verbs, adjectives, adverbs)
2. filter results to keep only collocations with Mutual Information > 3 and frequency > 10 (common thresholds)

In [12]:
# Filter results by POS

lexical = ['n','v','j','r']
    
[x for x in find_cols('mental','j') if x[0][1] in lexical][:10]

[(('retardation', 'n'), 283), (('health', 'n'), 128), (('mental', 'j'), 82), (('be', 'v'), 74), (('county', 'n'), 67), (('office', 'n'), 63), (('allegheny', 'n'), 50), (('program', 'n'), 44), (('hospital', 'n'), 36), (('state', 'n'), 29)]

COCA MI data is from a paid license. Please see [Dr Na-Rae Han](https://www.linguistics.pitt.edu/people/na-rae-han) for access information for Pitt students and faculty or the [COCA website](https://www.wordfrequency.info/purchase.asp) for purchase information.

In [13]:
# Filter results by mutual information

# Load and modify collocation dataframe
col_df = joblib.load('../../../../COCA_data/COCA_2020_collocation_df.pkl')
col_df = col_df.loc[col_df.MI >= 3].reset_index(drop=True)
col_df['word1'] = col_df.collocation.apply(lambda x: x[0])
col_df['word2'] = col_df.collocation.apply(lambda x: x[1])
col_df.loc[(col_df.word1 == ('mental','j')) | (col_df.word2 == ('mental','j'))].head()

Unnamed: 0,freq,MI,collocation,tscore,word1,word2
455,19432,7.01,"((health, n), (mental, j))",139.26,"(health, n)","(mental, j)"
456,19431,7.01,"((mental, j), (health, n))",139.25,"(mental, j)","(health, n)"
1623,7306,8.78,"((illness, n), (mental, j))",85.45,"(illness, n)","(mental, j)"
1627,7304,8.78,"((mental, j), (illness, n))",85.44,"(mental, j)","(illness, n)"
3409,4180,6.23,"((physical, j), (mental, j))",64.54,"(physical, j)","(mental, j)"


In [14]:
# Create function to perform this search

all_cols = set(col_df.collocation)

def MI_cols(lemma,POS):
    bob_cols = find_cols(lemma,POS)     
    MI_cols = [x for x in bob_cols if ((lemma,POS),x[0]) in all_cols or (x[0],(lemma,POS)) in all_cols]
    return MI_cols

In [15]:
MI_cols('mental','j')

[(('retardation', 'n'), 283), (('health', 'n'), 128), (('mental', 'j'), 82), (('hospital', 'n'), 36), (('service', 'n'), 18), (('institution', 'n'), 17), (('treatment', 'n'), 12), (('patient', 'n'), 11), (('illness', 'n'), 9), (('physical', 'j'), 8), (('condition', 'n'), 4), (('diagnosis', 'n'), 3), (('emotional', 'j'), 3), (('professional', 'n'), 3), (('evaluation', 'n'), 2), (('disorder', 'n'), 2), (('qualified', 'j'), 2), (('facility', 'n'), 2), (('serious', 'j'), 1), (('abnormality', 'n'), 1), (('deficiency', 'n'), 1), (('handicap', 'n'), 1), (('disability', 'n'), 1), (('stimulation', 'n'), 1), (('licensed', 'j'), 1), (('developmental', 'j'), 1), (('clinic', 'n'), 1), (('provider', 'n'), 1), (('well-being', 'n'), 1), (('deficient', 'j'), 1), (('suffer', 'v'), 1), (('stigma', 'n'), 1), (('ward', 'n'), 1), (('abuse', 'n'), 1), (('status', 'n'), 1)]

[Back to top](#Bob-Nelkin-Collection---Lexical-analysis)