# Bob Nelkin Collection - Text analysis

<br>

**Notebook author:** Ben Naismith  
**Last modified:** June 16, 2021

<br>

This notebook contains code for text analysis of the Bob Nelkin Collection texts. Text analysis in this case refers to the use machine learning tools to extract information about the texts in terms of entities, topics, and sentiment. All of the text analysis tools use APIs from [meaningcloud.com](meaningcloud.com). This information can be used to filter only those texts related to certain topics or containing certain sentiments.  


<br>

**Notebook contents:**
1. [Initial setup](#1.-Initial-setup)
2. [Sentiment analysis](#2.-Sentiment-analysis)
3. [Topic and entity extraction](#3.-Topic-and-entity-extraction)
4. [Cluster analysis](#4.-Cluster-analysis)
5. [Wrap-up](#5.-Wrap-up)

## 1. Initial setup

In [1]:
# Import necessary modules

import pandas as pd
import pprint
from IPython.core.interactiveshell import InteractiveShell
import joblib
import sys
import meaningcloud
import requests
from tqdm import tqdm
from nltk import FreqDist
import glob
import os
import shutil

In [2]:
# Set preferred notebook format

InteractiveShell.ast_node_interactivity = "all" # Show all output, not just last item
pd.set_option('display.max_columns', 999) # Allow viewing of all columns

In [3]:
# Read in processed dataframe

bob_df = joblib.load('../processing/bob_df.pkl')
bob_df.head(1)

Unnamed: 0,id,title,display_date,abstract,host,series,container,owner,depositor,collection_id,text,language,len,tok_lem_POS_NLTK,tok_lem_POS_CLAWS,tok_lem_POS_NLTK_corrected,misspelling_correction,len_errors,genre,genre_MODS,resource_type
0,MSS_1002_B001_F11_I01,Recent Litigation Memo,"July 11, 1975",A PARC internal memo that summarizes recent li...,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 11, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,﻿V\n\nPennsylvania Association for Retarded Ci...,English,3042,"[(V, V, NNP), (Pennsylvania, Pennsylvania, NNP...","[(Pennsylvania, pennsylvania, n), (Association...","[(v, V, NNP), (pennsylvania, Pennsylvania, NNP...","[((dpw, DPW, NNP), (dpi, dpi, NNP)), ((bazelon...",26,memo,correspondence,text


## 2. Sentiment analysis

Sentiment analysis using the API for meaningcloud: https://www.meaningcloud.com/developer/sentiment-analysis  
To run this code, you will need a license key from meaning cloud which allows for a limited number of API calls per month. A free account provides sufficient API calls to run all of the code in this notebook.  

Here, I am analyzing the polarity and agreement of the sentiment, as well as the confidence level that the label is correct. However, other options are also available, e.g. irony (see meaningcloud documentation).

In [4]:
license_key = 'YOUR KEY HERE'

In [5]:
# Create short function to get key info

def get_sentiment(text):
    url = "https://api.meaningcloud.com/sentiment-2.1"
    payload={"key": license_key,
         "txt": text,
         "lang": "en"}
    response = requests.post(url, data=payload)
    if len(text) != 0:
        sentiment = (response.json()["score_tag"],response.json()["agreement"],response.json()["confidence"])
    else:
        sentiment = ''
    return sentiment

In [6]:
# Test function

text1 = "I hope this is a good idea."
text2 = ""

get_sentiment(text1)
get_sentiment(text2)

('P', 'AGREEMENT', '100')

''

In [7]:
# Apply function to dataframe - do once only as time consuming and uses credits

tqdm.pandas(desc="Progress")
bob_df['sent'] = bob_df.text.progress_apply(get_sentiment)

  from pandas import Panel
Progress: 100%|██████████| 541/541 [07:32<00:00,  1.19it/s]  


In [8]:
# Split up sent columns into three parts

bob_df['sentiment_polarity'] = [x[0] if len(x) == 3 else x for x in bob_df.sent]
bob_df['sentiment_agreement'] = [x[1] if len(x) == 3 else x for x in bob_df.sent]
bob_df['sentiment_confidence'] = [x[2] if len(x) == 3 else x for x in bob_df.sent]

del bob_df['sent']

bob_df.head()

Unnamed: 0,id,title,display_date,abstract,host,series,container,owner,depositor,collection_id,text,language,len,tok_lem_POS_NLTK,tok_lem_POS_CLAWS,tok_lem_POS_NLTK_corrected,misspelling_correction,len_errors,genre,genre_MODS,resource_type,sentiment_polarity,sentiment_agreement,sentiment_confidence
0,MSS_1002_B001_F11_I01,Recent Litigation Memo,"July 11, 1975",A PARC internal memo that summarizes recent li...,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 11, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,﻿V\n\nPennsylvania Association for Retarded Ci...,English,3042,"[(V, V, NNP), (Pennsylvania, Pennsylvania, NNP...","[(Pennsylvania, pennsylvania, n), (Association...","[(v, V, NNP), (pennsylvania, Pennsylvania, NNP...","[((dpw, DPW, NNP), (dpi, dpi, NNP)), ((bazelon...",26,memo,correspondence,text,NEU,DISAGREEMENT,86
1,MSS_1002_B001_F12_I01,Letter from Peter Polloni to Bob Nelkin,"March 11, 1975","A letter from Peter Polloni, executive directo...",Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 12, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,﻿Pennsylvania Association for Retarded Citizen...,English,242,"[(Pennsylvania, Pennsylvania, NNP), (Associati...","[(Pennsylvania, pennsylvania, n), (Association...","[(pennsylvania, Pennsylvania, NNP), (associati...","[((ppp, PPP, NNP), (pop, pop, NNP)), ((schmi, ...",4,letter,correspondence,text,P,DISAGREEMENT,84
2,MSS_1002_B001_F13_I01,Letter to Frank Beal from Families and Friends...,"August 19, 1976",A letter from Families and Friends of Southwes...,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 13, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,﻿11\nFAMILIES & FRIENDS OF SOUTHWEST HABILITAT...,English,268,"[(11, 11, CD), (FAMILIES, FAMILIES, NNP), (&, ...","[(1, 1, m), (FAMILIES, family, n), (FRIENDS, f...","[(11, 11, CD), (families, FAMILIES, NNP), (&, ...","[((fodi, Fodi, NNP), (jodi, jodi, NNP))]",1,letter,correspondence,text,P,DISAGREEMENT,92
3,MSS_1002_B001_F13_I02,Letter from families of patients at Southwest ...,"July 27, 1976",A letter requesting Bob Nelkin's advice on adv...,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 13, Item 2",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,﻿FAMILIES & FRIENDS OF\nSOUTHWEST HABILITATION...,English,320,"[(FAMILIES, FAMILIES, NNP), (&, &, CC), (FRIEN...","[(FAMILIES, family, n), (FRIENDS, friend, n), ...","[(families, FAMILIES, NNP), (&, &, CC), (frien...","[((tesident, tesident, NN), (resident, residen...",3,letter,correspondence,text,NEU,DISAGREEMENT,91
4,MSS_1002_B001_F16_I01,ACC-PARC Recent Benefits to Families Memo,"March 28, 1977",Correspondence from Bob Nelkin to Joan Murdoch...,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 16, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,C ommonwealth of Pennsylvania\n\nDepartment of...,English,6932,"[(C, C, NNP), (ommonwealth, ommonwealth, NN), ...","[(C, c, n), (ommonwealth, ommonwealth, n), (of...","[(c, C, NNP), (commonwealth, commonwealth, NN)...","[((ommonwealth, ommonwealth, NN), (commonwealt...",209,memo,correspondence,text,P,DISAGREEMENT,86


In [9]:
# Check polarity and agreement counts (the blank responses are the 5 photos)

bob_df.sentiment_polarity.value_counts()
bob_df.sentiment_agreement.value_counts()

P       221
NEU     137
N       103
NONE     57
P+       16
          5
N+        2
Name: sentiment_polarity, dtype: int64

DISAGREEMENT    443
AGREEMENT        93
                  5
Name: sentiment_agreement, dtype: int64

## 3. Topic and entity extraction

Example code from https://github.com/MeaningCloud/meaningcloud-python/blob/master/example/mc_showcase.py

In [10]:
# Edited version of the function from the link above.

def extract_topics(text):
    relevance = 80 # Set minimum required relevance
    entities = ''
    concepts = ''
    topics_req = meaningcloud.TopicsRequest(license_key, txt=text, lang='en', topicType='ec',otherparams={'txtf':'markup'})
    topics_response = meaningcloud.TopicsResponse(topics_req.sendReq())
    if topics_response.isSuccessful():
        entities_list = topics_response.getEntities()
        formatted_entities = []
        if entities_list:
            for entity in entities_list:
                if int(topics_response.getTopicRelevance(entity)) >= relevance:
                    formatted_entities.append(topics_response.getTopicForm(entity) + ' (' + topics_response.getTypeLastNode(topics_response.getOntoType(entity)) + ')')
            entities = ', '.join(formatted_entities)
        else:
            entities = '(none)'
        concepts_list = topics_response.getConcepts()
        formatted_concepts = []
        if concepts_list:
            for concept in concepts_list:
                if int(topics_response.getTopicRelevance(concept)) >= relevance  or (' ' in topics_response.getTopicForm(concept) and int(topics_response.getTopicRelevance(concept)) >= (relevance/2)) or topics_response.isUserDefined(concept):
                    formatted_concepts.append(topics_response.getTopicForm(concept) + ' (' + topics_response.getTypeLastNode(topics_response.getOntoType(concept)) + ')')
            concepts = ', '.join(formatted_concepts) if formatted_concepts else '(none)'
        else:
            concepts = "(none)"
    else:            
        print("\tRequest to topics was not succesful: (" + topics_response.getStatusCode() + ') ' + topics_response.getStatusMsg())
    return [entities, concepts]

In [14]:
# Test function

example = "The Bob Nelkin Collection is owned by the Heinz History Center in Pittsburgh"
example2 = "There are no entities in this sentence"

extract_topics(example)
extract_topics(example2)

['Bob Nelkin Collection (Top), Heinz History Center (Facility), Pittsburgh (City)',
 '(none)']

['(none)', 'sentence (Top), sentence (Top), sentence (Top)']

In [15]:
# Apply function to dataframe. The 5 objects with no texts (the photographs) will return anything.

tqdm.pandas(desc="Progress")
bob_df['temp'] = bob_df.text.progress_apply(lambda row: extract_topics(row)) # Create temporary column

Progress:  99%|█████████▉| 535/541 [04:55<00:01,  3.70it/s]

	Request to topics was not succesful: (200) Missing required parameter(s): txt, url, doc
	Request to topics was not succesful: (200) Missing required parameter(s): txt, url, doc


Progress:  99%|█████████▉| 537/541 [04:56<00:00,  5.12it/s]

	Request to topics was not succesful: (200) Missing required parameter(s): txt, url, doc
	Request to topics was not succesful: (200) Missing required parameter(s): txt, url, doc


Progress:  99%|█████████▉| 538/541 [04:56<00:00,  5.77it/s]

	Request to topics was not succesful: (200) Missing required parameter(s): txt, url, doc


Progress: 100%|██████████| 541/541 [04:57<00:00,  1.82it/s]


In [16]:
# Split into two columns: entities and topics

bob_df['entities'] = [x[0] for x in bob_df.temp]
bob_df['topics'] = [x[1] for x in bob_df.temp]

del bob_df['temp']
bob_df.head()

Unnamed: 0,id,title,display_date,abstract,host,series,container,owner,depositor,collection_id,text,language,len,tok_lem_POS_NLTK,tok_lem_POS_CLAWS,tok_lem_POS_NLTK_corrected,misspelling_correction,len_errors,genre,genre_MODS,resource_type,sentiment_polarity,sentiment_agreement,sentiment_confidence,entities,topics
0,MSS_1002_B001_F11_I01,Recent Litigation Memo,"July 11, 1975",A PARC internal memo that summarizes recent li...,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 11, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,﻿V\n\nPennsylvania Association for Retarded Ci...,English,3042,"[(V, V, NNP), (Pennsylvania, Pennsylvania, NNP...","[(Pennsylvania, pennsylvania, n), (Association...","[(v, V, NNP), (pennsylvania, Pennsylvania, NNP...","[((dpw, DPW, NNP), (dpi, dpi, NNP)), ((bazelon...",26,memo,correspondence,text,NEU,DISAGREEMENT,86,"Legal Services (Company), Supreme Court (Gover...","plaintiff (Person), patient (Person), patient ..."
1,MSS_1002_B001_F12_I01,Letter from Peter Polloni to Bob Nelkin,"March 11, 1975","A letter from Peter Polloni, executive directo...",Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 12, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,﻿Pennsylvania Association for Retarded Citizen...,English,242,"[(Pennsylvania, Pennsylvania, NNP), (Associati...","[(Pennsylvania, pennsylvania, n), (Association...","[(pennsylvania, Pennsylvania, NNP), (associati...","[((ppp, PPP, NNP), (pop, pop, NNP)), ((schmi, ...",4,letter,correspondence,text,P,DISAGREEMENT,84,Pennsylvania (Adm1),report (Top)
2,MSS_1002_B001_F13_I01,Letter to Frank Beal from Families and Friends...,"August 19, 1976",A letter from Families and Friends of Southwes...,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 13, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,﻿11\nFAMILIES & FRIENDS OF SOUTHWEST HABILITAT...,English,268,"[(11, 11, CD), (FAMILIES, FAMILIES, NNP), (&, ...","[(1, 1, m), (FAMILIES, family, n), (FRIENDS, f...","[(11, 11, CD), (families, FAMILIES, NNP), (&, ...","[((fodi, Fodi, NNP), (jodi, jodi, NNP))]",1,letter,correspondence,text,P,DISAGREEMENT,92,"Habilitation Center (Facility), Pennsylvania (...","southwest (Location), unit (Unit), group (Orga..."
3,MSS_1002_B001_F13_I02,Letter from families of patients at Southwest ...,"July 27, 1976",A letter requesting Bob Nelkin's advice on adv...,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 13, Item 2",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,﻿FAMILIES & FRIENDS OF\nSOUTHWEST HABILITATION...,English,320,"[(FAMILIES, FAMILIES, NNP), (&, &, CC), (FRIEN...","[(FAMILIES, family, n), (FRIENDS, friend, n), ...","[(families, FAMILIES, NNP), (&, &, CC), (frien...","[((tesident, tesident, NN), (resident, residen...",3,letter,correspondence,text,NEU,DISAGREEMENT,91,Southwest Habilitation Center (Facility),"resident (Person), group (Organization)"
4,MSS_1002_B001_F16_I01,ACC-PARC Recent Benefits to Families Memo,"March 28, 1977",Correspondence from Bob Nelkin to Joan Murdoch...,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 16, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,C ommonwealth of Pennsylvania\n\nDepartment of...,English,6932,"[(C, C, NNP), (ommonwealth, ommonwealth, NN), ...","[(C, c, n), (ommonwealth, ommonwealth, n), (of...","[(c, C, NNP), (commonwealth, commonwealth, NN)...","[((ommonwealth, ommonwealth, NN), (commonwealt...",209,memo,correspondence,text,P,DISAGREEMENT,86,"Ze (LastName), Le (Top), Harrisburg (City)","institute (Institute), pound sterling (Currency)"


In [17]:
# Format new columns

# Create function
def clean_MC_output(MC_output):
    return [tuple(x.split(' (')) for x in MC_output.replace(')','').split(', ')]

# Apply to columns
bob_df.entities = bob_df.entities.apply(clean_MC_output)
bob_df.topics = bob_df.topics.apply(clean_MC_output)

bob_df.head()

Unnamed: 0,id,title,display_date,abstract,host,series,container,owner,depositor,collection_id,text,language,len,tok_lem_POS_NLTK,tok_lem_POS_CLAWS,tok_lem_POS_NLTK_corrected,misspelling_correction,len_errors,genre,genre_MODS,resource_type,sentiment_polarity,sentiment_agreement,sentiment_confidence,entities,topics
0,MSS_1002_B001_F11_I01,Recent Litigation Memo,"July 11, 1975",A PARC internal memo that summarizes recent li...,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 11, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,﻿V\n\nPennsylvania Association for Retarded Ci...,English,3042,"[(V, V, NNP), (Pennsylvania, Pennsylvania, NNP...","[(Pennsylvania, pennsylvania, n), (Association...","[(v, V, NNP), (pennsylvania, Pennsylvania, NNP...","[((dpw, DPW, NNP), (dpi, dpi, NNP)), ((bazelon...",26,memo,correspondence,text,NEU,DISAGREEMENT,86,"[(Legal Services, Company), (Supreme Court, Go...","[(plaintiff, Person), (patient, Person), (pati..."
1,MSS_1002_B001_F12_I01,Letter from Peter Polloni to Bob Nelkin,"March 11, 1975","A letter from Peter Polloni, executive directo...",Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 12, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,﻿Pennsylvania Association for Retarded Citizen...,English,242,"[(Pennsylvania, Pennsylvania, NNP), (Associati...","[(Pennsylvania, pennsylvania, n), (Association...","[(pennsylvania, Pennsylvania, NNP), (associati...","[((ppp, PPP, NNP), (pop, pop, NNP)), ((schmi, ...",4,letter,correspondence,text,P,DISAGREEMENT,84,"[(Pennsylvania, Adm1)]","[(report, Top)]"
2,MSS_1002_B001_F13_I01,Letter to Frank Beal from Families and Friends...,"August 19, 1976",A letter from Families and Friends of Southwes...,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 13, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,﻿11\nFAMILIES & FRIENDS OF SOUTHWEST HABILITAT...,English,268,"[(11, 11, CD), (FAMILIES, FAMILIES, NNP), (&, ...","[(1, 1, m), (FAMILIES, family, n), (FRIENDS, f...","[(11, 11, CD), (families, FAMILIES, NNP), (&, ...","[((fodi, Fodi, NNP), (jodi, jodi, NNP))]",1,letter,correspondence,text,P,DISAGREEMENT,92,"[(Habilitation Center, Facility), (Pennsylvani...","[(southwest, Location), (unit, Unit), (group, ..."
3,MSS_1002_B001_F13_I02,Letter from families of patients at Southwest ...,"July 27, 1976",A letter requesting Bob Nelkin's advice on adv...,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 13, Item 2",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,﻿FAMILIES & FRIENDS OF\nSOUTHWEST HABILITATION...,English,320,"[(FAMILIES, FAMILIES, NNP), (&, &, CC), (FRIEN...","[(FAMILIES, family, n), (FRIENDS, friend, n), ...","[(families, FAMILIES, NNP), (&, &, CC), (frien...","[((tesident, tesident, NN), (resident, residen...",3,letter,correspondence,text,NEU,DISAGREEMENT,91,"[(Southwest Habilitation Center, Facility)]","[(resident, Person), (group, Organization)]"
4,MSS_1002_B001_F16_I01,ACC-PARC Recent Benefits to Families Memo,"March 28, 1977",Correspondence from Bob Nelkin to Joan Murdoch...,Bob Nelkin Collection of ACC-PARC Records,I. Administrative Records 1953-1983,"box 1, folder 16, Item 1",Heinz History Center,"Detre Library & Archives, Heinz History Center",collection.341,C ommonwealth of Pennsylvania\n\nDepartment of...,English,6932,"[(C, C, NNP), (ommonwealth, ommonwealth, NN), ...","[(C, c, n), (ommonwealth, ommonwealth, n), (of...","[(c, C, NNP), (commonwealth, commonwealth, NN)...","[((ommonwealth, ommonwealth, NN), (commonwealt...",209,memo,correspondence,text,P,DISAGREEMENT,86,"[(Ze, LastName), (Le, Top), (Harrisburg, City)]","[(institute, Institute), (pound sterling, Curr..."


In [18]:
# Check most common entities

all_entities = [x for y in bob_df.entities.to_list() for x in y] # Create flat list of all entities
entity_freq_dict = FreqDist(all_entities) # Create frequency dictionary
entity_freq = [(x,entity_freq_dict[x]) for x in entity_freq_dict]
sorted(entity_freq,key = lambda x: x[1],reverse=True)[:10]

[(('Polk', 'LastName'), 68),
 (('Pennsylvania', 'Adm1'), 62),
 (('McClelland', 'GeoPoliticalEntity'), 48),
 (('Polk', 'GeoPoliticalEntity'), 37),
 (('(none',), 37),
 (('Palo Alto Research Center', 'TechnologyCompany'), 21),
 (('Pennhurst', 'Top'), 21),
 (('Harrisburg', 'City'), 18),
 (('Polk State School', 'School'), 18),
 (('Pittsburgh', 'City'), 17)]

In [19]:
# Check most common topics

all_topics = [x for y in bob_df.topics.to_list() for x in y] # Create flat list of all entities
topic_freq_dict = FreqDist(all_topics) # Create frequency dictionary
topic_freq = [(x,topic_freq_dict[x]) for x in topic_freq_dict]
sorted(topic_freq,key = lambda x: x[1],reverse=True)[:20]

[(('patient', 'Person'), 78),
 (('resident', 'Person'), 78),
 (('state school', 'School'), 60),
 (('pound sterling', 'Currency'), 40),
 (('(none',), 39),
 (('institute', 'Institute'), 32),
 (('child', 'Person'), 32),
 (('program', 'Broadcast'), 29),
 (('mental health', 'Top'), 24),
 (('general manager', 'Title'), 24),
 (('dollar', 'Currency'), 19),
 (('staff', 'Organization'), 19),
 (('action', 'Process'), 18),
 (('parent', 'Person'), 17),
 (('home', 'Facility'), 16),
 (('state', 'Adm1'), 16),
 (('person', 'Person'), 14),
 (('people', 'Person'), 13),
 (('service', 'ProfessionalService'), 12),
 (('community', 'Organization'), 12)]

## 4. Cluster analysis

Example code from https://github.com/MeaningCloud/meaningcloud-python/blob/master/example/mc_showcase.py

In [20]:
# This function obtains the text clustering of the text collection passed as a parameter

def getClustering(text_collection, cluster_score_threshold):
    clustering_response = meaningcloud.ClusteringResponse(meaningcloud.ClusteringRequest(license_key, lang='en', texts=text_collection).sendReq())
    if clustering_response.isSuccessful():
        clusters = clustering_response.getClusters()
        maximum_score = float(clustering_response.getClusterScore(clusters[0])) #first one has higher score
        titles = []
        sizes = []
        scores = []
        docs = []
        for cl in clusters:
            if (maximum_score == 0 or (float(clustering_response.getClusterScore(cl))/maximum_score)*100 >= cluster_score_threshold):
                titles.append(clustering_response.getClusterTitle(cl))
                sizes.append(clustering_response.getClusterSize(cl))
                scores.append(clustering_response.getClusterScore(cl))
                docs.append(', '.join(cl['document_list'].keys()))
        return titles, sizes, scores, docs
    else:
        print('Request to clustering was not succesful: (' + clustering_response.getStatusCode() + ') ' + clustering_response.getStatusMsg())
        return [], [], [], []

In [21]:
# Test function - requires text to be inputted as a dictionary

example = bob_df.iloc[0:5,10].to_dict()
example = {str(key): str(value) for key, value in example.items()}
example

{'0': '\ufeffV\n\nPennsylvania Association for Retarded Citizens\nr\nt5CO NORTH SECOND STREET • HARRIS8URG, PA. 1 71C2\nTEL: (717) 234-2621\n: -J:\nMEMO TO:\nOfficers\nPARC Residential Services Committee\nAll Regional Residential Services Committees\nDATE:\nJuly 11, 1975\nFROM:\nRE:\nRecent li\nDiane J.\nAttached, for your information and review, are several summaries of recent court cases which affect residents living in Pennsylvania facilities.\n1.\t- Vecchione v. Wohlgemuth: ensures that no resident of a state hospital shall be deprived of any property unless and until he has been determined incompetent and a court authorizes such an undertaking.\n2.\tDowns v. Pennsylvania DPW: outlines the plan which DPW shall implement to prevent peonage in residential facilities.\n3.\tMental Patients Civil Liberties Project v. DPW: assures residents in mental hospitals the right to free communication and the right to organize.\n4.\tIn re Zeiler: involves the commitment of a 13-year-old girl to th

In [22]:
# Returns titles, sizes, scores, docs

getClustering(example,50)

(['Mental Patient Civil Liberties Project', 'Southwest Habilitation Center'],
 ['2', '2'],
 ['41.37', '25.77'],
 ['0, 4', '2, 3'])

In [23]:
# Split dataframe into chunks as meaningcloud has max 100k words per API call

bob_df = bob_df.reset_index(drop=True)
bob_df1 = bob_df.iloc[:100,:]
bob_df2 = bob_df.iloc[100:200,:]
bob_df3 = bob_df.iloc[200:300,:]
bob_df4 = bob_df.iloc[300:400,:]
bob_df5 = bob_df.iloc[400:532,:]

In [24]:
# Turn all bob_df texts into dict

text_dict1 = bob_df1.iloc[:,10].to_dict()
text_dict1 = {str(key): str(value) for key, value in text_dict1.items()}

text_dict2 = bob_df2.iloc[:,10].to_dict()
text_dict2 = {str(key): str(value) for key, value in text_dict2.items()}

text_dict3 = bob_df3.iloc[:,10].to_dict()
text_dict3 = {str(key): str(value) for key, value in text_dict3.items()}

text_dict4 = bob_df4.iloc[:,10].to_dict()
text_dict4 = {str(key): str(value) for key, value in text_dict4.items()}

text_dict5 = bob_df5.iloc[:,10].to_dict()
text_dict5 = {str(key): str(value) for key, value in text_dict5.items()}

In [25]:
# Apply function

clusters1 = getClustering(text_dict1,50)
clusters1 = [pd.Series(x) for x in clusters1]

In [26]:
clusters2 = getClustering(text_dict2,50)
clusters2 = [pd.Series(x) for x in clusters2]

In [27]:
clusters3 = getClustering(text_dict3,50)
clusters3 = [pd.Series(x) for x in clusters3]

In [28]:
clusters4 = getClustering(text_dict4,50)
clusters4 = [pd.Series(x) for x in clusters4]

In [29]:
clusters5 = getClustering(text_dict5,50)
clusters5 = [pd.Series(x) for x in clusters5]

In [30]:
# Turn into dataframe and clean up

cluster1_df = pd.DataFrame(clusters1).T
cluster1_df = cluster1_df.astype({1:'int32',2:'float'})

cluster2_df = pd.DataFrame(clusters2).T
cluster2_df = cluster2_df.astype({1:'int32',2:'float'})

cluster3_df = pd.DataFrame(clusters3).T
cluster3_df = cluster3_df.astype({1:'int32',2:'float'})

cluster4_df = pd.DataFrame(clusters4).T
cluster4_df = cluster4_df.astype({1:'int32',2:'float'})

cluster5_df = pd.DataFrame(clusters5).T
cluster5_df = cluster5_df.astype({1:'int32',2:'float'})

In [31]:
# Combine all clusters into one dataframe and clean up

cluster_df = pd.concat([cluster1_df,cluster2_df,cluster3_df,cluster4_df,cluster5_df])
cluster_df = cluster_df.rename(columns={0: 'cluster', 1: 'cluster_freq', 2:'score',3:'cluster_loc'})
d = {'cluster_freq': 'sum', 'score': 'mean', 'cluster_loc': lambda x: ', '.join(x)}
cluster_df = cluster_df.groupby('cluster', as_index=False).aggregate(d).reindex(columns=cluster_df.columns)
cluster_df.cluster_loc = [sorted(x.split(', ')) for x in cluster_df.cluster_loc]
cluster_df.head()

Unnamed: 0,cluster,cluster_freq,score,cluster_loc
0,A. B.,2,191.64,"[505, 521]"
1,ACC Board,6,112.96,"[39, 54, 58, 75, 81, 82]"
2,ACC PARC,2,163.21,"[128, 175]"
3,ACC-PARC Office,5,117.55,"[25, 46, 80, 81, 82]"
4,Admission are not within their Realm,2,155.79,"[502, 504]"


In [32]:
# Sort clusters by freq and score

cluster_df.sort_values('cluster_freq',ascending=False).head(10)
cluster_df.sort_values('score',ascending=False).head(10)

Unnamed: 0,cluster,cluster_freq,score,cluster_loc
44,Helene Wohlgemuth,28,172.91,"[25, 4, 401, 418, 419, 420, 425, 434, 436, 437..."
6,Allegheny County Chapter,25,162.415,"[206, 211, 218, 222, 226, 228, 232, 235, 243, ..."
51,James H. McClelland,20,131.19,"[235, 238, 244, 245, 246, 257, 264, 267, 268, ..."
48,Interim care Facilities,19,127.87,"[23, 39, 41, 42, 43, 52, 53, 54, 57, 58, 59, 6..."
52,James McClelland,14,162.17,"[418, 419, 420, 421, 433, 436, 437, 439, 441, ..."
35,Ebensburg and Cresson,14,160.63,"[470, 471, 500, 501, 502, 503, 504, 505, 506, ..."
60,Mental Health and Mental Retardation,14,108.27,"[0, 14, 15, 25, 28, 31, 32, 35, 36, 4, 49, 56,..."
68,Pennsylvania Association for Retarded Children,13,190.55,"[434, 440, 455, 456, 464, 484, 485, 486, 489, ..."
95,State Hospital,12,128.5,"[437, 439, 447, 450, 464, 465, 467, 468, 498, ..."
12,Board of Directors,11,135.59,"[39, 4, 40, 51, 54, 58, 59, 61, 77, 81, 82]"


Unnamed: 0,cluster,cluster_freq,score,cluster_loc
102,William J. Cavanaugh ATTORNEY,2,272.53,"[388, 389]"
66,Paul Jenkins,6,262.7,"[128, 129, 135, 138, 150, 161]"
10,Avenue Pittsburgh,3,261.41,"[376, 377, 395]"
88,Seizure Pattern,2,240.68,"[169, 189]"
67,Pennhurst Center,9,235.74,"[103, 104, 105, 108, 109, 113, 115, 116, 118]"
61,Mike Levine,7,228.94,"[233, 236, 239, 241, 242, 243, 295]"
14,C. Duane Youngberg,2,225.33,"[436, 441]"
7,Anna Belle Calloway,4,223.22,"[178, 190, 191, 193]"
97,Thomas W. Snyder,2,223.12,"[349, 353]"
29,Dangerous Technique of Raking Food off Spoons,3,218.01,"[105, 106, 115]"


Clusters could potentially be fine-tuned and customized, but that is beyond the scope of the current project. Refer to meaningcloud documentation for options in this regard.

## 5. Wrap-up

As this is the final version of `bob_df`, the output will be saved in the main directory in a variety of forms: as .csv, .pkl, and .json files. The csv file will not contain the text columns which have issues converting to csv.

#### .pkl

In [33]:
# Write out main dataframe and cluster dataframe as .pkl files

joblib.dump(bob_df,'../bob_df.pkl')
joblib.dump(cluster_df,'bob_cluster.pkl')

['../bob_df.pkl']

['bob_cluster.pkl']

#### .json

In [34]:
bob_json = bob_df.to_json('../bob_df.json')

#### .csv

In [36]:
# Remove the text and related columns which do not convert to csv

bob_csv = bob_df.copy()
del bob_csv['text']
del bob_csv['tok_lem_POS_NLTK']
del bob_csv['tok_lem_POS_CLAWS']
del bob_csv['tok_lem_POS_NLTK_corrected']
del bob_csv['misspelling_correction']
del bob_csv['len_errors']

In [37]:
# Write out bob_df as a csv file

bob_csv.to_csv('../bob_df.csv',index=False)

[Back to top](#Bob-Nelkin-Collection---Text-analysis)