Introduction

This corpus contains five representative poems by Dickinson. I will use spacy to analyze the parts of speech in these five poems and explore the common images and the topics they represent in Dickinson's poems.

Research Questions

What are the main images in Dickinson's poetry? 
In relation to specific texts, what themes does Dickinson typically use these images to express?

Installing, Importing and Preprocessing

In [1]:
pip install nbformat --upgrade

Note: you may need to restart the kernel to use updated packages.


In [2]:
# Install and import spacy and plotly.
%pip install spaCy
%pip install plotly

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [5]:
# Import spacy
import spacy

# Install English language model
!spacy download en_core_web_sm

# Import os to upload documents and metadata
import os

# Load spaCy visualizer
from spacy import displacy

# Import pandas DataFrame packages
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'

# Import graphing package
import plotly.graph_objects as go
import plotly.express as px

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [15]:
# Create empty lists for file names and contents
texts = []
file_names = []

# Iterate through each file in the folder
for _file_name in os.listdir('txt_files'):
# Look for only text files
    if _file_name.endswith('.txt'):
    # Append contents of each text file to text list
        texts.append(open('txt_files' + '/' + _file_name, 'r', encoding='utf-8').read())
        # Append name of each file to file name list
        file_names.append(_file_name)


In [16]:
# Create dictionary object associating each file name with its text
d = {'Filename':file_names,'Text':texts}

In [17]:
# Turn dictionary into a dataframe
poems_df = pd.DataFrame(d)

In [18]:
poems_df.head()

Unnamed: 0,Filename,Text
0,I never saw a Moor.txt,﻿I never saw a Moor\nI never saw a Moor--\nI n...
1,I keep my pledge.txt,﻿I keep my pledge\nI was not called -\nDeath d...
2,Because I could not stop for Death.txt,﻿Because I could not stop for Death \n\nHe kin...
3,After A Hundred Years.txt,﻿After A Hundred Years\n\nAfter a hundred year...
4,Had I not seen the Sun.txt,﻿Had I not seen the Sun\n\nHad I not seen the ...


In [48]:
# Remove extra spaces from papers
paper_df['Text'] = paper_df['Text'].str.replace('\s+', ' ', regex=True).str.strip()
paper_df.head()

Unnamed: 0,Filename,Text
0,I never saw a Moor,﻿I never saw a Moor I never saw a Moor-- I nev...
1,I keep my pledge,﻿I keep my pledge I was not called - Death did...
2,Because I could not stop for Death,﻿Because I could not stop for Death He kindly ...
3,After A Hundred Years,﻿After A Hundred Years After a hundred years N...
4,Had I not seen the Sun,﻿Had I not seen the Sun Had I not seen the sun...


In [20]:
# Remove .txt from title of each paper
poems_df['Filename'] = poems_df['Filename'].str.replace('.txt', '', regex=True)

In [21]:
# Print DataFrame
poems_df.head()

Unnamed: 0,Filename,Text
0,I never saw a Moor,﻿I never saw a Moor\nI never saw a Moor--\nI n...
1,I keep my pledge,﻿I keep my pledge\nI was not called -\nDeath d...
2,Because I could not stop for Death,﻿Because I could not stop for Death \n\nHe kin...
3,After A Hundred Years,﻿After A Hundred Years\n\nAfter a hundred year...
4,Had I not seen the Sun,﻿Had I not seen the Sun\n\nHad I not seen the ...


Tokenization

In [22]:
# Load nlp pipeline
nlp = spacy.load('en_core_web_sm')

# Check what functions it performs
print(nlp.pipe_names)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


In [23]:
#Define example sentence
sentence = "This is 'an' example? sentence"

# Call the nlp model on the sentence
doc = nlp(sentence)

In [24]:
# Loop through each token in doc object
for token in doc:
    # Print text and part of speech for each
    print(token.text, token.pos_)

This PRON
is AUX
' PUNCT
an DET
' PUNCT
example NOUN
? PUNCT
sentence NOUN


In [26]:
# Define a function that runs the nlp pipeline on any given input text
def process_text(text):
    return nlp(text)

In [29]:
# Apply the function to the "Text" column, so that the nlp pipeline is called on each poem
poems_df['Doc'] = poems_df['Text'].apply(process_text)

In [30]:
# Define a function to retrieve tokens from a doc object
def get_token(doc):
    return [(token.text) for token in doc]

In [31]:
# Run the token retrieval function on the doc objects in the dataframe
poems_df['Tokens'] = poems_df['Doc'].apply(get_token)
poems_df.head()

Unnamed: 0,Filename,Text,Doc,Tokens
0,I never saw a Moor,﻿I never saw a Moor\nI never saw a Moor--\nI n...,"(﻿I, never, saw, a, Moor, \n, I, never, saw, a...","[﻿I, never, saw, a, Moor, \n, I, never, saw, a..."
1,I keep my pledge,﻿I keep my pledge\nI was not called -\nDeath d...,"(﻿I, keep, my, pledge, \n, I, was, not, called...","[﻿I, keep, my, pledge, \n, I, was, not, called..."
2,Because I could not stop for Death,﻿Because I could not stop for Death \n\nHe kin...,"(﻿Because, I, could, not, stop, for, Death, \n...","[﻿Because, I, could, not, stop, for, Death, \n..."
3,After A Hundred Years,﻿After A Hundred Years\n\nAfter a hundred year...,"(﻿After, A, Hundred, Years, \n\n, After, a, hu...","[﻿After, A, Hundred, Years, \n\n, After, a, hu..."
4,Had I not seen the Sun,﻿Had I not seen the Sun\n\nHad I not seen the ...,"(﻿Had, I, not, seen, the, Sun, \n\n, Had, I, n...","[﻿Had, I, not, seen, the, Sun, \n\n, Had, I, n..."


In [33]:
tokens = poems_df[['Text', 'Tokens']].copy()
tokens.head()

Unnamed: 0,Text,Tokens
0,﻿I never saw a Moor\nI never saw a Moor--\nI n...,"[﻿I, never, saw, a, Moor, \n, I, never, saw, a..."
1,﻿I keep my pledge\nI was not called -\nDeath d...,"[﻿I, keep, my, pledge, \n, I, was, not, called..."
2,﻿Because I could not stop for Death \n\nHe kin...,"[﻿Because, I, could, not, stop, for, Death, \n..."
3,﻿After A Hundred Years\n\nAfter a hundred year...,"[﻿After, A, Hundred, Years, \n\n, After, a, hu..."
4,﻿Had I not seen the Sun\n\nHad I not seen the ...,"[﻿Had, I, not, seen, the, Sun, \n\n, Had, I, n..."


Lemmatization

In [35]:
# Define a function to retrieve lemmas from a doc object
def get_lemma(doc):
    return [(token.lemma_) for token in doc]

# Run the lemma retrieval function on the doc objects in the dataframe
poems_df['Lemmas'] = poems_df['Doc'].apply(get_lemma)

Part of Speech Analysis

In [36]:
# Define a function to retrieve lemmas from a doc object
def get_pos(doc):
    #Return the coarse- and fine-grained part of speech text for each token in the doc
    return [(token.pos_, token.tag_) for token in doc]

# Define a function to retrieve parts of speech from a doc object
poems_df['POS'] = poems_df['Doc'].apply(get_pos)

In [38]:
# Create a list of part of speech tags
list(poems_df['POS'])

[[('PROPN', 'NNP'),
  ('ADV', 'RB'),
  ('VERB', 'VBD'),
  ('DET', 'DT'),
  ('PROPN', 'NNP'),
  ('SPACE', '_SP'),
  ('PRON', 'PRP'),
  ('ADV', 'RB'),
  ('VERB', 'VBD'),
  ('DET', 'DT'),
  ('NOUN', 'NN'),
  ('SPACE', '_SP'),
  ('PRON', 'PRP'),
  ('ADV', 'RB'),
  ('VERB', 'VBD'),
  ('DET', 'DT'),
  ('NOUN', 'NN'),
  ('SPACE', '_SP'),
  ('CCONJ', 'CC'),
  ('VERB', 'VBP'),
  ('PRON', 'PRP'),
  ('SCONJ', 'WRB'),
  ('DET', 'DT'),
  ('PROPN', 'NNP'),
  ('VERB', 'VBZ'),
  ('SPACE', '_SP'),
  ('CCONJ', 'CC'),
  ('PRON', 'WP'),
  ('DET', 'DT'),
  ('PROPN', 'NNP'),
  ('AUX', 'VB'),
  ('PUNCT', '.'),
  ('SPACE', '_SP'),
  ('PRON', 'PRP'),
  ('ADV', 'RB'),
  ('VERB', 'VBD'),
  ('ADP', 'IN'),
  ('PROPN', 'NNP'),
  ('SPACE', '_SP'),
  ('CCONJ', 'CC'),
  ('VERB', 'VBD'),
  ('ADP', 'IN'),
  ('NUM', 'CD'),
  ('SPACE', '_SP'),
  ('CCONJ', 'CC'),
  ('ADJ', 'JJ'),
  ('AUX', 'VBP'),
  ('PRON', 'PRP'),
  ('ADP', 'IN'),
  ('DET', 'DT'),
  ('NOUN', 'NN'),
  ('SPACE', '_SP'),
  ('SCONJ', 'IN'),
  ('SCONJ', 'IN')

In [39]:
spacy.explain("IN")

'conjunction, subordinating or preposition'

In [40]:
# Define function to extract proper nouns from Doc object
def extract_proper_nouns(doc):
    return [token.text for token in doc if token.pos_ == 'PROPN']

# Apply function to Doc column and store resulting proper nouns in new column
poems_df['Proper_Nouns'] = poems_df['Doc'].apply(extract_proper_nouns)

In [41]:
# Create doc object from single sentence
doc = nlp("This is 'an' example? sentence")

# Print counts of each part of speech in sentence
print(doc.count_by(spacy.attrs.POS))

{95: 1, 87: 1, 97: 3, 90: 1, 92: 2}


In [42]:
# Store dictionary with indexes and POS counts in a variable
num_pos = doc.count_by(spacy.attrs.POS)

dictionary = {}

# Create a new dictionary which replaces the index of each part of speech for its label (NOUN, VERB, ADJECTIVE)
for k,v in sorted(num_pos.items()):
  dictionary[doc.vocab[k].text] = v

dictionary

{'AUX': 1, 'DET': 1, 'NOUN': 2, 'PRON': 1, 'PUNCT': 3}

In [52]:
# Create new DataFrame for analysis purposes
pos_analysis_df = poems_df[['Filename', 'Text','Doc']]

# Create list to store each dictionary
num_list = []

# Define a function to get part of speech tags and counts and append them to a new dictionary
def get_pos_tags(doc):
    dictionary = {}
    num_pos = doc.count_by(spacy.attrs.POS)
    for k,v in sorted(num_pos.items()):
        dictionary[doc.vocab[k].text] = v
    num_list.append(dictionary)

# Apply function to each doc object in DataFrame
pos_analysis_df.loc['C_POS'] = pos_analysis_df['Doc'].apply(get_pos_tags)

In [55]:
# Create new dataframe with part of speech counts
pos_counts = pd.DataFrame(num_list)
columns = list(pos_counts.columns)

# Add Text of each poem as new column to dataframe
idx = 0
new_col = pos_analysis_df['Text']
pos_counts.insert(loc=idx, column='Text', value=new_col)

pos_counts

Unnamed: 0,Text,ADJ,ADP,ADV,AUX,CCONJ,DET,NOUN,NUM,PRON,PROPN,PUNCT,SCONJ,VERB,SPACE,PART
0,﻿I never saw a Moor\nI never saw a Moor--\nI n...,1,3.0,4.0,3.0,4,7,3,1.0,6,7,1.0,3.0,7,9,
1,﻿I keep my pledge\nI was not called -\nDeath d...,1,5.0,3.0,3.0,2,1,7,,8,5,11.0,,7,11,2.0
2,﻿Because I could not stop for Death \n\nHe kin...,2,12.0,12.0,4.0,7,15,12,1.0,20,21,26.0,2.0,18,24,2.0
3,﻿After A Hundred Years\n\nAfter a hundred year...,3,7.0,1.0,,1,7,13,2.0,2,4,11.0,,8,13,
4,﻿Had I not seen the Sun\n\nHad I not seen the ...,1,,,4.0,1,4,4,,4,3,,,4,5,2.0


In [56]:
# Get average part of speech counts used in poems of each Text
average_pos_df = pos_counts.groupby(['Text']).mean()

# Round calculations to the nearest whole number
average_pos_df = average_pos_df.round(0)

# Reset index to improve DataFrame readability
average_pos_df = average_pos_df.reset_index()

# Show dataframe
average_pos_df

Unnamed: 0,Text,ADJ,ADP,ADV,AUX,CCONJ,DET,NOUN,NUM,PRON,PROPN,PUNCT,SCONJ,VERB,SPACE,PART
0,﻿After A Hundred Years\n\nAfter a hundred year...,3.0,7.0,1.0,,1.0,7.0,13.0,2.0,2.0,4.0,11.0,,8.0,13.0,
1,﻿Because I could not stop for Death \n\nHe kin...,2.0,12.0,12.0,4.0,7.0,15.0,12.0,1.0,20.0,21.0,26.0,2.0,18.0,24.0,2.0
2,﻿Had I not seen the Sun\n\nHad I not seen the ...,1.0,,,4.0,1.0,4.0,4.0,,4.0,3.0,,,4.0,5.0,2.0
3,﻿I keep my pledge\nI was not called -\nDeath d...,1.0,5.0,3.0,3.0,2.0,1.0,7.0,,8.0,5.0,11.0,,7.0,11.0,2.0
4,﻿I never saw a Moor\nI never saw a Moor--\nI n...,1.0,3.0,4.0,3.0,4.0,7.0,3.0,1.0,6.0,7.0,1.0,3.0,7.0,9.0,


In [57]:
# Load metadata.
metadata_df = pd.read_csv('metadata.csv')
metadata_df.head()

Unnamed: 0,Title,Author,Language,Type
0,Had I not seen the Sun,Dickinson,English,modern poetry
1,I keep my pledge,Dickinson,English,modern poetry
2,I never saw a Moor,Dickinson,English,modern poetry
3,Because I could not stop for Death,Dickinson,English,modern poetry
4,After A Hundred Years,Dickinson,English,modern poetry


In [79]:
# Rename column from paper ID to Title
poems_df.rename(columns={"Title": "Filename"}, inplace=True)

In [80]:
# Merge metadata and papers into new DataFrame
# Will only keep rows where both essay and metadata are present
poems_df = metadata_df.merge(poems_df,on='Filename')

In [64]:
# Print DataFrame
poems_df.head()

Unnamed: 0,Title,Author_x,Language_x,Type_x,Author_y,Language_y,Type_y,Text,Doc,Tokens,Lemmas,POS,Proper_Nouns
0,Had I not seen the Sun,Dickinson,English,modern poetry,Dickinson,English,modern poetry,﻿Had I not seen the Sun\n\nHad I not seen the ...,"(﻿Had, I, not, seen, the, Sun, \n\n, Had, I, n...","[﻿Had, I, not, seen, the, Sun, \n\n, Had, I, n...","[﻿had, I, not, see, the, Sun, \n\n, have, I, n...","[(NOUN, NN), (PRON, PRP), (PART, RB), (VERB, V...","[Sun, Light, Wilderness]"
1,I keep my pledge,Dickinson,English,modern poetry,Dickinson,English,modern poetry,﻿I keep my pledge\nI was not called -\nDeath d...,"(﻿I, keep, my, pledge, \n, I, was, not, called...","[﻿I, keep, my, pledge, \n, I, was, not, called...","[﻿i, keep, my, pledge, \n, I, be, not, call, -...","[(NOUN, NN), (VERB, VB), (PRON, PRP$), (NOUN, ...","[Rose, Bee, Daisy, Bobolink, Blossom]"
2,I never saw a Moor,Dickinson,English,modern poetry,Dickinson,English,modern poetry,﻿I never saw a Moor\nI never saw a Moor--\nI n...,"(﻿I, never, saw, a, Moor, \n, I, never, saw, a...","[﻿I, never, saw, a, Moor, \n, I, never, saw, a...","[﻿I, never, see, a, Moor, \n, I, never, see, a...","[(PROPN, NNP), (ADV, RB), (VERB, VBD), (DET, D...","[﻿I, Moor, Heather, Billow, God, Checks, given--]"
3,Because I could not stop for Death,Dickinson,English,modern poetry,Dickinson,English,modern poetry,﻿Because I could not stop for Death \n\nHe kin...,"(﻿Because, I, could, not, stop, for, Death, \n...","[﻿Because, I, could, not, stop, for, Death, \n...","[﻿because, I, could, not, stop, for, Death, \n...","[(NOUN, NN), (PRON, PRP), (AUX, MD), (PART, RB...","[Death, Immortality, Civility, School, Fields,..."
4,After A Hundred Years,Dickinson,English,modern poetry,Dickinson,English,modern poetry,﻿After A Hundred Years\n\nAfter a hundred year...,"(﻿After, A, Hundred, Years, \n\n, After, a, hu...","[﻿After, A, Hundred, Years, \n\n, After, a, hu...","[﻿After, a, hundred, Years, \n\n, after, a, hu...","[(PUNCT, .), (DET, DT), (NUM, CD), (PROPN, NNP...","[Years, Agony, Motionless, Strangers]"


In [65]:
# Save DataFrame as csv (in Google Drive)
# Use this step only to save  csv to your computer's working directory
poems_df.to_csv('MICUSP_papers_with_spaCy_tags.csv')