This notebook is split into 3 parts:

1. Add POS with SpaCY - sample data: Mock example of the POS approach, produces two lists: one with tokens, and one with POS for sample data.

2. Add POS with SpaCY - our data - 2 lists: POS approach using our Emo_DV/Arctic, produces two lists: one with tokens, and one with POS for sample data. **Use this approach on our pipeline**

3. Add POS with SpaCY - our data - dictionary: approach used to easily visualize POS tagging on our data set, but won't work on pipeline.



# Libraries

In [None]:
# drive
from google.colab import drive
drive.mount('/content/drive')

#pos tagging
import spacy

# standard library
import numpy as np
import pandas as pd
import tensorflow as tf

#removes punctuation for confusion set
import string

#LLM
!pip install --q transformers
from transformers import RobertaTokenizer, TFRobertaModel, pipeline, TFRobertaForMaskedLM

# progress bar
from tqdm import tqdm


Mounted at /content/drive


# Add POS with SpaCY - sample data



### Create sample data

In [None]:
#load spacy for English
nlp = spacy.load("en_core_web_sm")


In [None]:
# crate some sample sentences
org_text = []
org_text.append("They drank the pub .")
org_text.append("I am looking forway to see you soon .")
org_text.append("The cat sat on mat .")
org_text.append('He ate a apple .')
org_text

['They drank the pub .',
 'I am looking forway to see you soon .',
 'The cat sat on mat .',
 'He ate a apple .']

### Create two lists: One for tokenized words, another for POS

In [None]:
#add POS tagging to tokens as a two lists
sample_pos_sentence = [[token.pos_ for token in nlp(sentence)]
    for sentence in org_text
               ]

print("List of POS:", sample_pos_sentence)

sample_tok_sentence = [[token.text for token in nlp(sentence)]
    for sentence in org_text
                ]
print("List of tokens:", sample_tok_sentence)

List of POS: [['PRON', 'VERB', 'DET', 'NOUN', 'PUNCT'], ['PRON', 'AUX', 'VERB', 'ADJ', 'PART', 'VERB', 'PRON', 'ADV', 'PUNCT'], ['DET', 'NOUN', 'VERB', 'ADP', 'PROPN', 'PUNCT'], ['PRON', 'VERB', 'DET', 'NOUN', 'PUNCT']]
List of tokens: [['They', 'drank', 'the', 'pub', '.'], ['I', 'am', 'looking', 'forway', 'to', 'see', 'you', 'soon', '.'], ['The', 'cat', 'sat', 'on', 'mat', '.'], ['He', 'ate', 'a', 'apple', '.']]


### Mask all words found in the confusion set


In [None]:
# List of common determiners
det = ['the', 'a', 'an', 'this', 'that', 'these', 'those', 'my', 'your', 'his',
       'her', 'its', 'our', 'their', 'all', 'both', 'half', 'either', 'neither',
       'each', 'every', 'other', 'another', 'such', 'what', 'rather', 'quite']

# List of common prepositions
prep = ["about", "at", "by", "for", "from", "in", "of", "on", "to", "with",
        "into", "during", "including", "until", "against", "among",
        "throughout", "despite", "towards", "upon", "concerning"]

# List of helping verbs
helping_verbs = ['am', 'is', 'are', 'was', 'were', 'being', 'been', 'be',
                 'have', 'has', 'had', 'do', 'does', 'did', 'will', 'would',
                 'shall', 'should', 'may', 'might', 'must', 'can', 'could']

confusion_set = det + prep + helping_verbs

masked_sentences = []

for sentence in sample_tok_sentence:
  masked_words = []

  for word in sentence:
    if word.lower() in confusion_set:
      masked_words.append('<mask>')
    else:
      masked_words.append(word)

  masked_sentences.append(masked_words)

print(masked_sentences)


[['They', 'drank', '<mask>', 'pub', '.'], ['I', '<mask>', 'looking', 'forway', '<mask>', 'see', 'you', 'soon', '.'], ['<mask>', 'cat', 'sat', '<mask>', 'mat', '.'], ['He', 'ate', '<mask>', 'apple', '.']]


### Find probabilities for masked words

In [None]:
# Join back to a single, untokenized sentence:
joined_sentences = []
for sentence in masked_sentences:
    joined = ' '.join(sentence)
    joined_sentences.append(joined)

print(joined_sentences)

['They drank <mask> pub .', 'I <mask> looking forway <mask> see you soon .', '<mask> cat sat <mask> mat .', 'He ate <mask> apple .']


In [None]:
# Find masked probabilities
unmasker = pipeline('fill-mask', model='roberta-large')

suggestions = []
for sentence in tqdm(joined_sentences):
    suggestions.append(unmasker(sentence))

100%|██████████| 4/4 [00:01<00:00,  2.28it/s]


In [None]:
suggestions

[[{'score': 0.5072934627532959,
   'token': 5,
   'token_str': ' the',
   'sequence': 'They drank the pub.'},
  {'score': 0.10973586142063141,
   'token': 11,
   'token_str': ' in',
   'sequence': 'They drank in pub.'},
  {'score': 0.09439131617546082,
   'token': 10,
   'token_str': ' a',
   'sequence': 'They drank a pub.'},
  {'score': 0.06871365755796432,
   'token': 23,
   'token_str': ' at',
   'sequence': 'They drank at pub.'},
  {'score': 0.03474224731326103,
   'token': 8,
   'token_str': ' and',
   'sequence': 'They drank and pub.'}],
 [[{'score': 0.7100566625595093,
    'token': 524,
    'token_str': ' am',
    'sequence': '<s>I am looking forway<mask> see you soon.</s>'},
   {'score': 0.19876757264137268,
    'token': 437,
    'token_str': "'m",
    'sequence': "<s>I'm looking forway<mask> see you soon.</s>"},
   {'score': 0.023501884192228317,
    'token': 21,
    'token_str': ' was',
    'sequence': '<s>I was looking forway<mask> see you soon.</s>'},
   {'score': 0.0146893

In [None]:
#add POS tagging to tokens as a two lists
# with this approach, still can't get the masking done

pos_sentence = [[token.pos_ for token in nlp(sentence)]
    for sentence in org_text
]

print("List of POS:", pos_sentence)

tok_sentence = [[token.textfor token in nlp(sentence)]
    for sentence in org_text
]]


print("List of tokens:", tok_sentence)


#clean = [
#    {token.text:token.pos_ for token in nlp(sentence)}
#    for sentence in org_text
#]


# and with this one, some words are removed
#clean = [
#    {token.text:token.pos_ for token in nlp(sentence)}
#    for sentence in org_text
#]




[{'They': 'PRON', 'drank': 'VERB', 'the': 'DET', 'pub': 'NOUN', '.': 'PUNCT'},
 {'I': 'PRON',
  'am': 'AUX',
  'looking': 'VERB',
  'forway': 'ADJ',
  'to': 'PART',
  'see': 'VERB',
  'you': 'PRON',
  'soon': 'ADV',
  '.': 'PUNCT'},
 {'The': 'DET',
  'cat': 'NOUN',
  'sat': 'VERB',
  'on': 'ADP',
  'mat': 'PROPN',
  '.': 'PUNCT'},
 {'He': 'PRON', 'ate': 'VERB', 'a': 'DET', 'apple': 'NOUN', '.': 'PUNCT'}]

In [None]:
# Mask the word only if the word is in the confusion set

pos_mask = ['DET', 'VERB', 'ADP']

# List of common determiners
det = ['the', 'a', 'an', 'this', 'that', 'these', 'those', 'my', 'your', 'his',
       'her', 'its', 'our', 'their', 'all', 'both', 'half', 'either', 'neither',
       'each', 'every', 'other', 'another', 'such', 'what', 'rather', 'quite']

# List of common prepositions
prep = ["about", "at", "by", "for", "from", "in", "of", "on", "to", "with",
        "into", "during", "including", "until", "against", "among",
        "throughout", "despite", "towards", "upon", "concerning"]

# List of helping verbs
helping_verbs = ['am', 'is', 'are', 'was', 'were', 'being', 'been', 'be',
                 'have', 'has', 'had', 'do', 'does', 'did', 'will', 'would',
                 'shall', 'should', 'may', 'might', 'must', 'can', 'could']

for d in clean:
    for k, v in d.items():
        if k in list(set(det + prep + helping_verbs)) and v in pos_mask:
            d[k] = '<mask>'

clean



[{'They': 'PRON',
  'drank': 'VERB',
  'the': '<mask>',
  'pub': 'NOUN',
  '.': 'PUNCT',
  'DET': '<mask>'},
 {'I': 'PRON',
  'am': 'AUX',
  'looking': 'VERB',
  'forway': 'ADJ',
  'to': 'PART',
  'see': 'VERB',
  'you': 'PRON',
  'soon': 'ADV',
  '.': 'PUNCT'},
 {'The': 'DET',
  'cat': 'NOUN',
  'sat': 'VERB',
  'on': '<mask>',
  'mat': 'PROPN',
  '.': 'PUNCT'},
 {'He': 'PRON', 'ate': 'VERB', 'a': '<mask>', 'apple': 'NOUN', '.': 'PUNCT'}]

In [None]:
#convert back to a list of sentences
sentences = []
for d in clean:
  sentence = ' '.join(d.keys())
  sentences.append(sentence)

print(sentences)

['They drank the pub .', 'I am looking forway to see you soon .', 'The cat sat on mat .', 'He ate a apple .']


In [None]:
# use a MLM to generate likely options for each of the masked words.
checkpoint = 'roberta-large'
tokenizer = RobertaTokenizer.from_pretrained(checkpoint)
model = TFRobertaForMaskedLM.from_pretrained(checkpoint)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFRobertaForMaskedLM: ['roberta.embeddings.position_ids']
- This IS expected if you are initializing TFRobertaForMaskedLM from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaForMaskedLM from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFRobertaForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForMaskedLM for predictions without further training.


In [None]:
unmasker = pipeline('fill-mask', model='roberta-large')
unmasker('He ate <mask> apple.')


[{'score': 0.44446417689323425,
  'token': 5,
  'token_str': ' the',
  'sequence': 'He ate the apple.'},
 {'score': 0.31382378935813904,
  'token': 41,
  'token_str': ' an',
  'sequence': 'He ate an apple.'},
 {'score': 0.12721681594848633,
  'token': 39,
  'token_str': ' his',
  'sequence': 'He ate his apple.'},
 {'score': 0.04988005757331848,
  'token': 277,
  'token_str': ' another',
  'sequence': 'He ate another apple.'},
 {'score': 0.030983757227659225,
  'token': 65,
  'token_str': ' one',
  'sequence': 'He ate one apple.'}]

In [None]:
sentences

['They drank <mask> pub .',
 'you am see forway to soon .',
 'The cat sat <mask> mat .',
 'Giant predator is <mask> .',
 'There brought <mask> age . has many <mask>',
 'He ate <mask> apple .']

# Add POS for confusion set

**Determiners:**

- Articles: a, an, the
- Demonstratives: this, that, these, those
- Possessives: my, your, his, her, its, our, their
- Quantifiers: some, many, much, few, several, each, every, either, neither
- Numbers: one, two, first, second, etc.
- Interrogatives: which, what




In [None]:
# List of common determiners: PRON - Pronoun # NOUN - Noun # ADV - Adverb # CCONJ - Coordinating Conjunction # ADJ - Adjective
det = ['the', 'a', 'an', 'this', 'that', 'these', 'those', 'my', 'your', 'his',
       'her', 'its', 'our', 'their', 'all', 'both', 'half', 'either', 'neither',
       'each', 'every', 'other', 'another', 'such', 'what', 'rather', 'quite']

# List of common prepositions and adverbs
prep = ["about", "at", "by", "for", "from", "in", "of", "on", "to", "with",
        "into", "during", "including", "until", "against", "among",
        "throughout", "despite", "towards", "upon", "concerning"]

# List of helping verbs and some verbs (9:14 are verbs - 'have', 'has', 'had', 'do', 'does', 'did')
#helping_verbs = ['am', 'is', 'are', 'was', 'were', 'being', 'been', 'be',
#                 'have', 'has', 'had', 'do', 'does', 'did', 'will', 'would',
3                 'shall', 'should', 'may', 'might', 'must', 'can', 'could']


helping_verbs = ['am', 'is', 'are', 'was', 'were', 'being', 'been', 'be',
                 'will', 'would','shall', 'should', 'may', 'might', 'must', 'can', 'could']

In [None]:
# check to see if POS changes based on context

expect_verb = "I have a cat"
expect_aux = "I have been planning on getting a cat"

expect_verb_pos = [token.pos_ for token in nlp(expect_verb)]
print(expect_verb_pos)

expect_aux_pos = [token.pos_ for token in nlp(expect_aux)]
print(expect_aux_pos)

['PRON', 'VERB', 'DET', 'NOUN']
['PRON', 'AUX', 'AUX', 'VERB', 'ADP', 'VERB', 'DET', 'NOUN']


### "Determiners"

In [None]:
det_pos = [[token.pos_ for token in nlp(sentence)]
    for sentence in det
]

det_pos

[['PRON'],
 ['PRON'],
 ['PRON'],
 ['PRON'],
 ['PRON'],
 ['PRON'],
 ['PRON'],
 ['PRON'],
 ['PRON'],
 ['PRON'],
 ['PRON'],
 ['PRON'],
 ['PRON'],
 ['PRON'],
 ['PRON'],
 ['PRON'],
 ['NOUN'],
 ['ADV'],
 ['CCONJ'],
 ['PRON'],
 ['PRON'],
 ['ADJ'],
 ['PRON'],
 ['ADJ'],
 ['PRON'],
 ['ADV'],
 ['ADV']]

### "Prepositions"

In [None]:
# ADV - Adverb
# ADP - Adposition
# PART - Particle
# VERB - Verb
# SCONJ - Subordinating Conjunction

prep_pos = [[token.pos_ for token in nlp(sentence)]
    for sentence in prep
]

prep_pos

[['ADV'],
 ['ADP'],
 ['ADP'],
 ['ADP'],
 ['ADP'],
 ['ADP'],
 ['ADP'],
 ['ADP'],
 ['PART'],
 ['ADP'],
 ['ADP'],
 ['ADP'],
 ['VERB'],
 ['ADP'],
 ['ADP'],
 ['ADP'],
 ['ADP'],
 ['SCONJ'],
 ['ADP'],
 ['SCONJ'],
 ['VERB']]

### "Auxiliary verbs"

In [None]:
helping_verbs_pos = [[token.pos_ for token in nlp(sentence)]
    for sentence in helping_verbs
]

helping_verbs_pos

[['AUX'],
 ['AUX'],
 ['AUX'],
 ['AUX'],
 ['AUX'],
 ['AUX'],
 ['AUX'],
 ['AUX'],
 ['VERB'],
 ['VERB'],
 ['VERB'],
 ['VERB'],
 ['VERB'],
 ['VERB'],
 ['AUX'],
 ['AUX'],
 ['AUX'],
 ['AUX'],
 ['AUX'],
 ['AUX'],
 ['AUX'],
 ['AUX'],
 ['AUX']]

# Add POS with SpaCY - our data - 2 lists

### Load our data

In [None]:
df = pd.read_csv('/content/drive/MyDrive/266/Data/Clean_Data/EmoV_Arctic/punctuated_cased_train.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,filename,clean_filename,actor,gender,emotion,auto_transcription,label,cleaned_auto_transcription,cleaned_label
0,0,amused_29-45_0042.wav,42,bea,female,amused,HOW COULD HE EXPLAIN HIS POSSESSION OF THE SKETCH,How could he explain his possession of the ske...,How could he explain his possession of the ske...,How could he explain his possession of the ske...
1,1,amused_46-56_0046.wav,46,bea,female,amused,THE GIRL FACED HIM HER EYES SHINING WITH SUDDE...,"The girl faced him, her eyes shining with sudd...","The girl faced him, her eyes shining with sudd...","The girl faced him, her eyes shining with sudd..."
2,2,amused_1-15_0005.wav,5,bea,female,amused,WILL WE EVER FORGET IT,Will we ever forget it.,Will we ever forget it.,Will we ever forget it.
3,3,amused_281-308_0281.wav,281,bea,female,amused,I DO NOT BLAME YOU FOR ANYTHING REMEMBER THAT,I do not blame you for anything; remember that.,I do not blame you for anything. Remember that.,I do not blame you for anything. Remember that.
4,4,amused_225-252_0226.wav,226,bea,female,amused,THAT CAME BEFORE MY A V CS,That came before my A B C's.,That came before my a v cs.,That came before my A B C's.


### Create two lists: one with tokenized sentences, another with POS

In [None]:
# convert series to list
check = df['cleaned_auto_transcription'].tolist()

# add a space before each punctuation mark to allow for proper POS tagging
punctuated_sentences = []
for sentence in check:
    punctuated = sentence.translate(str.maketrans({key:" {0} ".format(key) for key in string.punctuation}))
    punctuated_sentences.append(punctuated)

punctuated_sentences[:3]

['How could he explain his possession of the sketch . ',
 'The girl faced him ,  her eyes shining with sudden fear . ',
 'Will we ever forget it . ']

In [None]:
type(punctuated_sentences)

list

In [None]:
pos_sentence = [[token.pos_ for token in nlp(sentence)]
    for sentence in punctuated_sentences
               ]
print("List of POS:", pos_sentence)

tok_sentence = [[token.text for token in nlp(sentence)]
    for sentence in punctuated_sentences
                ]
print("List of tokens:", tok_sentence)

List of POS: [['SCONJ', 'AUX', 'PRON', 'VERB', 'PRON', 'NOUN', 'ADP', 'DET', 'NOUN', 'PUNCT'], ['DET', 'NOUN', 'VERB', 'PRON', 'PUNCT', 'SPACE', 'PRON', 'NOUN', 'VERB', 'ADP', 'ADJ', 'NOUN', 'PUNCT'], ['AUX', 'PRON', 'ADV', 'VERB', 'PRON', 'PUNCT'], ['PRON', 'AUX', 'PART', 'VERB', 'PRON', 'ADP', 'PRON', 'PUNCT', 'SPACE', 'VERB', 'PRON', 'PUNCT'], ['PRON', 'VERB', 'ADP', 'PRON', 'DET', 'NOUN', 'NOUN', 'PUNCT'], ['ADV', 'ADV', 'PUNCT', 'SPACE', 'PRON', 'AUX', 'ADJ', 'PUNCT', 'SPACE', 'PRON', 'VERB', 'PUNCT'], ['PRON', 'VERB', 'ADP', 'DET', 'NOUN', 'ADP', 'DET', 'NOUN', 'CCONJ', 'VERB', 'VERB', 'PRON', 'PUNCT'], ['AUX', 'VERB', 'NOUN', 'VERB', 'DET', 'NOUN', 'PUNCT'], ['ADV', 'ADV', 'PUNCT', 'SPACE', 'DET', 'NOUN', 'AUX', 'ADV', 'VERB', 'ADP', 'NOUN', 'PUNCT'], ['PROPN', 'PUNCT', 'SPACE', 'PRON', 'NOUN', 'PUNCT', 'SPACE', 'PRON', 'VERB', 'PART', 'VERB', 'ADJ', 'NOUN', 'ADP', 'PRON', 'NOUN', 'PUNCT'], ['PRON', 'NOUN', 'AUX', 'ADV', 'ADJ', 'PUNCT'], ['PROPN', 'NOUN', 'ADP', 'NOUN', 'PUNCT',

### Mask all words found in the confusion set


In [None]:
# List of common determiners
det = ['the', 'a', 'an', 'this', 'that', 'these', 'those', 'my', 'your', 'his',
       'her', 'its', 'our', 'their', 'all', 'both', 'half', 'either', 'neither',
       'each', 'every', 'other', 'another', 'such', 'what', 'rather', 'quite']

# List of common prepositions
prep = ["about", "at", "by", "for", "from", "in", "of", "on", "to", "with",
        "into", "during", "including", "until", "against", "among",
        "throughout", "despite", "towards", "upon", "concerning"]

# List of helping verbs
helping_verbs = ['am', 'is', 'are', 'was', 'were', 'being', 'been', 'be',
                 'have', 'has', 'had', 'do', 'does', 'did', 'will', 'would',
                 'shall', 'should', 'may', 'might', 'must', 'can', 'could']

confusion_set = det + prep + helping_verbs

masked_sentences = []

for sentence in tok_sentence:
  masked_words = []

  for word in sentence:
    if word.lower() in confusion_set:
      masked_words.append('<mask>')
    else:
      masked_words.append(word)

  masked_sentences.append(masked_words)

print(masked_sentences)

[['How', '<mask>', 'he', 'explain', '<mask>', 'possession', '<mask>', '<mask>', 'sketch', '.'], ['<mask>', 'girl', 'faced', 'him', ',', ' ', '<mask>', 'eyes', 'shining', '<mask>', 'sudden', 'fear', '.'], ['<mask>', 'we', 'ever', 'forget', 'it', '.'], ['I', '<mask>', 'not', 'blame', 'you', '<mask>', 'anything', '.', ' ', 'Remember', '<mask>', '.'], ['<mask>', 'came', 'before', '<mask>', '<mask>', 'v', 'cs', '.'], ['<mask>', 'course', ',', ' ', '<mask>', '<mask>', 'uninteresting', '.', ' ', 'She', 'continued', '.'], ['He', 'wated', '<mask>', '<mask>', 'edge', '<mask>', '<mask>', 'water', 'and', 'began', 'scrubbing', 'himself', '.'], ['<mask>', 'almay', 'dreams', 'violated', '<mask>', 'law', '.'], ['Down', 'there', ',', ' ', '<mask>', 'earth', '<mask>', 'already', 'swelling', '<mask>', 'life', '.'], ['Nunte', ',', ' ', '<mask>', 'surprise', ',', ' ', 'he', 'began', '<mask>', 'show', 'actual', 'enthusiasm', '<mask>', '<mask>', 'favor', '.'], ['<mask>', 'voice', '<mask>', 'passionately', 'r

### Find probabilities for masked words

In [None]:
# Join back to a single, untokenized sentence:
joined_sentences = []
for sentence in masked_sentences:
    joined = ' '.join(sentence)
    joined_sentences.append(joined)

print(joined_sentences)

['How <mask> he explain <mask> possession <mask> <mask> sketch .', '<mask> girl faced him ,   <mask> eyes shining <mask> sudden fear .', '<mask> we ever forget it .', 'I <mask> not blame you <mask> anything .   Remember <mask> .', '<mask> came before <mask> <mask> v cs .', '<mask> course ,   <mask> <mask> uninteresting .   She continued .', 'He wated <mask> <mask> edge <mask> <mask> water and began scrubbing himself .', '<mask> almay dreams violated <mask> law .', 'Down there ,   <mask> earth <mask> already swelling <mask> life .', 'Nunte ,   <mask> surprise ,   he began <mask> show actual enthusiasm <mask> <mask> favor .', '<mask> voice <mask> passionately rebellious .', 'Ahai game <mask> information ,   more out <mask> curiosity than anything else .', '<mask> maddening joy pounded <mask> <mask> brain .', '<mask> questions <mask> <mask> come vaguely <mask> <mask> mind .', 'They <mask> big trees and require plenty <mask> room .', '<mask> immaculate appearance <mask> gone .', '<mask> <m

In [None]:
# Find masked probabilities
unmasker = pipeline('fill-mask', model='roberta-large')

suggestions = []

for sentence in tqdm(joined_sentences):
  if "<mask>" in sentence:
    suggestion = unmasker(sentence)
    suggestions.append(suggestion)
  else:
    suggestions.append("No word from sentence found in confusion set.")
    suggestions.append(sentence)

print(suggestions)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [None]:
suggestions

# Add POS with SpaCY - our data - dictionary

In [None]:
our_data = df['cleaned_auto_transcription']
our_data

0       How could he explain his possession of the ske...
1       The girl faced him, her eyes shining with sudd...
2                                 Will we ever forget it.
3         I do not blame you for anything. Remember that.
4                             That came before my a v cs.
                              ...                        
9009    Unconsciously, our yells and exclamations yiel...
9010        The stout wood was crushed like an egg shell.
9011    Ten minutes had not elapsed since he had dropp...
9012    I suppose you picked that lingo up among the i...
9013    Mc coy found a stifling, poisonous atmosphere ...
Name: cleaned_auto_transcription, Length: 9014, dtype: object

In [None]:
df['cln_autot_pos'] = [
    {token.text: token.pos_ for token in nlp(sentence)}
    for sentence in df['cleaned_auto_transcription']
]

# Some examples of how misspelled words are being tagged

- "Wated" (waited) is correctly recognized as a verb
- "Nunte" (to) is correctly recognized as a preposition
- "Gean" (Jeanne)  is correctly recognized as a proper noun


- But words that are pasted together in the autotranscpription of course are incorrectly tagged: 'almay': 'VERB', when the original label was ALL MY.



|autotranscription|pos tagging|
|---|---|
|That came before my a v cs.	|{'That': 'PRON', 'came': 'VERB', 'before': 'ADP', 'my': 'PRON', 'a': 'DET', 'v': 'NOUN', 'cs': 'NOUN', '.': 'PUNCT'}|
|He wated in the edge of the water and began scrubbing himself.|{'He': 'PRON', 'wated': 'VERB', 'in': 'ADP', 'the': 'DET', 'edge': 'NOUN', 'of': 'ADP', 'water': 'NOUN', 'and': 'CCONJ', 'began': 'VERB', 'scrubbing': 'VERB', 'himself': 'PRON', '.': 'PUNCT'}|
|Nunte, my surprise, he began to show actual enthusiasm in my favor.|{'Nunte': 'PROPN', ',': 'PUNCT', 'my': 'PRON', 'surprise': 'NOUN', 'he': 'PRON', 'began': 'VERB', 'to': 'PART', 'show': 'VERB', 'actual': 'ADJ', 'enthusiasm': 'NOUN', 'in': 'ADP', 'favor': 'NOUN', '.': 'PUNCT'}|
|Could almay dreams violated this law.|{'Could': 'AUX', 'almay': 'VERB', 'dreams': 'NOUN', 'violated': 'VERB', 'this': 'DET', 'law': 'NOUN', '.': 'PUNCT'}|