# A Deeper Look at spaCy
## Student: Levi Lowther
### https://github.com/LevLow/Datamine_07_spaCy_dive 

### Source for used Tutorial: https://realpython.com/natural-language-processing-spacy-python/ 

### Load and test needed modules

In [2]:
from collections import Counter
import pickle
import requests
import spacy
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt

!pip list

print('All prereqs installed.')

Package                       Version
----------------------------- --------------------
alabaster                     0.7.12
anaconda-client               1.11.0
anaconda-navigator            2.3.2
anyio                         3.5.0
appdirs                       1.4.4
argon2-cffi                   21.3.0
argon2-cffi-bindings          21.2.0
arrow                         1.2.2
asgiref                       3.5.2
astroid                       2.11.7
astropy                       5.1
atomicwrites                  1.4.0
attrs                         22.1.0
Automat                       20.2.0
autopep8                      1.6.0
Babel                         2.11.0
backcall                      0.2.0
backports.functools-lru-cache 1.6.4
backports.tempfile            1.0
backports.weakref             1.0.post1
bcrypt                        3.2.0
beautifulsoup4                4.11.1
binaryornot                   0.4.4
bitarray                      2.5.1
bkcharts                      0.2
blac

### Load Modules and save the A Midsummer Night's Dream in as a pickle

In [5]:
import requests,pickle,io,re,spacy
from bs4 import BeautifulSoup
from contextlib import redirect_stdout
from spacytextblob.spacytextblob import SpacyTextBlob
from spacy.lang.en.stop_words import STOP_WORDS
from collections import Counter
import matplotlib.pyplot as plt
import numpy as np

url = "https://shakespeare.mit.edu/midsummer/full.html"

response = requests.get(url)
print(response.status_code)
print(response.headers['content-type'])

#code to check if request worked and write to pickel
if response.status_code == 200:
    html_content= response.text
    soup = BeautifulSoup(html_content, "html.parser")
    article = soup.find("html")

    if article:
        with open("midsummer.pkl", "wb") as file:
            pickle.dump(str(article), file)
            print("Our Play is Saved!")
    else:
        print("Article not found")
else:
    print("Webpage Error")

200
text/html
Our Play is Saved!


### Read the pickle file and print the text


In [6]:
#Read the file
with open("midsummer.pkl", "rb") as file:
    html_content = pickle.load(file)

#parse
soup = BeautifulSoup(html_content, "html.parser")
text = soup.get_text()
print(text)



Midsummer Night's Dream: Entire Play
 





A Midsummer Night's Dream

Shakespeare homepage 
    | Midsummer Night's Dream 
    | Entire play

ACT I
SCENE I. Athens. The palace of THESEUS.

Enter THESEUS, HIPPOLYTA, PHILOSTRATE, and Attendants

THESEUS

Now, fair Hippolyta, our nuptial hour
Draws on apace; four happy days bring in
Another moon: but, O, methinks, how slow
This old moon wanes! she lingers my desires,
Like to a step-dame or a dowager
Long withering out a young man revenue.

HIPPOLYTA

Four days will quickly steep themselves in night;
Four nights will quickly dream away the time;
And then the moon, like to a silver bow
New-bent in heaven, shall behold the night
Of our solemnities.

THESEUS

Go, Philostrate,
Stir up the Athenian youth to merriments;
Awake the pert and nimble spirit of mirth;
Turn melancholy forth to funerals;
The pale companion is not for our pomp.
Exit PHILOSTRATE
Hippolyta, I woo'd thee with my sword,
And won thy love, doing thee injuries;
But I will we

## The Doc object for Processed Text
### The tutorial has us running tokenization on text that has been typed directly into the constructor. Since this isn't something
### that we would normally do I am instead going to read directly from the text I just parsed and printed. 

In [13]:
import spacy
nlp = spacy.load("en_core_web_sm")
introduction_doc = nlp(text)
print ([token.text for token in introduction_doc])

#I had to troubleshoot and determine the correct encoding. utf-8 and HTML were incorrect, but a little digging showed me the cp1252 encoding.




## Sentence Detection
### Using sentence detection to divide the text into meaningful units and extract useful information. 
### This will also help set us up for Parts of Speach Tagging and Named Entity Recognition

In [15]:
# Determine the number of sentences in A Midsummer Night's Dream
about_doc = nlp(text)
sentences = list(about_doc.sents)
len(sentences)
# there are 1135 sentences



1135

In [18]:
#print the first word of every sentence followed by and elipsis
for sentence in sentences:
    print(f"{sentence[:1]}...")



...
The...
Enter...
she...
HIPPOLYTA...
THESEUS...
Exit...
But...
Enter...
THESEUS...
Stand...
My...
Stand...
Thou...
so...
THESEUS...
To...
Demetrius...
HERMIA...
THESEUS...
HERMIA...
but...
THESEUS...
HERMIA...
I...
But...
THESEUS...
Therefore...
Thrice...
HERMIA...
THESEUS...
The...
DEMETRIUS...
LYSANDER...
You...
EGEUS...
true...
And...
LYSANDER...
And...
THESEUS...
But...
For...
Come...
EGEUS...
Exeunt...
How...
why...
How...
LYSANDER...
Ay...
for...
But...
LYSANDER...
too...
LYSANDER...
to...
LYSANDER...
Or...
The...
HERMIA...
LYSANDER...
I...
And...
There...
If...
HERMIA...
I...
By...
LYSANDER...
Look...
Enter...
whither...
that...
Demetrius...
Your...
Sickness...
Were...
O...
HERMIA...
HELENA...
HERMIA...
HELENA...
HERMIA...
HELENA...
HERMIA...
HELENA...
Before...
LYSANDER...
HERMIA...
Farewell...
Keep...
LYSANDER...
I...
Exit...
Exit...
Through...
But...
Demetrius...
but...
And...
And...
As...
I...
But...
Exit...
Athens...
QUINCE...
Enter...
BOTTOM...
QUINCE...
Here...
BOTTO

In [20]:
# use the language comonent to set custom boundaries to use ":" as a delimiter for sentences

from spacy.language import Language
@Language.component("set_custom_boundaries")
def set_custom_boundaries(doc):
    """Add support to use `:` as a delimiter for sentence detection"""
    for token in doc[:-1]:
        if token.text == ":":
            doc[token.i + 1].is_sent_start = True
        return doc


custom_nlp = spacy.load("en_core_web_sm")
custom_nlp.add_pipe("set_custom_boundaries", before="parser")
custom_ellipsis_doc = custom_nlp(text)
custom_ellipsis_sentences = list(custom_ellipsis_doc.sents)
for sentence in custom_ellipsis_sentences:
     print(sentence)



Midsummer Night's Dream: Entire Play
 





A Midsummer Night's Dream

Shakespeare homepage 
    | Midsummer Night's Dream 
    | Entire play

ACT I
SCENE I. Athens.
The palace of THESEUS.


Enter THESEUS, HIPPOLYTA, PHILOSTRATE, and Attendants

THESEUS

Now, fair Hippolyta, our nuptial hour
Draws on apace; four happy days bring in
Another moon: but, O, methinks, how slow
This old moon wanes!
she lingers my desires,
Like to a step-dame or a dowager
Long withering out a young man revenue.


HIPPOLYTA

Four days will quickly steep themselves in night;
Four nights will quickly dream away the time;
And then the moon, like to a silver bow
New-bent in heaven, shall behold the night
Of our solemnities.


THESEUS

Go, Philostrate,
Stir up the Athenian youth to merriments;
Awake the pert and nimble spirit of mirth;
Turn melancholy forth to funerals;
The pale companion is not for our pomp.

Exit PHILOSTRATE
Hippolyta, I woo'd thee with my sword,
And won thy love, doing thee injuries;

But I wi

## Tokens
### Use tokens to print index and common atributes of tokens

In [21]:

# print index of tokens 
for token in about_doc:
    print (token, token.idx)



 0
Midsummer 2
Night 12
's 17
Dream 20
: 25
Entire 27
Play 34

 





 38
A 46
Midsummer 48
Night 58
's 63
Dream 66


 71
Shakespeare 73
homepage 85

     94
| 99
Midsummer 101
Night 111
's 116
Dream 119

     125
| 130
Entire 132
play 139


 143
ACT 145
I 149

 150
SCENE 151
I. 157
Athens 160
. 166
The 168
palace 172
of 179
THESEUS 182
. 189


 190
Enter 192
THESEUS 198
, 205
HIPPOLYTA 207
, 216
PHILOSTRATE 218
, 229
and 231
Attendants 235


 245
THESEUS 247


 254
Now 256
, 259
fair 261
Hippolyta 266
, 275
our 277
nuptial 281
hour 289

 293
Draws 294
on 300
apace 303
; 308
four 310
happy 315
days 321
bring 326
in 332

 334
Another 335
moon 343
: 347
but 349
, 352
O 354
, 355
methinks 357
, 365
how 367
slow 371

 375
This 376
old 381
moon 385
wanes 390
! 395
she 397
lingers 401
my 409
desires 412
, 419

 420
Like 421
to 426
a 429
step 431
- 435
dame 436
or 441
a 444
dowager 446

 453
Long 454
withering 459
out 469
a 473
young 475
man 481
revenue 485
. 492


 493
HIPPOLYTA 495


 504

In [22]:
# Prit commmon attribues for the tokens 

print(
    f"{'Text with Whitespace':22}"
    f"{'Is Alphanumeric?':15}"
    f"{'Is Punctuation?':18}"
    f"{'Is Stop Word?'}"
    )
for token in about_doc:
    print(
        f"{str(token.text_with_ws):22}"
        f"{str(token.is_alpha):15}"
        f"{str(token.is_punct):18}"
        f"{str(token.is_stop)}"
   )

Text with Whitespace  Is Alphanumeric?Is Punctuation?   Is Stop Word?


                    False          False             False
Midsummer             True           False             False
Night                 True           False             False
's                    False          False             True
Dream                 True           False             False
:                     False          True              False
Entire                True           False             False
Play                  True           False             False

 





              False          False             False
A                     True           False             True
Midsummer             True           False             False
Night                 True           False             False
's                    False          False             True
Dream                 True           False             False


                    False          False             False
Shakespeare       

In [45]:
# customize the tokenization process buy building our own tokenizer object. Since this document does not have
# an @ symbol decided to use "--" which appears a few times. 

import re
from spacy.tokenizer import Tokenizer

custom_nlp = spacy.load("en_core_web_sm")
prefix_re = spacy.util.compile_prefix_regex(
     custom_nlp.Defaults.prefixes
)
suffix_re = spacy.util.compile_suffix_regex(
     custom_nlp.Defaults.suffixes
 )

custom_infixes = [r"--"]

infix_re = spacy.util.compile_infix_regex(
     list(custom_nlp.Defaults.infixes) + custom_infixes
 )

custom_nlp.tokenizer = Tokenizer(
     nlp.vocab,
     prefix_search=prefix_re.search,
     suffix_search=suffix_re.search,
     infix_finditer=infix_re.finditer,
     token_match=None,
 )

custom_tokenizer_about_doc = custom_nlp(text)

print([token.text for token in custom_tokenizer_about_doc[1275:1350]])

# As we can see the double hyphen now tokenizes as it own token 


['\n', 'For', 'you', ',', 'fair', 'Hermia', ',', 'look', 'you', 'arm', 'yourself', '\n', 'To', 'fit', 'your', 'fancies', 'to', 'your', 'father', "'s", 'will', ';', '\n', 'Or', 'else', 'the', 'law', 'of', 'Athens', 'yields', 'you', 'up', '--', '\n', 'Which', 'by', 'no', 'means', 'we', 'may', 'extenuate', '--', '\n', 'To', 'death', ',', 'or', 'to', 'a', 'vow', 'of', 'single', 'life', '.', '\n', 'Come', ',', 'my', 'Hippolyta', ':', 'what', 'cheer', ',', 'my', 'love', '?', '\n', 'Demetrius', 'and', 'Egeus', ',', 'go', 'along', ':', '\n']


## Stop Words in spaCy
### Here we will examine stop words and remove them from the text.

In [46]:
import spacy
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
len(spacy_stopwords)

326

In [47]:
for stop_word in list(spacy_stopwords)[:10]:
    print(stop_word)

hereby
out
upon
empty
mine
been
indeed
latterly
or
against


In [48]:
#remove stopwords from the play

about_doc = nlp(text)
print([token for token in about_doc if not token.is_stop])

[

, Midsummer, Night, Dream, :, Entire, Play, 
 





, Midsummer, Night, Dream, 

, Shakespeare, homepage, 
    , |, Midsummer, Night, Dream, 
    , |, Entire, play, 

, ACT, 
, SCENE, I., Athens, ., palace, THESEUS, ., 

, Enter, THESEUS, ,, HIPPOLYTA, ,, PHILOSTRATE, ,, Attendants, 

, THESEUS, 

, ,, fair, Hippolyta, ,, nuptial, hour, 
, Draws, apace, ;, happy, days, bring, 
, moon, :, ,, O, ,, methinks, ,, slow, 
, old, moon, wanes, !, lingers, desires, ,, 
, Like, step, -, dame, dowager, 
, Long, withering, young, man, revenue, ., 

, HIPPOLYTA, 

, days, quickly, steep, night, ;, 
, nights, quickly, dream, away, time, ;, 
, moon, ,, like, silver, bow, 
, New, -, bent, heaven, ,, shall, behold, night, 
, solemnities, ., 

, THESEUS, 

, ,, Philostrate, ,, 
, Stir, Athenian, youth, merriments, ;, 
, Awake, pert, nimble, spirit, mirth, ;, 
, Turn, melancholy, forth, funerals, ;, 
, pale, companion, pomp, ., 
, Exit, PHILOSTRATE, 
, Hippolyta, ,, woo'd, thee, sword, ,, 
, won, thy,

## Lemmatization
### We will create lemmas, or the base roots of words, and compare them to their original word from the play. 


In [55]:
play_doc = nlp(text)
for token in play_doc:
    if str(token) != str(token.lemma_):
            print(f"{str(token):>20} : {str(token.lemma_)}")

# This is really interesting to see, especilly for what is very flowery language. 

              Entire : entire
                Play : play
                   A : a
              Entire : entire
                 The : the
           HIPPOLYTA : hippolyta
                 Now : now
                days : day
             Another : another
                   O : o
            methinks : methink
                This : this
               wanes : wane
             lingers : linger
             desires : desire
                Like : like
           withering : wither
                Four : four
                days : day
                Four : four
              nights : night
                 And : and
                 New : new
                  Of : of
         solemnities : solemnity
                  Go : go
                Stir : stir
            Athenian : athenian
          merriments : merriment
               Awake : awake
                Turn : turn
            funerals : funeral
                 The : the
                  is : be
                Exit : exit

## Word Frequency
### We can count the fequency of word


In [59]:
words = [
    token.text
    for token in play_doc
    if not token.is_stop and not token.is_punct
]

print(Counter(words).most_common(15))

# after seeing these results I think we need to add some new stop words

[('\n', 1782), ('\n\n', 997), ('love', 103), ('thou', 98), ('shall', 65), ('thee', 64), ('LYSANDER', 62), ('O', 59), ('DEMETRIUS', 59), ('thy', 58), ('HERMIA', 57), ('Pyramus', 57), ('THESEUS', 54), ('man', 50), ('night', 49)]


In [80]:
#remove old english stop words and re- process the text
nlp = spacy.load("en_core_web_sm")

new_stop_words = ["thou", "shall", "thee", "O", "thy"]

stop_words = nlp.Defaults.stop_words

stop_words.update(new_stop_words)

nlp.Defaults.stop_words = stop_words

play_doc = nlp(text)

words = [
    token.text
    for token in play_doc
    if not token.is_stop and not token.is_punct
]

print(Counter(words).most_common(15))

# these resutls look better

[('\n', 1782), ('\n\n', 997), ('love', 103), ('LYSANDER', 62), ('O', 59), ('DEMETRIUS', 59), ('HERMIA', 57), ('Pyramus', 57), ('THESEUS', 54), ('man', 50), ('night', 49), ('QUINCE', 46), ('HELENA', 44), ('Hermia', 43), ('sweet', 43)]


## Part-of-Speech Tagging
### I will use POS to determine which parts of speech each token is. 
### and also use spacy.explain to give descriptive details about the POS

In [81]:
for token in play_doc:
    print(
        F"""
    TOKEN: {str(token)}
    =====
    TAG: {str(token.tag_):10} POS: {token.pos_}
    EXSPLANATION: {spacy.explain(token.tag_)}"""
    )


    TOKEN: 


    =====
    TAG: _SP        POS: SPACE
    EXSPLANATION: whitespace

    TOKEN: Midsummer
    =====
    TAG: NNP        POS: PROPN
    EXSPLANATION: noun, proper singular

    TOKEN: Night
    =====
    TAG: NNP        POS: PROPN
    EXSPLANATION: noun, proper singular

    TOKEN: 's
    =====
    TAG: POS        POS: PART
    EXSPLANATION: possessive ending

    TOKEN: Dream
    =====
    TAG: NNP        POS: PROPN
    EXSPLANATION: noun, proper singular

    TOKEN: :
    =====
    TAG: :          POS: PUNCT
    EXSPLANATION: punctuation mark, colon or ellipsis

    TOKEN: Entire
    =====
    TAG: JJ         POS: ADJ
    EXSPLANATION: adjective (English), other noun-modifier (Chinese)

    TOKEN: Play
    =====
    TAG: NN         POS: NOUN
    EXSPLANATION: noun, singular or mass

    TOKEN: 
 






    =====
    TAG: _SP        POS: SPACE
    EXSPLANATION: whitespace

    TOKEN: A
    =====
    TAG: DT         POS: DET
    EXSPLANATION: determiner

    TOKEN: Mids

In [82]:
nouns = []
adjectives = []
for token in play_doc:
    if token.pos_=="NOUN":
        nouns.append(token)
    if token.pos_=="ADJ":
        adjectives.append(token)