# A Deeper Look at spaCy
## Student: Levi Lowther
### https://github.com/LevLow/Datamine_07_spaCy_dive 

### Source for used Tutorial: https://realpython.com/natural-language-processing-spacy-python/ 

### Load and test needed modules

In [2]:
from collections import Counter
import pickle
import requests
import spacy
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt

!pip list

print('All prereqs installed.')

Package                       Version
----------------------------- --------------------
alabaster                     0.7.12
anaconda-client               1.11.0
anaconda-navigator            2.3.2
anyio                         3.5.0
appdirs                       1.4.4
argon2-cffi                   21.3.0
argon2-cffi-bindings          21.2.0
arrow                         1.2.2
asgiref                       3.5.2
astroid                       2.11.7
astropy                       5.1
atomicwrites                  1.4.0
attrs                         22.1.0
Automat                       20.2.0
autopep8                      1.6.0
Babel                         2.11.0
backcall                      0.2.0
backports.functools-lru-cache 1.6.4
backports.tempfile            1.0
backports.weakref             1.0.post1
bcrypt                        3.2.0
beautifulsoup4                4.11.1
binaryornot                   0.4.4
bitarray                      2.5.1
bkcharts                      0.2
blac

### Load Modules and save the A Midsummer Night's Dream in as a pickle

In [5]:
import requests,pickle,io,re,spacy
from bs4 import BeautifulSoup
from contextlib import redirect_stdout
from spacytextblob.spacytextblob import SpacyTextBlob
from spacy.lang.en.stop_words import STOP_WORDS
from collections import Counter
import matplotlib.pyplot as plt
import numpy as np

url = "https://shakespeare.mit.edu/midsummer/full.html"

response = requests.get(url)
print(response.status_code)
print(response.headers['content-type'])

#code to check if request worked and write to pickel
if response.status_code == 200:
    html_content= response.text
    soup = BeautifulSoup(html_content, "html.parser")
    article = soup.find("html")

    if article:
        with open("midsummer.pkl", "wb") as file:
            pickle.dump(str(article), file)
            print("Our Play is Saved!")
    else:
        print("Article not found")
else:
    print("Webpage Error")

200
text/html
Our Play is Saved!


### Read the pickle file and print the text


In [6]:
#Read the file
with open("midsummer.pkl", "rb") as file:
    html_content = pickle.load(file)

#parse
soup = BeautifulSoup(html_content, "html.parser")
text = soup.get_text()
print(text)



Midsummer Night's Dream: Entire Play
 





A Midsummer Night's Dream

Shakespeare homepage 
    | Midsummer Night's Dream 
    | Entire play

ACT I
SCENE I. Athens. The palace of THESEUS.

Enter THESEUS, HIPPOLYTA, PHILOSTRATE, and Attendants

THESEUS

Now, fair Hippolyta, our nuptial hour
Draws on apace; four happy days bring in
Another moon: but, O, methinks, how slow
This old moon wanes! she lingers my desires,
Like to a step-dame or a dowager
Long withering out a young man revenue.

HIPPOLYTA

Four days will quickly steep themselves in night;
Four nights will quickly dream away the time;
And then the moon, like to a silver bow
New-bent in heaven, shall behold the night
Of our solemnities.

THESEUS

Go, Philostrate,
Stir up the Athenian youth to merriments;
Awake the pert and nimble spirit of mirth;
Turn melancholy forth to funerals;
The pale companion is not for our pomp.
Exit PHILOSTRATE
Hippolyta, I woo'd thee with my sword,
And won thy love, doing thee injuries;
But I will we

## The Doc object for Processed Text
### The tutorial has us running tokenization on text that has been typed directly into the constructor. Since this isn't something
### that we would normally do I am instead going to read directly from the text I just parsed and printed. 

In [13]:
import spacy
nlp = spacy.load("en_core_web_sm")
introduction_doc = nlp(text)
print ([token.text for token in introduction_doc])

#I had to troubleshoot and determine the correct encoding. utf-8 and HTML were incorrect, but a little digging showed me the cp1252 encoding.




## Sentence Detection
### Using sentence detection to divide the text into meaningful units and extract useful information. 
### This will also help set us up for Parts of Speach Tagging and Named Entity Recognition

In [15]:
# Determine the number of sentences in A Midsummer Night's Dream
about_doc = nlp(text)
sentences = list(about_doc.sents)
len(sentences)
# there are 1135 sentences



1135

In [18]:
#print the first word of every sentence followed by and elipsis
for sentence in sentences:
    print(f"{sentence[:1]}...")



...
The...
Enter...
she...
HIPPOLYTA...
THESEUS...
Exit...
But...
Enter...
THESEUS...
Stand...
My...
Stand...
Thou...
so...
THESEUS...
To...
Demetrius...
HERMIA...
THESEUS...
HERMIA...
but...
THESEUS...
HERMIA...
I...
But...
THESEUS...
Therefore...
Thrice...
HERMIA...
THESEUS...
The...
DEMETRIUS...
LYSANDER...
You...
EGEUS...
true...
And...
LYSANDER...
And...
THESEUS...
But...
For...
Come...
EGEUS...
Exeunt...
How...
why...
How...
LYSANDER...
Ay...
for...
But...
LYSANDER...
too...
LYSANDER...
to...
LYSANDER...
Or...
The...
HERMIA...
LYSANDER...
I...
And...
There...
If...
HERMIA...
I...
By...
LYSANDER...
Look...
Enter...
whither...
that...
Demetrius...
Your...
Sickness...
Were...
O...
HERMIA...
HELENA...
HERMIA...
HELENA...
HERMIA...
HELENA...
HERMIA...
HELENA...
Before...
LYSANDER...
HERMIA...
Farewell...
Keep...
LYSANDER...
I...
Exit...
Exit...
Through...
But...
Demetrius...
but...
And...
And...
As...
I...
But...
Exit...
Athens...
QUINCE...
Enter...
BOTTOM...
QUINCE...
Here...
BOTTO

In [20]:
# use the language comonent to set custom boundaries to use ":" as a delimiter for sentences

from spacy.language import Language
@Language.component("set_custom_boundaries")
def set_custom_boundaries(doc):
    """Add support to use `:` as a delimiter for sentence detection"""
    for token in doc[:-1]:
        if token.text == ":":
            doc[token.i + 1].is_sent_start = True
        return doc


custom_nlp = spacy.load("en_core_web_sm")
custom_nlp.add_pipe("set_custom_boundaries", before="parser")
custom_ellipsis_doc = custom_nlp(text)
custom_ellipsis_sentences = list(custom_ellipsis_doc.sents)
for sentence in custom_ellipsis_sentences:
     print(sentence)



Midsummer Night's Dream: Entire Play
 





A Midsummer Night's Dream

Shakespeare homepage 
    | Midsummer Night's Dream 
    | Entire play

ACT I
SCENE I. Athens.
The palace of THESEUS.


Enter THESEUS, HIPPOLYTA, PHILOSTRATE, and Attendants

THESEUS

Now, fair Hippolyta, our nuptial hour
Draws on apace; four happy days bring in
Another moon: but, O, methinks, how slow
This old moon wanes!
she lingers my desires,
Like to a step-dame or a dowager
Long withering out a young man revenue.


HIPPOLYTA

Four days will quickly steep themselves in night;
Four nights will quickly dream away the time;
And then the moon, like to a silver bow
New-bent in heaven, shall behold the night
Of our solemnities.


THESEUS

Go, Philostrate,
Stir up the Athenian youth to merriments;
Awake the pert and nimble spirit of mirth;
Turn melancholy forth to funerals;
The pale companion is not for our pomp.

Exit PHILOSTRATE
Hippolyta, I woo'd thee with my sword,
And won thy love, doing thee injuries;

But I wi

## Tokens
### Use tokens to print index and common atributes of tokens