# Homework 7

In [4]:
import pandas as pd
import spacy

## Task 1

In [5]:
text = "The quick brown fox doesn't jump over the lazy dog. Natural Language Processing is fascinating!"

In [None]:
# This uses only the python .split() method, which is good, but certainly not the best
tokens_split = text.split()
print(tokens_split)

['The', 'quick', 'brown', 'fox', "doesn't", 'jump', 'over', 'the', 'lazy', 'dog.', 'Natural', 'Language', 'Processing', 'is', 'fascinating!']


In [None]:
# This is how to use spaCy to tokenize, instead of using just the .split() method
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
tokens_spacy = [token.text for token in doc]
print(tokens_spacy)

['The', 'quick', 'brown', 'fox', 'does', "n't", 'jump', 'over', 'the', 'lazy', 'dog', '.', 'Natural', 'Language', 'Processing', 'is', 'fascinating', '!']


In [9]:
# Here is a side by side comparison of the two
print("Split():", tokens_split)
print("spaCy:", tokens_spacy)

Split(): ['The', 'quick', 'brown', 'fox', "doesn't", 'jump', 'over', 'the', 'lazy', 'dog.', 'Natural', 'Language', 'Processing', 'is', 'fascinating!']
spaCy: ['The', 'quick', 'brown', 'fox', 'does', "n't", 'jump', 'over', 'the', 'lazy', 'dog', '.', 'Natural', 'Language', 'Processing', 'is', 'fascinating', '!']


In [22]:
for token in doc:
    print(f"'{token.text}' ({token.morph}) has a base form of '{token.lemma_}' and its syntactic head is '{token.head}'.")

'The' (Definite=Def|PronType=Art) has a base form of 'the' and its syntactic head is 'fox'.
'quick' (Degree=Pos) has a base form of 'quick' and its syntactic head is 'fox'.
'brown' (Degree=Pos) has a base form of 'brown' and its syntactic head is 'fox'.
'fox' (Number=Sing) has a base form of 'fox' and its syntactic head is 'jump'.
'does' (Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin) has a base form of 'do' and its syntactic head is 'jump'.
'n't' (Polarity=Neg) has a base form of 'not' and its syntactic head is 'jump'.
'jump' (VerbForm=Inf) has a base form of 'jump' and its syntactic head is 'jump'.
'over' () has a base form of 'over' and its syntactic head is 'jump'.
'the' (Definite=Def|PronType=Art) has a base form of 'the' and its syntactic head is 'dog'.
'lazy' (Degree=Pos) has a base form of 'lazy' and its syntactic head is 'dog'.
'dog' (Number=Sing) has a base form of 'dog' and its syntactic head is 'over'.
'.' (PunctType=Peri) has a base form of '.' and its syntactic he

1. spaCy often processes the tokens by splitting each word into its own instance. However, sometimes it separates a word into multiple instances, such as contractions (like "doesn't in this example). It also takes punctuation marks into account. When this is done, you can then use the functions above (and many others) to gather information on each word within spaCy's databse, as well as how the words relate to each other.
2. Punctuation marks are treated very similarly to most words, and each have their own instance. They also have relationships with theh words around them, and each mark has distinct data within spaCy's database.
3. As mentioned before, contractions are typically split up into multiple instances. When this happens, each word can be analyzed indpendently on a more careful level.

## Task 2

In [25]:
for token in doc:
    print(f"{token.text} | {token.pos_} | {token.tag_}")

The | DET | DT
quick | ADJ | JJ
brown | ADJ | JJ
fox | NOUN | NN
does | AUX | VBZ
n't | PART | RB
jump | VERB | VB
over | ADP | IN
the | DET | DT
lazy | ADJ | JJ
dog | NOUN | NN
. | PUNCT | .
Natural | PROPN | NNP
Language | PROPN | NNP
Processing | NOUN | NN
is | AUX | VBZ
fascinating | ADJ | JJ
! | PUNCT | .


.pos_ and .tag_ give even more information on each token, with .tag_ giving a slightly more in depth view.
1. As you can see in the output, "quick" is an adjective, "jump" is the base form of a verb, and "is" is a 3rd person singular present tense verb.
2. Gathering parts of speech through spaCy is a great tool for grammar and translating, because all of these words often have varying forms depending on context. Knowing the part of speech gives a lot of context, making the outputs more accurate.  

## Task 3

In [29]:
text = "Barack Obama was the 44th President of the United States. He was born in Hawaii."
doc = nlp(text)
tokens_spacy = [token.text for token in doc]
print(tokens_spacy)
for ent in doc.ents:
    print(f"{ent.text}: {ent.label_} ({spacy.explain(ent.label_)})")

['Barack', 'Obama', 'was', 'the', '44th', 'President', 'of', 'the', 'United', 'States', '.', 'He', 'was', 'born', 'in', 'Hawaii', '.']
Barack Obama: PERSON (People, including fictional)
44th: ORDINAL ("first", "second", etc.)
the United States: GPE (Countries, cities, states)
Hawaii: GPE (Countries, cities, states)


1. The entities recognized by spaCy are Barack Obama, 44th, United States, and Hawaii
2. Barack Obama is recognized as a Person, while Hawaii is a state under the category GPE

## Task 4

In [33]:
text = "They lost they're luggage in the airport in Atlanta, Georgia, before landing in Russia and crossing the border into Georgia with are friends."
doc = nlp(text)
tokens_spacy = [token.text for token in doc]
for token in doc:
    print(f"'{token.text} | base form: '{token.lemma_}' | PoS: {token.pos_}")

'They | base form: 'they' | PoS: PRON
'lost | base form: 'lose' | PoS: VERB
'they | base form: 'they' | PoS: PRON
''re | base form: 'be' | PoS: AUX
'luggage | base form: 'luggage' | PoS: NOUN
'in | base form: 'in' | PoS: ADP
'the | base form: 'the' | PoS: DET
'airport | base form: 'airport' | PoS: NOUN
'in | base form: 'in' | PoS: ADP
'Atlanta | base form: 'Atlanta' | PoS: PROPN
', | base form: ',' | PoS: PUNCT
'Georgia | base form: 'Georgia' | PoS: PROPN
', | base form: ',' | PoS: PUNCT
'before | base form: 'before' | PoS: ADP
'landing | base form: 'land' | PoS: VERB
'in | base form: 'in' | PoS: ADP
'Russia | base form: 'Russia' | PoS: PROPN
'and | base form: 'and' | PoS: CCONJ
'crossing | base form: 'cross' | PoS: VERB
'the | base form: 'the' | PoS: DET
'border | base form: 'border' | PoS: NOUN
'into | base form: 'into' | PoS: ADP
'Georgia | base form: 'Georgia' | PoS: PROPN
'with | base form: 'with' | PoS: ADP
'are | base form: 'be' | PoS: AUX
'friends | base form: 'friend' | PoS: N

After modifying words a bit, I realized spaCy struggled with texts that were not exactly correct. I included "they're" when it should have been "their", and spaCy did not pick up on the difference. The same goes for "are" instead of "our".

In [36]:
from spacy import displacy
for ent in doc.ents:
    print(f"{ent.text}: {ent.label_} ({spacy.explain(ent.label_)})")
displacy.render(doc, style='ent', jupyter=True)

Atlanta: GPE (Countries, cities, states)
Georgia: GPE (Countries, cities, states)
Russia: GPE (Countries, cities, states)
Georgia: GPE (Countries, cities, states)


While I tried to confuse spaCy by including both Goergia the state and Georgia the country in my sentence, it just includes them both in a more broad category. This could also be an issue though, as many cities, states, and countries share the same name. Not including differentiation between them could cause issues if specificity is needed.