In [3]:
import pandas as pd
import numpy as np
reviews = pd.read_csv('../NLP_DATASETS/reviews_get.csv')

# Instructions for Downloading Spacy

**Fun Fact**: the Cy in spaCy indicates that Cython powers many of the underlying computations. Cython is a superset of the Python language that additionally supports calling C functions

1. go here: https://spacy.io/usage and follow instructions. I chose:
 Operating System: Mac OS/OSX 
 Platform: x86
 Package Manager: pip (couldn't get it to download with conda)
 (Check virtual environment as it is recommended to start with a fresh virtual env and this will give you the code for it for convenience)
 Hardware: CPU (GPU requires that you either download CUDA or have an NVIDIA graphics card)
 Trained Pipeline: English
 Selected to prioritize Accuracy over Efficiency
2. This will result in the following:
    - python -m venv .env
    - source .env/bin/activate
    - pip install -U pip setuptools wheel
    - pip install -U spacy
    - python -m spacy download en_core_web_trf (if want efficiency can use en_core_web_sm instead)

    which will install and update necessary package managers and spaCy and hopefully avoid most of the issues I ran into when trying to do it on my own. The resulting pipeline is spaCy's en_core_web_trf model which is a pretrained statistical model for predicting linguistic annotations such as if a word is a verb or noun for example

**Helpful Links** 
* https://spacy.io/usage (referenced above)
* https://spacy.io/usage/spacy-101
* https:o//course.spacy.io/en (A course that walks you through using spaCY with code examples)

### What's included in spaCy?
* **Binary weights** for the part-of-speech tagger, dependency parser and named entity recognizer to predict those annotations in context.
* **Lexical entries in the vocabulary**, i.e. words and their context-independent attributes like the shape or spelling.
* **Data files** like lemmatization rules and lookup tables.
* **Word vectors**, i.e. multi-dimensional meaning representations of words that let you determine how similar they are to each other.
* **Configuration options**, like the language and processing pipeline settings and model implementations to use, to put spaCy in the correct state when you load the pipeline.

In [4]:
import spacy
from spacy import displacy
from spacy.matcher import Matcher
from collections import Counter

In [5]:
#load the pre-trained pipeline
# nlp = spacy.load('en_core_web_trf')
nlp = spacy.load('en_core_web_sm')

### Some common linguistic annotations that come packaged into the models as attributes of spaCy that will be useful to know:
* **text** - the token itself
* **lemma_** - the lemmatized version of the token
* **orth_** - the token and index location of the token in the vocabulary. token.text returns the string representation of the token which merely wraps around the orth_
* **pos_** - part_of_speech such as a verb, adverb, noun etc...
* **is_stop** - boolean identifies if is a stop-token or not
* **tag_** - part_of_speech tag to identify more context specific information about the token
* **dep_** - type of dependency relation -  is used for the arc label, which describes the type of syntactic relation that connects the child to the head
* **shape_** - the lemmatized token represented with lower and upper case x's.
* **is_alpha_** - boolean to identify if token contains alphabetic characters

In [6]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

print(f"{'text':{8}} {'lemma':{6}} {'orth':{6}} {'POS':{6}} {'TAG':{6}} {'Dep':{6}} {'shape':{6}} {'is_stop':{6}} {'is_alpha':{6}} {'POS explained':{20}} {'tag explained'} ")
for token in doc:
    print(f'{token.text:{8}} {token.lemma_:{6}} {token.orth_:{6}} {token.pos_:{6}} {token.tag_:{6}} {token.dep_:{6}} {token.shape_:{6}} {token.is_stop:{6}} {token.is_alpha:{10}} {spacy.explain(token.pos_):{20}} {spacy.explain(token.tag_)}')

text     lemma  orth   POS    TAG    Dep    shape  is_stop is_alpha POS explained        tag explained 
Apple    Apple  Apple  PROPN  NNP    nsubj  Xxxxx       0          1 proper noun          noun, proper singular
is       be     is     AUX    VBZ    aux    xx          1          1 auxiliary            verb, 3rd person singular present
looking  look   looking VERB   VBG    ROOT   xxxx        0          1 verb                 verb, gerund or present participle
at       at     at     ADP    IN     prep   xx          1          1 adposition           conjunction, subordinating or preposition
buying   buy    buying VERB   VBG    pcomp  xxxx        0          1 verb                 verb, gerund or present participle
U.K.     U.K.   U.K.   PROPN  NNP    dobj   X.X.        0          0 proper noun          noun, proper singular
startup  startup startup NOUN   NN     dobj   xxxx        0          1 noun                 noun, singular or mass
for      for    for    ADP    IN     prep   xxx   

### Now onto Named Entity Recognition and a more podium-specific demo...
**Goal** - To be able to identify the name of a Key decision-maker at a business based only on review text

**Process** -
* Identify a "cherry-picked" example for demonstration
* Predict its parts_of_speech tags and "ents" or named entities by running it through the pre-trained pipeline we loaded from spaCy
* Begin to identify tokens around the main word that are able to inform whether or not the main word refers to a Key Decision Maker at the company

In [7]:
#Named Entity Recognition (ner)

#NOTE: spacy has the 'ner' pipeline component that identifies token spans fitting a predetermined set of named entities. 
#These are available as the 'ents' property of a Doc object and return any named entities including names of people and organizations 

ents = list(doc.ents)
print(ents)
print(ents[0].label_)
print(ents[0].text)

[Apple, U.K., $1 billion]
ORG
Apple


### One more note about the 'ner' component (per spaCy documentation - link below):
* The entity recognizer identifies non-overlapping labelled spans of tokens. The transition-based algorithm used encodes certain assumptions that are effective for “traditional” named entity recognition tasks, but may not be a good fit for every span identification problem. Specifically, the **loss function optimizes for whole entity accuracy**, so if your inter-annotator agreement on boundary tokens is low, the component will likely perform poorly on your problem
* The transition-based algorithm also assumes that the most decisive information about your entities will be close to their initial tokens. If your entities are long and characterized by tokens in their middle, the component will likely not be a good fit for your task
* (see https://spacy.io/api/entityrecognizer for more information)

In [8]:
#One measure of agreement between annotators (Fleiss’ Kappa is another) 
#I believe this is only necessary to check if you are labeling your own annotations and token boundaries which I'm not sure I'm doing here (ask aaron)
def cohen_kappa(ann1, ann2):
    """Computes Cohen kappa for pair-wise annotators.
    :param ann1: annotations provided by first annotator
    :type ann1: list
    :param ann2: annotations provided by second annotator
    :type ann2: list
    :rtype: float
    :return: Cohen kappa statistic
    """
    count = 0
    for an1, an2 in zip(ann1, ann2):
        if an1 == an2:
            count += 1
    A = count / len(ann1)  # observed agreement A (Po)

    uniq = set(ann1 + ann2)
    E = 0  # expected agreement E (Pe)
    for item in uniq:
        cnt1 = ann1.count(item)
        cnt2 = ann2.count(item)
        count = ((cnt1 / len(ann1)) * (cnt2 / len(ann2)))
        E += count

    return round((A - E) / (1 - E), 4)

#see link: https://towardsdatascience.com/inter-annotator-agreement-2f46c6d37bf3

In [9]:
#look for a good example to work with by searching for reviews with the word "team"
examples = reviews.loc[reviews['REVIEW_BODY'].str.contains('team|Team', na=False),'REVIEW_BODY']
examples = list(examples)
examples

['Very professional and courteous technicians.  Highly recommend these guys.  Team assigned to my project:  Ryan, Adrian, Arthur.  ServPro will be the only place I will call for any and all future clean up and damage mitigation projects.Service:\xa0Water damage-related cleanup & repair',
 'Great moving company! Super fast and efficient! The team was made up of experienced movers!',
 'Always good customer service, the manger Mike has always take care of me, he and his team do a great gob every time.',
 'Yes a very pleasant and efficient team']

In [10]:
#select the best example manually
example = examples[2]
example

'Always good customer service, the manger Mike has always take care of me, he and his team do a great gob every time.'

In [11]:
#run pipeline on selection to tag with annotations and ents
example = nlp(example)
type(example)

spacy.tokens.doc.Doc

In [12]:
# get list of ents in the doc "example"
labels = [x.label_ for x in example.ents]
print(example)
print(Counter(labels))

#Another more rich example:
rich = list(reviews['REVIEW_BODY'].iloc[[42]])
rz = nlp(str(rich))
rz_ent = [x.label_ for x in rz.ents]
print(rich)
Counter(rz_ent)

Always good customer service, the manger Mike has always take care of me, he and his team do a great gob every time.
Counter({'PERSON': 1})


Counter({'PERSON': 1,
         'ORDINAL': 3,
         'CARDINAL': 2,
         'DATE': 9,
         'NORP': 2,
         'MONEY': 5,
         'WORK_OF_ART': 3,
         'ORG': 2})

In [13]:
#Let's see which one is which for our two examples
print(dict([(str(x), x.label_) for x in nlp(str(example)).ents]))
dict([(str(x), x.label_) for x in nlp(str(rz)).ents])

{'Mike': 'PERSON'}


{"Natalie Kalaitzidis\\'s": 'PERSON',
 'first': 'ORDINAL',
 '4': 'CARDINAL',
 'the 2nd': 'DATE',
 'December': 'DATE',
 'the following year': 'DATE',
 'the end of the year': 'DATE',
 'Christmas Eve': 'DATE',
 'Dental': 'NORP',
 '520.88': 'MONEY',
 '520': 'MONEY',
 '3rd': 'ORDINAL',
 'one': 'CARDINAL',
 'Balance Forward': 'WORK_OF_ART',
 'Karthik': 'ORG',
 'August': 'DATE',
 'Aug & Sep.': 'ORG'}

### Entities that spaCy recognizes

* **PERSON**:      People, including fictional.
* **NORP**:        Nationalities or religious or political groups.
* **FAC**:         Buildings, airports, highways, bridges, etc.
* **ORG**:         Companies, agencies, institutions, etc.
* **GPE**:         Countries, cities, states.
* **LOC**:         Non-GPE locations, mountain ranges, bodies of water.
* **PRODUCT**:     Objects, vehicles, foods, etc. (Not services.)
* **EVENT**:       Named hurricanes, battles, wars, sports events, etc.
* **WORK_OF_ART**: Titles of books, songs, etc.
* **LAW**:         Named documents made into laws.
* **LANGUAGE**:    Any named language.
* **DATE**:        Absolute or relative dates or periods.
* **TIME**:        Times smaller than a day.
* **PERCENT**:     Percentage, including ”%“.
* **MONEY**:       Monetary values, including unit.
* **QUANTITY**:    Measurements, as of weight or distance.
* **ORDINAL**:     “first”, “second”, etc.
* **CARDINAL**:    Numerals that do not fall under another type.

* see link: https://towardsdatascience.com/explorations-in-named-entity-recognition-and-was-eleanor-roosevelt-right-671271117218

In [19]:
#Another way to visualize is with their graphical interface displacy that integrates into jupyter
displacy.render(example, jupyter=True, style='ent')

### It might be helpful to have this here to interpret what comes next - Parts of Speech:
1. **Adverbs** - a word or phrase that modifies or qualifies an adjective, verb, or other adverb or a word group, expressing a relation of place, time, circumstance, manner, cause, degree, etc. (e.g., gently, quite, then, there )
2. **Verbs** - a word used to describe an action, state, or occurrence, and forming the main part of the predicate of a sentence, such as hear, become, happen.
3. **Adjective** - a word or phrase naming an attribute, added to or grammatically related to a noun to modify or describe it.
4. **Nouns** - a word (other than a pronoun) used to identify any of a class of people, places, or things ( common noun ), or to name a particular one of these ( proper noun )
5. **Pronouns** - a word that can function by itself as a noun phrase and that refers either to the participants in the discourse (e.g., I, you ) or to someone or something mentioned elsewhere in the discourse (e.g., she, it, this ).
6. **Prepositions** - a word governing, and usually preceding, a noun or pronoun and expressing a relation to another word or element in the clause, as in “the man on the platform,” “she arrived after dinner,” “what did you do it for ?”.
7. **Conjunctions** - a word used to connect clauses or sentences or to coordinate words in the same clause (e.g. and, but, if )
8. **Interjections** - an exclamation, especially as a part of speech (e.g. ah!, dear me! ).
9. **Determiner** - a word (such as an article, possessive, demonstrative, or quantifier) that makes specific the denotation of a noun phrase. e.g. **The** bunny went home. (determiner comes directly before the noun)
I ate **the** chocolate cookie for dessert. (determiner comes before the adjective chocolate that describes the noun cookie)
10. **Article** - used to modify nouns (two in english: "the" and "a/an")
11. **Possessive** - being or belonging to the case of a noun or pronoun that shows possession "His" is a possessive pronoun e.g. **Wayne's** World in case of a noun
12. **Demonstrative** - a determiner or a pronoun that points to a particular noun or to the noun it replaces (e.g. "this", "these', "that", "those")
13. **Quantifier** - a word, especially a modifier, that indicates the quantity of something. e.g. much, many, each, every, one, two, several, a few, a couple etc.
14. **CCONJ** - A coordinating conjunction is a word that links words or larger constituents without syntactically subordinating one to the other and expresses a semantic relationship between them
15. **SCONJ** - a conjunction that links constructions by making one of them a constituent of the other. The subordinating conjunction typically marks the incorporated constituent which has the status of a (subordinate) clause. e.g. **that** as in "I believe that he will come." if, while etc.
15. **AUX** - An auxiliary verb is a verb that adds functional or grammatical meaning to the clause in which it occurs, so as to express tense, aspect, modality, voice, emphasis, etc e.g. **to be, to have, and to do**.
16. **ADP** - **Adposition** - general term including both prepositions and postpossitions
17. **Clause** - **Independent Clause** (also called a main clause) - A clause is a group of related words with both a subject and a predicate (verb). e.g. Subject + verb (predicate). = complete thought (IC or Independent Clause), I eat bananas. = complete thought (IC)
18. **Dependent Clause** - (also called a subordinate clause) is a group of words that has
a subject and a verb and starts with a subordinating conjunction. A dependent
clause cannot stand alone as a sentence. e.g. "After I ate raspberries"
19. **AMOD** - An adjectival modifier of an NP is any adjectival phrase that serves to modify the meaning of the NP e.g. Sam eats **red** meat
20. **APPOS** - An appositional modifier of a noun is a nominal immediately following the first noun that serves to define, modify, name, or describe that noun. It includes parenthesized examples, as well as defining abbreviations in one of these structures. e.g. "Sam, **my brother**, arrived
21. **NN** - singular common nouns
22. **NNS** - plural common nouns
23. **NP** - singular proper nouns
24. **Compound** - a word that is made up of two or more existing parts or elements
25. **Nominal Subject** - a nominal which is the syntactic subject and the proto-agent of a clause. 
26. **Nominal** - a grammatical category for words or groups of words that function as nouns in a sentence.
27. **Direct Object** - The direct object of a verb is the noun phrase that denotes the entity acted upon. e.g. **The students eat cake, the direct object is cake**
28. **Adverbial Modifier** -  a word or phrase that is used to modify another part of a sentence, typically a verb or adjective. When used properly, these modifiers provide additional information about an action or some part of a sentence and answer a question about it. e.g. "**less** often" In the sentence, “He crossed the bridge quickly,” the word “quickly” is an adverbial modifier. The modifier in this example answers the question, “How did he cross the bridge?” "Hardly" is the adverbial modifier in the sentence, "She slapped him hardly.
29. **Predicate** - the part of a sentence or clause containing a verb and stating something about the subject (e.g. **went home** in "John went home").
30. **NPADVMOD** -  noun phrase as adverbial modifier e.g. “The silence is **itself significant**”

**Good resouces:** 
* https://universaldependencies.org/docs/
* https://downloads.cs.stanford.edu/nlp/software/dependencies_manual.pdf
* https://machinelearningknowledge.ai/tutorial-on-spacy-part-of-speech-pos-tagging/

In [15]:
#Here is an "under the hood" look at the annotations that the model placed on example
displacy.render(nlp(str(example)), style='dep', jupyter = True, options = {'distance': 120})

### Fin for now

### For creating vector representations of each token:
* https://www.dataquest.io/blog/tutorial-text-classification-in-python-using-spacy/

In [16]:
print(example.vector.shape)
print(example.vector)

(96,)
[-0.01597212 -0.06585884 -0.12272161 -0.06319936  0.10595281 -0.26592857
  0.009597   -0.18387161  0.22103883 -0.09978317 -0.06497978 -0.14702599
  0.18614815  0.23146775 -0.17594999 -0.17636737 -0.04162822 -0.12467759
 -0.27792218 -0.13081977 -0.09354423 -0.03238046 -0.02361845  0.14279263
  0.23841856  0.17729084 -0.0099194  -0.0245508  -0.24542347  0.7447064
 -0.26634976  0.22512293 -0.34267935  0.2837784  -0.14622386  0.11161509
 -0.30725792  0.05584962 -0.05270758  0.1990669  -0.02435051  0.32070118
 -0.14390947 -0.26428363 -0.24050215 -0.0786149   0.32975835  0.48022923
 -0.05015451 -0.27521688  0.15141559 -0.10644344  0.04019525  0.173697
 -0.09854608  0.04899363  0.3485072  -0.12223566  0.2754811  -0.08169848
 -0.03703514 -0.19430453  0.33448026 -0.42756495 -0.06252185 -0.30588007
 -0.22029015  0.3366049  -0.19730693 -0.25510186  0.0416397   0.18952209
 -0.11853432 -0.02600521 -0.28164327 -0.09401219  0.38853905  0.32815808
 -0.29747817  0.21674028 -0.06386046 -0.5815586 

###  (In event  need rules based) Now that we've been able to identify named entities let’s take a look at spaCy's customizable matching capabilities
### say we want to enable spaCy to find a combination of three tokens:

* A token whose lowercase form matches team, e.g. Team or "TEAM".
* A token whose is_punct flag is set to True, i.e. any punctuation.
* A token whose lowercase form matches “manager”, e.g. Manager or "MANAGER".

see links: 
* https://towardsdatascience.com/explorations-in-named-entity-recognition-and-was-eleanor-roosevelt-right-671271117218
* https://spacy.io/usage/rule-based-matching

In [17]:
matcher = Matcher(nlp.vocab)
# Add match ID "TeamManager" with no callback and one pattern
pattern = [{"LOWER": "mike"}, {"IS_PUNCT": True}, {"LOWER": "you"}]
matcher.add("MikeYou", [pattern])

doc = nlp("Mike, You and your Team were great!!")
print(f"{'match_id':{20}} {'string_id':{6}} {'start':{3}} {'end':{6}} {'span_text'}")
matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(f"{match_id:{20}} {string_id:{6}} {start:{3}} {end:{6}} {span.text}")

match_id             string_id start end    span_text
12662094209803597944 MikeYou   0      3 Mike, You
