# Text Processing - Stemming, Lemmatization, Stopwords, Phrase Matching & Vocabularies

### Stemming
Stripping-off of suffix from long words to make shorter base non-words. It is faster and easier to process but potentially creates non-word which adds less meaning in text processing. For example:
<br>

`running → run` <br>
`studies → studi` <br>

<p style="margin-left: 60px;">
    <span style="color:salmon;"> For the crude method of Stemming to strip words, <b>spacy</b> doesn't include any stemmer method. Instead, they use Lemmatization, which we will learn later in this notebook.</span><br>
</p>

So for now, we are going to use `NLTK` package for the stemming process of text.

#### Type of Stemmers
1. **Porter Stemmer (Porter's Algorithm):** Uses 5 phases of word reduction with different mapping rules-<br>
    ***Mapping Rules:***
    | word → stem | Example |
    | --- | --- |
    | SSES → SS | Actresses → Actress |
    | IES → I | Pastries → Pastri |
    | SS → SS | Dress → Dress |
    | S →  | Dogs → Dog |
    | (m>0)ATIONAL → ATE | Relational → Relate<br>National → National |
    | (m>0)EED → EE | Agreed → Agree<br>Deed → Deed |
    
 <br>
2. **Snowball Stemmer (aka Porter2 Algorithm):** Revised and more accurate improvement of the previous, known as "English Stemmer".

In [1]:
import nltk

In [2]:
from nltk.stem.porter import PorterStemmer

p_stemmer = PorterStemmer()
words = ['dream', 'dreamer', 'dreaming', 'dreams',
         'study', 'studying', 'studies', 'studious',
         'love', 'lover', 'loves', 'lovely', 'loving',
         'sensational', 'relational', 'disagreed', 'parties', 
         'poorness', 'poverty', 'poorly']

print("Using Porter's Stemming Algorithm:")
for word in words:
    print(f" {word:<15} → {p_stemmer.stem(word)}")

Using Porter's Stemming Algorithm:
 dream           → dream
 dreamer         → dreamer
 dreaming        → dream
 dreams          → dream
 study           → studi
 studying        → studi
 studies         → studi
 studious        → studiou
 love            → love
 lover           → lover
 loves           → love
 lovely          → love
 loving          → love
 sensational     → sensat
 relational      → relat
 disagreed       → disagre
 parties         → parti
 poorness        → poor
 poverty         → poverti
 poorly          → poorli


In [3]:
from nltk.stem.snowball import SnowballStemmer

s_stemmer = SnowballStemmer(language='english')
words = ['dream', 'dreamer', 'dreaming', 'dreams',
         'study', 'studying', 'studies', 'studious',
         'love', 'lover', 'loves', 'lovely', 'loving',
         'sensational', 'relational', 'disagreed', 'parties', 
         'poorness', 'poverty', 'poorly']

print("Using Snowball Stemming Algorithm:")
for word in words:
    print(f" {word:<15} → {s_stemmer.stem(word)}")

Using Snowball Stemming Algorithm:
 dream           → dream
 dreamer         → dreamer
 dreaming        → dream
 dreams          → dream
 study           → studi
 studying        → studi
 studies         → studi
 studious        → studious
 love            → love
 lover           → lover
 loves           → love
 lovely          → love
 loving          → love
 sensational     → sensat
 relational      → relat
 disagreed       → disagre
 parties         → parti
 poorness        → poor
 poverty         → poverti
 poorly          → poor


### Lemmatization
Unlike stemming, lemmatization is the process of reducing longer words to the correct base word (lemma) based on linguistic knowledge and correct parts of speech context. It is better and accurate than stemming, but slower and morphological in terms of language. For example:<br>
`studying → study`<br>
`mice → mouse`<br>
`am, is, are, was, were, been, being → be`<br>


In [4]:
import spacy

nlp = spacy.load("en_core_web_sm")

In [5]:
text1 = (f"I am a lover who loves to read lovely poems with a loving partner"
f" who lovingly partners with me. Last time I broke my previous record of reading"
f" a book while my nose was running, breaking it with a studious boy who was" 
f" studying about Physics. I saw couple of mouse squeeking in the back.")

In [6]:
doc = nlp(text1)

for token in doc:
    print(f"{token.text:<12}{token.pos_:<7}{token.lemma:<25}{token.lemma_}")

I           PRON   4690420944186131903      I
am          AUX    10382539506755952630     be
a           DET    11901859001352538922     a
lover       NOUN   2827967690759983985      lover
who         PRON   3876862883474502309      who
loves       VERB   3702023516439754181      love
to          PART   3791531372978436496      to
read        VERB   11792590063656742891     read
lovely      ADJ    12747335289542760454     lovely
poems       NOUN   3429842728468313404      poem
with        ADP    12510949447758279278     with
a           DET    11901859001352538922     a
loving      VERB   3702023516439754181      love
partner     NOUN   15215732398709813157     partner
who         PRON   3876862883474502309      who
lovingly    ADV    10008751644619825950     lovingly
partners    VERB   15215732398709813157     partner
with        ADP    12510949447758279278     with
me          PRON   4690420944186131903      I
.           PUNCT  12646065887601541794     .
Last        ADJ    103215189

In [7]:
for token in doc:
    print(f"{token.text:<12} → {token.lemma_}")

I            → I
am           → be
a            → a
lover        → lover
who          → who
loves        → love
to           → to
read         → read
lovely       → lovely
poems        → poem
with         → with
a            → a
loving       → love
partner      → partner
who          → who
lovingly     → lovingly
partners     → partner
with         → with
me           → I
.            → .
Last         → last
time         → time
I            → I
broke        → break
my           → my
previous     → previous
record       → record
of           → of
reading      → read
a            → a
book         → book
while        → while
my           → my
nose         → nose
was          → be
running      → run
,            → ,
breaking     → break
it           → it
with         → with
a            → a
studious     → studious
boy          → boy
who          → who
was          → be
studying     → study
about        → about
Physics      → Physics
.            → .
I            → I
saw          → see
coup

### Stopwords
Stopwords are the most common and most used words that doesn't add up to any meaning to a text. Like: <br>
`a, an, the, this, in, and, is, of, on, to, for, with, that,..., etc.`<br>
Almost 326 built-in list of stopwords are in spacy library. These words can be easily filtered from the text to process with the actual, meaningful text.

In [8]:
# list of all stopwords in the 'en_core_web_sm' model of spicy
print(nlp.Defaults.stop_words)

{'third', 're', 'other', 'few', 'own', 'next', 'eleven', 'will', 'to', 'whether', 'afterwards', 'of', 'six', 'again', 'however', 'upon', 'too', 'least', 'herself', 'not', 'unless', 'would', "'ve", 'may', 'two', '’m', 'together', 'within', 'anyway', 'be', 'their', 'therefore', 'against', 'then', 'ourselves', 'regarding', 'that', 'since', 'anywhere', 'put', 'serious', 'by', 'do', 'n’t', 'how', 'so', 'everything', 'four', 'had', 'nine', 'namely', '’re', 'for', 'in', 'must', 'yourselves', 'beforehand', '‘re', 'everywhere', 'until', 'whereas', 'any', 'thereafter', 'yours', 'yet', 'more', 'did', 'behind', 'and', 'was', 'just', 'or', 'get', 'thereupon', 'what', 'therein', 'per', 'about', 'moreover', 'his', 'because', 'ever', 'themselves', 'only', 'should', 'he', 'besides', 'became', 'nowhere', 'anyone', 'sometimes', 'enough', 'both', 'mostly', "n't", 'she', 'quite', 'before', 'across', 'as', 'empty', 'thence', 'seem', 'which', 'perhaps', 'doing', 'none', 'take', 'might', 'been', 'into', 'ever

In [9]:
# check the length of total stopwords
len(nlp.Defaults.stop_words)

326

In [10]:
nlp.vocab['The'].is_stop

True

In [11]:
nlp.vocab['btw'].is_stop

False

In [12]:
"""To add a new stopwords to the list for yourself"""

nlp.Defaults.stop_words.add('btw')
nlp.vocab['btw'].is_stop = True
len(nlp.Defaults.stop_words)

327

In [13]:
"""To remove a new stopword from the list for yourself"""

nlp.Defaults.stop_words.remove('btw')
nlp.vocab['btw'].is_stop = False
len(nlp.Defaults.stop_words)

326

### Phrase Matching & Vocabularies
The process of identifying and labeling specific phrases that match pattern defined by ourselves. This can be looked as a powerful and better version of Regular Expression where we can actually take parts of speech to search for our patterns.

#### List of Token Attributes

| Attributes | Description | Example |
| --- | --- | --- |
| `ORTH`| The exact verbatim text of a token| `ORTH("ApPlE")` → `"ApPlE"` |
| `LOWER` | The lowercase form of the token text | `LOWER("Happy")` → `"happy"`|
| `LENGTH` | The length of the token text | `LENGTH("5:7")` → `"Jello", "Rabbit", "Pokemon"` |
| `IS_ALPHA, IS_ASCII, IS_DIGIT` | Token consists of alphanumerics, ASCII, and digits | `IS_ALPHA("HOMER")` → `True`,<br>`IS_DIGIT(35)` → `True` |
| `IS_LOWER, IS_UPPER, IS_TITLE` | Token text in lowercase, uppercase, and titlecase | `"parsimony", "CRUEL", "SpaCy"`|
| `IS_PUNCT, IS_SPACE, IS_STOP` | Token text is punctuation, whitespace, and stopword | `"!", " ", "the"` |
| `LIKE_NUM, LIKE_URL, LIKE_EMAIL` | Token text resembles a number, URL, email | `"902-768-123", "https://", "@yahoo.com"`|
| `POS, TAG, DEP, LEMMA, SHAPE` | Token's POS, tag, dependency label, lemma, shape | `"VERB", "NNP", "dobj", "xxxx"` |
| `ENT_TYPE` | The token's entity label| `"GEP", "ORG", "MONEY"` |

In [14]:
from spacy.matcher import Matcher

In [15]:
matcher = Matcher(nlp.vocab)

In [16]:
""" To detect:
        + solarpower  = transform to all lowercase
        + Solar-power = check for punctuation in between
        + Solar power = as 2 seperate words
"""
pattern1 = [{'LOWER':'solarpower'}]
pattern2 = [{'LOWER':'solar'}, {'IS_PUNCT':True}, {'LOWER':'power'}] 
pattern3 = [{'LOWER':'solar'},{'LOWER':'power'}]

# to add a matcher with these patterns
matcher.add('SolarPower', [pattern1, pattern2, pattern3])

In [17]:
doc = nlp("The Solar Power industry is growing more with the solarpower increment. The team Solar-Power is good.")

In [18]:
found_matches = matcher(doc)
print(found_matches)

[(8656102463236116519, 1, 3), (8656102463236116519, 9, 10), (8656102463236116519, 14, 17)]


In [19]:
print(f"Match_ID{'':15}StringID{'':3}Starting_Idx{'':3}Ending_Idx{'':3}Text_span")
for match_id, start, end in found_matches:
    stringID = nlp.vocab.strings[match_id]  # to get string representation
    span = doc[start:end]
    print(f"{match_id:<23}{stringID:<15}{start:<15}{end:<10}{span.text}")

Match_ID               StringID   Starting_Idx   Ending_Idx   Text_span
8656102463236116519    SolarPower     1              3         Solar Power
8656102463236116519    SolarPower     9              10        solarpower
8656102463236116519    SolarPower     14             17        Solar-Power


In [20]:
# to remove a pattern
matcher.remove('SolarPower')