## Credit

Notes are taken from NLPlanet Practical NLP with Python course section 1.5 Stemming, Lemmatization, Stopwords, POS Tagging.
* https://www.nlplanet.org/course-practical-nlp/01-intro-to-nlp/05-tokenization-stemming-lemmatization

Authored by Fabio Chiusano
* https://medium.com/@chiusanofabio94

**All quotes '' are sourced from the NLPlanet course.**

**Inflection**

<u>Inflection:<u>
* The modification of a word to express different grammatical categories such as tense, case, tone of voice, person, number, and gender.
* Ex: eat, eats, ate
    
<u>Inflected Language:<u>
* A language that contains words that are derived from another word as their use in speech changes.

**Stemming**

<u>Stemming:<u>
* The process of reducing inflected words to their stem (their base or root form).
* Related words need to reduce to the same stem whether or not the stem is a word or a root.
* Operates on a single word without knowledge of context.

In [1]:
# Imports

import nltk
nltk.download('punkt')
# Allows you to tokenize

from nltk.tokenize import word_tokenize
# Tokenizes based on words and punctuation

from nltk.stem import PorterStemmer
# Performs suffix stripping to produce stems
# Applies algorithmic rules to generate stems

[nltk_data] Downloading package punkt to /home/brandon/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [6]:
stemmer = PorterStemmer()
# PorterStemmer object used to stem

In [7]:
# Exmaple

print(
stemmer.stem("program"), 
stemmer.stem("programming"),
stemmer.stem("programer"),
stemmer.stem("programmed")
)

program program program program


In [11]:
# Using tokenization

text = "The marching ants navigate the waving grass blades."

# Tokenize
tokens = nltk.word_tokenize(text)

# Stem
token_stems = [stemmer.stem(token) for token in tokens]

print(token_stems)

['the', 'march', 'ant', 'navig', 'the', 'wave', 'grass', 'blade', '.']


In [12]:
# Stemming other languages: Import

from nltk.stem.snowball import SnowballStemmer
# Contains a family of stemmers for different languages

In [13]:
# Stemming other languages: Available languages

SnowballStemmer.languages

('arabic',
 'danish',
 'dutch',
 'english',
 'finnish',
 'french',
 'german',
 'hungarian',
 'italian',
 'norwegian',
 'porter',
 'portuguese',
 'romanian',
 'russian',
 'spanish',
 'swedish')

In [19]:
# Stemming other languages: Example use

stemmer = SnowballStemmer("portuguese")
# SnowballStemmer object set to stem words from the german language

print(
stemmer.stem("Algoritmo"),
stemmer.stem("algoritmos")
)

algoritm algoritm


**Lemmatization**

<u>Lemmatization:<u>
* Used to group together the inflected forms of a word so that they can be analyzed as a single item (their lemma).
* Reduces inflected words to their lemma (an existing word).
* Can leverage context to find hte correct lemma of a word.
* Slower than stemming.

In [21]:
# Imports

import nltk
nltk.download('wordnet')
# WordNet is a lexical database that tracks words and their relations
nltk.download('omw-1.4')
# Open Multilingual WordNet provides translations and word senses in multiple languages
nltk.download('averaged_perceptron_tagger')
# Model used for POS tagging (mentioned later)

from nltk.tokenize import word_tokenize
# Class tokenizes text into individual tokens (words)
from nltk.stem import WordNetLemmatizer
# Class used to reduce words to their lemma

[nltk_data] Downloading package wordnet to /home/brandon/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/brandon/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/brandon/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [28]:
# Example

lemmatizer = WordNetLemmatizer()
# WordNetLemmatizer object

print(lemmatizer.lemmatize("software"))

software


**Stopwords**

<u>Stop-words:<u>
* Words that are so common in all texts that they add little to no information for NLP tasks.
* Usually include articles and prepositions.

In [29]:
# Imports

import nltk
nltk.download('stopwords')
# A corpus of stopwords
from nltk.corpus import stopwords
# Imports stopwords from the corpus

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/brandon/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [31]:
# Stop words

english_stopwords = stopwords.words("english")
# Retrieves stopwords specific to English

print(f"""
Total number of stopwords in English: {len(english_stopwords)}
Examples:
{english_stopwords[:10]}
""")


Total number of stopwords in English: 179
Examples:
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]



**POS Tagging**

<u>Part-Of-Speech Tagging:<u>
* The act of inferring the part of speech of words in sentences based on context.
* POS Tags are acronyms for the part of speech a word represents. Ex: NN (noun) or VB (verb)

In [34]:
# Example

import nltk

text = word_tokenize("I can't figure this out")
print(nltk.pos_tag(text))
# Takes a list of tokens and returns the POS (part-of-speech) tag for each token (word or punctuation)

text = word_tokenize("This is the easiest thing ever")
print(nltk.pos_tag(text))

[('I', 'PRP'), ('ca', 'MD'), ("n't", 'RB'), ('figure', 'VB'), ('this', 'DT'), ('out', 'RP')]
[('This', 'DT'), ('is', 'VBZ'), ('the', 'DT'), ('easiest', 'JJS'), ('thing', 'NN'), ('ever', 'RB')]
