# NLTK Basics - Stemmers, Lemmatizers and Vectorizers

### Stemming

** Word stemming means removing affixes from words and return the root word. The idea behind stemming is vaguely related to the process of normalising words of different parts-of-speech to a single normalised word.**

#### Porter Stemmer

** The oldest stemming algorithm available, and also found to be limited in terms of itis functionality as seen below. Instead of "cry" as the stem, we get "cri". **

In [7]:
import nltk
from nltk.stem import PorterStemmer
stemmerporter=PorterStemmer()
stemmerporter.stem('crying')

'cri'

##### Regexp Stemmer

** Type of Stemmer that behaves well when dealing with regular expressions. In fact whatever has been specified as the regular expression is cropped out of the word if present. **

In [1]:
from nltk.stem import RegexpStemmer
stemmerregexp=RegexpStemmer('s')
stemmerregexp.stem('cries')

'crie'

##### Snowball Stemmer

** An improvement over PorterStemmer. It can perform accurate stemming of languages apart from Engish (such as German) as observed below. It is slightly faster computation time than porter, with a reasonably large community around it. **

In [10]:
import nltk
from nltk.stem import SnowballStemmer
SnowballStemmer.languages
frenchstemmer=SnowballStemmer('german')
frenchstemmer.stem('studenten')

'student'

In [11]:
import nltk
from nltk.stem import SnowballStemmer
SnowballStemmer.languages
frenchstemmer=SnowballStemmer('french')
frenchstemmer.stem('manges')

'mang'

#### List of words supported by various stemmers

** Unit Tests for PorterStemmer **

In [1]:
from nltk.stem.porter import *

In [2]:
stemmer=PorterStemmer()

In [3]:
plurals = ['caresses', 'flies', 'dies', 'mules', 'denied', 'died', 'agreed', 'owned', 'humbled', 'sized', 'meeting', 'stating', 'siezing', 'itemization', 'sensational', 'traditional', 'reference', 'colonizer', 'plotted']

In [4]:
singles = [stemmer.stem(plural) for plural in plurals]

In [5]:
print(' '.join(singles))

caress fli die mule deni die agre own humbl size meet state siez item sensat tradit refer colon plot


** Unit Tests for SnowballStemmer **

In [6]:
from nltk.stem.snowball import SnowballStemmer

** Here we define two stemmers of which the second stemmer ignores the occurrences of stop words and do not stem them. **

In [12]:
stemmer2 = SnowballStemmer("english")

In [13]:
stemmer3 = SnowballStemmer("english", ignore_stopwords=True)

In [14]:
print(stemmer2.stem("having"))

have


In [15]:
print(stemmer3.stem("having"))

having


** The 'english' stemmer is better than the original 'porter' stemmer. **

In [16]:
print(SnowballStemmer("english").stem("generously"))

generous


In [17]:
print(SnowballStemmer("porter").stem("generously"))

gener


### Word Lemmatizers

** *Lemmatisation* is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form. **

In [2]:
from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer()

In [3]:
print(lemmatizer.lemmatize("Am"))

Am


** Lemmatization of irregular nouns.. **

In [11]:
print("oxen: ",lemmatizer.lemmatize("oxen"))
print("geese: ",lemmatizer.lemmatize("geese"))

oxen:  ox
geese:  goose


**Lemmatization of irregular verbs.. **

In [12]:
print("brought: ",lemmatizer.lemmatize("brought",pos='v'))
print("caught: ",lemmatizer.lemmatize("caught",pos='v'))

brought:  bring
caught:  catch


** Lemmatization of adjectives.. **

In [13]:
print("better :", lemmatizer.lemmatize("better", pos ="a"))

better : good


### Working of Stemmers with Prefixes

** From the code snippet below, we can observe that the traditional PorterStemmer as well as the SnowballStemmer does not deal with stemming of prefixes, but rather targets the stemming of suffixes alone. **

In [14]:
from nltk.stem import PorterStemmer
stemmer=PorterStemmer()
prefix=["unhappy","redraw","disapprove"]
prefix=[stemmer.stem(token) for token in prefix]
for token in prefix:
    print(token)

unhappi
redraw
disapprov


In [18]:
from nltk.stem import SnowballStemmer
stemmer=SnowballStemmer("english",ignore_stopwords=True)
prefix=["unhappy","redraw","disapprove"]
prefix=[stemmer.stem(token) for token in prefix]
for token in prefix:
    print(token)

unhappi
redraw
disapprov


#### Stemming of words part of an entire passage

In [25]:
from nltk.stem import PorterStemmer
stemmer=PorterStemmer()
ex="In writing the words point and purpose are almost synonymous Your point is your purpose and how you decide to make your point clear to your reader is also your purpose Writers have a point and a purpose for every paragraph that they create"
ex=[stemmer.stem(token) for token in ex.split(" ")]
print(" ".join(ex))

In write the word point and purpos are almost synonym your point is your purpos and how you decid to make your point clear to your reader is also your purpos writer have a point and a purpos for everi paragraph that they creat


In [26]:
#Performance of Stemmer-- Porter, Lemmatizer, Snowball, Regexp

### Vectorisers

** Word vectorization is a methodology in NLP to map words or phrases from vocabulary to a corresponding vector of real numbers which used to find word predictions, word similarities/semantics. **


** The process of converting words into numbers are called *Vectorization.* **

** Note: For the application of document similarity evaluation we require vectorization as the first step. **

** First up is a simple vectorizer - CountVectorizer.. **

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
    "The cat ate the mouse",
    "Ralph and his parents went to the movies.",
    "A quick brown fox jumps over a lazy dog.",
]
vectorizer = CountVectorizer()
vectors = vectorizer.fit_transform(corpus)

** Here the fit method of the vectorizer takes list of strings and creates a dictionary of the vocabulary on the corpus. ** 

** When transform is called, each individual document is transformed into a sparse array whose index is the row (document ID) and the token ID from the dictionary, and value is the count. **

In [5]:
print(vectorizer)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)


** Next is the TF-IDF Vectorizer (Term Frequency - Inverse Document Frequency)**

** The major difference is that the count values are stored in their normalised forms rather than their actual values. **