# Basics of NLP

### Stemming

### Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP).

In [3]:
import nltk
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

In [4]:
pragraph = """Well that’s just disappointing: it takes 5 minutes to just tokenize 100000 notes. This is kind of annoying if you are playing with hyperparameters of a Vectorizer for your NLP Bag-of-words model. Note that the cleaning function plays a minimal role with this tokenizer (12 seconds out of 291 seconds). Let’s see if we can do better."""

In [5]:
pragraph

'Well that’s just disappointing: it takes 5 minutes to just tokenize 100000 notes. This is kind of annoying if you are playing with hyperparameters of a Vectorizer for your NLP Bag-of-words model. Note that the cleaning function plays a minimal role with this tokenizer (12 seconds out of 291 seconds). Let’s see if we can do better.'

In [6]:
# Sentences Tokenizing
sentences = nltk.sent_tokenize(pragraph)
sentences

['Well that’s just disappointing: it takes 5 minutes to just tokenize 100000 notes.',
 'This is kind of annoying if you are playing with hyperparameters of a Vectorizer for your NLP Bag-of-words model.',
 'Note that the cleaning function plays a minimal role with this tokenizer (12 seconds out of 291 seconds).',
 'Let’s see if we can do better.']

In [8]:
# Word Tokenizing
word = nltk.word_tokenize(pragraph)
word

['Well',
 'that',
 '’',
 's',
 'just',
 'disappointing',
 ':',
 'it',
 'takes',
 '5',
 'minutes',
 'to',
 'just',
 'tokenize',
 '100000',
 'notes',
 '.',
 'This',
 'is',
 'kind',
 'of',
 'annoying',
 'if',
 'you',
 'are',
 'playing',
 'with',
 'hyperparameters',
 'of',
 'a',
 'Vectorizer',
 'for',
 'your',
 'NLP',
 'Bag-of-words',
 'model',
 '.',
 'Note',
 'that',
 'the',
 'cleaning',
 'function',
 'plays',
 'a',
 'minimal',
 'role',
 'with',
 'this',
 'tokenizer',
 '(',
 '12',
 'seconds',
 'out',
 'of',
 '291',
 'seconds',
 ')',
 '.',
 'Let',
 '’',
 's',
 'see',
 'if',
 'we',
 'can',
 'do',
 'better',
 '.']

In [10]:
stemmer = PorterStemmer()

In [11]:
stopwords.words("english")

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [18]:
# Stemming

for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [stemmer.stem(word) for word in words if word not in set(stopwords.words("english"))]
    sentences[i] = ' '.join(words)

In [19]:
sentences

['well ’ disappoint : take 5 minut token 100000 note .',
 'thi kind annoy play hyperparamet vector nlp bag-of-word model .',
 'note clean function play minim role token ( 12 second 291 second ) .',
 'let ’ see better .']