# Installation
```bash
pip install nltk```

You also need to download the data for the NLP. It can be done using 

```python 
import nltk
nltk.download()```

Select '$all$' from the list. 
It will take around 2 minutes to get the entire data. 

In [2]:
import nltk

# Tokenization
Our sentences must be broken down to words before performing any operations. NLTK provides the API to do so. 
* sent_tokenize - Parah → Sentences
* word_tokenize - Parah → Words

In [3]:
TEXT = "Hello! My name is SyzygianInfern0. It's inspired from Dante Alighieri's 14th-century \
        epic poem Divine Comedy. Anyways python is the best of all"
nltk.tokenize.sent_tokenize(TEXT)

['Hello!',
 'My name is SyzygianInfern0.',
 "It's inspired from Dante Alighieri's 14th-century         epic poem Divine Comedy.",
 'Anyways python is the best of all']

In [4]:
nltk.tokenize.word_tokenize(TEXT)

['Hello',
 '!',
 'My',
 'name',
 'is',
 'SyzygianInfern0',
 '.',
 'It',
 "'s",
 'inspired',
 'from',
 'Dante',
 'Alighieri',
 "'s",
 '14th-century',
 'epic',
 'poem',
 'Divine',
 'Comedy',
 '.',
 'Anyways',
 'python',
 'is',
 'the',
 'best',
 'of',
 'all']

# Stop Words
We use fillers that aren't that important to the meaning of the sentence. Therefore, we can go over and remove them to reduce the size of our training data. 
<br>
You can use the `nltk.corpus.stopwords` for this purpose.

In [8]:
set(nltk.corpus.stopwords.words('english'))

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

---
Let's put this in our tokenizer to incorporate the cleanup

In [10]:
stop_words = set(nltk.corpus.stopwords.words('english'))
tokens = nltk.tokenize.word_tokenize(TEXT)
filtered = [w for w in tokens if w not in stop_words]
print(f'Tokenized : {tokens}')
print(f'Cleaned : {filtered}')

Tokenized : ['Hello', '!', 'My', 'name', 'is', 'SyzygianInfern0', '.', 'It', "'s", 'inspired', 'from', 'Dante', 'Alighieri', "'s", '14th-century', 'epic', 'poem', 'Divine', 'Comedy', '.', 'Anyways', 'python', 'is', 'the', 'best', 'of', 'all']
Cleaned : ['Hello', '!', 'My', 'name', 'SyzygianInfern0', '.', 'It', "'s", 'inspired', 'Dante', 'Alighieri', "'s", '14th-century', 'epic', 'poem', 'Divine', 'Comedy', '.', 'Anyways', 'python', 'best']


Pretty good. Let's move on to the next Notebook