# **What is NLTK?**
NLTK (Natural Language Toolkit) is a powerful Python library for natural language processing (NLP). It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for tasks such as tokenization, stemming, tagging, parsing, and classification. It's widely used both in academia and industry for tasks ranging from simple text analysis to more complex natural language understanding and generation tasks.

reference: https://www.nltk.org/index.html

In [71]:
# Use the code below in cmd in order to install nltk
# pip install nltk

In [72]:
import nltk

## 0. Strings
Strings are collection of words, alphabets and other characters. They are surrounded by either single quotation marks, or double quotation marks e.g "hello" or 'hello' . Strings have their own built-in functions that can make changes on them. Most of this functions are shown below.

reference: https://www.w3schools.com/python/python_strings.asp

In [73]:
text = "I can only comment on the vermicelli bowls because I have never had the Pho. The vermicelli are a decent size and a fair price. Conveniently located if you live in the Alberta Ave neighbourhood, my orders have always been ready in 15 minutes - even on a Friday night."

In [74]:
text

'I can only comment on the vermicelli bowls because I have never had the Pho. The vermicelli are a decent size and a fair price. Conveniently located if you live in the Alberta Ave neighbourhood, my orders have always been ready in 15 minutes - even on a Friday night.'

In [75]:
# Gets the first index (element) of the text
text[0]

'I'

In [76]:
# Gets the first 10 indexes of a text ( indexes from 0 to 10 )
text[0:10]

'I can only'

In [77]:
# Gets the indexes from 3 to 14
text[3:14]

'an only com'

In [78]:
# Gets the length of a list
len(text)

267

In [79]:
# Locates the position of the first letter of the word given
query = text.find("comment")
query

11

In [80]:
# Locates the position of the letter "o"
query = text.find("o")
query

6

In [81]:
# Locates the position of the letter "o" from index 10 to 20
query = text.find("o", 10, 20)
query

12

In [82]:
# Counts how many times a word had been used
query = text.count("have")
query

2

In [83]:
# Counts how many times a word had been used from the indexes 10 to 100
query = text.count("have", 10,100)
query

1

In [84]:
# Capitalize the word given
word = "word"
query = word.capitalize()
query

'Word'

In [85]:
# Checks if the input starts with the letter "I"
query = text.startswith("I")
query

True

In [86]:
# Checks if the input ends with "."
query = text.endswith(".")
query

True

In [87]:
# Makes all the letters uppercased
query = text.upper()
query

'I CAN ONLY COMMENT ON THE VERMICELLI BOWLS BECAUSE I HAVE NEVER HAD THE PHO. THE VERMICELLI ARE A DECENT SIZE AND A FAIR PRICE. CONVENIENTLY LOCATED IF YOU LIVE IN THE ALBERTA AVE NEIGHBOURHOOD, MY ORDERS HAVE ALWAYS BEEN READY IN 15 MINUTES - EVEN ON A FRIDAY NIGHT.'

In [88]:
# Makes all the letters lowercased
query = text.lower()
query

'i can only comment on the vermicelli bowls because i have never had the pho. the vermicelli are a decent size and a fair price. conveniently located if you live in the alberta ave neighbourhood, my orders have always been ready in 15 minutes - even on a friday night.'

In [89]:
# Replaces the first input with the second one
query = text.replace("I", "You")
query

'You can only comment on the vermicelli bowls because You have never had the Pho. The vermicelli are a decent size and a fair price. Conveniently located if you live in the Alberta Ave neighbourhood, my orders have always been ready in 15 minutes - even on a Friday night.'

In [90]:
# Splits a string based on the input given and returns a list ( the default input is space )
query = text.split()
query

['I',
 'can',
 'only',
 'comment',
 'on',
 'the',
 'vermicelli',
 'bowls',
 'because',
 'I',
 'have',
 'never',
 'had',
 'the',
 'Pho.',
 'The',
 'vermicelli',
 'are',
 'a',
 'decent',
 'size',
 'and',
 'a',
 'fair',
 'price.',
 'Conveniently',
 'located',
 'if',
 'you',
 'live',
 'in',
 'the',
 'Alberta',
 'Ave',
 'neighbourhood,',
 'my',
 'orders',
 'have',
 'always',
 'been',
 'ready',
 'in',
 '15',
 'minutes',
 '-',
 'even',
 'on',
 'a',
 'Friday',
 'night.']

In [91]:
# Remove extra whitespaces from the start and from the end
word = "  word  "
query = word.strip()
query

'word'

reference: https://www.w3schools.com/python/python_strings.asp

## 1. Tokenization
Tokenizers divide strings into lists of substrings. They can be used to break a large text body down to sentences, groups of words, single words or even just groups of characters or subwords. Tokenization is a common step used to help prepare language data for further use. There are several ways and methods available to tokenize data.

reference: https://medium.com/@kelsklane/tokenization-with-nltk-52cd7b88c7d

In [92]:
# Downloads the Punkt sentence tokenization models
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [93]:
# Imports the tokenizer functions needed
from nltk.tokenize import sent_tokenize, word_tokenize, wordpunct_tokenize, WhitespaceTokenizer, regexp_tokenize

In [94]:
text2 = "Good muffins cost $3.88\nin New York.  Please buy me two of them.\n\nThanks."

In [95]:
# Tokenizes a string by sentences
sent_tokenize(text2)

['Good muffins cost $3.88\nin New York.',
 'Please buy me two of them.',
 'Thanks.']

In [96]:
# Tokenizes a string by words
word_tokenize(text2)

['Good',
 'muffins',
 'cost',
 '$',
 '3.88',
 'in',
 'New',
 'York',
 '.',
 'Please',
 'buy',
 'me',
 'two',
 'of',
 'them',
 '.',
 'Thanks',
 '.']

In [97]:
# Tokenizes a string by words and panctuations
wordpunct_tokenize(text2)

['Good',
 'muffins',
 'cost',
 '$',
 '3',
 '.',
 '88',
 'in',
 'New',
 'York',
 '.',
 'Please',
 'buy',
 'me',
 'two',
 'of',
 'them',
 '.',
 'Thanks',
 '.']

In [98]:
# Tokenizes a string by whitespace ( space, tab, newline )
WhitespaceTokenizer().tokenize(text2)

['Good',
 'muffins',
 'cost',
 '$3.88',
 'in',
 'New',
 'York.',
 'Please',
 'buy',
 'me',
 'two',
 'of',
 'them.',
 'Thanks.']

In [99]:
# Tokenizes a string by using regex inputs ( in this case it tokenizes the words that are capitalized )
regexp_tokenize(text2, '[A-Z]\w+')

['Good', 'New', 'York', 'Please', 'Thanks']

reference: https://www.nltk.org/api/nltk.tokenize.html