<a href="https://colab.research.google.com/github/ParsaPNT128/NLTK-tutorial/blob/main/nltk_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **What is NLTK?**
NLTK (Natural Language Toolkit) is a powerful Python library for natural language processing (NLP). It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for tasks such as tokenization, stemming, tagging, parsing, and classification. It's widely used both in academia and industry for tasks ranging from simple text analysis to more complex natural language understanding and generation tasks.

reference: https://www.nltk.org/index.html

Use the command below in cmd in order to install nltk

`pip install nltk`

Import NLTK to your app using:

`import nltk`

In [None]:
import nltk

## **1. Introduction**
In Natural Language Processing (NLP), strings and regular expressions (regex) play crucial roles in various tasks, especially when working with libraries like NLTK (Natural Language Toolkit).

In this section, we will review strings and regular expressions.

### **1.1. Strings**
Strings are collection of words, alphabets and other characters. They are surrounded by either single quotation marks, double quotation marks, or even triple quotation marks e.g:
```
string1 = 'hello'
string2 = "hello"
string3 = ''' triple
quotation
marks
'''
```
Strings have their own built-in functions that can make changes on them. Most of this functions are shown below.

reference: https://www.w3schools.com/python/python_strings.asp

In [None]:
text = "I can only comment on the vermicelli bowls because I have never had the Pho. The vermicelli are a decent size and a fair price. Conveniently located if you live in the Alberta Ave neighbourhood, my orders have always been ready in 15 minutes - even on a Friday night."

In [None]:
text

'I can only comment on the vermicelli bowls because I have never had the Pho. The vermicelli are a decent size and a fair price. Conveniently located if you live in the Alberta Ave neighbourhood, my orders have always been ready in 15 minutes - even on a Friday night.'

In [None]:
# Gets the first index (element) of the text
text[0]

'I'

In [None]:
# Gets the first 10 indexes of a text ( indexes from 0 to 10 )
text[0:10]

'I can only'

In [None]:
# Gets the indexes from 3 to 14
text[3:14]

'an only com'

In [None]:
# Gets the length of a list
len(text)

267

In [None]:
# Locates the position of the first letter of the word given
query = text.find("comment")
query

11

In [None]:
# Locates the position of the letter "o"
query = text.find("o")
query

6

In [None]:
# Locates the position of the letter "o" from index 10 to 20
query = text.find("o", 10, 20)
query

12

In [None]:
# Counts how many times a word had been used
query = text.count("have")
query

2

In [None]:
# Counts how many times a word had been used from the indexes 10 to 100
query = text.count("have", 10,100)
query

1

In [None]:
# Capitalize the word given
word = "word"
query = word.capitalize()
query

'Word'

In [None]:
# Checks if the input starts with the letter "I"
query = text.startswith("I")
query

True

In [None]:
# Checks if the input ends with "."
query = text.endswith(".")
query

True

In [None]:
# Makes all the letters uppercased
query = text.upper()
query

'I CAN ONLY COMMENT ON THE VERMICELLI BOWLS BECAUSE I HAVE NEVER HAD THE PHO. THE VERMICELLI ARE A DECENT SIZE AND A FAIR PRICE. CONVENIENTLY LOCATED IF YOU LIVE IN THE ALBERTA AVE NEIGHBOURHOOD, MY ORDERS HAVE ALWAYS BEEN READY IN 15 MINUTES - EVEN ON A FRIDAY NIGHT.'

In [None]:
# Makes all the letters lowercased
query = text.lower()
query

'i can only comment on the vermicelli bowls because i have never had the pho. the vermicelli are a decent size and a fair price. conveniently located if you live in the alberta ave neighbourhood, my orders have always been ready in 15 minutes - even on a friday night.'

In [None]:
# Replaces the first input with the second one
query = text.replace("I", "You")
query

'You can only comment on the vermicelli bowls because You have never had the Pho. The vermicelli are a decent size and a fair price. Conveniently located if you live in the Alberta Ave neighbourhood, my orders have always been ready in 15 minutes - even on a Friday night.'

In [None]:
# Splits a string based on the input given and returns a list ( the default input is space )
query = text.split()
query

['I',
 'can',
 'only',
 'comment',
 'on',
 'the',
 'vermicelli',
 'bowls',
 'because',
 'I',
 'have',
 'never',
 'had',
 'the',
 'Pho.',
 'The',
 'vermicelli',
 'are',
 'a',
 'decent',
 'size',
 'and',
 'a',
 'fair',
 'price.',
 'Conveniently',
 'located',
 'if',
 'you',
 'live',
 'in',
 'the',
 'Alberta',
 'Ave',
 'neighbourhood,',
 'my',
 'orders',
 'have',
 'always',
 'been',
 'ready',
 'in',
 '15',
 'minutes',
 '-',
 'even',
 'on',
 'a',
 'Friday',
 'night.']

In [None]:
# Removes the extra whitespaces from the start and from the end
word = "  word  "
query = word.strip()
query

'word'

In [None]:
# Capitalizes all the words in a string
query = text.title()
query

'I Can Only Comment On The Vermicelli Bowls Because I Have Never Had The Pho. The Vermicelli Are A Decent Size And A Fair Price. Conveniently Located If You Live In The Alberta Ave Neighbourhood, My Orders Have Always Been Ready In 15 Minutes - Even On A Friday Night.'

reference: https://www.w3schools.com/python/python_strings.asp

### **1.2. Regular Expressions (RegEx)**
Regular Expressions ( Also known as regex ) is a sequence of characters that forms a search pattern. This pattern can be used by several methods in order to search or check if a string contains the specified search pattern.

Use to following command in the command prompt to install the module:

```
pip install regex
```
Then import it to your program using:

```
import re
```
reference: https://www.w3schools.com/python/python_regex.asp

In [None]:
import re

In [None]:
text2 = "The rain in Spain"

In [None]:
# findall function returns a list containing all matches
# the following command returns a list of all the times that "ai" is used
query = re.findall("ai", text2)
query

['ai', 'ai']

In [None]:
# search function checks if there is a match with pattern
# the following command returns if the text is satrted with "The" and ended with "Spain"
query = re.search("^The.*Spain$", text2)
query

<re.Match object; span=(0, 17), match='The rain in Spain'>

In [None]:
# split function is similar to the split function in strings but this time it uses a search pattern
# the following commamd returns a list of strings splited by the characters from a to f
query = re.split("[a-f]", text2)
query

['Th', ' r', 'in in Sp', 'in']

In [None]:
# sub function is also similar to replace function in string but it uses a search pattern
# the following command replaces all the whitespaces with |
query = re.sub("\s", "|", text2)
query

'The|rain|in|Spain'

For more metacharacters and sequences visit this link: https://www.w3schools.com/python/python_regex.asp

## **2. Tokenization**
Tokenizers divide strings into lists of substrings. They can be used to break a large text body down to sentences, groups of words, single words or even just groups of characters or subwords. Tokenization is a common step used to help prepare language data for further use. There are several ways and methods available to tokenize data.

For more information, read this [link](https://medium.com/@ajay_khanna/tokenization-techniques-in-natural-language-processing-67bb22088c75).

reference: https://medium.com/@kelsklane/tokenization-with-nltk-52cd7b88c7d

In [None]:
# Downloads the Punkt sentence tokenization models
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
# Imports the tokenizer functions needed
from nltk.tokenize import sent_tokenize, word_tokenize, wordpunct_tokenize, WhitespaceTokenizer, regexp_tokenize

In [None]:
# Tokenizes a string by sentences
sent_tokenize(text)

['I can only comment on the vermicelli bowls because I have never had the Pho.',
 'The vermicelli are a decent size and a fair price.',
 'Conveniently located if you live in the Alberta Ave neighbourhood, my orders have always been ready in 15 minutes - even on a Friday night.']

In [None]:
# Tokenizes a string by words
word_tokenize(text)

['I',
 'can',
 'only',
 'comment',
 'on',
 'the',
 'vermicelli',
 'bowls',
 'because',
 'I',
 'have',
 'never',
 'had',
 'the',
 'Pho',
 '.',
 'The',
 'vermicelli',
 'are',
 'a',
 'decent',
 'size',
 'and',
 'a',
 'fair',
 'price',
 '.',
 'Conveniently',
 'located',
 'if',
 'you',
 'live',
 'in',
 'the',
 'Alberta',
 'Ave',
 'neighbourhood',
 ',',
 'my',
 'orders',
 'have',
 'always',
 'been',
 'ready',
 'in',
 '15',
 'minutes',
 '-',
 'even',
 'on',
 'a',
 'Friday',
 'night',
 '.']

In [None]:
# Tokenizes a string by words and panctuations
wordpunct_tokenize(text)

['I',
 'can',
 'only',
 'comment',
 'on',
 'the',
 'vermicelli',
 'bowls',
 'because',
 'I',
 'have',
 'never',
 'had',
 'the',
 'Pho',
 '.',
 'The',
 'vermicelli',
 'are',
 'a',
 'decent',
 'size',
 'and',
 'a',
 'fair',
 'price',
 '.',
 'Conveniently',
 'located',
 'if',
 'you',
 'live',
 'in',
 'the',
 'Alberta',
 'Ave',
 'neighbourhood',
 ',',
 'my',
 'orders',
 'have',
 'always',
 'been',
 'ready',
 'in',
 '15',
 'minutes',
 '-',
 'even',
 'on',
 'a',
 'Friday',
 'night',
 '.']

In [None]:
# Tokenizes a string by whitespace ( space, tab, newline )
WhitespaceTokenizer().tokenize(text)

['I',
 'can',
 'only',
 'comment',
 'on',
 'the',
 'vermicelli',
 'bowls',
 'because',
 'I',
 'have',
 'never',
 'had',
 'the',
 'Pho.',
 'The',
 'vermicelli',
 'are',
 'a',
 'decent',
 'size',
 'and',
 'a',
 'fair',
 'price.',
 'Conveniently',
 'located',
 'if',
 'you',
 'live',
 'in',
 'the',
 'Alberta',
 'Ave',
 'neighbourhood,',
 'my',
 'orders',
 'have',
 'always',
 'been',
 'ready',
 'in',
 '15',
 'minutes',
 '-',
 'even',
 'on',
 'a',
 'Friday',
 'night.']

In [None]:
# Tokenizes a string by using regex inputs ( in this case it tokenizes the words that are capitalized )
regexp_tokenize(text, '[A-Z]\w+')

['Pho', 'The', 'Conveniently', 'Alberta', 'Ave', 'Friday']

reference: https://www.nltk.org/api/nltk.tokenize.html