<a href="https://colab.research.google.com/github/RJ-Stony/A-Complete-Guide-to-TM/blob/main/01)Text_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Tokenization

In [42]:
# Download the required NLTK library
import nltk
nltk.download('punkt')
nltk.download('webtext')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('omw-1.4')
nltk.download('tagsets')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package webtext to /root/nltk_data...
[nltk_data]   Package webtext is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package tagsets to /root/nltk_data...
[nltk_data]   Unzipping help/tagsets.zip.


True

### Sentence Tokenization

In [3]:
para = "Hello everyone. It's good to see you. Let's start our text mining class!"

from nltk.tokenize import sent_tokenize

# Tokenize the given Text by sentence. Mostly . ! ? etc.
print(sent_tokenize(para))

['Hello everyone.', "It's good to see you.", "Let's start our text mining class!"]


In [5]:
paragraph_french = """Je t'ai demand si tu m'aimais bien, Tu m'a r pondu non.
Je t'ai demand si j' tais jolie, Tu m'a r pondu non.
Je t'ai demand si j' tai dans ton coeur, Tu m'a r pondu non."""

import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/french.pickle')
print(tokenizer.tokenize(paragraph_french))

["Je t'ai demand si tu m'aimais bien, Tu m'a r pondu non.", "Je t'ai demand si j' tais jolie, Tu m'a r pondu non.", "Je t'ai demand si j' tai dans ton coeur, Tu m'a r pondu non."]


In [6]:
para_kor = "안녕하세요, 여러분. 만나서 반갑습니다. 이제 텍스트마이닝 클래스를 시작해봅시다!"

# Sentence tokenizer works well for Korean too.
print(sent_tokenize(para_kor))

['안녕하세요, 여러분.', '만나서 반갑습니다.', '이제 텍스트마이닝 클래스를 시작해봅시다!']


### Word Tokenization

In [7]:
from nltk.tokenize import word_tokenize

# Tokenize the given Text by word.
print(word_tokenize(para))

['Hello', 'everyone', '.', 'It', "'s", 'good', 'to', 'see', 'you', '.', 'Let', "'s", 'start', 'our', 'text', 'mining', 'class', '!']


In [8]:
from nltk.tokenize import WordPunctTokenizer
print(WordPunctTokenizer().tokenize(para))

['Hello', 'everyone', '.', 'It', "'", 's', 'good', 'to', 'see', 'you', '.', 'Let', "'", 's', 'start', 'our', 'text', 'mining', 'class', '!']


In [9]:
print(word_tokenize(para_kor))

['안녕하세요', ',', '여러분', '.', '만나서', '반갑습니다', '.', '이제', '텍스트마이닝', '클래스를', '시작해봅시다', '!']


### Tokenization using Regular expressions

In [10]:
import re
re.findall("[abc]", "How are you, boy?")

['a', 'b']

In [11]:
re.findall("[0123456789]", "3a7b5c9d")

['3', '7', '5', '9']

In [12]:
re.findall("[\w]", "3a 7b_ '.^&5c9d")

['3', 'a', '7', 'b', '_', '5', 'c', '9', 'd']

In [13]:
re.findall("[_]+", "a_b, c__d, e___f")

['_', '__', '___']

In [14]:
re.findall("[\w]+", "How are you, boy?")

['How', 'are', 'you', 'boy']

In [16]:
re.findall("[o]{2,4}", "oh, hoow are yoooou, boooooooy?")

['oo', 'oooo', 'oooo', 'ooo']

In [17]:
from nltk.tokenize import RegexpTokenizer

# Tokenizer using Regular expression.
# Tokenize Text by word. \w : Means a letter or number. That is, it finds Repeating letters, numbers, or 's.
tokenizer = RegexpTokenizer("[\w']+")

# Recognize "can't" as a word.
print(tokenizer.tokenize("Sorry, I can't go there."))

['Sorry', 'I', "can't", 'go', 'there']


In [18]:
tokenizer = RegexpTokenizer("[\w]+")
print(tokenizer.tokenize("Sorry, I can't go there."))

['Sorry', 'I', 'can', 't', 'go', 'there']


In [21]:
text1 = "Sorry, I can't go there."
tokenizer = RegexpTokenizer("[\w']{3,}")
print(tokenizer.tokenize(text1.lower()))

['sorry', "can't", 'there']


### Remove Noise and Stopwords

In [23]:
from nltk.corpus import stopwords                 # Words that are not normally analyzed.
english_stops = set(stopwords.words('english'))   # Convert to "SET" to avoid repetition.

text1 = "Sorry, I couldn't go to movie yesterday."

tokenizer = RegexpTokenizer("[\w']+")
tokens = tokenizer.tokenize(text1.lower())        # Tokenize with "word_tokenize".

# Create a "List" with only words excluding "stopwords".
result = [word for word in tokens if word not in english_stops]

print(result)

['sorry', 'go', 'movie', 'yesterday']


In [24]:
# Check english stopword provided by "NLTK".
print(english_stops)

{'mustn', 'ma', 'but', 'will', 'needn', 'down', 'themselves', 'who', 'such', 'being', 'doesn', 'himself', 'so', 'weren', 'we', 'don', 'should', "hadn't", 'were', 'you', 'him', 'did', "didn't", "you'd", 'or', 'are', 'myself', 'can', 'won', 'them', 'both', 'it', 'as', 'how', "isn't", "doesn't", 'shan', 'am', 'after', 'wouldn', "mustn't", 'not', 'didn', 'over', 'whom', 'to', 'while', "hasn't", 'more', 'other', 'couldn', "should've", 'shouldn', 'each', 'below', "you're", 'no', 'ours', 'your', "won't", 'up', 'between', 'her', 'its', "that'll", 'those', 'what', 'be', 'on', 'because', 'a', 'has', 'once', "weren't", 'he', 'during', "aren't", 'with', 'few', 'most', 'under', 'this', 'their', 'having', "it's", 'here', "don't", 'further', 'very', 'which', 'above', 'too', 'isn', 'at', 'his', 'hasn', 're', 'out', 've', 'mightn', 'why', 'these', 'from', 'theirs', 'she', 'do', 'll', 'own', 'the', 'through', 'about', 'd', 'had', 'was', 'only', 'they', 'our', 'yourselves', "wouldn't", "shouldn't", 'itse

In [26]:
# Create and use your own stopwords.
# It is also useful for Korean processing.
# Organize your own stopwords into a list.
my_stopword = ['i', 'go', 'to']

result = [word for word in tokens if word not in my_stopword]
print(result)

['sorry', "couldn't", 'movie', 'yesterday']


## Normalization

### Stemming

In [27]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
print(stemmer.stem('cooking'), stemmer.stem('cookery'), stemmer.stem('cookbooks'))

cook cookeri cookbook


In [28]:
from nltk.tokenize import word_tokenize

para = "Hello everyone. It's good to see you. Let's start our text mining class!"
tokens = word_tokenize(para)                        # Tokenization Execution.
print(tokens)
result = [stemmer.stem(token) for token in tokens]  # Running Stemming for All Tokens.
print(result)

['Hello', 'everyone', '.', 'It', "'s", 'good', 'to', 'see', 'you', '.', 'Let', "'s", 'start', 'our', 'text', 'mining', 'class', '!']
['hello', 'everyon', '.', 'it', "'s", 'good', 'to', 'see', 'you', '.', 'let', "'s", 'start', 'our', 'text', 'mine', 'class', '!']


In [29]:
from nltk.stem import LancasterStemmer
stemmer = LancasterStemmer()
print(stemmer.stem('cooking'), stemmer.stem('cookery'), stemmer.stem('cookbooks'))

cook cookery cookbook


### Lemmatization

In [32]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('cooking'))
print(lemmatizer.lemmatize('cooking', pos='v'))   # Designate "part of speech"
print(lemmatizer.lemmatize('cookery'))
print(lemmatizer.lemmatize('cookbooks'))

cooking
cook
cookery
cookbook


In [33]:
# Comparison of "Lemmatizing" and "Stemming"
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
print('stemming result:', stemmer.stem('believes'))
print('lemmatizing result:', lemmatizer.lemmatize('believes'))
print('lemmatizing result:', lemmatizer.lemmatize('believes', pos='v'))

stemming result: believ
lemmatizing result: belief
lemmatizing result: believe


## POS Tagging

### POS Tagging using NLTK

In [34]:
import nltk
from nltk.tokenize import word_tokenize

tokens = word_tokenize("Hello everyone. It's good to see you. Let's start our text mining class!")
print(nltk.pos_tag(tokens))

[('Hello', 'NNP'), ('everyone', 'NN'), ('.', '.'), ('It', 'PRP'), ("'s", 'VBZ'), ('good', 'JJ'), ('to', 'TO'), ('see', 'VB'), ('you', 'PRP'), ('.', '.'), ('Let', 'VB'), ("'s", 'POS'), ('start', 'VB'), ('our', 'PRP$'), ('text', 'NN'), ('mining', 'NN'), ('class', 'NN'), ('!', '.')]


In [43]:
nltk.help.upenn_tagset('CC')

CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet


#### Extract only the words of the desired POS

In [44]:
my_tag_set = ['NN', 'VB', 'JJ']
my_words = [word for word, tag in nltk.pos_tag(tokens) if tag in my_tag_set]
print(my_words)

['everyone', 'good', 'see', 'Let', 'start', 'text', 'mining', 'class']


#### Separating words by Adding POS information

In [45]:
words_with_tag = ['/'.join(item) for item in nltk.pos_tag(tokens)]
print(words_with_tag)

['Hello/NNP', 'everyone/NN', './.', 'It/PRP', "'s/VBZ", 'good/JJ', 'to/TO', 'see/VB', 'you/PRP', './.', 'Let/VB', "'s/POS", 'start/VB', 'our/PRP$', 'text/NN', 'mining/NN', 'class/NN', '!/.']
