## 03_01 Tokenization

Tokenization refers to converting a text string into individual tokens. Tokens may be words or punctations

In [1]:
import nltk
import os

Token List :  ['Data', 'science', 'is', 'the', 'study', 'of', 'data', 'to', 'extract', 'meaningful', 'insights', 'for', 'business', '.', 'It', 'is', 'a', 'multidisciplinary', 'approach', 'that']

 Total Tokens :  158


In [2]:
#Read the base file into a raw text variable
base_file = open(os.getcwd()+ "/data_science.txt", 'rt')
raw_text = base_file.read()
base_file.close()

In [3]:
#Extract tokens
token_list = nltk.word_tokenize(raw_text)
print("Token List : ",token_list[:20])
print("\n Total Tokens : ",len(token_list))

Token List :  ['Data', 'science', 'is', 'the', 'study', 'of', 'data', 'to', 'extract', 'meaningful', 'insights', 'for', 'business', '.', 'It', 'is', 'a', 'multidisciplinary', 'approach', 'that']

 Total Tokens :  158


## 03_02 Cleansing Text

We will see examples of removing punctuation and converting to lower case

#### Remove Punctuation

In [4]:
#Use the Punkt library to extract tokens
token_list2 = list(filter(lambda token: nltk.tokenize.punkt.PunktToken(token).is_non_punct, token_list))
print("Token List after removing punctuation : ",token_list2[:20])
print("\nTotal tokens after removing punctuation : ", len(token_list2))

Token List after removing punctuation :  ['Data', 'science', 'is', 'the', 'study', 'of', 'data', 'to', 'extract', 'meaningful', 'insights', 'for', 'business', 'It', 'is', 'a', 'multidisciplinary', 'approach', 'that', 'combines']

Total tokens after removing punctuation :  136


#### Convert to Lower Case

In [5]:
token_list3=[word.lower() for word in token_list2 ]
print("Token list after converting to lower case : ", token_list3[:20])
print("\nTotal tokens after converting to lower case : ", len(token_list3))

Token list after converting to lower case :  ['data', 'science', 'is', 'the', 'study', 'of', 'data', 'to', 'extract', 'meaningful', 'insights', 'for', 'business', 'it', 'is', 'a', 'multidisciplinary', 'approach', 'that', 'combines']

Total tokens after converting to lower case :  136


## 03_03 Stop word Removal

Removing stop words by using a standard stop word list available in NLTK for English

In [6]:
#Download the standard stopword list
nltk.download('stopwords')
from nltk.corpus import stopwords

#Remove stopwords
token_list4 = list(filter(lambda token: token not in stopwords.words('english'), token_list3))
print("Token list after removing stop words : ", token_list4[:20])
print("\nTotal tokens after removing stop words : ", len(token_list4))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...


Token list after removing stop words :  ['data', 'science', 'study', 'data', 'extract', 'meaningful', 'insights', 'business', 'multidisciplinary', 'approach', 'combines', 'principles', 'practices', 'fields', 'mathematics', 'statistics', 'artificial', 'intelligence', 'computer', 'engineering']

Total tokens after removing stop words :  79


[nltk_data]   Unzipping corpora\stopwords.zip.


## 03_04 Stemming

In [7]:
#Use the PorterStemmer library for stemming.
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

#Stem data
token_list5 = [stemmer.stem(word) for word in token_list4 ]
print("Token list after stemming : ", token_list5[:20])
print("\nTotal tokens after Stemming : ", len(token_list5))

Token list after stemming :  ['data', 'scienc', 'studi', 'data', 'extract', 'meaning', 'insight', 'busi', 'multidisciplinari', 'approach', 'combin', 'principl', 'practic', 'field', 'mathemat', 'statist', 'artifici', 'intellig', 'comput', 'engin']

Total tokens after Stemming :  79


## 03_05 Lemmatization

In [9]:
#Use the wordnet library to map words to their lemmatized form
nltk.download('omw-1.4')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...


True

In [12]:
lemmatizer = WordNetLemmatizer()
token_list6 = [lemmatizer.lemmatize(word) for word in token_list4 ]
print("Token list after Lemmatization : ", token_list6[:20])
print("\nTotal tokens after Lemmatization : ", len(token_list6))

Token list after Lemmatization :  ['data', 'science', 'study', 'data', 'extract', 'meaningful', 'insight', 'business', 'multidisciplinary', 'approach', 'combine', 'principle', 'practice', 'field', 'mathematics', 'statistic', 'artificial', 'intelligence', 'computer', 'engineering']

Total tokens after Lemmatization :  79


#### Comparison of tokens between raw, stemming and lemmatization

In [18]:
#Check for token technlogies
for i in range(20):
    print( "Raw : ", token_list4[i]," , Stemmed : ", token_list5[i], " , Lemmatized : ", token_list6[i])

Raw :  data  , Stemmed :  data  , Lemmatized :  data
Raw :  science  , Stemmed :  scienc  , Lemmatized :  science
Raw :  study  , Stemmed :  studi  , Lemmatized :  study
Raw :  data  , Stemmed :  data  , Lemmatized :  data
Raw :  extract  , Stemmed :  extract  , Lemmatized :  extract
Raw :  meaningful  , Stemmed :  meaning  , Lemmatized :  meaningful
Raw :  insights  , Stemmed :  insight  , Lemmatized :  insight
Raw :  business  , Stemmed :  busi  , Lemmatized :  business
Raw :  multidisciplinary  , Stemmed :  multidisciplinari  , Lemmatized :  multidisciplinary
Raw :  approach  , Stemmed :  approach  , Lemmatized :  approach
Raw :  combines  , Stemmed :  combin  , Lemmatized :  combine
Raw :  principles  , Stemmed :  principl  , Lemmatized :  principle
Raw :  practices  , Stemmed :  practic  , Lemmatized :  practice
Raw :  fields  , Stemmed :  field  , Lemmatized :  field
Raw :  mathematics  , Stemmed :  mathemat  , Lemmatized :  mathematics
Raw :  statistics  , Stemmed :  statist  , 