**TEXT PREPROCESSING AND NLP**

**Text Preprocessing – Noise Removal**

**•Removal of ‘stop words’ (‘the’, ‘a’, ‘and’)**

**•Removal of punctuation and special symbol removal (‘?’, ‘@’)**

In [1]:
import nltk

nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
text = 'I really like the dessert here'
text = text.split()
text

['I', 'really', 'like', 'the', 'dessert', 'here']

In [3]:
from nltk.corpus import stopwords
filtered = [] #put the remaining words in filtered
Stopwords = set(stopwords.words('english'))
for word in text:
    if word not in Stopwords:
        filtered.append(word)

In [4]:
filtered

['I', 'really', 'like', 'dessert']

In [5]:
#Remove the punctuations and special symbols
import re
text = 'I really like the dessert here. I am definitely coming back to this restaurant!'
text = re.sub(r'[^a-zA-Z]', ' ',text) #[^...] matches anything not contained in brackets, replace it by space ' '
#a-zA-Z pattern = a-z is shortform for alphabet a until z for huruf kecik, same goes with A-Z
print(text)

I really like the dessert here  I am definitely coming back to this restaurant 


In [6]:
import re
result = re.sub(r'[/;,]', ' ', "adfa/sdfe;arwe,awrc")
print(result)

adfa sdfe arwe awrc


**Text Preprocessing – Word Normalization Stemming**

In [7]:
from nltk.stem.porter import PorterStemmer
ps =  PorterStemmer()
words = ['student', 'study', 'studying', 'studies']

for word in words:
    stemming = ps.stem(word)
    print(stemming)

student
studi
studi
studi


In [8]:
from nltk.stem.porter import PorterStemmer
ps =  PorterStemmer()
words = ['love', 'lovely', 'loveable', 'loving']

for word in words:
    stemming = ps.stem(word)
    print(stemming)

love
love
loveabl
love


In [9]:
from nltk.stem import PorterStemmer
wps =  PorterStemmer()
words = ['wonderous', 'wonder', 'wondering', 'wonderful']

for word in words:
    stemming = ps.stem(word)
    print(stemming)

wonder
wonder
wonder
wonder


**Text Preprocessing – Word Normalization Lemmatizer**

In [10]:
from nltk.stem import WordNetLemmatizer
wn1 =  WordNetLemmatizer()
words = ['student', 'study', 'studying', 'studies']

for word in words:
    tokens = wn1.lemmatize(word)
    print(tokens)
    
    #studies replaced to study

student
study
studying
study


In [11]:
from nltk.stem import WordNetLemmatizer
wn1 =  WordNetLemmatizer()
words = ['love', 'lovely', 'loveable', 'loving']

for word in words:
    tokens = wn1.lemmatize(word)
    print(tokens)
    
    #lemmatizer does not replace as many like stemming method

love
lovely
loveable
loving


In [12]:
from nltk.stem import WordNetLemmatizer
wn1 =  WordNetLemmatizer()
words = ['wonderous', 'wonder', 'wondering', 'wonderful']

for word in words:
    tokens = wn1.lemmatize(word)
    print(tokens)

wonderous
wonder
wondering
wonderful


In [13]:
import nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer =  WordNetLemmatizer()
text = "studies studying cries cry"
tokenization = nltk.word_tokenize(text)

for w in tokenization:
    print("Lemma for {} is {}".format(w, wordnet_lemmatizer.lemmatize(w)))

Lemma for studies is study
Lemma for studying is studying
Lemma for cries is cry
Lemma for cry is cry


**Text Preprocessing – Word Normalization Tokenization**

In [14]:
#STEP 1 - USE TOKENIZATION
from nltk.tokenize import word_tokenize
text = "God is Great! I won a lottery." #ada !
print(word_tokenize(text))

['God', 'is', 'Great', '!', 'I', 'won', 'a', 'lottery', '.']


In [15]:
#STEP 2 - SPLIT BUT STILL INCLUDE '.' AND !, NOT USING TOKENIZATION
text = 'God is Great! I won a lottery.'
text.split()

['God', 'is', 'Great!', 'I', 'won', 'a', 'lottery.']

In [16]:
#STEP 3 - NOT USE TOKENIZATION AND NO !, '.'
text = 'God is Great I won a lottery'
text.split()

['God', 'is', 'Great', 'I', 'won', 'a', 'lottery']

In [17]:
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer
text = ["You can catch more flies with honey than you can with vinegar.",
       "You can lead a horse to water, but you can't make him drink."]
cv = CountVectorizer()
X = cv.fit_transform(text)
print(pd.DataFrame(X.A, columns=cv.get_feature_names()).to_string())

   but  can  catch  drink  flies  him  honey  horse  lead  make  more  than  to  vinegar  water  with  you
0    0    2      1      0      1    0      1      0     0     0     1     1   0        1      0     2    2
1    1    2      0      1      0    1      0      1     1     1     0     0   1        0      1     0    2


