# Practical 4 Morphology
## Text Preprocessing

In NLP, text preprocessing is the first step in the process of building a model. The various text preprocessing including Tokenization. Lower casing. Stop words removal and etc.

### Convert text to lowercase

In [11]:
txt = "The top 5 countries in the world containing COVID-19 are United States of America, Brazil, Russian Federation, India and United Kingdom" #WHO
result = txt.lower()

print(result)

the top 5 countries in the world containing covid-19 are united states of america, brazil, russian federation, india and united kingdom


### Remove numbers
Remove numbers if they are not relevant to your analyses. Usually, regular expressions are used to remove numbers.

In [12]:
import re
txt = "Box A contains 3 red and 5 white balls, while Box B contains 4 red and 2 blue balls."
result = re.sub('[0-9]','', txt)

print(result)

Box A contains  red and  white balls, while Box B contains  red and  blue balls.


### Punctuation Removal

The following code removes this set of symbols [!”#$%&’()*+,-./:;<=>?@[\]^_`{|}~].

In [13]:
import string

txt = "This &is [an] example? {of} string. with.? punctuation!!!!" # Sample string
result = txt.translate(str.maketrans("", "", string.punctuation)) #string.punctuation = !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
print(result)

This is an example of string with punctuation


### Tokenization with Punctuation Removal (NLTK)
The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing for English written in the Python programming language.

In [14]:
import nltk # install using "pip install nltk" in cmd 

txt = "This &is [an] example? {of} string. with.? punctuation!!!!" # Sample string
remover = nltk.RegexpTokenizer(r"\w+")
clean = remover.tokenize(txt)

print(clean)

['This', 'is', 'an', 'example', 'of', 'string', 'with', 'punctuation']


### White space removal

In [16]:
txt = "\t a string example\t"
print(txt)

clean = txt.strip()
print(clean)

	 a string example	
a string example


### Stop word removal using NLTK

In [18]:
#pip install nltk
nltk.download('stopwords')
from nltk.corpus import stopwords


from nltk.tokenize import word_tokenize

text = "Nick likes to play football, however he is not into tennis."
text_tokens = word_tokenize(text)

clean = [word for word in text_tokens if not word in stopwords.words()]

print(clean)

['Nick', 'likes', 'play', 'football', ',', 'however', 'tennis', '.']


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Stop word removal using Spacy

In [19]:
# pip install -U spacy      <--- (cmd) package
# python -m spacy download en    <--- language model

#from nltk import word_tokenize

import spacy
sp = spacy.load('en_core_web_sm')

all_stopwords = sp.Defaults.stop_words

text = "Nick likes to play football, however he is not too fond of tennis."
text_tokens = word_tokenize(text)
tokens_without_sw= [word for word in text_tokens if not word in all_stopwords]

print(tokens_without_sw)

['Nick', 'likes', 'play', 'football', ',', 'fond', 'tennis', '.']


### Tokenization in Chinese Language

In [20]:
import jieba #stammer/stutter 口吃

seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
print("Full Mode: " + "/ ".join(seg_list))

Full Mode: 我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学


### Stemming

In [22]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
stemmer= PorterStemmer()
#input_str="connect connected connection connections"
input_str="trouble troubled troubles troublesome"

input_str=word_tokenize(input_str)
for word in input_str:
    print(stemmer.stem(word))

troubl
troubl
troubl
troublesom


### Lemmatization
Lemmatization tools: NLTK (WordNet Lemmatizer), spaCy, TextBlob, Pattern, gensim, Stanford CoreNLP, Memory-Based Shallow Parser (MBSP), Apache OpenNLP, Apache Lucene, General Architecture for Text Engineering (GATE), Illinois Lemmatizer, and DKPro Core.

In [25]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
lemmatizer=WordNetLemmatizer()

input_str="been had done languages cities mice"
input_str=word_tokenize(input_str)
for word in input_str:
    print(lemmatizer.lemmatize(word))

been
had
done
language
city
mouse
