# Text Preprocessing

Process of cleaning the unstructured text data in order to make it noise-free and ready for further analysis.
 - Load raw data
 - Tokenize text
 - Filter out tokens that are stop words.
 - Convert to lowercase.
 - Remove punctuation marks, URLs and special characters from each token.
 - Stemming
 - Lemmatization

## Packages used for this session: 
- NLTK (Natural Language Toolkit)
- spaCy (library for advanced NLP)

## Install NLTK

In [None]:
pip install nltk #install NLTK and run in terminal

In [None]:
conda install nltk #if working with jupyter notebook

Download NLTK data:

In [None]:
import nltk
nltk.download()

## Install spaCy

In [None]:
conda install spacy

## Stop Word Removal 

NLTK has a list of stopwords in 16 different languages. Stopwords are usually referred to as noisy words in a language, i.e., words with higher frequency, that a search engine has been programmed to ignore.

In [None]:
import nltk
#stopword removal
nltk.download('stopwords')
from nltk.corpus import stopwords
#specifying the stopword language dataset
stop_words = stopwords.words('english')
print(stop_words) #printing all stopwords in the English stopwords dataset

In [None]:
#stopword removal from a string
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize #word_tokenize accepts a string as an input, not a file
example_line='something...is! wrong() with.,; this :: sentence.'
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(example_line)
filtered_sentence = [w for w in word_tokens if not w in stop_words]
filtered_sentence = []
for w in word_tokens:
    if w not in stop_words:
        filtered_sentence.append(w)
print(filtered_sentence)

In [None]:
#stopword removal from a text file
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
stop_words = set(stopwords.words('english'))
#read from a file as a stream
filename = open('preprocessing_data.txt')
line = filename.read()
words = line.split()
for r in words:
    if r not in stop_words:
        outputFile = open('nostopwords_data.txt','a') #writing processed output to a new file
        outputFile.write(" "+r)
        outputFile.close()

In [None]:
#converting to lowercase
with open('preprocessing_data.txt','r') as fileinput:
    for line in fileinput:
        line = line.lower()
        with open('lowercase_data.txt','a') as outputFile: #writing processed output to a new file
            outputFile.writelines(line)

In [None]:
#removing punctuations, URL and special characters
import string
tr = str.maketrans("","",string.punctuation)
s = "{{Here is some stuff in curly braces..!!}}"
s.translate(tr)

with open('preprocessing_data.txt','r') as fileinput:
    for line in fileinput:
        stripped = line.translate(tr)
        print(stripped)

In [None]:
#sentence segmentation
from nltk import tokenize
with open('preprocessing_data.txt','r') as fileinput:
    for line in fileinput:
        tokenized = tokenize.sent_tokenize(line)
        print(tokenized)
        
#text tokenization
from nltk.tokenize import word_tokenize
with open('preprocessing_data.txt','r') as fileinput:
    for line in fileinput:
        tokens = word_tokenize(line)
        print(tokens)

## Stemming
 -  Reducing each word to its root or base.
 -  A rudimentary rule-based process of stripping the suffixes ("ing", "ly", "es", "s" etc) from a word.
 -  For example "fishing", "fished", "fisher" all reduce to the stem "fish". 
 -  "studies" --> "studi", "studying" --> "study"
 -  Most common algorithm for stemming English is Porter's algorithm.

In [None]:
from nltk.tokenize import word_tokenize
filename = open('preprocessing_data.txt')
line = filename.read()
tokens = word_tokenize(line)
#import the stemming algorithm: Porter's Stemmer
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
stemmed = [porter.stem(word) for word in tokens]
print(stemmed[:100])

## Lemmatization

 - It usually refers to the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word known as the lemma.
 - "studies", "studying" --> "study"

In [None]:
from nltk.tokenize import word_tokenize
filename = open('preprocessing_data.txt')
line = filename.read()
tokens = word_tokenize(line)
#import wordnet
from nltk.stem.wordnet import WordNetLemmatizer
lem = WordNetLemmatizer()
lemmatized = [lem.lemmatize(word,'v') for word in tokens]
print(lemmatized)

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.

## Part of speech tagging

Identifying lexical units within text

In [None]:
from nltk.tokenize import word_tokenize
with open('preprocessing_data.txt','r') as fileinput:
    for line in fileinput:
        tokens = word_tokenize(line)
        tags = nltk.pos_tag(tokens)
        print(tags)

## Named Entity Recognition

Information Extraction task to identify and classify named entities found in text into pre-defined entity types such as PERSON, LOCATION, ORG, etc.

<b>Entity Identification</b> is performed using dependency parsing and part-of-speech tagging (noun phrases).<br>
<b>Entity Classification</b> deals with categorizing the identified noun phrases into various types which can be performed using lookup of type dictionaries and other sources (Wikipedia, DBpedia, Google Maps,..) available on the Web.


Entity types used by spaCy are listed <a href="https://spacy.io/api/annotation#named-entities">here</a>.

In [None]:
!python -m spacy download en_core_web_sm #The installation of spaCy doesn’t automatically download the English model.
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

doc = nlp("Next week I'll be in Madrid.") #to obtain IOB style tagging of sentences with the entity types
iob_tagged = [
    (
        token.text, 
        token.tag_, 
        "{0}-{1}".format(token.ent_iob_, token.ent_type_) if token.ent_iob_ != 'O' else token.ent_iob_
    ) for token in doc
]
 
print(iob_tagged)

In [None]:
#Obtaining frequencies of entity types
from spacy import displacy
from collections import Counter
nlp = spacy.load("en_core_web_sm")
doc = nlp("Monica, Mary and Oliver had lunch together and bought some Apple products.")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)
    
lables = [x.label_ for x in doc.ents]
Counter(lables) #count entity type occurrences

In [None]:
#Visualising the document with identified named entities and entity types.
displacy.render(doc, jupyter=True, style='ent')