# Text Analytics

    1. Extract Sample document and apply following document preprocessing methods:
    Tokenization, POS Tagging, stop words removal, Stemming and Lemmatization.
    2. Create representation of document by calculating Term Frequency and Inverse Document Frequency.

In [1]:
# Using Natural Language Toolkit (nltk)
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer 

In [2]:
# Extracting text from sample.txt
file = open("/home/student/Documents/31170/A7/sample.txt", "rt") 
text = file.read()        
file.close()                   
print(text)

Two roads diverged in a yellow wood,
And sorry I could not travel both
And be one traveler, long I stood
And looked down one as far as I could
To where it bent in the undergrowth;

Then took the other, as just as fair,
And having perhaps the better claim,
Because it was grassy and wanted wear;
Though as for that the passing there
Had worn them really about the same,

And both that morning equally lay
In leaves no step had trodden black.
Oh, I kept the first for another day!
Yet knowing how way leads on to way,
I doubted if I should ever come back.

I shall be telling this with a sigh
Somewhere ages and ages hence:
Two roads diverged in a wood, and I—
I took the one less traveled by,
And that has made all the difference.




## Tokenization

In [3]:
# Tokenization by split()
tokens = text.split()
print(tokens)

['Two', 'roads', 'diverged', 'in', 'a', 'yellow', 'wood,', 'And', 'sorry', 'I', 'could', 'not', 'travel', 'both', 'And', 'be', 'one', 'traveler,', 'long', 'I', 'stood', 'And', 'looked', 'down', 'one', 'as', 'far', 'as', 'I', 'could', 'To', 'where', 'it', 'bent', 'in', 'the', 'undergrowth;', 'Then', 'took', 'the', 'other,', 'as', 'just', 'as', 'fair,', 'And', 'having', 'perhaps', 'the', 'better', 'claim,', 'Because', 'it', 'was', 'grassy', 'and', 'wanted', 'wear;', 'Though', 'as', 'for', 'that', 'the', 'passing', 'there', 'Had', 'worn', 'them', 'really', 'about', 'the', 'same,', 'And', 'both', 'that', 'morning', 'equally', 'lay', 'In', 'leaves', 'no', 'step', 'had', 'trodden', 'black.', 'Oh,', 'I', 'kept', 'the', 'first', 'for', 'another', 'day!', 'Yet', 'knowing', 'how', 'way', 'leads', 'on', 'to', 'way,', 'I', 'doubted', 'if', 'I', 'should', 'ever', 'come', 'back.', 'I', 'shall', 'be', 'telling', 'this', 'with', 'a', 'sigh', 'Somewhere', 'ages', 'and', 'ages', 'hence:', 'Two', 'roads'

In [4]:
# Tokenization by using nltk
token = word_tokenize(text) 
print(token)

['Two', 'roads', 'diverged', 'in', 'a', 'yellow', 'wood', ',', 'And', 'sorry', 'I', 'could', 'not', 'travel', 'both', 'And', 'be', 'one', 'traveler', ',', 'long', 'I', 'stood', 'And', 'looked', 'down', 'one', 'as', 'far', 'as', 'I', 'could', 'To', 'where', 'it', 'bent', 'in', 'the', 'undergrowth', ';', 'Then', 'took', 'the', 'other', ',', 'as', 'just', 'as', 'fair', ',', 'And', 'having', 'perhaps', 'the', 'better', 'claim', ',', 'Because', 'it', 'was', 'grassy', 'and', 'wanted', 'wear', ';', 'Though', 'as', 'for', 'that', 'the', 'passing', 'there', 'Had', 'worn', 'them', 'really', 'about', 'the', 'same', ',', 'And', 'both', 'that', 'morning', 'equally', 'lay', 'In', 'leaves', 'no', 'step', 'had', 'trodden', 'black', '.', 'Oh', ',', 'I', 'kept', 'the', 'first', 'for', 'another', 'day', '!', 'Yet', 'knowing', 'how', 'way', 'leads', 'on', 'to', 'way', ',', 'I', 'doubted', 'if', 'I', 'should', 'ever', 'come', 'back', '.', 'I', 'shall', 'be', 'telling', 'this', 'with', 'a', 'sigh', 'Somewhe

## POS Tagging

Part-of-speech (POS) tagging is a popular Natural Language Processing process which refers to categorizing words in a text in correspondence with a particular part of speech, depending on the definition of the word and its context.

<table>
    <tr>
        <th>Abbreviation</th>
        <th>Meaning</th>
    </tr>
    <tr>
        <td>CC</td>
        <td>Coordinating Conjunction</td>
    </tr>
     <tr>
        <td>CD</td>
        <td>Cardinal Digit</td>
    </tr>
     <tr>
        <td>VB</td>
        <td>verb (ask)</td>
    </tr>
     <tr>
        <td>VBG</td>
        <td>Verb Gerund (judging)</td>
    </tr>
     <tr>
        <td>NNS</td>
        <td>Noun Plural (desks)</td>
    </tr>
</table>

In [5]:
# Parts of Speech (POS) Tagging using nltk
tagged = pos_tag(token)      
print(tagged)

[('Two', 'CD'), ('roads', 'NNS'), ('diverged', 'VBN'), ('in', 'IN'), ('a', 'DT'), ('yellow', 'JJ'), ('wood', 'NN'), (',', ','), ('And', 'CC'), ('sorry', 'NN'), ('I', 'PRP'), ('could', 'MD'), ('not', 'RB'), ('travel', 'VB'), ('both', 'DT'), ('And', 'CC'), ('be', 'VB'), ('one', 'CD'), ('traveler', 'NN'), (',', ','), ('long', 'RB'), ('I', 'PRP'), ('stood', 'VBD'), ('And', 'CC'), ('looked', 'VBD'), ('down', 'RB'), ('one', 'CD'), ('as', 'RB'), ('far', 'RB'), ('as', 'IN'), ('I', 'PRP'), ('could', 'MD'), ('To', 'TO'), ('where', 'WRB'), ('it', 'PRP'), ('bent', 'VBD'), ('in', 'IN'), ('the', 'DT'), ('undergrowth', 'NN'), (';', ':'), ('Then', 'RB'), ('took', 'VBD'), ('the', 'DT'), ('other', 'JJ'), (',', ','), ('as', 'RB'), ('just', 'RB'), ('as', 'IN'), ('fair', 'JJ'), (',', ','), ('And', 'CC'), ('having', 'VBG'), ('perhaps', 'RB'), ('the', 'DT'), ('better', 'JJR'), ('claim', 'NN'), (',', ','), ('Because', 'IN'), ('it', 'PRP'), ('was', 'VBD'), ('grassy', 'JJ'), ('and', 'CC'), ('wanted', 'VBD'), ('

## Stop Words

The words which are generally filtered out before processing a natural language are called stop words. These are actually the most common words in any language (like articles, prepositions, pronouns, conjunctions, etc) and does not add much information to the text. Examples of a few stop words in English are “the”, “a”, “an”, “so”, “what”.

In [6]:
# Creating a variable stop_words to store all of the stop words in the English language using nltk
stop_words = stopwords.words('english')

In [7]:
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [8]:
# Checking if the word is part of stop_words and adding it to cleaned_token if it is not
cleaned_token = []
for word in token:
    if word not in stop_words:
        cleaned_token.append(word)

In [9]:
print('Unclean version:', token)
print('\nCleaned version:', cleaned_token)

Unclean version: ['Two', 'roads', 'diverged', 'in', 'a', 'yellow', 'wood', ',', 'And', 'sorry', 'I', 'could', 'not', 'travel', 'both', 'And', 'be', 'one', 'traveler', ',', 'long', 'I', 'stood', 'And', 'looked', 'down', 'one', 'as', 'far', 'as', 'I', 'could', 'To', 'where', 'it', 'bent', 'in', 'the', 'undergrowth', ';', 'Then', 'took', 'the', 'other', ',', 'as', 'just', 'as', 'fair', ',', 'And', 'having', 'perhaps', 'the', 'better', 'claim', ',', 'Because', 'it', 'was', 'grassy', 'and', 'wanted', 'wear', ';', 'Though', 'as', 'for', 'that', 'the', 'passing', 'there', 'Had', 'worn', 'them', 'really', 'about', 'the', 'same', ',', 'And', 'both', 'that', 'morning', 'equally', 'lay', 'In', 'leaves', 'no', 'step', 'had', 'trodden', 'black', '.', 'Oh', ',', 'I', 'kept', 'the', 'first', 'for', 'another', 'day', '!', 'Yet', 'knowing', 'how', 'way', 'leads', 'on', 'to', 'way', ',', 'I', 'doubted', 'if', 'I', 'should', 'ever', 'come', 'back', '.', 'I', 'shall', 'be', 'telling', 'this', 'with', 'a',

## Stemming

We use Stemming to remove suffixes from words and end up with a so-called word stem. The words “likes”, “likely” and “liked”, for example, all result in their common word stem “like” which can be used as a synonym for all three words. That way, an NLP model can learn that all three words are somehow similar and are used in a similar context.

### Porter's Stemmer

Porter’s Stemmer Algorithm is one of the most popular Stemming methods and was proposed in 1980. It is based on the idea that the suffixes in the English language are made up of a combination of smaller and simpler suffixes. It is known for its efficient and simple processes, but also comes with several disadvantages. Since it is based on many, hard-coded rules which result from the English language, it can only be used for English words. Also, there may be cases in which the output of Porter’s Stemmer is not an English word but only an artificial word stem.

In [10]:
# Using Port Stemmer in nltk for stemming
stemmer = PorterStemmer()
stemmed = [stemmer.stem(word) for word in token]
print(" ".join(stemmed))

two road diverg in a yellow wood , and sorri i could not travel both and be one travel , long i stood and look down one as far as i could to where it bent in the undergrowth ; then took the other , as just as fair , and have perhap the better claim , becaus it wa grassi and want wear ; though as for that the pass there had worn them realli about the same , and both that morn equal lay in leav no step had trodden black . oh , i kept the first for anoth day ! yet know how way lead on to way , i doubt if i should ever come back . i shall be tell thi with a sigh somewher age and age henc : two road diverg in a wood , and i— i took the one less travel by , and that ha made all the differ .


## Lemmatization

Lemmatization is a development of Stemming and describes the process of grouping together the different inflected forms of a word so they can be analyzed as a single item. Lemmatization is similar to Stemming but it brings context to the words. So it links words with similar meanings to one word. Lemmatization algorithms usually also use positional arguments as inputs, such as whether the word is an adjective, noun, or verb.

In [11]:
# Performing lemmaization using nltk
lemmatizer = WordNetLemmatizer()
lemmatized_output = [lemmatizer.lemmatize(word) for word in token]
print(" ".join(lemmatized_output))

Two road diverged in a yellow wood , And sorry I could not travel both And be one traveler , long I stood And looked down one a far a I could To where it bent in the undergrowth ; Then took the other , a just a fair , And having perhaps the better claim , Because it wa grassy and wanted wear ; Though a for that the passing there Had worn them really about the same , And both that morning equally lay In leaf no step had trodden black . Oh , I kept the first for another day ! Yet knowing how way lead on to way , I doubted if I should ever come back . I shall be telling this with a sigh Somewhere age and age hence : Two road diverged in a wood , and I— I took the one le traveled by , And that ha made all the difference .
