# Text Analytics

    1. Extract Sample document and apply following document preprocessing methods:
    Tokenization, POS Tagging, stop words removal, Stemming and Lemmatization.
    2. Create representation of document by calculating Term Frequency and Inverse Document Frequency.

In [1]:
# Using Natural Language Toolkit (nltk)
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer 
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
# Extracting text from sample.txt
file = open("/home/student/Documents/31170/A7/sample.txt", "rt") 
text = file.read()     
file.close()                   
print(text)

 It was 7 minutes after midnight. The dog was lying on the grass in the middle of the lawn in front of Mrs Shears’ house. Its eyes were closed. It looked as if it was running on its side, the way dogs run when they think they are chasing a cat in a dream. But the dog was not running or asleep. The dog was dead. There was a garden fork sticking out of the dog. The points of the fork must have gone all the way through the dog and into the ground because the fork had not fallen over. I decided that the dog was probably killed with the fork because I could not see any other wounds in the dog and I do not think you would stick a garden fork into a dog after it had died for some other reason, like cancer for example, or a road accident. But I could not be certain about this.



## Tokenization

In [3]:
# Tokenization by split()
tokens = text.split()
print(tokens)

['It', 'was', '7', 'minutes', 'after', 'midnight.', 'The', 'dog', 'was', 'lying', 'on', 'the', 'grass', 'in', 'the', 'middle', 'of', 'the', 'lawn', 'in', 'front', 'of', 'Mrs', 'Shears’', 'house.', 'Its', 'eyes', 'were', 'closed.', 'It', 'looked', 'as', 'if', 'it', 'was', 'running', 'on', 'its', 'side,', 'the', 'way', 'dogs', 'run', 'when', 'they', 'think', 'they', 'are', 'chasing', 'a', 'cat', 'in', 'a', 'dream.', 'But', 'the', 'dog', 'was', 'not', 'running', 'or', 'asleep.', 'The', 'dog', 'was', 'dead.', 'There', 'was', 'a', 'garden', 'fork', 'sticking', 'out', 'of', 'the', 'dog.', 'The', 'points', 'of', 'the', 'fork', 'must', 'have', 'gone', 'all', 'the', 'way', 'through', 'the', 'dog', 'and', 'into', 'the', 'ground', 'because', 'the', 'fork', 'had', 'not', 'fallen', 'over.', 'I', 'decided', 'that', 'the', 'dog', 'was', 'probably', 'killed', 'with', 'the', 'fork', 'because', 'I', 'could', 'not', 'see', 'any', 'other', 'wounds', 'in', 'the', 'dog', 'and', 'I', 'do', 'not', 'think', 'y

In [4]:
# Tokenization by using nltk
token = word_tokenize(text) 
print(token)

['It', 'was', '7', 'minutes', 'after', 'midnight', '.', 'The', 'dog', 'was', 'lying', 'on', 'the', 'grass', 'in', 'the', 'middle', 'of', 'the', 'lawn', 'in', 'front', 'of', 'Mrs', 'Shears', '’', 'house', '.', 'Its', 'eyes', 'were', 'closed', '.', 'It', 'looked', 'as', 'if', 'it', 'was', 'running', 'on', 'its', 'side', ',', 'the', 'way', 'dogs', 'run', 'when', 'they', 'think', 'they', 'are', 'chasing', 'a', 'cat', 'in', 'a', 'dream', '.', 'But', 'the', 'dog', 'was', 'not', 'running', 'or', 'asleep', '.', 'The', 'dog', 'was', 'dead', '.', 'There', 'was', 'a', 'garden', 'fork', 'sticking', 'out', 'of', 'the', 'dog', '.', 'The', 'points', 'of', 'the', 'fork', 'must', 'have', 'gone', 'all', 'the', 'way', 'through', 'the', 'dog', 'and', 'into', 'the', 'ground', 'because', 'the', 'fork', 'had', 'not', 'fallen', 'over', '.', 'I', 'decided', 'that', 'the', 'dog', 'was', 'probably', 'killed', 'with', 'the', 'fork', 'because', 'I', 'could', 'not', 'see', 'any', 'other', 'wounds', 'in', 'the', 'do

## POS Tagging

Part-of-speech (POS) tagging is a popular Natural Language Processing process which refers to categorizing words in a text in correspondence with a particular part of speech, depending on the definition of the word and its context.

<table>
    <tr>
        <th>Abbreviation</th>
        <th>Meaning</th>
    </tr>
    <tr>
        <td>CC</td>
        <td>Coordinating Conjunction</td>
    </tr>
     <tr>
        <td>CD</td>
        <td>Cardinal Digit</td>
    </tr>
     <tr>
        <td>VB</td>
        <td>verb (ask)</td>
    </tr>
     <tr>
        <td>VBG</td>
        <td>Verb Gerund (judging)</td>
    </tr>
     <tr>
        <td>NNS</td>
        <td>Noun Plural (desks)</td>
    </tr>
</table>

In [6]:
# Parts of Speech (POS) Tagging using nltk
tagged = pos_tag(token)      
print(tagged)

[('It', 'PRP'), ('was', 'VBD'), ('7', 'CD'), ('minutes', 'NNS'), ('after', 'IN'), ('midnight', 'NN'), ('.', '.'), ('The', 'DT'), ('dog', 'NN'), ('was', 'VBD'), ('lying', 'VBG'), ('on', 'IN'), ('the', 'DT'), ('grass', 'NN'), ('in', 'IN'), ('the', 'DT'), ('middle', 'NN'), ('of', 'IN'), ('the', 'DT'), ('lawn', 'NN'), ('in', 'IN'), ('front', 'NN'), ('of', 'IN'), ('Mrs', 'NNP'), ('Shears', 'NNP'), ('’', 'NNP'), ('house', 'NN'), ('.', '.'), ('Its', 'PRP$'), ('eyes', 'NNS'), ('were', 'VBD'), ('closed', 'VBN'), ('.', '.'), ('It', 'PRP'), ('looked', 'VBD'), ('as', 'IN'), ('if', 'IN'), ('it', 'PRP'), ('was', 'VBD'), ('running', 'VBG'), ('on', 'IN'), ('its', 'PRP$'), ('side', 'NN'), (',', ','), ('the', 'DT'), ('way', 'NN'), ('dogs', 'NNS'), ('run', 'VBP'), ('when', 'WRB'), ('they', 'PRP'), ('think', 'VBP'), ('they', 'PRP'), ('are', 'VBP'), ('chasing', 'VBG'), ('a', 'DT'), ('cat', 'NN'), ('in', 'IN'), ('a', 'DT'), ('dream', 'NN'), ('.', '.'), ('But', 'CC'), ('the', 'DT'), ('dog', 'NN'), ('was', 'V

## Stop Words

The words which are generally filtered out before processing a natural language are called stop words. These are actually the most common words in any language (like articles, prepositions, pronouns, conjunctions, etc) and does not add much information to the text. Examples of a few stop words in English are “the”, “a”, “an”, “so”, “what”.

In [7]:
# Creating a variable stop_words to store all of the stop words in the English language using nltk
stop_words = stopwords.words('english')

In [8]:
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [9]:
# Checking if the word is part of stop_words and adding it to cleaned_token if it is not
cleaned_token = []
for word in token:
    if word not in stop_words:
        cleaned_token.append(word)

In [10]:
print('Unclean version:', token)
print('\nCleaned version:', cleaned_token)

Unclean version: ['It', 'was', '7', 'minutes', 'after', 'midnight', '.', 'The', 'dog', 'was', 'lying', 'on', 'the', 'grass', 'in', 'the', 'middle', 'of', 'the', 'lawn', 'in', 'front', 'of', 'Mrs', 'Shears', '’', 'house', '.', 'Its', 'eyes', 'were', 'closed', '.', 'It', 'looked', 'as', 'if', 'it', 'was', 'running', 'on', 'its', 'side', ',', 'the', 'way', 'dogs', 'run', 'when', 'they', 'think', 'they', 'are', 'chasing', 'a', 'cat', 'in', 'a', 'dream', '.', 'But', 'the', 'dog', 'was', 'not', 'running', 'or', 'asleep', '.', 'The', 'dog', 'was', 'dead', '.', 'There', 'was', 'a', 'garden', 'fork', 'sticking', 'out', 'of', 'the', 'dog', '.', 'The', 'points', 'of', 'the', 'fork', 'must', 'have', 'gone', 'all', 'the', 'way', 'through', 'the', 'dog', 'and', 'into', 'the', 'ground', 'because', 'the', 'fork', 'had', 'not', 'fallen', 'over', '.', 'I', 'decided', 'that', 'the', 'dog', 'was', 'probably', 'killed', 'with', 'the', 'fork', 'because', 'I', 'could', 'not', 'see', 'any', 'other', 'wounds',

## Stemming

We use Stemming to remove suffixes from words and end up with a so-called word stem. The words “likes”, “likely” and “liked”, for example, all result in their common word stem “like” which can be used as a synonym for all three words. That way, an NLP model can learn that all three words are somehow similar and are used in a similar context.

### Porter's Stemmer

Porter’s Stemmer Algorithm is one of the most popular Stemming methods and was proposed in 1980. It is based on the idea that the suffixes in the English language are made up of a combination of smaller and simpler suffixes. It is known for its efficient and simple processes, but also comes with several disadvantages. Since it is based on many, hard-coded rules which result from the English language, it can only be used for English words. Also, there may be cases in which the output of Porter’s Stemmer is not an English word but only an artificial word stem.

In [11]:
# Using Port Stemmer in nltk for stemming
stemmer = PorterStemmer()
stemmed = [stemmer.stem(word) for word in token]
print(" ".join(stemmed))

it wa 7 minut after midnight . the dog wa lie on the grass in the middl of the lawn in front of mr shear ’ hous . it eye were close . it look as if it wa run on it side , the way dog run when they think they are chase a cat in a dream . but the dog wa not run or asleep . the dog wa dead . there wa a garden fork stick out of the dog . the point of the fork must have gone all the way through the dog and into the ground becaus the fork had not fallen over . i decid that the dog wa probabl kill with the fork becaus i could not see ani other wound in the dog and i do not think you would stick a garden fork into a dog after it had die for some other reason , like cancer for exampl , or a road accid . but i could not be certain about thi .


## Lemmatization

Lemmatization is a development of Stemming and describes the process of grouping together the different inflected forms of a word so they can be analyzed as a single item. Lemmatization is similar to Stemming but it brings context to the words. So it links words with similar meanings to one word. Lemmatization algorithms usually also use positional arguments as inputs, such as whether the word is an adjective, noun, or verb.

In [12]:
# Performing lemmaization using nltk
lemmatizer = WordNetLemmatizer()
lemmatized_output = [lemmatizer.lemmatize(word) for word in token]
print(" ".join(lemmatized_output))

It wa 7 minute after midnight . The dog wa lying on the grass in the middle of the lawn in front of Mrs Shears ’ house . Its eye were closed . It looked a if it wa running on it side , the way dog run when they think they are chasing a cat in a dream . But the dog wa not running or asleep . The dog wa dead . There wa a garden fork sticking out of the dog . The point of the fork must have gone all the way through the dog and into the ground because the fork had not fallen over . I decided that the dog wa probably killed with the fork because I could not see any other wound in the dog and I do not think you would stick a garden fork into a dog after it had died for some other reason , like cancer for example , or a road accident . But I could not be certain about this .


## Term Frequency (TF) and Inverse Document Frequency (IDF)

<b>Term frequency</b> is the number of times a term occurs in a document.The weight of a term that occurs in a document is simply proportional to the term frequency.

Formula: tf(t,d) = count of t in d / number of words in d

<b>Inverse document frequency</b> is the inverse of the document frequency which measures the informativeness of term t. In case of a large corpus,say 100,000,000, the IDF value explodes, to avoid the effect we take the log of idf.

Formula: idf(t) = log(N/(df + 1))

<b>TF-IDF</b> now is a the right measure to evaluate how important a word is to a document in a collection or corpus.

Formula: tf-idf(t, d) = tf(t, d) * log(N/(df + 1))

In [13]:
# Converting string object to raw text format
text = [text]

In [None]:
# Using sklearn's TfidfVectorizer() function to find term frequency and inverse term frequency
vectorize = TfidfVectorizer()

In [14]:
# Fitting the model and passing our text
response = vectorize.fit_transform(text)
print(response)

  (0, 77)	0.03984095364447979
  (0, 0)	0.03984095364447979
  (0, 14)	0.03984095364447979
  (0, 9)	0.03984095364447979
  (0, 1)	0.03984095364447979
  (0, 63)	0.03984095364447979
  (0, 25)	0.03984095364447979
  (0, 12)	0.03984095364447979
  (0, 45)	0.03984095364447979
  (0, 62)	0.03984095364447979
  (0, 69)	0.03984095364447979
  (0, 28)	0.07968190728895957
  (0, 20)	0.03984095364447979
  (0, 70)	0.03984095364447979
  (0, 84)	0.03984095364447979
  (0, 86)	0.03984095364447979
  (0, 21)	0.03984095364447979
  (0, 85)	0.03984095364447979
  (0, 57)	0.07968190728895957
  (0, 5)	0.03984095364447979
  (0, 66)	0.03984095364447979
  (0, 17)	0.07968190728895957
  (0, 83)	0.03984095364447979
  (0, 43)	0.03984095364447979
  (0, 61)	0.03984095364447979
  :	:
  (0, 38)	0.03984095364447979
  (0, 7)	0.03984095364447979
  (0, 46)	0.03984095364447979
  (0, 16)	0.03984095364447979
  (0, 81)	0.03984095364447979
  (0, 26)	0.03984095364447979
  (0, 42)	0.07968190728895957
  (0, 37)	0.03984095364447979
  (0, 67)