<a href="https://colab.research.google.com/github/ProfessorPatrickSlatraigh/CST3512/blob/main/CST3512_NLP_Intro_Spring_2022.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Natural Language Process    

*CST3512 Data and Information Management - II*

The content of this notebook is derived from several sources including:
*  **Basic Concepts of Natural Language Processing (NLP) Models and Python Implementation** by Prasun Biswas in Toward Data Science, 01-Jul-2021   
*  **What Is the Difference Between Stemming and Lemmatization?** from [StackOverflow](https://stackoverflow.com/questions/1787110/what-is-the-difference-between-lemmatization-vs-stemming)    
*  **Beginners Guide to Stemming in Python NLTK** from [MachineLearningKnowldge.ai](https://machinelearningknowledge.ai/beginners-guide-to-stemming-in-python-nltk/)    


For a human it’s pretty easy to understand language but machines are not capable to recognize it easily.  Natural Language Processing (NLP) enables computers to interpret and to understand the way humans communicate using language.  But NLP does not interpret language the same way humans understand language.

The Natural Language Toolkit (NLTK) is a platform used for building Python programs that work with human language data for applying in statistical natural language processing (NLP). It contains text processing libraries for tokenization, parsing, classification, stemming, tagging and semantic reasoning. This notebook steps through the basics of NLP using the NLTK in Python.

### Initial Housekeeping    

First, import requisite libraries and access desired data sources    


In [1]:
# Setup wordcloud and nltk
!pip install -q wordcloud
import wordcloud
import nltk

In [None]:
# Download and import book sources from nltk
nltk.download('book') 
from nltk.book import *

In [None]:
# Get stopwords for English language from NLTK
stopwords = nltk.corpus.stopwords.words('english')

In [33]:
# Get Lemmatizer and stemmere (Porter, Snowball, Lancaster, Regexp) from NLTK
# The Snowball Stemmer may be preferred as it is more modern

# Get PorterStemmer from NLTK
porter_stemmer = nltk.stem.PorterStemmer()

# Get SnowballStemmer from NLTK
# SnowballStemmer needs a language parameter set
snowball_stemmer = nltk.stem.SnowballStemmer(language='english')

# Get LancasterStemmer from NLTK
lancaster_stemmer = nltk.stem.LancasterStemmer()

# Get the RegexpStemmer from NLTK
from nltk.stem import RegexpStemmer
# Instances of RegexpStemmer require a regex as a positional argument
# RegexpStemmer calls may also provide a kwarg for min(imum) length
regexp_stemmer = nltk.stem.RegexpStemmer('ing$|s$|e$|able$', min=4)

# Get Lemmatizer from NLTK
lemmatizer = nltk.stem.WordNetLemmatizer()

In [36]:
# Get tagging capability for parts of speech using spacy library
import spacy
# Get Tokenizer class from spacy module
from spacy.tokenizer import Tokenizer
# Instantiate an nlp object with "en_core_web_sm" argument
nlp = spacy.load("en_core_web_sm")

In [4]:
# Import regular expressions for searches and filtering
import re

Natural languages are a free form of text which are unstructured in nature. Cleaning and preparing the text data to extract features is very important for an NLP approach to develop any model(s). This notebook covers the basic but important steps and shows how to implement them in Python using NLTK and other packages with the goal of developing an NLP-based classification model.

The following phases of the approach will be described:

A. Data Cleaning    
B. Tokenization    
C. Vectorization/Word Embedding    
D. Model Development

## A. Data Cleaning    

Data cleaning is a basic but very important step in NLP. This section includes a few steps in an approach for data cleaning.  Depending on the source and nature of the text data being analyzed, new steps may be added and the steps considered may be refined.  As with any data science analysis, a continuos improvement approach is recommended where learnings are applied to every step in the approach and not just to final analysis and presentation.     


For this exercise a simple line of text is created through the assignment of a string to the variable `line`

In [6]:
line = 'Reaching out for HELP. Please meet me in LONDON at 6 a.m xyz@abc.com #urgent!'

###1. Remove stopwords    

There are words which are very commonly used but have little meaning when humans use language, those words do not add much value. Depending on the source of data and analysis to be performed, there may be words which are not required for the natural language processing approach. It is best to delete these words from the data.  The set of words to be deleted because they add little value is referred to as **stopwords** in NLP.    

The NLTK package has a defined set of stopwords for different languages like English. This example focuses on English-language stopwords but, of course, every language and use case may have its own set of stopwords.


In [None]:
# Additional stopwords
extra_list = ['let', 'may', 'might', 'must', 'need', 'apologies', 'meet']
# Extending the list of stopwords
stopwords.extend(extra_list)
# Display the extended list of stopwords
print('Stopwords to use:', stopwords)

###2. Transform to lower case    

Convert all the text to lowercase to maintain uniformity in subsequent analysis.  Of course, all text could just as easily be converted to upper case, but it is generally accepted practice to work with text all in lower case.    


In [None]:
# Transform text to lower case
line = line.lower()
# Display text after transformation to lower case
print(line)

**Lemmatization and Stemming**

Stemming just removes or stems the last few characters of a word, often leading to incorrect meanings and spelling. Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma.

Stemming is the process of reducing the word to its word stem that affixes to suffixes and prefixes or to roots of words known as a lemma. In simple words stemming is reducing a word to its base word or stem in such a way that the words of similar kind lie under a common stem. For example – The words care, cared and caring lie under the same stem ‘care’. Stemming is important in NLP as an alternative to Lemmatizing.


Sometimes, the same word can have multiple different Lemmas. We should identify the Part of Speech (POS) tag for the word in that specific context. Here are the examples to illustrate all the differences and use cases:

1. If you lemmatize the word 'Caring', it would return 'Care'. If you stem, it would return 'Car' and this is erroneous.
2. If you lemmatize the word 'Stripes' in verb context, it would return 'Strip'. If you lemmatize it in noun context, it would return 'Stripe'. If you just stem it, it would just return 'Strip'.
3. You would get same results whether you lemmatize or stem words such as walking, running, swimming... to walk, run, swim etc.
4. Lemmatization is computationally expensive since it involves look-up tables and what not.     

If you have a large dataset and performance is an issue, go with Stemming. Remember you can also add your own rules to Stemming. If accuracy is paramount and dataset isn't humongous, go with Lemmatization.

*source: StackOverflow, [What Is the Difference Betweem Stemming and Lemmatization?](https://stackoverflow.com/questions/1787110/what-is-the-difference-between-lemmatization-vs-stemming)*


**Glossary**    

* **whitespace tokenizer** - a tokenizer that splits on and discards only whitespace characters returning individual words or terms as tokens.    

* **Porter Stemming** - The Porter stemming algorithm (or 'Porter stemmer') is a process for removing the most common morphological and inflexional endings from words in English. Its main use is as part of a term normalisation process that is usually done when setting up Information Retrieval systems.    

* **Snowball Stemmer** - This algorithm is also known as the Porter2 stemming algorithm. It is almost universally accepted as better than the Porter stemmer, even being acknowledged as such by the individual who created the Porter stemmer.    

* **Lancaster Stemmer** - is simple but it tends to produce results with over stemming. Over-stemming causes the stems to be not linguistic, or they may have no meaning.    

* **Regexp Stemmer** - uses regular expressions (regex) to identify morphological affixes. Any substrings that match the regular expressions will be removed.


###3. Lemmatization    

Lemmatization helps to reduce the words into a single form taking the context of the use of the word into consideration.  The resultant term from Lemmatization is called a lemma. The following example defines a function to use a Lemmatization method from NLTK.  Other methods may prove more effective depending on the data source and use case.

In [None]:
# Define a function to Lemmatize text and to tokenize based on whitespace
def lemmatize_text(text):
    w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
    lemmatizer = nltk.stem.WordNetLemmatizer()
    return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]

# Invoke the function defined above and display result
lemma_line = lemmatize_text(line) 
print(lemma_line)
# Display the line without tokenization and Lemmatization
print(line)

###4. Stemming    

Stemming is a faster, more efficient way to attempto to reduce words into their root form. As mentioned earlier, there is a trade-off of accuracy in return for the efficiency of stemming vs. Lemmatization.  The following example defines a function to perform whitespace tokenization and stemming of a line of text then invokes that function.  In this example the NLTK Porter method of stemming is used.  Depending on the nature of the source text and the use case(s) for the analysis, different methods may be considered for stemming. 

**Porter Stemming**

In [None]:
# Define function for Porter stemming of text tokenized for whitespaces
def stem_porter(text):
    w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
    ps = nltk.PorterStemmer()
    return [ps.stem(w) for w in w_tokenizer.tokenize(text)]
# Invoke the method defined for Porter stemming and display the result
porter_stem_line = stem_porter(line)
print(porter_stem_line)
# Display the original line
print(line)

**Snowball Stemming**    

A few common rules of Snowball stemming are:    
* ILY  -----> ILI    
* LY   -----> Nill    
* SS   -----> SS    
* S    -----> Nill    
* ED   -----> E,Nill    

**Nill** means the suffix is replaced with nothing and is just removed.    

There may be cases where these rules vary depending on the words. As in the case of the suffix ‘ed’ if the words are ‘cared’ and ‘bumped’ they will be stemmed as ‘care‘ and ‘bump‘. Hence, here in cared the suffix is considered as ‘d’ only and not ‘ed’. 

The word ‘stemmed‘ is replaced with the word ‘stem‘ and not ‘stemmed‘. Therefore, the suffix depends on the word.

Here are a few examples:-
```
WORD           STEM
cared          care
university     univers
fairly         fair
easily         easili
singing        sing
sings          sing
sung           sung
singer         singer
sportingly     sport
```

In [None]:
# Define function for Snowball stemming of text tokenized for whitespaces
def stem_snowball(text):
    w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
    # Bear in mind that the Snowball stemmer needs language defined
    ps = nltk.SnowballStemmer(language='english')
    return [ps.stem(w) for w in w_tokenizer.tokenize(text)]
# Invoke the method defined for Snowball stemming and display the result
snowball_stem_line = stem_snowball(line)
print(snowball_stem_line)
# Display the original line
print(line)

**Lancaster Stemmer**    

There is a third method of stemming available, Lancaster Stemming. 

Lancaster Stemmer is simple but it tends to produce results with over stemming. Over-stemming causes the stems to be not linguistic, or they may have no meaning.

In NLTK, there is a module LancasterStemmer() that supports the Lancaster stemming algorithm.  The following code shows an example of the application of Lancaster Stemming along with whitespace tokenization.    


In [None]:
# Define function for Lancaster stemming of text tokenized for whitespaces
def stem_lancaster(text):
    w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
    # Bear in mind that the Lancaster stemmer needs language defined
    ps = nltk.LancasterStemmer()
    return [ps.stem(w) for w in w_tokenizer.tokenize(text)]
# Invoke the method defined for Lancaster stemming and display the result
lancaster_stem_line = stem_lancaster(line)
print(lancaster_stem_line)
# Display the original line
print(line)

**RegexpStemmer**

Regex stemmer uses regular expressions to identify morphological affixes. Any substrings that match the regular expressions will be removed.

In NLTK, there is a module `RegexpStemmer()` that supports the Regex stemming algorithm.

In [None]:
# Define function for Regexp stemming of text tokenized for whitespaces
def stem_regexp(text):
    w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
    # Bear in mind that the Regexp stemmer needs a regex defined
    ps = nltk.RegexpStemmer('ing$|s$|e$|able$', min=4)
    return [ps.stem(w) for w in w_tokenizer.tokenize(text)]
# Invoke the method defined for Regexp stemming and display the result
regexp_stem_line = stem_regexp(line)
print(regexp_stem_line)
# Display the original line
print(line)

*Display of the result of Lemmatization vs. the four methods of stemming compared with the original text follow.  The following code requires the execution of each of the five snippets above to create the variables to print.*

In [None]:
print("Lematization:")
print(lemma_line, "\n")
print("Porter Stemming:")
print(porter_stem_line, "\n")
print("Snowball Stemming:")
print(snowball_stem_line, "\n")
print("Lancaster Stemming:")
print(lancaster_stem_line, "\n")
print("Regexp Stemming with `'ing$|s$|e$|able$', min=4'`:")
print(regexp_stem_line, "\n")


**More Comparative Examples of Stemming**    

In [None]:
# STEMMING EXAMPLE 1
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer, RegexpStemmer

porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer(language='english')
regexp = RegexpStemmer('ing$|s$|e$|able$', min=4)

word_list = ["friend", "friendship", "friends", "friendships"]
print("{0:20}{1:20}{2:20}{3:30}{4:40}".format("Word","Porter Stemmer","Snowball Stemmer","Lancaster Stemmer",'Regexp Stemmer'))
for word in word_list:
    print("{0:20}{1:20}{2:20}{3:30}{4:40}".format(word,porter.stem(word),snowball.stem(word),lancaster.stem(word),regexp.stem(word)))

In [None]:
# STEMMING EXAMPLE 2
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer, RegexpStemmer

porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer(language='english')
regexp = RegexpStemmer('ing$|s$|e$|able$', min=4)

word_list = ['run','runs','running','runner','ran','easily','fairly']
print("{0:20}{1:20}{2:20}{3:30}{4:40}".format("Word","Porter Stemmer","Snowball Stemmer","Lancaster Stemmer",'Regexp Stemmer'))
for word in word_list:
    print("{0:20}{1:20}{2:20}{3:30}{4:40}".format(word,porter.stem(word),snowball.stem(word),lancaster.stem(word),regexp.stem(word)))

**Difference Between Porter Stemmer and Snowball Stemmer**

Snowball Stemmer is more aggressive than Porter Stemmer.
Some issues in Porter Stemmer were fixed in Snowball Stemmer.
There is only a little difference in the working of these two.
Words like ‘fairly‘ and ‘sportingly‘ were stemmed to ‘fair’ and ‘sport’ in the snowball stemmer but when you use the porter stemmer they are stemmed to ‘fairli‘ and ‘sportingli‘.    

The difference between the two algorithms can be clearly seen in the way the word ‘Sportingly’ in stemmed by both. Clearly Snowball Stemmer stems it to a more accurate stem.

**Drawbacks of Stemming**    

Issues of over stemming and under stemming may lead to not so meaningful or inappropriate stems.    

Stemming does not consider how the word is being used. For example – the word ‘saw‘ will be stemmed to ‘saw‘ itself but it won’t be considered whether the word is being used as a noun or a verb in the context. For this reason, Lemmatization is used as it keeps this fact in consideration and will return either ‘see’ or ‘saw’ depending on whether the word ‘saw’ was used as a verb or a noun.

###5. Removal of Regular Expressions   

Regular expressions (regex) help to identify and to get rid of different patterns which are not required in the text.


In [None]:
print("Line:", line)
print("...cleaning...\n")
clean0 = line
clean1 = re.sub('\S*@\S*\s?'," ",clean0)            # email removal
clean2 = re.sub('\s+'," ",clean1)                   # new line character removal
clean3 = re.sub("\’"," ",clean2)                    # single quote removal
clean4 = re.sub('_'," ",clean3)                     # underscore removal
clean5 = re.sub('http\S*\s?'," ",clean4)            # link removal
clean = ' '.join([i for i in clean5.split() if i.find('#') < 0]) #remove hashtag
print("Clean line:", clean)

### 6. Parts-of-Speech (POS) tagging

Parts-of-speech tagging helps to identify the parts of speech. Based on the use case one can keep or remove some of them.  Parts-of-speech may be localized by dialect and other cultural factors (profession, age, etc.)

The following code is an example of parts-of-speech tagging using the `spacy` library.


In [None]:
tokens_spacy = nlp(line)
for token in tokens_spacy:
    print(token.text, ': ', token.pos_, ': ', token.is_stop)

###7. Named-Entity-Recognition (NER)    

Named-entity-recognition helps to identify and categorize the different groups which includes names, places, currency etc.

The following code is an example of the use of named-entity-recognition using the `spacy` module. 

*The following code requires execution of the prior snippet of code to create the variable `tokens_spacy`.

In [None]:
for ent in tokens_spacy.ents:
    print(ent.text, ': ', ent.label_)



---





---



## Let us see how we can use LISTs in NLP

[NLTK Chapter 1](https://www.nltk.org/book/ch01.html)

In [None]:
# Setup wordcloud and nltk
!pip install -q wordcloud
import wordcloud
import nltk

In [None]:
nltk.download('book') 


In [None]:
from nltk.book import *

In [None]:
# Get stopwords, stemmer and lemmatizer
stopwords = nltk.corpus.stopwords.words('english')
stemmer = nltk.stem.PorterStemmer()
lemmatizer = nltk.stem.WordNetLemmatizer()


In [None]:
type(stopwords)

list

In [None]:
len(stopwords)

179

In [None]:
stopwords

In [None]:
stopwords[0]

'i'

In [None]:
sent1 = ['Call', 'me', 'Ishmael', '.']
sent1

['Call', 'me', 'Ishmael', '.']

In [None]:
text1[100:110]

['and', 'to', 'teach', 'them', 'by', 'what', 'name', 'a', 'whale', '-']

In [None]:
text1[0:50]

In [None]:
len(sent1)

4

In [None]:
sent2 = ['The', 'family', 'of', 'Dashwood', 'had', 'long', 'been', 'settled', 'in', 'Sussex', '.']
sent3 = ['In', 'the', 'beginning', 'God', 'created', 'the', 'heaven', 'and', 'the', 'earth', '.']

In [None]:
sent2+sent3

['The',
 'family',
 'of',
 'Dashwood',
 'had',
 'long',
 'been',
 'settled',
 'in',
 'Sussex',
 '.',
 'In',
 'the',
 'beginning',
 'God',
 'created',
 'the',
 'heaven',
 'and',
 'the',
 'earth',
 '.']

In [None]:
sent1.append("Some")
sent1

['Call', 'me', 'Ishmael', '.', 'Some']

In [None]:
len(text1)

260819

In [None]:
len(text2)

141576

In [None]:
text1[100:110]

['and', 'to', 'teach', 'them', 'by', 'what', 'name', 'a', 'whale', '-']

##[NLTK concordance](http://www.nltk.org/api/nltk.html?highlight=concordance)

In [None]:
text4

<Text: Inaugural Address Corpus>

In [None]:
text4.concordance("nation")

Displaying 25 of 330 matches:
 to the character of an independent nation seems to have been distinguished by
f Heaven can never be expected on a nation that disregards the eternal rules o
first , the representatives of this nation , then consisting of little more th
, situation , and relations of this nation and country than any which had ever
, prosperity , and happiness of the nation I have acquired an habitual attachm
an be no spectacle presented by any nation more pleasing , more noble , majest
party for its own ends , not of the nation for the national good . If that sol
tures and the people throughout the nation . On this subject it might become m
if a personal esteem for the French nation , formed in a residence of seven ye
f our fellow - citizens by whatever nation , and if success can not be obtaine
y , continue His blessing upon this nation and its Government and give it all 
powers so justly inspire . A rising nation , spread over a wide and fruitful l
ing now decided by the

In [None]:
print(text1)
text1.concordance("son")

<Text: Moby Dick by Herman Melville 1851>
Displaying 20 of 20 matches:
l sinners among men , the sin of this son of Amittai was in his wilful disobedi
r his naked wrists ; Queequeg was the son of a King , and Queequeg budged not .
 cabin , ye canting , drab - coloured son of a wooden gun -- a straight wake wi
 He must show that he ' s converted . Son of darkness ," he added , turning to 
 and all of us , and every mother ' s son and soul of us belong ; the great and
arnestly into his eyes , and said , " Son of darkness , I must do my duty by th
the sea ; the unerring harpoon of the son fitly replacing the infallible arrow 
f - believed this wild Indian to be a son of the Prince of the Powers of the Ai
or little Flask , he was the youngest son , and little boy of this weary family
narrative ; I have conversed with his son ; and all this within a few miles of 
 from his girdle ; " every mother ' s son of ye draw his knife , and pull with 
elkilt Charlemagne , had he been born son to Char

In [None]:
print(text1)
text1.concordance("king")

<Text: Moby Dick by Herman Melville 1851>
Displaying 25 of 64 matches:
th , of which he brought some to the king . ... The best whales were catched i
RRATIVE TAKEN DOWN FROM HIS MOUTH BY KING ALFRED , A . D . 890 . " And whereas
armacetti for an inward bruise ." -- KING HENRY . " Very like a whale ." -- HA
SOMEWHERE .) " A tenth branch of the king ' s ordinary revenue , said to be gr
 the coast , are the property of the king ." -- BLACKSTONE . " Soon to the spo
 Io ! sing . To the finny people ' s king . Not a mightier whale than this In 
n might , where might is right , And King of the boundless sea ." -- WHALE SON
wo . His father was a High Chief , a King ; his uncle a High Priest ; and on t
, spurned his suit ; and not all the King his father ' s influence could preva
d wrists ; Queequeg was the son of a King , and Queequeg budged not . Struck b
 the High Priest and his majesty the King , Queequeg ' s father . Grace being 
 plain precedence over a mere island King , especially in th

In [None]:
print(text1)
text1.concordance("ship")

<Text: Moby Dick by Herman Melville 1851>
Displaying 25 of 518 matches:
hale is floating at the stern of the ship , they cut off his head , and tow it
ution for fear they should run their ship upon them ." -- SCHOUTEN ' S SIXTH C
 from the Elbe , wind N . E . in the ship called The Jonas - in - the - Whale 
RATIVE OF THE SHIPWRECK OF THE WHALE SHIP ESSEX OF NANTUCKET , WHICH WAS ATTAC
HALING CRUIZE . 1846 . " The Whale - ship Globe , on board of which vessel occ
OCK . ANOTHER VERSION OF THE WHALE - SHIP GLOBE NARRATIVE . " The voyages of t
" It is impossible to meet a whale - ship on the ocean without being struck by
e whales , that the whites saw their ship in bloody possession of the savages 
E TAKING AND RETAKING OF THE WHALE - SHIP HOBOMACK . " It is generally well kn
on his sword ; I quietly take to the ship . There is nothing surprising in thi
 , when first told that you and your ship were now out of sight of land ? Why 
 , a cook being a sort of officer on ship - board -- yet , 

In [None]:
print(text1)
text1.concordance("knife")

<Text: Moby Dick by Herman Melville 1851>
Displaying 25 of 31 matches:
 further adorning it with his jack - knife , stooping over and diligently worki
ed with a sailor - belt and sheath - knife . Here comes another with a sou '- w
rd into its face , and with a jack - knife gently whittling away at its nose , 
trying to mend a pen with his jack - knife , old Bildad , to my no small surpri
 mend that pen , will ye . My jack - knife here needs the grindstone . That ' s
es all fastened upon the old man ' s knife , as he carved the chief dish before
her . No ! And when reaching out his knife and fork , between which the slice o
 little started if , perchance , the knife grazed against the plate ; and chewe
y wooden trencher , while Tashtego , knife in hand , began laying out the circl
him ; tows me with a cable I have no knife to cut . Horrible old man ! Who ' s 
er ! SPANISH SAILOR ( MEETING HIM ). Knife thee heartily ! big frame , small sp
 Fair play ! Snatch the Spaniard ' s knife ! A ri

In [None]:
print(text1)
text1.concordance("monster")

<Text: Moby Dick by Herman Melville 1851>
Displaying 25 of 49 matches:
des cometh within the chaos of this monster ' s mouth , be it beast , boat , or
nter into the dreadful gulf of this monster ' s ( whale ' s ) mouth , are immed
time with a lance ; but the furious monster at length rushed on the boat ; hims
 . Such a portentous and mysterious monster roused all my curiosity . Then the 
and flank with the most exasperated monster . Long usage had , for this Stubb ,
ACK ).-- Under this head I reckon a monster which , by the various names of Fin
arned the history of that murderous monster against whom I and all the others h
ocity , cunning , and malice in the monster attacked ; therefore it was , that 
iathan is restricted to the ignoble monster primitively pursued in the North ; 
 and incontestable character of the monster to strike the imagination with unwo
mberment . Then , in darting at the monster , knife in hand , he had but given 
e rock ; instead of this we saw the monster saili

Using .similar()

In [None]:
len(text1)

260819

In [None]:
text1.similar("monster")

whale ship world sea whales boat pequod other sun leviathan thing king
water head captain air crew cabin body more


In [None]:
text1.similar("king")

whale ship sea boat man line pequod water head captain time crew rope
harpooneer cry world lord day monster land


In [None]:
text1.similar("knife")

head and side way ship time face hand whale more boat men place duty
leg officers marines trumpet heart body


In [None]:
text1.similar("son")

side head part heart body hand boat sort matter time sense wife cabin
whale book eye god lord sea mouth


In [None]:
text1.similar("ship")

whale boat sea world captain way head time other man crew pequod line
deck body fishery air side water voyage




---



In [None]:
print(text4)
len(text4)

<Text: Inaugural Address Corpus>


152901

In [None]:
print(text2)
len(text2)

<Text: Sense and Sensibility by Jane Austen 1811>


141576

In [None]:
text2.dispersion_plot(["Elinor", "Marianne", "Edward", "Willoughby", "Brandon"])

In [None]:
len(set(text4))

In [None]:
text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "liberty", "is"])

In [None]:
text2.dispersion_plot(["citizens", "democracy", "freedom", "duties", "is"])

In [None]:
text2.dispersion_plot(["society", "husband", "feeling", "marriage", "independent", "Dashwood"])

In [None]:
len(text2)

In [None]:
len(set(text2))

In [None]:
len(set(text2))/len(text2)

In [None]:
len(set(text3)) / len(text3)


In [None]:
len(set(text3))

In [None]:
len(set(text3)-set(stopwords))

In [None]:
fdist1 = FreqDist(text1)

In [None]:
fdist1.most_common(50) 

In [None]:
fdist2 = FreqDist(text2)

In [None]:
len(fdist2)

In [None]:
fdist2.most_common(50) 

Find 50 most frequent words for Sense and Sensibility that are not stopwords

In [None]:
fdist2[',']

In [None]:
counter = 0
for item in fdist2.most_common():
  if item[0] not in stopwords:
    print(item)
    counter += 1
    if (counter == 49):
      break

In [None]:
170/len(text2)