# Natural Language Processing

* Focuses on making natural human usable by computer programs.
* we use a library called NLTK (Natural Language Toolkit)

## Tokenizing

*  divide text into words or sentences.
*  This will enable you to deal with shorter passages of text that, even when read separately from the rest of the text, are still largely cohesive and intelligible.
* It's the initial stage in structuring unstructured data so that it may be analyzed more easily.
* Whenyou’re analyzing text, you’ll be tokenizing by word and tokenizing by sentence.
* Here’s what both types of tokenization bring to the table.
    *  Tokenizing by word
        * This allows you to identify words that come up particularly often
    * Tokenizing by sentence
        * When you tokenize by sentence, you can analyze how those words relate to one another and see more context

In [1]:
# import pip
# pip.main(["install", "nltk"])

Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.


0

In [2]:
from nltk.tokenize import sent_tokenize, word_tokenize



In [3]:
sample = "Muad'Dib learned rapidly because his first training was in how to learn. And the first lesson of all was the basic trust that he could learn. It's shocking to find how many people do not believe they can learn, and how many more belive learning to be difficult."

In [4]:
print(sample)

Muad'Dib learned rapidly because his first training was in how to learn. And the first lesson of all was the basic trust that he could learn. It's shocking to find how many people do not believe they can learn, and how many more belive learning to be difficult.


In [5]:
# import nltk
# nltk.download('punkt')

In [6]:
sent_tokenize(sample) # divide into sentences, using . and space after it, (.(space))

["Muad'Dib learned rapidly because his first training was in how to learn.",
 'And the first lesson of all was the basic trust that he could learn.',
 "It's shocking to find how many people do not believe they can learn, and how many more belive learning to be difficult."]

In [8]:
print(word_tokenize(sample))

["Muad'Dib", 'learned', 'rapidly', 'because', 'his', 'first', 'training', 'was', 'in', 'how', 'to', 'learn', '.', 'And', 'the', 'first', 'lesson', 'of', 'all', 'was', 'the', 'basic', 'trust', 'that', 'he', 'could', 'learn', '.', 'It', "'s", 'shocking', 'to', 'find', 'how', 'many', 'people', 'do', 'not', 'believe', 'they', 'can', 'learn', ',', 'and', 'how', 'many', 'more', 'belive', 'learning', 'to', 'be', 'difficult', '.']


# Stop Words

* Unwanted words that can be removed from text(in a sentence or pera)

In [9]:
# import nltk
# nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Lapmart\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [10]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [11]:
sample = "Sir, I protest. I am not merry man!"
words = word_tokenize(sample)
print(words)   # Words in the sample

['Sir', ',', 'I', 'protest', '.', 'I', 'am', 'not', 'merry', 'man', '!']


In [12]:
stop_words = set(stopwords.words("english")) # Convert into set to remove duplicated if have, ( uniaue set )

Stop words we concern

In [13]:
print(stopwords.words("english")) # Raw stop words list may be coontained duplicates

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [14]:
 print(stop_words) # now this is a unique set

{'what', 'doesn', 'against', 'yourself', 'off', 'them', 'ourselves', 'other', 'himself', "she's", 'did', 'not', 'having', 'yourselves', 'can', 'o', 'this', 'some', 'be', 'i', 're', 'weren', 'will', 'but', 'ain', 'wouldn', "isn't", "you'd", 'her', 'is', 'we', 'theirs', 'isn', 'during', 'shouldn', 'she', 'through', 'didn', "shan't", 'wasn', "hasn't", "you'll", "weren't", 'y', 'while', 'there', 'being', 't', 'which', "you're", 'its', 'where', 'above', 'herself', 'ours', 'doing', 'or', 'my', 'to', 'only', 'our', "won't", 'by', 'why', 'they', 'too', "should've", 'their', 'do', "you've", 'had', 'just', 'an', 'further', 've', 'themselves', 'hadn', 'needn', 'whom', 'over', 'couldn', 'the', 'itself', 'of', 'yours', 'were', 'how', "that'll", 'very', 'those', 'both', 'mustn', 'was', 'these', 'been', 'mightn', 'his', 'after', 'does', 'few', 'have', 'same', 'as', "aren't", 'myself', 's', 'are', 'because', 'below', 'ma', 'on', 'shan', "don't", 'hers', 'with', "mustn't", 'aren', 'so', "doesn't", 'him

In [15]:
filtered_list = []

In [16]:
for word in words:
    if word.casefold() not in stop_words:   # casefold() = lower() before checking convert into lowercase
        filtered_list.append(word)   # Append only words that is not in  stop_words

In [17]:
filtered_list

['Sir', ',', 'protest', '.', 'merry', 'man', '!']

# Stemming

* Stemming is a text processing task in which you reduce words to their root, which is the core part of a word.
* For example, the words “helping” and “helper” share the root “help.”
* Stemming allows you to zero in on the basic meaning of a word rather than all the details of how it’s being used.
* NLTK has more than one stemmer, but the most popular one is the Porter stemmer.
* There are two ways, the stemming can go wrong.
    * **Understemming**
        * happens when two related words should be reduced to the same stem but aren’t. This is a false negative.
    * **Overstemming**
        * happens when two unrelated words are reduced to the same stem even though they shouldn’t be. This is a false positive

In [18]:
from nltk.stem import PorterStemmer, SnowballStemmer

In [19]:
stemmer_porterstemmer = PorterStemmer() # obj

In [20]:
sample = "The crew of the USS Discovery discovered many discoveries. Discovering is what explorers do."

In [21]:
words = word_tokenize(sample)
print(words)

['The', 'crew', 'of', 'the', 'USS', 'Discovery', 'discovered', 'many', 'discoveries', '.', 'Discovering', 'is', 'what', 'explorers', 'do', '.']


In [22]:
stemmed_words1 = [stemmer_porterstemmer.stem(word) for word in words]
print(stemmed_words1)   # Understemming arter there but good

['the', 'crew', 'of', 'the', 'uss', 'discoveri', 'discov', 'mani', 'discoveri', '.', 'discov', 'is', 'what', 'explor', 'do', '.']


In [23]:
stemmer_snowball = SnowballStemmer("english")

In [24]:
stemmed_words2 = [stemmer_snowball.stem(word) for word in words]
print(stemmed_words2)

['the', 'crew', 'of', 'the', 'uss', 'discoveri', 'discov', 'mani', 'discoveri', '.', 'discov', 'is', 'what', 'explor', 'do', '.']


# POS Tagging

* give the correct word by removing 's' so on.
* Part Of Speech is a grammatical term that deals with the roles words play when you use them together in sentences.
* Tagging Parts Of Speech, or POS tagging, is the task of labeling the words in your text according to their part of speech.
* In English, there are eight parts of speech.

In [25]:
sample = "If you wish to make an apple pie from scratch, you must first invent the universe."

In [26]:
from nltk import pos_tag

tokenizing

In [27]:
words = word_tokenize(sample)
print(words)

['If', 'you', 'wish', 'to', 'make', 'an', 'apple', 'pie', 'from', 'scratch', ',', 'you', 'must', 'first', 'invent', 'the', 'universe', '.']


In [30]:
# import nltk
# nltk.download('averaged_perceptron_tagger')

* each tuple contains a word and its corresponding POS tag.

In [31]:
pos_tag(words) # 'if' is belongs to IN(preposition or conjunction)

[('If', 'IN'),
 ('you', 'PRP'),
 ('wish', 'VBP'),
 ('to', 'TO'),
 ('make', 'VB'),
 ('an', 'DT'),
 ('apple', 'NN'),
 ('pie', 'NN'),
 ('from', 'IN'),
 ('scratch', 'NN'),
 (',', ','),
 ('you', 'PRP'),
 ('must', 'MD'),
 ('first', 'VB'),
 ('invent', 'VB'),
 ('the', 'DT'),
 ('universe', 'NN'),
 ('.', '.')]

In [32]:
# import nltk
# nltk.download('tagsets')

In [33]:
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

# Lemmatizing

*  Like stemming, lemmatizing reduces words to their core meaning, but it will give you a complete English word that makes sense on its own instead of just a fragment of a word like 'discoveri'.


In [34]:
from nltk.stem import WordNetLemmatizer

In [35]:
lemmatizer = WordNetLemmatizer()

In [36]:
sample = "The friends of Desoto love scarves."

In [37]:
words = word_tokenize(sample)
print(words)

['The', 'friends', 'of', 'Desoto', 'love', 'scarves', '.']


In [38]:
# import nltk
# nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Lapmart\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [39]:
# import nltk
# nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Lapmart\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [40]:
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
print(lemmatized_words)

['The', 'friend', 'of', 'Desoto', 'love', 'scarf', '.']


# Chunking

* While tokenizing allows you to identify words and sentences, chunking allows you to identify phrases.
* Chunking makes use of POS tags to group words and apply chunk tags to those groups.
* Chunks don’t overlap, so one instance of a word can be in only one chunk at a time.

In [57]:
from nltk.tokenize import word_tokenize
from nltk import pos_tag

In [44]:
sample = "It's a dangerous business, Frodo, going out your door."

In [47]:
words = word_tokenize(sample)
print(words)

['It', "'s", 'a', 'dangerous', 'business', ',', 'Frodo', ',', 'going', 'out', 'your', 'door', '.']


In [58]:
tagpos = pos_tag(words)
tagpos

[('It', 'PRP'),
 ("'s", 'VBZ'),
 ('a', 'DT'),
 ('dangerous', 'JJ'),
 ('business', 'NN'),
 (',', ','),
 ('Frodo', 'NNP'),
 (',', ','),
 ('going', 'VBG'),
 ('out', 'RP'),
 ('your', 'PRP$'),
 ('door', 'NN'),
 ('.', '.')]

In [59]:
grammer = "NP: {<DT>?<JJ>*<NN>}" # Check meaning of DT, JJ, NN, ... using ---> nltk.help.upenn_tagset()
# NP stands for noun phrase
# Start with an optional (?) determiner('DT'), DT 0 or 1 occurrence
# Can have any number (*) of adjectives (JJ), JJ 0 or more occurrence
# End with a noun ('NN'), must occur at least one

In [53]:
chunk_parser = nltk.RegexpParser(grammer)

In [54]:
tree = chunk_parser.parse(tagpos) # The parse method is used to parse the POS-tagged words into a chunk tree.

In [55]:
tree.draw()

# Chinking

*  Chinking is used together with chunking, but while chunking is used to include a pattern, chinking is used to exclude a pattern.


In [71]:
from nltk.tokenize import word_tokenize
from nltk import pos_tag

In [73]:
sample = "It's a dangerous business, Frodo, going out your door."
words = word_tokenize(sample)
tagpos = pos_tag(words)
tagpos

[('It', 'PRP'),
 ("'s", 'VBZ'),
 ('a', 'DT'),
 ('dangerous', 'JJ'),
 ('business', 'NN'),
 (',', ','),
 ('Frodo', 'NNP'),
 (',', ','),
 ('going', 'VBG'),
 ('out', 'RP'),
 ('your', 'PRP$'),
 ('door', 'NN'),
 ('.', '.')]

In [61]:
grammer = """
chunk: {<.*>+}
}<JJ>{
"""

# ' chunk: {<.*>+} ' : This rule matches any sequence of one or more (+) POS tags (<.*> matches any tag). 
# Essentially, this rule creates a chunk for every sequence of tags.

# ' }<JJ>{ ' : This rule is a chinking rule, which removes adjectives (<JJ>) from the chunks created by the previous rule.

In [62]:
chunk_parser = nltk.RegexpParser(grammer)

In [65]:
tree = chunk_parser.parse(tagpos) # The parse method is used to parse the POS-tagged words into a chunk tree.

In [69]:
tree.draw()   # there are chunks and JJ( adjective - dangerous ) will remove

# Named Entity Recognition (NER)

* Named entities are noun phrases that refer to specific locations, people, organizations, and so on.
* With named entity recognition, you can find the named entities in your texts and also determine what kind of named entity they are.


In [77]:
import nltk
nltk.download('maxent_ne_chunker')
from nltk.tokenize import word_tokenize
from nltk import pos_tagnltk.download('words')

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\Lapmart\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\Lapmart\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\words.zip.


True

In [85]:
from nltk.tokenize import word_tokenize
from nltk import pos_tag

sample = "It's a dangerous business, Frodo, going out your door."
words = word_tokenize(sample)
tagpos = pos_tag(words)

In [86]:
tree = nltk.ne_chunk(tagpos)

In [87]:
tree.draw() # Person---> Frodo

In [89]:
msg = "Obama was the president of the America"
words = word_tokenize(msg)
tagpos = nltk.pos_tag(words)

In [90]:
tree = nltk.ne_chunk(tagpos)

In [94]:
tree.draw() # Obama - Person, America - GPE