# Natural Language Processing

# Google Cloud Platform (GCP) NLP 

![image-2.png](attachment:image-2.png)

# Example - Implementation Process



![image.png](attachment:image.png)

# Base Installation

In [1]:
pip install nltk==3.5




In [2]:
pip install numpy matplotlib

Note: you may need to restart the kernel to use updated packages.


# Tokenization

By tokenizing, you can conveniently split up text by word or by sentence.<br><br>
Tokenizing by word<br>
Tokenizing by sentence

In [3]:
from nltk.tokenize import sent_tokenize, word_tokenize

In [4]:
Latest = "Trump fraud trial live updates: State to question defense expert on potential fine."

In [5]:
sent_tokenize(Latest)

['Trump fraud trial live updates: State to question defense expert on potential fine.']

In [6]:
word_tokenize(Latest)

['Trump',
 'fraud',
 'trial',
 'live',
 'updates',
 ':',
 'State',
 'to',
 'question',
 'defense',
 'expert',
 'on',
 'potential',
 'fine',
 '.']

# Stop Words

In [7]:
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sbatukdeo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [8]:
Latest

'Trump fraud trial live updates: State to question defense expert on potential fine.'

In [9]:
Latest_Tokenized = word_tokenize(Latest)

In [10]:
stop_words = set(stopwords.words("english"))

In [11]:
filtered_list = []

In [12]:
for word in Latest_Tokenized:
    if word.casefold() not in stop_words:
         filtered_list.append(word)

In [13]:
filtered_list = [
     word for word in Latest_Tokenized if word.casefold() not in stop_words
     ]

In [14]:
filtered_list


['Trump',
 'fraud',
 'trial',
 'live',
 'updates',
 ':',
 'State',
 'question',
 'defense',
 'expert',
 'potential',
 'fine',
 '.']

# Stemming

In [15]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

In [16]:
stemmer = PorterStemmer()

In [17]:
string_for_stemming = """
The crew of the USS Discovery discovered many discoveries.
Discovering is what explorers do.
"""

In [18]:
words = word_tokenize(string_for_stemming)

In [19]:
words

['The',
 'crew',
 'of',
 'the',
 'USS',
 'Discovery',
 'discovered',
 'many',
 'discoveries',
 '.',
 'Discovering',
 'is',
 'what',
 'explorers',
 'do',
 '.']

In [20]:
stemmed_words = [stemmer.stem(word) for word in words]

In [21]:
stemmed_words

['the',
 'crew',
 'of',
 'the',
 'uss',
 'discoveri',
 'discov',
 'mani',
 'discoveri',
 '.',
 'discov',
 'is',
 'what',
 'explor',
 'do',
 '.']

# Lemmatizing

In [22]:
from nltk.stem import WordNetLemmatizer

In [23]:
lemmatizer = WordNetLemmatizer()

In [31]:
lemmatizer.lemmatize("loves")


'love'

# Tagging

Part of speech is a grammatical term that deals with the roles words play when you use them together in sentences. Tagging parts of speech, or POS tagging, is the task of labeling the words in your text according to their part of speech.

In [25]:
from nltk.tokenize import word_tokenize

In [26]:
News_tagging = """
Surendra loves Python. Python, does not love Surendra."""

In [27]:
News_tagged = word_tokenize(News_tagging)

In [28]:
nltk.pos_tag(News_tagged)

[('Surendra', 'NNP'),
 ('loves', 'VBZ'),
 ('Python', 'NNP'),
 ('.', '.'),
 ('Python', 'NNP'),
 (',', ','),
 ('does', 'VBZ'),
 ('not', 'RB'),
 ('love', 'VB'),
 ('Surendra', 'NNP'),
 ('.', '.')]

In [29]:
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\sbatukdeo\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


# Take Home Quiz

Write a pipelined (stage by stage) architecure, using the above cell titled - Example Implementation
Process - using any data set that has at least 5 lines and 10 words per line. (minimum 50 words). 

4 stages - Tokenization, Tagging, Stopwords and Lemmatization. 