## In this article, we are going to perform (apply) the coding on paragraph by using NLP.

<b> Importing and downloading the nltk library and packages.

<b> Install the NLTK library using pip command.

If it is already installed or present in your notebook, it will prompt Requirement already Satisfied otherwise it will start downloading and start installing the NLTK library in your notebook

In [1]:
# Install the NLTK library using pip command.
!pip install nltk



<b> In my case it's already installed, hence it's shown as "Requirement already Satisfied".
    
<b> Import the NLTK library:
    
By using the "**import nltk**" command we can import the nltk library for your further operations.

In [2]:
# import nltk library
import nltk

<b> Installing All from NLTK library

If we want to download all packages from the NLTk library then by using the "**nltk.download()**" command we can download the packages which will unzipp all the packages from NLTK Corpus like for e.g. Stemmer, lemmatizer and many more.

In [3]:
# download nltk all the packages
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

<b> Writing the paragraph, which one we are going to perform NLP coding.

In [4]:
paragraph = "He is a good boy. She is a good girl. boy & girl are good."

## Tokenization
<b> Tokenization means splitting text into meaningful unit words. There are sentence tokenizers as well as word tokenizers.
    
<b> Sentence Tokenization
    
Sentence tokenizer splits a paragraph into meaningful sentences.

In [5]:
# Sentence Tokenization
sentences = nltk.sent_tokenize(paragraph)

# print the result
sentences

['He is a good boy.', 'She is a good girl.', 'boy & girl are good.']

<b> Word  Tokenization
    
word tokenizer splits a sentence into unit meaningful words.

In [6]:
# Word Tokenization
words = nltk.word_tokenize(paragraph)

# print the result
words

['He',
 'is',
 'a',
 'good',
 'boy',
 '.',
 'She',
 'is',
 'a',
 'good',
 'girl',
 '.',
 'boy',
 '&',
 'girl',
 'are',
 'good',
 '.']

## Text Cleaning
<b> Punctuation Removal:
    
- Removing punctuation is a crucial step since punctuation doesn’t add any extra information or value to our data. Hence, removing punctuation reduces the data size; therefore, it improves computational efficiency.
    
    
- The Punctuations are: ('!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~') ---> Excluded paranthesis.

In [7]:
# import the regular expression 
import re

# initilise the "corpus" empty list
corpus = []

# Using for loop for iteration each sentences.
for i in range(len(sentences)):
    # remove the punctuation and same store as "rp" object. 
    rp = re.sub('[^a-zA-Z]'," ", sentences[i])
    # append the result to corpus list
    corpus.append(rp)
    
    
# print the corpus
print(corpus)

['He is a good boy ', 'She is a good girl ', 'boy   girl are good ']


<b> Stop Words Removal:
    
- Words that frequently occur in sentences and carry no significant meaning in sentences. These are not important for prediction, so we remove stopwords to reduce data size and prevent overfitting. Note: Before filtering stopwords, make sure you lowercase the data since our stopwords are lowercase.
    
    
- Using the NLTK library, we can filter out our Stopwords from the dataset.

In [8]:
# import stopwords class from nltk.corpus library.
from nltk.corpus import stopwords
stopwords.words("english")

# initilise the "corpus" empty list
corpus = []

# Using for loop for iteration each records.
for i in range(len(sentences)):
    # remove the punctuation and same store as "rp" object. 
    rp = re.sub('[^a-zA-Z]'," ", sentences[i])
    # lowering the "rp" object and storing in the same object.
    rp = rp.lower()
    # split the words from sentences and same store in the same object.
    rp = rp.split()
    # removing the stopwords (by using list comprehension)
    rp = [word for word in rp if not word in set(stopwords.words('english'))]
    # join the words.
    rp = " ".join(rp)
    # append the result to corpus list
    corpus.append(rp)
    
# print the corpus
print(corpus)


['good boy', 'good girl', 'boy girl good']


<b> Stemming

Stemming is converting words into their root word using some set of rules irrespective of meaning.

Stemming may or may not returns meaningful words.

In [9]:
# import PoasterStemmer class from nltk.stem library
from nltk.stem import PorterStemmer    

# Initializa the PorterStemmer as "ps"
ps = PorterStemmer()          

# applying stemming on "history" word
ps.stem("history")

'histori'

<b> Lemmatization
    
Lemmatization is converting words into their root word using vocabulary mapping. Lemmatization is done with the help of part of speech and its meaning; hence it doesn’t generate meaningless root words. But lemmatization is slower than stemming.  
    
Lemmatizatio must be returns meaningful words.

In [10]:
# import WordNetLemmatizer class from nltk.stem library
from nltk.stem import WordNetLemmatizer   

# Initializa the WordNetLemmatizer as "wnl"
wnl = WordNetLemmatizer()          

# applying stemming on "history" word
wnl.lemmatize("history")

'history'

## Text Cleaning (Punctuation Removal + Stop Words Removal + Stemming/Lemmatization)

<b> Here we are going to apply all these steps in the single page.

In [11]:
# initilise the "corpus" empty list
corpus = []

# Using for loop for iteration each records.
for i in range(len(sentences)):
    # remove the punctuation and same store as "rp" object. 
    rp = re.sub('[^a-zA-Z]'," ", sentences[i])
    # lowering the "rp" object and storing in the same object.
    rp = rp.lower()
    # split the words from sentences and same store in the same object.
    rp = rp.split()
    # converting words into their root word by stemming (by using list comprehension)
    rp = [ps.stem(word) for word in rp if not word in (stopwords.words('english'))]
    # join the words.
    rp = " ".join(rp)
    # append the result to corpus list
    corpus.append(rp)
    
# print the corpus
print(corpus)

['good boy', 'good girl', 'boy girl good']


## Vectorizaton:
Vectorization is the process of converting textual data into numerical vectors and is a process that is usually applied once the text is cleaned.

<b> 1. CountVectorizer (bag of words):
    
- CountVectorizer is one of the simplest techniques that is used for converting text into vectors. It starts by tokenizing the document into a list of tokens (words). It selects the unique tokens from the token list and creates a vocabulary of words. Finally, a sparse matrix is created containing the frequency of words, where each row represents different sentences and each column represents unique words.

In [12]:
# import the CountVectorizer() class from sklearn.feature_extraction.text library
from sklearn.feature_extraction.text import CountVectorizer

# creating object cv of class CountVectorizer()
cv = CountVectorizer()

# Fit and transform the corpus and same store as "bow" object.
bow = cv.fit_transform(corpus).toarray()          # toarray returns an ndarray

# print the result
bow

array([[1, 0, 1],
       [0, 1, 1],
       [1, 1, 1]], dtype=int64)

<b> 2. TF-IDF:
    
- TF-IDF or **Term Frequency–Inverse Document Frequency**, is a statistical measure that tells how relevant a word is to a document. It combines two metrics — term frequency and inverse document frequency — to produce a relevance score.


The **Term Frequency** is the frequency of a word in a document. It is calculated by dividing the occurrence of a word inside a document by the total number of words in that document.


The **Inverse Document Frequency** is a measure of how much information a word provides. Words like “the,” for example, occur very frequently but provide little context or value to a sentence. It is calculated by taking the inverse log of document frequency, that is the proportion of documents that contain a particular word.

- TF-IDF scores range from 0 to 1. A score closer to 1 is higher the importance of a word to a document.

In [16]:
# import the TfidfVectorizer() class from sklearn.feature_extraction.text library
from sklearn.feature_extraction.text import TfidfVectorizer

# creating object tf of class TfidfVectorizer()
tf = TfidfVectorizer()

# Fit and transform the corpus and same store as "tfidf" object.
tfidf = tf.fit_transform(corpus).toarray()          # toarray returns an ndarray

# print the result
tfidf

array([[0.78980693, 0.        , 0.61335554],
       [0.        , 0.78980693, 0.61335554],
       [0.61980538, 0.61980538, 0.48133417]])

<b> 3. N-grams

- N-gram can be defined as the contiguous sequence of n items from a given sample of text or speech. The items can be letters, words, or base pairs according to the application. The N-grams typically are collected from a text or speech corpus (A long text dataset).


- N-grams means number of words which we will combine together.
    
    
- In order to implement n-grams, ngrams function present in nltk is used which will perform all the n-gram operation.

<b> Writing the sentence, which one we are going to perform N-gram coding.

In [20]:
sentence2 = "This is a sentence"

In [21]:
# import the ngrams from nltk library
from nltk import ngrams

# creating object "n-grams" of class ngrams()
n_grams = ngrams(sentence2.split(), 2)

# Using for loop for iteration each words.
for grams in n_grams:
    print(grams)         # print the words

('This', 'is')
('is', 'a')
('a', 'sentence')


In the above first we have imported ngrams class from nltk library. After importing the class, we have created the Classifier object of the class. The Parameter of this class are as :

- param sequence: the source data to be converted into ngrams.


- type n: type of ngram, i.e. unigram, bigram, trigram and n-grams.

And then we have used for loop for iteration of each words and finally printed the result as required. In the above we have passed type_n=2, i.e. bigram. If we want a unigram or n-gram we need to change the type_n value.

## Part-of-Speech (POS) Tagging
- Part-of-Speech (POS) tagging a process of assigning one of the parts of speech to the given word.


- if we talk about Part-of-Speech (POS) tagging, it may be defined as the process of converting a sentence in the form of a list of words, into a list of tuples. Here, the tuples are in the form of (word, tag).


**Example:**
Let us understand it with a Python experiment −

In [23]:
# import the nltk library.
import nltk

# import the word_tokenize class from nltk library.
from nltk import word_tokenize

# Create a sentence.
sentence3 = "I am going to school"

# Calculate the Part-of-Speech (POS) Tagging on above sentence by using pos_tag()
# And preint the result.
print (nltk.pos_tag(word_tokenize(sentence3)))

[('I', 'PRP'), ('am', 'VBP'), ('going', 'VBG'), ('to', 'TO'), ('school', 'NN')]


<b> In the above, we have got the part of speech of each words from the sentence and regarding "PRP", "VBP", "VBG" we have discussed in the 0.1 Natural Language Processing (NLP) Notes.
