# NLP 

Natural language processing is a subset of Artificial intelligence that helps computers to understand, interpret, and utilize the human languages. 

NLP allows computers to communicate with peoples using human languages. 

NLP also provides computers with the ability to read text, hear speech, and try to interpret it. NLP draws several disciplines, including Computational linguistics and computer science, as this attempts to fill the gap in between human and computer communication.


Let’s take a look at 11 of the most interesting applications of natural language processing  in business:

Sentiment Analysis

Text Classification

Chatbots & Virtual Assistants

Text Extraction

Machine Translation

Text Summarization


Auto-Correct

Intent Classification

Urgency Detection

Speech Recognition

## Text Processing 

Lets consider we have textual data with lots of variation.we need to apply many of pre-processing steps to the data to transform those words into numerical features that work with machine learning algorithms.In the end ML alogorithms understand numbers.

### NLTK

In [37]:
!pip install nltk # in case if its not installed 



The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing for English written in the Python programming language.
NLTK is a Python package that you can use for NLP. A lot of the data that you could be analyzing is unstructured data and contains human-readable text. NLTK is pandas of textual data

In [3]:
# import necessary libraries 
import nltk # Natural Language Toolkit
import string # Common string operations
import re #This module provides regular expression matching operations  # this already covered in python

### Basic Text Manipulation

### Upercase & Lowercase

In [4]:
text1= "I am Going to Learn NLP" 

In [5]:
print(text1)

I am Going to Learn NLP


In [7]:
text1.lower()  # Text lowercase We do lowercase the text to reduce the size of the vocabulary of our text data.

'i am going to learn nlp'

In [8]:
text1.upper()

'I AM GOING TO LEARN NLP'

### Remove Punctuation # common sense / BI

In [20]:
def rem_punct(text): 
    translator = str.maketrans('', '', string.punctuation) 
    return text.translate(translator) 
  
input_str = "Hi!, Guys are you excited ?? we will start CV soon !!"
rem_punct(input_str) 

'Hi Guys are you excited  we will start CV soon '

In [34]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

### Remove default stopwords:

Stopwords are words that do not contribute to the meaning of the sentence. Hence, they can be safely removed without causing any change in the meaning of a sentence. The NLTK(Natural Language Toolkit) library has the set of stopwords and we can use these to remove stopwords from our text and return a list of word tokens.

In [25]:
from nltk.corpus import stopwords #A corpus is a collection of authentic text or audio organized into datasets. Authentic here means text written or audio spoken by a native of the language or dialect. A corpus can be made up of everything from newspapers, novels, recipes, radio broadcasts to television shows, movies, and tweets.
from nltk.tokenize import word_tokenize 
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Pranav\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Pranav\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [26]:
def remove_stopwords(text): 
    stop_words = set(stopwords.words("english")) 
    word_tokens = word_tokenize(text) 
    filtered_text = [word for word in word_tokens if word not in stop_words] #python code to do tokenization excluding stop words 
    return filtered_text 
  

In [29]:
ex_text = "Natural language processing is a subset of Artificial intelligence that helps computers to understand the human languages"
remove_stopwords(ex_text)

['Natural',
 'language',
 'processing',
 'subset',
 'Artificial',
 'intelligence',
 'helps',
 'computers',
 'understand',
 'human',
 'languages']

### Stemming

From Stemming we will process of getting the root form of a word. Root or Stem is the part to which inflextional affixes(like -ed, -ize, etc) are added. We would create the stem words by removing the prefix of suffix of a word. So, stemming a word may not result in actual words.

For Example: Mangoes ---> Mango

             Boys ---> Boy
             
             going ---> go
             
             
If our sentences are not in tokens, then we need to convert it into tokens. After we converted strings of text into tokens, then we can convert those word tokens into their root form. These are the Porter stemmer, the snowball stemmer, and the Lancaster Stemmer. We usually use Porter stemmer among them.

In [30]:
#importing nltk's porter stemmer 
from nltk.stem.porter import PorterStemmer #Stemming
from nltk.tokenize import word_tokenize 
stem1 = PorterStemmer() 
  
# stem words in the list of tokenised words 
def s_words(text): 
    word_tokens = word_tokenize(text) 
    stems = [stem1.stem(word) for word in word_tokens] 
    return stems 
  
text = 'Data is the new revolution in the World, in a day one individual would generate terabytes of data.'
s_words(text)

['data',
 'is',
 'the',
 'new',
 'revolut',
 'in',
 'the',
 'world',
 ',',
 'in',
 'a',
 'day',
 'one',
 'individu',
 'would',
 'gener',
 'terabyt',
 'of',
 'data',
 '.']

### Lemmatization

As stemming, lemmatization do the same but the only difference is that lemmatization ensures that root word belongs to the language. Because of the use of lemmatization we will get the valid words. In NLTK(Natural language Toolkit), we use WordLemmatizer to get the lemmas of words. We also need to provide a context for the lemmatization.So, we added pos(parts-of-speech) as a parameter. 

In [41]:
from nltk.stem import wordnet 
from nltk.tokenize import word_tokenize 
lemma = wordnet.WordNetLemmatizer()
nltk.download('wordnet')
# lemmatize string 
def lemmatize_word(text): 
    word_tokens = word_tokenize(text) 
    # provide context i.e. part-of-speech(pos)
    lemmas = [lemma.lemmatize(word, pos ='v') for word in word_tokens] 
    return lemmas 
  
text = 'Data is the new revolution in the World, in a day one individual would generate terabytes of data.'
lemmatize_word(text)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Pranav\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


['Data',
 'be',
 'the',
 'new',
 'revolution',
 'in',
 'the',
 'World',
 ',',
 'in',
 'a',
 'day',
 'one',
 'individual',
 'would',
 'generate',
 'terabytes',
 'of',
 'data',
 '.']

### Parts of Speech (POS) Tagging

The pos(parts of speech) explain you how a word is used in a sentence. In the sentence, a word have different contexts and semantic meanings. The basic natural language processing(NLP) models like bag-of-words(bow) fails to identify these relation between the words. For that we use pos tagging to mark a word to its pos tag based on its context in the data. Pos is also used to extract rlationship between the words. 

In [35]:
# importing tokenize library
from nltk.tokenize import word_tokenize 
from nltk import pos_tag 
nltk.download('averaged_perceptron_tagger')
  
# convert text into word_tokens with their tags 
def pos_tagg(text): 
    word_tokens = word_tokenize(text) 
    return pos_tag(word_tokens) 
  
pos_tagg('Are you going of somewhere?') 

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Pranav\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


[('Are', 'NNP'),
 ('you', 'PRP'),
 ('going', 'VBG'),
 ('of', 'IN'),
 ('somewhere', 'RB'),
 ('?', '.')]

In the above example NNP stands for Proper noun, PRP stands for personal noun, IN as Preposition. We can get all the details pos tags using the Penn Treebank tagset.

# RE to extract email from textual data 
1. Create text file in that add some email and other text
2. Read text file into python 
3. Extract all emails in the text file
4. Store it in csv/excel format 

In [1]:
Sent = "I will go to United States"
lst_sent = Sent.split (" ")
of_bigrams_in = []
for i in range(len(lst_sent)- 1):
   of_bigrams_in.append(lst_sent[i]+ " " + lst_sent[ i + 1])
   
    
print(of_bigrams_in)

['I will', 'will go', 'go to', 'to United', 'United States']


In [2]:
import re
punctuation_pattern = re.compile(r"" "[.,!? ""] "" " )

sent = "I will go to United States"
no_punctuation_sent = re.sub(punctuation_pattern , " " , sent )
lst_sent = no_punctuation_sent.split (" ")
trigram = []
for i in range(len(lst_sent)- 2):
   trigram.append(lst_sent[i] + " " + lst_sent[i + 1] + " " +lst_sent[i + 2])

In [3]:
trigram

['I will go', 'will go to', 'go to United', 'to United States']