Suppose we are in a situation to create Sentimental Analysis model, we have the dataset available. But the problem here is, the machine does not understand the sentences of any language. We have to clean those datasets by using stopwords, deleting punctuations and deletoing many more relevant things inside data and we have to make it upto that level where we can feed this data to machine learning and deep learning algorithms, so that we get the desired output with it.

In [1]:
#import necessary libraries
import nltk
import string
import re

### NLTK Tokenization

In [2]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\NIKIL\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\NIKIL\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
def rem_stopwords(text):
    stop_words = set(stopwords.words("english"))
    words_tokens = word_tokenize(text)
    filtered_text = [word for word in words_tokens if word not in stop_words]
    return filtered_text

ex_text = "Data is new oil, A.I. is the last invention"
rem_stopwords(ex_text)

['Data', 'new', 'oil', ',', 'A.I', '.', 'last', 'invention']

### Stemming
Using Stemming, we will process of getting the root form of a word. Root or Stem is the part to which inflextional affixes (like -ed, -ize , etc. ) are added. We would create the stem words by removing the prefix or the suffix of the words. Hence, stemming of the words may not result in the actual words. 
e.g.:   Mangoes >> Mango
        Boys >> Boy

If our sentences are not in tokens, then we need to convert it into tokens. After we convert strings of texts into tokens, then we can convert those word tokens into their root form. These are the porter stemmer, the snowball stemmer, and the Lancaster stemmer. We usually use ported stemmer among them.

##### Whenever grammar is conidered as important as a part of the operation, we should not use stemming

In [4]:
#import nltk's porter stemmer
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import word_tokenize
stem1 = PorterStemmer()

# stem words in the list of tokenized words

def s_words(text):
    word_tokens = word_tokenize(text)
    stems = [stem1.stem(word) for word in word_tokens]
    return stems

text = "Using Stemming, we will process of getting the root form of a word."
s_words(text)

['use',
 'stem',
 ',',
 'we',
 'will',
 'process',
 'of',
 'get',
 'the',
 'root',
 'form',
 'of',
 'a',
 'word',
 '.']

### Lemmatization
Stemming and lemmatization do the same but the only difference is that lemmatization ensures that root word belongs to the language. With the help of lemmatization we will get the valid words. In NLTK, we use WordLemmatizer to get the lemmas of words

In [5]:
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [6]:
from nltk.stem import wordnet
from nltk.tokenize import word_tokenize
lemma = wordnet.WordNetLemmatizer()
nltk.download('wordnet')
# Lemmatize string
def lemmatize_word(text):
    word_tokens = word_tokenize(text)
    # provide some context which has parts of speech.
#     lemmas = [lemma.lemmatize(word,pos = 'v') for word in word_tokens]
    return (' '.join([lemma.lemmatize(word,pos = 'v') for word in word_tokens]))

text = 'Data is the new revolution in the world, in a day one individual would generate terabytes of data'
lemmatize_word(text)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\NIKIL\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


'Data be the new revolution in the world , in a day one individual would generate terabytes of data'

### Word Lemmatization with appropriate POS tag.

In [7]:
from nltk.stem import wordnet
from nltk.tokenize import word_tokenize
from nltk import pos_tag
nltk.download('averaged_perceptron_tagger')

sentence = "Data is the new revolution in the world, in a day one individual would generate terabytes of data"
def pos_tagging(text):
    word_tokenised = nltk.word_tokenize(text)
    return (nltk.pos_tag(word_tokenised))

pos_tagging(sentence)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\NIKIL\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


[('Data', 'NNP'),
 ('is', 'VBZ'),
 ('the', 'DT'),
 ('new', 'JJ'),
 ('revolution', 'NN'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('world', 'NN'),
 (',', ','),
 ('in', 'IN'),
 ('a', 'DT'),
 ('day', 'NN'),
 ('one', 'CD'),
 ('individual', 'NN'),
 ('would', 'MD'),
 ('generate', 'VB'),
 ('terabytes', 'NNS'),
 ('of', 'IN'),
 ('data', 'NNS')]

### Chunking
Process of extracting phrases from unstructured text and give them more structure to it

In [8]:
from nltk import pos_tag
from nltk.tokenize import word_tokenize
def chunking(text,grammar):
    word_tokens = word_tokenize(text)
    word_pos = pos_tag(word_tokens)
    chunkParser = nltk.RegexpParser(grammar)
    tree = chunkParser.parse(word_pos)
    for subtree in tree.subtrees():
        print(subtree)
        
sentence = 'the little red parrot is flying in the sky'
grammar = "NP:{<DT>?<JJ>*<NN>}"
chunking(sentence,grammar)

(S
  (NP the/DT little/JJ red/JJ parrot/NN)
  is/VBZ
  flying/VBG
  in/IN
  (NP the/DT sky/NN))
(NP the/DT little/JJ red/JJ parrot/NN)
(NP the/DT sky/NN)


### Remove numbers!

In [9]:
s = "My 2 favorite numbers are 7 and 10"
lst = re.sub('[0-7]',"",s)
print(lst)

My  favorite numbers are  and 


In [10]:
def lowercase(text):
    return text.lower()
inpur_str = "Weateher is Cloudy. Possibility if Rain is High Today!"
lowercase(inpur_str)

'weateher is cloudy. possibility if rain is high today!'

In [14]:
import os
current_path = os.getcwd()