# Natural Language Processing

According to industry estimates, only 21% of the available data is present in structured form. Data is being generated as we speak, as we tweet, as we send messages on Whatsapp and in various other activities. Majority of this data exists in the textual form, which is highly unstructured in nature.

Few notorious examples include – tweets / posts on social media, user to user chat conversations, news, blogs and articles, product or services reviews and patient records in the healthcare sector. A few more recent ones includes chatbots and other voice driven bots.

Despite having high dimension data, the information present in it is not directly accessible unless it is processed (read and understood) manually or analyzed by an automated system. In order to produce significant and actionable insights from text data, it is important to get acquainted with the techniques and principles of Natural Language Processing (NLP)

In [1]:
!pip install nltk



In [4]:
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [2]:
! pip install spacy

Collecting spacy
  Downloading https://files.pythonhosted.org/packages/69/d8/f3103202aeca6fb0d2dbdd3a4ab1a7b86e9ad1d3cf8b23fa46bd466d64ac/spacy-2.2.3-cp37-cp37m-win_amd64.whl (9.7MB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Downloading https://files.pythonhosted.org/packages/3c/5a/0d1b575ed40989d74fab25723083837c220246b25f3582917135cb32453f/preshed-3.0.2-cp37-cp37m-win_amd64.whl (105kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Downloading https://files.pythonhosted.org/packages/4f/7b/d77bc9bb101e113884b2d70a118e7ec8dcc9846a35a0e10d47ca37acdcbf/murmurhash-1.0.2-cp37-cp37m-win_amd64.whl
Collecting thinc<7.4.0,>=7.3.0 (from spacy)
  Downloading https://files.pythonhosted.org/packages/9e/ed/7edded74724747f7dfc513f85b483db7828e4a1ed072c9625188dcb633a5/thinc-7.3.1-cp37-cp37m-win_amd64.whl (2.0MB)
Collecting blis<0.5.0,>=0.4.0 (from spacy)
  Downloading https://files.pythonhosted.org/packages/d5/7e/1981d5389b75543f950026de40a9d346e2aec7e860b2800e54e65bd46c06/blis-0.4.1-cp37-

# Introduction to Natural Language Processing

NLP is a branch of data science that consists of systematic processes for analyzing, understanding, and deriving information from the text data in a smart and efficient manner. By utilizing NLP and its components, one can organize the massive chunks of text data, perform numerous automated tasks and solve a wide range of problems such as – automatic summarization, machine translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation etc.

Before moving further, lets have a look at few keywords used frequently.

Tokenization – process of converting a text into tokens

Tokens – words or entities present in the text

Text object – a sentence or a phrase or a word or an article

# Text Preprocessing

Since, text is the most unstructured form of all the available data, various types of noise are present in it and the data is not readily analyzable without any pre-processing. The entire process of cleaning and standardization of text, making it noise-free and ready for analysis is known as text preprocessing.

It is predominantly comprised of three steps:

Noise Removal
Lexicon Normalization
Object Standardization

#Noise Removal 

Any piece of text which is not relevant to the context of the data and the end-output can be specified as the noise.

For example – language stopwords (commonly used words of a language – is, am, the, of, in etc), URLs or links, social media entities (mentions, hashtags), punctuations and industry specific words. This step deals with removal of all types of noisy entities present in the text. A general approach for noise removal is to prepare a dictionary of noisy entities, and iterate the text object by tokens (or by words), eliminating those tokens which are present in the noise dictionary.

In [5]:
# Sample code to remove noisy words from a text

noise_list = ["is", "a", "this", "...",'to', 'and'] 
def remove_noise(input_text):
    words = input_text.split() 
    noise_free_words = [word for word in words if word not in noise_list] 
    noise_free_text = " ".join(noise_free_words) 
    return noise_free_text

remove_noise("this is a sample text and i'll go to market now")

"sample text i'll go market now"

In [6]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to C:\Users\AMRESH
[nltk_data]     SINGH\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [7]:
from nltk.corpus import stopwords
set(stopwords.words('english'))

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

Another approach is to use the regular expressions while dealing with special patterns of noise.

In [1]:
# Sample code to remove a regex pattern 
import re 

def remove_regex(input_text, regex_pattern):
    urls = re.finditer(regex_pattern, input_text) 
    for i in urls: 
        input_text = re.sub(i.group().strip(), '', input_text)
    return input_text

regex_pattern = "#[\w]*"  

remove_regex("remove this #hashtag from my given string object", regex_pattern)

'remove this  from my given string object'

# Lexicon Normalization

Another type of textual noise is about the multiple representations exhibited by single word.

For example – “play”, “player”, “played”, “plays” and “playing” are the different variations of the word – “play”, Though they mean different but contextually all are similar. The step converts all the disparities of a word into their normalized form (also known as lemma). Normalization is a pivotal step for feature engineering with text as it converts the high dimensional features (N different features) to the low dimensional space (1 feature), which is an ideal ask for any ML model.

The most common lexicon normalization practices are :

Stemming: Stemming is a rudimentary rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s” etc) from a word.
Lemmatization: Lemmatization, on the other hand, is an organized & step by step procedure of obtaining the root form of the word, it makes use of vocabulary (dictionary importance of words) and morphological analysis (word structure and grammar relations).

In [2]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to C:\Users\AMRESH
[nltk_data]     SINGH\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


True

In [3]:
from nltk.stem.wordnet import WordNetLemmatizer 
lem = WordNetLemmatizer()

from nltk.stem.porter import PorterStemmer 
stem = PorterStemmer()

word = "Lexical" 

print(lem.lemmatize(word, "v"))
print(stem.stem(word))

Lexical
lexic


# Object Standardization

Text data often contains words or phrases which are not present in any standard lexical dictionaries. These pieces are not recognized by search engines and models.

Some of the examples are – acronyms, hashtags with attached words, and colloquial slangs. With the help of regular expressions and manually prepared data dictionaries, this type of noise can be fixed, the code below uses a dictionary lookup method to replace social media slangs from a text.

In [4]:
lookup_dict = {'rt':'Retweet', 'dm':'direct message', "awsm" : "awesome", "luv" :"love"}
def lookup_words(input_text):
    words = input_text.split() 
    new_words = [] 
    for word in words:
        if word.lower() in lookup_dict:
            word = lookup_dict[word.lower()]
        new_words.append(word) 
        new_text = " ".join(new_words) 
    return new_text

print(lookup_words("RT this is a retweeted tweet by Doland J.Trump"))

Retweet this is a retweeted tweet by Doland J.Trump


# Text to Features (Feature Engineering on text data)

In [1]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to C:\Users\AMRESH
[nltk_data]     SINGH\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [2]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\AMRESH SINGH\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


True

In [3]:
from nltk import word_tokenize, pos_tag
text = "I am going to travel the world and click lot of beautiful pictures"
tokens = word_tokenize(text)
print(tokens)
print (pos_tag(tokens))

['I', 'am', 'going', 'to', 'travel', 'the', 'world', 'and', 'click', 'lot', 'of', 'beautiful', 'pictures']
[('I', 'PRP'), ('am', 'VBP'), ('going', 'VBG'), ('to', 'TO'), ('travel', 'VB'), ('the', 'DT'), ('world', 'NN'), ('and', 'CC'), ('click', 'NN'), ('lot', 'NN'), ('of', 'IN'), ('beautiful', 'JJ'), ('pictures', 'NNS')]


In [4]:
! pip install gensim
! pip install corpora

Collecting gensim
  Downloading https://files.pythonhosted.org/packages/09/ed/b59a2edde05b7f5755ea68648487c150c7c742361e9c8733c6d4ca005020/gensim-3.8.1-cp37-cp37m-win_amd64.whl (24.2MB)
Collecting smart-open>=1.8.1 (from gensim)
  Downloading https://files.pythonhosted.org/packages/0c/09/735f2786dfac9bbf39d244ce75c0313d27d4962e71e0774750dc809f2395/smart_open-1.9.0.tar.gz (70kB)
Collecting boto3 (from smart-open>=1.8.1->gensim)
  Downloading https://files.pythonhosted.org/packages/1f/a5/6b25e39aea40bc03163615c07572d90aaddc9951f5af9c21204d3da46398/boto3-1.10.34-py2.py3-none-any.whl (128kB)
Collecting botocore<1.14.0,>=1.13.34 (from boto3->smart-open>=1.8.1->gensim)
  Downloading https://files.pythonhosted.org/packages/59/cb/b4772a4abc128f6fd637af9d7fc93a1db11617859af680fc2b2e6282eb95/botocore-1.13.34-py2.py3-none-any.whl (5.8MB)
Collecting s3transfer<0.3.0,>=0.2.0 (from boto3->smart-open>=1.8.1->gensim)
  Downloading https://files.pythonhosted.org/packages/16/8a/1fc3dba0c4923c2a76e1ff0d5

In [5]:
! pip install corpus

Collecting corpus
  Downloading https://files.pythonhosted.org/packages/f1/b9/120d9e0ae8702a6929946b494b723a4de6c9bf3d79e8e07e239a81be4e7c/Corpus-0.4.2.tar.gz (88kB)
Building wheels for collected packages: corpus
  Building wheel for corpus (setup.py): started
  Building wheel for corpus (setup.py): finished with status 'done'
  Stored in directory: C:\Users\AMRESH SINGH\AppData\Local\pip\Cache\wheels\9d\20\6d\214e9c84ce43f62538d4c2f6e23d412bf9a52dd0f12bc716c9
Successfully built corpus
Installing collected packages: corpus
Successfully installed corpus-0.4.2


In [6]:
doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father." 
doc2 = "My father spends a lot of time driving my sister around to dance practice."
doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."
doc_complete = [doc1, doc2, doc3]
doc_clean = [doc.split() for doc in doc_complete]

import gensim
# import corpora

# Creating the term dictionary of our corpus, where every unique term is assigned an index.  
dictionary = gensim.corpora.Dictionary(doc_clean)

# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above. 
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]

# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Training LDA model on the document term matrix
ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50)

# Results 
print(ldamodel.print_topics())

[(0, '0.053*"driving" + 0.053*"sister" + 0.053*"my" + 0.053*"My" + 0.053*"to" + 0.053*"time" + 0.053*"around" + 0.053*"lot" + 0.053*"spends" + 0.053*"practice."'), (1, '0.063*"to" + 0.036*"is" + 0.036*"sugar," + 0.036*"Sugar" + 0.036*"likes" + 0.036*"have" + 0.036*"not" + 0.036*"father." + 0.036*"but" + 0.036*"bad"'), (2, '0.029*"driving" + 0.029*"sister" + 0.029*"My" + 0.029*"my" + 0.029*"to" + 0.029*"stress" + 0.029*"may" + 0.029*"and" + 0.029*"cause" + 0.029*"Doctors"')]


In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
obj = TfidfVectorizer()
corpus = ['This is sample document.', 'another random document.', 'third sample document text']
X = obj.fit_transform(corpus)
print (X)

  (0, 7)	0.5844829010200651
  (0, 2)	0.5844829010200651
  (0, 4)	0.444514311537431
  (0, 1)	0.34520501686496574
  (1, 1)	0.3853716274664007
  (1, 0)	0.652490884512534
  (1, 3)	0.652490884512534
  (2, 4)	0.444514311537431
  (2, 1)	0.34520501686496574
  (2, 6)	0.5844829010200651
  (2, 5)	0.5844829010200651


In [8]:
from gensim.models import Word2Vec
sentences = [['data', 'science'], ['vidhya', 'science', 'data', 'analytics', 'physics'],['machine', 'learning'], ['deep', 'learning']]

# train the model on your corpus  
model = Word2Vec(sentences, min_count = 1)

print (model.similarity('physics', 'science'))
print (model['learning'] )

0.06255885
[ 3.3129281e-03  3.9250827e-03  8.6020317e-04 -2.9725595e-03
 -2.8257733e-03 -2.7850142e-03 -2.5648235e-03 -4.7016231e-04
 -4.0992303e-03 -4.1560782e-03  2.2628147e-03 -4.8905291e-04
 -2.1010332e-03 -2.5545047e-03 -4.3336535e-03  4.9901833e-03
  2.2883723e-03  2.2870705e-03 -4.2620664e-03 -1.5417896e-03
 -3.1300588e-03 -2.6271115e-03  3.9877226e-03 -3.2053944e-03
 -1.9581441e-03 -2.8376102e-03  9.5188874e-04  1.6796933e-03
  1.0374994e-03 -3.6458403e-03  3.2441567e-03  3.5289060e-03
 -2.6706520e-03  1.1933720e-03  5.4661208e-04 -3.5518835e-05
 -4.2518107e-03  1.7608163e-03 -3.9118016e-03  1.5823084e-03
  3.3075358e-03  1.5819842e-03 -3.3941732e-03 -3.6831538e-04
 -4.5937770e-03  4.4235573e-03 -1.2582025e-03 -3.1509036e-03
  2.0073862e-03 -3.0839434e-03  3.8265167e-03  4.7102114e-03
  1.1531030e-03 -9.2678948e-04  4.6883905e-03  3.0220570e-03
  1.0830073e-03  4.5398511e-03 -1.4505321e-03  5.5985490e-04
  3.8825913e-04  4.7606616e-03  4.0753088e-03 -2.3248633e-03
 -2.0214819e-

  import sys
  


In [9]:
training_corpus = [
                   ('I am exhausted of this work.', 'Class_B'),
                   ("I can't cooperate with this", 'Class_B'),
                   ('He is my badest enemy!', 'Class_B'),
                   ('My management is poor.', 'Class_B'),
                   ('I love this burger.', 'Class_A'),
                   ('This is an brilliant place!', 'Class_A'),
                   ('I feel very good about these dates.', 'Class_A'),
                   ('This is my best work.', 'Class_A'),
                   ("What an awesome view", 'Class_A'),
                   ('I do not like this dish', 'Class_B')]
test_corpus = [
                ("I am not feeling well today.", 'Class_B'), 
                ("I feel brilliant!", 'Class_A'), 
                ('Gary is a friend of mine.', 'Class_A'), 
                ("I can't believe I'm doing this.", 'Class_B'), 
                ('The date was good.', 'Class_A'), ('I do not enjoy my job', 'Class_B')]


In [102]:
import sklearn.feature_extraction.text
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn import svm 
from sklearn.metrics import classification_report
from scipy.sparse import dok_matrix
import pandas as pd

# preparing data for SVM model (using the same training_corpus, test_corpus from naive bayes example)
train_data = []
train_labels = []
for row in training_corpus:
    train_data.append(row[0])
    train_labels.append(row[1])

test_data = [] 
test_labels = [] 
for row in test_corpus:
    test_data.append(row[0]) 
    test_labels.append(row[1])

print(train_data)
# Create feature vectors 
vectorizer = TfidfVectorizer(min_df=5, max_df=0.9)
#vocabulary = vectorizer.get_feature_names()


# Train the feature vectors
x_train_vectors = vectorizer.fit_transform(train_data)
df=pd.DataFrame(data=x_train_vectors.toarray(),columns= [
                   ('I am exhausted of this work.', 'Class_B'),
                   ("I can't cooperate with this", 'Class_B'),
                   ('He is my badest enemy!', 'Class_B'),
                   ('My management is poor.', 'Class_B'),
                   ('I love this burger.', 'Class_A'),
                   ('This is an brilliant place!', 'Class_A'),
                   ('I feel very good about these dates.', 'Class_A'),
                   ('This is my best work.', 'Class_A'),
                   ("What an awesome view", 'Class_A'),
                   ('I do not like this dish', 'Class_B')])
df1=pd.DataFrame(data=train_labels.toarray(),columns= [
                   ('I am exhausted of this work.', 'Class_B'),
                   ("I can't cooperate with this", 'Class_B'),
                   ('He is my badest enemy!', 'Class_B'),
                   ('My management is poor.', 'Class_B'),
                   ('I love this burger.', 'Class_A'),
                   ('This is an brilliant place!', 'Class_A'),
                   ('I feel very good about these dates.', 'Class_A'),
                   ('This is my best work.', 'Class_A'),
                   ("What an awesome view", 'Class_A'),
                   ('I do not like this dish', 'Class_B')])

# Apply model on test data 
X1_test_vectors = vectorizer.transform(test_data)
df2=pd.DataFrame(data=X1_test_vectors.toarray(),columns=[
                ("I am not feeling well today.", 'Class_B'), 
                ("I feel brilliant!", 'Class_A'), 
                ('Gary is a friend of mine.', 'Class_A'), 
                ("I can't believe I'm doing this.", 'Class_B'), 
                ('The date was good.', 'Class_A'),
                ('I do not enjoy my job', 'Class_B')])
df3=pd.DataFrame(data=test_labels.toarray(),columns=[
                ("I am not feeling well today.", 'Class_B'), 
                ("I feel brilliant!", 'Class_A'), 
                ('Gary is a friend of mine.', 'Class_A'), 
                ("I can't believe I'm doing this.", 'Class_B'), 
                ('The date was good.', 'Class_A'),
                ('I do not enjoy my job', 'Class_B')])



print(df)
# Perform classification with SVM, kernel=linear 
#model = svm.SVC(kernel='linear') 
model = GaussianNB()
model.fit(df, df1) 
prediction = model.predict(df1)


print (classification_report(df3, prediction))

['I am exhausted of this work.', "I can't cooperate with this", 'He is my badest enemy!', 'My management is poor.', 'I love this burger.', 'This is an brilliant place!', 'I feel very good about these dates.', 'This is my best work.', 'What an awesome view', 'I do not like this dish']


ValueError: Shape of passed values is (10, 1), indices imply (10, 10)