Text Mining in Python

Text Mining is the process of deriving meaningful information from natural language text.

Note: NLP is a component of text mining that performs a special kind of linguistic analysis that essentially helps a machine “read” text. It uses a different methodology to decipher the ambiguities in human language, including the following: automatic summarization, part-of-speech tagging, disambiguation, chunking, as well as disambiguation and natural language understanding and recognition.

Terminologies in NLP
- Tokenization: Tokenization is the first step in NLP. It is the process of breaking strings into tokens which in turn are small structures or units. Tokenization involves three steps which are breaking a complex sentence into words, understanding the importance of each word with respect to the sentence and finally produce structural description on an input sentence.

In [1]:
# Importing necessary library
import pandas as pd
import numpy as np
import nltk
import os
import nltk.corpus

In [2]:
# Input the data (Open a text file for reading)
text = open('JD.txt', 'r')

# Store content in a variable
text = text.read()

In [3]:
# importing word_tokenize from nltk
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

# Pass the string text into word tokenize for breaking the sentences
token = word_tokenize(text)
token

# We can see the text split into tokens. Words, comma, punctuations are called tokens.

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['Data',
 'Analyst',
 '-',
 'Geotab',
 'Hiring',
 'Manager',
 '-',
 'James',
 'Brown',
 'Location',
 '-',
 'Oakville',
 'Who',
 'we',
 'are',
 ':',
 'Geotab',
 'is',
 'a',
 'global',
 'leader',
 'in',
 'IoT',
 'and',
 'connected',
 'transportation',
 'and',
 'certified',
 '“',
 'Great',
 'Place',
 'to',
 'Work.',
 '”',
 'We',
 'are',
 'a',
 'company',
 'of',
 'diverse',
 'and',
 'talented',
 'individuals',
 'who',
 'work',
 'together',
 'to',
 'help',
 'businesses',
 'grow',
 'and',
 'succeed',
 ',',
 'and',
 'increase',
 'the',
 'safety',
 'and',
 'sustainability',
 'of',
 'our',
 'communities',
 '.',
 'Geotab',
 'is',
 'advancing',
 'security',
 ',',
 'connecting',
 'commercial',
 'vehicles',
 'to',
 'the',
 'internet',
 'and',
 'providing',
 'web-based',
 'analytics',
 'to',
 'help',
 'customers',
 'better',
 'manage',
 'their',
 'fleets',
 '.',
 'Geotab',
 '’',
 's',
 'open',
 'platform',
 'and',
 'Marketplace',
 ',',
 'offering',
 'hundreds',
 'of',
 'third-party',
 'solution',
 '

Finding frequency distinct in the text

In [4]:
# Finding the frequency distinct in the tokens (importing FreqDist library from nltk and passing token into FreqDist)

from nltk.probability import FreqDist
fdist = FreqDist(token)
fdist

FreqDist({'!': 4,
          '$': 1,
          '%': 1,
          '&': 3,
          "'": 1,
          "'d": 1,
          "'re": 3,
          "'s": 5,
          '(': 7,
          ')': 7,
          ',': 124,
          '-': 12,
          '.': 86,
          '0-5': 1,
          '100': 1,
          '220': 1,
          '3': 1,
          '3+': 1,
          '3-5': 2,
          ':': 14,
          ';': 1,
          '<': 8,
          '>': 8,
          '?': 1,
          '@': 2,
          'A': 1,
          'AI': 2,
          'ASR': 1,
          'AWS': 1,
          'Ability': 2,
          'About': 2,
          'Advanced': 1,
          'All': 1,
          'An': 1,
          'Analyst': 9,
          'Analysts': 1,
          'Analytics': 1,
          'Andreessen': 1,
          'Anywhere': 1,
          'Are': 1,
          'Argentine': 1,
          'As': 2,
          'Assist': 1,
          'At': 2,
          'Austin': 2,
          'B2B': 1,
          'BS': 1,
          'Baby': 1,
          'Bangalore': 2,
  

In [5]:
# To find the frequency of top 10 words

fdist1 = fdist.most_common(10)
fdist1

[(',', 124),
 ('and', 101),
 ('.', 86),
 ('to', 66),
 ('the', 45),
 ('data', 39),
 ('of', 35),
 ('with', 35),
 ('a', 32),
 ('in', 29)]

Stemming: Stemming usually refers to normalizing words into its base form or root form. 

- There are two methods in Stemming namely, Porter Stemming (removes common morphological and inflectional endings from words) and Lancaster Stemming (a more aggressive stemming algorithm).



In [6]:
# Importing Porterstemmer from nltk library (checking for the word ‘giving’) 

from nltk.stem import PorterStemmer
pst = PorterStemmer()
pst.stem("waiting")

'wait'

In [7]:
# Checking for the list of words

stm = ["waited", "waiting", "waits"]
for word in stm :
  print(word+ ":" +pst.stem(word))

waited:wait
waiting:wait
waits:wait


In [8]:
# Importing LancasterStemmer from nltk

from nltk.stem import LancasterStemmer
lst = LancasterStemmer()
stm = ["giving", "given", "given", "gave"]
for word in stm :
  print(word+ ":" +lst.stem(word))

giving:giv
given:giv
given:giv
gave:gav


Lemmatization: It is the process of converting a word to its base form. The difference between stemming and lemmatization is, lemmatization considers the context and converts the word to its meaningful base form, whereas stemming just removes the last few characters, often leading to incorrect meanings and spelling errors.

- Lemmatization can be implemented in python by using Wordnet Lemmatizer, Spacy Lemmatizer, TextBlob, Stanford CoreNLP

In [9]:
# Importing Lemmatizer library from nltk
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer() 
 
print("rocks :", lemmatizer.lemmatize("rocks")) 
print("corpora :", lemmatizer.lemmatize("corpora"))

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
rocks : rock
corpora : corpus


Stop Words: “Stop words” are the most common words in a language like “the”, “a”, “at”, “for”, “above”, “on”, “is”, “all”. These words do not provide any meaning and are usually removed from texts. We can remove these stop words using nltk library

In [10]:
# importing stopwors from nltk library

import nltk
nltk.download('stopwords')
from nltk import word_tokenize
from nltk.corpus import stopwords
a = set(stopwords.words('english'))

# Input the data (Open a text file for reading)
text = open('JD.txt', 'r')

# Store content in a variable
text = text.read()

text1 = word_tokenize(text.lower())
print(text1)
stopwords = [x for x in text1 if x not in a]
print(stopwords)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
['data', 'analyst', '-', 'geotab', 'hiring', 'manager', '-', 'james', 'brown', 'location', '-', 'oakville', 'who', 'we', 'are', ':', 'geotab', 'is', 'a', 'global', 'leader', 'in', 'iot', 'and', 'connected', 'transportation', 'and', 'certified', '“', 'great', 'place', 'to', 'work.', '”', 'we', 'are', 'a', 'company', 'of', 'diverse', 'and', 'talented', 'individuals', 'who', 'work', 'together', 'to', 'help', 'businesses', 'grow', 'and', 'succeed', ',', 'and', 'increase', 'the', 'safety', 'and', 'sustainability', 'of', 'our', 'communities', '.', 'geotab', 'is', 'advancing', 'security', ',', 'connecting', 'commercial', 'vehicles', 'to', 'the', 'internet', 'and', 'providing', 'web-based', 'analytics', 'to', 'help', 'customers', 'better', 'manage', 'their', 'fleets', '.', 'geotab', '’', 's', 'open', 'platform', 'and', 'marketplace', ',', 'offering', 'hundreds', 'of', 'third-p

Part of speech tagging (POS)

- Part-of-speech tagging is used to assign parts of speech to each word of a given text (such as nouns, verbs, pronouns, adverbs, conjunction, adjectives, interjection) based on its definition and its context. There are many tools available for POS taggers and some of the widely used taggers are NLTK, Spacy, TextBlob, Standford CoreNLP, etc.

In [11]:
import nltk
nltk.download('averaged_perceptron_tagger')

# Input the data (Open a text file for reading)
text = open('JD.txt', 'r')

# Store content in a variable
text = text.read()

# Tokenize the text

tex = word_tokenize(text)
for token in tex:
  print(nltk.pos_tag([token]))

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[('Data', 'NNS')]
[('Analyst', 'NN')]
[('-', ':')]
[('Geotab', 'NN')]
[('Hiring', 'VBG')]
[('Manager', 'NN')]
[('-', ':')]
[('James', 'NNP')]
[('Brown', 'NNP')]
[('Location', 'NN')]
[('-', ':')]
[('Oakville', 'NNP')]
[('Who', 'WP')]
[('we', 'PRP')]
[('are', 'VBP')]
[(':', ':')]
[('Geotab', 'NN')]
[('is', 'VBZ')]
[('a', 'DT')]
[('global', 'JJ')]
[('leader', 'NN')]
[('in', 'IN')]
[('IoT', 'NN')]
[('and', 'CC')]
[('connected', 'VBN')]
[('transportation', 'NN')]
[('and', 'CC')]
[('certified', 'VBN')]
[('“', 'NN')]
[('Great', 'NN')]
[('Place', 'NN')]
[('to', 'TO')]
[('Work.', 'NN')]
[('”', 'NN')]
[('We', 'PRP')]
[('are', 'VBP')]
[('a', 'DT')]
[('company', 'NN')]
[('of', 'IN')]
[('diverse', 'NN')]
[('and', 'CC')]
[('talented', 'VBN')]
[('individuals', 'NNS')]
[('who', 'WP')]
[('work', 'NN')]
[('together', '

Named entity recognition: It is the process of detecting the named entities such as the person name, the location name, the company name, the quantities and the monetary value.

In [12]:
import nltk
nltk.download('maxent_ne_chunker')

# Input the data (Open a text file for reading)
text = open('JD.txt', 'r')

# Store content in a variable
text = text.read()

#importing chunk library from nltk

import nltk
nltk.download('words')
from nltk import ne_chunk

# tokenize and POS Tagging before doing chunk

token = word_tokenize(text)
tags = nltk.pos_tag(token)
chunk = ne_chunk(tags)
chunk

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


TclError: ignored

Tree('S', [Tree('PERSON', [('Data', 'NNP'), ('Analyst', 'NNP')]), ('-', ':'), Tree('PERSON', [('Geotab', 'NNP'), ('Hiring', 'NNP'), ('Manager', 'NNP')]), ('-', ':'), Tree('PERSON', [('James', 'NNP'), ('Brown', 'NNP'), ('Location', 'NNP')]), ('-', ':'), Tree('GPE', [('Oakville', 'NN')]), ('Who', 'IN'), ('we', 'PRP'), ('are', 'VBP'), (':', ':'), Tree('PERSON', [('Geotab', 'NNP')]), ('is', 'VBZ'), ('a', 'DT'), ('global', 'JJ'), ('leader', 'NN'), ('in', 'IN'), Tree('ORGANIZATION', [('IoT', 'NNP')]), ('and', 'CC'), ('connected', 'VBN'), ('transportation', 'NN'), ('and', 'CC'), ('certified', 'JJ'), ('“', 'NNP'), ('Great', 'NNP'), ('Place', 'NNP'), ('to', 'TO'), ('Work.', 'NNP'), ('”', 'NNP'), ('We', 'PRP'), ('are', 'VBP'), ('a', 'DT'), ('company', 'NN'), ('of', 'IN'), ('diverse', 'NN'), ('and', 'CC'), ('talented', 'JJ'), ('individuals', 'NNS'), ('who', 'WP'), ('work', 'VBP'), ('together', 'RB'), ('to', 'TO'), ('help', 'VB'), ('businesses', 'NNS'), ('grow', 'VB'), ('and', 'CC'), ('succeed', '

Chunking: Chunking means picking up individual pieces of information and grouping them into bigger pieces. In the context of NLP and text mining, chunking means a grouping of words or tokens into chunks.

In [14]:
# Input the data (Open a text file for reading)
text = open('JD.txt', 'r')

# Store content in a variable
text = text.read()

token = word_tokenize(text)
tags = nltk.pos_tag(token)

reg = "NP: {<DT>?<JJ>*<NN>}" 
a = nltk.RegexpParser(reg)
result = a.parse(tags)
print(result)

(S
  Data/NNP
  Analyst/NNP
  -/:
  Geotab/NNP
  Hiring/NNP
  Manager/NNP
  -/:
  James/NNP
  Brown/NNP
  Location/NNP
  -/:
  (NP Oakville/NN)
  Who/IN
  we/PRP
  are/VBP
  :/:
  Geotab/NNP
  is/VBZ
  (NP a/DT global/JJ leader/NN)
  in/IN
  IoT/NNP
  and/CC
  connected/VBN
  (NP transportation/NN)
  and/CC
  certified/JJ
  “/NNP
  Great/NNP
  Place/NNP
  to/TO
  Work./NNP
  ”/NNP
  We/PRP
  are/VBP
  (NP a/DT company/NN)
  of/IN
  (NP diverse/NN)
  and/CC
  talented/JJ
  individuals/NNS
  who/WP
  work/VBP
  together/RB
  to/TO
  help/VB
  businesses/NNS
  grow/VB
  and/CC
  succeed/VB
  ,/,
  and/CC
  increase/VB
  (NP the/DT safety/NN)
  and/CC
  (NP sustainability/NN)
  of/IN
  our/PRP$
  communities/NNS
  ./.
  Geotab/NNP
  is/VBZ
  advancing/VBG
  (NP security/NN)
  ,/,
  connecting/VBG
  commercial/JJ
  vehicles/NNS
  to/TO
  (NP the/DT internet/NN)
  and/CC
  providing/VBG
  web-based/JJ
  analytics/NNS
  to/TO
  help/VB
  customers/NNS
  better/RBR
  manage/VBP
  their/PRP$
  