# Text Mining 


### What is Text Mining?
Text Mining is the process of deriving meaningful information from natural language text.

### What is Natural Language Processing?

Natural Language Processing (NLP) is a part of computer science and artificial intelligence which deals with human languages.

In other words, NLP is a component of text mining that performs a special kind of linguistic analysis that essentially helps a machine “read” text. It uses a different methodology to decipher (decode) the ambiguities in human language, including the following: automatic summarization, part-of-speech tagging, disambiguation, chunking, as well as disambiguation and natural language understanding and recognition.

### Tokenization 
Tokenization is the first step in NLP. It is the process of breaking strings into tokens which in turn are small structures or units. Tokenization involves three steps which are breaking a complex sentence into words, understanding the importance of each word with respect to the sentence and finally produce structural description on an input sentence.


In [19]:
import nltk
# nltk.download('punkt')

from nltk.tokenize import word_tokenize
from soupsieve.util import lower

text = 'In Brazil they drive on the right-hand side of the road. Brazil has a large coastline on the eastern side of South America'

# Passing the string text into word tokenize for breaking the sentences
token = word_tokenize(text)
token

['In',
 'Brazil',
 'they',
 'drive',
 'on',
 'the',
 'right-hand',
 'side',
 'of',
 'the',
 'road',
 '.',
 'Brazil',
 'has',
 'a',
 'large',
 'coastline',
 'on',
 'the',
 'eastern',
 'side',
 'of',
 'South',
 'America']

In [20]:
# finding the frequency distinct in the tokens
# Importing FreqDist library from nltk and passing token into FreqDist
from nltk.probability import FreqDist

fdist = FreqDist(token)
fdist

FreqDist({'the': 3, 'Brazil': 2, 'on': 2, 'side': 2, 'of': 2, 'In': 1, 'they': 1, 'drive': 1, 'right-hand': 1, 'road': 1, ...})

In [21]:
# To find the frequency of top 10 words
fdist1 = fdist.most_common(10)
fdist1

[('the', 3),
 ('Brazil', 2),
 ('on', 2),
 ('side', 2),
 ('of', 2),
 ('In', 1),
 ('they', 1),
 ('drive', 1),
 ('right-hand', 1),
 ('road', 1)]

### Steaming

Stemming usually refers to normalizing words into its base form or root form
.

There are two methods in Stemming namely, **Porter Stemming** (removes common morphological and inflectional endings from words) and **Lancaster Stemming** (a more aggressive stemming algorithm).



### Porter Stemming

In [22]:
# Checking for the word ‘giving’ 
from nltk.stem import PorterStemmer

pst = PorterStemmer()
pst.stem('waiting')

'wait'

In [23]:
# Checking for the list of words
stm = ["waited", "waiting", "waits"]
for word in stm:
    print(word + ":" + pst.stem(word))

waited:wait
waiting:wait
waits:wait


### Lancaster Stemming

Lancaster is more aggressive than Porter stemmer

we use Landcaster stemmer when we dont care about the meaning of the words but for frequence of the words

In [24]:
# Importing LancasterStemmer from nltk
from nltk.stem import LancasterStemmer

lst = LancasterStemmer()
stm = ['giving', 'given', 'given', 'gave']
for word in stm:
    print(word + ':' + lst.stem(word))


giving:giv
given:giv
given:giv
gave:gav


In [25]:
stm = ['care', 'carefully', 'caring']
for word in stm:
    print(word + ':' + lst.stem(word))

# on we see clear that the landcaster stemmer is more aggressive than the porter stemmer in our case it change all the meaning of the words

care:car
carefully:car
caring:car


### Lemmatization


The difference between stemming and lemmatization is, lemmatization considers the context and converts the word to its meaningful base form, whereas stemming just removes the last few characters, often leading to incorrect meanings and spelling errors.
For example, lemmatization would correctly identify the base form of ‘caring’ to ‘care’, whereas, stemming would cut off the ‘ing’ part and convert it to car.


Stemming just removes or stems the last few characters of a word, often leading to incorrect meanings and spelling. 

Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma. Sometimes, the same word can have multiple different Lemmas

Is to find the true mening of the word


In [26]:
# Importing Lemmatizer library from nltk
from nltk.stem import WordNetLemmatizer

# Download the 'wordnet' dataset if it's not already present
# nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()

print('rocks :', lemmatizer.lemmatize('rocks'))
print('corpora :', lemmatizer.lemmatize('corpora'))


# corpus  = corpora = a collection of written texts, especially the entire works of a particular author or a body of writing on a particular subject. They find the synonym of the word

rocks : rock
corpora : corpus


### Stop Words
 
“Stop words” are the most common words in a language like “the”, “a”, “at”, “for”, “above”, “on”, “is”, “all”. These words do not provide any meaning and are usually removed from texts


In [27]:
from nltk import word_tokenize
from nltk.corpus import stopwords

# Download the 'stopwords' dataset
# nltk.download('stopwords')

a = set(stopwords.words('english'))
text = 'Cristiano Ronaldo was born on February 5, 1985, in Funchal, Madeira, Portugal.'

text1 = word_tokenize(text.lower())
print(text1)
print()

stopwords = [x for x in text1 if x not in a]
print(stopwords)

['cristiano', 'ronaldo', 'was', 'born', 'on', 'february', '5', ',', '1985', ',', 'in', 'funchal', ',', 'madeira', ',', 'portugal', '.']

['cristiano', 'ronaldo', 'born', 'february', '5', ',', '1985', ',', 'funchal', ',', 'madeira', ',', 'portugal', '.']


In [28]:
print(a)

{'that', 'been', "should've", 'again', 'my', 'she', "that'll", 'down', 'over', "hasn't", 'wouldn', 'itself', 'ourselves', 'out', 'if', 'was', 'hasn', 'don', 'both', "shan't", 'these', 'any', 'same', 'at', 'off', 'needn', "mightn't", "doesn't", 'am', 'but', 'during', 'have', 'as', 'shan', 'you', 'himself', 'here', 'to', 'own', 'which', 'up', 'should', 'won', 'how', 'just', 't', 'when', 'then', 'once', 'and', 'about', 's', 'above', 'having', 'be', 'has', 'haven', "haven't", 'it', 'other', 'not', 'only', "you're", 'herself', "it's", 'into', "needn't", 'yourselves', "hadn't", 'by', 'me', "you've", 'no', 'mustn', 'nor', 'isn', "isn't", 'hers', 'below', 'm', 'd', "you'd", 'o', 'couldn', 'doing', 'an', 'there', "aren't", 'so', 'more', 'does', 'being', "don't", 'some', 'myself', 'themselves', "you'll", 'through', 'mightn', 'ain', 'this', 'or', 'will', "didn't", 'between', 'their', 'doesn', "won't", 'he', 'with', 'your', 'ours', 'them', 'until', "mustn't", "wasn't", 've', 'few', 'than', 'yourse

### Part of Speech Tagging (POS)
Part-of-speech tagging is used to assign parts of speech to each word of a given text (such as nouns, verbs, pronouns, adverbs, conjunction, adjectives, interjection) based on its definition and its context.

In [29]:
from nltk import word_tokenize

# Download the 'averaged_perceptron_tagger' resource
# nltk.download('averaged_perceptron_tagger')

text = 'vote to choose a particular man or a group (party) to represent them in parliament'
#Tokenize the text
token = word_tokenize(text)
for token in token:
    print(nltk.pos_tag([token]))



[('vote', 'NN')]
[('to', 'TO')]
[('choose', 'NN')]
[('a', 'DT')]
[('particular', 'JJ')]
[('man', 'NN')]
[('or', 'CC')]
[('a', 'DT')]
[('group', 'NN')]
[('(', '(')]
[('party', 'NN')]
[(')', ')')]
[('to', 'TO')]
[('represent', 'NN')]
[('them', 'PRP')]
[('in', 'IN')]
[('parliament', 'NN')]


### Alphabetical list of part-of-speech tags


| Number | Tag  | Description                              |
|--------|------|------------------------------------------|
| 1      | CC   | Coordinating conjunction                 |
| 2      | CD   | Cardinal number                          |
| 3      | DT   | Determiner                               |
| 4      | EX   | Existential there                        |
| 5      | FW   | Foreign word                             |
| 6      | IN   | Preposition or subordinating conjunction |
| 7      | JJ   | Adjective                                |
| 8      | JJR  | Adjective, comparative                   |
| 9      | JJS  | Adjective, superlative                   |
| 10     | LS   | List item marker                         |
| 11     | MD   | Modal                                    |
| 12     | NN   | Noun, singular or mass                   |
| 13     | NNS  | Noun, plural                             |
| 14     | NNP  | Proper noun, singular                    |
| 15     | NNPS | Proper noun, plural                      |
| 16     | PDT  | Predeterminer                            |
| 17     | POS  | Possessive ending                        |
| 18     | PRP  | Personal pronoun                         |
| 19     | PRP$ | Possessive pronoun                       |
| 20     | RB   | Adverb                                   |
| 21     | RBR  | Adverb, comparative                      |
| 22     | RBS  | Adverb, superlative                      |
| 23     | RP   | Particle                                 |
| 24     | SYM  | Symbol                                   |
| 25     | TO   | to                                       |
| 26     | UH   | Interjection                             |
| 27     | VB   | Verb, base form                          |
| 28     | VBD  | Verb, past tense                         |
| 29     | VBG  | Verb, gerund or present participle       |
| 30     | VBN  | Verb, past participle                    |
| 31     | VBP  | Verb, non-3rd person singular present    |
| 32     | VBZ  | Verb, 3rd person singular present        |
| 33     | WDT  | Wh-determiner                            |
| 34     | WP   | Wh-pronoun                               |
| 35     | WP$  | Possessive wh-pronoun                    |
| 36     | WRB  | Wh-adverb                                |



### Named entity recognition

It is the process of detecting the named entities such as the person name, the location name, the company name, the quantities and the monetary value.


#### What is a Named Entity?
A named entity is basically a real-life object which has proper identification and can be denoted with a proper name. Named Entities can be a place, person, organization, time, object, or geographic entity.
For example, named entities would be Roger Federer, Honda city, Samsung Galaxy S10. Named entities are usually instances of entity instances. For example, Roger Federer is an instance of a Tennis Player/person, Honda City is an instance of a car and Samsung Galaxy S24 Ultra is an instance of a Mobile Phone. 



#### Named Entity Recognition:
Named Entity Recognition is the process of NLP which deals with identifying and classifying named entities. The raw and structured text is taken and named entities are classified into persons, organizations, places, money, time, etc. Basically, named entities are identified and segmented into various predefined classes.


In [31]:
# import spacy
# from spacy import displacy
# 
# NER = spacy.load("en_core_web_sm")
# 
# #enter the sample text which we shall be testing
# raw_text=("The Indian Space Research Organisation or is the national space agency of India, headquartered in Bengaluru. It operates under Department of Space which is directly overseen by the Prime Minister of India while Chairman of ISRO acts as executive of DOS as well.")
# text1= NER(raw_text)
# 
# # Now, we print the data on the NEs found in this text sample.
# for word in text1.ents:
#     print(word.text,word.label_)


Lets use what we lern on real world data we are goint to web scrape the data from the web and apply the text mining on it

Steps: 
- Web scrap content from the web
- Remove punctuation marks
- Convert to Lowercase all text 
- Perform Tokenization
- Finding the frequency distinct in the tokens
- find the frequency of top 30 words
- Removing Stopwords
- Finding the frequency used terms in the tokens


In [37]:
from bs4 import BeautifulSoup
import requests

# Use the URL of the news article
URL = "https://en.wikipedia.org/wiki/Steve_Jobs"

html_content = requests.get(URL).text
soup = BeautifulSoup(html_content, "html.parser")  # Use "html.parser" instead of "lxml"

# Now, we get the body content
body = soup.body.text

print(body)

Jump to content







Main menu





Main menu
move to sidebar
hide



		Navigation
	


Main pageContentsCurrent eventsRandom articleAbout WikipediaContact us





		Contribute
	


HelpLearn to editCommunity portalRecent changesUpload file



















Search











Search






















Appearance
















Donate

Create account

Log in








Personal tools





Donate Create account Log in





		Pages for logged out editors learn more



ContributionsTalk




























Contents
move to sidebar
hide




(Top)





1
Early life




Toggle Early life subsection





1.1
Family








1.2
Infancy








1.3
Childhood








1.4
Homestead High








1.5
Reed College










2
1974–1985




Toggle 1974–1985 subsection





2.1
Pre-Apple








2.2
Apple (1976–1985)










3
1985–1997




Toggle 1985–1997 subsection





3.1
NeXT computer








3.2
Pixar and Disney










4
1997–2011




Toggle 1997–2011 subsection





4.1
Return to

In [38]:
# Remove punctuation marks

# initializing punctuations string
punc = '''!()-[]{};:'"\,<>./?@#$%^&*_~—''' # remove all the punctuation marks (symbols) from the text

# Removing punctuations in string
# Using loop + punctuation string
for ele in body:
    if ele in punc:
        body = body.replace(ele, "")

# printing result
print(body) 


Jump to content







Main menu





Main menu
move to sidebar
hide



		Navigation
	


Main pageContentsCurrent eventsRandom articleAbout WikipediaContact us





		Contribute
	


HelpLearn to editCommunity portalRecent changesUpload file



















Search











Search






















Appearance
















Donate

Create account

Log in








Personal tools





Donate Create account Log in





		Pages for logged out editors learn more



ContributionsTalk




























Contents
move to sidebar
hide




Top





1
Early life




Toggle Early life subsection





11
Family








12
Infancy








13
Childhood








14
Homestead High








15
Reed College










2
1974–1985




Toggle 1974–1985 subsection





21
PreApple








22
Apple 1976–1985










3
1985–1997




Toggle 1985–1997 subsection





31
NeXT computer








32
Pixar and Disney










4
1997–2011




Toggle 1997–2011 subsection





41
Return to Apple










In [42]:
# Convert to Lowercase all text
body = body.lower()

In [43]:
token = word_tokenize(body)
token

['jump',
 'to',
 'content',
 'main',
 'menu',
 'main',
 'menu',
 'move',
 'to',
 'sidebar',
 'hide',
 'navigation',
 'main',
 'pagecontentscurrent',
 'eventsrandom',
 'articleabout',
 'wikipediacontact',
 'us',
 'contribute',
 'helplearn',
 'to',
 'editcommunity',
 'portalrecent',
 'changesupload',
 'file',
 'search',
 'search',
 'appearance',
 'donate',
 'create',
 'account',
 'log',
 'in',
 'personal',
 'tools',
 'donate',
 'create',
 'account',
 'log',
 'in',
 'pages',
 'for',
 'logged',
 'out',
 'editors',
 'learn',
 'more',
 'contributionstalk',
 'contents',
 'move',
 'to',
 'sidebar',
 'hide',
 'top',
 '1',
 'early',
 'life',
 'toggle',
 'early',
 'life',
 'subsection',
 '11',
 'family',
 '12',
 'infancy',
 '13',
 'childhood',
 '14',
 'homestead',
 'high',
 '15',
 'reed',
 'college',
 '2',
 '1974–1985',
 'toggle',
 '1974–1985',
 'subsection',
 '21',
 'preapple',
 '22',
 'apple',
 '1976–1985',
 '3',
 '1985–1997',
 'toggle',
 '1985–1997',
 'subsection',
 '31',
 'next',
 'computer',

In [45]:
# Finding the frequency distinct in the tokens
fdist = FreqDist(token)
fdist

FreqDist({'the': 1061, 'and': 472, 'jobs': 472, 'of': 417, 'a': 415, 'to': 389, 'on': 349, 'in': 344, 'from': 311, 'apple': 248, ...})

In [48]:
# To find the frequency of top 30 words
fdist1 = fdist.most_common(30)
fdist1

[('the', 1061),
 ('and', 472),
 ('jobs', 472),
 ('of', 417),
 ('a', 415),
 ('to', 389),
 ('on', 349),
 ('in', 344),
 ('from', 311),
 ('apple', 248),
 ('was', 238),
 ('steve', 231),
 ('original', 230),
 ('archived', 229),
 ('retrieved', 212),
 ('he', 204),
 ('2011', 194),
 ('that', 192),
 ('his', 180),
 ('for', 149),
 ('at', 131),
 ('with', 128),
 ('october', 117),
 ('2012', 109),
 ('as', 106),
 ('by', 95),
 ('computer', 87),
 ('i', 85),
 ('it', 85),
 ('april', 80)]

In [57]:
# Removing Stopwords
# a = set(stopwords.words('english')) # whe have already initialize the stopwords

token2 = [x for x in token if x not in a]
print(token2)


['jump', 'content', 'main', 'menu', 'main', 'menu', 'move', 'sidebar', 'hide', 'navigation', 'main', 'pagecontentscurrent', 'eventsrandom', 'articleabout', 'wikipediacontact', 'us', 'contribute', 'helplearn', 'editcommunity', 'portalrecent', 'changesupload', 'file', 'search', 'search', 'appearance', 'donate', 'create', 'account', 'log', 'personal', 'tools', 'donate', 'create', 'account', 'log', 'pages', 'logged', 'editors', 'learn', 'contributionstalk', 'contents', 'move', 'sidebar', 'hide', 'top', '1', 'early', 'life', 'toggle', 'early', 'life', 'subsection', '11', 'family', '12', 'infancy', '13', 'childhood', '14', 'homestead', 'high', '15', 'reed', 'college', '2', '1974–1985', 'toggle', '1974–1985', 'subsection', '21', 'preapple', '22', 'apple', '1976–1985', '3', '1985–1997', 'toggle', '1985–1997', 'subsection', '31', 'next', 'computer', '32', 'pixar', 'disney', '4', '1997–2011', 'toggle', '1997–2011', 'subsection', '41', 'return', 'apple', '5', 'health', 'problems', 'toggle', 'heal

In [59]:
# Finding the frequency used terms in the tokens
fdist2 = FreqDist(token2)
fdist2

FreqDist({'jobs': 472, 'apple': 248, 'steve': 231, 'original': 230, 'archived': 229, 'retrieved': 212, '2011': 194, 'october': 117, '2012': 109, 'computer': 87, ...})

In [60]:
fdist3 = fdist2.most_common(30)
fdist3

[('jobs', 472),
 ('apple', 248),
 ('steve', 231),
 ('original', 230),
 ('archived', 229),
 ('retrieved', 212),
 ('2011', 194),
 ('october', 117),
 ('2012', 109),
 ('computer', 87),
 ('april', 80),
 ('june', 80),
 ('january', 76),
 ('august', 75),
 ('jobss', 63),
 ('2013', 59),
 ('wozniak', 58),
 ('first', 55),
 ('b', 52),
 ('p', 51),
 ('november', 50),
 ('next', 48),
 ('5', 48),
 ('december', 47),
 ('2009', 46),
 ('10', 45),
 ('may', 45),
 ('isaacson', 45),
 ('2015', 45),
 ('new', 44)]

In [None]:
# import spacy
# from spacy import displacy
# 
# NER = spacy.load("en_core_web_sm")
# 
# text3= NER(body)
# displacy.render(text3,style="ent",jupyter=True)
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# from bs4 import BeautifulSoup
# import requests
# import re
# import spacy
# from spacy import displacy
# 
# NER = spacy.load("en_core_web_sm")
# # use the URL of the news article
# URL="https://en.wikipedia.org/wiki/Steve_Jobs"
# 
# html_content = requests.get(URL).text
# soup = BeautifulSoup(html_content, "lxml")
# 
# 
# # Now, we get the body content
# body=soup.body.text
# 
# text1= NER(body)
# displacy.render(text1,style="ent",jupyter=True)
