# Text Data Mining

Text data mining can be described as the process of extracting essential data from common language text. All the data that we generate via text messages, documents, emails, files are written in common language text. Text mining is primarily used to draw useful insights or patterns from such data.

<img src="https://i.imgur.com/mQvsNyf.png" width="400" height="350" class="center">


<br>

# Areas of text mining in data mining

These are the following area of text mining :

<br><br>

<img src="https://i.imgur.com/Y0Ql6J3.png" width="450" height="450" class="center">


<br><br>

- **Information Extraction:**<br>
The automatic extraction of structured data such as entities, entities relationships, and attributes describing entities from an unstructured source is called information extraction.

- **Natural Language Processing:**<br>
NLP stands for Natural language processing. Computer software can understand human language as same as it is spoken. NLP is primarily a component of artificial intelligence(AI). The development of the NLP application is difficult because computers generally expect humans to "Speak" to them in a programming language that is accurate, clear, and exceptionally structured. Human speech is usually not authentic so that it can depend on many complex variables, including slang, social context, and regional dialects.

- **Data Mining:**<br>
Data mining refers to the extraction of useful data, hidden patterns from large data sets. Data mining tools can predict behaviors and future trends that allow businesses to make a better data-driven decision. Data mining tools can be used to resolve many business problems that have traditionally been too time-consuming.

- **Information Retrieval:**<br>
Information retrieval deals with retrieving useful data from data that is stored in our systems. Alternately, as an analogy, we can view search engines that happen on websites such as e-commerce sites or any other sites as part of information retrieval.


# Text Mining Process:

The text mining process incorporates the following steps to extract the data from the document.<br><br>

<img src="https://i.imgur.com/sDfE2o4.png" width="500" height="400">

- **Text transformation**<br>
A text transformation is a technique that is used to control the capitalization of the text.
Here the two major way of document representation is given.

    - Bag of words
    - Vector Space
   
- **Text Pre-processing**<br>
Pre-processing is a significant task and a critical step in Text Mining, Natural Language Processing (NLP), and information retrieval(IR). In the field of text mining, data pre-processing is used for extracting useful information and knowledge from unstructured text data. Information Retrieval (IR) is a matter of choosing which documents in a collection should be retrieved to fulfill the user's need.

- **Feature selection**<br>
Feature selection is a significant part of data mining. Feature selection can be defined as the process of reducing the input of processing or finding the essential information sources. The feature selection is also called variable selection.

- **Data Mining**<br>
Now, in this step, the text mining procedure merges with the conventional process. Classic Data Mining procedures are used in the structural database.

- **Evaluate**<br>
Afterward, it evaluates the results. Once the result is evaluated, the result abandon.


- **Applications**<br>
These are the following text mining applications:

    - Risk Management
    - Customer Care Service
    - Business Intelligence
    - Social Media Analysis

# Natural Language Processing (NLP)

NLP stands for Natural Language Processing, which is a part of Computer Science, Human language, and Artificial Intelligence. It is the technology that is used by machines to understand, analyse, manipulate, and interpret human's languages. It helps developers to organize knowledge for performing tasks such as translation, automatic summarization, Named Entity Recognition (NER), speech recognition, relationship extraction, and topic segmentation.

## Applications of NLP

There are the following applications of NLP -

**1. Speech Recognition**<br>
Speech recognition is used for converting spoken words into text. It is used in applications, such as mobile, home automation, video recovery, dictating to Microsoft Word, voice biometrics, voice user interface, and so on.<br>

<img src="https://i.imgur.com/EatZjYU.png" width="300" height="100" class="center">


**2. Spam Detection**<br>
Spam detection is used to detect unwanted e-mails getting to a user's inbox.<br>

<img src="https://i.imgur.com/U31OVdW.png" width="400" class="center">

**3. Sentiment Analysis**<br>
Sentiment Analysis is also known as opinion mining. It is used on the web to analyse the attitude, behaviour, and emotional state of the sender. This application is implemented through a combination of NLP (Natural Language Processing) and statistics by assigning the values to the text (positive, negative, or natural), identify the mood of the context (happy, sad, angry, etc.)

<img src="https://i.imgur.com/AYd74N2.png" width="400" height="500" class="center">

**4. Spelling correction**<br>
Microsoft Corporation provides word processor software like MS-word, PowerPoint for the spelling correction.

<img src="https://i.imgur.com/0Ir3LFQ.png" width="400" height="400" class="center">

**5. Chatbot**<br>
Implementing the Chatbot is one of the important applications of NLP. It is used by many companies to provide the customer's chat services.

<img src="https://i.imgur.com/ixoPd2Z.png" width="300" height="400" class="center">



## How to build an NLP pipeline

There are the following steps to build an NLP pipeline -

### Step1: Sentence Segmentation

Sentence Segment is the first step for building the NLP pipeline. It breaks the paragraph into separate sentences.

**Example:** Consider the following paragraph -

**Hello everyone. Welcome to Tech IS.Tech IS is a global programming school.We are from Silicon Valley, USA. Our Tutors are Professional engineers with extensive work experience in Information Technology Industry hailing from around the world. You are studying NLP article.**

**Sentence Segment produces the following result:**

    1. "Hello everyone. Welcome to Tech IS.Tech IS is a global programming school."
    2. "We are from Silicon Valley, USA."
    3. "Our Tutors are Professional engineers with extensive work experience in Information Technology Industry
        hailing from around the world."
    4. "You are studying NLP article."

In [2]:
import nltk
nltk.download('punkt')

paragraph = """Hello everyone. Welcome to Tech IS.Tech IS is a global programming school.We are from Silicon Valley, USA. 
            Our Tutors are Professional engineers with extensive work experience in Information Technology Industry hailing from around the world.
            You are studying NLP article."""
             
# Tokenizing sentences
sentences = nltk.sent_tokenize(paragraph)
print(sentences)

['Hello everyone.', 'Welcome to Tech IS.Tech IS is a global programming school.We are from Silicon Valley, USA.', 'Our Tutors are Professional engineers with extensive work experience in Information Technology Industry hailing from around the world.', 'You are studying NLP article.']


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/musubimanagement/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Step2: Word Tokenization

Word Tokenizer is used to break the sentence into separate words or tokens.

**Example:**

**Hello everyone. Welcome to Tech IS.Tech IS is a global programming school.We are from Silicon Valley, USA. Our Tutors are Professional engineers with extensive work experience in Information Technology Industry hailing from around the world. You are studying NLP article.**

**Word Tokenizer generates the following result:**

'Hello', 'everyone', '.', 'Welcome', 'to', 'Tech', 'IS.Tech', 'IS', 'is', 'a', 'global', 'programming', 'school.We', 'are', 'from', 'Silicon', 'Valley', ',', 'USA', '.', 'Our', 'Tutors', 'are', 'Professional', 'engineers', 'with', 'extensive', 'work', 'experience', 'in', 'Information', 'Technology', 'Industry', 'hailing', 'from', 'around', 'the', 'world', '.', 'You', 'are', 'studying', 'NLP', 'article', '.'

In [2]:
# Tokenizing words
words = nltk.word_tokenize(paragraph)
print(words)

['Hello', 'everyone', '.', 'Welcome', 'to', 'Tech', 'IS.Tech', 'IS', 'is', 'a', 'global', 'programming', 'school.We', 'are', 'from', 'Silicon', 'Valley', ',', 'USA', '.', 'Our', 'Tutors', 'are', 'Professional', 'engineers', 'with', 'extensive', 'work', 'experience', 'in', 'Information', 'Technology', 'Industry', 'hailing', 'from', 'around', 'the', 'world', '.', 'You', 'are', 'studying', 'NLP', 'article', '.']


### Step3: Identifying Stop Words

In English, there are a lot of words that appear very frequently like "is", "and", "the", and "a". NLP pipelines will flag these words as **stop words.** Stop words might be filtered out before doing any statistical analysis.

**Example:** He **is a** good boy.

In [3]:
from nltk.corpus import stopwords
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

### Step4: Stemming

Stemming is used to normalize words into its base form or root form. For example, celebrates, celebrated and celebrating, all these words are originated with a single root word "celebrate." The big problem with stemming is that sometimes it produces the root word which may not have any meaning.

**For Example,** intelligence, intelligent, and intelligently, all these words are originated with a single root word "intelligen." In English, the word "intelligen" do not have any meaning.

In [4]:
from nltk.stem import PorterStemmer

sentences = nltk.sent_tokenize(paragraph)
stemmer = PorterStemmer()

# Stemming
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
    sentences[i] = ' '.join(words)   
print(sentences)

['hello everyon .', 'welcom tech is.tech IS global program school.w silicon valley , usa .', 'our tutor profession engin extens work experi inform technolog industri hail around world .', 'you studi nlp articl .']


### Step 5: Lemmatization

Lemmatization is quite similar to the Stamming. It is used to group different inflected forms of the word, called Lemma. The main difference between Stemming and lemmatization is that it produces the root word, which has a meaning.

**For example:** In lemmatization, the words intelligence, intelligent, and intelligently has a root word intelligent, which has a meaning.

In [5]:
#import these modules 
from nltk.stem import WordNetLemmatizer 
sentences = nltk.sent_tokenize(paragraph)
lemmatizer = WordNetLemmatizer()

# Lemmatization
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
    sentences[i] = ' '.join(words) 
print(sentences)

['Hello everyone .', 'Welcome Tech IS.Tech IS global programming school.We Silicon Valley , USA .', 'Our Tutors Professional engineer extensive work experience Information Technology Industry hailing around world .', 'You studying NLP article .']


## Bag of Words (BoW) model

The Bag of Words (BoW) model is the simplest form of text representation in numbers. Like the term itself, we can represent a sentence as a bag of words vector (a string of numbers).

Let’s recall the three types of movie reviews we saw earlier:

**Review 1:** This movie is very scary and long.<br>
**Review 2:** This movie is not scary and is slow.<br>
**Review 3:** This movie is spooky and good.<br>

We will first build a vocabulary from all the unique words in the above three reviews. The vocabulary consists of these 11 words: ‘This’, ‘movie’, ‘is’, ‘very’, ‘scary’, ‘and’, ‘long’, ‘not’,  ‘slow’, ‘spooky’,  ‘good’.

We can now take each of these words and mark their occurrence in the three movie reviews above with 1s and 0s. This will give us 3 vectors for 3 reviews:

<img src="https://i.imgur.com/Q8e9CXG.png" width="500" height="600" class="center">

Vector of Review 1: [1 1 1 1 1 1 1 0 0 0 0]

Vector of Review 2: [1 1 2 0 0 1 1 0 1 0 0]

Vector of Review 3: [1 1 1 0 0 0 1 0 0 1 1]

And that’s the core idea behind a Bag of Words (BoW) model.

In [7]:
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
paragrapg
ps = PorterStemmer()
wordnet=WordNetLemmatizer()
sentences = nltk.sent_tokenize(paragraph)
corpus = []
for i in range(len(sentences)):
    review = re.sub('[^a-zA-Z]', ' ', sentences[i])
    review = review.lower()
    review = review.split()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)
    
# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()
print(X)

[[0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 0 2 0 0 1 1 1 0 0]
 [1 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 0 0 1 1 0 0 0 1 1]
 [0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0]]


## TF-IDF (term frequency-inverse document frequency)
TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents.

## How is TF-IDF calculated?
TF-IDF for a word in a document is calculated by multiplying two different metrics:

### Term Frequency (TF)
The number of times a word appears in a document divded by the total number of words in the document. Every document has its own term frequency.

<img src="https://i.imgur.com/BdRVuyn.png" width="200" height="80" class="center">

### Inverse Data Frequency (IDF)
The log of the number of documents divided by the number of documents that contain the word w. Inverse data frequency determines the weight of rare words across all documents in the corpus.

<img src="https://i.imgur.com/l4Qq4ZO.png" width="200" height="80" class="center">

**Lastly, the TF-IDF is simply the TF multiplied by IDF.**

<img src="https://i.imgur.com/RfIjTpr.png" width="300" height="100" class="center">

In [8]:
# Cleaning the texts and creat corpus
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

ps = PorterStemmer()
wordnet=WordNetLemmatizer()
sentences = nltk.sent_tokenize(paragraph)

corpus = []
for i in range(len(sentences)):
    review = re.sub('[^a-zA-Z]', ' ', sentences[i])
    review = review.lower()
    review = review.split()
    review = [wordnet.lemmatize(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)
    
# Creating the TF-IDF model
from sklearn.feature_extraction.text import TfidfVectorizer
cv = TfidfVectorizer()
X = cv.fit_transform(corpus).toarray()
print(X)

[[0.         0.         0.         0.70710678 0.         0.
  0.         0.         0.70710678 0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.        ]
 [0.         0.         0.         0.         0.         0.
  0.30151134 0.         0.         0.         0.         0.
  0.         0.30151134 0.30151134 0.30151134 0.         0.60302269
  0.         0.         0.30151134 0.30151134 0.30151134 0.
  0.        ]
 [0.28867513 0.         0.28867513 0.         0.28867513 0.28867513
  0.         0.28867513 0.         0.28867513 0.28867513 0.
  0.28867513 0.         0.         0.         0.         0.
  0.28867513 0.28867513 0.         0.         0.         0.28867513
  0.28867513]
 [0.         0.57735027 0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.57735027
  0.         0.         0.         0.         0.57735027 0.
  0.         0.         0.