<a href="https://colab.research.google.com/github/ThanhVanLe0605/Data-Mining-For-Business-Analytics-In-Python/blob/main/Chapter_20_Text_mining.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this chapter, we introduce text as a form of data.
- First, we discuss a tabular representation of text data in which each column is a word, each row is a document, and each cell is a 0 or 1, indicating whether that column's word is present in that row's document.
- Then we consider how to move from unstructured documents to this structured matrix.
- Finnally, we illustrate how to integrate this process into the standard data mining procedures covered in earlier parts of the book.

20.1. [INTRODUCTION](https://colab.research.google.com/drive/1rIv6HOfTRsYjsN7KnG_-WYKFBvIB4oaL#scrollTo=_tud5FVMy6dV&line=1&uniqifier=1)

20.2.[THE TABULAR REPRESENTATION OF TEXT: TERM-DOCUMENT MATRIX AND "BAG-OF-WORDS"](https://colab.research.google.com/drive/1rIv6HOfTRsYjsN7KnG_-WYKFBvIB4oaL#scrollTo=tlry-74ezIfL&line=1&uniqifier=1)

20.3. [BAG-OF-WORDS VS. MEANING EXTRACTION AT DOCUMENT LEVEL](https://colab.research.google.com/drive/1rIv6HOfTRsYjsN7KnG_-WYKFBvIB4oaL#scrollTo=yqryZDJ1zdYq&line=1&uniqifier=1)

20.4. [PREPROCESSING THE TEXT](https://colab.research.google.com/drive/1rIv6HOfTRsYjsN7KnG_-WYKFBvIB4oaL#scrollTo=Ulj1YLIWzvzW&line=1&uniqifier=1)

20.5. [IMPLEMENTING DATA MINING METHODS](https://colab.research.google.com/drive/1rIv6HOfTRsYjsN7KnG_-WYKFBvIB4oaL#scrollTo=PcvJBFVQ0y9z&line=1&uniqifier=1)

20.6. [EXAMPLE: ONLINE DISCUSSION ON AUTOS AND ELECTRONICS](https://colab.research.google.com/drive/1rIv6HOfTRsYjsN7KnG_-WYKFBvIB4oaL#scrollTo=eKYVcdku05FR&line=1&uniqifier=1)

20.7. [SUMMARY](https://colab.research.google.com/drive/1rIv6HOfTRsYjsN7KnG_-WYKFBvIB4oaL#scrollTo=Xb7hf_1o1FRS&line=1&uniqifier=1)

**Python**

In this chapter, we will **pandas** for data handling and **scikit-learn** for the feature creation and model building. The Natural Language Toolkit will be used for more advanced text processing (nltk: https://www.nltk.org).

In [4]:
# import required functionality for this chapter
from zipfile import ZipFile
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
import nltk
from nltk import word_tokenize
from nltk.stem.snowball import EnglishStemmer
import matplotlib.pylab as plt

# Install smba library as it's not found
!pip install dmba

from dmba import printTermDocumentMatrix, classificationSummary, liftChart

# download data required for NLTK
nltk.download('punkt')

Collecting dmba
  Downloading dmba-0.2.4-py3-none-any.whl.metadata (1.9 kB)
Downloading dmba-0.2.4-py3-none-any.whl (11.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.8/11.8 MB[0m [31m37.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dmba
Successfully installed dmba-0.2.4
Colab environment detected.


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

# 20.1. INTRODUCTION


**1. Traditional Data Types**
Up to this point, data mining has primarily focused on structured data types:

* a. **Numerical**
* b. **Binary** (yes/no)

* c. **Multicategory**

**2. The Role of Text in Predictive Analytics**
In many modern applications, data exists in **text form**. In these scenarios:

* a. The **predictor variables (features)** are embedded directly within the text of documents.
* b. **Automated algorithms** are required to process this unstructured data.

**3. Real-world Classification Examples**
* a. **Internet Service Providers (ISP):** Using algorithms to **classify** support tickets as *urgent* or *routine* for efficient routing.

* b. **Legal Industry:** Automating the discovery process by classifying massive volumes of documents as *relevant* or *irrelevant*.

**4. Drivers of Text Mining Growth**
The field has expanded significantly due to the availability of **social media data** (Twitter feeds, blogs, online forums, news).

* a. **Adoption Rate:** According to the *Rexer Analytics 2013 Data Mining Survey*, 25% to 40% of data miners utilize text data.

* b. **Research Impact:** High public availability of this data provides a repository for researchers to hone and improve text mining methods.

**5. Emerging Application Areas**

* a. A key area of growth is applying text mining methods to **notes and transcripts** from contact centers and service centers.

# 20.2. THE TABULAR REPRESENTATION OF TEXT: TERM-DOCUMENT MATRIX AND "BAG-OF-WORDS"

**1. The Term-Document Matrix**
To analyze text quantitatively, unstructured sentences must be converted into a structured matrix format:
* **Documents:** The distinct units of text (e.g., sentences $S1, S2, S3$).
* **Terms:** The individual words extracted from these documents.
* **Structure:** A matrix where rows represent terms and columns represent documents (or vice versa). The cell values represent the **frequency** of the term in that document.

**2. Implementation in Python**
* **Library:** `scikit-learn` (a standard machine learning library).
* **Tool:** `CountVectorizer`.
    * *Process:* Collect documents into a list $\rightarrow$ Apply `CountVectorizer` to generate the matrix.

**3. The "Bag-of-Words" (BoW) Approach**
This matrix representation relies on the **Bag-of-Words** assumption:
* **Ignored:** Grammar, syntax, and word order do not matter.
* **Preserved:** Only the presence and frequency of words matter.
* **Result:** Text is transformed into **tabular data** (numerical features) suitable for standard algorithms.



**4. Applications and Challenges**
* **Utility:** Once in tabular form, the data can be used for **Clustering**, **Classification**, or **Prediction** (by appending an outcome variable).
* **Reality Check:** Text mining is rarely simple.
    * Raw data usually requires significant **preprocessing** (cleaning).
    * **Human review** is often indispensable to validate the relevance of documents before training.

**TABLE 20.1. TERM-DOCUMENT MATRIX REPRESENTATION OF WORDS IN SENTENCES S1-S3**

In [5]:
text = [
    'this is the first sentence.',
    'this is a second sentence.',
    'the third sentence is here.'
]
# learn features based on text
count_vect = CountVectorizer()
counts = count_vect.fit_transform(text)

printTermDocumentMatrix(count_vect, counts)

          S1  S2  S3
first      1   0   0
here       0   0   1
is         1   1   1
second     0   1   0
sentence   1   1   1
the        1   0   1
third      0   0   1
this       1   1   0


# 20.3. BAG-OF-WORDS VS. MEANING EXTRACTION AT DOCUMENT LEVEL

**1. Two Distinct Goals in Text Mining**
We can categorize text mining tasks into two levels of complexity:
* **Goal A: Classification & Clustering (The focus of this book)**
    * *Task:* Labeling a document as belonging to a specific class or grouping similar documents together.
    * *Method:* Uses standard statistical and machine learning predictive models (similar to those used for numerical data).
    * *Requirement:* A sizable collection of documents (**Corpus**) and pre-labeled data for training.
* **Goal B: Extracting Detailed Meaning**
    * *Task:* Deriving deep understanding from a single document.
    * *Method:* Requires complex algorithms to handle grammar, syntax, punctuation, and natural language logic.
    * *Field:* This is the domain of **Natural Language Processing (NLP)**.

**2. The Challenge of "Meaning"**
Extracting meaning is far more formidable than probabilistic classification due to:
* **Ambiguity:** Identical words can have vastly different meanings depending on context.
* **Context Sensitivity:**
    * *Example:* "Hitchcock **shot** The Birds in Bodega Bay."
    * *Interpretation 1:* Filming a movie (Correct in film context).
    * *Interpretation 2:* Hunting birds (Possible interpretation if context is ignored).
* **Resolution:** Humans use cultural and social context to resolve ambiguity; computers struggle with this in a simple Bag-of-Words model.



**3. Scope of Analysis**
* **In Scope:** We will focus on **probabilistic assignment**—using word frequencies to predict which class a document belongs to (e.g., "Urgent" vs. "Routine").
* **Out of Scope:** We will not attempt to program the computer to "understand" the documents in the human sense (deep NLP).

# 20.4. PREPROCESSING THE TEXT

**1. Theory vs. Reality**
* **Simple Theory:** Previous examples used clean text where words were separated by single spaces and sentences ended with periods. Simple rules could easily parse this.
* **Complex Reality:** Real-world text data is "messy." Preparing text for mining is significantly more involved than preparing numerical or categorical data.

**2. Common Data Issues (The "Dirty" Examples)**
Consider the modified sentences ($S1, S2, S3, S4$) presented in the text. They introduce typical noise found in raw data:

  a. **Extra Spaces:** Irregular spacing between words.

  b. **Non-alpha Characters:** Usage of punctuation and emoticons (e.g., `!!`, `:)`, `,`).

  c. **Inconsistent Capitalization:** Random upper/lower case usage (e.g., "Sentence" capitalized in the middle).

  d. **Misspellings:** Typographical errors (e.g., "forth" instead of "fourth").

**3. The Concept of a "Corpus"**
* **Definition:** A *corpus* refers to a large, fixed, and standard collection of documents.
* **Purpose:**
    a. Used to train text preprocessing algorithms.
    
    b. Serves as a benchmark to compare the results of different algorithms.
* **Example:** The **Brown Corpus** (compiled at Brown University in the 1960s), which contains 500 English documents across various types.

**TABLE 20.2. TERM-DOCUMENT MATRIX REPRESENTATION OF WORDS IN SENTENCES S1-S4**

In [7]:
text = [
    'this is the first     sentence!!',
    'this is a second sentence :',
    'the third sentence, is here ',
    'forth of all sentences'
]

# Learn features based on text. Special characters are excluded in the analysis
count_vext = CountVectorizer()
counts = count_vect.fit_transform(text)

printTermDocumentMatrix(count_vect, counts)

           S1  S2  S3  S4
all         0   0   0   1
first       1   0   0   0
forth       0   0   0   1
here        0   0   1   0
is          1   1   1   0
of          0   0   0   1
second      0   1   0   0
sentence    1   1   1   0
sentences   0   0   0   1
the         1   0   1   0
third       0   0   1   0
this        1   1   0   0


**Tokenization**

**1. Tokenization (The First Step)**
* **Definition:** The automated process of dividing raw text into separate units called "tokens" (or terms).
* **Delimiters:** Software uses characters like spaces, commas, or colons to decide where one token ends and another begins.
* **Result:** A raw list of terms that forms the basis of the Term-Document Matrix.

**2. The Problem: "Bulk and Noise"**
Creating a matrix from every single token leads to significant issues:
* **Bulk (High Dimensionality):** The English language has over a million words. Including every date stamp, email address, or typo creates a massive, computationally heavy matrix.
* **Noise (Irrelevant Data):** Many terms do not help in prediction and confuse the model.
    * *Examples:* Boilerplate text (email signatures), random punctuation, or overly common words.



**3. Text Reduction Strategies**
To combat "Bulk and Noise," preprocessing is essential:
* **Stopwords Removal:** Eliminating common words (e.g., "and", "the", "is") that carry little semantic meaning for classification.
* **Filtering Boilerplate:** Removing repetitive standard text (like legal disclaimers in emails) that appears in every document but adds no unique information.
* **Goal:** To reduce the number of variables (columns) to only those that aid in analysis, making the model faster and more accurate.

**TABLE 20.3 TOKENIZATION OF S1-S4 EXAMPLE**

In [8]:
# code for term frequency of second example
text = [
    'this is the first     sentence!!',
    'this is a second sentence :',
    'the third sentence, is here ',
    'forth of all sentences'
]
# learn features based on text. Include special characters that are part of a word in the analysis
count_vect = CountVectorizer(token_pattern = ' [a-zA-Z!:]+')
counts = count_vect.fit_transform(text)

printTermDocumentMatrix(count_vect, counts)

            S1  S2  S3  S4
:            0   1   0   0
a            0   1   0   0
all          0   0   0   1
first        1   0   0   0
here         0   0   1   0
is           1   1   1   0
of           0   0   0   1
second       0   1   0   0
sentence     0   1   1   0
sentence!!   1   0   0   0
sentences    0   0   0   1
the          1   0   0   0
third        0   0   1   0


**Text Reduction**

Effective text reduction focuses on removing noise and reducing the vocabulary size to improve model performance. Below are the key techniques:

### 1. Stopword Removal
  a. **Tools:** Most software, such as the `CountVectorizer` class in `scikit-learn`, includes generic stopword lists for removing frequently occurring terms.
  b. **Customization:** Users can review the extensive default list in `scikit-learn` or provide a custom list using the `stop_words` argument.

### 2. Vocabulary Reduction Strategies
Additional techniques to reduce text volume and focus on meaningful content include:

* **Stemming**

    a. A linguistic method that reduces different variants of words to a common core (root form).

* **Frequency Filters**

    a. Eliminate terms occurring in the great majority of documents (stop-words-like behavior).

    b. Eliminate very rare terms to reduce noise.

    c. Limit the vocabulary to the top *n* most frequent terms.

* **Synonyms & Formatting**

    a. Consolidate synonyms or synonymous phrases.

    b. Ignore letter case (usually converting all text to lowercase).

* **Normalization**

    a. Replace specific terms within a category with the general category name.
    
    b. *Example:* Replacing distinct e-mail addresses with `emailtoken` or different numbers with `numbertoken`.

Table 20.5 presents the text reduction step applied to the four sentences example, after tokenization. We can see the number of terms has been reduced to five

**Presence/Absence vs. Frequency**

The "Bag-of-Words" model can be implemented in two ways:

a. **Frequency-Based (Count)**
   * Counts how many times a term appears.
   * **Use Case:** When repetition implies intensity (e.g., repeated mentions of "IP address" in a support ticket indicate a specific technical issue).

b. **Presence/Absence (Binary)**
   * Records only if a term exists (1) or not (0), ignoring counts.
   * **Use Case:** Classification tasks where the mere existence of a term is a key predictor (e.g., a specific vendor name in forensic accounting).
   * **Implementation:** Set `binary=True` in `CountVectorizer`.

**Term Frequency-Inverse Document Frequency (TF_IDF)**

TF-IDF is a statistical measure used to evaluate the importance of a word to a document in a collection or corpus.

a. **The Concept**
   * It highlights terms that are frequent in a specific document but rare across the entire corpus.

b. **The Formula**
   * **Term Frequency ($TF$):** Count of term $t$ in document $d$.
   * **Inverse Document Frequency ($IDF$):**
     $$IDF(t) = 1 + \log\left(\frac{\text{Total Documents}}{\text{Documents containing } t}\right)$$
   * **Final Score:**
     $$TF\text{-}IDF = TF(t,d) \times IDF(t)$$

c. **Interpretation**
   * **High score:** Rare term appearing frequently in the document.
   * **Low score:** Term appearing in almost all documents (stopwords-like behavior) or absent terms.

**Table 20.5. Text redunction of S1-S4 (After tokenization)**

**Table 20.6. TF-IDF MATRIX FOR S1-S4 EXAMPLE (AFTER TOKENIZATION AND TEXT REDUCTION)**

**From Terms to Concepts: Latent Semantic Indexing**

Dimensionality Reduction: Latent Semantic Indexing (LSI)
LSI (or LSA) reduces the complexity of text data by transforming "Terms" into "Concepts".

a. **Mechanism**
   * Similar to PCA (Principal Component Analysis), it groups correlated terms into linear combinations.
   * **Example:** Terms like *alternator, battery, headlights* $\rightarrow$ mapped to concept **"Alternator Failure"**.

b. **Trade-off**
   * **Pros:** Handles synonyms and reduces noise; improves manageability for modeling.
   * **Cons:** Creates a **"Blackbox"** model. The resulting concepts may not always have a clear human-readable meaning, but they effectively cluster related documents.

# 20.5. IMPLEMENTING DATA MINING METHODS

Once text is converted into a numeric matrix, standard Data Mining methods are applied:

a. **Clustering:** Grouping similar documents (e.g., clustering medical reports by symptoms).

b. **Prediction:** Predicting continuous values (e.g., time to resolve a ticket).

c. **Classification (Labeling):** Assigning categories to documents (the most common application).

# 20.6. EXAMPLE: ONLINE DISCUSSION ON AUTOS AND ELECTRONICS

# 20.7. SUMMARY

**Distinction between NLP and Text Mining**

- **Natural Language Processing (NLP):** Focuses on extracting meaning from a single document.
- **Text Mining:** Focuses on classifying or labeling numerous documents in a probabilistic fashion.
- *Note:* This chapter concentrates on Text Mining.

**Preprocessing Challenges**

- Preprocessing text is more varied and involved than preparing numerical data.
- The ultimate goal is to produce a matrix where rows represent **terms** and columns represent **documents**.

**Dimensionality Reduction**

- **Vocabulary Reduction:** Necessary because the sheer number of terms in natural language is excessive for effective model-building.

- **Concept Extraction:** A final major reduction involves using a limited set of *concepts* instead of raw terms.

- **Analogy:** This captures variation in documents similarly to how *Principal Components* capture variation in numerical data.

**Final Output & Application**

- The process results in a **quantitative matrix** where cells represent the frequency or presence of terms.

- **Document labels (classes)** are appended to this matrix.

- The data is now ready for **document classification** using standard classification methods.