# Text Data

## Types of Datasets
   - **Numerical Dataset:** Contains measurable quantities and can be analyzed mathematically. Examples include temperature, humidity, and test scores.
   - **Categorical Dataset:** Comprises a set of categories or groups. Examples are colors, product categories, and yes/no responses.
   - **Time Series Dataset:** Captures data points at successive time intervals. Useful for analyzing trends over time.
   - **Image Dataset:** Consists of image files, often used in computer vision tasks to identify patterns or objects.
   - **Text Dataset:** Includes collections of words, sentences, or documents, typically analyzed for linguistic patterns or content.

   and many more

## Text Data

1. **Understanding Text Data:**
   - Text data is composed of sequences of characters, forming words, sentences, or paragraphs.
   - It varies in length and complexity, often containing nuanced linguistic features.

2. **Text vs. Categorical Data:**
   - Strings of characters can represent different types of data.
   - Categorical data is derived from a predefined set of options, such as 'red' or 'blue', 'yes' or 'no'.
   - Text data, however, is more fluid, often forming meaningful phrases or sentences that convey complex ideas.

3. **Analyzing Text: Corpus and Documents:**
   - Text analysis typically involves examining a large body of text, known as a [corpus](https://en.wikipedia.org/wiki/Text_corpus).
   - Within this corpus, each individual text entry, whether an article, social media post, or review, is termed a **document**.

## What is bag-of-words?

Bag-of-words is a technique in natural language processing where we ignore the structure of input text and focus solely on word occurrences. It’s like mentally holding a bag of words and counting how many times each word appears in a document.

## Steps for bag-of-words representation

1. **Tokenization**:
   - **Tokenization** is the process of splitting each document (text) into individual words or tokens.
   - We achieve this by breaking the text at whitespace, punctuation marks, or other delimiters.

2. **Vocabulary Building**:
   - Next, we create a **vocabulary** containing all unique words (tokens) that appear in any of the documents.
   - Each word is assigned a unique **index** (usually in alphabetical order).

3. **Encoding**:
   - For each document, we count how often each word from the vocabulary appears in that document.
   - The resulting vector represents the word frequencies (counts) for that document.

## Implementing Bag-of-Words

In the scikit-learn library, the [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) is utilized to transform text data into a bag-of-words representation.

The `CountVectorizer` method standardizes all text data to lowercase, ensuring that words with identical spellings are recognized as the same token.

<font color='Blue'><b>Example:</b></font> Let's create a simple example of the Bag-of-Words (BoW) model using text related to Calgary. We'll follow the steps mentioned earlier:

Each Tokenization can be done as follows

In [None]:
# Import the necessary libraries
from sklearn.feature_extraction.text import CountVectorizer

# Step 0: Collect Data
# Define the documents
docs = ["Calgary is known for its annual Stampede.",
        "The Calgary Tower offers stunning views of the city.",
        "Calgary's weather can be unpredictable."
        ]

# Initialize the CountVectorizer
vectorizer = CountVectorizer()

tokenizer = vectorizer.build_analyzer()
for (i, doc) in enumerate(docs, 1):
    print(f'\nDoc {i}:')
    print("Original document:")
    print(doc)
    print("Tokenized document:")
    print(tokenizer(doc))

In [None]:
# Import the necessary libraries
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Step 0: Collect Data
# Define the documents
docs = ["Calgary is known for its annual Stampede.",
        "The Calgary Tower offers stunning views of the city.",
        "Calgary's weather can be unpredictable."
        ]

# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Fit the vectorizer to the documents
vectorizer.fit(docs)

# Transform the documents into a bag-of-words matrix
bag_of_words = vectorizer.transform(docs)

# Get the feature names (vocabulary)
feature_names = vectorizer.get_feature_names_out()
print('Feature Names (Vocabulary): ' + ', '.join(feature_names))

# Display the vocabulary size and content
print(f"Vocabulary size: {len(vectorizer.vocabulary_)}")
print(f"Vocabulary content: {vectorizer.vocabulary_}")

# Display the dense representation of the bag_of_words
# print("Dense representation of bag_of_words:\n{}".format(bag_of_words.toarray()))

# Create a DataFrame from the BoW matrix
df_bow = pd.DataFrame(bag_of_words.toarray(), columns=feature_names)

# Display the DataFrame
display(df_bow)

<font color='Blue'><b>Example:</b></font> Example: Bag-of-Words (BoW) Model Using Text Related to Calgary with **repeating words**

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Define a list of sentences with repeating words related to the city of Calgary
# This list will be used to demonstrate how the Bag-of-Words model handles repetition
repeating_words = ["Calgary, Calgary, a city so vibrant, so vibrant.",
                   "The Rockies, the Rockies, so majestic, so majestic."
                   ]

# Initialize a CountVectorizer object to convert the text data into a matrix of token counts
# The CountVectorizer will tokenize the text and count the occurrences of each token
vect = CountVectorizer()

# Fit the vectorizer to the list of sentences to build the vocabulary
# The vocabulary consists of all unique tokens in the text
vect.fit(repeating_words)

# Display the vocabulary that has been learned from the input documents
# The vocabulary is a dictionary where keys are tokens and values are their corresponding indices
print(f"Vocabulary learned from the documents: {vect.vocabulary_}")

# Transform the list of sentences into a bag-of-words matrix
# Each row in the matrix represents a document, and each column represents a token from the vocabulary
# The values in the matrix are the frequency counts of the tokens in each document
bag_of_words = vect.transform(repeating_words)

# Convert the bag-of-words matrix into a pandas DataFrame for better visualization
# The DataFrame makes it easier to see the token frequencies in tabular form
df_bow = pd.DataFrame(bag_of_words.toarray(), columns=vect.get_feature_names_out())

# Display the DataFrame that shows the frequency of each word in the given sentences
# This DataFrame is useful for understanding the distribution of tokens across the documents
print("DataFrame showing the Bag-of-Words matrix:")
display(df_bow)

## Enhancing Bag-of-Words: Stopword Removal

In the Bag-of-Words (BoW) model, certain words are so common that they carry minimal useful information about the actual content of the document. These words, known as 'stopwords', can be removed to improve the analysis. There are two primary methods to eliminate stopwords:

1. Utilizing a predefined list of stopwords specific to a language.
2. Excluding words that appear too frequently across the documents.

The scikit-learn library provides a built-in English stopword list in the `feature_extraction.text` module. This list can be used to filter out stopwords from the text data during the vectorization process, resulting in a more meaningful BoW representation.


In [None]:
# Import the set of English stop words from scikit-learn's feature_extraction.text module
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

# Display the total number of stop words provided in the scikit-learn's list
print(f"Number of stop words: {len(ENGLISH_STOP_WORDS)}")

# To provide a sample of this list, print every 20th stop word
# This gives an idea of what kind of words are considered as stop words
print("Every 20th stopword:")
print(list(ENGLISH_STOP_WORDS)[::20])

In [None]:
docs_ext = docs + repeating_words
vect = CountVectorizer(stop_words = "english")
vect.fit(docs_ext)
bag_of_words = vect.transform(docs_ext)
pd.DataFrame(bag_of_words.toarray(), columns=vect.get_feature_names_out())

## Understanding TfidfVectorizer

`TfidfVectorizer` is an advanced feature extraction tool from the scikit-learn library that converts text data into a numerical matrix of TF-IDF features, suitable for use in machine learning models.

- **TF (Term Frequency)**: This quantifies the frequency of a term within a single document, giving higher weight to terms that occur more frequently.

- **IDF (Inverse Document Frequency)**: This assesses the significance of a term across the entire document corpus. It diminishes the weight of terms that occur very commonly across documents, thereby amplifying the importance of rarer terms. The IDF for a term is calculated by taking the logarithm of the ratio of the total number of documents to the number of documents containing the term.

- **TF-IDF**: This metric is the product of TF and IDF, reflecting the term's significance within a particular document relative to the entire collection of documents. A term's TF-IDF score increases with its frequency in a document but is balanced by its commonality across all documents.

<font color='Blue'><b>Example:</b></font>[link text](https://)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Sample extended documents related to Calgary
docs_ext = docs + repeating_words

# Initialize the TfidfVectorizer with English stop words to be filtered out
vectorizer = TfidfVectorizer(stop_words = "english")

# Fit the vectorizer to the documents to learn the vocabulary
vectorizer.fit(docs_ext)

# Display the size and content of the learned vocabulary
print(f"Vocabulary size: {len(vectorizer.vocabulary_)}")
print("Vocabulary content:")
print(vectorizer.vocabulary_)

# Transform the documents into a TF-IDF-weighted term-document matrix
tfidf_matrix = vectorizer.transform(docs_ext)

# Retrieve and display the feature names (vocabulary)
print("\nFeature names:")
print(vectorizer.get_feature_names_out())

# Display the TF-IDF matrix as a dense array
# print("\nTF-IDF matrix:")
# print(tfidf_matrix.toarray())

# Convert the TF-IDF matrix into a DataFrame for better readability
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())

# Display the DataFrame
print("\nDataFrame representation of the TF-IDF matrix:")
display(tfidf_df.round(3))

The numbers in the table represent the **TF-IDF (Term Frequency-Inverse Document Frequency) scores** for various terms across different documents. Here's what they indicate:

- **Columns**: Each column header represents a unique term (e.g., 'annual', 'calgary', 'city', etc.).
- **Rows**: Each row corresponds to a different document (e.g., Document 0, Document 1, etc.).
- **Values**: The numerical values are the TF-IDF scores, which measure how important a term is to a document in a collection. A score of **0** means the term does not appear in the document, while higher scores indicate greater importance.

For example, in Document 0, the terms 'annual', 'known', and 'stampede' have a score of **0.549**, suggesting they are significant in that document. Conversely, the term 'city' has a score of **0**, indicating it does not appear in Document 0.

## Expanding Bag-of-Words with n-Grams

The Bag-of-Words (BoW) model is a simple yet powerful way to represent text data for machine learning. However, it has a significant limitation: it ignores word order. For example, "it's bad, not good at all" and "it's good, not bad at all" would have identical BoW representations despite their opposite meanings. This is where n-grams come into play.

### Capturing Context with n-Grams

To capture more context, we can extend the BoW model to consider sequences of words:
- **Bigrams**: Pairs of consecutive words.
- **Trigrams**: Triplets of consecutive words.
- **n-Grams**: Sequences of 'n' consecutive words.


### Implementing n-Grams with CountVectorizer

The `CountVectorizer` and `TfidfVectorizer` classes in scikit-learn can be configured to use n-grams by setting the `ngram_range` parameter. Here's how you can do it:

In [None]:
docs = ["Calgary is known for its annual Stampede.",
        "The Calgary Tower offers stunning views of the city.",
        "Calgary's weather can be unpredictable."
        ]

vectorizer = CountVectorizer(ngram_range=(1, 1), stop_words = "english").fit(docs)
# ngram_rangetuple (min_n, max_n), default=(1, 1)
print(f"Vocabulary size: {vectorizer.vocabulary_}")
print(f"Vocabulary:")
print(vectorizer.get_feature_names_out())

ngram_words = vectorizer.transform(docs)
pd.DataFrame(ngram_words.toarray(), columns= vectorizer.get_feature_names_out())

In [None]:
from pprint import pprint
vectorizer = CountVectorizer(ngram_range=(2, 2), stop_words = "english").fit(docs)
print(f"Vocabulary size: {vectorizer.vocabulary_}")
print(f"Vocabulary:")
pprint(vectorizer.get_feature_names_out())

ngram_words = vectorizer.transform(docs)
pd.DataFrame(ngram_words.toarray(), columns= vectorizer.get_feature_names_out())

In [None]:
vectorizer = CountVectorizer(ngram_range=(1, 2), stop_words = "english").fit(docs)
print(f"Vocabulary size: {vectorizer.vocabulary_}")
print(f"Vocabulary:")
print(vectorizer.get_feature_names_out())

ngram_words = vectorizer.transform(docs)
pd.DataFrame(ngram_words.toarray(), columns= vectorizer.get_feature_names_out())

## Td-idf n-gram

In [None]:
docs = ["Calgary is known for its annual Stampede.",
        "The Calgary Tower offers stunning views of the city.",
        "Calgary's weather can be unpredictable."
        ]
vect = TfidfVectorizer(ngram_range=(1,2), stop_words = "english")
vect.fit(docs)
tfidf_words = vect.transform(docs)
df = pd.DataFrame(tfidf_words.toarray(), columns=vect.get_feature_names_out())
display(df.round(3))