# Statistics and Information Theory

In the previous lecture, we examined some general statistical features of words and/or tokens. We observed that the frequency of terms in document follow a power law distribution. We noticed that the most frequent words are often words that are hardly germane to the ideas of the text. What is more, we often don't think about ideas as associated with one word, but we can create noun phrases or prepositional phrases to communicate our ideas. For example, the phrase "bacon and eggs" might mean, in a given context, the entities to which the words are referencing. In a different context, however, the noun phrase, "bacon and eggs", could mean the event of breakfast or the items one would prefer to eat at the event of a morning meal. Thus, we need a strategy to assess the co-occurrence of n-terms and whether the co-occurence is significant.

## Introduction

### Overview of Information Theory

Information theory was introduced by Claude Shannon in 1948 in his seminal paper "A Mathematical Theory of Communication." Originally developed for telecommunications, it provides a framework for quantifying information and understanding the limits of data compression and transmission.

At its heart, information theory deals with the quantification of information. It provides measures to determine the amount of uncertainty in a set of outcomes or the surprise of an event. The more uncertain or surprising an event is, the more information it contains.

#### Some Key Concepts

**Entropy**: Represents the amount of uncertainty or randomness in a dataset. High entropy means more unpredictability, while low entropy indicates more uniformity or predictability.
**Mutual Information**: Measures the amount of information shared between two variables. It tells us how much knowing one variable reduces uncertainty about the other.
**Redundancy**: The opposite of information; it represents the predictability in a dataset. In communication, redundancy can be added intentionally to protect data from errors during transmission.

## Use of Information Theory in NLP

#### Quantifying Textual Information

In NLP, the text is treated as a source of data. Information theory helps in quantifying the amount of information present in this data. For instance, understanding the entropy of a text can give insights into its complexity or diversity.

#### Language Modeling

Information theory aids in building probabilistic language models. By understanding the predictability of words or sequences in a language, we can build models that generate or recognize text more effectively.

#### Feature Selection

In text classification or sentiment analysis tasks, not all words or features are equally informative. Mutual information can help in selecting the most informative features, leading to better model performance.

#### Text Compression

Information theory principles can be applied to compress textual data, ensuring that the most relevant information is retained with the least amount of storage.

#### Semantic Similarity

By comparing the information content of different textual units (like sentences or documents), we can gauge their semantic similarity or relevance.

#### Understanding Neural Networks

As deep learning models become more prevalent in NLP, information theory provides tools to understand what these models learn and how they represent information.

## Exploring Entropy

- Definition: Measure of uncertainty or randomness of a random variable.
- Def: 
    $$
    H(X) = -\sum_{i=1}^{n} p(x_i) \log_2 p(x_i)
    $$

where $p(x_i)$ is the probability of event $x_i$ occurring.

### Expected Values

To get a better understanding of entropy, let's look first at expected values. Let's imagine we have a discrete value that describes how many people in Williamsburg have played and have not played tennis.

| Have Played Tennis | Have Not Played Tennis | Total |
|--------------------|------------------------|-------|
| 5289               | 2376                   | 7665  |

We can calculate the probability of each event by dividing the number of people who have played tennis by the total number of people. Let's imagine that we place a bet with a random person on the street as to whether they have played tennis. If we guess correctly, we win \$1. If we guess incorrectly, we lose \$1. What is the expected value of our bet?

| Have Played Tennis | Have Not Played Tennis | Total |
|--------------------|------------------------|-------|
| 5289               | 2376                   | 7665  |
| 1                  | -1                     |       |

$$\text{Have Played Tennis} = \frac{5289}{7665} = 0.69$$
$$\text{Have Not Played Tennis} = \frac{2376}{7665} = 0.31$$

So we can calculate the expected value of our bet as follows:

$$\text{Expected Value} = 0.69 \times 1 + 0.31 \times -1 = 0.38$$

Thus, we can expect to win \$0.38 on average for each bet we make.

We can say that $\text{Expected Value} = \sum_{i=1}^{n} p(x_i) \times x_i$. In our case, $x_i$ is the amount of money we win or lose, and $p(x_i)$ is the probability of winning or losing.

$$\sum{x} P(X = x)$$

Where $x$ is the specific outcome and $P(X = x)$ is the probability of observing that outcome.

$$E(x) = \sum{x}P(X=x)$$

Where $E(x)$ is the expected value of $x$.

### Entropy as a Measure of Surprise



In [5]:
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

def calculate_bigram_entropy(text):
    # Tokenize the text into bigrams
    vectorizer = CountVectorizer(ngram_range=(2, 2), tokenizer=lambda x: x.split())
    bigram_counts = vectorizer.fit_transform([text]).toarray()

    # Calculate the probability of each bigram
    total_bigrams = np.sum(bigram_counts)
    bigram_probabilities = bigram_counts / total_bigrams

    # Calculate the entropy
    entropy = -np.sum(bigram_probabilities * np.log2(bigram_probabilities + np.finfo(float).eps))

    return entropy

text = """
To the People of the State of New York:
WHEN the people of America reflect that they are now called upon to decide a question, which, in its consequences, must prove one of the most important that ever engaged their attention, the propriety of their taking a very comprehensive, as well as a very serious, view of it, will be evident.
Nothing is more certain than the indispensable necessity of government, and it is equally undeniable, that whenever and however it is instituted, the people must cede to it some of their natural rights in order to vest it with requisite powers. It is well worthy of consideration therefore, whether it would conduce more to the interest of the people of America that they should, to all general purposes, be one nation, under one federal government, or that they should divide themselves into separate confederacies, and give to the head of each the same kind of powers which they are advised to place in one national government.
It has until lately been a received and uncontradicted opinion that the prosperity of the people of America depended on their continuing firmly united, and the wishes, prayers, and efforts of our best and wisest citizens have been constantly directed to that object. But politicians now appear, who insist that this opinion is erroneous, and that instead of looking for safety and happiness in union, we ought to seek it in a division of the States into distinct confederacies or sovereignties. However extraordinary this new doctrine may appear, it nevertheless has its advocates; and certain characters who were much opposed to it formerly, are at present of the number. Whatever may be the arguments or inducements which have wrought this change in the sentiments and declarations of these gentlemen, it certainly would not be wise in the people at large to adopt these new political tenets without being fully convinced that they are founded in truth and sound policy.
It has often given me pleasure to observe that independent America was not composed of detached and distant territories, but that one connected, fertile, widespreading country was the portion of our western sons of liberty. Providence has in a particular manner blessed it with a variety of soils and productions, and watered it with innumerable streams, for the delight and accommodation of its inhabitants. A succession of navigable waters forms a kind of chain round its borders, as if to bind it together; while the most noble rivers in the world, running at convenient distances, present them with highways for the easy communication of friendly aids, and the mutual transportation and exchange of their various commodities.
"""

print(calculate_bigram_entropy(text))

8.56172381942633




## Use Cases of Entropy in NLP


1. **Language Modeling**:
   - **Predicting the Next Word**: Entropy can be used to evaluate the performance of language models. A lower entropy indicates that the model is more certain about its predictions.
   - **Model Comparison**: When comparing different language models, the one with lower entropy on a test set is generally considered better, as it means the model is more confident in its predictions.
2. **Text Classification**:
   - **Feature Selection**: Entropy can be used to rank the importance of features (words or n-grams) in text classification tasks. Features with higher entropy are more informative because they appear in a diverse set of classes.
3. **Information Retrieval**: 
   - **Query Expansion**: Entropy can help in determining which terms to add to a query to make it more specific or broad.
   - **Document Ranking**: Documents that have lower entropy with respect to a query can be ranked higher, as they are more relevant to the query.
4. **Text Summarization**:
   - **Sentence Selection**: Sentences with lower entropy might be more generic and can be good candidates for inclusion in a summary, as they might capture the main theme of the document.
5. **Topic Modeling**:
   - **Topic Coherence**: Entropy can be used to measure the coherence of words within a topic. Topics with lower entropy are more coherent and focused.
6. **Text Compression**:
   - **Lossless Compression**: Entropy provides a lower bound on the number of bits needed to encode a piece of text without loss. Huffman coding and Arithmetic coding are examples of entropy-based compression algorithms.
7. **Authorship Attribution**:
   - **Stylistic Analysis**: Different authors might have different entropy patterns in their writings. Analyzing these patterns can help in attributing a piece of text to its most likely author.
8. **Detecting Language Change Over Time**:
   - By measuring the entropy of specific words or phrases over different time periods, one can detect shifts in language usage and semantics.
9.  **Anomaly Detection**:
   - **Detecting Spam**: Emails or messages with unusually high or low entropy might be considered suspicious and flagged as potential spam.
   - **Detecting Machine-Generated Text**: Text generated by machines or bots might have different entropy patterns compared to human-written text.
10. **Optimizing Text Representations**:
   - **Dimensionality Reduction**: Features (like words or n-grams) with very high entropy might be less discriminative and can be considered for removal when representing text in a high-dimensional space.