# Embeddings (Traditional Statistical Vector-based Embeddings)

Yes, traditional statistical vector-based embeddings are foundational techniques in natural language processing (NLP) that represent text data using various statistical measures. Here are some of these traditional methods:

### 1. Bag of Words (BoW)
- **Description**: Represents text by the occurrence (count) of each word in the document without considering the word order or context.
- **Implementation**: Typically uses a Count Vectorizer.
- **Characteristics**: Produces sparse vectors where each dimension corresponds to a specific term from the vocabulary and the value is the word count.
- **Use Cases**: Simple and effective for basic text classification and clustering tasks.

### 2. Term Frequency-Inverse Document Frequency (TF-IDF)
- **Description**: Enhances the Bag of Words model by weighting terms based on their frequency in a document and their inverse frequency across all documents in the corpus.
- **Implementation**: Uses TF-IDF Vectorizer.
- **Characteristics**: Produces sparse vectors with weighted values, reducing the impact of common words and highlighting important terms.
- **Use Cases**: Widely used in information retrieval and text mining.

### 3. Latent Semantic Analysis (LSA) or Latent Semantic Indexing (LSI)
- **Description**: Applies Singular Value Decomposition (SVD) to the term-document matrix (typically after applying TF-IDF) to reduce dimensions and capture latent semantic relationships between terms.
- **Implementation**: Perform SVD on the term-document matrix.
- **Characteristics**: Transforms high-dimensional sparse vectors into lower-dimensional dense vectors.
- **Use Cases**: Useful for topic modeling and capturing underlying semantic structures.

### 4. Latent Dirichlet Allocation (LDA)
- **Description**: A generative probabilistic model that represents documents as mixtures of topics and topics as mixtures of words.
- **Implementation**: Uses probabilistic algorithms to infer topic distributions.
- **Characteristics**: Produces dense vectors representing the distribution of topics in each document.
- **Use Cases**: Widely used for topic modeling and discovering abstract topics in large text corpora.

### 5. Pointwise Mutual Information (PMI)
- **Description**: Measures the association between a pair of words by comparing the probability of their co-occurrence to the probabilities of their individual occurrences.
- **Implementation**: Uses co-occurrence matrices.
- **Characteristics**: Produces dense vectors that capture the likelihood of words appearing together.
- **Use Cases**: Useful for capturing word associations and semantic relationships.

In [1]:
import os
import pickle

import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from alive_progress import alive_bar
from nltk.tokenize import sent_tokenize
from scipy.sparse import csr_matrix
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [None]:
statistical_methods = {
    "bow": CountVectorizer,
    "tfidf": TfidfVectorizer,
    
}

In [None]:
# Parameters
CONTRIBUTOR: str = "Health Promotion Board"
CATEGORY: str = "live-healthy"
MODEL_NAME: str = "all-MiniLM-L6-v2"
POOLING_STRATEGY: str = "max"