## Classification - Logistic Regression 

Category ID 17: "Sports"

Category ID 10: "Music"

Category ID 20: "Gaming"

Category ID 24: "Entertainment"

## Two aggregration approach - "sentence-by-sentence" VS "whole document"
### 1. Treating Each Video's Tags as a "Sentence"

In this approach, you consider the tags of each video as a single "sentence". This method allows you to maintain the context of which tags belong to which video, which can be valuable for certain types of analysis.

**Pros:**
- **Context Preservation:** Keeping tags grouped by video preserves the context, which is crucial if the relationships between tags within a video are important.
- **Individual Video Analysis:** It allows for analysis at the video level, such as understanding common tag combinations within individual videos.
- **Granular Data:** More granular data can lead to more detailed insights, especially if the goal is to understand tagging patterns or to generate video-specific recommendations or classifications.

**Cons:**
- **Complexity:** More complex to implement, as you need to maintain the association between tags and their respective videos.
- **Data Sparsity:** If tags are not repeated often across videos, it might result in sparse data, which can be challenging for some machine learning models.

### 2. Aggregating All Tags from a Category into One Document

In this method, you aggregate all tags from videos in a particular category into a single document, without distinguishing which tags came from which video.

**Pros:**
- **Simplicity:** Simpler to implement and process since you're dealing with a bulk collection of tags without worrying about their original context.
- **Category-Level Analysis:** Useful for understanding broad trends and commonalities across an entire category of videos.
- **Better for Large-Scale Patterns:** Can be more effective for identifying overarching patterns that define a category.

**Cons:**
- **Loss of Context:** Loses the specific context of which tags are used together in individual videos.
- **Potential Noise:** If certain tags are overly common or generic across many videos, they might dominate the dataset and potentially skew the analysis.

### Deciding on the Approach

- **Goal-Oriented:** Consider what you want to achieve with the analysis. If you are interested in patterns within individual videos or in understanding how certain tags cluster within videos, the first approach is better. If your goal is to understand broader trends across a category, the second approach might be more suitable.
- **Data Size and Quality:** The amount and quality of data you have can also influence your choice. If you have a large number of videos with a diverse range of tags, the second approach might provide more meaningful insights at a category level.
- **Machine Learning Considerations:** If you're using machine learning, think about what kind of features and labels will be most effective for your model. The granularity of your data can significantly affect model performance.

## Unnormalized Sentence by sentence aggregation with 4 embeddings (Plain, TF-IDF, LSA, Word to Vec) 

In [None]:
"""Compare token/document vectors for classification."""
import random
from typing import List, Mapping, Optional, Sequence
import gensim
import nltk
import numpy as np
from numpy.typing import NDArray
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import LogisticRegression

FloatArray = NDArray[np.float64]
import gensim.downloader as api

# Load Google's pre-trained Word2Vec model.

model = api.load("word2vec-google-news-300")
# print(api.info())  # show info about available models/datasets

# Un-comment this to fix the random seed
random.seed(31)


def read_file_to_sentences(file_path):
    with open(file_path, "r", encoding="utf-8") as file:
        return [line.strip().split(",") for line in file if line.strip()]


music = read_file_to_sentences("category10.txt")
print(music)

In [2]:
"""Compare token/document vectors for classification."""
import random
from typing import List, Mapping, Optional, Sequence
import gensim
import nltk
import numpy as np
from numpy.typing import NDArray
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import LogisticRegression

FloatArray = NDArray[np.float64]
import gensim.downloader as api

# Load Google's pre-trained Word2Vec model.

model = api.load("word2vec-google-news-300")
# print(api.info())  # show info about available models/datasets

# Un-comment this to fix the random seed
random.seed(31)


def read_file_to_sentences(file_path):
    with open(file_path, "r", encoding="utf-8") as file:
        return [line.strip().split(",") for line in file if line.strip()]


music = read_file_to_sentences("category10.txt")
sports = read_file_to_sentences("category17.txt")
gaming = read_file_to_sentences("category20.txt")
entertainment = read_file_to_sentences("category24.txt")

vocabulary = sorted(
    set(
        token
        for sentence in music + sports + gaming + entertainment
        for token in sentence
    )
) + [None]

vocabulary_map = {token: idx for idx, token in enumerate(vocabulary)}


def onehot(
    vocabulary_map: Mapping[Optional[str], int], token: Optional[str]
) -> FloatArray:
    """Generate the one-hot encoding for the provided token in the provided vocabulary."""
    embedding = np.zeros((len(vocabulary_map),))
    idx = vocabulary_map.get(token, len(vocabulary_map) - 1)
    embedding[idx] = 1
    return embedding


def sum_token_embeddings(
    token_embeddings: Sequence[FloatArray],
) -> FloatArray:
    """Sum the token embeddings."""
    total: FloatArray = np.array(token_embeddings).sum(axis=0)
    return total


def split_train_test(
    X: FloatArray, y: FloatArray, test_percent: float = 10
) -> tuple[FloatArray, FloatArray, FloatArray, FloatArray]:
    """Split data into training and testing sets."""
    N = len(y)
    data_idx = list(range(N))
    random.shuffle(data_idx)
    break_idx = round(test_percent / 100 * N)
    training_idx = data_idx[break_idx:]
    testing_idx = data_idx[:break_idx]
    X_train = X[training_idx, :]
    y_train = y[training_idx]
    X_test = X[testing_idx, :]
    y_test = y[testing_idx]
    return X_train, y_train, X_test, y_test


def generate_data_token_counts(
    music_document: list[list[str]],
    sports_document: list[list[str]],
    gaming_document: list[list[str]],
    entertainment_document: list[list[str]],
) -> tuple[FloatArray, FloatArray, FloatArray, FloatArray]:
    """Generate training and testing data with raw token counts for four categories."""

    # Aggregate embeddings for each category
    X: FloatArray = np.array(
        [
            sum_token_embeddings([onehot(vocabulary_map, token) for token in sentence])
            for sentence in music_document
        ]
        + [
            sum_token_embeddings([onehot(vocabulary_map, token) for token in sentence])
            for sentence in sports_document
        ]
        + [
            sum_token_embeddings([onehot(vocabulary_map, token) for token in sentence])
            for sentence in gaming_document
        ]
        + [
            sum_token_embeddings([onehot(vocabulary_map, token) for token in sentence])
            for sentence in entertainment_document
        ]
    )
    # Generate labels for each category
    # Assuming music:0, sports:1, gaming:2, entertainment:3
    y: FloatArray = np.array(
        [0 for sentence in music_document]
        + [1 for sentence in sports_document]
        + [2 for sentence in gaming_document]
        + [3 for sentence in entertainment_document]
    )

    return split_train_test(X, y)


def generate_data_tfidf(
    music_document: list[list[str]],
    sports_document: list[list[str]],
    gaming_document: list[list[str]],
    entertainment_document: list[list[str]],
) -> tuple[FloatArray, FloatArray, FloatArray, FloatArray]:
    """Generate training and testing data with TF-IDF scaling."""
    X_train, y_train, X_test, y_test = generate_data_token_counts(
        music_document, sports_document, gaming_document, entertainment_document
    )
    tfidf = TfidfTransformer(norm=None).fit(X_train)
    X_train = tfidf.transform(X_train)
    X_test = tfidf.transform(X_test)
    return X_train, y_train, X_test, y_test


def generate_data_lsa(
    music_document: list[list[str]],
    sports_document: list[list[str]],
    gaming_document: list[list[str]],
    entertainment_document: list[list[str]],
) -> tuple[FloatArray, FloatArray, FloatArray, FloatArray]:
    """Generate training and testing data with LSA."""
    X_train, y_train, X_test, y_test = generate_data_token_counts(
        music_document, sports_document, gaming_document, entertainment_document
    )
    lsa = TruncatedSVD(n_components=300).fit(X_train)
    X_train = lsa.transform(X_train)
    X_test = lsa.transform(X_test)
    return X_train, y_train, X_test, y_test


def generate_data_word2vec(
    music_document: list[list[str]],
    sports_document: list[list[str]],
    gaming_document: list[list[str]],
    entertainment_document: list[list[str]],
) -> tuple[FloatArray, FloatArray, FloatArray, FloatArray]:
    """Generate training and testing data with word2vec."""
    # Load pretrained word2vec model from gensim
    model = api.load("word2vec-google-news-300")

    def get_document_vector(sentence: list[str]) -> NDArray:
        """Return document vector by summing word vectors."""
        vectors = [model[word] for word in sentence if word in model.key_to_index]
        if vectors:
            return np.sum(vectors, axis=0)
        else:
            return np.zeros(
                300
            )  # return zero vector if no word in the document has a pretrained vector

    # Produce document vectors for each sentence
    X = np.array(
        [
            get_document_vector(sentence)
            for sentence in music_document
            + sports_document
            + gaming_document
            + entertainment_document
        ]
    )
    y = np.array(
        [0 for sentence in music_document]
        + [1 for sentence in sports_document]
        + [2 for sentence in gaming_document]
        + [3 for sentence in entertainment_document]
    )
    return split_train_test(X, y)


def run_experiment() -> None:
    """Compare performance with different embeddiings."""
    X_train, y_train, X_test, y_test = generate_data_token_counts(
        music, sports, gaming, entertainment
    )
    clf = LogisticRegression(random_state=0, max_iter=1000).fit(X_train, y_train)
    print("raw counts (train):", clf.score(X_train, y_train))
    print("raw_counts (test):", clf.score(X_test, y_test))
    X_train, y_train, X_test, y_test = generate_data_tfidf(
        music, sports, gaming, entertainment
    )
    clf = LogisticRegression(random_state=0, max_iter=1000).fit(X_train, y_train)
    print("tfidf (train):", clf.score(X_train, y_train))
    print("tfidf (test):", clf.score(X_test, y_test))
    X_train, y_train, X_test, y_test = generate_data_lsa(
        music, sports, gaming, entertainment
    )
    clf = LogisticRegression(random_state=0, max_iter=1000).fit(X_train, y_train)
    print("lsa (train):", clf.score(X_train, y_train))
    print("lsa (test):", clf.score(X_test, y_test))
    X_train, y_train, X_test, y_test = generate_data_word2vec(
        music, sports, gaming, entertainment
    )
    clf = LogisticRegression(random_state=0, max_iter=1000).fit(X_train, y_train)
    print("word2vec (train):", clf.score(X_train, y_train))
    print("word2vec (test):", clf.score(X_test, y_test))


if __name__ == "__main__":
    run_experiment()

: 