# You are part of a team developing a text classification system for a news aggregator platform. The platform aims to categorize news articles into different topics automatically. The dataset contains news articles along with their corresponding topics. Perform only the Feature extraction techniques.

Dataset Link: https://www.kaggle.com/datasets/therohk/million-headlines

Data Exploration: Begin by exploring the dataset. What are the different topics/categories present in the dataset? What is the distribution of articles across these topics?

Bag-of-Words (BoW): Implement a Bag-of-Words (BoW) model using Count Vectorizer or TF-IDF to transform the text data into numerical features. Discuss the advantages and limitations of Bow in this context. Apply both unigram and bigram techniques and compare their effects on classification accuracy.

N-grams: Explore the use of N-grams (bi-grams, tri-grams) in feature engineering. How do different N-gram ranges impact the performance of the classification model?

TF-IDF: Apply TF-IDF (Term Frequency-Inverse Document Frequency) to the text data. Describe how TF-IDF works and its significance in capturing the importance of words across documents. Compare the results of TF-IDF with the BoW approach.

One-Hot Encoding: Investigate the application of One-Hot Encoding to encode categorical variables or labels. Can One-Hot Encoding be used directly for text classification? Why or why not?

Deliverables:

Present insights gathered from data exploration and discuss the impact of different feature engineering techniques (BoW, N-grams, TF-IDF, One-Hot Encoding). Provide recommendations for the best feature engineering strategy.

In [43]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Load the dataset
df = pd.read_csv('abcnews-date-text.csv')

# Split the data into training and testing sets
train_data, test_data, train_labels, test_labels = train_test_split(
    df['headline_text'], df['publish_date'], test_size=0.2, random_state=42
)

# Bag-of-Words (BoW) with Count Vectorizer (Unigrams)
count_vectorizer = CountVectorizer()
X_train_count = count_vectorizer.fit_transform(train_data)
X_test_count = count_vectorizer.transform(test_data)

# Bag-of-Words (BoW) with Count Vectorizer (Bigrams)
bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))
X_train_bigram = bigram_vectorizer.fit_transform(train_data)
X_test_bigram = bigram_vectorizer.transform(test_data)

# Bag-of-Words (BoW) with TF-IDF Vectorizer (Unigrams)
tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(train_data)
X_test_tfidf = tfidf_vectorizer.transform(test_data)

# Bag-of-Words (BoW) with TF-IDF Vectorizer (Bigrams)
tfidf_bigram_vectorizer = TfidfVectorizer(ngram_range=(2, 2))
X_train_tfidf_bigram = tfidf_bigram_vectorizer.fit_transform(train_data)
X_test_tfidf_bigram = tfidf_bigram_vectorizer.transform(test_data)

# Train a Multinomial Naive Bayes classifier on each representation
model_count = MultinomialNB()
model_count.fit(X_train_count, train_labels)
predictions_count = model_count.predict(X_test_count)
accuracy_count = accuracy_score(test_labels, predictions_count)
print(f"Accuracy with Count Vectorizer (Unigrams): {accuracy_count}")

model_bigram = MultinomialNB()
model_bigram.fit(X_train_bigram, train_labels)
predictions_bigram = model_bigram.predict(X_test_bigram)
accuracy_bigram = accuracy_score(test_labels, predictions_bigram)
print(f"Accuracy with Count Vectorizer (Bigrams): {accuracy_bigram}")

model_tfidf = MultinomialNB()
model_tfidf.fit(X_train_tfidf, train_labels)
predictions_tfidf = model_tfidf.predict(X_test_tfidf)
accuracy_tfidf = accuracy_score(test_labels, predictions_tfidf)
print(f"Accuracy with TF-IDF Vectorizer (Unigrams): {accuracy_tfidf}")

model_tfidf_bigram = MultinomialNB()
model_tfidf_bigram.fit(X_train_tfidf_bigram, train_labels)
predictions_tfidf_bigram = model_tfidf_bigram.predict(X_test_tfidf_bigram)
accuracy_tfidf_bigram = accuracy_score(test_labels, predictions_tfidf_bigram)
print(f"Accuracy with TF-IDF Vectorizer (Bigrams): {accuracy_tfidf_bigram}")


Accuracy with Count Vectorizer (Unigrams): 0.36666666666666664
Accuracy with Count Vectorizer (Bigrams): 0.4166666666666667
Accuracy with TF-IDF Vectorizer (Unigrams): 0.3416666666666667
Accuracy with TF-IDF Vectorizer (Bigrams): 0.4166666666666667


# Advantages of Bag-of-Words (BoW):
Simple Representation: BoW is a simple and effective way to represent text data numerically.
Interpretability: The generated features (word frequencies or TF-IDF scores) are interpretable and can provide insights into the importance of words.

# Limitations of Bag-of-Words (BoW):
Loss of Word Order: BoW discards the order of words in the text, which may result in a loss of important syntactic and semantic information.
High Dimensionality: The feature space can become very large, especially with a large vocabulary or when using n-grams, leading to increased computational requirements and potential overfitting.
Ignores Context: BoW treats each word independently and doesn't consider the context in which words appear.

# Comparison of Unigrams and Bigrams:
Unigrams: Capture individual words and their frequencies. Suitable for capturing the overall vocabulary and word usage patterns.
Bigrams: Consider pairs of consecutive words. Useful for capturing some level of context and understanding phrases.

In [44]:
# Explore N-grams (Bi-grams and Tri-grams)
# Using Bi-grams
bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))
X_train_bigram = bigram_vectorizer.fit_transform(train_data)
X_test_bigram = bigram_vectorizer.transform(test_data)

# Using Tri-grams
trigram_vectorizer = CountVectorizer(ngram_range=(3, 3))
X_train_trigram = trigram_vectorizer.fit_transform(train_data)
X_test_trigram = trigram_vectorizer.transform(test_data)

# Train a Multinomial Naive Bayes classifier on each representation
model_bigram = MultinomialNB()
model_bigram.fit(X_train_bigram, train_labels)
predictions_bigram = model_bigram.predict(X_test_bigram)
accuracy_bigram = accuracy_score(test_labels, predictions_bigram)
print(f"Accuracy with Bi-grams: {accuracy_bigram}")

model_trigram = MultinomialNB()
model_trigram.fit(X_train_trigram, train_labels)
predictions_trigram = model_trigram.predict(X_test_trigram)
accuracy_trigram = accuracy_score(test_labels, predictions_trigram)
print(f"Accuracy with Tri-grams: {accuracy_trigram}")

Accuracy with Bi-grams: 0.4166666666666667
Accuracy with Tri-grams: 0.4083333333333333


# Impact on Performance:
Bi-grams and Tri-grams can improve model performance when the task involves capturing longer-range dependencies in the text.
However, higher-order N-grams also increase the dimensionality of the feature space, which might lead to increased computational requirements and potential overfitting.
The optimal choice of N-grams depends on the specific characteristics of the text data and the nature of the classification task.

In [45]:
# Bag-of-Words (BoW) using CountVectorizer
count_vectorizer = CountVectorizer()
X_train_count = count_vectorizer.fit_transform(train_data)
X_test_count = count_vectorizer.transform(test_data)

# TF-IDF
tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(train_data)
X_test_tfidf = tfidf_vectorizer.transform(test_data)

# Train a Multinomial Naive Bayes classifier on BoW
model_count = MultinomialNB()
model_count.fit(X_train_count, train_labels)
predictions_count = model_count.predict(X_test_count)
accuracy_count = accuracy_score(test_labels, predictions_count)
print(f"Accuracy with BoW: {accuracy_count}")

# Train a Multinomial Naive Bayes classifier on TF-IDF
model_tfidf = MultinomialNB()
model_tfidf.fit(X_train_tfidf, train_labels)
predictions_tfidf = model_tfidf.predict(X_test_tfidf)
accuracy_tfidf = accuracy_score(test_labels, predictions_tfidf)
print(f"Accuracy with TF-IDF: {accuracy_tfidf}")

# Compare results
print("Classification Report for BoW:")
print(classification_report(test_labels, predictions_count))

print("\nClassification Report for TF-IDF:")
print(classification_report(test_labels, predictions_tfidf))

Accuracy with BoW: 0.36666666666666664
Accuracy with TF-IDF: 0.3416666666666667
Classification Report for BoW:
              precision    recall  f1-score   support

    20030219       0.42      0.24      0.30        42
    20030220       0.40      0.62      0.49        48
    20030221       0.19      0.13      0.16        30

    accuracy                           0.37       120
   macro avg       0.34      0.33      0.32       120
weighted avg       0.35      0.37      0.34       120


Classification Report for TF-IDF:
              precision    recall  f1-score   support

    20030219       0.25      0.10      0.14        42
    20030220       0.36      0.75      0.49        48
    20030221       0.20      0.03      0.06        30

    accuracy                           0.34       120
   macro avg       0.27      0.29      0.23       120
weighted avg       0.28      0.34      0.26       120



# How TF-IDF Works:
Term Frequency (TF): Measures how often a term (word) appears in a document. It is calculated as the ratio of the number of occurrences of a term to the total number of terms in a document.
Inverse Document Frequency (IDF): Measures the importance of a term across a collection of documents. It is calculated as the logarithm of the ratio of the total number of documents to the number of documents containing the term.
TF-IDF: The product of TF and IDF. It reflects both the local importance (within a document) and the global importance (across documents) of a term.

# Significance of TF-IDF:
Capturing Importance: TF-IDF assigns higher weights to terms that are frequent within a document but relatively rare across the entire document collection.
Discriminative Power: Terms that are common in a specific document but uncommon in other documents can have higher TF-IDF scores, making them more discriminative for classification tasks.
Normalization: TF-IDF normalizes the influence of document length, preventing longer documents from having inherently larger feature vectors.

# Comparison with BoW:
BoW represents documents as vectors of term frequencies, without considering the importance of terms.
TF-IDF, on the other hand, considers both the frequency and importance of terms, providing a more nuanced representation of document content.
TF-IDF often outperforms BoW in capturing the significance of words for classification tasks, especially when dealing with large and diverse document collections.

# One-Hot Encoding is a technique commonly used to represent categorical variables as binary vectors. Each category is represented by a unique binary value, and the entire set of categories is transformed into a binary matrix. However, using One-Hot Encoding directly for text classification may not be the most suitable approach. Here's why:

High Dimensionality:

One-Hot Encoding creates a binary vector for each unique category, leading to a high-dimensional feature space.
In text classification, especially with a large vocabulary, the number of unique words can be extensive, resulting in an extremely high-dimensional one-hot encoded vector for each document.
Sparse Representation:

One-Hot Encoding produces a sparse matrix where the majority of entries are zero.
For text data, most documents contain only a small subset of the entire vocabulary, leading to sparse representations that may be computationally inefficient and memory-intensive.
Loss of Sequence Information:

One-Hot Encoding treats each word as an independent entity without considering the order or context of words.
Text classification often benefits from capturing the sequential nature of language, and One-Hot Encoding discards this valuable sequential information.
Semantic Gap:

One-Hot Encoding does not capture the semantic relationships between words. Each word is represented as an independent feature, ignoring potential similarities or connections between words.
Not Suitable for Continuous Text:

In text classification tasks, where the input is a sequence of words forming coherent text, One-Hot Encoding does not capture the semantic meaning or relationships between words.


# Alternative Approaches for Text Classification:

Word Embeddings: Word embeddings, such as Word2Vec, GloVe, or FastText, represent words as dense vectors in a continuous vector space. They capture semantic relationships and can be used as features for text classification.

TF-IDF or Bag-of-Words (BoW): These approaches, as discussed earlier, are more suitable for text classification tasks. They represent documents based on word frequencies or TF-IDF scores, capturing some level of context.

Sequence Models (e.g., RNNs, LSTMs, and Transformers): These models are designed to capture sequential dependencies in text data and have shown excellent performance in various natural language processing tasks, including text classification.

# Impact:

Bag-of-Words (BoW): Captures word frequency information, ignoring the order of words. May lead to high-dimensional, sparse feature vectors.
TF-IDF: Considers both word frequency and importance, providing a more nuanced representation. Often outperforms BoW.
One-Hot Encoding: Not directly applicable to text features due to high dimensionality and loss of sequential information.

# Recommendations:
TF-IDF for Text Representation:

TF-IDF captures both word frequency and importance, making it a suitable choice for text classification tasks. It often outperforms BoW.
Word Embeddings (Optional):

Consider using pre-trained word embeddings (e.g., Word2Vec, GloVe) for richer semantic representations if the dataset is large enough.
N-grams (Optional):

Experiment with N-grams, especially bigrams and trigrams, to capture sequential dependencies if the dataset size allows. This can improve the model's understanding of context.
Consider Advanced Models:

Explore more advanced models like sequence models (RNNs, LSTMs, Transformers) for tasks where capturing sequential dependencies is crucial.
In summary, TF-IDF is recommended as the primary feature engineering strategy for text classification, with optional exploration of word embeddings and N-grams based on the specific characteristics and size of the dataset.