# Day 2
You are part of a team developing a text classification system for a news aggregator 
platform. The platform aims to categorize news articles into different topics automatically. 
The dataset contains news articles along with their corresponding topics. Perform only the 
Feature extraction techniques.
Dataset Link: https://www.kaggle.com/datasets/therohk/million-headlines
Data Exploration: Begin by exploring the dataset. What are the different topics/categories 
present in the dataset? What is the distribution of articles across these topics?
Bag-of-Words (BoW): Implement a Bag-of-Words (BoW) model using CountVectorizer 
or TF-IDF to transform the text data into numerical features. Discuss the advantages and 
limitations of BoW in this context. Apply both unigram and bigram techniques and 
compare their effects on classification accuracy.
N-grams: Explore the use of N-grams (bi-grams, tri-grams) in feature engineering. How do 
different N-gram ranges impact the performance of the classification model?
TF-IDF: Apply TF-IDF (Term Frequency-Inverse Document Frequency) to the text data. 
Describe how TF-IDF works and its significance in capturing the importance of words 
across documents. Compare the results of TF-IDF with the BoW approach.
One-Hot Encoding: Investigate the application of One-Hot Encoding to encode categorical 
variables or labels. Can One-Hot Encoding be used directly for text classification? Why or 
why not?
Deliverables: 
Present insights gathered from data exploration and discuss the impact of different feature 
engineering techniques (BoW, N-grams, TF-IDF, One-Hot Encoding). Provide 
recommendations for the best feature engineering strategy

# 1

In [2]:
# Import necessary libraries
import pandas as pd

# Load the dataset
df = pd.read_csv("C://Users//TmC//Downloads//archive//abcnews-date-text.csv")

# Display the first few rows of the dataset
print("Sample of the Dataset:")
print(df.head())

# Explore different topics/categories
categories = df['publish_date'].unique()
print("\nCategories present in the dataset:")
print(categories)

# Distribution of articles across topics
print("\nDistribution of articles across topics:")
print(df['publish_date'].value_counts())


Sample of the Dataset:
   publish_date                                      headline_text
0      20030219  aba decides against community broadcasting lic...
1      20030219     act fire witnesses must be aware of defamation
2      20030219     a g calls for infrastructure protection summit
3      20030219           air nz staff in aust strike for pay rise
4      20030219      air nz strike to affect australian travellers

Categories present in the dataset:
[20030219 20030220 20030221 ... 20211229 20211230 20211231]

Distribution of articles across topics:
20120824    384
20130412    383
20110222    380
20120814    379
20130514    378
           ... 
20210605      6
20211023      5
20210515      5
20210806      1
20170209      1
Name: publish_date, Length: 6882, dtype: int64


# 2

In [18]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Load the dataset
file_path = "C://Users//TmC//Downloads//archive//abcnews-date-text.csv"  # Replace with the actual path to your downloaded CSV file
df = pd.read_csv(file_path)

# Check the column names in the DataFrame
print(df.columns)

# Assuming the column names are 'headline_text' and 'category', adjust accordingly if different
# For simplicity, let's focus on a smaller subset for faster execution
df = df.sample(frac=0.1, random_state=42)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df['headline_text'], df['publish_date'], test_size=0.2, random_state=42)

# Implement Bag-of-Words with unigrams using CountVectorizer
count_vectorizer_unigram = CountVectorizer()
X_train_counts_unigram = count_vectorizer_unigram.fit_transform(X_train)
X_test_counts_unigram = count_vectorizer_unigram.transform(X_test)

# Implement Bag-of-Words with bigrams using CountVectorizer
count_vectorizer_bigram = CountVectorizer(ngram_range=(2, 2))
X_train_counts_bigram = count_vectorizer_bigram.fit_transform(X_train)
X_test_counts_bigram = count_vectorizer_bigram.transform(X_test)

# Implement Bag-of-Words with unigrams using TF-IDF
tfidf_vectorizer_unigram = TfidfVectorizer()
X_train_tfidf_unigram = tfidf_vectorizer_unigram.fit_transform(X_train)
X_test_tfidf_unigram = tfidf_vectorizer_unigram.transform(X_test)

# Implement Bag-of-Words with bigrams using TF-IDF
tfidf_vectorizer_bigram = TfidfVectorizer(ngram_range=(2, 2))
X_train_tfidf_bigram = tfidf_vectorizer_bigram.fit_transform(X_train)
X_test_tfidf_bigram = tfidf_vectorizer_bigram.transform(X_test)

# Train a simple classifier (e.g., Multinomial Naive Bayes) and evaluate accuracy
# (Continue with the classifier and accuracy evaluation as shown in the previous code)

Index(['publish_date', 'headline_text'], dtype='object')


In [19]:
print(df.head())

         publish_date                                      headline_text
1144371      20181017  virtual reality trial ahead of fire season in ...
282871       20070131                     farmers prepare for ec funding
895099       20140810                   the sunday inquisition august 10
764744       20130221                                      news csg reax
894276       20140806  rosetta spacecraft on final approach to comet ...


# 3

In [49]:
nltk.download('punkt')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\TmC\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [50]:
import nltk
from nltk.util import ngrams

# Example sentence
sentence = "this is a sample sentence"

# Convert sentence to words
words = nltk.word_tokenize(sentence)

# Create unigrams, bigrams, and trigrams
unigrams = list(ngrams(words, 1))
bigrams = list(ngrams(words, 2))
trigrams = list(ngrams(words, 3))

print("Unigrams:", unigrams)
print("Bigrams:", bigrams)
print("Trigrams:", trigrams)


Unigrams: [('this',), ('is',), ('a',), ('sample',), ('sentence',)]
Bigrams: [('this', 'is'), ('is', 'a'), ('a', 'sample'), ('sample', 'sentence')]
Trigrams: [('this', 'is', 'a'), ('is', 'a', 'sample'), ('a', 'sample', 'sentence')]


# 4

In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Assuming df is your DataFrame
X_train, X_test, y_train, y_test = train_test_split(df['headline_text'], df['publish_date'], test_size=0.2, random_state=42)

# TF-IDF with unigrams
tfidf_vectorizer_unigram = TfidfVectorizer()
X_train_tfidf_unigram = tfidf_vectorizer_unigram.fit_transform(X_train)
X_test_tfidf_unigram = tfidf_vectorizer_unigram.transform(X_test)

# Bag-of-Words (BoW) with unigrams using CountVectorizer
count_vectorizer_unigram = CountVectorizer()
X_train_bow_unigram = count_vectorizer_unigram.fit_transform(X_train)
X_test_bow_unigram = count_vectorizer_unigram.transform(X_test)

# Train a simple classifier (e.g., Multinomial Naive Bayes) and evaluate accuracy

# TF-IDF with unigrams
model_tfidf_unigram = MultinomialNB()
model_tfidf_unigram.fit(X_train_tfidf_unigram, y_train)
y_pred_tfidf_unigram = model_tfidf_unigram.predict(X_test_tfidf_unigram)
accuracy_tfidf_unigram = accuracy_score(y_test, y_pred_tfidf_unigram)
print(f'Accuracy with TF-IDF (unigram): {accuracy_tfidf_unigram}')

# BoW with unigrams
model_bow_unigram = MultinomialNB()
model_bow_unigram.fit(X_train_bow_unigram, y_train)
y_pred_bow_unigram = model_bow_unigram.predict(X_test_bow_unigram)
accuracy_bow_unigram = accuracy_score(y_test, y_pred_bow_unigram)
print(f'Accuracy with BoW (unigram): {accuracy_bow_unigram}')


Accuracy with TF-IDF (unigram): 0.0019289503295290146
Accuracy with BoW (unigram): 0.003777527728660987


# 5

In [None]:
One-Hot Encoding is a technique commonly used for encoding categorical variables or labels.
It works by representing each category as a binary vector, where each element in the vector 
corresponds to a unique category. The element is set to 1 if the data point belongs to that category and 0 otherwise.

However, One-Hot Encoding is not directly suitable for text classification tasks. 

1.High Dimensionality: 
    In text classification, especially when dealing with a large vocabulary or a large number of 
    unique words, the One-Hot Encoding would result in a very high-dimensional and sparse vector 
    representation for each document. This can lead to computational inefficiency and memory issues.

2.Loss of Sequence Information: 
    One-Hot Encoding treats each word independently and doesn't capture the sequential or semantic 
    relationships between words in a document. It doesn't consider the order or proximity of words, 
    which is crucial in understanding the meaning of a sentence or document.

3.Doesn't Capture Semantic Similarity: 
     One-Hot Encoding doesn't capture the semantic similarity between words. Words with similar meanings
    will be represented as completely independent vectors, which can lead to challenges in understanding 
    the context and semantics of the text.

Instead of One-Hot Encoding, more advanced techniques like Word Embeddings (e.g., Word2Vec, GloVe) or methods 
based on deep learning architectures (e.g., LSTM, GRU, Transformer models) are commonly used for text classification.
These methods provide dense vector representations that capture semantic relationships, maintain word order, 
and effectively handle high-dimensional data.

In [53]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Sample data with a categorical variable
data = {'Category': ['A', 'B', 'A', 'C', 'B']}
df = pd.DataFrame(data)

# One-Hot Encoding
encoder = OneHotEncoder(sparse=False)  # Set sparse=False to get a dense array
encoded_data = encoder.fit_transform(df[['Category']])

# Create a DataFrame with the encoded data
encoded_df = pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out(['Category']))

# Concatenate the original DataFrame with the encoded DataFrame
df_encoded = pd.concat([df, encoded_df], axis=1)

print(df_encoded)


  Category  Category_A  Category_B  Category_C
0        A         1.0         0.0         0.0
1        B         0.0         1.0         0.0
2        A         1.0         0.0         0.0
3        C         0.0         0.0         1.0
4        B         0.0         1.0         0.0




In [73]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Sample text data
text_data = ['This is a positive example', 'This is a negative example', 'Another positive one', 'Negative example here']

# Sample labels
labels = [10, 10, 0, 1]

# Split the data
X_train, X_test, y_train, y_test = train_test_split(text_data, labels, test_size=0.2, random_state=42)

# TF-IDF vectorization
tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# Train a classifier (e.g., Naive Bayes)
model = MultinomialNB()
model.fit(X_train_tfidf, y_train)

# Make predictions
y_pred = model.predict(X_test_tfidf)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')


Accuracy: 1.00
