# You are part of a team developing a text classification system for a news aggregator platform. The platform aims to categorize news articles into different topics automatically. The dataset contains news articles along with their corresponding topics. Perform only the Feature extraction techniques.

Dataset Link: https://www.kaggle.com/datasets/therohk/million-headlines

# Data Exploration: 

Begin by exploring the dataset. What are the different topics/categories present in the dataset? What is the distribution of articles across these topics?

In [13]:
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

In [8]:
# Load the dataset

df = pd.read_csv('abcnews-date-text.csv')
df.head()

Unnamed: 0,publish_date,headline_text
0,20030219,aba decides against community broadcasting lic...
1,20030219,act fire witnesses must be aware of defamation
2,20030219,a g calls for infrastructure protection summit
3,20030219,air nz staff in aust strike for pay rise
4,20030219,air nz strike to affect australian travellers


In [9]:
# Explore the different topics/categories present in the dataset
topics = df['headline_text'].unique()
print("Topics:\n", topics)

Topics:
 ['aba decides against community broadcasting licence'
 'act fire witnesses must be aware of defamation'
 'a g calls for infrastructure protection summit' ...
 'wa delays adopting new close contact definition'
 'western ringtail possums found badly dehydrated in heatwave'
 'what makes you a close covid contact here are the new rules']


In [10]:
# Distribution of articles across topics
topic_distribution = df['headline_text'].value_counts()
print("Topic Distribution:\n", topic_distribution)

Topic Distribution:
 national rural news                                            983
abc sport                                                      718
abc weather                                                    714
abc business news and market analysis                          585
abc entertainment                                              551
                                                              ... 
rio drug gang used alligators to terrify slum                    1
research to identify women at risk of premature                  1
religious order defends sex abuse handling                       1
reigning champion federer advances to us semis                   1
what makes you a close covid contact here are the new rules      1
Name: headline_text, Length: 1213004, dtype: int64


In [21]:
# Select a small portion of the data for illustration
text_data = df.iloc[:1000]

# Initialize NLTK components
ps = PorterStemmer()
stop_words = set(stopwords.words('english'))

corpus = []
for i in range(len(text_data)):
    text = text_data['headline_text'].iloc[i].lower()
    words = word_tokenize(text)
    words = [ps.stem(word) for word in words if word not in stop_words]
    text = ' '.join(words)
    corpus.append(text)

new_text_data = corpus
new_text_data

['aba decid commun broadcast licenc',
 'act fire wit must awar defam',
 'g call infrastructur protect summit',
 'air nz staff aust strike pay rise',
 'air nz strike affect australian travel',
 'ambiti olsson win tripl jump',
 'antic delight record break barca',
 'aussi qualifi stosur wast four memphi match',
 'aust address un secur council iraq',
 'australia lock war timet opp']

# Bag-of-Words (BoW):

Implement a Bag-of-Words (BoW) model using Count Vectorizer or TF-IDF to transform the text data into numerical features. Discuss the advantages and limitations of Bow in this context. Apply both unigram and bigram techniques and compare their effects on classification accuracy.

In [25]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Create a feature matrix using Count Vectorizer (unigrams)
cv = CountVectorizer()
X_cv = cv.fit_transform(corpus)

# Create a feature matrix using Count Vectorizer (bigrams)
cv_bigram = CountVectorizer(ngram_range=(1, 2))
X_cv_bigram = cv_bigram.fit_transform(corpus)
y = text_data['headline_text']

# Split the data into training and testing sets
X_train_cv, X_test_cv, y_train, y_test = train_test_split(X_cv, y, test_size=0.2, random_state=42)
X_train_cv_bigram, X_test_cv_bigram, _, _ = train_test_split(X_cv_bigram, y, test_size=0.2, random_state=42)

# Train a classifier (Naive Bayes is used here as an example)
clf_cv = MultinomialNB()
clf_cv.fit(X_train_cv, y_train)

clf_cv_bigram = MultinomialNB()
clf_cv_bigram.fit(X_train_cv_bigram, y_train)

# Make predictions
y_pred_cv = clf_cv.predict(X_test_cv)
y_pred_cv_bigram = clf_cv_bigram.predict(X_test_cv_bigram)

# Evaluate the classifiers
print("Accuracy (BoW - Unigram):", accuracy_score(y_test, y_pred_cv))
print("Accuracy (BoW - Bigram):", accuracy_score(y_test, y_pred_cv_bigram))

# Display information about unigram BoW
print("\nUnigram BoW:")
print("\nVocabulary Size:", len(cv.vocabulary_))
print("Shape of BoW Matrix:", X_cv.shape)
print("BoW Feature Names:\n", cv.get_feature_names_out())

# Display information about bigram BoW
print("\nBigram BoW:")
print("\nVocabulary Size:", len(cv_bigram.vocabulary_))
print("Shape of BoW Matrix:", X_cv_bigram.shape)
print("BoW Feature Names:\n", cv_bigram.get_feature_names_out())

Accuracy (BoW - Unigram): 0.0
Accuracy (BoW - Bigram): 0.0

Unigram BoW:

Vocabulary Size: 2205
Shape of BoW Matrix: (1000, 2205)
BoW Feature Names:
 ['10' '100th' '108' ... 'zealand' 'zimbabw' 'zone']

Bigram BoW:

Vocabulary Size: 6226
Shape of BoW Matrix: (1000, 6226)
BoW Feature Names:
 ['10' '10 day' '10 man' ... 'zimbabw world' 'zone' 'zone home']


Advantages of Bag-of-Words (BoW):
    
   - Simplicity: BoW is a simple and effective way to represent text data.
   - Interpretability: The resulting feature matrix is easy to interpret as it directly represents the occurrence of words.
    
Limitations of Bag-of-Words (BoW):
    
   - Lack of Semantic Understanding: BoW doesn't capture the semantic meaning of words and their relationships.
   - High Dimensionality: In datasets with a large vocabulary, the feature matrix can become very high-dimensional.
   - No Contextual Information: BoW treats each word independently, ignoring the order and structure of the words in the text.

# N-grams: 

Explore the use of N-grams (bi-grams, tri-grams) in feature engineering. How do different N-gram ranges impact the performance of the classification model?

In [28]:
import nltk
from nltk import ngrams

# Combine sentences into a single text
text = ' '.join(new_text_data)

# Tokenize the text
tokens = nltk.word_tokenize(text)

# Bi-grams
bi_grams = list(ngrams(tokens, 2))

print('Original text:\n', text)
print('\nGenerated Bi-grams:')
for grams in bi_grams:
    print(grams)

Original text:
 aba decid commun broadcast licenc act fire wit must awar defam g call infrastructur protect summit air nz staff aust strike pay rise air nz strike affect australian travel ambiti olsson win tripl jump antic delight record break barca aussi qualifi stosur wast four memphi match aust address un secur council iraq australia lock war timet opp

Generated Bi-grams:
('aba', 'decid')
('decid', 'commun')
('commun', 'broadcast')
('broadcast', 'licenc')
('licenc', 'act')
('act', 'fire')
('fire', 'wit')
('wit', 'must')
('must', 'awar')
('awar', 'defam')
('defam', 'g')
('g', 'call')
('call', 'infrastructur')
('infrastructur', 'protect')
('protect', 'summit')
('summit', 'air')
('air', 'nz')
('nz', 'staff')
('staff', 'aust')
('aust', 'strike')
('strike', 'pay')
('pay', 'rise')
('rise', 'air')
('air', 'nz')
('nz', 'strike')
('strike', 'affect')
('affect', 'australian')
('australian', 'travel')
('travel', 'ambiti')
('ambiti', 'olsson')
('olsson', 'win')
('win', 'tripl')
('tripl', 'jump

In [29]:
# Tri-grams
tri_grams = list(ngrams(tokens, 3))

print('\nGenerated Tri-grams:')
for grams in tri_grams:
    print(grams)


Generated Tri-grams:
('aba', 'decid', 'commun')
('decid', 'commun', 'broadcast')
('commun', 'broadcast', 'licenc')
('broadcast', 'licenc', 'act')
('licenc', 'act', 'fire')
('act', 'fire', 'wit')
('fire', 'wit', 'must')
('wit', 'must', 'awar')
('must', 'awar', 'defam')
('awar', 'defam', 'g')
('defam', 'g', 'call')
('g', 'call', 'infrastructur')
('call', 'infrastructur', 'protect')
('infrastructur', 'protect', 'summit')
('protect', 'summit', 'air')
('summit', 'air', 'nz')
('air', 'nz', 'staff')
('nz', 'staff', 'aust')
('staff', 'aust', 'strike')
('aust', 'strike', 'pay')
('strike', 'pay', 'rise')
('pay', 'rise', 'air')
('rise', 'air', 'nz')
('air', 'nz', 'strike')
('nz', 'strike', 'affect')
('strike', 'affect', 'australian')
('affect', 'australian', 'travel')
('australian', 'travel', 'ambiti')
('travel', 'ambiti', 'olsson')
('ambiti', 'olsson', 'win')
('olsson', 'win', 'tripl')
('win', 'tripl', 'jump')
('tripl', 'jump', 'antic')
('jump', 'antic', 'delight')
('antic', 'delight', 'record'

# TF-IDF: 

Apply TF-IDF (Term Frequency-Inverse Document Frequency) to the text data. Describe how TF-IDF works and its significance in capturing the importance of words across documents. Compare the results of TF-IDF with the BoW approach.

In [33]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Create TF-IDF feature matrix
X_tfidf = tfidf_vectorizer.fit_transform(corpus)

# Get feature names
feature_names_tfidf = tfidf_vectorizer.get_feature_names_out()

# Display information about TF-IDF
print("TF-IDF Feature Names:")
print("\nVocabulary Size:", len(feature_names_tfidf))
print("Shape of TF-IDF Matrix:", X_tfidf.shape)
print("TF-IDF Feature Names:\n", feature_names_tfidf)

# Display the TF-IDF matrix
print("\nTF-IDF Matrix:")
print(X_tfidf.toarray())

TF-IDF Feature Names:

Vocabulary Size: 2205
Shape of TF-IDF Matrix: (1000, 2205)
TF-IDF Feature Names:
 ['10' '100th' '108' ... 'zealand' 'zimbabw' 'zone']

TF-IDF Matrix:
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


# One-Hot Encoding: 

Investigate the application of One-Hot Encoding to encode categorical variables or labels. Can One-Hot Encoding be used directly for text classification? Why or why not?

In [35]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np

# Step 1 - Tokens
tokens = [word for sent in new_text_data for word in sent.split()]  # Update: Split without stemming

# Step 2 - Vocabulary
vocab = list(set(tokens))  # Unique words in the text

# Initialize the OneHotEncoder
encoder = OneHotEncoder(categories=[vocab], sparse=False)

# Perform the One-Hot Encoding
one_hot_encoder = []
for sent in new_text_data:
    sent_encoded = []
    for word in sent.lower().split():
        if word in vocab:  # Check if the stemmed word is in the vocabulary
            word_index = vocab.index(word)
            word_vector = np.zeros(len(vocab))
            word_vector[word_index] = 1
            sent_encoded.append(word_vector)
    one_hot_encoder.append(sent_encoded)

for sent in one_hot_encoder:
    print(sent)

[array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
       0., 0.]), array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0.]), array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
       0., 0.]), array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0.]), array([0., 0., 0., 0., 0., 0., 

# Deliverables:

Present insights gathered from data exploration and discuss the impact of different feature engineering techniques (BoW, N-grams, TF-IDF, One-Hot Encoding). Provide recommendations for the best feature engineering strategy.

Impact of Feature Engineering Techniques:

Bag-of-Words (BoW):

Advantages:
Simplicity: BoW is straightforward to implement and interpret.
Interpretability: The resulting feature matrix is easy to understand.
Limitations:
Lack of Semantic Understanding: BoW doesn't capture the semantic meaning of words.
High Dimensionality: The feature matrix can become very high-dimensional in large vocabularies.
No Contextual Information: BoW treats each word independently, ignoring word order.

N-grams:

Bi-grams and tri-grams capture relationships between adjacent and nearby words.
Impact on Performance:
Bi-grams and tri-grams may enhance the model's ability to capture context and semantic meaning.
The choice of N-gram range should be based on the specific characteristics of the dataset.

TF-IDF (Term Frequency-Inverse Document Frequency):

How TF-IDF Works:
TF-IDF captures the importance of words by considering both the frequency of a term in a document and its rarity across all documents.
Significance:
Highlights terms that are frequent in a specific document but rare in the entire corpus.
Useful for identifying distinctive words that carry meaningful information.

One-Hot Encoding:

Issues Encountered:
Stemming and preprocessing impacted the vocabulary for One-Hot Encoding.
Ensuring that stemmed words are in the vocabulary is crucial.
Applicability:
One-Hot Encoding is generally not suitable for text classification due to high dimensionality.