# 20 News Groups

The 20 Newsgroups dataset is a collection of approximately 20,000 newsgroup documents, organized into 20 different categories. This dataset is widely used for experimenting with text classification and clustering algorithms. It was originally collected for a text classification project and has since become a standard benchmark in the field of machine learning.

In [None]:
from sklearn.datasets import fetch_20newsgroups

# Download the 20 Newsgroups dataset
newsgroups_dataset = fetch_20newsgroups(subset='all')
categories = newsgroups_dataset.target_names

print(categories)

# Access the data
print(newsgroups_dataset.data[0])  # Print the first news article
print(newsgroups_dataset.target[0])  # Print the target of the first news article


# Amazon Review Dataset
The Amazon Reviews dataset, specifically the `amazon_polarity` version, is a collection of product reviews from Amazon. This dataset is designed for binary sentiment classification, where the goal is to classify reviews as either positive or negative

In [None]:
from datasets import load_dataset

# Download the Amazon reviews dataset
amazon_dataset = load_dataset('amazon_polarity')
print(type(dataset))

# Access the data
print(dataset['train'][0])  # Print the first review in the training set
print(dataset['test'][0])  # Print the first review in the test set


# POS Tagging
The `get_wordnet_pos` function is designed to convert the part-of-speech (POS) tags generated by NLTK's pos_tag function into a format that is compatible with the `WordNetLemmatizer`. `WordNetLemmatizer` requires specific POS tags to perform accurate lemmatization, and this function facilitates that conversion.

In [None]:
import nltk
from nltk.corpus import wordnet

def get_wordnet_pos(word):
    """
    Convert POS tag to format that WordNetLemmatizer can use.
    
    Parameters:
    word (str): The word for which to determine the POS tag.
    
    Returns:
    str: The WordNet POS tag corresponding to the first letter of the NLTK POS tag.
         Defaults to 'n' for noun if no corresponding tag is found.
    
    Example:
    >>> get_wordnet_pos('running')
    'v'
    """
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {
        'J': wordnet.ADJ,
        'N': wordnet.NOUN,
        'V': wordnet.VERB,
        'R': wordnet.ADV
    }
    return tag_dict.get(tag, wordnet.NOUN)


# Pre-processing
The `preprocess_text` function is designed to clean and prepare text data for natural language processing tasks. It combines several preprocessing steps to transform the input text into a more uniform and analyzable format.

In [None]:
import nltk
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

def preprocess_text(text):
    """
    Preprocess the input text by performing the following steps:
    1. Convert to lowercase.
    2. Remove punctuation.
    3. Tokenize the text.
    4. Remove stopwords.
    5. Lemmatize the tokens with POS tagging.
    
    Parameters:
    text (str): The input text to preprocess.
    
    Returns:
    str: The preprocessed text as a single string of tokens.
    
    Example:
    >>> preprocess_text("Running and jumping are fun activities!")
    'run jump fun activity'
    """
    # Convert text to lowercase
    text = text.lower()
    
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    # Tokenize the text
    tokens = word_tokenize(text)
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    
    # Lemmatize the text with POS tagging
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in tokens]
    
    return ' '.join(tokens)


In [None]:

for article in newsgroups_dataset.data:
    preprocessed_article=preprocess_text(article)
    print(preprocessed_article)
    # break
    

In [None]:
for review in amazon_dataset["train"]:
    print(preprocess_text(review))
    break

In [None]:
# from sklearn.feature_extraction.text import CountVectorizer

# def  extract_feature(processed_documents):
    # vectorizer = CountVectorizer()

    # # Fit the model and transform the documents into a BoW representation
    # X = vectorizer.fit_transform(processed_documents)

    # # Convert the BoW matrix to an array for easier understanding
    # bow_array = X.toarray()

    # # Get the feature names (i.e., words)
    # feature_names = vectorizer.get_feature_names_out()

    # # Print the BoW representation
    # print("Feature Names (Vocabulary):")
    # print(feature_names)
    # print("\nBag-of-Words Array:")
    # print(bow_array)
