# STOPWORDS

Stopwords are common words (e.g. "is", "the") that are removed in NLP because they do not contribute significantly to the meaning of the text. This preprocessing aims to eliminate irrelevant and auxiliary terms, focusing the analysis on the essential content.

Removing stopwords simplifies sentences by eliminating common words like **"I", "have", "been", "the", "and", "that", "are", "about", "of", focusing on key terms such as "studied", "effects", "global warming", "noticed", "studies", "inconclusive", "results"**.

Stopword extraction (or removal) offers several key benefits in Natural Language Processing (NLP):

1. **Dimensionality Reduction:** Reduces the total number of unique words (vocabulary) in a corpus, which optimizes memory usage and speeds up processing in ML models.
2. **Focus on Relevant Words:** Allows NLP algorithms to focus on terms that actually carry meaning and contextual information, improving the relevance of the analysis.
3. **Improved Efficiency and Performance:** By dealing with less data, algorithms run faster and more efficiently, which is crucial for large volumes of text.
4. **Increased Accuracy in Specific Tasks:** In tasks such as text classification, sentiment analysis, and information retrieval, stopword removal can improve the accuracy of results, as irrelevant words do not "pollute" the analysis.

For the practice of stopword removal, we will use NLTK's 20Newsgroups corpus. This dataset, consisting of about 20,000 articles from old online forums (newsgroups) divided into 20 categories, is widely used in machine learning for text classification and clustering.

In [1]:
import pandas as pd
# Imports the pandas library, which is fundamental for data manipulation and analysis in tabular format (DataFrames).
# It is a powerful tool for working with structured data.

from sklearn.datasets import fetch_20newsgroups
# Imports the 'fetch_20newsgroups' function from the 'datasets' module of the scikit-learn library.
# This function is specifically used to download and load the famous '20 Newsgroups' dataset.

newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
# Calls the 'fetch_20newsgroups' function to load the '20 Newsgroups' dataset.
# 'subset='all'' specifies that we want to load all available data (both training and test sets).
# 'remove=('headers', 'footers', 'quotes')' is an argument that instructs the function to remove common and less relevant parts of the newsgroup articles,
# such as headers (sender info, date, etc.), footers (signatures), and quotes (parts of previous messages).
# This is a common pre-processing step to focus on the main content of the text.
# The result (the cleaned dataset) is stored in the 'newsgroups' variable.

After importing the data, the next step is to visualize it. To make it easier to understand during preprocessing, we will apply a style to the code, making the presentation of the data more pleasant.

In [6]:
import pandas as pd
# Imports the pandas library for data manipulation and analysis (DataFrames).

from sklearn.datasets import fetch_20newsgroups
# Imports 'fetch_20newsgroups' to load the 20 Newsgroups dataset from scikit-learn.

# Load the 20 Newsgroups dataset
# Loads all subsets of the dataset and removes headers, footers, and quotes
# to focus on the main content for NLP tasks.
newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))

# Creating a pandas DataFrame from the newsgroups data
# Creates a DataFrame with two columns: 'Category' (human-readable names)
# and 'News' (the cleaned text content).
data_frame = pd.DataFrame({
    'Category': [newsgroups.target_names[i] for i in newsgroups.target],
    'News': newsgroups.data
})

# Function to style the DataFrame rows
# Defines a function to apply alternating background colors (lightblue/lightgray) to rows
# for improved visual readability.
def highlight_alternating_rows(row):
    return ['background-color: lightblue' if i % 2 else 'background-color: lightgray' for i in range(len(row))]

# Apply styling to the first few rows of the DataFrame
# Selects the first 5 rows (.head()), applies the alternating row highlight,
# and sets common CSS properties for text alignment, border color, and font size.
# The .hide_index() method (commented out) was removed due to compatibility with older pandas versions.
styled_data_frame = data_frame.head().style.apply(highlight_alternating_rows, axis=1).set_properties(**{
    'text-align': 'left',
    'border-color': 'white',
    'font-size': '12pt'
})

# Display the styled DataFrame
# Renders the visually enhanced DataFrame in an interactive environment (e.g., Jupyter Notebook).
styled_data_frame

Unnamed: 0,Category,News
0,rec.sport.hockey,"I am sure some bashers of Pens fans are pretty confused about the lack of any kind of posts about the recent Pens massacre of the Devils. Actually, I am bit puzzled too and a bit relieved. However, I am going to put an end to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they are killing those Devils worse than I thought. Jagr just showed you why he is much better than his regular season stats. He is also a lot fo fun to watch in the playoffs. Bowman should let JAgr have a lot of fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final regular season game. PENS RULE!!!"
1,comp.sys.ibm.pc.hardware,My brother is in the market for a high-performance video card that supports VESA local bus with 1-2MB RAM. Does anyone have suggestions/ideas on:  - Diamond Stealth Pro Local Bus  - Orchid Farenheit 1280  - ATI Graphics Ultra Pro  - Any other high-performance VLB card Please post or email. Thank you!  - Matt
2,talk.politics.mideast,"Finally you said what you dream about. Mediterranean???? That was new.... 	The area will be ""greater"" after some years, like your ""holocaust"" numbers......  ***** 	Is't July in USA now????? Here in Sweden it's April and still cold. 	Or have you changed your calendar???  ****************  ******************  *************** 	NOTHING OF THE MENTIONED IS TRUE, BUT LET SAY IT's TRUE.  SHALL THE AZERI WOMEN AND CHILDREN GOING TO PAY THE PRICE WITH  ************** 	BEING RAPED, KILLED AND TORTURED BY THE ARMENIANS??????????  HAVE YOU HEARDED SOMETHING CALLED: ""GENEVA CONVENTION""??????? 	YOU FACIST!!!!! 	Ohhh i forgot, this is how Armenians fight, nobody has forgot 	you killings, rapings and torture against the Kurds and Turks once 	upon a time!  Ohhhh so swedish RedCross workers do lie they too? What ever you say ""regional killer"", if you don't like the person then shoot him that's your policy.....l  i  i  i 	Confused????? i  i  Search Turkish planes? You don't know what you are talking about.	i  Turkey's government has announced that it's giving weapons <-----------i  to Azerbadjan since Armenia started to attack Azerbadjan it self, not the Karabag province. So why search a plane for weapons since it's content is announced to be weapons? If there is one that's confused then that's you! We have the right (and we do) 	to give weapons to the Azeris, since Armenians started the fight in Azerbadjan!  Shoot down with what? Armenian bread and butter? Or the arms and personel of the Russian army?"
3,comp.sys.ibm.pc.hardware,Think! It's the SCSI card doing the DMA transfers NOT the disks... The SCSI card can do DMA transfers containing data from any of the SCSI devices it is attached when it wants to. An important feature of SCSI is the ability to detach a device. This frees the SCSI bus for other devices. This is typically used in a multi-tasking OS to start transfers on several devices. While each device is seeking the data the bus is free for other commands and data transfers. When the devices are ready to transfer the data they can aquire the bus and send the data. On an IDE bus when you start a transfer the bus is busy until the disk has seeked the data and transfered it. This is typically a 10-20ms second lock out for other processes wanting the bus irrespective of transfer time.
4,comp.sys.mac.hardware,"1) I have an old Jasmine drive which I cannot use with my new system.  My understanding is that I have to upsate the driver with a more modern one in order to gain compatability with system 7.0.1. does anyone know of an inexpensive program to do this? ( I have seen formatters for <$20 buit have no idea if they will work)  2) I have another ancient device, this one a tape drive for which the back utility freezes the system if I try to use it. THe drive is a jasmine direct tape (bought used for $150 w/ 6 tapes, techmar mechanism). Essentially I have the same question as above, anyone know of an inexpensive beckup utility I can use with system 7.0.1"


After loading the news dataset, the next step is to load and store the stopwords.

In [8]:
from nltk.corpus import stopwords

# Download the stopwords corpus (if not already downloaded).
nltk.download('stopwords')

# Load the English stopwords into a set for efficient lookup.
# Change 'english' to 'portuguese' for Portuguese stopwords.
english_stop_words = set(stopwords.words('english'))

# Print the loaded set of stopwords.
print(english_stop_words)

{'being', 'can', 'll', 'more', "we've", 'is', "you'd", "mightn't", 'why', 'all', 'its', "hasn't", 'which', "won't", "you'll", 'mightn', 'yourself', "you've", 'after', 'ain', 'own', 'we', "aren't", 'about', 'are', "you're", 'other', 'again', 'both', 'him', 'when', 'your', 'those', "hadn't", 'didn', 'there', 'any', "she'll", 'needn', 'his', 'do', 'does', "he's", "don't", 'here', 'this', 'off', 'yourselves', 'them', 'nor', 'by', "she'd", 'some', 'have', 'with', 'into', 'o', 'did', "he'll", "haven't", 'herself', 'has', 've', 're', 'wouldn', "needn't", 'hadn', 'few', 'if', 'm', "they'd", 'her', 'doing', 'himself', 'they', "shouldn't", 'y', 'while', 'hers', "doesn't", 'too', 'how', 'but', 'a', "didn't", 'were', "we're", 'ours', "we'll", 'i', 'no', 'and', "i'll", "shan't", 'of', 'most', 'he', 'because', 'having', 'out', "they'll", 't', 'so', 'over', 'theirs', 'don', "should've", "weren't", 'an', "they've", 'very', 's', 'where', "it'll", 'itself', 'weren', "couldn't", 'only', 'or', 'at', 'was'

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Felipe\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


We have the `20newsgroups` (full texts) and the `stopwords` set (words to remove). To apply the removal, we first need to "break" the `20newsgroups` texts into individual words (tokenize). Then, we will create a function that tokenizes any text and removes the stopwords found.

In [16]:
import nltk
# Ensure you have 'punkt' tokenizer downloaded: nltk.download('punkt')
# Ensure you have 'stopwords' downloaded: nltk.download('stopwords')

# Assuming 'stop_words' (e.g., english_stop_words or portuguese_stop_words)
# has been previously loaded as a set of stopwords.
# Example: from nltk.corpus import stopwords
#          stop_words = set(stopwords.words('english'))

def remove_stopwords(input_text):
    """
    Removes stopwords from a given text.

    Args:
        input_text (str): The input text string from which to remove stopwords.

    Returns:
        str: The text string with stopwords removed, where words are joined by spaces.
    """
    # Tokenizes the input text into a list of individual words (tokens).
    # nltk.word_tokenize handles punctuation correctly by separating it from words.
    words = nltk.word_tokenize(input_text)

    # Filters the tokenized words, keeping only those that are NOT in the 'stop_words' set.
    # word.lower() is used to ensure case-insensitive matching against the stop_words set.
    filtered_words = [word for word in words if word.lower() not in english_stop_words]

    # Joins the filtered words back into a single string, separated by spaces.
    # This reconstructs the text without the stopwords.
    return ' '.join(filtered_words)

# Example Usage (assuming stop_words is defined, e.g., english_stop_words from previous steps)
# sample_text = "This is a sample sentence showing the removal of common words."
# cleaned_text = remove_stopwords(sample_text)
# print(cleaned_text)

To demonstrate the use of the function, we will apply it to the first text in the 20newsgroups dataset, displaying the original text and the text after removing the stopwords.

In [18]:
# Get the first document text from the loaded newsgroups dataset.
original_text = newsgroups.data[0]

# Print a header indicating the original text.
print("Original Text")
print("----------------")
# Display the original text content.
print(original_text)

# Print a header for the text after stopword removal.
print("\n\nText After Stopwords Removal:")
print("---------------------------------")

# Call the 'remove_stopwords' function to process the original text
# and print the cleaned result.
print(remove_stopwords(original_text))

Original Text
----------------


I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game.          PENS RULE!!!




Text After Stopwords Removal:
---------------------------------
sure bashers Pens fans pretty confused lack kind posts recent Pens massacre Devils . Actually , bit puzzled bit relieved . However , going put end non-PIttsburghers ' relief bit praise Pens 