# STOPWORDS

Stopwords are common words (e.g. "is", "the") that are removed in NLP because they do not contribute significantly to the meaning of the text. This preprocessing aims to eliminate irrelevant and auxiliary terms, focusing the analysis on the essential content.

Removing stopwords simplifies sentences by eliminating common words like **"I", "have", "been", "the", "and", "that", "are", "about", "of", focusing on key terms such as "studied", "effects", "global warming", "noticed", "studies", "inconclusive", "results"**.

Stopword extraction (or removal) offers several key benefits in Natural Language Processing (NLP):

1. **Dimensionality Reduction:** Reduces the total number of unique words (vocabulary) in a corpus, which optimizes memory usage and speeds up processing in ML models.
2. **Focus on Relevant Words:** Allows NLP algorithms to focus on terms that actually carry meaning and contextual information, improving the relevance of the analysis.
3. **Improved Efficiency and Performance:** By dealing with less data, algorithms run faster and more efficiently, which is crucial for large volumes of text.
4. **Increased Accuracy in Specific Tasks:** In tasks such as text classification, sentiment analysis, and information retrieval, stopword removal can improve the accuracy of results, as irrelevant words do not "pollute" the analysis.

For the practice of stopword removal, we will use NLTK's 20Newsgroups corpus. This dataset, consisting of about 20,000 articles from old online forums (newsgroups) divided into 20 categories, is widely used in machine learning for text classification and clustering.

In [1]:
import pandas as pd
# Imports the pandas library, which is fundamental for data manipulation and analysis in tabular format (DataFrames).
# It is a powerful tool for working with structured data.

from sklearn.datasets import fetch_20newsgroups
# Imports the 'fetch_20newsgroups' function from the 'datasets' module of the scikit-learn library.
# This function is specifically used to download and load the famous '20 Newsgroups' dataset.

newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
# Calls the 'fetch_20newsgroups' function to load the '20 Newsgroups' dataset.
# 'subset='all'' specifies that we want to load all available data (both training and test sets).
# 'remove=('headers', 'footers', 'quotes')' is an argument that instructs the function to remove common and less relevant parts of the newsgroup articles,
# such as headers (sender info, date, etc.), footers (signatures), and quotes (parts of previous messages).
# This is a common pre-processing step to focus on the main content of the text.
# The result (the cleaned dataset) is stored in the 'newsgroups' variable.

After importing the data, the next step is to visualize it. To make it easier to understand during preprocessing, we will apply a style to the code, making the presentation of the data more pleasant.

In [6]:
import pandas as pd
# Imports the pandas library for data manipulation and analysis (DataFrames).

from sklearn.datasets import fetch_20newsgroups
# Imports 'fetch_20newsgroups' to load the 20 Newsgroups dataset from scikit-learn.

# Load the 20 Newsgroups dataset
# Loads all subsets of the dataset and removes headers, footers, and quotes
# to focus on the main content for NLP tasks.
newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))

# Creating a pandas DataFrame from the newsgroups data
# Creates a DataFrame with two columns: 'Category' (human-readable names)
# and 'News' (the cleaned text content).
data_frame = pd.DataFrame({
    'Category': [newsgroups.target_names[i] for i in newsgroups.target],
    'News': newsgroups.data
})

# Function to style the DataFrame rows
# Defines a function to apply alternating background colors (lightblue/lightgray) to rows
# for improved visual readability.
def highlight_alternating_rows(row):
    return ['background-color: lightblue' if i % 2 else 'background-color: lightgray' for i in range(len(row))]

# Apply styling to the first few rows of the DataFrame
# Selects the first 5 rows (.head()), applies the alternating row highlight,
# and sets common CSS properties for text alignment, border color, and font size.
# The .hide_index() method (commented out) was removed due to compatibility with older pandas versions.
styled_data_frame = data_frame.head().style.apply(highlight_alternating_rows, axis=1).set_properties(**{
    'text-align': 'left',
    'border-color': 'white',
    'font-size': '12pt'
})

# Display the styled DataFrame
# Renders the visually enhanced DataFrame in an interactive environment (e.g., Jupyter Notebook).
styled_data_frame

Unnamed: 0,Category,News
0,rec.sport.hockey,"I am sure some bashers of Pens fans are pretty confused about the lack of any kind of posts about the recent Pens massacre of the Devils. Actually, I am bit puzzled too and a bit relieved. However, I am going to put an end to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they are killing those Devils worse than I thought. Jagr just showed you why he is much better than his regular season stats. He is also a lot fo fun to watch in the playoffs. Bowman should let JAgr have a lot of fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final regular season game. PENS RULE!!!"
1,comp.sys.ibm.pc.hardware,My brother is in the market for a high-performance video card that supports VESA local bus with 1-2MB RAM. Does anyone have suggestions/ideas on:  - Diamond Stealth Pro Local Bus  - Orchid Farenheit 1280  - ATI Graphics Ultra Pro  - Any other high-performance VLB card Please post or email. Thank you!  - Matt
2,talk.politics.mideast,"Finally you said what you dream about. Mediterranean???? That was new.... 	The area will be ""greater"" after some years, like your ""holocaust"" numbers......  ***** 	Is't July in USA now????? Here in Sweden it's April and still cold. 	Or have you changed your calendar???  ****************  ******************  *************** 	NOTHING OF THE MENTIONED IS TRUE, BUT LET SAY IT's TRUE.  SHALL THE AZERI WOMEN AND CHILDREN GOING TO PAY THE PRICE WITH  ************** 	BEING RAPED, KILLED AND TORTURED BY THE ARMENIANS??????????  HAVE YOU HEARDED SOMETHING CALLED: ""GENEVA CONVENTION""??????? 	YOU FACIST!!!!! 	Ohhh i forgot, this is how Armenians fight, nobody has forgot 	you killings, rapings and torture against the Kurds and Turks once 	upon a time!  Ohhhh so swedish RedCross workers do lie they too? What ever you say ""regional killer"", if you don't like the person then shoot him that's your policy.....l  i  i  i 	Confused????? i  i  Search Turkish planes? You don't know what you are talking about.	i  Turkey's government has announced that it's giving weapons <-----------i  to Azerbadjan since Armenia started to attack Azerbadjan it self, not the Karabag province. So why search a plane for weapons since it's content is announced to be weapons? If there is one that's confused then that's you! We have the right (and we do) 	to give weapons to the Azeris, since Armenians started the fight in Azerbadjan!  Shoot down with what? Armenian bread and butter? Or the arms and personel of the Russian army?"
3,comp.sys.ibm.pc.hardware,Think! It's the SCSI card doing the DMA transfers NOT the disks... The SCSI card can do DMA transfers containing data from any of the SCSI devices it is attached when it wants to. An important feature of SCSI is the ability to detach a device. This frees the SCSI bus for other devices. This is typically used in a multi-tasking OS to start transfers on several devices. While each device is seeking the data the bus is free for other commands and data transfers. When the devices are ready to transfer the data they can aquire the bus and send the data. On an IDE bus when you start a transfer the bus is busy until the disk has seeked the data and transfered it. This is typically a 10-20ms second lock out for other processes wanting the bus irrespective of transfer time.
4,comp.sys.mac.hardware,"1) I have an old Jasmine drive which I cannot use with my new system.  My understanding is that I have to upsate the driver with a more modern one in order to gain compatability with system 7.0.1. does anyone know of an inexpensive program to do this? ( I have seen formatters for <$20 buit have no idea if they will work)  2) I have another ancient device, this one a tape drive for which the back utility freezes the system if I try to use it. THe drive is a jasmine direct tape (bought used for $150 w/ 6 tapes, techmar mechanism). Essentially I have the same question as above, anyone know of an inexpensive beckup utility I can use with system 7.0.1"


After loading the news dataset, the next step is to load and store the stopwords.

In [8]:
from nltk.corpus import stopwords

# Download the stopwords corpus (if not already downloaded).
nltk.download('stopwords')

# Load the English stopwords into a set for efficient lookup.
# Change 'english' to 'portuguese' for Portuguese stopwords.
english_stop_words = set(stopwords.words('english'))

# Print the loaded set of stopwords.
print(english_stop_words)

{'being', 'can', 'll', 'more', "we've", 'is', "you'd", "mightn't", 'why', 'all', 'its', "hasn't", 'which', "won't", "you'll", 'mightn', 'yourself', "you've", 'after', 'ain', 'own', 'we', "aren't", 'about', 'are', "you're", 'other', 'again', 'both', 'him', 'when', 'your', 'those', "hadn't", 'didn', 'there', 'any', "she'll", 'needn', 'his', 'do', 'does', "he's", "don't", 'here', 'this', 'off', 'yourselves', 'them', 'nor', 'by', "she'd", 'some', 'have', 'with', 'into', 'o', 'did', "he'll", "haven't", 'herself', 'has', 've', 're', 'wouldn', "needn't", 'hadn', 'few', 'if', 'm', "they'd", 'her', 'doing', 'himself', 'they', "shouldn't", 'y', 'while', 'hers', "doesn't", 'too', 'how', 'but', 'a', "didn't", 'were', "we're", 'ours', "we'll", 'i', 'no', 'and', "i'll", "shan't", 'of', 'most', 'he', 'because', 'having', 'out', "they'll", 't', 'so', 'over', 'theirs', 'don', "should've", "weren't", 'an', "they've", 'very', 's', 'where', "it'll", 'itself', 'weren', "couldn't", 'only', 'or', 'at', 'was'

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Felipe\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


We have the `20newsgroups` (full texts) and the `stopwords` set (words to remove). To apply the removal, we first need to "break" the `20newsgroups` texts into individual words (tokenize). Then, we will create a function that tokenizes any text and removes the stopwords found.

In [16]:
import nltk
# Ensure you have 'punkt' tokenizer downloaded: nltk.download('punkt')
# Ensure you have 'stopwords' downloaded: nltk.download('stopwords')

# Assuming 'stop_words' (e.g., english_stop_words or portuguese_stop_words)
# has been previously loaded as a set of stopwords.
# Example: from nltk.corpus import stopwords
#          stop_words = set(stopwords.words('english'))

def remove_stopwords(input_text):
    """
    Removes stopwords from a given text.

    Args:
        input_text (str): The input text string from which to remove stopwords.

    Returns:
        str: The text string with stopwords removed, where words are joined by spaces.
    """
    # Tokenizes the input text into a list of individual words (tokens).
    # nltk.word_tokenize handles punctuation correctly by separating it from words.
    words = nltk.word_tokenize(input_text)

    # Filters the tokenized words, keeping only those that are NOT in the 'stop_words' set.
    # word.lower() is used to ensure case-insensitive matching against the stop_words set.
    filtered_words = [word for word in words if word.lower() not in english_stop_words]

    # Joins the filtered words back into a single string, separated by spaces.
    # This reconstructs the text without the stopwords.
    return ' '.join(filtered_words)

# Example Usage (assuming stop_words is defined, e.g., english_stop_words from previous steps)
# sample_text = "This is a sample sentence showing the removal of common words."
# cleaned_text = remove_stopwords(sample_text)
# print(cleaned_text)

To demonstrate the use of the function, we will apply it to the first text in the 20newsgroups dataset, displaying the original text and the text after removing the stopwords.

In [18]:
# Get the first document text from the loaded newsgroups dataset.
original_text = newsgroups.data[0]

# Print a header indicating the original text.
print("Original Text")
print("----------------")
# Display the original text content.
print(original_text)

# Print a header for the text after stopword removal.
print("\n\nText After Stopwords Removal:")
print("---------------------------------")

# Call the 'remove_stopwords' function to process the original text
# and print the cleaned result.
print(remove_stopwords(original_text))

Original Text
----------------


I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game.          PENS RULE!!!




Text After Stopwords Removal:
---------------------------------
sure bashers Pens fans pretty confused lack kind posts recent Pens massacre Devils . Actually , bit puzzled bit relieved . However , going put end non-PIttsburghers ' relief bit praise Pens 

Removing stopwords in the example demonstrated that the text is reduced without losing meaning. We will now apply this batch preprocessing to the entire 20Newsgroups dataset to achieve the expected benefits for large text sets.

The primary initial benefit of stopword removal is data size reduction on disk. By eliminating frequent words with little intrinsic meaning, we decrease the dataset's volume. We will verify this in practice by comparing the size before and after removal.

In [19]:
from sklearn.datasets import fetch_20newsgroups
import nltk
from nltk.corpus import stopwords

# Download NLTK stopwords (if not already present).
nltk.download('stopwords')
# Load Portuguese stopwords into a set for efficient lookup.
# Using 'portuguese' as specified in the original code.
stop_words = set(stopwords.words('portuguese'))

# Load the 20 Newsgroups dataset.
# 'subset='all'' gets both training and test data.
# 'remove' strips headers, footers, and quotes for cleaner text.
news_groups_data = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))

# Calculate the total number of words in the original dataset.
# It splits each document into words and sums their counts across all documents.
original_word_count = sum(len(text.split()) for text in news_groups_data.data)

# Process each document: split into words, filter out stopwords, then join back into a string.
# This creates a new list of documents with stopwords removed.
cleaned_data = [" ".join(word for word in text.split() if word.lower() not in stop_words) for text in news_groups_data.data]

# Calculate the total number of words in the cleaned dataset.
# This reflects the word count after stopword removal.
cleaned_word_count = sum(len(text.split()) for text in cleaned_data)

# Print the comparison of word counts before and after stopword removal.
print(f"Original Total Words: {original_word_count}")
print(f"Total Words After Stopword Removal: {cleaned_word_count}")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Felipe\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Original Total Words: 3423145
Total Words After Stopword Removal: 3281948



Analyzing the results, the **original dataset contained 3,423,145 words**. After **removing stopwords, this number dropped to 3,281,948**, a **reduction of 141,197 words**.

This is from a dataset of 20,000 texts, but in real-world NLP applications, datasets are often much larger. This demonstrates that **stopword removal can have a substantial impact on overall dataset size**, leading to more efficient processing and storage.




Stopword removal is **crucial for eliminating noise in text analysis**. Consider an application tracking trending terms over time. Without filtering stopwords, **irrelevant words could obscure valuable insights**, such as identifying prominent politicians or the most discussed electronics.

We'll now demonstrate this by showing the most frequent words in the 20Newsgroups dataset, first from the original text, and then after stopword removal.


In [23]:
import re
from collections import Counter


# Load the ENGLISH stopwords into a set.
# This is the crucial correction: using 'english' for English text.
stop_words = set(stopwords.words('english'))

# --- 2. Carregar o Dataset 20 Newsgroups ---
# Load the 20 Newsgroups dataset, removing headers, footers, and quotes.
newsgroups_data = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))

# --- 3. Definir a Função de Limpeza (com comentários simples em inglês) ---
def clean_text(text):
    # Remove non-alphabetic characters (numbers, punctuation) and replaces them with a space.
    # This also helps separate words that might be joined by punctuation.
    cleaned_alpha_text = re.sub(r'[^a-zA-Z\s]', ' ', text)
    # Split the text into words, convert to lowercase, and filter out stopwords.
    # The split() method without arguments handles multiple spaces correctly.
    filtered_words = [word for word in cleaned_alpha_text.split() if word.lower() not in stop_words]
    # Join the cleaned words back into a single string.
    return ' '.join(filtered_words)

# --- 4. Aplicar a Limpeza e Calcular Frequências ---

# Clean all documents in the dataset.
cleaned_documents = [clean_text(text) for text in newsgroups_data.data]

# Calculate word frequencies for the original dataset.
# Join all original texts into one string, then split by space to get words, then count.
word_counts_original = Counter(" ".join(news_groups_data.data).split())

# Calculate word frequencies for the cleaned dataset.
# Join all cleaned documents into one string, then split by space, then count.
word_counts_cleaned = Counter(" ".join(cleaned_documents).split())

# --- 5. Imprimir os Resultados ---

# Print the 10 most common words in the original text.
print("Top 10 Most Common Words (Original Text):")
print(word_counts_original.most_common(10))

# Print the 10 most common words after cleaning (stopword removal and non-alphabetic characters).
print("\nTop 10 Most Common Words (After Cleaning):")
print(word_counts_cleaned.most_common(10))

# --- Teste explícito com um documento ---
print("\n--- Explicit Document Test ---")
sample_original_text = newsgroups_data.data[0]
print("\nOriginal Text Sample:")
print(sample_original_text[:10]) # Print first 500 chars for brevity

sample_cleaned_text = clean_text(sample_original_text)
print("\nCleaned Text Sample:")
print(sample_cleaned_text[:10]) # Print first 500 chars of cleaned text

Top 10 Most Common Words (Original Text):
[('the', 153574), ('to', 83965), ('of', 75111), ('a', 65521), ('and', 64420), ('in', 45448), ('is', 45404), ('I', 44740), ('that', 41057), ('for', 29222)]

Top 10 Most Common Words (After Cleaning):
[('AX', 62500), ('X', 12208), ('would', 9902), ('Q', 9313), ('one', 9197), ('W', 8546), ('F', 7940), ('G', 7598), ('B', 7576), ('R', 7530)]

--- Explicit Document Test ---

Original Text Sample:


I am sur

Cleaned Text Sample:
sure bashe


---

### Summary: Model Evaluation Metrics from Confusion Matrix

When evaluating classification models, understanding their **accuracy and reliability** is crucial. The **confusion matrix** is a vital tool for deriving key performance metrics, including **precision, recall, F-measure, and accuracy**.

**Accuracy** is one of the most intuitive and widely used metrics. It represents the **proportion of correct predictions** out of the total. For instance, if a model correctly classifies 90 out of 100 samples, its accuracy is 90%. It offers a quick overview of the model's overall effectiveness.

Beyond accuracy, the **F-measure (or F-score)** is another fundamental metric that **combines precision and recall** into a single score. It's particularly useful when dealing with **imbalanced classes**. A high F-score indicates a good balance between precision and recall. While both accuracy and F-measure provide valuable insights, it's essential to consider other metrics and the specific context when assessing model performance.

---



## Model Evaluation Steps

| Step | Description | Purpose |
| :--- | :---------- | :------ |
| **1. Load Dataset** | Load the `20 Newsgroups` base dataset. | To obtain the raw text data for analysis. |
| **2. Clean Special Characters** | Remove special characters (e.g., `!@#$%^&*`). | To reduce noise and prepare text for tokenization and further processing. |
| **3. Create Cleaned Dataset** | Create a dataset containing **only the cleaned data** (without stopwords removed yet). | To serve as a baseline for comparison, reflecting text after basic normalization. |
| **4. Create Stopword-Removed Dataset** | Create a second dataset where **stopwords are also removed** from the cleaned data. | To isolate the impact of stopword removal on model performance and data size. |
| **5. Execute ML Algorithm** | Run the machine learning algorithm on **both** the cleaned dataset and the stopword-removed dataset. | To train and test models under different pre-processing conditions. |
| **6. Collect Metrics** | Collect **accuracy and F-measure** metrics for models trained on both datasets. | To compare model performance and determine the effectiveness of stopword removal on classification quality. |



In [25]:
from sklearn.datasets import fetch_20newsgroups # Imports the function to load the 20 Newsgroups dataset.
from sklearn.feature_extraction.text import CountVectorizer # Imports CountVectorizer, a tool to convert text into numerical feature vectors.
from nltk.corpus import stopwords # Imports the stopwords corpus from NLTK.
import nltk # Imports the Natural Language Toolkit library.

nltk.download('stopwords') # Downloads the NLTK stopwords list (if not already downloaded).

# Defines a set of English stopwords. Using a set makes lookups very fast.
stop_words = set(stopwords.words('english'))

# Loads the training subset of the 20 Newsgroups dataset.
# 'remove' strips common elements like headers, footers, and quotes to clean the text.
newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))

# Loads the test subset of the 20 Newsgroups dataset, applying the same cleaning.
newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))

# --- Vectorizing with and without stopwords ---

# Initializes a CountVectorizer without any stopword filtering.
# This vectorizer will convert text into numerical counts of all words.
vectorizer = CountVectorizer()

# Initializes a CountVectorizer configured to remove the specified 'stop_words'.
# The 'stop_words' parameter expects a list, so the set is converted to a list.
vectorizer_sw = CountVectorizer(stop_words=list(stop_words))

# Transforms the training text data into a numerical feature matrix (X_train).
# 'fit_transform' learns the vocabulary from the training data and then converts it.
# This matrix will include all words.
X_train = vectorizer.fit_transform(newsgroups_train.data)

# Transforms the training text data into a numerical feature matrix (X_train_sw),
# but this time, stopwords are excluded from the vocabulary.
X_train_sw = vectorizer_sw.fit_transform(newsgroups_train.data)

# Transforms the test text data using the vocabulary learned from 'vectorizer' (without stopwords removed during fitting).
# 'transform' is used here because the vocabulary is already learned from the training data.
X_test = vectorizer.transform(newsgroups_test.data)

# Transforms the test text data using the vocabulary learned from 'vectorizer_sw' (with stopwords removed during fitting).
# This ensures consistency in feature representation between training and testing sets.
X_test_sw = vectorizer_sw.transform(newsgroups_test.data)

# Extracts the numerical target labels (categories) for the training data.
y_train = newsgroups_train.target

# Extracts the numerical target labels (categories) for the test data.
y_test = newsgroups_test.target

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Felipe\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!




We'll now execute machine learning algorithms to **classify text based on the 20 Newsgroups categories**. To assess performance with and without stopword removal, we'll employ **three different classification algorithms**.



In [26]:
from sklearn.naive_bayes import MultinomialNB # Imports the Multinomial Naive Bayes classifier, suitable for text classification with word counts.
from sklearn.svm import LinearSVC # Imports Linear Support Vector Classification, a robust linear classifier.
from sklearn.ensemble import RandomForestClassifier # Imports the Random Forest classifier, an ensemble method that builds multiple decision trees.
from sklearn.metrics import f1_score, accuracy_score # Imports F1-score and accuracy score metrics for evaluating classification models.

# Defines a dictionary of machine learning models to be evaluated.
# Keys are user-friendly names for the models, and values are instantiated model objects.
models = {
    'Naive Bayes': MultinomialNB(),
    'SVM': LinearSVC(),
    'Random Forest': RandomForestClassifier()
}

# Loops through each model in the 'models' dictionary.
# 'name' will be the string key (e.g., 'Naive Bayes'), and 'model' will be the model object itself.
for name, model in models.items():
    # --- Evaluation without stopword removal ---
    # Trains the current model using the training data (X_train) that INCLUDES stopwords.
    # y_train contains the true categories for the training data.
    model.fit(X_train, y_train)
    # Makes predictions on the test data (X_test), which also includes stopwords.
    predictions = model.predict(X_test)
    # Prints the model's performance metrics for the case WITHOUT stopword removal.
    # Uses f-strings for formatted output:
    #   - {name}: The name of the current model.
    #   - accuracy_score(y_test, predictions): Calculates the overall accuracy.
    #   - f1_score(y_test, predictions, average='macro'): Calculates the F1-score.
    #     'macro' average calculates F1-score for each class independently and then takes the unweighted mean,
    #     useful when you want to treat all classes equally regardless of their size.
    #   - :.4f ensures the numbers are formatted to 4 decimal places.
    print(f"{name} (Without stopwords removal) - Accuracy: {accuracy_score(y_test, predictions):.4f}, F1: {f1_score(y_test, predictions, average='macro'):.4f}")

    # --- Evaluation with stopword removal ---
    # Trains the current model using the training data (X_train_sw) that EXCLUDES stopwords.
    model.fit(X_train_sw, y_train)
    # Makes predictions on the test data (X_test_sw), which also has stopwords removed.
    predictions_sw = model.predict(X_test_sw)
    # Prints the model's performance metrics for the case WITH stopword removal.
    print(f"{name} (With stopwords removal) - Accuracy: {accuracy_score(y_test, predictions_sw):.4f}, F1: {f1_score(y_test, predictions_sw, average='macro'):.4f}\n")

Naive Bayes (Without stopwords removal) - Accuracy: 0.5431, F1: 0.5121
Naive Bayes (With stopwords removal) - Accuracy: 0.6288, F1: 0.5907





SVM (Without stopwords removal) - Accuracy: 0.5720, F1: 0.5637




SVM (With stopwords removal) - Accuracy: 0.5786, F1: 0.5690

Random Forest (Without stopwords removal) - Accuracy: 0.5932, F1: 0.5713
Random Forest (With stopwords removal) - Accuracy: 0.6147, F1: 0.5969

