dataset for Sentiment Analysis using SVM. We will use the SMS Spam Collection dataset, which contains messages labeled as either "ham" (non-spam) or "spam". This dataset is a popular choice for text classification tasks.
Steps:
Load the SMS Spam Collection dataset.
Preprocess the text data (lowercasing, removing stopwords).
Train the SVM model.
Evaluate the model.
We will load the dataset from a CSV file, preprocess the text, vectorize it, and then train the SVM model to classify the messages as spam or non-spam.

In [2]:
import pandas as pd
import string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

# Load the dataset
file_path = "spam.xlsx"  # Ensure the file is in the correct directory
df = pd.read_excel(file_path)

df = df.rename(columns={"v1": "label", "v2": "message"})
df = df[["label", "message"]]  # Keep only relevant columns

df["message"] = df["message"].astype(str)  # Ensure all messages are strings
df["label"] = df["label"].map({"spam": 1, "ham": 0})  # Convert labels to binary

# Predefined stopwords list
custom_stopwords = set(["i", "me", "my", "we", "our", "you", "your", "he", "she", "it", "they", "them", "this", "that", "is", "are", "was", "were", "be", "been", "have", "has", "had", "do", "does", "did", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "with", "about", "between", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again", "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each", "more", "most", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than", "too", "very", "s", "t", "can", "will", "just", "don", "should", "now"])

# Function to preprocess text
def preprocess_text(text):
    text = text.lower()
    text = text.translate(str.maketrans("", "", string.punctuation))
    text = " ".join([word for word in text.split() if word not in custom_stopwords])
    return text

df["cleaned_message"] = df["message"].apply(preprocess_text)

# Convert text to TF-IDF vectors
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df["cleaned_message"])
y = df["label"]

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Train the SVM model
svm_model = SVC(kernel="linear")
svm_model.fit(X_train, y_train)

# Make predictions
y_pred = svm_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy:.4f}")
print("Classification Report:\n", classification_rep)

Accuracy: 0.9776
Classification Report:
               precision    recall  f1-score   support

           0       0.98      1.00      0.99       966
           1       0.98      0.85      0.91       149

    accuracy                           0.98      1115
   macro avg       0.98      0.92      0.95      1115
weighted avg       0.98      0.98      0.98      1115

