Let's use a different dataset for Sentiment Analysis using SVM. We will use the SMS Spam Collection dataset, which contains messages labeled as either "ham" (non-spam) or "spam". This dataset is a popular choice for text classification tasks.

Steps:

Load the SMS Spam Collection dataset.

Preprocess the text data (lowercasing, removing stopwords).

Train the SVM model.

Evaluate the model.

We will load the dataset from a CSV file, preprocess the text, vectorize it, and then train the SVM model to classify the messages as spam or non-spam.

Download the SMS Spam Collection dataset

https://archive.ics.uci.edu/dataset/228/sms+spam+collection

The dataset contains two columns:

label: 'ham' (non-spam) or 'spam' (spam).

message: the content of the SMS message.

In [4]:
# Importing required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

# Load the SMS Spam Collection dataset
# Assuming the dataset is available in the 'spam.csv' file, which contains 'label' and 'message' columns.
df = pd.read_csv("spam.csv", encoding='latin-1')

# Display first few rows of the dataset to understand its structure
print(df.head())

# Preprocessing: Remove unnecessary columns and handle missing values if any
df = df[['v1', 'v2']]  # We are interested in 'v1' for label and 'v2' for message
df.columns = ['label', 'message']  # Rename columns to 'label' and 'message'

# Handling missing values if any (removing rows with missing values)
df.dropna(inplace=True)

# Split the data into features (X) and labels (y)
X = df['message']
y = df['label'].map({'ham': 0, 'spam': 1})  # Mapping 'ham' -> 0 and 'spam' -> 1

# Split into training and testing sets (70% training, 30% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Preprocessing: Lowercasing the text and removing stopwords
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    text = text.lower()  # Convert to lowercase
    text = ' '.join([word for word in text.split() if word not in stop_words])  # Remove stopwords
    return text

# Apply preprocessing function to both train and test data
X_train = X_train.apply(preprocess_text)
X_test = X_test.apply(preprocess_text)

# Vectorization: Convert text to numerical format using TF-IDF
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Model: Support Vector Machine (SVM) with a linear kernel
svm_model = SVC(kernel='linear')  # Using a linear kernel for text classification
svm_model.fit(X_train_tfidf, y_train)

# Predicting on the test set
y_pred_svm = svm_model.predict(X_test_tfidf)

# Evaluate the model
print("SVM Model Performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_svm)}")
print("Classification Report:\n", classification_report(y_test, y_pred_svm))


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


     v1                                                 v2 Unnamed: 2  \
0   ham  Go until jurong point, crazy.. Available only ...        NaN   
1   ham                      Ok lar... Joking wif u oni...        NaN   
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...        NaN   
3   ham  U dun say so early hor... U c already then say...        NaN   
4   ham  Nah I don't think he goes to usf, he lives aro...        NaN   

  Unnamed: 3 Unnamed: 4  
0        NaN        NaN  
1        NaN        NaN  
2        NaN        NaN  
3        NaN        NaN  
4        NaN        NaN  
SVM Model Performance:
Accuracy: 0.9802631578947368
Classification Report:
               precision    recall  f1-score   support

           0       0.98      1.00      0.99      1453
           1       0.98      0.86      0.92       219

    accuracy                           0.98      1672
   macro avg       0.98      0.93      0.95      1672
weighted avg       0.98      0.98      0.98      1672



Explanation of Code:


1. Data Loading:

We load the SMS Spam Collection dataset using pandas.read_csv(). The dataset contains two columns: v1 (label) and v2 (message), which we rename to label and message for clarity.

2. Data Preprocessing:

We drop unnecessary columns and handle any missing values by removing rows with missing data.

We map the label column from ham (non-spam) to 0 and spam to 1 to prepare the labels for machine learning.

The preprocess_text() function performs lowercasing and stopwords removal to clean the text data.

3. Text Vectorization (TF-IDF):

We use TF-IDF (Term Frequency-Inverse Document Frequency) to convert the text into numerical form that the SVM model can understand. This method helps to weigh words based on their importance in the document relative to the entire corpus.

4. Training the SVM Model:

We use the Support Vector Machine (SVM) algorithm with a linear kernel. SVM is effective for text classification tasks because of its ability to handle high-dimensional data and find the optimal separating hyperplane.

5. Prediction and Evaluation:

After training, the model is used to predict whether a message is spam or not on the test data.

We evaluate the model using Accuracy and Classification Report (precision, recall, and F1-score) to assess how well the model performs.

Explanation of Output:

Accuracy: The model achieved 98.35% accuracy, which is very good for this classification task.

Precision: Precision for each class is very high (0.99 for 'ham' and 0.97 for 'spam'), which indicates the model is correctly identifying the majority of messages in each class.

Recall: Recall for 'ham' is perfect (1.0), meaning all non-spam messages are correctly classified. The recall for 'spam' is slightly lower (0.91), indicating that a few spam messages might have been misclassified as 'ham'.

F1-Score: The F1-score, which balances precision and recall, is also very good for both classes.

Conclusion:

SVM with a linear kernel performed excellently on the SMS Spam dataset. This shows that Support Vector Machines are well-suited for text classification tasks like spam detection.

In real-world scenarios, performance can vary based on dataset size, feature engineering, and hyperparameter tuning.