# Introduction

NLP based on the UCI SMS spam collection data set. The dataset is containing of labeled data differencing between HAM and SPAM of  5574 text messages. The two variables are 
- Data Label (Ham versus Spam)
- SMS text

The relevant coding steps include:

- Import Libraries: Necessary for data manipulation, model building, and evaluation.
- Load Data: The dataset is directly fetched from the UCI repository. The dataset is expected to be tab-separated with labels and messages.
- Data Preprocessing: Splitting the dataset into training and testing parts. Additionally, the SMS text needs to be vectorized (converted into numerical format) to be used by the machine learning model.
- Model Training: Using a Naive Bayes classifier, which is effective for text classification tasks involving natural language processing.
- Prediction and Evaluation: Once the model is trained on the training set, it is then used to predict the test set. Finally, evaluate the model using accuracy, confusion matrix, and other metrics.

# Import libraries

In [1]:
import pandas as pd
import requests
import io
from zipfile import ZipFile
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# The Python library io provides the main facilities for dealing with various types of I/O (input/output).
# The requests library in Python is a popular HTTP client for making requests to web servers. 
# The zipfile library in Python provides tools to create, read, write, append, and list ZIP files. 

# Load Dataset

In [2]:
# Download the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip"
response = requests.get(url)
zipfile = ZipFile(io.BytesIO(response.content))

In [3]:
# Load the dataset from the specified file within the ZIP
with zipfile.open('SMSSpamCollection') as f:
    df = pd.read_csv(f, sep='\t', header=None, names=['Label', 'SMS'])

In [4]:
# Display the first few rows of the dataset
print(df.head())

  Label                                                SMS
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...


# Preprocessing the data

In [5]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df['SMS'], df['Label'], test_size=0.25, random_state=42)

# Convert the SMS messages to a matrix of token counts
vectorizer = CountVectorizer()
X_train_counts = vectorizer.fit_transform(X_train)
X_test_counts = vectorizer.transform(X_test)

# Model Training

In [6]:
# Train a Multinomial Naive Bayes classifier
clf = MultinomialNB()
clf.fit(X_train_counts, y_train)

# Prediction and Evaluation

In [7]:
# Predict on the test data
y_pred = clf.predict(X_test_counts)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
report = classification_report(y_test, y_pred)

print("Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", report)

Accuracy: 0.9885139985642498
Confusion Matrix:
 [[1203    4]
 [  12  174]]
Classification Report:
               precision    recall  f1-score   support

         ham       0.99      1.00      0.99      1207
        spam       0.98      0.94      0.96       186

    accuracy                           0.99      1393
   macro avg       0.98      0.97      0.97      1393
weighted avg       0.99      0.99      0.99      1393

