## Inductive Learning: 
Spam Detection using Naive Bayes

**Dataset:**  
UCI Machine Learning Repository  
*SMSSpamCollection Dataset*

**Attributes:**
- **label:** Indicates whether the message is spam or ham.
- **message:** The content of the SMS message.


In [69]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset
data = pd.read_csv('SMSSpamCollection', sep='\t', names=['label', 'message'])

In [70]:
data

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


DATA PREPROCESSING

In [71]:
label_counts = data['label'].value_counts()
print(label_counts)


label
ham     4825
spam     747
Name: count, dtype: int64


In [72]:
# Check for null values in the label column
null_labels = data['label'].isnull().sum()
print(f"Number of null labels: {null_labels}")

# Check for any rows with missing or empty labels
empty_labels = (data['label'] == '').sum()
print(f"Number of empty labels: {empty_labels}")

# Check if there are any rows without labels
rows_without_labels = data[data['label'].isnull() | (data['label'] == '')]
print(f"Rows without labels: {len(rows_without_labels)}")

Number of null labels: 0
Number of empty labels: 0
Rows without labels: 0


In [73]:
# Encode the labels into numerical data
data['label'] = data['label'].map({'ham': 0, 'spam': 1})
data

Unnamed: 0,label,message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,1,This is the 2nd time we have tried 2 contact u...
5568,0,Will ü b going to esplanade fr home?
5569,0,"Pity, * was in mood for that. So...any other s..."
5570,0,The guy did some bitching but I acted like i'd...


## Dataset Splitting: 
-  Into Labeled and Unlabeled 
-  Drop labels in unlabeled.
- Split the labeled into train and Test


In [74]:
# Split the data into labeled and unlabeled sets 50/50
labeled_data, unlabeled_data = train_test_split(data, test_size=0.5, random_state=42, stratify=data['label'])

# Drop the 'label' column from the unlabeled data
unlabeled_data = unlabeled_data.drop('label', axis=1)

# Drop corresponding labels from the labels variable
labels = list(labeled_data['label']) + [-1] * len(unlabeled_data)

# Show the size of the datasets
print(f"Labeled data size: {labeled_data.shape}")
print(f"Unlabeled data size: {unlabeled_data.shape}")

# Split labeled data into training and test sets
train_data, test_data = train_test_split(labeled_data, test_size=0.3, random_state=42, stratify=labeled_data['label'])


train_count = len(train_data)
test_count = len(test_data)
print("Using Labeled Set:")
print(f"Number of training set: {train_count}")
print(f"Number of test set: {test_count}")


Labeled data size: (2786, 2)
Unlabeled data size: (2786, 1)
Using Labeled Set:
Number of training set: 1950
Number of test set: 836


MODEL TRAINING

In [75]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score, classification_report

model = make_pipeline(TfidfVectorizer(), MultinomialNB())

# Train the model on the labeled training data
model.fit(train_data['message'], train_data['label'])

# Predict and evaluate on the test set
predictions = model.predict(test_data['message'])
print(f"Inductive Learning Accuracy: {accuracy_score(test_data['label'], predictions)}")
print(classification_report(test_data['label'], predictions))


Inductive Learning Accuracy: 0.94377990430622
              precision    recall  f1-score   support

           0       0.94      1.00      0.97       724
           1       1.00      0.58      0.73       112

    accuracy                           0.94       836
   macro avg       0.97      0.79      0.85       836
weighted avg       0.95      0.94      0.94       836



In [76]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.semi_supervised import SelfTrainingClassifier


vectorizer = TfidfVectorizer()
# Combine labeled and unlabeled data 
combined_data = pd.concat([labeled_data, unlabeled_data])
labels = list(labeled_data['label']) + [-1] * len(unlabeled_data)

# Define the model pipeline 
self_training_model = make_pipeline(vectorizer, SelfTrainingClassifier(MultinomialNB()))

# Train on both labeled and unlabeled data
self_training_model.fit(combined_data['message'], labels)

# Predict and evaluate on the labeled test set 
semi_supervised_predictions = self_training_model.predict(test_data['message'])
print(f"Semi-Supervised Inductive Learning Accuracy: {accuracy_score(test_data['label'], semi_supervised_predictions)}")
print(classification_report(test_data['label'], semi_supervised_predictions))


Semi-Supervised Inductive Learning Accuracy: 0.9533492822966507
              precision    recall  f1-score   support

           0       0.95      1.00      0.97       724
           1       1.00      0.65      0.79       112

    accuracy                           0.95       836
   macro avg       0.97      0.83      0.88       836
weighted avg       0.96      0.95      0.95       836

