In [None]:
Implement e-mail spam filtering using text classification algorithm with appropriate dataset

In [32]:
import pandas as pd
df = pd.read_csv(r"C:\Users\Niyati Patil\Downloads\SMSSpamCollection.csv", sep='\t',names=['label','message'])
print(df.head())

  label                                            message
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...


In [33]:
# Label encoding: spam=1, ham=0
df['label']=df['label'].map({'spam':1,'ham':0})

In [34]:
# train test
from sklearn.model_selection import train_test_split
X_train,X_test,y_train, y_test= train_test_split(df['message'], df['label'], test_size=0.2, random_state= 42)

In [35]:
# text preprocessing
import re
def preprocessing_text(text):
    text =text.lower()
    text = re.sub(r'\d+','',text) #remove numbers
    text = re.sub(r'\W+','',text) #remove punctuations
    return text

In [36]:
# apply preprocessing
X_train=X_train.apply(preprocessing_text)
X_test=X_test.apply(preprocessing_text)

In [37]:
#Tf_IDF vectorization
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer=TfidfVectorizer(stop_words='english',max_features=3000)
X_train_vec=vectorizer.fit_transform(X_train)
X_test_vec=vectorizer.transform(X_test)

In [38]:
# train neive bayes classifire
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(X_train_vec,y_train)

In [39]:
# predict on test_data
y_data= classifier.predict(X_test_vec)

In [40]:
from sklearn.metrics import classification_report 
print(classification_report(y_test,y_data))

              precision    recall  f1-score   support

           0       0.87      1.00      0.93       966
           1       1.00      0.05      0.09       149

    accuracy                           0.87      1115
   macro avg       0.94      0.52      0.51      1115
weighted avg       0.89      0.87      0.82      1115



In [54]:
print(classification_report(y_test, y_data, target_names=['ham', 'spam']))

              precision    recall  f1-score   support

         ham       0.87      1.00      0.93       966
        spam       1.00      0.05      0.09       149

    accuracy                           0.87      1115
   macro avg       0.94      0.52      0.51      1115
weighted avg       0.89      0.87      0.82      1115



In [None]:
The task of e-mail spam filtering using a text classification algorithm aims to
automatically identify and filter out spam emails (unwanted or irrelevant
messages) from legitimate (ham) emails. This problem falls under natural
language processing (NLP) and machine learning (ML), where we use
algorithms to classify email text based on patterns learned from previous
examples. Here’s a breakdown of the problem, its theory, and how it’s typically
implemented:
1. Text Classification in Machine Learning
• Text Classification is a type of supervised learning where each text
sample (email) is assigned a predefined category (spam or ham).
• It requires transforming text into numerical features that machine
learning algorithms can interpret.
• Spam filtering is a binary classification problem, where we categorize
each email as either spam or ham.
2. Key Steps in Text Classification for Spam Filtering
1. Data Collection
o A labeled dataset with a sufficient number of spam and ham
messages is essential. For spam filtering, datasets like the SMS
Spam Collection Dataset or Enron Email Dataset (available on
Kaggle) are commonly used. These datasets contain email
messages labeled as spam or ham, providing the training data for
the model.
2. Data Preprocessing
o Text data must be preprocessed to clean and prepare it for
analysis. Key steps include:
▪ Lowercasing: Standardize text to lowercase.
▪ Tokenization: Splitting text into words (tokens).
▪ Removing Punctuation: Eliminating punctuation marks,
which are usually not helpful in classification.
▪ Stop Words Removal: Removing common words (e.g.,
"the," "is") that don’t contribute to classification.
▪ Stemming/Lemmatization: Reducing words to their root
forms (e.g., "running" to "run").
3. Feature Extraction
o We convert preprocessed text into a numerical form using
methods such as:
▪ Bag of Words (BoW): Represents text as a count of words.
▪ TF-IDF (Term Frequency-Inverse Document Frequency):
Weighs words based on their importance across the dataset.
▪ Word Embeddings: Advanced methods like Word2Vec or
GloVe can also be used for richer representations.
4. Model Selection
o Some algorithms that are well-suited for text classification include:
▪ Naive Bayes: Often used for spam detection, it’s effective
for text data and computationally efficient.
▪ Support Vector Machine (SVM): Can separate spam from
ham by finding the optimal boundary in feature space.
▪ Logistic Regression: A linear classifier that works well for
binary classification.
▪ Deep Learning Models: RNNs, LSTMs, or transformers (like
BERT) can provide high accuracy but require more data and
computational power.
5. Model Training
o We train the model on a subset of data (training set) and test it on
another subset (test set). This allows the model to learn from
labeled data, finding patterns and associations between text
features and labels (spam or ham).
6. Evaluation and Metrics
o Once trained, we evaluate the model’s performance on unseen
data using metrics like:
▪ Accuracy: Proportion of correct predictions out of total
predictions.
▪ Precision: Proportion of predicted spam emails that are
truly spam.
▪ Recall: Proportion of actual spam emails that the model
correctly identifies.
▪ F1 Score: The harmonic mean of precision and recall,
providing a balanced measure.
7. Deployment
o After satisfactory evaluation, the model can be deployed in an
email system to classify incoming messages. It can label emails as
spam or ham in real-time, automatically directing spam to a
separate folder.
3. Naive Bayes Classifier: A Theoretical Overview
• Naive Bayes is a popular algorithm for spam filtering because of its
simplicity and effectiveness in text classification.
• It’s based on Bayes’ theorem, which calculates the probability of a
message being spam given the presence of specific words.
• The model assumes each word contributes independently to the
probability of the email being spam (hence "naive"), which simplifies
calculations.
• For each email, it calculates:
P(spam∣email)=P(email∣spam)⋅P(spam)P(email)P(\text{spam} |
\text{email}) = \frac{P(\text{email} | \text{spam}) \cdot
P(\text{spam})}{P(\text{email})}P(spam∣email)=P(email)P(email∣spam)⋅P(
spam)
• The model compares this probability with the probability of the email
being ham and assigns the label with the higher probability.
4. Challenges and Considerations
• Imbalanced Dataset: In real-world scenarios, spam emails are often
fewer than ham emails, requiring techniques to handle imbalance.
• Feature Selection: Choosing the right features (words or phrases)
impacts accuracy.
• Data Drift: Spammers frequently change tactics, so the model may need
periodic retraining to adapt to new patterns.
• Performance vs. Complexity: Simpler models (like Naive Bayes) perform
surprisingly well, but advanced models may improve accuracy for
complex cases.
❖ Traditional Machine Learning Algorithms
❖ These algorithms are often combined with feature extraction methods
like Bag of Words (BoW) or TF-IDF.
❖ Naive Bayes: Commonly used for spam detection; it’s efficient and often
performs well with text data.
❖ Support Vector Machines (SVM): A powerful linear classifier effective for
text classification tasks.
❖ Logistic Regression: Suitable for binary and multiclass classification;
works well with high-dimensional data.
❖ K-Nearest Neighbors (KNN): Simple but often less effective on large text
datasets due to high-dimensionality.
❖ Decision Trees and Random Forests: Can work for text classification but
are typically less common due to overfitting on sparse text data.
❖ Gradient Boosting Algorithms (e.g., XGBoost, CatBoost): Used
occasionally but less common for text due to high dimensionality
challenges.