<a href="https://colab.research.google.com/github/Manikanta898/Spam-SMS-Detection/blob/main/Spam_SMS_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



#### Project Name - SMS Spam Detection
#### Project Type - Supervised Learning
#### Project by  - Manikanta Tangi

# **Project Summary -**

In this supervised machine learning project, the objective is to develop a **classification model** that can accurately distinguish between spam and legitimate (ham) SMS messages. The dataset consists of **5,572 messages**, categorized as **spam (1) or ham (0)**.

The project involves **text preprocessing**, including **lowercasing, punctuation removal, stopword removal, and stemming**, to clean and standardize the messages. Using **TF-IDF vectorization**, the text is converted into numerical features, which are then used to train a **Naive Bayes classifier** a commonly used model for text classification.

# **Problem Statement**


The goal of this project is to **classify SMS messages as spam or non-spam** using machine learning techniques. The model will help in **automatically filtering out spam messages**, reducing user inconvenience and improving security. The project focuses on achieving **high accuracy and efficiency** through proper data preprocessing and the use of effective classification algorithms.

### Dataset Loading

In [None]:
# Loading Dataset
import pandas as pd
path='/content/sms_spam.csv'
df = pd.read_csv(path, encoding='latin-1')

# Displaying the first few rows and dataset info
df.info(), df.head()

# **Cleaning the data**


In [None]:
# Dropping unnecessary columns
df = df[['v1', 'v2']]

# Renaming columns for clarity
df = df.rename(columns={'v1': 'label', 'v2': 'message'})

# Converting labels to binary (ham = 0, spam = 1)
df['label'] = df['label'].map({'ham': 0, 'spam': 1})

# Checking the cleaned dataset
df.info(), df.head()

- Kept only relevant columns (v1, v2).
- Renamed them as label, message for clarity.
- Converted the labels (ham → 0, spam → 1).

# **Text Preprocessing**


In [None]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# Downloading stopwords
nltk.download('stopwords')

# Initialing the stemmer and stopwords
ps = PorterStemmer()
stop_words = set(stopwords.words('english'))

# Text preprocessing function
def preprocess_text(text):
    # Converting to lowercase
    text = text.lower()
    # Removing special characters, numbers, and punctuation
    text = re.sub(r'[^a-z\s]', '', text)
    # Tokenization and stemming
    words = text.split()
    words = [ps.stem(word) for word in words if word not in stop_words]
    # Joining the words back into a cleaned sentence
    return ' '.join(words)

# Applying preprocessing to all messages
df['cleaned_message'] = df['message'].apply(preprocess_text)

# Displaying the cleaned dataset
df.head()

- All messages are converted to lowercase, punctuation removed, and stopwords filtered out.
- Stemming applied to reduce words to their root form (e.g., 'running' → 'run').
- The dataset is now ready for feature extraction.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

# Converting text into numerical features using TF-IDF
tfidf_vectorizer = TfidfVectorizer(max_features=5000)  # Limit features to 5000 most important words
X = tfidf_vectorizer.fit_transform(df['cleaned_message'])

# Labels (Spam = 1, Ham = 0)
y = df['label']

# Splitting the dataset into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Checking dataset sizes
X_train.shape, X_test.shape, y_train.shape, y_test.shape


- Text successfully converted into numerical features using TF-IDF.
- Dataset split into 80% training and 20% testing.
- The model is now ready for training.

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Training a Naive Bayes model
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)

# Predictions on the test set
y_pred = nb_model.predict(X_test)

# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

# Printing the results
print(f"Model Accuracy: {accuracy:.2%}\n")
print("Classification Report:\n", classification_rep)
print("Confusion Matrix:")
print(conf_matrix)

- Algorithm Used: Naive Bayes (MultinomialNB)
- Accuracy Achieved: 96.77%
- Precision (Spam Detection): 99% (Very few false positives)
- Recall (Spam Detection): 77% (Some spam messages missed)
- Confusion Matrix Analysis: 35 spam messages misclassified as ham.

# **Conclusion**


This project successfully showed how to classify text using Machine Learning.This method could be used in real-world applications like filtering spam in messages, emails, and detecting fraud.