<a href="https://colab.research.google.com/github/Shubham-sharma2005/CODSOFT/blob/main/sms_spam.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Create a humanized model to detect spam SMS using the data from "/content/spam.csv".

## Load data

### Subtask:
Load the data from the CSV file into a pandas DataFrame.


**Reasoning**:
Load the data from the CSV file into a pandas DataFrame and display the head and info.



In [25]:
import pandas as pd

df = pd.read_csv('/content/spam.csv', encoding='latin-1')
print(df.head())
print(df.info())

     v1                                                 v2 Unnamed: 2  \
0   ham  Go until jurong point, crazy.. Available only ...        NaN   
1   ham                      Ok lar... Joking wif u oni...        NaN   
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...        NaN   
3   ham  U dun say so early hor... U c already then say...        NaN   
4   ham  Nah I don't think he goes to usf, he lives aro...        NaN   

  Unnamed: 3 Unnamed: 4  
0        NaN        NaN  
1        NaN        NaN  
2        NaN        NaN  
3        NaN        NaN  
4        NaN        NaN  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   v1          5572 non-null   object
 1   v2          5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
m

## Preprocessing

### Subtask:
Clean and preprocess the text data for model training.


**Reasoning**:
Rename columns, drop unnecessary columns, convert label to numerical, remove duplicates, and print verification information.



In [26]:
df = df.rename(columns={'v1': 'label', 'v2': 'message'})
df = df.drop(columns=['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'])
df['label'] = df['label'].apply(lambda x: 1 if x == 'spam' else 0)
df = df.drop_duplicates()

print(f"Number of unique values in 'label': {df['label'].nunique()}")
print(f"Number of rows after dropping duplicates: {df.shape[0]}")

Number of unique values in 'label': 2
Number of rows after dropping duplicates: 5169


## Split data

### Subtask:
Split the data into training and testing sets.


**Reasoning**:
Split the data into training and testing sets using train_test_split.



In [27]:
from sklearn.model_selection import train_test_split

X = df['message']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of y_test: {y_test.shape}")

Shape of X_train: (4135,)
Shape of X_test: (1034,)
Shape of y_train: (4135,)
Shape of y_test: (1034,)


## Feature extraction

### Subtask:
Convert the text data into numerical features using a suitable technique.


**Reasoning**:
Convert the text data into numerical features using TF-IDF vectorization.



In [28]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

print(f"Shape of X_train_tfidf: {X_train_tfidf.shape}")
print(f"Shape of X_test_tfidf: {X_test_tfidf.shape}")

Shape of X_train_tfidf: (4135, 7591)
Shape of X_test_tfidf: (1034, 7591)


## Train model

### Subtask:
Train a machine learning model to classify spam messages.


**Reasoning**:
Import the necessary classifiers and train them on the training data.



In [29]:
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC

lr_model = LogisticRegression()
nb_model = MultinomialNB()
svm_model = SVC()

lr_model.fit(X_train_tfidf, y_train)
nb_model.fit(X_train_tfidf, y_train)
svm_model.fit(X_train_tfidf, y_train)

## Evaluate model

### Subtask:
Evaluate the performance of the trained model using appropriate metrics.


**Reasoning**:
Calculate evaluation metrics for each model and store them as variables.



In [30]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

lr_predictions = lr_model.predict(X_test_tfidf)
nb_predictions = nb_model.predict(X_test_tfidf)
svm_predictions = svm_model.predict(X_test_tfidf)

lr_accuracy = accuracy_score(y_test, lr_predictions)
lr_precision = precision_score(y_test, lr_predictions)
lr_recall = recall_score(y_test, lr_predictions)
lr_f1 = f1_score(y_test, lr_predictions)

nb_accuracy = accuracy_score(y_test, nb_predictions)
nb_precision = precision_score(y_test, nb_predictions)
nb_recall = recall_score(y_test, nb_predictions)
nb_f1 = f1_score(y_test, nb_predictions)

svm_accuracy = accuracy_score(y_test, svm_predictions)
svm_precision = precision_score(y_test, svm_predictions)
svm_recall = recall_score(y_test, svm_predictions)
svm_f1 = f1_score(y_test, svm_predictions)

print("Logistic Regression Metrics:")
print(f"  Accuracy: {lr_accuracy:.4f}")
print(f"  Precision: {lr_precision:.4f}")
print(f"  Recall: {lr_recall:.4f}")
print(f"  F1-score: {lr_f1:.4f}")

print("\nNaive Bayes Metrics:")
print(f"  Accuracy: {nb_accuracy:.4f}")
print(f"  Precision: {nb_precision:.4f}")
print(f"  Recall: {nb_recall:.4f}")
print(f"  F1-score: {nb_f1:.4f}")

print("\nSVM Metrics:")
print(f"  Accuracy: {svm_accuracy:.4f}")
print(f"  Precision: {svm_precision:.4f}")
print(f"  Recall: {svm_recall:.4f}")
print(f"  F1-score: {svm_f1:.4f}")

Logistic Regression Metrics:
  Accuracy: 0.9671
  Precision: 0.9899
  Recall: 0.7481
  F1-score: 0.8522

Naive Bayes Metrics:
  Accuracy: 0.9516
  Precision: 1.0000
  Recall: 0.6183
  F1-score: 0.7642

SVM Metrics:
  Accuracy: 0.9797
  Precision: 0.9911
  Recall: 0.8473
  F1-score: 0.9136
