 ## Hi there! ##

  I am Ryan Ancheta and this is a spam classifier model that I made. I used a Naive Bayes model with TF-IDF vectorization algorithm to train the machine learning model using a dataset of spam and ham emails from Kaggle. I used 80% of the data for training and 20% for testing and in the other file I will show you how to use the model with a python application that will take a text input from user and will classify it as spam or ham.
 




In [29]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score
import pandas as pd

In [30]:
df1 = pd.read_csv('spam_and_ham_classification.csv')


In [31]:
df2 = pd.read_csv('spam1.csv')

In [32]:
df2.rename(columns={'v1': 'label', 'v2': 'text'}, inplace=True)

In [33]:
df2

Unnamed: 0,label,text,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,
...,...,...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,,,
5568,ham,Will Ì_ b going to esplanade fr home?,,,
5569,ham,"Pity, * was in mood for that. So...any other s...",,,
5570,ham,The guy did some bitching but I acted like i'd...,,,


In [34]:
df2.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1, inplace=True)

In [35]:
df2

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ì_ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [36]:
df = pd.concat([df1, df2], axis=0)

In [44]:
df.to_csv('spam-or-ham.csv', index=False)

1 key step before performing a machine learning model is to have a glimpse of the data. What the data looks like or do we need to clean it up? Importantly, we also need to consider if any preprocessing techniques are required before applying the machine learning model.

In [37]:
df

Unnamed: 0,label,text
0,ham,into the kingdom of god and those that are ent...
1,spam,there was flow at hpl meter 1505 on april firs...
2,ham,take a look at this one campaign for bvyhprice...
3,spam,somu wrote actually thats what i was looking f...
4,spam,fathi boudra wrote i fixed the issue in the sv...
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ì_ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


Here we can see that there is no missing data and at the top we can see that the data is cleaned. We can also see that there are labels for spam and ham emails which is a key factor for classification models. So in this case, we can use Naive Bayes model with TF-IDF vectorization algorithm as our machine learning model.

In [38]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 15561 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   label   15561 non-null  object
 1   text    15561 non-null  object
dtypes: object(2)
memory usage: 364.7+ KB


# Now let's dive into training the model. #


First is we need to initialize a TfidfVectorizer with English as its stop words to transform our text data into a TF-IDF features. This vectorizer is important for our model because it converts the text data from dataset into numerical features so that the model algorith can process it. This vectors capture the importance of each word in the text data of our dataset.


After initializing the vectorizer let's fit and transform our text column into numerical features and assign it to X variable and the labels to y variable.

In [39]:
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(df['text'])
y = df['label']

I will be splitting the data into 80% training and 20% testing to prevent data leakage ensuring that our model is robust to overfitting.

In [40]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

After splitting the data, let's train our model using the Naive Bayes algorithm. We will initialize a MultinomialNB model and fit it to our training data.

In [41]:
model = MultinomialNB()
model.fit(X_train, y_train)

Now that we have trained our model, let's make predictions. Take note that we will be using X_test (the text column) because we want to classify whether the text is spam or ham.

In [42]:
y_pred = model.predict(X_test)

In [43]:
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)
print(accuracy)
print(report)

0.9572759396080951
              precision    recall  f1-score   support

         ham       0.95      0.99      0.97      2033
        spam       0.98      0.90      0.94      1080

    accuracy                           0.96      3113
   macro avg       0.96      0.94      0.95      3113
weighted avg       0.96      0.96      0.96      3113



# Let's evaluate our model #


## Accuracy ##
Our model achieved an accuracy of 95.73% this indicates that our model classified 95.73% of the text in the test set correctly. This is a very good accuracy score.

## Precision, Recall, F1-score, and Support for Each Class ##
For the Ham Class: 
-Precision: 0.95 meaning 95% of the text are predicted as ham were actually ham.
-Recall: 0.99 meaning 95% of the actual ham emails were correctly identified as ham.
-F1-score: 0.97 considering all the false positives and false negatives f1-score of 0.97 indicates a high level of accuracy in classification.
-Support: 2033 meaning there are 2033 ham emails in the test set.

For the Spam Class:
-Precision: 0.98 meaning 98% of the text are predicted as spam were actually spam.
-Recall: 0.90 meaning 90% of the actual spam emails were correctly identified as spam.
-F1-score: 0.94 considering all the false positives and false negatives f1-score of 0.97 indicates a high level of accuracy in classification.
-Support: 1080 meaning there are 940 spam emails in the test set.

## Overall Metrics ##
Accuracy: 0.96 
Macro Average: 0.95 marco avg is the average of the precision, recall, and F1-score for each class
-macro average of 0.95 indicates that the model did well across both classes without being biased toward any particular class.
Weighted Average: 0.96 weighted avg takes into account the support and then calculates the average precision, recall, and F1-score. 
-weighted average of 0.96 means that the model's performance is consistent and didwell overall and is not significantly influenced by the imbalance between the spam and ham classes.



