<a href="https://colab.research.google.com/github/Lathakh/Message_Spam_classifier_NLP_Project/blob/main/spam_filter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Data collection**

In [1]:
import numpy as np

In [2]:
import pandas as pd

In [7]:
data=pd.read_csv("/SMSSpamCollection",sep='\t',names=["label","Messages"])

In [8]:
data

Unnamed: 0,label,Messages
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


# **Text-Preprocessing**

In [9]:
import nltk

In [10]:
nltk.download()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> all


    Downloading collection 'all'
       | 
       | Downloading package abc to /root/nltk_data...
       |   Unzipping corpora/abc.zip.
       | Downloading package alpino to /root/nltk_data...
       |   Unzipping corpora/alpino.zip.
       | Downloading package averaged_perceptron_tagger to
       |     /root/nltk_data...
       |   Unzipping taggers/averaged_perceptron_tagger.zip.
       | Downloading package averaged_perceptron_tagger_ru to
       |     /root/nltk_data...
       |   Unzipping taggers/averaged_perceptron_tagger_ru.zip.
       | Downloading package basque_grammars to /root/nltk_data...
       |   Unzipping grammars/basque_grammars.zip.
       | Downloading package bcp47 to /root/nltk_data...
       | Downloading package biocreative_ppi to /root/nltk_data...
       |   Unzipping corpora/biocreative_ppi.zip.
       | Downloading package bllip_wsj_no_aux to /root/nltk_data...
       |   Unzipping models/bllip_wsj_no_aux.zip.
       | Downloading package book_grammars to


---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> q


True

In [11]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk import pos_tag, word_tokenize
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import svm
from sklearn.metrics import confusion_matrix

# **lemmatizing**

In [13]:
lemmatizer = WordNetLemmatizer()
stopwords=set(stopwords.words("english"))

In [14]:
# convert lower case
def review_messages(msg):
    # converting messages to lowercase
    msg = msg.lower()
    return msg


In [20]:
def alternative_review_messages(msg):
    # converting messages to lowercase
    msg = msg.lower()

    # uses a lemmatizer (wnpos is the parts of speech tag)
    # unfortunately wordnet and nltk uses a different set of terminology for pos tags
    # first, we must translate the nltk pos to wordnet
    nltk_pos = [tag[1] for tag in pos_tag(word_tokenize(msg))]
    msg = [tag[0] for tag in pos_tag(word_tokenize(msg))]
    wnpos = ['a' if tag[0] == 'J' else tag[0].lower() if tag[0] in ['N', 'R', 'V'] else 'n' for tag in nltk_pos]
    msg = " ".join([lemmatizer.lemmatize(word, wnpos[i]) for i, word in enumerate(msg)])
    # removing stopwords
    msg = [word for word in msg.split() if word not in stopwords]
    return msg

split dataset into train and test

In [22]:
# Processing text messages
data['Messages'] = data['Messages'].apply(review_messages)

# **training vectorizer**

In [24]:
# train test split
X_train, X_test, y_train, y_test = train_test_split(data['Messages'], data['label'], test_size = 0.1, random_state = 1)

# training vectorizer
vectorizer = TfidfVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)

In [25]:
X_train

5240     gud gud..k, chikku tke care.. sleep well gud nyt
544       4 oclock at mine. just to bash out a flat plan.
2653                        no need for the drug anymore.
1139                                    what * u wearing?
2045                  i can send you a pic if you like :)
                              ...                        
905     we're all getting worried over here, derek and...
5192    oh oh... den muz change plan liao... go back h...
3980    ceri u rebel! sweet dreamz me little buddy!! c...
235     text & meet someone sexy today. u can find a d...
5157                              k k:) sms chat with me.
Name: Messages, Length: 5014, dtype: object

In [27]:
X_test

1078                         yep, by the pretty sculpture
4028        yes, princess. are you going to make me moan?
958                            welp apparently he retired
4642                                              havent.
4674    i forgot 2 ask ü all smth.. there's a card on ...
                              ...                        
3529    you are a £1000 winner or guaranteed caller pr...
5488                              k. i will sent it again
5134      sday only joined.so training we started today:)
5       freemsg hey there darling it's been 3 week's n...
1289                             happy new year to u too!
Name: Messages, Length: 558, dtype: object

# **training the classifier**

In [28]:
# training the classifier
svm = svm.SVC(C=1000)
svm.fit(X_train_vec, y_train)

# testing against testing set

In [29]:
X_test = vectorizer.transform(X_test)
y_pred = svm.predict(X_test)
print(confusion_matrix(y_test, y_pred))

[[488   1]
 [  5  64]]


# test against new messages

In [31]:
def pred(msg):
  msg=vectorizer.transform([msg])
  prediction=svm.predict(msg)
  return prediction[0]

In [33]:
rand_index=np.random.randint(0,len(data))
test_sample=data.iloc[rand_index][1]
print(test_sample)

sorry that was my uncle. i.ll keep in touch


In [34]:
test_sample

'sorry that was my uncle. i.ll keep in touch'

In [35]:
pred(test_sample)

'ham'

In [36]:
print("message is - "+ str(pred(test_sample)))

message is - ham
