<h3>Problem statement :<h/3>
<p3> The SMS Spam Collection is a public set of SMS labeled messages that have been collected for mobile phone spam research and with the given data we need to predict whether a recieved message is spam or not.
     </p3>

data source:https://archive.ics.uci.edu/ml/datasets/sms+spam+collection#

In [1]:
import pandas as pd
import nltk
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

In [2]:
df=pd.read_csv("Downloads/smsspamcollection/SMSSpamCollection", sep='\t', names=["label", "message"])
df.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
df.shape

(5572, 2)

In [5]:
#converting categorical variable to numerical value
df['label'] = df['label'].map({'spam':1 ,'ham':0})
df.head()

Unnamed: 0,label,message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [6]:
y=df['label']
y.head()

0    0
1    0
2    1
3    0
4    0
Name: label, dtype: int64

<h3>  MODEL built using Stemming and BOW methods </h3>

In [4]:
#Data cleaning and preprocessing using stemming and Bag of words
from nltk.corpus import stopwords
import re
ps = PorterStemmer()
corpus = []
for i in range(0, len(df)):
    review = re.sub('[^a-zA-Z]', ' ', df['message'][i])
    review = review.lower()
    review = review.split()
    review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
    review = ' '.join(review)
    corpus.append(review)
    
    
# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=2500)
X = cv.fit_transform(corpus).toarray()

X

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [9]:
#Now training the model using Train Test Split
import sklearn
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [10]:
#now building a model using naive bayes classifier

from sklearn.naive_bayes import MultinomialNB
spam_detect_model_1 = MultinomialNB().fit(X_train, y_train)

In [11]:
#predicting for test data set using trained model
spam_detect_model_1.predict(X_test)

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [12]:
#calculating accuracy of model
spam_detect_model_1.score(X_test,y_test)

0.9802690582959641

<h3>  MODEL built using Lemmatization and BOW methods </h3>

In [13]:
#data cleaning and preprocessing using lemmatization and Bag of Words
WNL=WordNetLemmatizer()
corpus = []
for i in range(0, len(df)):
    review = re.sub('[^a-zA-Z]', ' ', df['message'][i])
    review = review.lower()
    review = review.split()
    review = [WNL.lemmatize(word) for word in review if not word in stopwords.words('english')]
    review = ' '.join(review)
    corpus.append(review)
    
    
# Creating the Bag of Words model 
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=2500)
X = cv.fit_transform(corpus).toarray()

X

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [14]:
#Now training the model using Train Test Split
import sklearn
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [15]:
#now building a model using naive bayes classifier

from sklearn.naive_bayes import MultinomialNB
spam_detect_model_2 = MultinomialNB().fit(X_train, y_train)

In [16]:
#predicting for test data set using trained model
spam_detect_model_2.predict(X_test)

array([0, 0, 0, ..., 0, 0, 1], dtype=int64)

In [17]:
#calculating accuracy of model
spam_detect_model_2.score(X_test,y_test)

0.9811659192825112

<h3>   MODEL built using stemming and TF-IDF  </h3>

In [18]:
#Data cleaning and preprocessing using stemming and TF-IDF
from nltk.corpus import stopwords
import re
ps = PorterStemmer()
corpus = []
for i in range(0, len(df)):
    review = re.sub('[^a-zA-Z]', ' ', df['message'][i])
    review = review.lower()
    review = review.split()
    review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
    review = ' '.join(review)
    corpus.append(review)
    
    
# Creating the TF-IDF model
from sklearn.feature_extraction.text import TfidfVectorizer
cv = TfidfVectorizer(max_features=2500)
X = cv.fit_transform(corpus).toarray()

X

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [19]:
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

#model built using naive bayes classifier
from sklearn.naive_bayes import MultinomialNB
spam_detect_model = MultinomialNB().fit(X_train, y_train)

#predicting for test data set using trained model
spam_detect_model.predict(X_test)

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [20]:
#calculating accuracy of model
spam_detect_model.score(X_test,y_test)

0.9874439461883409

<h3>   MODEL built using Lemmatization and TF-IDF  </h3>

In [21]:
#data cleaning and preprocessing using lemmatization and Bag of Words
WNL=WordNetLemmatizer()
corpus = []
for i in range(0, len(df)):
    review = re.sub('[^a-zA-Z]', ' ', df['message'][i])
    review = review.lower()
    review = review.split()
    review = [WNL.lemmatize(word) for word in review if not word in stopwords.words('english')]
    review = ' '.join(review)
    corpus.append(review)

    
# Creating the TF-IDF model
from sklearn.feature_extraction.text import TfidfVectorizer
cv = TfidfVectorizer(max_features=2500)
X = cv.fit_transform(corpus).toarray()

X    

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [22]:
#Now training the model using Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [23]:
from sklearn.naive_bayes import MultinomialNB
spam_detect_model = MultinomialNB().fit(X_train, y_train)

In [24]:
#predicting for test data set using trained model
spam_detect_model.predict(X_test)

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [25]:
#calculating accuracy of model
spam_detect_model.score(X_test,y_test)

0.9704035874439462

<h4> Hence forth after comparing accuracy of all models  we can draw a conclusion that, model built using stemming and TF-IDF has accuracy of 98.7% which is highest among others and can be considered as best model for given problem statement and data. </h4>