<a id="top"></a>
## **Table of Contents ⏩**

* [Spam Classifier](#1)

* [Basic Overview of Dataset 📺](#2)

* [Preprocessing and EDA 📊💹](#3)
  * [Data Cleaning using Regex 🧹](#3.1)
  * [Removing stop words](#3.2)
  * [Word Cloud of non-spam messages ☁](#3.3)
  * [Word Cloud of spam messages ☁](#3.4)
 
 
* [Modelling with Bag of words Method 💰 ](#4)

* [Modelling with TF-IDF method ⏩](#5)
  



[Slide to top](#top)
<a id="1"></a>
## **Spam Classifier 🏛**

![spam or ham](https://analyticsindiamag.com/wp-content/uploads/2020/10/spamimage.jpg)

In [None]:
#Importing necessary pre-processing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df= pd.read_csv("../input/sms-spam-collection-dataset/spam.csv",encoding="latin-1")

[Slide to top](#top)
<a id="2"></a>
## **Basic Overview of dataset 📺**

In [None]:
print(df.columns)

In [None]:
print("** First five rows of the dataset **")
print()

#Dropping unncessary columns from the dataset
df=df.drop(["Unnamed: 2" , "Unnamed: 3" , "Unnamed: 4"] , axis=1)
df=df.rename(columns={"v1": "Target" , "v2": "Text"})
df.head()

In [None]:
print("** Value Counts of Target **")
print()
print(df['Target'].value_counts())
print()
print("** Basic description of dataset **")
print()
df.describe()


In [None]:
print(" ** Basic Information **")
print()
df.info()

[Slide to Top](#top)

<a id=3></a>
## **Preprocessing and EDA 📊💹**

* Data is imbalanced

* 86.6 % are "ham" messages and remaining 13.4 % are "spam" messages

In [None]:
def without_hue(data,feature,ax):
    
    total=float(len(data))
    bars_plot=ax.patches
    
    for bars in bars_plot:
        percentage = '{:.1f}%'.format(100 * bars.get_height()/total)
        x = bars.get_x() + bars.get_width()/2.0
        y = bars.get_height()
        ax.text(x, y,(percentage,bars.get_height()),ha='center',fontweight='bold',fontsize=14)

In [None]:
#setting theme
sns.set_theme(context='notebook',style='white',font_scale=3)

#setting the background and foreground color
fig=plt.figure(figsize=(10,5))
ax=plt.axes()
ax.set_facecolor("#F2EDD7FF")
fig.patch.set_color("#F2EDD7FF")

#Dealing with spines
for i in ['left','top','right']:
    ax.spines[i].set_visible(False)
    
ax.grid(linestyle="--",axis='y',color='gray')

#countplot
a=sns.countplot(data=df,x='Target',saturation=3,palette='cool')

without_hue(df,'target',a)

plt.title("Label Distribution",weight='bold',fontsize=15)

In [None]:
#Adding new feature 'message_length'
df['message_length']=df['Text'].apply(lambda x: len(x.split(" ")))

In [None]:
df

In [None]:
df_ham= df["message_length"][df["Target"]=="ham"].value_counts()
df_spam=df["message_length"][df["Target"]=="spam"].value_counts()
df_ham=pd.DataFrame(df_ham)
df_spam=pd.DataFrame(df_spam)

In [None]:
df_ham

* From figure we can conclude that spam messages are more lengthy than ham messages

In [None]:
fig=plt.figure(figsize=(20,10))
fig.patch.set_color("#F2EDD7FF")

ax=plt.axes()
ax.set_facecolor("#F2EDD7FF")
fig.patch.set_color("#F2EDD7FF")

#Dealing with spines
for i in ['left','top','right']:
    ax.spines[i].set_visible(False)
    
ax.grid(linestyle="--",axis='y',color='gray')



sns.scatterplot(data=df_ham,x=df_ham.index,y=df_ham['message_length'],label="ham")
sns.scatterplot(data=df_spam,x=df_spam.index,y=df_spam['message_length'],label='spam')
plt.xlabel("Message Length",fontsize=15,fontweight='bold')
plt.ylabel("Message Length Frequencies",fontsize=15,fontweight='bold')

[Slide to top](#top)

<a id=3.1></a>
### **Data Cleaning 🧹**

In [None]:
import re
import string

In [None]:
#Using regex functions to clean the text

def text_cleaning(text):
    
    #Converting text into lowercase
    text = str(text).lower()
    
    #Removing square brackets from the text
    text = re.sub('\[.*?\]','',text)
    
    
    #Removing links starting with (https or www)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    
    #Removing <"text"> type of text 
    text = re.sub('<.*?>+','',text)
    
    #Removing punctuations
    text = re.sub("[%s]" % re.escape(string.punctuation),'',text)
    
    #Removing new lines
    text = re.sub("\n",'',text)
    
    #Removing alphanumeric numbers 
    text = re.sub('\w*\d\w*','',text)
    
    return(text)
        

In [None]:
#Applying 'text_cleaning' function on the dataset
df['cleaned_text']=df['Text'].apply(text_cleaning)


df.head()

In [None]:
print("***** First five sentences of the cleaned and uncleaned text *****")
print()
for i in range(0,5):
    print("Uncleaned sentence ==>",i+1 , ".", df["Text"][i])
    print("Cleaned sentence ==>",i+1,".", df['cleaned_text'][i])
    print()

[Slide to top](#top)

<a id=3.2></a>
### **Removing Stopwords**

In [None]:
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

wordnet = WordNetLemmatizer()
def remove_stopwords(text):
    text = text.split()
    text = [wordnet.lemmatize(word) for word in text if word not in set(stopwords.words('english'))]
    text = " ".join(text)
    return(text)

df['cleaned_text']=df['cleaned_text'].apply(remove_stopwords)

In [None]:
print("***** Dataset after lemmatizing and removing stop words *****")
print()
df.head()

In [None]:
df=df.drop(['Text'],axis=1)

In [None]:
df_ham_cleaned= df[df["Target"]=="ham"]
df_spam_cleaned=df[df["Target"]=="spam"]
df_ham_cleaned=pd.DataFrame(df_ham_cleaned)
df_spam_cleaned=pd.DataFrame(df_spam_cleaned)

In [None]:
df_ham_cleaned

In [None]:
para_ham = " ".join([word for word in df_ham_cleaned['cleaned_text']])
para_spam = " ".join([word for word in df_spam_cleaned['cleaned_text']])

<a id=3.3></a>
### **Word Cloud of non-spam messages ☁**

In [None]:
from wordcloud import WordCloud

wordcloud=WordCloud(width=2000,height=1000,background_color='#F2EDD7FF').generate(para_ham)

plt.figure(figsize=(20,30))
plt.imshow(wordcloud)
plt.title("Non_Spam Messages")
plt.show()

<a id=3.4></a>
### **Word Cloud of spam messages ☁**

In [None]:
from wordcloud import WordCloud

wordcloud=WordCloud(width=2000,height=1000,background_color='#F2EDD7FF').generate(para_spam)

plt.figure(figsize=(20,30))
plt.imshow(wordcloud)
plt.title("Spam Messages")
plt.show()

[Silde to Top](#top)
<a id=4></a>
## **Modelling with Bag of Words Method 💰**

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

In [None]:
le=LabelEncoder()
df['Target']=le.fit_transform(df["Target"])
df.head()

In [None]:
x=df['cleaned_text']
y=df['Target']
x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=42)

In [None]:
print("*** Size of x_train ***",x_train.shape)
print("*** Size of y_train ***",y_train.shape)
print("*** Size of x_test *** ",x_test.shape)
print("*** Size of y_test *** ",y_test.shape)

In [None]:
cv=CountVectorizer()
vect=cv.fit(x_train)
x_train_vector=vect.transform(x_train)
x_test_vector=vect.transform(x_test)

In [None]:
print("**** Shape of training dataset after vectorization ****" , x_train_vector.shape)
print("**** Shape of test dataset after vectorization ****" , x_test_vector.shape)

**Using machine learning algorithms**
* Naive Bayes
* Decision Tree Classifier
* SVM
* RandomForest CLassifier
* XGBoost

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
import xgboost as xgb
from sklearn import svm
from sklearn.metrics import accuracy_score , classification_report

In [None]:
lis=[]
def modelling(dic):
    for models in dic:
        print("**** Training with", models , "algorithm ****")
        dic[models].fit(x_train_vector,y_train)
        print("**** Predicting with",models , "algorithm ****")
        print("......")
        pred=dic[models].predict(x_test_vector)
        print()
        print("**** Getting Accuracy of" , models , "algorithm ****")
        print("......")
        print(accuracy_score(y_test,pred))
        lis.append(accuracy_score(y_test,pred))
        print("......")
        print("**** Getting Classification report of", models , "algorithm ****")
        print()
        print(classification_report(y_test,pred))
        print("----------------------------------------------------------------")
        print()
        
        

In [None]:
dic={"Naive Bayes": MultinomialNB(),"Decision Tree": DecisionTreeClassifier(random_state=42),"SVM":svm.SVC(),
     "Random Forest":RandomForestClassifier(n_estimators=200,random_state=42),"XGB":xgb.XGBClassifier(n_estimators=80),
      }

In [None]:
modelling(dic)

In [None]:
models_dataframe=pd.DataFrame({
    "Models":["Naive Bayes" , "Decision Tree" , "SVM" , "Random Forest" , "XGBoost"] ,
    "Accuracy_score":[i for i in lis]
})

In [None]:
models_dataframe

In [None]:
plt.figure(figsize=(20,10))
sns.barplot(y=models_dataframe['Models'],x=models_dataframe['Accuracy_score'],palette='rocket')
plt.title("Accuracy of models with BOW method")
plt.show()

[Slide to Top](#top)
<a id=5></a>
## **Modelling with TF-IDF Method ⏩**

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

#Usinf TF-IDF method
tf=TfidfVectorizer()
tf_vect=tf.fit(x_train)
x_train_vector_tf= tf_vect.transform(x_train)
x_test_vector_tf=tf_vect.transform(x_test)

print("**** Shape of training dataset after vectorization ****" , x_train_vector.shape)
print("**** Shape of test dataset after vectorization ****" , x_test_vector.shape)

In [None]:
lis_tf=[]
def modelling_tf(dic):
    for models in dic:
        print("**** Training with", models , "algorithm ****")
        dic[models].fit(x_train_vector_tf,y_train)
        print("**** Predicting with",models , "algorithm ****")
        print("......")
        pred=dic[models].predict(x_test_vector_tf)
        print()
        print("**** Getting Accuracy of" , models , "algorithm ****")
        print("......")
        print(accuracy_score(y_test,pred))
        lis_tf.append(accuracy_score(y_test,pred))
        print("......")
        print("**** Getting Classification report of", models , "algorithm ****")
        print()
        print(classification_report(y_test,pred))
        print("----------------------------------------------------------------")
        print()
        
dic={"Naive Bayes": MultinomialNB(),"Decision Tree": DecisionTreeClassifier(random_state=42),"SVM":svm.SVC(),
     "Random Forest":RandomForestClassifier(n_estimators=200,random_state=42),"XGB":xgb.XGBClassifier(n_estimators=80),
      }



In [None]:
modelling_tf(dic)

In [None]:
models_dataframe=pd.DataFrame({
    "Models":["Naive Bayes" , "Decision Tree" , "SVM" , "Random Forest" , "XGBoost"] ,
    "Accuracy_score":[i for i in lis_tf]
})

models_dataframe

In [None]:
plt.figure(figsize=(20,10))
sns.barplot(y=models_dataframe['Models'],x=models_dataframe['Accuracy_score'],palette='rocket_r')
plt.title("Accuracy of models with TF-IDF method")
plt.show()

**THANK YOU BEING PATIENT AND SCROLL THIS DOWN INTO THIS NOTEBOOK**

**If you like my work please give it a upvote and any feedback is appreciated**

**Very new to NLP and doing my hands dirty with basics will come up another notebook which will contain word embedding and deep learning implementation of "Spam Classifier" , STAY TUNED 😉**

**Made with LOVE❤**