<a href="https://colab.research.google.com/github/JeetChauhan17/Spam-Ham-Classifier/blob/main/Spam_Ham_Classification_Model_Made_By_Jeet_Chauhan2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spam/Ham Classification - Made By Jeet S. Chauhan

This is a Spam/Ham Classification model made using Python with the help of many ML Liabraries like : SKlearn, Numpy, Pandas and NLTK. We have plotted the body length distribution at the end with the help of MatPlotLib Liabrary.

This Project encompasses many topics in NLP. Topics such as Tokenization, Removing Stopwords, Stemming, Lemmenting, Vectorization, use of Sparse Matrix which ultimately help creating this project.

This project works basically by reducing the contents of message into keywords with no punctuations and using that to train a model which can then classify new unseen messages into Spam or Ham.

NLTK- Natural Language Toolkit- The NLTK is the most utilised package for handling natural language processing tasks. It is an open source library.

# Importing Libraries - NLTK, Pandas, Numpy :





In [None]:
# !pip install nltk
!pip install -U scikit-learn
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
import pandas as pd
import numpy as np
import warnings
import sklearn
import matplotlib.pyplot as plt


# Loading The Dataset

In [None]:
datas = pd.read_csv("SMSSpamCollection.tsv", sep="\t", header=None)
datas.columns =['label', 'body_text']
datas.head()


# Checking the Contents using column names and index.

In [None]:
datas['label'][0], datas['body_text'][0]

In [None]:
datas['body_text'][1]

#Shape Of Data :

In [None]:
print("The Dataset has {} Rows and {} Columns".format(len(datas), len(datas.columns)))

#Number Of Spam and Ham Data :

In [None]:
print("There is total {} Number of Spam Data and {} number of Ham data. Out of {} number of Data.".format(len(datas[datas['label']=="spam"]), len(datas[datas['label']=="ham"]),len(datas)))

Number of Missing Data :

In [None]:
print("There are {} number of missing data.".format(datas['label'].isnull().sum()))
print("There are {} number of missing data.".format(datas['body_text'].isnull().sum()))

#Preprocessing Data - Cleaning Up Data :


### Removing Punctuation from body text :

In [None]:
import string

def rem_punct(text):
  nopunct_text = "".join([char for char in text
                          if char not in string.punctuation])
  return nopunct_text



In [None]:
datas['body_clean'] = datas['body_text'].apply(lambda x:rem_punct(x))
datas.head()

###Tokenization - Splitting sentences into tokens or keywords :

In [None]:
import re

def tokenize(text):
  tokens = re.split('\W',text)
  return tokens

datas['tokenized_text'] = datas['body_clean'].apply(lambda x:tokenize(x.lower()))
datas.head()

###Removing Stopwords - Removing unnecesaary words like the, but, etc.

In [None]:
stopwrds = nltk.corpus.stopwords.words('english')

def rem_stopwrds(tokenized_text):
  text = [word for word in tokenized_text if word not in stopwrds]

  return text

In [None]:
datas['no_stop'] = datas['tokenized_text'].apply(lambda x:rem_stopwrds(x))
datas.head()

###Stemming - Reducing words with similar inflection or derived words to their stem or root :

In [None]:
ps=nltk.PorterStemmer()

def stemming(tokenized_text):
    text=[ps.stem(word) for word in tokenized_text]
    return text

datas['stemmed_text']=datas['no_stop'].apply(lambda x:stemming(x))

datas.head()

###Lemmatization - Grouping together inflected form of words so they can be analysied as a single term, the words lemma.

In [None]:
wnl = nltk.WordNetLemmatizer()

def lemmatizing(tokenized_text):
  text = [wnl.lemmatize(word) for word in tokenized_text]
  return text


In [None]:
datas['lemmatied_text'] = datas['no_stop'].apply(lambda x:lemmatizing(x))

datas.head()

###Vectorization - Process of encoding integers as feature vectors.

###Count Vectorization- Used to Create a document-term matrix that has entry of each cell which will be a count of the number of times that word occured in that document :




In [None]:
from sklearn.feature_extraction.text import CountVectorizer

def clean_text(text):
  text = "".join([word.lower() for word in text if word not in string.punctuation])
  tokens = re.split('\W',text)
  text = [ps.stem(word) for word in tokens if word not in stopwrds]
  return text

count_vect = CountVectorizer(analyzer = clean_text)
X_count = count_vect.fit_transform(datas['body_text'])

print(X_count.shape)

Applying Count Vectorization to small sample

In [None]:
data_sample = datas[0:20]

count_vect_sample = CountVectorizer(analyzer=clean_text)
X_count_sample = count_vect_sample.fit_transform(data_sample['body_text'])

print(X_count_sample.shape)

###Sparse Matrix - A Matrix of zeros and ones (Mostly zero). And to be efficient, it shows only non-zero Entries.

In [None]:
X_count_sample

In [None]:
X_count_df = pd.DataFrame(X_count_sample.toarray())
X_count_df

In [None]:
import warnings
warnings.filterwarnings("ignore")

X_count_df.columns= count_vect_sample.get_feature_names_out()
X_count_df

###TF/IDF (Term Frequency, Inverse Document Frequency) - Creates Document amtrix where column represents Unirams and cells represent weighting which represents importance of word to the document.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(analyzer = clean_text)
X_tfidf = tfidf_vect.fit_transform(datas['body_text'])

print(X_tfidf.shape)


Applying TfidVectorizer to a small sample :



In [None]:
data_sample = datas[0:20]

tfidf_vect_sample = TfidfVectorizer(analyzer = clean_text)
X_tfidf_sample = tfidf_vect_sample.fit_transform(data_sample['body_text'])

print(X_tfidf_sample.shape)

#Feature Engineering - Feature Creation :

In [None]:
datas=pd.read_csv("SMSSpamCollection.tsv",sep="\t",header=None)

datas.columns=['label','body_text']

datas.head()

##Feature Creation - Text Message Length :


In [None]:
datas['body_len']=datas["body_text"].apply(lambda x:len(x)-x.count(" "))

datas.head()

##Feature Creation - Percentage for Punctuation :

In [None]:
def count_punct(text):
    count=sum([1 for char in text if char in string.punctuation])
    return round(count/(len(text)-text.count(" ")),3)*100

datas['punct%']=datas['body_text'].apply(lambda x:count_punct(x))

datas.head()

#Plotting :

In [None]:

bins=np.linspace(0,200,40)

plt.hist(datas['body_len'],bins)
plt.title('Body Length Distribution')
plt.show()