### Classification of spam messages and non-spam messages

As we know that the spam messages are very frustrating things and to get out of this problem, our data scientists made the spam classifiers which separates spam messages from non-spam messsages.

*And one real time example of such classifier is used in Gmail which efficiently handles spam mails.*

* I have used random forest classifier to do the classify spam and non-spam messages.

### Import the libraries

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import wordcloud
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))



### Load the data

In [None]:
sms_data = pd.read_csv('/kaggle/input/sms-spam-collection-dataset/spam.csv',encoding = 'latin-1')

In [None]:
# Let's check the head of our dataframe
sms_data.head()

* Here the last three columns are null and are not usefull to us. We have to drop it first.
* And for the first two columns we have to just change the names for our simplicity.

---> We will be doing these in EDA step.

### EDA (Exploratory Data Analysis)

In [None]:
sms_data.drop(sms_data.iloc[:,2:], axis = 1, inplace = True)

In [None]:
# Let's check our data again
sms_data.head()

In [None]:
sms_data.rename(columns = {'v1': 'label', 'v2' : 'sms'}, inplace = True)

In [None]:
sms_data.head()

In [None]:
# Let's check whether ther are null values present in the data or not
sms_data.isnull().any()

No null values are present.

#### Let's see the Distribution of the target variable

In [None]:
sms_data.label.value_counts().plot.bar(rot = 0)
plt.xlabel('Class')
plt.ylabel('Frequency')
plt.title('SMS class distribution')

**Clearly we have imbalanced data because ham class has a lot of examples than the spam class.**

**We should handle this issue otherwise our model will overfit to predict only ham/not spam.**

**But first let's add a numerical label for spam or ham.**

In [None]:
sms_data['spam'] = pd.get_dummies(sms_data['label'], drop_first = True)

sms_data.head()

#### Wordcloud for spam and ham sms

In [None]:
data_ham = sms_data[sms_data['spam'] == 0]
data_spam = sms_data[sms_data['spam'] == 1]

In [None]:
def show_wordcloud(data_spam_or_ham, title):
    text = ' '.join(data_spam_or_ham['sms'].astype(str).tolist())
    stopwords = set(wordcloud.STOPWORDS)
    
    fig_wordcloud = wordcloud.WordCloud(stopwords = stopwords,background_color = 'lightgrey',
                    colormap='Accent', width = 800, height = 600).generate(text)
    
    plt.figure(figsize = (10,7), frameon = True)
    plt.imshow(fig_wordcloud)  
    plt.axis('off')
    plt.title(title, fontsize = 20 )
    plt.show()

***Word Cloud for Spam sms***

In [None]:
show_wordcloud(data_spam, 'Spam SMS')

***Wordcloud for Ham sms***

In [None]:
show_wordcloud(data_ham, 'Ham SMS')

#### Create dependent and independent variables

In [None]:
X = sms_data['sms']
y = sms_data['spam']

### Data Preprocessing

#### Let's
* Remove unwanted Characters from the data.
* Remove stopwords.
* Perform stemming.

In [None]:
def process_data(message):
    ps = PorterStemmer()   # Porter Stemmer Object

    corpus = []

    for i in range(0, len(message)):
        review = re.sub('[^A-Za-z]', ' ', message[i])
        review = review.lower()
        review = review.split()
    
        review = [ps.stem(word) for word in review if word not in(stopwords.words('english'))]
        review = ' '.join(review)
        corpus.append(review)
    return corpus

In [None]:
corpus = process_data(X)

#### Let's check our corpus

In [None]:
corpus

### Bag of Words model (BOW)
Let's create our Bag of Words model using TF-IDF vectorizer

In [None]:
tfidf = TfidfVectorizer(max_features = 4000)
X = tfidf.fit_transform(corpus).toarray()

Now, since we have the data, all numerical to feed into our machine learning model.

But, first let's handle the problem of imbalanced class

### Splitting the data into Train and Test set

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, 
                                                    random_state = 101, shuffle = True)

In [None]:
print(X_train.shape)
print(X_test.shape)

### Train and Test the model

#### Using Random Forest Classifier

In [None]:
rf_model = RandomForestClassifier(random_state = 101)

rf_model.fit(X_train, y_train)
y_pred = rf_model.predict(X_test)
print(metrics.classification_report(y_test, y_pred))

confusion_matrix = metrics.confusion_matrix(y_test,y_pred)
display(pd.DataFrame(data = confusion_matrix, columns = ['Predicted 0', 'Predicted 1'],
            index = ['Actual 0', 'Actual 1']))

**And if you liked the notebook please give an upvote. It will boost my confidence and motivation.**

**And any further suggestions for improving this notebook are most welcome as I will be looking to improve this notebook further.**

**Thank you 😀**