Importing necessary libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
import nltk
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score

Fetching dataset

In [2]:
df = pd.read_csv("spam.csv", encoding = "latin-1")

In [3]:
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


 Making data shine: fewer columns, clearer names (like lipstick on a girl - enhances what's already beautiful)

In [4]:
df.drop(columns = [df.columns[2], df.columns[3], df.columns[4]], inplace = True)
df.rename(columns = {"v1" : "label", "v2" : "text"}, inplace = True)

In [5]:
df

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ì_ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


Replacing "ham" and "spam" with numbers... because computers can only understand pizza with or without pineapple (clearly inferior taste)

In [6]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
df["label"] = encoder.fit_transform(df["label"])

Hsekar Data cleaning Services: Cleaning Up Missing Socks & Copycat Records

In [7]:
print(df.isnull().sum(), df.duplicated().sum()) # Checking data health: missing values & sneaky duplicates
if df.duplicated().sum() != 0:
    df = df.drop_duplicates()

label    0
text     0
dtype: int64 403


In [8]:
df.head()

Unnamed: 0,label,text
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


**Text Boot Camp:** 
  This function gets your text data in top shape for analysis. 
  We'll ditch unnecessary words and shrink others down, 
  making everything cleaner and easier to understand.

In [9]:
stemmer = PorterStemmer()

def text_preprocess(text):
    from nltk.corpus import stopwords
    import string
    text = text.lower()    
    text = nltk.word_tokenize(text)
    li = [x for x in text if x.isalpha() == True and (x not in stopwords.words("english")) and x not in string.punctuation] # Removing useless symbols and hardy grammar
    text = li[:]
    li.clear()
    li = [stemmer.stem(x) for x in text]
    return " ".join(li)

Testing the function

In [10]:
text_preprocess("I am waiting for you my dear cute madam garu...! ")

'wait dear cute madam garu'

Text Boot Camp in action! Transforming data for smoother analysis (computers love a concise diet)
for the brainless people like me computers love a concise diet ==> reducing the size of the text by removing unnecessary things

In [11]:
df["text_transformed"] = df["text"].apply(text_preprocess)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["text_transformed"] = df["text"].apply(text_preprocess)


 Let's peek at our shiny new column, "text_transformed"! (All that hard work paid off!)

In [12]:
print(df["text_transformed"])

0       go jurong point crazi avail bugi n great world...
1                                   ok lar joke wif u oni
2       free entri wkli comp win fa cup final tkt may ...
3                     u dun say earli hor u c alreadi say
4                    nah think goe usf live around though
                              ...                        
5567    time tri contact u pound prize claim easi call...
5568                                b go esplanad fr home
5569                                    piti mood suggest
5570    guy bitch act like interest buy someth els nex...
5571                                       rofl true name
Name: text_transformed, Length: 5169, dtype: object


Text to features:  Unleashing the secret language of our data (machines love numbers, not words!)

In [13]:
vectorizer = TfidfVectorizer(max_features = 3000)
X = vectorizer.fit_transform(df["text_transformed"]).toarray()
y = df["label"].values

Dividing the data: Training set for the A-team, Testing set for the B-team (both ready to learn!) 

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

Training the model and Unleashing Sherlock (our text classifier): Time to see how well it detects spam by knowing it's accuracy
and precision score

In [15]:
model = MultinomialNB()
model.fit(X_train, y_train)
y_predicted = model.predict(X_test)
print(accuracy_score(y_test, y_predicted))
print(precision_score(y_test, y_predicted))

0.9729206963249516
0.9915966386554622


**Mic Drop Moment!**  Our model achieved a whopping 97.2% accuracy and 99.2% precision!  :-)

Looks like spam can't hide from us anymore. Time to celebrate (with responsibly-sized cupcakes)! 

In [21]:
def predict(msg):
    msg = text_preprocess(msg)
    msg_vec = vectorizer.transform([msg])

    if model.predict(msg_vec) == 1:
        print("It's a SPAM")
    else:
        print("It's not a SPAM")

In [29]:
msg1 = """
    How are you my dear cute madam garu...?
    I have a lot of love transactions that are in queue(pending).
    Will you please help me to complete that love transactions madam garu...?
"""
predict(msg1)

It's not a SPAM


In [30]:
msg2 = """
  Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles 
  with camera for Free! Call The Mobile Update Co FREE on 08002986030
"""
predict(msg2)

It's a SPAM


In [28]:
msg3 = """
    Congratulations! Your credit score entitles you to a no-interest Visa credit card. Click here to claim: "https://www.rakeshspams.com"
"""
predict(msg3)

It's a SPAM


I need to save this model because I'm lazy enough to not repeat these all steps again...! :-)

In [25]:
import pickle

In [26]:
with open("spam-classifier-model.pkl", "wb") as f:
    pickle.dump(model, f)

Save the vectorizer. Don't let our model get confused by fancy new words it
didn't learn in training school! (Ensures consistent text preprocessing)

In [27]:
with open("vectorizer.pkl", "wb") as f:
    pickle.dump(vectorizer, f)