# Spam Classification

**Data processing**
*   Import the required packages
*   Load the data into train and test variables
*   Remove the unwanted data columns
*   Build wordcloud to see which message is spam and which is not.
*   Remove the stop words and punctuations
*   Convert the text data into vectors

**Building a classification model**
*   Split the data into train and test sets
*   Use Sklearn built in classifiers to build the models
*   Train the data on the model
*   Make predictions on new data

## Import the required packages

In [None]:
import sklearn
import pickle
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV,train_test_split,StratifiedKFold,cross_val_score,learning_curve

## Preprocessing and Exploring the Dataset

In [None]:
# Create data frame from .csv file
data = pd.read_csv("spam.csv", encoding='latin-1', on_bad_lines="skip")
# Display first 5 rows
data.head()

Unnamed: 0,v1,v2,Unnamed: 2
0,ham,"Go until jurong point, crazy.. Available only ...",
1,ham,Ok lar... Joking wif u oni...,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,
3,ham,U dun say so early hor... U c already then say...,
4,ham,"Nah I don't think he goes to usf, he lives aro...",


In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

## Removing unwanted columns

In [None]:
# Remove Unnamed columns
#data = data.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1)
data = data.drop(["Unnamed: 2"], axis=1)
# Rename columns
data = data.rename(columns={"v2" : "text", "v1":"label"})
# Display first 5 rows
data.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
# Display the number of examples per label (I.e How many spam, how many are not spam)
data['label'].value_counts()

ham     4780
spam     742
Name: label, dtype: int64

## Converting words to vectors using Count Vectorizer

In [None]:
# Import the CountVectorizer Library
from sklearn.feature_extraction.text import CountVectorizer
# Initialize the CountVectorizer
vectorizer = CountVectorizer()
# Vectorize/Preprocess our sentences
vectors = vectorizer.fit_transform(data['text'])

In [None]:
# View the vocabulary of the vectorizer
print(vectorizer.vocabulary_)



## Splitting into training and test set

In [None]:
# Split the dataset into train and test set
X_train, X_test, y_train, y_test = train_test_split(vectors, data['label'], test_size=0.15, random_state=111)

## Training our NLP Classification Model


In [None]:
# Import sklearn packages for Logistic regression
from sklearn.linear_model import LogisticRegression
# Initialize the Logistic Regression Model
model = LogisticRegression(solver='liblinear', penalty='l1')
# Train our model on the training data
model.fit(X_train, y_train)

LogisticRegression(penalty='l1', solver='liblinear')

In [None]:
# Test out our model on the testing data
model.score(X_test, y_test)

0.971049457177322

## Making predictions

In [None]:
test_sentence = "FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, 1.50 to rcv"
# Vectorize our sentence
vect = vectorizer.transform([test_sentence])
# Generate predictions for our sentence
score = model.predict(vect)
print(score[0])

ham


# Saving model to disk 

In [None]:
# Create model.pkl file and save the model in binary mode
pickle.dump(model, open("model.pkl", "wb"))

# Create vect.pkl file and save the model in binary mode
pickle.dump(vectorizer, open("vect.pkl","wb"))


In [None]:
# Open the model.pkl file
saved_model = open("model.pkl", "rb")
# Load the model from the .pkl file
loaded_model = pickle.load(saved_model)

# Open the vect.pkl file
saved_vect = open("vect.pkl", "rb")
# Load the vectorizer from the .pkl file
loaded_vect = pickle.load(saved_vect)