# Email Spam Classification using scikit-learn

This is a machine learning model which classifies emails as spam or ham(not spam). Email Classification is one of most basic yet important part of machine learning applications.
The dataset is a CSV file, named "spam.csv", containing text messages classified as ham or spam, with a total of 5572 examples.

## Cell 0:  Importing the necessary libraries

#### Pandas
Python Library used for data manipulation. 
It is used here to read the dataset from the csv file. [Used in cell 1]
#### CountVectorizer
CountVectorizer provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary. 
It is used here to create a vocabulary from the text from dataset and encode it. [Used in cell 3]
#### SVM
Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection.
For this classification, SVM model is imported using scikit-learn [Used in cell 4]

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import svm

## Cell 1: Reading the dataset
The dataset is a simple text dataset of size (5572 x 2), ie 5572 dataset examples or rows and 2 heading or columns (EmailText and Label). 

This classification dataset has 2 classes :-
(1) Ham - Genuine emails that are not spam
(2) Spam

The dataset will be saved in variable called 'data', as a Panda dataframe.

In [2]:
data = pd.read_csv("spam.csv")
# print(data) 
# print(type(data)) # Pandas Dataframe
# print(data.shape) # 5572 rows and 2 columns
# print(data.describe())

## Cell 2: Splitting the dataset
The dataset needs to be split into x and y ie data and label. Then it is again split into training and testing dataset in the ratio of 75% training set and 25% testing set.

In [3]:
x = data["EmailText"]
y = data["Label"]

### Get the number of examples
data_size = data.shape[0]
### Splitting ratio - 75% training set and 25% testing set
dataset_split_percent = 0.75
data_split = round(data_size*dataset_split_percent)
# print(data_split)

x_train, y_train = x[0:data_split], y[0:data_split]
x_test, y_test = x[data_split:data_size], y[data_split:data_size]

# print(x_train)
# print(y_train)
# print(x_test)
# print(y_test)

## Cell 3: Feature Extraction
Here, features used for classification, is a count of words used in any email. Individual words are taken from the training data, i.e. x_train and the frequency of those words is evaluated. This frequency in the occurence of words acts as the feature to classify emails.
For this, an inbuilt function of scikit-learn is used, called CountVectorizer. It converts a collection of text documents to a matrix of token counts.

In [4]:
cv = CountVectorizer()
features = cv.fit_transform(x_train)

# print(features)
# print(features.shape)
### get_feature_names function produces a list which allows you to access the labels/features of the countvectorizer, i.e. the individual unique words in the dataset.
# print(cv.get_feature_names())
### length of the above mentioned list, i.e. the number of unique words in the dataset
# print(len(cv.get_feature_names()))

## Cell 4: Training Model
The model used for the classification is SVM (Support Vector Machines). This is a basic SVM model with default features. The features or parameters of the model can be changed accordingly.

In [5]:
### Model creation
model = svm.SVC()
### Model training
model.fit(features, y_train)

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

## Cell 5: Testing and accuracy
The working of the model is tested by running the testing dataset on the model and scoring the accuracy by comparing it against the labels of test data.

In [6]:
test = cv.transform(x_test)
# print(test)
# print(y_test)
accuracy = model.score(test, y_test)
print(accuracy)

0.9834888729361091
