<a href="https://colab.research.google.com/github/SVT23/Text-Mining-and-Language-/blob/main/spam_classification_TM_course.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


This notebook deals with spam and non-spam (=ham) classification. We have seen the spam data already, each record has the email text and the label (target, ham or spam). The first part of this notebook is very similar to a previous pandas notebook.


# Data (previously in pandas notebook)

In [6]:
import pandas as pd
import numpy as np

# open the CSV file

# ------------
# we have already seen how to upload a file into colab. If not sure: 
# you could see previous notebooks or google search 'colab upload file' 
# ------------

spam_data = pd.read_csv('spam.csv')

# display the first 10 rows, they contain text and a column with either 
# 'ham' or 'spam' = target or label for each record
spam_data.head(10)

Unnamed: 0,text,target
0,"Go until jurong point, crazy.. Available only ...",ham
1,Ok lar... Joking wif u oni...,ham
2,Free entry in 2 a wkly comp to win FA Cup fina...,spam
3,U dun say so early hor... U c already then say...,ham
4,"Nah I don't think he goes to usf, he lives aro...",ham
5,FreeMsg Hey there darling it's been 3 week's n...,spam
6,Even my brother is not like to speak with me. ...,ham
7,As per your request 'Melle Melle (Oru Minnamin...,ham
8,WINNER!! As a valued network customer you have...,spam
9,Had your mobile 11 months or more? U R entitle...,spam


In [7]:
# what are the unique labels? 
print(spam_data['target'].unique())

['ham' 'spam']


In [8]:
# it is easier to deal with numerical targets (or labels)
# change the target to 0 and 1, one of the ways to do it
spam_data['target'] = spam_data['target'].map({'spam': 1, 'ham': 0})
spam_data.head(10)

Unnamed: 0,text,target
0,"Go until jurong point, crazy.. Available only ...",0
1,Ok lar... Joking wif u oni...,0
2,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,U dun say so early hor... U c already then say...,0
4,"Nah I don't think he goes to usf, he lives aro...",0
5,FreeMsg Hey there darling it's been 3 week's n...,1
6,Even my brother is not like to speak with me. ...,0
7,As per your request 'Melle Melle (Oru Minnamin...,0
8,WINNER!! As a valued network customer you have...,1
9,Had your mobile 11 months or more? U R entitle...,1


In [9]:
# display a count of how many 0's and how many 1's in the target overall
label_groups = spam_data['target'].groupby(spam_data['target'])
label_groups.count()

target
0    4825
1     747
Name: target, dtype: int64

# train test split

Doc: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html 

Added this link: https://realpython.com/train-test-split-python-data/ 

We split the data into training set and test set:

-  First, we build the model on the train set. 

-  Then test out the model on the test set. 

- Each of these sets will have text and their corresponding targets (so we have train data and their labels and then test data and their labels).


See also the classification slides for the overal concepts and ideas.


Here I am using the dataframes from pandas as input to the train_test_split, you can also use to_numpy() to use numpy functions


In [10]:
 
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(spam_data['text'], 
                                                    spam_data['target'], 
                                                    test_size=0.2,
                                                    random_state=42)

# data is in the "X_" variables and labels are in the "y_" variables

# Question: what are the dimensions of X_train? X_test? y_train? y_test? 

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(4457,) (1115,) (4457,) (1115,)


# BOW and Tf-Idf (earlier material)

We are not able to classify text so we need to convert it into something that classifiers "understand".

**Tf-idf** is the most popular measure to use instead of absolute frequency count as it offsets the frequency by how often a word appears in the corpus. TFIDF = term frequency–inverse document frequency. 

** For more details see the material on BOW including the related notebook.

In [11]:
# use a vectorizer to get the text into a matrix format
# the result will be a matrix of as many rows as the train or test set
# and the columns will be the unique terms (or words) in the data (7735 is vocab)
# it is actually a sparse matrix to help with memory management

# Fit and transform X_train using Tfidf Vectorizer with default parameters
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

X_train_tfidf = vectorizer.fit_transform(X_train).toarray()
print('tfidf train shape:', X_train_tfidf.shape)
print('tfidf train type:', X_train_tfidf.dtype)

tfidf train shape: (4457, 7735)
tfidf train type: float64


In [None]:
# use the same as above to transform X_test

# ATTENTIOn! this is not fit AND transform
# Question: what is the difference between transform and fit_transform?
# fit transform finds unique terms builds for training set
# transform applies to test data same data 

X_test_tfidf = vectorizer.transform(X_test).toarray()
print('tfidf test:', X_test_tfidf.shape)

# Model or Classifier

Classification: build a model (here using Naive Bayes) on train data and labels then predict the labels of the test set

In [None]:
# Classification: build a model (here using Naive Bayes) 
# based on the *train* set ONLY;
# then use the model to predict the target (labels) in the *test* data ONLY
# see the slides on Classification on Canvas

# train = build the model (fit) on the train set
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X_train_tfidf, y_train)  
    
# test results = predict on test set data, to get predicted labels
predicted = clf.predict(X_test_tfidf)

In [None]:
# what is the accuracy of the model prediction on the test set?

# basically how many predicted labels match the y_test, which are the 
# labels of the test set (true labels in the data)
from sklearn import metrics

display (metrics.accuracy_score(y_test, predicted) )

0.9623318385650225