# Spam Text Classification using kNN Classifier

Data Source:
https://www.kaggle.com/team-ai/spam-text-message-classification

In [1]:
import pandas as pd
df = pd.read_csv('spam.csv')
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## Lemmatization and Stemming

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.

The difference between these methods are:
- **Lemmatization** usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma
- **Stemming** usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes.

source: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

In [2]:
#lemmatizer
lem = []
import nltk
# nltk.download('wordnet') # need to donwload this for the first time
lemma = nltk.wordnet.WordNetLemmatizer()
for i in df.Message:
    j = lemma.lemmatize(i)
    lem.append(j)

#stemmer
stem = []
sno = nltk.stem.SnowballStemmer('english')
for i in lem:
    j = sno.stem(i)
    stem.append(j)

## Vectorization
This process converts a dataset of text documents to a matrix of token counts (number)
<br/>
<img src="image/vector.jpeg" width=400  />
source: https://towardsdatascience.com/another-twitter-sentiment-analysis-with-python-part-4-count-vectorizer-b3f4944e51b5

In [3]:
# nltk.download('stopwords') # need to donwload this for the first time

from nltk.corpus import stopwords

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(min_df=35, stop_words = stopwords.words('english'))

cv_X = cv.fit_transform(stem)
fea = cv.get_feature_names()
mat = cv_X.toarray()
print(mat)
print(mat.shape)


[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
(5572, 244)


## TF-IDF
TF-IDF (term frequency–inverse document frequency) is a method used for calculating the weight of each word based on its frequency in all documents.
<br/>
<img src="image/tfidf.png" width=400  />
source: https://www.researchgate.net/publication/319996754_Finding_discriminative_and_interpretable_patterns_in_sequences_of_surgical_activities/figures?lo=1&utm_source=google&utm_medium=organic

In [4]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidfconverter = TfidfTransformer()
mat = tfidfconverter.fit_transform(mat).toarray()
X = pd.DataFrame(mat, columns = fea)

#categorical mapping
di = {'ham': 1, 'spam' : 0}
y = df.Category.map(di)

print(X.head)

<bound method NDFrame.head of        10  100  1000  150p   16   18  1st   50  500  account  ...     world  \
0     0.0  0.0   0.0   0.0  0.0  0.0  0.0  0.0  0.0      0.0  ...  0.552623   
1     0.0  0.0   0.0   0.0  0.0  0.0  0.0  0.0  0.0      0.0  ...  0.000000   
2     0.0  0.0   0.0   0.0  0.0  0.0  0.0  0.0  0.0      0.0  ...  0.000000   
3     0.0  0.0   0.0   0.0  0.0  0.0  0.0  0.0  0.0      0.0  ...  0.000000   
4     0.0  0.0   0.0   0.0  0.0  0.0  0.0  0.0  0.0      0.0  ...  0.000000   
...   ...  ...   ...   ...  ...  ...  ...  ...  ...      ...  ...       ...   
5567  0.0  0.0   0.0   0.0  0.0  0.0  0.0  0.0  0.0      0.0  ...  0.000000   
5568  0.0  0.0   0.0   0.0  0.0  0.0  0.0  0.0  0.0      0.0  ...  0.000000   
5569  0.0  0.0   0.0   0.0  0.0  0.0  0.0  0.0  0.0      0.0  ...  0.000000   
5570  0.0  0.0   0.0   0.0  0.0  0.0  0.0  0.0  0.0      0.0  ...  0.000000   
5571  0.0  0.0   0.0   0.0  0.0  0.0  0.0  0.0  0.0      0.0  ...  0.000000   

      would  www  xxx

## Cosine Similarity

Cosine Similarity is aparameter which shows the similarity level between sentences/documents using the cosine principle of a particular angle
<br/>
<img src="image/cosine.png" width=400  />
source: https://www.researchgate.net/publication/255181106_Xlang_ISCIS/figures?lo=1&utm_source=google&utm_medium=organic

In [5]:
from sklearn.metrics.pairwise import cosine_similarity
sim = cosine_similarity(X, X)
print(sim)

[[1. 0. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 0. 1. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 1. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


## kNN classifier

In [6]:
from sklearn.model_selection import train_test_split
#stratify is to maintain "y" portion in train/test dataset
X_train, X_test, y_train, y_test=train_test_split(X,y, test_size=0.2, stratify = y, random_state=0)


In [7]:

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
    
#fit and predict
knn.fit(X_train, y_train) #fit model to training data
y_pred_train = knn.predict(X_train) #predict model to training data
y_pred = knn.predict(X_test) #predict model to testing data

## Model Performance

In [8]:
from sklearn.metrics import classification_report
print('training data classification report: \n' + classification_report(y_train,y_pred_train))
print('testing data classification report: \n' +classification_report(y_test,y_pred))

training data classification report: 
              precision    recall  f1-score   support

           0       0.97      0.85      0.91       598
           1       0.98      1.00      0.99      3859

    accuracy                           0.98      4457
   macro avg       0.97      0.92      0.95      4457
weighted avg       0.98      0.98      0.98      4457

testing data classification report: 
              precision    recall  f1-score   support

           0       0.95      0.83      0.88       149
           1       0.97      0.99      0.98       966

    accuracy                           0.97      1115
   macro avg       0.96      0.91      0.93      1115
weighted avg       0.97      0.97      0.97      1115

