# K-Nearest Neighbors (K-NN)

### 參考課程實作並在datasets_483_982_spam.csv的資料集中獲得90% 以上的 accuracy (testset)

## Importing the libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import os
import glob
import codecs
import re

## Importing the dataset

In [4]:
dataset = pd.read_csv('./datasets/datasets_483_982_spam.csv', encoding = 'latin-1')

dataset.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [6]:
dataset.shape

(5572, 5)

In [8]:
# 將類別文字轉換成類別

dataset = dataset[['v1', 'v2']]
dataset.columns = ['label', 'content']
dataset['label'] = dataset['label'].apply(func=lambda x: 1 if x == 'spam' else 0)
dataset.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


Unnamed: 0,label,content
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


### 取出訓練內文與標註

In [11]:
all_data = dataset.to_numpy()

In [14]:
X = all_data[:,1]
Y = all_data[:,0].astype(np.uint8)

print(X[0])
print(Y[0])

Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
0


In [15]:
print('Training Data Examples : \n{}'.format(X[:5]))

Training Data Examples : 
['Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'
 'Ok lar... Joking wif u oni...'
 "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"
 'U dun say so early hor... U c already then say...'
 "Nah I don't think he goes to usf, he lives around here though"]


In [16]:
print('Labeling Data Examples : \n{}'.format(Y[:5]))

Labeling Data Examples : 
[0 0 1 0 0]


### 文字預處理

In [18]:
from sklearn.metrics import confusion_matrix
from nltk.corpus import stopwords

import nltk

nltk.download('stopwords')

# Lemmatize with POS Tag
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer 

lemmatizer = WordNetLemmatizer()

"""可以參考課程練習方式清理文字，或是使用自己的方式"""
def clean_content(X):
    X_clean = [re.sub('[^a-zA-Z]', ' ', x).lower() for x in X]    # 將非字母去除、並小寫
    X_word_tokenize = [nltk.word_tokenize(x) for x  in X_clean]   # 分詞、
    stop_words = set(stopwords.words('english'))                  # 停用詞準備去除, 並且還原詞
    X_stopwords_lemmatizer = []
    for content in X_word_tokenize:
        content_clean = []
        for word in content:
            if word not in stop_words:
                word = lemmatizer.lemmatize(word)
            content_clean.append(word)
        X_stopwords_lemmatizer.append(content_clean)
    X_output = [' '.join(x) for x in X_stopwords_lemmatizer]
    
    return X_output

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\aband\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [19]:
X = clean_content(X)

In [21]:
X[0]

'go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat'

### Bag of words

In [22]:
from sklearn.feature_extraction.text import CountVectorizer
# max_features是要建造幾個column，會按造字出現的高低去篩選 
cv = CountVectorizer(max_features = 1000)
X = cv.fit_transform(X).toarray()

In [23]:
X.shape

(5572, 1000)

## Splitting the dataset into the Training set and Test set

In [24]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)

## Training the K-NN model on the Training set

In [26]:
# 先都用預預設的, 等等我來調整距離權重

from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(weights='uniform')
classifier.fit(X_train, y_train)

KNeighborsClassifier()

## Predicting a new result

In [27]:
print('Trainset Accuracy: {}'.format(classifier.score(X_train, y_train)))

Trainset Accuracy: 0.9452546555979359


In [28]:
print('Testset Accuracy: {}'.format(classifier.score(X_test, y_test)))

Testset Accuracy: 0.9228699551569507


## Predicting the Test set results

In [29]:
y_pred = classifier.predict(X_test)

## Making the Confusion Matrix

In [30]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[949   0]
 [ 86  80]]


0.9228699551569507

## KNN 改變距離權重
- 效果變得更好!

In [31]:
## 改變權重

knn_dist = KNeighborsClassifier(weights='distance')
knn_dist.fit(X_train, y_train)

KNeighborsClassifier(weights='distance')

In [35]:
print('Testset Accuracy: {}'.format(knn_dist.score(X_train, y_train)))

Testset Accuracy: 0.9997756338344178


In [34]:
print('Testset Accuracy: {}'.format(knn_dist.score(X_test, y_test)))

Testset Accuracy: 0.9443946188340807


In [36]:
from sklearn.metrics import confusion_matrix, accuracy_score

y_pred = knn_dist.predict(X_test)
cm_dist = confusion_matrix(y_test, y_pred)
print(cm_dist)
accuracy_score(y_test, y_pred)

[[949   0]
 [ 62 104]]


0.9443946188340807