# **NLP - Spam Sınıflandırması**

## **Giriş**

Bu çalışma, NLP - Doğal Dil İşleme tekniği ile spam mesajlarının sınıflandırılarak tahmin edilmesi üzerine gerçekleştirilen bir örnektir.

Dataset: 'SMS Spam Collection Dataset' 






## **Kurulum**

### Kütüphaneler

In [0]:
import numpy as np
import pandas as pd
import string
import itertools
import os

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import LabelBinarizer, LabelEncoder
from termcolor import colored
from tensorflow import keras
layers = keras.layers
models = keras.models

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

### Veri Seti

In [0]:
data = pd.read_csv("spam.csv", encoding = "ISO-8859-1")
dataFrame = pd.DataFrame(data)

columns = ["state", "context"]
dataFrame = dataFrame[columns]

print("Dataset Size: ", dataFrame.size)
dataFrame.head()

Dataset Size:  11144


Unnamed: 0,state,context
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## **Veri Önişlemler**

### Temizleme

In [0]:
dataFrame = dataFrame.dropna(how='any',axis=0)
print(dataFrame.size)
dataFrame.head()

11144


Unnamed: 0,state,context
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


### Değiştirme

In [0]:
"""
dataFrame['state'] = dataFrame['state'].replace(['ham'],0)
dataFrame['state'] = dataFrame['state'].replace(['spam'],1)
dataFrame.head()
"""

"\ndataFrame['state'] = dataFrame['state'].replace(['ham'],0)\ndataFrame['state'] = dataFrame['state'].replace(['spam'],1)\ndataFrame.head()\n"

## **NLP Önişlemler**

### Noisy Entity Removal

In [0]:
print("String Punctuation: ",string.punctuation)
dataFrame.context = dataFrame.context.str.translate(str.maketrans('', '', string.punctuation))
dataFrame.head()

String Punctuation:  !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


Unnamed: 0,state,context
0,ham,Go until jurong point crazy Available only in ...
1,ham,Ok lar Joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor U c already then say
4,ham,Nah I dont think he goes to usf he lives aroun...


### Lowercasing

In [0]:
dataFrame.context = dataFrame.context.str.lower()
dataFrame.head()

Unnamed: 0,state,context
0,ham,go until jurong point crazy available only in ...
1,ham,ok lar joking wif u oni
2,spam,free entry in 2 a wkly comp to win fa cup fina...
3,ham,u dun say so early hor u c already then say
4,ham,nah i dont think he goes to usf he lives aroun...


### Tokenization

In [0]:
max_words = 100
tokenize = keras.preprocessing.text.Tokenizer(num_words=max_words, char_level=False)

## **Veri Seti İşlemleri**

### Train/Test Split

In [0]:
train_size = int(len(dataFrame) * 0.3)
print ("Train size: %d" % train_size)
print ("Test size: %d" % (len(data) - train_size))

def train_test_split(dataFrame, train_size):
    train = dataFrame[:train_size]
    test = dataFrame[train_size:]
    return train, test
  
train_y, test_y = train_test_split(dataFrame['state'], train_size)
train_x, test_x = train_test_split(dataFrame['context'], train_size)

Train size: 1671
Test size: 3901


In [0]:
tokenize.fit_on_texts(train_x)
x_train = tokenize.texts_to_matrix(train_x)
x_test = tokenize.texts_to_matrix(test_x)

### Label (Etiketleme) & Kategorilendirme

In [0]:
encoder = LabelEncoder()
encoder.fit(train_y)
y_train = encoder.transform(train_y)
y_test = encoder.transform(test_y)

In [0]:
num_classes = np.max(y_train) + 1
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

"""
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)
print('y_train shape:', y_train.shape)
print('y_test shape:', y_test.shape)
"""

"\nprint('x_train shape:', x_train.shape)\nprint('x_test shape:', x_test.shape)\nprint('y_train shape:', y_train.shape)\nprint('y_test shape:', y_test.shape)\n"

In [0]:
text_labels = encoder.classes_

## **Eğitim/Test**

Modelin eğitilmesi.

### Modelin Oluşturulması

In [0]:
batch_size = 32   #batch_size 2 ve katları olacak şekilde belirlenmelidir. 64 - 128 ....
epochs = 10       #veri setinin model üzerinden geçme sayısı. 1 epoch tüm veri setinin 1 kez model üzerinden geçmesi anlamına gelmektedir.
drop_ratio = 0.5  #düşme oranı.

model = models.Sequential()
model.add(layers.Dense(512, input_shape=(max_words,)))
model.add(layers.Activation('relu'))
model.add(layers.Dropout(drop_ratio))
model.add(layers.Dense(100))
model.add(layers.Activation('relu'))
model.add(layers.Dense(num_classes))
model.add(layers.Activation('softmax')) #softmax aktivasyon fonksiyonu özniteliklerin önceliklerini belirlenmesi özelliğini barındırır.

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

history = model.fit(x_train, y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=0,
                    validation_split=0.1)

In [0]:
#model.save("model.h5")

### Değerlendirme
Modelin test verisi ile metrik ölçümlerinin yapılması.

In [0]:
predict = model.predict(x_train)

score = model.evaluate(x_test, y_test,batch_size=batch_size, verbose=1)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Test loss: 0.18275538086891174
Test accuracy: 0.9618046879768372


## **Tahmin**

Modele dışarıdan girilen veriler üzerinde tahminde bulunulması.

In [0]:
def predict(title,actual):
  temp = tokenize.texts_to_matrix([title])       
  prediction = model.predict(np.array([temp[0]]))
  acc = prediction[0][np.argmax(prediction)]
  predicted_label = text_labels[np.argmax(prediction)]
  
  if(acc>0.80):
    print(colored("Data          : " + title, 'green'))
    print(colored("Gerçek durum  : " + actual, 'green'))
    print(colored("Tahmin basari : %" + str(acc), 'green'))  
    print(colored("Tahmin durum  : " + str(predicted_label), 'green'))  
    print("")
  else:
    print(colored("Data          : " + title, 'red'))
    print(colored("Gerçek durum  : " + actual, 'red'))
    print(colored("Tahmin basari : %" + str(acc), 'red'))  
    print(colored("Tahmin durum  : " + str(predicted_label), 'red'))  
    print("")

In [0]:
print("Etiketler: ",text_labels)
print("")

print("-Tahmin Edilmesi Beklenen Veriler-")  
print("--------------------------------")
predict('We will give you $1,000 for sending an e-mail to your friends.  AB Mailing, Inc. is proud to anounce the start of a new contest.  Each day until January, 31 1999, one lucky Internet or AOL user who forwards our advertisement to their friends will be randomly picked to receive $1,000! You could be the winner!','Spam')
predict('Have you finished your paperwork for Kaken and writing academic articles? If you have some free time in the near future, I want to meet you and explain to you our next project.','Normal e-Posta')


Etiketler:  ['ham' 'spam']

-Tahmin Edilmesi Beklenen Veriler-
--------------------------------
[32mData          : We will give you $1,000 for sending an e-mail to your friends.  AB Mailing, Inc. is proud to anounce the start of a new contest.  Each day until January, 31 1999, one lucky Internet or AOL user who forwards our advertisement to their friends will be randomly picked to receive $1,000! You could be the winner![0m
[32mGerçek durum  : Spam[0m
[32mTahmin basari : %0.9954035[0m
[32mTahmin durum  : spam[0m

[32mData          : Have you finished your paperwork for Kaken and writing academic articles? If you have some free time in the near future, I want to meet you and explain to you our next project.[0m
[32mGerçek durum  : Normal e-Posta[0m
[32mTahmin basari : %0.9029755[0m
[32mTahmin durum  : ham[0m

