<a href="https://colab.research.google.com/github/NguyenVuong22/Project-email-classification/blob/feature%2Femail_classification/Project_email_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**1. Tải bộ dữ liệu về máy**

In [1]:
!gdown --id 1N7rk-kfnDFIGMeX0ROVTjKh71gcgx-7R

Downloading...
From: https://drive.google.com/uc?id=1N7rk-kfnDFIGMeX0ROVTjKh71gcgx-7R
To: /content/2cls_spam_text_cls.csv
100% 486k/486k [00:00<00:00, 78.2MB/s]


**2. Import các thư viện cần thiết:**

In [2]:
import string
import nltk
nltk.download('stopwords')
nltk.download ('punkt')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


**3. Đọc dữ liệu**

In [3]:
DATASET_PATH = '/content/2cls_spam_text_cls.csv'
df = pd.read_csv (DATASET_PATH)
print(df)

messages = df['Message'].values.tolist()
labels = df['Category'].values.tolist()

     Category                                            Message
0         ham  Go until jurong point, crazy.. Available only ...
1         ham                      Ok lar... Joking wif u oni...
2        spam  Free entry in 2 a wkly comp to win FA Cup fina...
3         ham  U dun say so early hor... U c already then say...
4         ham  Nah I don't think he goes to usf, he lives aro...
...       ...                                                ...
5567     spam  This is the 2nd time we have tried 2 contact u...
5568      ham               Will ü b going to esplanade fr home?
5569      ham  Pity, * was in mood for that. So...any other s...
5570      ham  The guy did some bitching but I acted like i'd...
5571      ham                         Rofl. Its true to its name

[5572 rows x 2 columns]


**4. Tiền xử lý dữ liệu nhãn**

In [4]:
le = LabelEncoder()
y = le.fit_transform(labels)
print(f'Classes: {le.classes_}')
print(f'Encoded labels: {y}')

Classes: ['ham' 'spam']
Encoded labels: [0 0 1 ... 0 0 0]


**5. Tiền xử lý dữ liệu đặc trưng**

In [5]:
def lowercase(text) :
  return text.lower()

def punctuation_removal(text):
  translator = str.maketrans('', '', string.punctuation)
  return text.translate(translator)

def tokenize (text):
  return nltk.word_tokenize(text)

def remove_stopwords (tokens):
  stop_words = nltk.corpus.stopwords.words('english')

  return [token for token in tokens if token not in stop_words]

def stemming (tokens) :
  stemmer = nltk.PorterStemmer ()

  return [stemmer.stem (token) for token in tokens]

def preprocess_text (text):
  text = lowercase(text)
  text = punctuation_removal(text)
  tokens = tokenize(text)
  tokens = remove_stopwords(tokens)
  tokens = stemming(tokens)

  return tokens


In [15]:
messages = [str(message) if not isinstance(message, str) else message for message in messages]
messages = [preprocess_text(message) for message in messages]


**6. Tạo một bộ từ điển dictionary**

In [8]:
def create_dictionary (messages) :
  dictionary = []

  for tokens in messages :
    for token in tokens :
      if token not in dictionary :
        dictionary.append (token )

  return dictionary

dictionary = create_dictionary (messages)
print(dictionary)

['go', 'jurong', 'point', 'crazi', 'avail', 'bugi', 'n', 'great', 'world', 'la', 'e', 'buffet', 'cine', 'got', 'amor', 'wat', 'ok', 'lar', 'joke', 'wif', 'u', 'oni', 'free', 'entri', '2', 'wkli', 'comp', 'win', 'fa', 'cup', 'final', 'tkt', '21st', 'may', '2005', 'text', '87121', 'receiv', 'questionstd', 'txt', 'ratetc', 'appli', '08452810075over18', 'dun', 'say', 'earli', 'hor', 'c', 'alreadi', 'nah', 'dont', 'think', 'goe', 'usf', 'live', 'around', 'though', 'freemsg', 'hey', 'darl', '3', 'week', 'word', 'back', 'id', 'like', 'fun', 'still', 'tb', 'xxx', 'std', 'chg', 'send', '£150', 'rcv', 'even', 'brother', 'speak', 'treat', 'aid', 'patent', 'per', 'request', 'mell', 'oru', 'minnaminungint', 'nurungu', 'vettam', 'set', 'callertun', 'caller', 'press', '9', 'copi', 'friend', 'winner', 'valu', 'network', 'custom', 'select', 'receivea', '£900', 'prize', 'reward', 'claim', 'call', '09061701461', 'code', 'kl341', 'valid', '12', 'hour', 'mobil', '11', 'month', 'r', 'entitl', 'updat', 'late

**Tạo ra những đặc trưng đại diện cho thông tin**

In [10]:
def create_features(tokens, dictionary):
    features = np.zeros(len(dictionary))

    for token in tokens:
        if token in dictionary:
            features[dictionary.index(token)] += 1

    return features

X = np.array([ create_features (tokens, dictionary) for tokens in messages])


**7.Chia bộ dữ liệu train/val/test:**

In [11]:
VAL_SIZE = 0.2
TEST_SIZE = 0.125
SEED = 0

X_train, X_val, y_train, y_val = train_test_split(X, y,
                                                  test_size=VAL_SIZE,
                                                  shuffle=True,
                                                  random_state=SEED)
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train,
                                                    test_size=TEST_SIZE,
                                                    shuffle=True,
                                                    random_state=SEED)

**8. Huấn luyện mô hình**

In [12]:
%%time
model = GaussianNB()
print('Start training...')
model = model.fit(X_train, y_train)
print('Training completed!')

Start training...
Training completed!
CPU times: user 313 ms, sys: 142 ms, total: 455 ms
Wall time: 446 ms


**9. Đánh giá mô hình**

In [13]:
y_val_pred = model.predict (X_val)
y_test_pred = model.predict (X_test )

val_accuracy = accuracy_score (y_val , y_val_pred )
test_accuracy = accuracy_score (y_test , y_test_pred )

print (f'Val accuracy: {val_accuracy}')
print (f'Test accuracy: {test_accuracy}')

Val accuracy: 0.8816143497757848
Test accuracy: 0.8602150537634409


**10. Thực hiện dự đoán**

In [14]:
def predict (text, model, dictionary):
  processed_text = preprocess_text(text)
  features = create_features(text, dictionary)
  features = np.array(features).reshape (1,-1)
  prediction = model.predict(features )
  prediction_cls = le.inverse_transform (prediction)[0]
  return prediction_cls

test_input = 'I am actually thinking a way of doing something useful'
prediction_cls = predict(test_input, model, dictionary)
print(f'Prediction : {prediction_cls}')

Prediction : ham
