# **Data Preprocessing**

## **Download Dataset**

In [1]:
! pip install kaggle

Collecting kaggle


[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip



  Using cached kaggle-1.6.17-py3-none-any.whl
Collecting python-slugify (from kaggle)
  Using cached python_slugify-8.0.4-py2.py3-none-any.whl.metadata (8.5 kB)
Collecting bleach (from kaggle)
  Using cached bleach-6.2.0-py3-none-any.whl.metadata (30 kB)
Collecting webencodings (from bleach->kaggle)
  Using cached webencodings-0.5.1-py2.py3-none-any.whl.metadata (2.1 kB)
Collecting text-unidecode>=1.3 (from python-slugify->kaggle)
  Using cached text_unidecode-1.3-py2.py3-none-any.whl.metadata (2.4 kB)
Using cached bleach-6.2.0-py3-none-any.whl (163 kB)
Using cached python_slugify-8.0.4-py2.py3-none-any.whl (10 kB)
Using cached text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
Using cached webencodings-0.5.1-py2.py3-none-any.whl (11 kB)
Installing collected packages: webencodings, text-unidecode, python-slugify, bleach, kaggle
Successfully installed bleach-6.2.0 kaggle-1.6.17 python-slugify-8.0.4 text-unidecode-1.3 webencodings-0.5.1


In [2]:
! mkdir ~/.kaggle

The syntax of the command is incorrect.


In [3]:
from google.colab import files
files.upload()

ModuleNotFoundError: No module named 'google.colab'

In [None]:
!mv kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!kaggle datasets download gevabriel/indonesian-sms-spam

mv: cannot stat 'kaggle.json': No such file or directory
chmod: cannot access '/root/.kaggle/kaggle.json': No such file or directory
Dataset URL: https://www.kaggle.com/datasets/gevabriel/indonesian-sms-spam
License(s): CC0-1.0
Downloading indonesian-sms-spam.zip to /content
  0% 0.00/58.6k [00:00<?, ?B/s]
100% 58.6k/58.6k [00:00<00:00, 79.2MB/s]


In [None]:
!unzip indonesian-sms-spam.zip

Archive:  indonesian-sms-spam.zip
replace sms_spam_indo.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: n


In [None]:
import pandas as pd
df = pd.read_csv('sms_spam_indo.csv')
df

Unnamed: 0,Kategori,Pesan
0,spam,Plg Yth: Simcard anda mendptkan bonus poin plu...
1,ham,Iya ih ko sedih sih gtau kapan lg ke bandung :(
2,ham,Kalau mau bikin model/controller mending per a...
3,ham,Selamat nama1. Semoga selalu menempuh hidup ya...
4,spam,Tingkatkan nilai isi ulang Anda selanjutnya mi...
...,...,...
1138,ham,Yg ragu sm bulet/datar atau yg pgn ikutan deba...
1139,ham,"Semangat yang ibu gita, ibu putri dan bapak ad..."
1140,ham,"nama1, minta database kamu sama view dan contr..."
1141,spam,Dapatkan GRATIS 1 cappuccino (hot/ice) & Freza...


## **Text Preprocessing**

### Cleaning
Removing noise means cleaning text from unnecessary or distracting characters. For example, removing unnecessary
punctuation, excessive spaces, or irrelevant symbols.

In [None]:
import re

def clean_text(text):
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'\s+', ' ', text)
    text = text.strip()
    return text

df['Pesan'] = df['Pesan'].apply(clean_text)
df

Unnamed: 0,Kategori,Pesan
0,spam,Plg Yth Simcard anda mendptkan bonus poin plus...
1,ham,Iya ih ko sedih sih gtau kapan lg ke bandung
2,ham,Kalau mau bikin modelcontroller mending per apa y
3,ham,Selamat nama1 Semoga selalu menempuh hidup yan...
4,spam,Tingkatkan nilai isi ulang Anda selanjutnya mi...
...,...,...
1138,ham,Yg ragu sm buletdatar atau yg pgn ikutan debat...
1139,ham,Semangat yang ibu gita ibu putri dan bapak adi...
1140,ham,nama1 minta database kamu sama view dan contro...
1141,spam,Dapatkan GRATIS 1 cappuccino hotice Freza seti...


### Case Folding
Converting all letters in the dataset to lowercase means changing all uppercase letters to lowercase.

In [None]:
df['Pesan'] = df['Pesan'].str.lower()
df.head()

Unnamed: 0,Kategori,Pesan
0,spam,plg yth simcard anda mendptkan bonus poin plus...
1,ham,iya ih ko sedih sih gtau kapan lg ke bandung
2,ham,kalau mau bikin modelcontroller mending per apa y
3,ham,selamat nama1 semoga selalu menempuh hidup yan...
4,spam,tingkatkan nilai isi ulang anda selanjutnya mi...


### Tokenizing
Tokenization is like breaking a sentence puzzle into pieces, or "tokens", which are individual words. For example, the
sentence "I like eating fried rice" would be separated into tokens like "I", "like", "eating", "fried", "rice".

In [None]:
import nltk
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize

# Tokenizing pesan yang sudah difolding (case folding)
df['Pesan'] = df['Pesan'].apply(word_tokenize)

# Menampilkan beberapa baris pertama hasil tokenizing
df.head()

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Unnamed: 0,Kategori,Pesan
0,spam,"[plg, yth, simcard, anda, mendptkan, bonus, po..."
1,ham,"[iya, ih, ko, sedih, sih, gtau, kapan, lg, ke,..."
2,ham,"[kalau, mau, bikin, modelcontroller, mending, ..."
3,ham,"[selamat, nama1, semoga, selalu, menempuh, hid..."
4,spam,"[tingkatkan, nilai, isi, ulang, anda, selanjut..."


### Stopword Removal
This process is a step to remove words that do not provide significant meaning in the data. For example, words like
"and", "or", "I", which often appear in text but do not convey useful information for analysis. Removing these words
helps simplify the text and enhance focus on more relevant words. Here, we use the nltk (Natural Language Toolkit)
library to assist in this removal process.

In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')

# Stopwords Bahasa Indonesia
stop_words = set(stopwords.words('indonesian'))

# Fungsi untuk menghapus stopwords dari tokens
def remove_stopwords(tokens):
    return [word for word in tokens if word not in stop_words]

# Terapkan stopword removal
df['Pesan'] = df['Pesan'].apply(remove_stopwords)

# Menampilkan beberapa baris pertama setelah stopword removal
df.head()


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Unnamed: 0,Kategori,Pesan
0,spam,"[plg, yth, simcard, mendptkan, bonus, poin, pl..."
1,ham,"[iya, ih, ko, sedih, sih, gtau, lg, bandung]"
2,ham,"[bikin, modelcontroller, mending, y]"
3,ham,"[selamat, nama1, semoga, menempuh, hidup, baha..."
4,spam,"[tingkatkan, nilai, isi, ulang, minimal, rp10r..."


### Stemming
This process aims to remove affixes attached to words in the text messages. Affixes are parts of words that add meaning
or change their function. For example, in the word "playing," the prefix "play-" indicates that it is a verb in infinitive
form. Removing affixes helps simplify words so they are easier to understand and analyze. To perform this step, we
use the Sastrawi library, which is a tool for natural language processing in the Indonesian language.

In [None]:
!pip install Sastrawi
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory

factory = StemmerFactory()
stemmer = factory.create_stemmer()

def stemming(words):
  return [stemmer.stem(word) for word in words]

df['Pesan'] = df['Pesan'].apply(stemming)
df



KeyboardInterrupt: 

Back Up

In [None]:
df_backup = df.copy()
df_backup

Unnamed: 0,Kategori,Pesan
0,spam,"[plg, yth, simcard, mendptkan, bonus, poin, pl..."
1,ham,"[iya, ih, ko, sedih, sih, gtau, lg, bandung]"
2,ham,"[bikin, modelcontroller, mending, y]"
3,ham,"[selamat, nama1, semoga, menempuh, hidup, baha..."
4,spam,"[tingkatkan, nilai, isi, ulang, minimal, rp10r..."
...,...,...
1138,ham,"[yg, ragu, sm, buletdatar, yg, pgn, ikutan, de..."
1139,ham,"[semangat, gita, putri, adison, esok, semoga, ..."
1140,ham,"[nama1, database, view, controller, js, dropdo..."
1141,spam,"[dapatkan, gratis, 1, cappuccino, hotice, frez..."


In [None]:
df = df_backup.copy()

### Removing empty lists & converting datatype

In [None]:
# Removing empty lists
df = df[df['Pesan'].apply(lambda x: len(x) > 0)]

# Convert lists to String
df['Pesan'] = df['Pesan'].apply(lambda x: ' '.join(x))

df

Unnamed: 0,Kategori,Pesan
0,spam,plg yth simcard mendptkan bonus poin plusplus ...
1,ham,iya ih ko sedih sih gtau lg bandung
2,ham,bikin modelcontroller mending y
3,ham,selamat nama1 semoga menempuh hidup bahagia me...
4,spam,tingkatkan nilai isi ulang minimal rp10ribu pa...
...,...,...
1138,ham,yg ragu sm buletdatar yg pgn ikutan debat kusir v
1139,ham,semangat gita putri adison esok semoga terbaik...
1140,ham,nama1 database view controller js dropdown kot...
1141,spam,dapatkan gratis 1 cappuccino hotice freza tran...


## **TF-IDF Weighting**
This stage involves calculating how important each word is in the document based on how often the word appears and how unique the word is in the dataset. The method used to measure this is called TF-IDF (Term Frequency-Inverse Document Frequency). TF-IDF assigns a higher weight to words that appear more frequently in a specific document but rarely appear in other documents, as these words are considered more important in describing the content of that document specifically.

In [None]:
!pip install scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()

tfidf_matrix = vectorizer.fit_transform(df['Pesan'])

tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())

tfidf_df



Unnamed: 0,0000,00001200,0006,0006kecepatan,0009,001,0016285286552555,002359,00353918,008,...,yudisium,yuk,yuks,yuni,yunit,z10,z1044jt,zalora,zarkasi,zona
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1138,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1139,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1140,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1141,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# **Data Splitting**

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(tfidf_df, df['Kategori'], test_size=0.2, random_state=42)

print('X_train:')
print(X_train)

print('Y_train:')
print(Y_train)


X_train:
      0000  00001200  0006  0006kecepatan  0009  001  0016285286552555  \
12     0.0       0.0   0.0            0.0   0.0  0.0               0.0   
758    0.0       0.0   0.0            0.0   0.0  0.0               0.0   
636    0.0       0.0   0.0            0.0   0.0  0.0               0.0   
1109   0.0       0.0   0.0            0.0   0.0  0.0               0.0   
743    0.0       0.0   0.0            0.0   0.0  0.0               0.0   
...    ...       ...   ...            ...   ...  ...               ...   
1044   0.0       0.0   0.0            0.0   0.0  0.0               0.0   
1095   0.0       0.0   0.0            0.0   0.0  0.0               0.0   
1130   0.0       0.0   0.0            0.0   0.0  0.0               0.0   
860    0.0       0.0   0.0            0.0   0.0  0.0               0.0   
1126   0.0       0.0   0.0            0.0   0.0  0.0               0.0   

      002359  00353918  008  ...  yudisium  yuk  yuks  yuni  yunit  z10  \
12       0.0       0.0  0.0

# **Project Akhir Machine Learning**
Aplikasi Pendeteksi SMS Spam dengan Algoritma Naive Bayes

In [None]:
## Test

import numpy as np
from collections import defaultdict

class NaiveBayes:
    def __init__(self):
        self.prior = {}
        self.mean_std_dev = {}
        self.classes = []

    def fit(self, X, y):
        self.classes = np.unique(y)
        total_docs = len(y)

        for cls in self.classes:
            # Menghitung probabilitas prior
            cls_docs = y[y == cls].shape[0]
            self.prior[cls] = cls_docs / total_docs

            cls_index = (y == cls)
            # Menghitung rata-rata semua atribut
            cls_mean = X[cls_index].mean(axis=0)

            # Menghitung standar deviasi semua atribut
            cls_std_dev = X[cls_index].std(axis=0)

            # Menangani sigma_ik = 0
            cls_std_dev = np.where(cls_std_dev == 0, 1e-6, cls_std_dev)

            self.mean_std_dev[cls] = [cls_mean, cls_std_dev]

    def predict(self, X):
        predictions = []
        epsilon = 1e-6  # Tambahkan epsilon untuk mencegah log(0)

        for doc in X:
            class_probs = {}

            for cls in self.classes:
                mean, std_dev = self.mean_std_dev[cls]
                log_prob = np.log(self.prior[cls])  # Log prior probability

                for i in range(len(doc)):
                    mu_ik = mean[i]
                    sigma_ik = std_dev[i]

                    # Hitung log probabilitas Gaussian
                    log_coefficient = -np.log(sigma_ik * np.sqrt(2 * np.pi))
                    log_exponent = -((doc[i] - mu_ik) ** 2) / (2 * (sigma_ik ** 2))
                    log_gaussian_prob = log_coefficient + log_exponent

                    log_prob += log_gaussian_prob

                class_probs[cls] = log_prob

            # Prediksi kelas dengan log probabilitas tertinggi
            predicted_class = max(class_probs, key=class_probs.get)
            predictions.append(predicted_class)

        return predictions


In [None]:
import numpy as np
from collections import defaultdict

class NaiveBayes:
    def __init__(self):
        self.prior = {}
        self.mean_std_dev = {}
        self.classes = []

    def fit(self, X, y):
        self.classes = np.unique(y)
        total_docs = len(y)

        for cls in self.classes:
            # Menghitung probabilitas prior
            cls_docs = y[y == cls].shape[0]
            self.prior[cls] = cls_docs / total_docs

            cls_index = (y == cls)
            # Menghitung rata-rata semua atribut
            cls_mean = X[cls_index].mean(axis=0)

            # Menghitung standar deviasi semua atribut
            cls_std_dev = X[cls_index].std(axis=0)

            # Menangani sigma_ik = 0
            cls_std_dev = np.where(cls_std_dev == 0, 1e-6, cls_std_dev)

            self.mean_std_dev[cls] = [cls_mean, cls_std_dev]

    def predict(self, X):
        predictions = []
        for doc in X:
            class_probs = {}

            for cls in self.classes:
                mean, std_dev = self.mean_std_dev[cls]
                phi_gaussian_prob = 1

                for i in range(len(doc)):
                    mu_ik = mean[i]
                    sigma_ik = std_dev[i]

                    # Hitung probabilitas Gaussian
                    coefficient = 1 / (sigma_ik * np.sqrt(2 * np.pi))
                    exponent = np.exp(-(doc[i] - mu_ik) ** 2 / (2 * (sigma_ik ** 2)))
                    gaussian_prob = coefficient * exponent

                    phi_gaussian_prob *= gaussian_prob


                # Probabilitas posterior
                class_probs[cls] = self.prior[cls] * phi_gaussian_prob

            # Prediksi kelas dengan probabilitas tertinggi
            predicted_class = max(class_probs, key=class_probs.get)
            predictions.append(predicted_class)

        return predictions

In [None]:
from sklearn.naive_bayes import MultinomialNB

# Training
nb = NaiveBayes()
nb.fit(X_train.values, Y_train.values)

# Test
nb_test = MultinomialNB()
nb_test.fit(X_train.values, Y_train.values)

In [None]:
# Predicting
y_pred = nb.predict(X_test.values)

y_test_pred = nb_test.predict(X_test.values)

  phi_gaussian_prob *= gaussian_prob
  phi_gaussian_prob *= gaussian_prob


In [None]:
# Evaluating
from sklearn.metrics import classification_report
print(classification_report(Y_test, y_pred))
print(classification_report(Y_test, y_test_pred))

              precision    recall  f1-score   support

         ham       0.48      1.00      0.65       111
        spam       0.00      0.00      0.00       118

    accuracy                           0.48       229
   macro avg       0.24      0.50      0.33       229
weighted avg       0.23      0.48      0.32       229

              precision    recall  f1-score   support

         ham       0.98      0.95      0.97       111
        spam       0.96      0.98      0.97       118

    accuracy                           0.97       229
   macro avg       0.97      0.97      0.97       229
weighted avg       0.97      0.97      0.97       229



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [None]:
!pip install flask-ngrok



In [None]:
import pickle

# Save the vectorizer
with open('vectorizer.pkl', 'wb') as f:
    pickle.dump(tfidf_df, f)

# Save the model
with open('model.pkl', 'wb') as f:
    pickle.dump(nb_test, f)

print("Model and vectorizer saved as 'model.pkl' and 'vectorizer.pkl'")

Model and vectorizer saved as 'model.pkl' and 'vectorizer.pkl'


In [None]:
import os

print("Vectorizer exists:", os.path.exists("vectorizer.pkl"))
print("Model exists:", os.path.exists("model.pkl"))


Vectorizer exists: True
Model exists: True


In [None]:
from google.colab import files

files.download('vectorizer.pkl')
files.download('model.pkl')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
import nltk
nltk.data.path.append('C:\\ML\\Ujicoba\\venv\\nltk_data')
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\asus\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\asus\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

# **Kode Tidak Terpakai**

In [None]:
import numpy as np
from collections import defaultdict

class NaiveBayes:
  def __init__(self):
    self.prior = {}
    self.mean_std_dev = {}
    self.classes = []

  def fit(self, X, y):
    self.classes = np.unique(y)
    total_docs = len(y)

    for cls in self.classes:
      # Menambah class
      self.classes.append(cls)

      # Menghitung probabilitas prior
      cls_docs = y[y == cls].shape[0] # Menghitung jumlah tuple dengan kelas yang sama dengan cls
      self.prior[cls] = cls_docs/total_docs

      cls_index = (y == cls)
      # Menghitung rata-rata semua atribut
      cls_mean = (X[cls_index].sum(axis=0)) / cls_docs

      # Menghitung standar deviasi semua atribut
      columns = list(zip(*X[cls_index]))
      cls_std_devs = []
      for column in columns:
        avg = sum(column) / len(column)
        variance = sum((x - avg) ** 2 for x in column) / len(column)
        cls_std_devs.append(variance ** 0.5)

      self.mean_std_dev[cls]= [cls_mean, cls_std_devs]

    def predict(self, X):
      predictions = []

      # Untuk setiap data (row) pada X
      for doc in X:
        class_probs = {}

        # Hitung probabilitas untuk setiap kelas
        for cls in self.classes:
          mean, std_dev = self.mean_std_dev[cls]
          phi_gaussian_prob = 1

          # Iterasi setiap atribut
          for i in range(len(doc)):
            mu_ik = mean[i]
            sigma_ik = std_dev[i]

            # Hitung probabilitas Gaussian
            coefficient = 1 / (sigma_ik * np.sqrt(2 * np.pi))
            exponent = np.exp(-(x[i] - mu_ik) ** 2) / (2 * (sigma_ik ** 2))
            gaussian_prob = coefficient * exponent

            # Kalikan probabilitas Gaussian
            phi_gaussian_prob *= gaussian_prob

          # Kalikan prob. prior dengan phi prob. gaussian
          class_probs[cls] = self.prior[cls] * phi_gaussian_prob

        # Prediksi kelas dengan melihat prob. posterior tertinggi
        predicted_class = max(class_probs, key=class_probs.get)
        predictions.append(predicted_class)

      return predictions