<a href="https://colab.research.google.com/github/Kenfinn/AIMastery_Orbit/blob/main/Case_Study_Review_NLP_01_%26_NLP_02_Afinzaki_Amiral.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Case Study: SMS Spam Classification
Created by NLP Squad - Orbit Future Academy<br>
Senin, 28 Maret 2022


![title](https://undangmi.com/wp-content/uploads/2022/03/Screen-Shot-2022-03-26-at-23.14.46.png)

Langkah-langkah yang diperlukan:
1. Akuisisi data
2. Text Preprocessing
  1. Case folding
  2. Word Normalization
  3. Stopword removal
  4. Stemming
3. Feature Engineering
  1. Feature Extraction - Bag of Words
  2. Feature Extraction - TFIDF
  3. Feature Selection - Chi Square
4. Modelling (Machine Learning)# coming soon
5. Model Evaluation # coming soon
6. Deployment # coming soon

# Setup

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

In [None]:
!pip -q install sastrawi

In [None]:
import nltk
nltk.download('stopwords')

In [None]:
# Untuk mempermudah kita dalam menyimpan objek agar dapat kita gunakan untuk pemodelan maupun deployment
import pickle

In [None]:
# Buat folder pada Google Drive untuk eksperimen
# Sambungkan Google Colab dengan Google Drive
from google.colab import drive

drive.mount('/content/gdrive', force_remount=True)

In [None]:
# Pindahkan direktori ke folder eksperimen
%cd /content/gdrive/MyDrive/Colab Notebooks/Group 2

In [None]:
!ls

# Data Acquisition
Penjelasan label:
- 0 --> SMS Normal
- 1 --> SMS Fraud atau penipuan
- 2 --> SMS Promo

In [None]:
!wget https://raw.githubusercontent.com/ksnugroho/klasifikasi-spam-sms/master/data/dataset_sms_spam_v1.csv

In [None]:
data = pd.read_csv('dataset_sms_spam_v1.csv')
data.head()

In [None]:
data.info()

In [None]:
print('Total Jumlah SMS:', data.shape[0], 'data\n')
print('terdiri dari (label):')
print('-- [0] SMS Normal\t:', data[data.label == 0].shape[0], 'data')
print('-- [1] Fraud / Penipuan\t:', data[data.label == 1].shape[0], 'data')
print('-- [2] Promo\t\t:', data[data.label == 2].shape[0], 'data\n')

In [None]:
height = data['label'].value_counts()
labels = ('SMS Normal', 'SMS Fraud / Penipuan', 'SMS Promo')
y_pos = np.arange(len(labels))

plt.figure(figsize=(7,4), dpi=80)
plt.ylim(0,600)
plt.title('Distribusi Kategori SMS', fontweight='bold')
plt.xlabel('Kategori', fontweight='bold')
plt.ylabel('Jumlah', fontweight='bold')
plt.bar(y_pos, height, color=['deepskyblue', 'royalblue', 'skyblue'])
plt.xticks(y_pos, labels)
plt.show()

# Text Preprocessing

## Case Folding

In [None]:
import re

# Buat fungsi untuk melakukan case folding
def casefolding(text):
  text = text.lower()                                 # Ubah jadi lowercase
  text = re.sub(r'https?://\S+|www\.\S+', '', text)   # Menghapus URL
  text = re.sub(r'[-+]?[0-9]+', '', text)             # Menghapus karakter angka
  text = re.sub(r'[^\w\s]', '', text)                 # Menghapus karakter tanda baca
  text = text.strip()
  return text

In [None]:
raw_text = data['teks'].iloc[5]
case_folding = casefolding(raw_text)

print('Raw text\t: ', raw_text)
print('Case folding\t: ', case_folding)

## Word Normalization

In [None]:
# Download corpus singkatan
!wget https://raw.githubusercontent.com/ksnugroho/klasifikasi-spam-sms/master/data/key_norm.csv

In [None]:
key_norm = pd.read_csv('key_norm.csv')

# Buat fungsi untuk melakukan word normalization
def text_normalize(text):
  text = ' '.join([key_norm[key_norm['singkat'] == word]['hasil'].values[0] if (key_norm['singkat'] == word).any() else word for word in text.split()])
  text = str.lower(text)
  return text

In [None]:
raw_text = data['teks'].iloc[5]
case_folding = casefolding(raw_text)
word_normalization = text_normalize(case_folding)

print('Raw text\t: ', raw_text)
print('Case folding\t: ', case_folding)
print('Word normalization\t: ', word_normalization)

## Filtering (Stopwords Removal)

In [None]:
from nltk.corpus import stopwords

stopwords_id = stopwords.words('indonesian')

In [None]:
len(stopwords_id)

In [None]:
# Lihat daftar stopwords Indonesia yang disediakan NLTK
stopwords_id

In [None]:
# Buat fungsi untuk langkah stopwords removal

def remove_stopwords(text):
  clean_word = []
  all_text = text.split()
  for word in all_text:
    if word not in stopwords_id:
      clean_word.append(word)
  return ' '.join(clean_word)

In [None]:
raw_text = data['teks'].iloc[5]
case_folding = casefolding(raw_text)
word_normalization = text_normalize(case_folding)
stopwords_removal = remove_stopwords(word_normalization)

print('Raw text\t: ', raw_text)
print('Case folding\t: ', case_folding)
print('Word normalization\t: ', word_normalization)
print('Stopwords removal\t: ', stopwords_removal)

## Stemming

In [None]:
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory

factory = StemmerFactory()
stemmer = factory.create_stemmer()

# Buat fungsi untuk langkah stemming Bahasa Indonesia
def stemming(text):
  text = stemmer.stem(text)
  return text

In [None]:
raw_text = data['teks'].iloc[5]
case_folding = casefolding(raw_text)
word_normalization = text_normalize(case_folding)
stopwords_removal = remove_stopwords(word_normalization)
text_stemming = stemming(stopwords_removal)

print('Raw text\t: ', raw_text)
print('Case folding\t: ', case_folding)
print('Word normalization\t: ', word_normalization)
print('Stopwords removal\t: ', stopwords_removal)
print('Stemming\t: ', text_stemming)

## Text Preprocessing Pipeline


In [None]:
# Buat fungsi untuk menggabungkan seluruh langkah pada text preprocessing
def text_preprocessing_process(text):
  text = casefolding(text)
  text = text_normalize(text)
  text = remove_stopwords(text)
  text = stemming(text)
  return text

In [None]:
%%time
data['clean_teks'] = data['teks'].apply(text_preprocessing_process)

# Perhatikan waktu komputasi ketika melakukan text preprocessing

In [None]:
data.head()

In [None]:
# Simpan hasil data yang telah melalui proses text preprocessing
data.to_csv('clean_data.csv')

# Feature Engineering

In [None]:
# Pisahkan kolom feature dan target
X = data['clean_teks']
y = data['label']

In [None]:
X

In [None]:
y

## Feature Extraction (Bag of Words & N-Gram)
Proses mengubah teks menjadi bentuk vektor menggunakan metode BoW

In [None]:
'''
Convert a collection of text documents to a matrix of token counts.
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
'''

from sklearn.feature_extraction.text import CountVectorizer

# Bag of Words menggunakan unigram
bow = CountVectorizer(ngram_range=(1,1))
bow.fit(X)

In [None]:
# Melihat jumlah token / fitur yang dihasilkan Bag of Words
len(bow.get_feature_names_out())

In [None]:
# Melihat fitur-fitur apa saja yang ada dalam korpus
bow.get_feature_names_out()

In [None]:
# Melihat matriks jumlah token
# Hasil ini, siap dimasukkan (di fit kan) dalam proses modelling (machine learning)

X_bow = bow.transform(X).toarray()
X_bow

In [None]:
data_bow = pd.DataFrame(X_bow, columns=bow.get_feature_names_out())
data_bow

In [None]:
# Simpan objek dari hasil bag of words
with open('hasil_bow.pickle', 'wb') as ouput:
  pickle.dump(X_bow, ouput)

## Feature Extraction (TF-IDF & N-Gram)
Proses mengubah teks menjadi bentuk vektor menggunakan metode TF-IDF

In [None]:
'''
Convert a collection of raw documents to a matrix of TF-IDF features.
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
'''

from sklearn.feature_extraction.text import TfidfVectorizer

tf_idf = TfidfVectorizer(ngram_range=(1,1))
tf_idf.fit(X)

X_tf_idf = tf_idf.transform(X)

In [None]:
# Melihat jumlah token / fitur yang dihasilkan TF-IDF
len(tf_idf.get_feature_names_out())

In [None]:
# Melihat fitur-fitur apa saja yang ada dalam korpus
tf_idf.get_feature_names_out()

In [None]:
# Melihat matriks token menggunakan TF-IDF, lihat perbandingannya dengan metode BoW
# Sama halnya dengan BoW, data ini siap dimasukkan ke proses modelling

X_tf_idf = tf_idf.transform(X).toarray()
X_tf_idf

In [None]:
data_tf_idf = pd.DataFrame(X_tf_idf, columns=tf_idf.get_feature_names_out())
data_tf_idf

In [None]:
# Simpan objek dari hasil TF-IDF
with open('hasil_tf_idf.pickle', 'wb') as ouput:
  pickle.dump(X_tf_idf, ouput)

## Feature Selection

In [None]:
# Mengubah nilai data tabular tf-idf menjadi array agar dapat dijalankan pada proses seleksi fitur
X = np.array(data_tf_idf)
y = np.array(y)

In [None]:
'''
Select features according to the k highest scores.
https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html

Compute chi-squared stats between each non-negative feature and class.
https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html
'''

from sklearn.feature_selection import SelectKBest 
from sklearn.feature_selection import chi2 

# K features with highest chi-squared statistics are selected 
chi2_features = SelectKBest(chi2, k=1000) 
X_kbest_features = chi2_features.fit_transform(X, y) 
  
# Reduced features 
print('Original feature number:', X.shape[1]) 
print('Reduced feature number:', X_kbest_features.shape[1]) 

In [None]:
# chi2_features.scores_ adalah nilai chi-square, semakin tinggi nilainya maka semakin baik fiturnya
data_chi2 = pd.DataFrame(chi2_features.scores_, columns=['nilai'])
data_chi2

In [None]:
# Menampilkan fitur beserta nilainya
feature = tf_idf.get_feature_names_out()
data_chi2['fitur'] = feature
data_chi2

In [None]:
# Mengurutkan fitur terbaik
data_chi2.sort_values(by='nilai', ascending=False)

In [None]:
# Menampilkan mask pada feature yang diseleksi
# False berarti fitur tidak terpilih dan True berarti fitur terpilih
mask = chi2_features.get_support()
mask

In [None]:
# Menampilkan fitur-fitur terpilih berdasarkan mask atau nilai tertinggi yang sudah dikalkulasi pada Chi-Square
new_feature = []
for bool, f in zip(mask, feature):
  if bool:
    new_feature.append(f)
  selected_feature = new_feature
selected_feature

In [None]:
# Menampilkan fitur-fitur yang sudah diseleksi 
# Beserta nilai vektornya pada keseluruhan data untuk dijalankan pada proses machine learning

# Hanya k fitur yang terpilih sesuai parameter k yang ditentukan sebelumnya

data_selected_feature = pd.DataFrame(X_kbest_features, columns=selected_feature)
data_selected_feature

In [None]:
with open('best_feature_chi2.pickle', 'wb') as output:
  pickle.dump(X_kbest_features, output)

# Modelling (Machine Learning)

In [None]:
# Coming Soon

# Model Evaluation

In [None]:
# Coming Soon

# Deployment

In [None]:
# Coming Soon

# WordCloud

In [None]:
# Import Library WordCloud. WordCloud digunakan untuk melihat secara visual kata-kata yang paling sering muncul.
# Import Library cv2 untuk mengolah gambar menjadi masking WordCloud

import cv2
from wordcloud import WordCloud

In [None]:
# Download gambar masking
!wget https://raw.githubusercontent.com/ksnugroho/klasifikasi-spam-sms/master/img/cloud.jpg

In [None]:
originalImage = cv2.imread('cloud.jpg')
grayImage = cv2.cvtColor(originalImage, cv2.COLOR_BGR2GRAY)
(thresh, cloud_mask) = cv2.threshold(grayImage, 100, 255, cv2.THRESH_BINARY)

In [None]:
# Tampilkan masking
from google.colab.patches import cv2_imshow

cv2_imshow(cloud_mask)

In [None]:
# WordCloud Label SMS Normal

sms_normal = data[data.label == 0]
normal_string = []

for t in sms_normal.clean_teks:
  normal_string.append(t)

normal_string = pd.Series(normal_string).str.cat(sep=' ')
from wordcloud import WordCloud

wordcloud = WordCloud(width=1600, height=800, margin=10,
                      background_color='white', colormap='Dark2',
                      max_font_size=200, min_font_size=25,
                      mask=cloud_mask, contour_width=10, contour_color='firebrick',
                      max_words=100).generate(normal_string)
plt.figure(figsize=(10,8))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

In [None]:
# WordCloud Label SMS Fraud / Penipuan

sms_fraud = data[data.label == 1]
fraud_string = []

for t in sms_fraud.clean_teks:
  fraud_string.append(t)

fraud_string = pd.Series(fraud_string).str.cat(sep=' ')
from wordcloud import WordCloud

wordcloud = WordCloud(width=1600, height=800, margin=10,
                      background_color='white', colormap='Dark2',
                      max_font_size=200, min_font_size=25,
                      mask=cloud_mask, contour_width=10, contour_color='firebrick',
                      max_words=100).generate(fraud_string)
plt.figure(figsize=(10,8))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

In [None]:
# WordCloud Label SMS Promo

sms_promo = data[data.label == 2]
promo_string = []

for t in sms_promo.clean_teks:
  promo_string.append(t)

promo_string = pd.Series(promo_string).str.cat(sep=' ')
from wordcloud import WordCloud

wordcloud = WordCloud(width=1600, height=800, margin=10,
                      background_color='white', colormap='Dark2',
                      max_font_size=200, min_font_size=25,
                      mask=cloud_mask, contour_width=10, contour_color='firebrick',
                      max_words=100).generate(promo_string)
plt.figure(figsize=(10,8))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

# Student Activity
Lakukan proses di atas menggunakan dataset review product https://drive.google.com/file/d/1qn5WXp-H95_FL_Rx5oqvfZaflYdHsnrF/view?usp=sharing

Tugas Anda:
- Tentukan langkah text pre-processing yang tepat untuk dataset di atas.
- Gunakan range `n_gram` yang berbeda. Amati apa perbedaannya.
- Menurut Anda, apakah `term` yang dihasilkan (`X_kbest_features`) pada feature selection sudah memiliki informasi yang relevan?

Setelah dikerjakan, buatlah resume berdasarkan pengalaman Anda dalam melakukan text pre-processing dan feature engineering.

Kumpulkan tugas Anda pada: https://s.id/tugas-nlp-ofa

Batas pengumpulan maksimal: Jumat, 1 April 2022