<a href="https://colab.research.google.com/github/ArtikaYuzuf/-Art-Yusuf/blob/master/Preprocessing_Bahasa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Mining with Python

Text mining is the process of extracting the implicit knowledge from textual data. Because the implicit knowledge which is the output of text mining does not exist in the given storage, it should be distinguished from the information which is retrieved from the storage. Unlike regular machine learning problems, text mining have it's own challenges. One of those challenges is how to transform our text data from a set of words to something meaningful and easy for the computer to understand. In this notebook I will try to demonstrate one form of text mining, text classification, to classify if a tweet is hate speech or not.

Before we start, we will load some libraries we will be using in the code cell below.

In [None]:
# Import libraries
## Basic libs
import pandas as pd
import numpy as np
import warnings
## Data Visualization
import seaborn as sns
import matplotlib.pyplot as plt

# Configure libraries
warnings.filterwarnings('ignore')
plt.rcParams['figure.figsize'] = (10, 10)
plt.style.use('seaborn')
sns.set(rc={'figure.facecolor':'white'})
sns.set_palette('Accent')

# Loading Data

The dataset I will be using is originally taken from ['The Dataset for Hate Speech Detection in the Indonesian Language (Bahasa Indonesia)'](https://github.com/ialfina/id-hatespeech-detection/). I modified the data from TSV to CSV and uploaded it into my GitHub repos for easier load process.

In [None]:
import pandas as pd
df = pd.read_csv('/content/ShopeedataMkII.csv', error_bad_lines=False, sep=";")


print(df.shape)
df.head(1000)

(9400, 1)


Unnamed: 0,TEXT
0,alfamart untuk penguna baru cashback penguna l...
1,alfamart alfamidi kno miris pot casbacknya pot...
2,alfamart belanja pakai cashback belaja
3,alfamart beli barang promo bayar pakai pay sto...
4,alfamart beli coklat delfi buy get harga rban ...
...,...
995,coba gosok baru koin
996,coba hadiah koin
997,coba hasil
998,coba hitung dlm bulan bunga coba kasih tahu ka...


# Text Pre-Processing

Before we can begin to create our model we first need to  pre-process the data. This step ensure that our model will receive a good data to learn from, as they said "a model is only as good as it's data". This is especially true for our text dataset which originated from Twitter, which can really be messy if we didn't clean it before passing it into the model.

As I explained above, this process will be different from doing data pre-processing on structured data.  The data pre-processing will be divided into few steps as explained below.

## Checking Class Distribution

Another important thing to make sure before feeding our data into the model is the class distribution of the data. In our case where the expected class are divided into two outcome, 'HS' and 'No_HS', a class distribution of 50:50 can be considered ideal.

In [None]:
df['TEXT'].value_counts()

he                                                                  2
kepikiran langsung beli                                             2
cashback mantab                                                     2
 checkout                                                           2
mantap                                                              2
                                                                   ..
terjangkau ya ampun dimasa pandemi sekarang                         1
terkadang galau kata free ongkir waktu checkout tapi kena ongkir    1
terkadang lupa punya voucher                                        1
terkait data data terima mohon bantu ditindaklanjuti                1
TRUE                                                                1
Name: TEXT, Length: 9392, dtype: int64

Look likes our class distribution isn't exactly ideal. To solve this problem we can try to either undersample (removing data from majority class) or oversampling (create new data for the minority class). In this notebook I won't do either one of those beacause I personally think the imbalance isn't that bad. [Here](https://towardsdatascience.com/how-i-handled-imbalanced-text-data-ba9b757ab1d8) is a good reference how to handle imbalanced text data if you are interested in doing so.

## Text Cleaning

On the next few cell block we will try to clean our data into a more usable data for our model to learn from. Before we do that, let's make a copy of our original dataframe to avoid messing with the original dataset by using the pandas `.copy()` function.



In [None]:
# Copy original dataframe to avoid messing the original data
df1 = df.copy()

### Tokenize

The first step we will do is tokenizing our data. Tokenizing data simply means that we will separate our data from sentence into a list of words, for example:

|  | Result |
|-|-|
| Original | @JohnDoe Hai, apa kabar? Aku baru bergabung ke #twitter nih |
| Tokenize | ['@JohnDoe', 'Hai', 'apa', 'kabar', 'aku', 'baru', 'bergabung', 'ke', '#twitter', 'nih'] |

In this step I will be using a function I wrote some times ago. The function will be utilizing the NLTK tokenizer to also include function to remove any punctuation from the data.

In [None]:
import nltk
from nltk.tokenize import RegexpTokenizer
nltk.download('punkt')

def tokenizeWords(s, remove_punctuation=True):
    if remove_punctuation == True:
        tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
        clean_words = tokenizer.tokenize(s)
    else:
        clean_words = nltk.word_tokenize(s)
    return clean_words

# Tokenize words
df1['tokens'] = df1['TEXT'].apply(tokenizeWords)

df1.head()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Unnamed: 0,TEXT,tokens
0,alfamart untuk penguna baru cashback penguna l...,"[alfamart, untuk, penguna, baru, cashback, pen..."
1,alfamart alfamidi kno miris pot casbacknya pot...,"[alfamart, alfamidi, kno, miris, pot, casbackn..."
2,alfamart belanja pakai cashback belaja,"[alfamart, belanja, pakai, cashback, belaja]"
3,alfamart beli barang promo bayar pakai pay sto...,"[alfamart, beli, barang, promo, bayar, pakai, ..."
4,alfamart beli coklat delfi buy get harga rban ...,"[alfamart, beli, coklat, delfi, buy, get, harg..."


### Remove mentions, URLs, special chars, and number

This second step is a special one, because in most case you are not really required to do this. But in this case since our data is from Twitter we will have lot of @mention, #hashtags, URLs, and other things that will just clutter our data and won't help it's performance.

|  | Result |
|-|-|
| Tokenize | ['@JohnDoe', 'Hai', 'apa', 'kabar', 'aku', 'baru', 'bergabung', 'ke', '#twitter', 'nih'] |
| Remove Useless Text | ['Hai', 'apa', 'kabar', 'aku', 'baru', 'bergabung', 'ke', 'nih'] |

Below I wrote a new function involving the re libraries to extract pattern from text using regular expression, which is really useful to remove the aforementioned elements.

In [None]:
import re

def removeUselessText(tokens):
    new_tokens = []
    for t in tokens:
        # Remove hashtag
        if not t.startswith('#'):
            # Remove leading & trailing whitespace
            t = t.strip()
            
            # Remove mention
            t = re.sub('@[^\s]+', '', t)

            # Remove urls
            t = re.sub(r'\\/', '/', t) # replace escaped character
            t = re.sub(r'(https?://\S+)', '', t) # remove urls
            t = re.sub(r"http\S+", "", t)

            # Remove special character and number
            t = re.sub('[^a-zA-Z\s]', '', t)

            new_tokens.append(t)

    return [token for token in new_tokens if token]

df1['no_useless'] = df1['tokens'].apply(removeUselessText)

df1.head()

Unnamed: 0,TEXT,tokens,no_useless
0,alfamart untuk penguna baru cashback penguna l...,"[alfamart, untuk, penguna, baru, cashback, pen...","[alfamart, untuk, penguna, baru, cashback, pen..."
1,alfamart alfamidi kno miris pot casbacknya pot...,"[alfamart, alfamidi, kno, miris, pot, casbackn...","[alfamart, alfamidi, kno, miris, pot, casbackn..."
2,alfamart belanja pakai cashback belaja,"[alfamart, belanja, pakai, cashback, belaja]","[alfamart, belanja, pakai, cashback, belaja]"
3,alfamart beli barang promo bayar pakai pay sto...,"[alfamart, beli, barang, promo, bayar, pakai, ...","[alfamart, beli, barang, promo, bayar, pakai, ..."
4,alfamart beli coklat delfi buy get harga rban ...,"[alfamart, beli, coklat, delfi, buy, get, harg...","[alfamart, beli, coklat, delfi, buy, get, harg..."


In [None]:
import re

def removeUselessText(tokens):
    new_tokens = []
    for t in tokens:
        # Remove hashtag
        if not t.startswith('#'):
            # Remove leading & trailing whitespace
            t = t.strip()
            
            # Remove mention
            t = re.sub('@[^\s]+', '', t)
            t = re.sub('tco[^\s]+', '', t)
            t = re.sub('wk[^\s]+', '', t)
            t = re.sub('wq[^\s]+', '', t)
            t = re.sub('wwk[^\s]+', '', t)
            t = re.sub('haha[^\s]+', '', t)
            t = re.sub('hm[^\s]+', '', t)




            # Remove urls
            t = re.sub(r'\\/', '/', t) # replace escaped character
           
          
            # Remove special character and number
            t = re.sub('[^a-zA-Z\s]', '', t)
            t = t.casefold()

            new_tokens.append(t)

    return [token for token in new_tokens if token]

df1['no_useless1'] = df1['no_useless'].apply(removeUselessText)

df1.head()

Unnamed: 0,TEXT,tokens,no_useless,no_useless1
0,alfamart untuk penguna baru cashback penguna l...,"[alfamart, untuk, penguna, baru, cashback, pen...","[alfamart, untuk, penguna, baru, cashback, pen...","[alfamart, untuk, penguna, baru, cashback, pen..."
1,alfamart alfamidi kno miris pot casbacknya pot...,"[alfamart, alfamidi, kno, miris, pot, casbackn...","[alfamart, alfamidi, kno, miris, pot, casbackn...","[alfamart, alfamidi, kno, miris, pot, casbackn..."
2,alfamart belanja pakai cashback belaja,"[alfamart, belanja, pakai, cashback, belaja]","[alfamart, belanja, pakai, cashback, belaja]","[alfamart, belanja, pakai, cashback, belaja]"
3,alfamart beli barang promo bayar pakai pay sto...,"[alfamart, beli, barang, promo, bayar, pakai, ...","[alfamart, beli, barang, promo, bayar, pakai, ...","[alfamart, beli, barang, promo, bayar, pakai, ..."
4,alfamart beli coklat delfi buy get harga rban ...,"[alfamart, beli, coklat, delfi, buy, get, harg...","[alfamart, beli, coklat, delfi, buy, get, harg...","[alfamart, beli, coklat, delfi, buy, get, harg..."


### Stemming
### Kurang Proper

Onto the third step, we will stem all word into it's base form by removing the prefix and suffix of the words.

|  | Result |
|-|-|
| Remove Useless Text | ['Hai', 'apa', 'kabar', 'aku', 'baru', 'bergabung', 'ke', 'nih'] |
| Stemming | ['Hai', 'apa', 'kabar', 'aku', 'baru', 'gabung', 'ke', 'nih'] |

In this step I will be using Bahasa Indonesia stemmer from Sastrawi libraries.

*Note: this step may take a few minutes and is a known performance problem from Sastrawi*

In [None]:
!pip install Sastrawi
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory

def stemmSentence(tokens):
    # Initiate Sastrawi stemmer
    factory = StemmerFactory()
    stemmer = factory.create_stemmer()

    return [stemmer.stem(t) for t in tokens]

df1['stemmed'] = df1['no_useless1'].apply(stemmSentence)

df1.head()

Collecting Sastrawi
  Downloading Sastrawi-1.0.1-py2.py3-none-any.whl (209 kB)
[?25l[K     |█▋                              | 10 kB 20.5 MB/s eta 0:00:01[K     |███▏                            | 20 kB 18.0 MB/s eta 0:00:01[K     |████▊                           | 30 kB 9.4 MB/s eta 0:00:01[K     |██████▎                         | 40 kB 5.3 MB/s eta 0:00:01[K     |███████▉                        | 51 kB 5.3 MB/s eta 0:00:01[K     |█████████▍                      | 61 kB 5.7 MB/s eta 0:00:01[K     |███████████                     | 71 kB 5.7 MB/s eta 0:00:01[K     |████████████▌                   | 81 kB 6.4 MB/s eta 0:00:01[K     |██████████████                  | 92 kB 6.7 MB/s eta 0:00:01[K     |███████████████▋                | 102 kB 5.4 MB/s eta 0:00:01[K     |█████████████████▏              | 112 kB 5.4 MB/s eta 0:00:01[K     |██████████████████▊             | 122 kB 5.4 MB/s eta 0:00:01[K     |████████████████████▎           | 133 kB 5.4 MB/s eta 0:00:0

In [None]:
df1['stemmed']

0        [kocag, sih, asli, kgk, nge, endorse, dia, ber...
1        [ges, siga, brand, pang, hadena, alam, dunya, ...
2        [sorry, ferguso, saat, nya, turun, harga, tung...
3        [enggak, ajar, dari, usaha, yang, lain, suka, ...
4        [jujur, ini, blunder, banget, eiger, parah, gu...
                               ...                        
19475                             [aku, belum, punya, min]
19476                                  [belum, punya, min]
19477                      [sudah, punya, dong, saya, min]
19478    [hobi, nynyir, dari, akun, ini, enggak, berani...
19479    [kalau, gue, sih, kopi, hitam, instan, kopi, s...
Name: stemmed, Length: 19480, dtype: object

In [None]:
# Combine cleaned text into one string
df1['clean'] = df1['stemmed'].apply(lambda x: ' '.join(x))


df1.head()

Unnamed: 0,ready,tokens,no_useless,no_useless1,stemmed,clean
0,kocag sih asli kgk nge endorse dia keberatan s...,"[kocag, sih, asli, kgk, nge, endorse, dia, keb...","[kocag, sih, asli, kgk, nge, endorse, dia, keb...","[kocag, sih, asli, kgk, nge, endorse, dia, keb...","[kocag, sih, asli, kgk, nge, endorse, dia, ber...",kocag sih asli kgk nge endorse dia berat sama ...
1,ges siga brand pang hadena alam dunya ajih,"[ges, siga, brand, pang, hadena, alam, dunya, ...","[ges, siga, brand, pang, hadena, alam, dunya, ...","[ges, siga, brand, pang, hadena, alam, dunya, ...","[ges, siga, brand, pang, hadena, alam, dunya, ...",ges siga brand pang hadena alam dunya ajih
2,sorry ferguso saat nya turun harga tunggu loh,"[sorry, ferguso, saat, nya, turun, harga, tung...","[sorry, ferguso, saat, nya, turun, harga, tung...","[sorry, ferguso, saat, nya, turun, harga, tung...","[sorry, ferguso, saat, nya, turun, harga, tung...",sorry ferguso saat nya turun harga tunggu loh
3,enggak belajar dari perusahaan2 yang lain suka...,"[enggak, belajar, dari, perusahaan2, yang, lai...","[enggak, belajar, dari, perusahaan, yang, lain...","[enggak, belajar, dari, perusahaan, yang, lain...","[enggak, ajar, dari, usaha, yang, lain, suka, ...",enggak ajar dari usaha yang lain suka bikin bl...
4,jujur ini blunder banget eiger parah gue tadin...,"[jujur, ini, blunder, banget, eiger, parah, gu...","[jujur, ini, blunder, banget, eiger, parah, gu...","[jujur, ini, blunder, banget, eiger, parah, gu...","[jujur, ini, blunder, banget, eiger, parah, gu...",jujur ini blunder banget eiger parah gue tadi ...


In [None]:
df1.to_csv('eiger_all_stemming.csv') 

### Replace Slang Words

One other thing to take note of when dealing with text data in Bahasa Indonesia is slang words, Indonesian people loves to use slang, especially on the internet. Unfortunately, I am not able to found any ready-to-use Python libraries to help with this problem. 

Instead what I will do is use the slangword dictionary from [dhitology's GitHub repos](https://github.com/dhitology) and convert it to Python dictionaries to make replacing our data easier. Granted, this is not a really good solution since there is still some slang words in our dataset that's not included in the dictionary, but (personally) I think it's good enough.

If you are interested in learning a better approach to solve this problem here is a good [article](https://medium.com/kata-engineering/mengubah-bahasa-indonesia-informal-menjadi-baku-menggunakan-kecerdasan-buatan-4c6317b00ea5) about a new methodology to transform text from informal Bahasa Indonesia to a more formal form.

In [None]:
slang_df = pd.read_csv('/content/replace.csv', error_bad_lines=False, sep=";")


print(slang_df.shape)
slang_df.head(1000)

(1354, 2)


Unnamed: 0,slang,formal
0,aamiin,amin
1,abis,habis
2,abisinnya,habis
3,abiss,habis
4,acaranya,acara
...,...,...
995,nonton,tonton
996,notif,notifikasi
997,notifnya,notifikasi
998,ntar,sebentar


In [None]:

# Remove trailing whitespace
slang_df['slang'] = slang_df['slang'].apply(lambda x: x.strip())
slang_df['formal'] = slang_df['formal'].apply(lambda x: x.strip())

# Transform into key value paris in a dict
slang_dict = {}
for idx, row in slang_df.iterrows():
    slang_dict.update({row['slang']: row['formal']})

def replaceSlang(tokens):
    # iterate through tokens
    for i, word in enumerate(tokens):
        # check if token is in slang dictionary
        try:
            tokens[i] = slang_dict[word]
        # if token is not slang pass
        except KeyError:
            pass
    return tokens

df1['no_slang'] = df1['no_useless1'].apply(replaceSlang)

df1.head()

Unnamed: 0,TEXT,tokens,no_useless,no_useless1,no_slang
0,alfamart untuk penguna baru cashback penguna l...,"[alfamart, untuk, penguna, baru, cashback, pen...","[alfamart, untuk, penguna, baru, cashback, pen...","[alfamart, untuk, penguna, baru, cashback, pen...","[alfamart, untuk, penguna, baru, cashback, pen..."
1,alfamart alfamidi kno miris pot casbacknya pot...,"[alfamart, alfamidi, kno, miris, pot, casbackn...","[alfamart, alfamidi, kno, miris, pot, casbackn...","[alfamart, alfamidi, kno, miris, pot, casbackn...","[alfamart, alfamidi, kno, miris, pot, casbackn..."
2,alfamart belanja pakai cashback belaja,"[alfamart, belanja, pakai, cashback, belaja]","[alfamart, belanja, pakai, cashback, belaja]","[alfamart, belanja, pakai, cashback, belaja]","[alfamart, belanja, pakai, cashback, belaja]"
3,alfamart beli barang promo bayar pakai pay sto...,"[alfamart, beli, barang, promo, bayar, pakai, ...","[alfamart, beli, barang, promo, bayar, pakai, ...","[alfamart, beli, barang, promo, bayar, pakai, ...","[alfamart, beli, barang, promo, bayar, pakai, ..."
4,alfamart beli coklat delfi buy get harga rban ...,"[alfamart, beli, coklat, delfi, buy, get, harg...","[alfamart, beli, coklat, delfi, buy, get, harg...","[alfamart, beli, coklat, delfi, buy, get, harg...","[alfamart, beli, coklat, delfi, buy, get, harg..."


### Remove Stop Words

For the last step of our text cleaning we will remove stop words (meaningless word) from our data. I will use the Bahasa Indonesia stop words list provided in the spacy libraries as the reference on which word to remove from our data. In this step I will also remove any word that's less than 3 characters long in order to trim our data and increase our training speed. 

In [None]:
with open ('stopword.txt','r') as f :
    stoplist= [name.rstrip() for name in f]

def removeStopWords(tokens, min_len=2):

    return [t for t in tokens if t not in stoplist and len(t)>min_len]

df1['no_stop'] = df1['no_useless1'].apply(removeStopWords)

df1.head()

Unnamed: 0,TEXT,tokens,no_useless,no_useless1,no_slang,no_stop
0,alfamart untuk penguna baru cashback penguna l...,"[alfamart, untuk, penguna, baru, cashback, pen...","[alfamart, untuk, penguna, baru, cashback, pen...","[alfamart, untuk, penguna, baru, cashback, pen...","[alfamart, untuk, penguna, baru, cashback, pen...","[alfamart, penguna, baru, cashback, penguna, k..."
1,alfamart alfamidi kno miris pot casbacknya pot...,"[alfamart, alfamidi, kno, miris, pot, casbackn...","[alfamart, alfamidi, kno, miris, pot, casbackn...","[alfamart, alfamidi, kno, miris, pot, casbackn...","[alfamart, alfamidi, kno, miris, pot, casbackn...","[alfamart, alfamidi, kno, miris, pot, casbackn..."
2,alfamart belanja pakai cashback belaja,"[alfamart, belanja, pakai, cashback, belaja]","[alfamart, belanja, pakai, cashback, belaja]","[alfamart, belanja, pakai, cashback, belaja]","[alfamart, belanja, pakai, cashback, belaja]","[alfamart, belanja, pakai, cashback, belaja]"
3,alfamart beli barang promo bayar pakai pay sto...,"[alfamart, beli, barang, promo, bayar, pakai, ...","[alfamart, beli, barang, promo, bayar, pakai, ...","[alfamart, beli, barang, promo, bayar, pakai, ...","[alfamart, beli, barang, promo, bayar, pakai, ...","[alfamart, beli, barang, promo, bayar, pakai, ..."
4,alfamart beli coklat delfi buy get harga rban ...,"[alfamart, beli, coklat, delfi, buy, get, harg...","[alfamart, beli, coklat, delfi, buy, get, harg...","[alfamart, beli, coklat, delfi, buy, get, harg...","[alfamart, beli, coklat, delfi, buy, get, harg...","[alfamart, beli, coklat, delfi, buy, get, harg..."


In [None]:
# Combine cleaned text into one string
df1['clean'] = df1['no_stop'].apply(lambda x: ' '.join(x))


df1.head()

Unnamed: 0,TEXT,tokens,no_useless,no_useless1,no_slang,no_stop,clean
0,alfamart untuk penguna baru cashback penguna l...,"[alfamart, untuk, penguna, baru, cashback, pen...","[alfamart, untuk, penguna, baru, cashback, pen...","[alfamart, untuk, penguna, baru, cashback, pen...","[alfamart, untuk, penguna, baru, cashback, pen...","[alfamart, penguna, baru, cashback, penguna, k...",alfamart penguna baru cashback penguna kebelie...
1,alfamart alfamidi kno miris pot casbacknya pot...,"[alfamart, alfamidi, kno, miris, pot, casbackn...","[alfamart, alfamidi, kno, miris, pot, casbackn...","[alfamart, alfamidi, kno, miris, pot, casbackn...","[alfamart, alfamidi, kno, miris, pot, casbackn...","[alfamart, alfamidi, kno, miris, pot, casbackn...",alfamart alfamidi kno miris pot casbacknya pot...
2,alfamart belanja pakai cashback belaja,"[alfamart, belanja, pakai, cashback, belaja]","[alfamart, belanja, pakai, cashback, belaja]","[alfamart, belanja, pakai, cashback, belaja]","[alfamart, belanja, pakai, cashback, belaja]","[alfamart, belanja, pakai, cashback, belaja]",alfamart belanja pakai cashback belaja
3,alfamart beli barang promo bayar pakai pay sto...,"[alfamart, beli, barang, promo, bayar, pakai, ...","[alfamart, beli, barang, promo, bayar, pakai, ...","[alfamart, beli, barang, promo, bayar, pakai, ...","[alfamart, beli, barang, promo, bayar, pakai, ...","[alfamart, beli, barang, promo, bayar, pakai, ...",alfamart beli barang promo bayar pakai pay sto...
4,alfamart beli coklat delfi buy get harga rban ...,"[alfamart, beli, coklat, delfi, buy, get, harg...","[alfamart, beli, coklat, delfi, buy, get, harg...","[alfamart, beli, coklat, delfi, buy, get, harg...","[alfamart, beli, coklat, delfi, buy, get, harg...","[alfamart, beli, coklat, delfi, buy, get, harg...",alfamart beli coklat delfi buy get harga rban ...


In [None]:
df1.to_csv('ShopeeMKIII.csv') 