<a href="https://colab.research.google.com/github/Hidayathamir/Preprocess-Indonlu-smsa_doc-sentiment-prosa-Dataset-for-machine-learning/blob/main/Preprocess_Indonlu_smsa_doc_sentiment_prosa_Dataset_for_Machine_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Source : https://github.com/indobenchmark/indonlu/tree/master/dataset/smsa_doc-sentiment-prosa

In [1]:
import warnings
warnings.filterwarnings('ignore')

Required library for stem in Indonesian language

In [2]:
!pip install sastrawi

Collecting sastrawi
[?25l  Downloading https://files.pythonhosted.org/packages/6f/4b/bab676953da3103003730b8fcdfadbdd20f333d4add10af949dd5c51e6ed/Sastrawi-1.0.1-py2.py3-none-any.whl (209kB)
[K     |████████████████████████████████| 215kB 4.4MB/s 
[?25hInstalling collected packages: sastrawi
Successfully installed sastrawi-1.0.1


In [3]:
import pandas as pd
import string
from tqdm import tqdm
from collections import Counter
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from Sastrawi.StopWordRemover.StopWordRemoverFactory import StopWordRemoverFactory

# Load dataset
Load train and valid dataframe

In [4]:
df_train = pd.read_csv('https://raw.githubusercontent.com/indobenchmark/indonlu/master/dataset/smsa_doc-sentiment-prosa/train_preprocess.tsv', sep='\t', names=['text', 'sentiment'])
df_train

Unnamed: 0,text,sentiment
0,warung ini dimiliki oleh pengusaha pabrik tahu...,positive
1,mohon ulama lurus dan k212 mmbri hujjah partai...,neutral
2,lokasi strategis di jalan sumatera bandung . t...,positive
3,betapa bahagia nya diri ini saat unboxing pake...,positive
4,duh . jadi mahasiswa jangan sombong dong . kas...,negative
...,...,...
10995,tidak kecewa,positive
10996,enak rasa masakan nya apalagi kepiting yang me...,positive
10997,hormati partai-partai yang telah berkoalisi,neutral
10998,"pagi pagi di tol pasteur sudah macet parah , b...",negative


In [5]:
df_valid = pd.read_csv('https://raw.githubusercontent.com/indobenchmark/indonlu/master/dataset/smsa_doc-sentiment-prosa/valid_preprocess.tsv', sep='\t', names=['text', 'sentiment'])
df_valid

Unnamed: 0,text,sentiment
0,"meski masa kampanye sudah selesai , bukan bera...",neutral
1,tidak enak,negative
2,restoran ini menawarkan makanan sunda . kami m...,positive
3,lokasi di alun alun masakan padang ini cukup t...,positive
4,betapa bejad kader gerindra yang anggota dprd ...,negative
...,...,...
1255,"film tncfu , tidak cocok untuk penonton yang t...",negative
1256,"indihome ini mahal loh bayar nya . hanya , pen...",negative
1257,"be de gea , cowok cupu yang takut dengan pacar...",negative
1258,valen yang sangat tidak berkualitas . konentat...,negative


# Explanation
Explain how I do cleansing

## Normalizing
In Normalization we do
1. Remove punctuation
2. Case Folding
3. Handling typo. Source : Peter Norvig in https://norvig.com/spell-correct.html <br>
In order to make sure token is typo I need to compare token with kbbi or stem token with kbbi

In [6]:
kbbi = pd.read_csv('https://raw.githubusercontent.com/Hidayathamir/kata-kbbi-github/main/kbbi.csv')
WORDS = Counter(kbbi['kata'].to_list())

def P(word, N=sum(WORDS.values())): 
    "Probability of `word`."
    return WORDS[word] / N

def correction(word): 
    "Most probable spelling correction for word."
    return max(candidates(word), key=P)

def candidates(word): 
    "Generate possible spelling corrections for word."
    return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])

def known(words): 
    "The subset of `words` that appear in the dictionary of WORDS."
    return set(w for w in words if w in WORDS)

def edits1(word):
    "All edits that are one edit away from `word`."
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

def edits2(word): 
    "All edits that are two edits away from `word`."
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))

In [7]:
typo = 'marag'
correction(typo)

'maras'

In [8]:
nama = 'syarifah'
correction(nama)

'syarifah'

In [9]:
text = 'hteks ini banyyak kesalahn penulsan'
text

'hteks ini banyyak kesalahn penulsan'

In [10]:
stemmer = StemmerFactory().create_stemmer()
def normalizing(text):
  a = []
  for token in text.lower().split():
    if token not in string.punctuation:  # Make sure token is not punctuation
      if token in WORDS:
        a.append(token)
      else:
        if stemmer.stem(token) in WORDS:  # Some token need to stem first, kbbi problem
           a.append(token)
        else:        
          a.append(correction(token))  # Handling typo base on WORDS
  return ' '.join(a)

text = normalizing(text)
text

'teks ini banyak kesalahan penulisan'

In [11]:
# token = 'terpilihnya'
# print(token, token in WORDS)  # e.g all word in kbbi
# print(stemmer.stem(token), stemmer.stem(token) in WORDS)  # e.g 'terpilihnya'
# print(correction(token), correction(token) in WORDS)  # e.g 'ketiks'

## Stopword Removal
How Sastrawi removal work

In [12]:
stopword = StopWordRemoverFactory().create_stop_word_remover()

In [13]:
text

'teks ini banyak kesalahan penulisan'

In [14]:
text = stopword.remove(text)
text

'teks banyak kesalahan penulisan'

## Stemming

How Sastrawi stemmer work

In [15]:
stemmer = StemmerFactory().create_stemmer()

In [16]:
text

'teks banyak kesalahan penulisan'

In [17]:
text = stemmer.stem(text)
text

'teks banyak salah tulis'

# Apply Cleansing

Apply cleaning to dataframe

In [18]:
def clean(text):
  # Normlizing
  text = normalizing(text)
  # Stopword Removal
  text = stopword.remove(text)
  # Stemming
  text = stemmer.stem(text)
  return text

In [19]:
text = 'hteks ini banyyak kesalahn penulsan'
text

'hteks ini banyyak kesalahn penulsan'

In [20]:
text = clean(text)
text

'teks banyak salah tulis'

In [21]:
tqdm.pandas()

In [22]:
df_train['text'] = df_train['text'].progress_apply(clean)

100%|██████████| 11000/11000 [40:19<00:00,  4.55it/s]


In [23]:
df_valid['text'] = df_valid['text'].progress_apply(clean)

100%|██████████| 1260/1260 [03:42<00:00,  5.67it/s]


Export dataframe to csv

In [24]:
df_train.to_csv('train.csv', index=False)
df_valid.to_csv('valid.csv', index=False)