# Preliminiaries

## Sentence Segmentation

In [16]:
import nltk as nt
mytext = "The Archon War began an indeterminate time in the past, and ended around 2,000 years ago. During this period of time, many gods and archons roamed the land, locked in a bitter struggle for supremacy. It appears that the battles associated with the Archon War are a multitude of local struggles that were grouped together by human history."

my_sentences = nt.sent_tokenize(mytext)
my_sentences

['The Archon War began an indeterminate time in the past, and ended around 2,000 years ago.',
 'During this period of time, many gods and archons roamed the land, locked in a bitter struggle for supremacy.',
 'It appears that the battles associated with the Archon War are a multitude of local struggles that were grouped together by human history.']

In [17]:
for sentence in my_sentences:
    print(nt.word_tokenize(sentence))

['The', 'Archon', 'War', 'began', 'an', 'indeterminate', 'time', 'in', 'the', 'past', ',', 'and', 'ended', 'around', '2,000', 'years', 'ago', '.']
['During', 'this', 'period', 'of', 'time', ',', 'many', 'gods', 'and', 'archons', 'roamed', 'the', 'land', ',', 'locked', 'in', 'a', 'bitter', 'struggle', 'for', 'supremacy', '.']
['It', 'appears', 'that', 'the', 'battles', 'associated', 'with', 'the', 'Archon', 'War', 'are', 'a', 'multitude', 'of', 'local', 'struggles', 'that', 'were', 'grouped', 'together', 'by', 'human', 'history', '.']


## Frequent Steps
Langkah yang biasanya sering dilakukan dalam preprocessing.

### Removing Stopwords
Stopwords biasanya merupakan kata konjungsi atau tanda baca / angka yang tidak penting dalam pemrosesan bahasa.

In [5]:
from nltk.corpus import stopwords
from string import punctuation

def preprocess_corpus(texts):
    # tentukan kriteria stopword berdasarkan bahasa
    mystopwords = set(stopwords.words("english"))
    
    def remove_stops_digits(tokens):
        # hilangkan token yang berupa:
        # stopwords, angka, dan punctuation (tanda baca)
        return [token.lower() for (token) in tokens if (token not in mystopwords) and not (token.isdigit()) and (token not in punctuation)]
    
    return [remove_stops_digits(nt.word_tokenize(text)) for text in texts]

In [31]:
removed_stopwrd = preprocess_corpus(my_sentences)
print(my_sentences[0])
print(removed_stopwrd[0])

The Archon War began an indeterminate time in the past, and ended around 2,000 years ago.
['the', 'archon', 'war', 'began', 'indeterminate', 'time', 'past', 'ended', 'around', '2,000', 'years', 'ago']


### Stemming & Lemmatization
Stemming merupakan proses mengubah kata menjadi kata dasar, sedangkan lemmatization merupakan proses mengubah kata menjadi kata yang sama.

#### Stemming

In [37]:
from nltk.stem.porter import PorterStemmer

def stem_sentences(sentences):
    # stem token-token
    stemmer = PorterStemmer()
    def stemming(sentence):
        return [stemmer.stem(word) for word in sentence]
    return [stemming(sentence) for sentence in sentences]

In [43]:
stemmed_text = stem_sentences(removed_stopwrd)
print(my_sentences[2])
print(stemmed_text[2])

It appears that the battles associated with the Archon War are a multitude of local struggles that were grouped together by human history.
['it', 'appear', 'battl', 'associ', 'archon', 'war', 'multitud', 'local', 'struggl', 'group', 'togeth', 'human', 'histori']


#### Lemmatization

In [51]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize("better", pos='a') # 'a' = adjective

'good'

## Advanced Processing

In [2]:
import spacy, en_core_web_lg
nlp = en_core_web_lg.load()
doc = nlp(u'The Archon War began an indeterminate time in the past, and ended around 2,000 years ago. During this period of time, many gods and archons roamed the land, locked in a bitter struggle for supremacy. It appears that the battles associated with the Archon War are a multitude of local struggles that were grouped together by human history.')
for token in doc:
    # cara auto dapatin lemma, pos, dsb
    print("Text: {0}\nLemma: {1}\nPOS: {2}\n".format(token.text, token.lemma_, token.pos_))

Text: The
Lemma: the
POS: DET

Text: Archon
Lemma: Archon
POS: PROPN

Text: War
Lemma: War
POS: PROPN

Text: began
Lemma: begin
POS: VERB

Text: an
Lemma: an
POS: DET

Text: indeterminate
Lemma: indeterminate
POS: ADJ

Text: time
Lemma: time
POS: NOUN

Text: in
Lemma: in
POS: ADP

Text: the
Lemma: the
POS: DET

Text: past
Lemma: past
POS: NOUN

Text: ,
Lemma: ,
POS: PUNCT

Text: and
Lemma: and
POS: CCONJ

Text: ended
Lemma: end
POS: VERB

Text: around
Lemma: around
POS: ADV

Text: 2,000
Lemma: 2,000
POS: NUM

Text: years
Lemma: year
POS: NOUN

Text: ago
Lemma: ago
POS: ADV

Text: .
Lemma: .
POS: PUNCT

Text: During
Lemma: during
POS: ADP

Text: this
Lemma: this
POS: DET

Text: period
Lemma: period
POS: NOUN

Text: of
Lemma: of
POS: ADP

Text: time
Lemma: time
POS: NOUN

Text: ,
Lemma: ,
POS: PUNCT

Text: many
Lemma: many
POS: ADJ

Text: gods
Lemma: god
POS: NOUN

Text: and
Lemma: and
POS: CCONJ

Text: archons
Lemma: archon
POS: NOUN

Text: roamed
Lemma: roam
POS: VERB

Text: the
Lemma: