#Latihan CFG

#Create Rules Manually
Berikut adalah contoh production rules yang dibuat manual dalam format string biasa.
```
cfg_rules = """
S -> NP VP
NP -> Det N | PropN
Det -> PosPro | Art
VP -> Vt NP
Art -> 'the' | 'a'
PropN -> 'Alice'
N -> 'duck' | 'telescope' | 'park'
Vt -> 'saw'
PosPro -> 'my' | 'her'
"""
```

#Create CFG from String
Selanjutnya buat objek grammar CFG dari string di atas. Akan terbentuk objek grammar dengan 15 productions.
```
cfg = nltk.CFG.fromstring(cfg_rules)
print(cfg)
```
#Parsing
Dari grammar tersebut, kita coba untuk mendapatkan parse tree dari sebuah kalimat. Tampilkan hasil parse tree.
```
from nltk.parse.chart import BottomUpChartParser
parser = BottomUpChartParser(cfg)
sentence = 'the duck saw Alice'.split()
parsed = list(parser.parse(sentence))
print(parsed[0])
parsed[0].draw()
```

#Extract Production Rules From Corpus
Sebagai alternatif cara membuat grammar selain manual, kita bisa mengekstrak rules dari corpus yang telah dianotasi. Corpus berbahasa Inggris yang terkenal misalnya penn treebank. Salah satu corpus treebank berbahasa indonesia bisa dilihat di https://github.com/famrashel/idn-treebank.
```
from nltk import Tree

parsed_string = "(S      (NP-SBJ-1    (NN      Kisah) (NP      (NN      pendiri) (NNP      Facebook))) (VP      (VB      dibuat) (S  (NP-SBJ  *-1) (NP-PRD  (NN      komik)))))"
t = Tree.fromstring(parsed_string)
print(t.productions())
```

#Create CFG from Production Rules
Setelah mendapatkan list production rules dari corpus, selanjutnya kita bisa membuat grammar CFG. Grammar ini selanjutnya bisa dipakai untuk melakukan parsing pada data testing atau data baru.
```
from nltk.grammar import CFG, Nonterminal
grammar = CFG(Nonterminal('S'), t.productions())
print(grammar)
```

#Menambahkan Production Rules ke existing grammar
Jika suatu saat corpus kita bertambah, maka kita bisa menambahkan production rules baru ke existing grammar.
```
from nltk.grammar import Nonterminal, Production

#Cara membuat objek Nonterminal
lhs = Nonterminal('S')
rhs1 = Nonterminal('A')
rhs2 = Nonterminal('B')
#Untuk Objek terminal cukup string saja
rhs3 = "saya"
#Buat objek Production
newprod = Production(lhs,[rhs1,rhs2,rhs3])

#Tambahkan ke existing grammar
grammar.productions().append(newprod)
print(grammar)
```

#Tugas
1. Ekstraklah Production Rules dari minimal 100 kalimat pertama (tergantung kemampuan perangkat) dari Corpus https://github.com/famrashel/idn-treebank/blob/master/Indonesian_Treebank.bracket. Anda mungkin perlu menghubungkan gdrive ke colab. Cek tutorial ini :  https://www.marktechpost.com/2019/06/07/how-to-connect-google-colab-with-google-drive/

2. Print top-10 Rules yang paling sering muncul. Hint gunakan collection.Counter.

3. Buatlah grammar CFG dari production rules yang telah diekstrak.
4. Lakukan Parsing terhadap kalimat berikut ini:
```
Binatang ini tidak bisa dibunuh karena masyarakat India menganggap mereka suci .
```

5. Print top-50 lexicon atau terminal symbol.

dari contoh ini: (S  (PP  (IN	  (Selama)) (NP  (CD	  (bertahun-tahun)))) (NP-SBJ  (NN	  (monyet))) (VP  (VB	  (mengganggu)) (NP  (NN	  (warga)) (NNP	  (Delhi)))) (Z	  (.)))
yang dimaksuk lexicon adalah : Selama, bertahun-tahun, monyet, mengganggu, warga, Delhi, .

6. Buatlah 5 kalimat baru dari top-50 lexicon, lalu parsing kalimat tersebut.



# JAWABAN

In [61]:
from google.colab import drive

drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [0]:
#contoh path
path = '/content/gdrive/My Drive/tugas nlp/Indonesian_Treebank.bracket'

In [0]:
import nltk
from nltk import Tree
import re
from collections import Counter
from nltk.grammar import CFG, Nonterminal
from nltk.parse.chart import BottomUpChartParser

###Ekstrak Production Rules

In [0]:
#here
with open(path,'r') as file:
  data = file.read().splitlines()
t = []
tleaf = []
for i in range (35):
  data[i] = re.sub(r'\(([a-zA-Z0-9-*.:,"]+)\)',r'\1',data[i])  
  rule = Tree.fromstring(data[i].upper())
  t.extend(rule.productions())
  tleaf.extend(t.leaves())

In [65]:
t

[NP -> NN SBAR,
 NN -> 'KERA',
 SBAR -> SC S,
 SC -> 'UNTUK',
 S -> NP-SBJ VP,
 NP-SBJ -> '*',
 VP -> VB NP,
 VB -> 'AMANKAN',
 NP -> NN,
 NN -> PESTA,
 PESTA -> 'OLAHRAGA',
 S -> NP-SBJ VP Z,
 NP-SBJ -> NNP NNP NNP,
 NNP -> 'PEMERINTAH',
 NNP -> 'KOTA',
 NNP -> 'DELHI',
 VP -> VB NP SBAR,
 VB -> 'MENGERAHKAN',
 NP -> NN,
 NN -> 'MONYET',
 SBAR -> SC S,
 SC -> 'UNTUK',
 S -> NP-SBJ VP,
 NP-SBJ -> '*',
 VP -> VB NP PP,
 VB -> 'MENGUSIR',
 NP -> NP SBAR,
 NP -> NN JJ,
 NN -> 'MONYET-MONYET',
 JJ -> 'LAIN',
 SBAR -> SC S,
 SC -> 'YANG',
 S -> NP-SBJ VP,
 NP-SBJ -> '*',
 VP -> VB ADJP,
 VB -> 'BERBADAN',
 ADJP -> RB JJ,
 RB -> 'LEBIH',
 JJ -> 'KECIL',
 PP -> IN NP,
 IN -> 'DARI',
 NP -> NN NP,
 NN -> 'ARENA',
 NP -> NNP NNP,
 NNP -> PESTA,
 PESTA -> 'OLAHRAGA',
 NNP -> 'PERSEMAKMURAN',
 Z -> '.',
 S -> NP-SBJ VP Z,
 NP-SBJ -> CD NN,
 CD -> 'BEBERAPA',
 NN -> 'LAPORAN',
 VP -> VB SBAR,
 VB -> 'MENYEBUTKAN',
 SBAR -> '0' S,
 S -> NP-SBJ-1 VP,
 NP-SBJ-1 -> QP NN,
 QP -> RB CD,
 RB -> 'SETIDAK

###Print top-10 rules

In [66]:
#here
Counter(t).most_common(10)

[(PP -> IN NP, 39),
 (S -> NP-SBJ VP, 32),
 (Z -> '.', 31),
 (VP -> VB NP, 24),
 (NP-SBJ -> '*', 23),
 (SBAR -> SC S, 20),
 (NP -> NN, 18),
 (VP -> MD VP, 18),
 (IN -> 'DI', 17),
 (NP -> NN NN, 16)]

###Buat Grammar CFG

In [0]:
#here
grammar = CFG(Nonterminal('S'), set(t))

In [68]:
print(grammar)

Grammar with 459 productions (start state = S)
    NN -> 'MONYET'
    JJ -> 'PANJANG'
    NN -> 'DUNIA'
    RB -> 'BEGITU'
    NNP -> 'PEMERINTAH'
    NN -> 'ACARA'
    VP -> VP SBAR
    ADJP -> Z JJ Z
    VB -> 'ADA'
    VB -> 'MENDAPATKAN'
    NP-SBJ -> '*'
    NP -> NP NP
    NP-SBJ-1 -> NN NN PR
    VB -> 'MENCAPAI'
    JJ -> 'KECIL'
    NN -> 'NAMA-NAMA'
    S-TPC-1 -> NP-SBJ-2 VP
    NP -> NNP
    VB -> BERKEMBANG
    VB -> 'BERKEMBANG'
    REFORMASI -> 'EKONOMI'
    VB -> 'MENYUMBANG'
    S -> NP-1 Z VP
    NN -> REFORMASI
    NP-SBJ -> NP NP
    CD -> '5.000'
    CD -> 'SEPARUH'
    NN -> 'KEBERHASILAN'
    VP -> VB SBAR
    NP-SBJ -> NN NP
    RB -> 'SERING'
    NN -> 'GEDUNG'
    VB -> 'KESULITAN'
    NP-SBJ -> '*-1'
    PR -> 'DEMIKIAN'
    NN -> 'MONYET-MONYET'
    VP -> VB NP
    VP -> Z VB Z
    NP-SBJ-1 -> NN JJ
    NN -> 'PIHAK'
    NNP -> 'DELHI'
    VB -> 'MENYERBU'
    NP -> CD NN
    VB -> 'MEMBERI'
    PERDANA -> 'MENTERI'
    JJ -> 'SEBAIK'
    SC -> 'KALAU'
    N

###Lakukan parsing pada contoh kalimat

In [69]:
#here

parser = BottomUpChartParser(grammar)
sentence = 'Binatang ini tidak bisa dibunuh karena masyarakat India menganggap mereka suci .'.upper().split()
parsed = list(parser.parse(sentence))
print(parsed[0])

(S
  (NP-SBJ (NP (NN BINATANG)) (NP (PR INI)))
  (VP
    (VP (NEG TIDAK) (MD BISA) (VP (VB DIBUNUH)))
    (SBAR
      (SC KARENA)
      (S
        (NP-SBJ-1 (NN MASYARAKAT) (NP (NNP INDIA)))
        (VP
          (VB MENGANGGAP)
          (S (NP-SBJ (PRP MEREKA)) (ADJP-PRD (JJ SUCI)))))))
  (Z .))


###Print top-50 Lexicon

In [108]:
#here

countl = Counter(tleaf)
top50 = countl.most_common(50)
top50

[('.', 31),
 ('*', 25),
 ('DI', 17),
 ('*-1', 16),
 ('MONYET', 15),
 (',', 15),
 ('DAN', 13),
 ('"', 10),
 ('UNTUK', 9),
 ('YANG', 8),
 ('AKAN', 8),
 ('0', 7),
 ('INI', 7),
 ('BANYAK', 6),
 ('CINA', 6),
 ('ORANG', 6),
 ('DELHI', 5),
 ('DARI', 5),
 ('MEREKA', 5),
 ('AMAL', 5),
 ('ITU', 5),
 ('MENGUSIR', 4),
 ('KECIL', 4),
 ('PEMKOT', 4),
 ('BESAR', 4),
 ('FACEBOOK', 4),
 ('PEMERINTAH', 3),
 ('LAIN', 3),
 ('INDIA', 3),
 ('DIKERAHKAN', 3),
 ('DENGAN', 3),
 ('STADION', 3),
 ('DUA', 3),
 ('*T*-1', 3),
 ('BISA', 3),
 ('BAGI', 3),
 ('MILIARDER', 3),
 ('KAYA', 3),
 ('JAMUAN', 3),
 ('MALAM', 3),
 ('SUDAH', 3),
 ('MENYUMBANGKAN', 3),
 ('MENGENAI', 3),
 ('DIA', 3),
 ('KISAH', 3),
 ('KOMIK', 3),
 ('ZUCKERBERG', 3),
 ('OLAHRAGA', 2),
 ('KOTA', 2),
 ('MONYET-MONYET', 2)]

###Buat 5 kalimat baru (manual) dari top-50 lexicon, dan Parsing

In [107]:
#here

sentence = ['Zuckerberg itu mengusir monyet facebook .','Pemerintah India menyumbangkan monyet Delhi .','monyet menyumbangkan facebook .','pemerintah kota Delhi berbadan monyet Cina .','pemerintah Cina banyak menyumbangkan monyet .']
# list_parser = []
for i in range(len(sentence)):
  sntnce = sentence[i].upper().split()
  parsed = list(parser.parse(sntnce))
  # list_parser.extend(parsed[0])
  print("[ Kalimat ke-",i+1,"]\n", parsed)

[ Kalimat ke- 1 ]
 [Tree('S', [Tree('NP-SBJ', [Tree('NP', [Tree('NNP', ['ZUCKERBERG'])]), Tree('NP', [Tree('PR', ['ITU'])])]), Tree('VP', [Tree('VB', ['MENGUSIR']), Tree('NP', [Tree('NP', [Tree('NN', ['MONYET'])]), Tree('NP', [Tree('NNP', ['FACEBOOK'])])])]), Tree('Z', ['.'])]), Tree('S', [Tree('NP-SBJ', [Tree('NP', [Tree('NNP', ['ZUCKERBERG'])]), Tree('NP', [Tree('PR', ['ITU'])])]), Tree('VP', [Tree('VB', ['MENGUSIR']), Tree('NP', [Tree('NN', ['MONYET']), Tree('NP', [Tree('NNP', ['FACEBOOK'])])])]), Tree('Z', ['.'])]), Tree('S', [Tree('NP-SBJ', [Tree('NP', [Tree('NNP', ['ZUCKERBERG'])]), Tree('NP', [Tree('PR', ['ITU'])])]), Tree('VP', [Tree('VB', ['MENGUSIR']), Tree('NP', [Tree('NN', ['MONYET']), Tree('NNP', ['FACEBOOK'])])]), Tree('Z', ['.'])]), Tree('S', [Tree('NP-SBJ', [Tree('NP', [Tree('NNP', ['ZUCKERBERG'])]), Tree('NP', [Tree('PR', ['ITU'])])]), Tree('VP', [Tree('VB', ['MENGUSIR']), Tree('NP', [Tree('NN', ['MONYET'])]), Tree('NP', [Tree('NNP', ['FACEBOOK'])])]), Tree('Z', ['.'])

###Selamat Mengerjakan