ref : https://github.com/sagorbrur/bnlp

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
!pip install bnlp_toolkit



# Cool Imports

In [3]:
import gensim
from bnlp.bengali_word2vec import Bengali_Word2Vec
from bnlp.sentencepiece_tokenizer import SP_Tokenizer
from bnlp.basic_tokenizer import BasicTokenizer
from bnlp.ner import NER
from bnlp.glove_wordvector import BN_Glove
from bnlp.nltk_tokenizer import NLTK_Tokenizer
from bnlp.pos import POS
import nltk
from bnlp.bengali_fasttext import Bengali_Fasttext
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

# Part 1 : BANGLA WORD EMBEDDING



*   Sentencepiece, Word2Vec, Fasttext, GloVe model trained with [Bengali Wikipedia Dump Dataset](https://dumps.wikimedia.org/bnwiki/latest/)
*   SentencePiece Training Vocab Size=50000
* Fasttext trained with total words = 20M, vocab size = 1171011, epoch=50, embedding dimension = 300 and the training loss = 0.318668,
* Word2Vec word embedding dimension = 300
* To Know Bengali GloVe Wordvector and training process follow this repository
* Bengali CRF POS Tagging was training with nltr dataset with 80% accuracy.
* Bengali CRF NER Tagging was train with this data with 90% accuracy.



# load Pretrained "BANGLA WORD to VECTOR" Model

In [4]:
%%time
model = gensim.models.Word2Vec.load('/content/drive/My Drive/datasets/bn_word2vec/bengali_word2vec.model')
bwv = Bengali_Word2Vec()
model_path = "/content/drive/My Drive/datasets/bn_word2vec/bengali_word2vec.model"
word = 'আমার'
vector = bwv.generate_word_vector(model_path, word)
print(vector.shape)
#print(vector) 



  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


(300,)
CPU times: user 13.9 s, sys: 8.21 s, total: 22.1 s
Wall time: 50.2 s


if we wish to Train Bengali Word2Vec with our own dataset then we can use  the  below code snippet

In [5]:
%%time
bwv = Bengali_Word2Vec(True)
data_file = "/content/drive/My Drive/datasets/News Articles/ebala_articles.txt"
model_name = "test_model.model"
vector_name = "test_vector.vector"
bwv.train_word2vec(data_file, model_name, vector_name)


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


test_model.model and test_vector.vector saved in your current directory.
CPU times: user 2min 46s, sys: 3.09 s, total: 2min 49s
Wall time: 2min 19s


In [6]:
words=model.most_similar(positive=['খব'], negative=[], topn=10)

for w in words:
  print(w[0])

  """Entry point for launching an IPython kernel.
  if np.issubdtype(vec.dtype, np.int):


(১৮৯৪-১৯৭৯)
"খুব
কানকান
'তুষারঝড়ের
'দার
দর্শকমণ্ডলী
ষাটটিরও
দামি,
২৮৫,০০০-এর
ততো


it says "খুব is very close to the word খব

In [7]:
words=model.most_similar(positive=['অমি'], negative=[], topn=10)

for w in words:
  print(w[0])

  """Entry point for launching an IPython kernel.
  if np.issubdtype(vec.dtype, np.int):


শরণ,
ইনসান
বারদেম,
।যিনি
স্বপন,
হাসিন
শ্বেতকেতু,
রায়;
বাগচী,
রাণা,


we know that "অমি" is not misspelled world,it can be a name of a person so you can see that our bengali word2vec showing  us some bangla names that can be close to অমি,pretty cool stuff right? :)

In [8]:
words=model.most_similar(positive=['আমি'], negative=[], topn=10)
for w in words:
  print(w[0])

তুমি
আমাকে
আপনি
আমার
তোমরা
তোমার
"আমি
আমরা
তো
তোমাকে


  """Entry point for launching an IPython kernel.
  if np.issubdtype(vec.dtype, np.int):


now, আমি is not a name and you can see most similar words for আমি that our  bangla word2vec recommended for us is good enough,no?

# Bengali FastText
Generate Vector Using Pretrained Model(downloaded model size is more than 3 GB) :(

code below requests a lot of memory!

In [9]:
'''
%%time
bft = Bengali_Fasttext()
word = "গ্রাম"
model_path = "/content/drive/My Drive/datasets/bn_word2vec/bengali_fasttext.bin"
word_vector = bft.generate_word_vector(model_path, word)
print(word_vector.shape)
#print(word_vector)
'''

'\n%%time\nbft = Bengali_Fasttext()\nword = "গ্রাম"\nmodel_path = "/content/drive/My Drive/datasets/bn_word2vec/bengali_fasttext.bin"\nword_vector = bft.generate_word_vector(model_path, word)\nprint(word_vector.shape)\n#print(word_vector)\n'

to train bengali fasttext model on our data we can use the code below :

In [10]:
'''
bft = Bengali_Fasttext()
data = "/content/drive/My Drive/datasets/News Articles/ebala_articles.txt"
model_name = "saved_model.bin"
epoch = 50
bft.train_fasttext(data, model_name, epoch)
'''

'\nbft = Bengali_Fasttext()\ndata = "/content/drive/My Drive/datasets/News Articles/ebala_articles.txt"\nmodel_name = "saved_model.bin"\nepoch = 50\nbft.train_fasttext(data, model_name, epoch)\n'

# Bengali GloVe Word Vectors

they trained glove model with bengali data(wiki+news articles) and published bengali glove word vectors
we can download and use it on your different machine learning purposes.

In [11]:
%%time
glove_path = "/content/drive/My Drive/datasets/bn_word2vec/bn_glove.39M.300d.txt"
word = "গ্রাম"
bng = BN_Glove()
res = bng.closest_word(glove_path, word)
print(res)
vec = bng.word2vec(glove_path, word)
#print(vec)

['গ্রাম', 'পঞ্চায়েতগুলি', 'পঞ্চায়েতের', 'পঞ্চায়েতে', 'খাদ্যআঁশ', 'পঞ্চায়েতে', 'নিগে', 'নারান্দিয়া', 'অন্তঃপাতী', 'গঞ্জের']
CPU times: user 32.1 s, sys: 951 ms, total: 33.1 s
Wall time: 33.3 s


# Tokenization


Bengali SentencePiece Tokenization

tokenization using trained model 

In [12]:

bsp = SP_Tokenizer()
model_path = "/content/drive/My Drive/datasets/bnlp-master/model/bn_spm.model"
input_text = "আমি ভাত খাই। সে বাজারে যায়।"
tokens = bsp.tokenize(model_path, input_text)
print(tokens)
text2id = bsp.text2id(model_path, input_text)
print(text2id)
id2text = bsp.id2text(model_path, text2id)
print(id2text)

['▁আমি', '▁ভাত', '▁খাই', '।', '▁সে', '▁বাজারে', '▁যায়', '।']
[914, 5265, 24224, 3, 124, 2244, 41, 3]
আমি ভাত খাই। সে বাজারে যায়।


In [13]:

basic_t = BasicTokenizer()
raw_text = "আমি বাংলায় গান গাই।"
tokens = basic_t.tokenize(raw_text)
print(tokens)

['আমি', 'বাংলায়', 'গান', 'গাই', '।']


Training SentencePiece

In [14]:
'''bsp = SP_Tokenizer()
data = "/content/drive/My Drive/datasets/News Articles/ebala_articles.txt"
model_prefix = "ebala_articles"
vocab_size = 5
bsp.train_bsp(data, model_prefix, vocab_size) '''

'bsp = SP_Tokenizer()\ndata = "/content/drive/My Drive/datasets/News Articles/ebala_articles.txt"\nmodel_prefix = "ebala_articles"\nvocab_size = 5\nbsp.train_bsp(data, model_prefix, vocab_size) '

the cell above gave me this error: 
OSError: Not found: unknown field name "Drive/datasets/News" in TrainerSpec.

will investigate  later when it is required(probably error  related to space in path)

# NLTK Tokenization

In [15]:

text = "আমি ভাত খাই। সে বাজারে যায়। তিনি কি সত্যিই ভালো মানুষ?"
bnltk = NLTK_Tokenizer()
word_tokens = bnltk.word_tokenize(text)
sentence_tokens = bnltk.sentence_tokenize(text)
print(word_tokens)
print(sentence_tokens)


['আমি', 'ভাত', 'খাই', '।', 'সে', 'বাজারে', 'যায়', '।', 'তিনি', 'কি', 'সত্যিই', 'ভালো', 'মানুষ', '?']
['আমি ভাত খাই।', 'সে বাজারে যায়।', 'তিনি কি সত্যিই ভালো মানুষ?']


# Bengali POS Tagging

In [17]:

bn_pos = POS()
model_path = "/content/drive/My Drive/datasets/bnlp-master/model/bn_pos.pkl"
text = "আমি ভাত খাই।"
res = bn_pos.tag(model_path, text)
print(res)

[('আমি', 'PPR'), ('ভাত', 'NC'), ('খাই', 'VM'), ('।', 'PU')]


# Train POS Tag Model

In [18]:

bn_pos = POS()
model_name = "pos_model.pkl"
tagged_sentences = [[('রপ্তানি', 'JJ'), ('দ্রব্য', 'NC'), ('-', 'PU'), ('তাজা', 'JJ'), ('ও', 'CCD'), ('শুকনা', 'JJ'), ('ফল', 'NC'), (',', 'PU'), ('আফিম', 'NC'), (',', 'PU'), ('পশুচর্ম', 'NC'), ('ও', 'CCD'), ('পশম', 'NC'), ('এবং', 'CCD'),('কার্পেট', 'NC'), ('৷', 'PU')], [('মাটি', 'NC'), ('থেকে', 'PP'), ('বড়জোর', 'JQ'), ('চার', 'JQ'), ('পাঁচ', 'JQ'), ('ফুট', 'CCL'), ('উঁচু', 'JJ'), ('হবে', 'VM'), ('৷', 'PU')]]

bn_pos.train(model_name, tagged_sentences)

1
1
Training Started........
it will take time according to your dataset size..
Training Finished!
Evaluating with Test Data...
Accuracy is: 
0.1111111111111111
Model Saved!


In [19]:
model_path = "/content/pos_model.pkl"
text = "আমি ফল খাই।"
res = bn_pos.tag(model_path, text)
print(res)

[('আমি', 'JJ'), ('ফল', 'NC'), ('খাই', 'NC'), ('।', 'PU')]


i labeled ফল as NC and it is recognizing that word well

# Bengali NER

In [20]:

bn_ner = NER()
model_path = "/content/drive/My Drive/datasets/bnlp-master/model/bn_ner.pkl"
text = "সে ঢাকায় থাকে।"
result = bn_ner.tag(model_path, text)
print(result)

[('সে', 'O'), ('ঢাকায়', 'S-LOC'), ('থাকে', 'O')]


In [21]:

bn_ner = NER()
model_name = "ner_model.pkl"
tagged_sentences = [[('ত্রাণ', 'O'),('ও', 'O'),('সমাজকল্যাণ', 'O'),('ওমর', 'B-PER'),('সম্পাদক', 'S-PER'),('মোবাশ্বির', 'B-PER'),('রায়', 'I-PER'),('নন্দী', 'E-PER'),('প্রমুখ', 'O'),('সংবাদ', 'O'),('সম্মেলনে', 'O'),('উপস্থিত', 'O'),('ছিলেন', 'O')], [('ত্রাণ', 'O'),('ও', 'O'),('সমাজকল্যাণ', 'O'),('সম্পাদক', 'S-PER'),('মোবাশ্বির', 'B-PER'),('রায়', 'I-PER'),('নন্দী', 'E-PER'),('প্রমুখ', 'O'),('সংবাদ', 'O'),('সম্মেলনে', 'O'),('উপস্থিত', 'O'),('ছিলেন', 'O')], [('ত্রাণ', 'O'),('ও', 'O'),('ওমর', 'B-PER'),('সমাজকল্যাণ', 'O'),('সম্পাদক', 'S-PER'),('মোবাশ্বির', 'B-PER'),('ওমর', 'B-PER'),('রায়', 'I-PER'),('নন্দী', 'E-PER'),('প্রমুখ', 'O'),('সংবাদ', 'O'),('সম্মেলনে', 'O'),('উপস্থিত', 'O'),('ছিলেন', 'O')]]

bn_ner.train(model_name, tagged_sentences)

2
1
Training Started........
It will take time according to your dataset size...
Training Finished!
Evaluating with Test Data...
Accuracy is: 
0.9285714285714286
Model Saved!


# Train NER Tag Model

In [22]:
model_path = "/content/ner_model.pkl"
text = "ওমর মোবাশ্বির"
result = bn_ner.tag(model_path, text)
print(result)

[('ওমর', 'B-PER'), ('মোবাশ্বির', 'B-PER')]


note that ওমর and মোবাশ্বির were never known by our ner  model that these 2 words should be labeled as "B-PER", in tagged_sentences i have manually labeled both words or names as "B-PER" entity and after training faster when we made prediction,our Trained NER Tag Model did good job

for sentencepiece  tokenization i was watching this video :[ Sentencepiece Tokenizer With Offsets For T5, ALBERT, XLM-RoBERTa And Many More](https://www.youtube.com/watch?v=U51ranzJBpY&ab_channel=AbhishekThakur)