# Training Bengali Book Word Vectors

In this notebook, we will use the data we scraped from news websites to train a Word2Vec model for Bengali.

Then we will test the model to see how well it is performing.

First we import the packages we need

In [None]:
import json
import os
import re
import string
import numpy as np

from gensim.models import Word2Vec

Let's define a function that will read the data file and extract the fields we want.

In our case, we will be using the article body for training

In [None]:
# list_all_text_of_তারাশঙ্কর বন্দ্যোপাধ্যায়.txt
def extract_text(filename):

    extracted_field=[]

    with open(os.path.join('/content/drive/MyDrive/Datasets/Bengali_book_dataset', filename), 'r') as f:
        content = f.read()

    return content

Now we define a function to preprocess our data.

The function does the following:
- It replaces common texts found in the data and replaces that with our custom text
- It removes all emoji's and emoticons from the text
- It removes all English text

In [None]:
def replace_strings(texts, replace):
    new_texts=[]

    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    english_pattern=re.compile('[a-zA-Z0-9]+', flags=re.I)

    for text in texts:
        for r in replace:
            text=text.replace(r[0], r[1])
        text=emoji_pattern.sub(r'', text)
        text=english_pattern.sub(r'', text)
        text=re.sub(r'\s+', ' ', text).strip()
        new_texts.append(text)

    return new_texts

We also need to remove all the punctuations in our data. The `remove_pun` function removes all common punctuations found in text.

In [None]:
def remove_punc(sentences):
    # import ipdb; ipdb.set_trace()
    new_sentences=[]
    exclude = list(set(string.punctuation))
    exclude.extend(["’", "‘", "—"])
    for sentence in sentences:
        s = ''.join(ch for ch in sentence if ch not in exclude)
        new_sentences.append(s)

    return new_sentences

Let's extract some of the data from Ebala and print them to see how the data changes throughout the process.

In [None]:
import re

chunksize = 10_000_000  # Set an appropriate chunk size (e.g., 10MB)
input_file = '/content/drive/MyDrive/Datasets/Bengali_book_dataset/all_text.txt'
output_file = '/content/drive/MyDrive/Datasets/Bengali_book_dataset/all_text_without_pun.txt'

def remove_punctuation(text):
    punctuation_pattern = r'[.,!?;:\-–—।‘’“”(){}\[\]<>«»\'\"`´~]'
    return re.sub(punctuation_pattern, '', text)

with open(input_file, 'r', encoding='utf-8') as file:
    with open(output_file, 'a', encoding='utf-8') as outfile:
        while True:
            chunk = file.read(chunksize)
            if not chunk:
                break
            processed_chunk = remove_punctuation(chunk)
            outfile.write(processed_chunk)


In [None]:
book_text = extract_text('all_text.txt')

print("------------------Crawled Unprocessed Text-----------------------")
print(book_text[12])



------------------Crawled Unprocessed Text-----------------------
ল


In [None]:
book_text_without_punc = extract_text('all_text_without_pun.txt')
print("------------------Crawled without punctuation Text-----------------------")
print(book_text[12])

------------------Crawled without pun Text-----------------------
ল


In [None]:

def replace_strings(texts, replace):
    new_texts = []

    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    english_pattern = re.compile('[a-zA-Z0-9]+', flags=re.I)

    for text in texts:
        for r in replace:
            text = text.replace(r[0], r[1])
        text = emoji_pattern.sub(r'', text)
        text = english_pattern.sub(r'', text)
        text = re.sub(r'\s+', ' ', text).strip()
        new_texts.append(text)

    return new_texts

chunksize = 10_000_000  # Set an appropriate chunk size (e.g., 10MB)
input_file = '/content/drive/MyDrive/Datasets/Bengali_book_dataset/all_text_without_pun.txt'
output_file = '/content/drive/MyDrive/Datasets/Bengali_book_dataset/all_text_processed_file.txt'

replacements=[('\u200c', ' '),
         ('\u200d', ' '),
        ('\xa0', ' '),
        ('\n', ' '),
        ('\r', ' ')]


with open(input_file, 'r', encoding='utf-8') as file:
    with open(output_file, 'a', encoding='utf-8') as outfile:
        while True:
            chunk = file.read(chunksize)
            if not chunk:
                break
            processed_chunk = replace_strings([chunk], replacements)
            outfile.write(processed_chunk[0])


In [None]:
filename = '/content/drive/MyDrive/Datasets/Bengali_book_dataset/all_text_processed_file.txt'
num_words = 10  # Number of words to read

with open(filename, 'r', encoding='utf-8') as file:
    content = file.read()
    words = content.split()[:num_words]

for word in words:
    print(word)


মাথার
ভেতরে
লেখা
অদূরে
রেস্তোরাঁ
আষাঢ়
সেজেছে
খুব
মেঘে
মেঘেমনে


We do the same thing for the other data too

In [None]:
total_text=extract_text('all_text_processed_file.txt')
print(f"Total Number of training data: {len(total_text)}")

Total Number of training data: 333410416


Finally, we need to split the articles into sentences and extract each word from those sentences.

Our final training data looks like this

In [None]:

filename = '/content/drive/MyDrive/Datasets/Bengali_book_dataset/all_text_processed_file.txt'
output_filename = '/content/drive/MyDrive/Datasets/Bengali_book_dataset/all_text_processed_sentences.txt'
chunk_size = 100_000_000  # Process 100MB at a time
sentences = []
sentences_to_write = 10  # Number of sentences to write

with open(filename, 'r', encoding='utf-8') as file:
    while True:
        chunk = file.read(chunk_size)
        if not chunk:
            break
        sentences.extend(re.split(r'[।!?]', chunk))

with open(output_filename, 'w', encoding='utf-8') as output_file:
    for i, sentence in enumerate(sentences):
        sentence = sentence.strip()
        if sentence:
            output_file.write(sentence + '\n')
            sentences_to_write -= 1
            if sentences_to_write == 0:
                break


In [None]:
body=[article.split('।') for article in body]
body=[item for sublist in body for item in sublist]
body=[item.strip() for item in body if len(item.split())>1]

body=[item.split() for item in body]

print(body[:10])

In [None]:
import gensim
from gensim.utils import tokenize

input_filename = '/content/drive/MyDrive/Datasets/Bengali_book_dataset/all_text_processed_sentences.txt'
output_filename = '/content/drive/MyDrive/Datasets/Bengali_book_dataset/all_text_processed_tokenized_sentences.txt'
chunk_size = 10_000_000  # Process 10MB at a time

sentences = []

with open(input_filename, 'r', encoding='utf-8') as input_file:
    with open(output_filename, 'w', encoding='utf-8') as output_file:
        while True:
            chunk = input_file.read(chunk_size)
            if not chunk:
                break

            chunk_sentences = chunk.split('\n')
            for sentence in chunk_sentences:
                tokenized_sentence = list(tokenize(sentence, deacc=True, lower=True))
                sentences.append(tokenized_sentence)
                output_file.write(' '.join(tokenized_sentence) + '\n')


In [None]:
tokenized_filename = '/content/drive/MyDrive/Datasets/Bengali_book_dataset/all_text_processed_tokenized_sentences.txt'

input_filename = '/content/drive/MyDrive/Datasets/Bengali_book_dataset/all_text_processed_sentences.txt'

lines = []

with open(input_filename, 'r', encoding='utf-8') as file:
    for line in file:
        lines.append(line.strip())

print(lines[:5])

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



Now that we have our preprocessed training data, we can start training our model.

We will generate embeddings for each word of size 200 and use 5 words in its vicinity to figure out the meaning of the word

In [None]:

model = gensim.models.Word2Vec(sentences=sentences, min_count=1, size=100, window=5, sg=0)

In [None]:
print("What are the words most similar to chele")
model.wv.most_similar('ছেলে', topn=5)

What are the words most similar to chele


[('মেয়ে', 0.9104634523391724),
 ('বোন', 0.8716319799423218),
 ('ভাই', 0.8627076148986816),
 ('বাবা', 0.8575456738471985),
 ('বন্ধু', 0.8455409407615662)]

In [None]:
print("What is Father + Girl - Boy =?")
model.wv.most_similar(positive=['', 'মেয়ে'], negative=['ছেলে'], topn=5)

What is Father + Girl - Boy =?


KeyError: ignored

In [None]:
print('Find the odd one out')
model.wv.doesnt_match("কলকাতা চেন্নাই দিল্লি রবীন্দ্রনাথ".split())

Find the odd one out


'রবীন্দ্রনাথ'

In [None]:
print("How similar are bengali and sweet?")
model.wv.similarity('বাঙালি', 'মিষ্টি')

How similar are bengali and sweet?


0.66019154

In [None]:
model.wv.save_word2vec_format('/content/drive/MyDrive/Datasets/bengali_news_data/news_vector_text.txt', binary=False)
model.wv.save_word2vec_format('/content/drive/MyDrive/Datasets/bengali_news_data/news_vector_binary.txt', binary=True)

In [None]:
print("What about Bihari and Sweets?")
model.wv.similarity('বিহারি', 'মিষ্টি')

What about Bihari and Sweets?


0.51406723