# 0. Description

In this execise, I finetuned `deepseek-ai/deepseek-math-7b-base` which is a SOTA model for solving mathematical problems. I fintuned the model to solve math problems in bengali language.

# 1. Build Dataset 

There is no dataset for bangla language math problems. I translated `hkust-nlp/dart-math-uniform` using `facebook/nllb-200-3.3B`. To achieve better translation sentences are splitted and then fed into translation model

In [1]:
! pip install -q git+https://github.com/mediacloud/sentence-splitter

In [2]:
import pandas as pd
from datasets import load_dataset
from sentence_splitter import split_text_into_sentences

ds = load_dataset("hkust-nlp/dart-math-uniform",verification_mode='no_checks')
df = ds['train'].to_pandas()
sub = df[df['response'].map(lambda p: len(p)) < 300]
sub = sub[sub['query'].map(lambda p: len(p)) < 300]
sub = sub.sample(1000, random_state=27)
sub

Downloading readme:   0%|          | 0.00/11.3k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/115M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/49.4M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/585392 [00:00<?, ? examples/s]

Unnamed: 0,query,response
167855,Let $f : \mathbb{R} \to \mathbb{R}$ be a funct...,"Setting $x = y = 0,$ we get\n\[f(0)^2 - f(0) =..."
199546,Cory made a complete list of the prime numbers...,The smallest prime number between 1 and 25 is ...
28560,Evaluate $(-27)^{5/3}$.,We have $(-27)^{5/3} = ((-3)^3)^{5/3}$.\nUsing...
8415,Simplify $(1)(2a)(3a^2)(4a^3)(5a^4)$.,Multiplying the coefficients: $1\cdot2\cdot3\c...
492063,"Randy had $3,000. Smith gave him another $200....","Randy started with $3,000 from the beginning, ..."
...,...,...
310309,Dan is digging graves for a family killed by C...,"If it takes Dan 3 hours to dig an adult grave,..."
364722,"In a guessing game, Hajar's score is 24 points...","If the score Hajar gets is 24, and the differe..."
7224,Evaluate $\left\lfloor |{-34.1}|\right\rfloor$.,"First, we need to evaluate the absolute value ..."
474340,A pie shop charges $3 per slice of custard pie...,"If each whole pie is cut into 10 slices, and t..."


In [3]:
# Splitting the dataset sentence wise
query = []
len_query = []
for i in sub['query'].values:
    sen = split_text_into_sentences(i, language='en')
    query.extend(sen)
    len_query.append(len(sen))

response = []
len_response = []
for i in sub['response'].values:
    sen = split_text_into_sentences(i, language='en')
    response.extend(sen)
    len_response.append(len(sen))

In [4]:
from torch.utils.data import Dataset
from tqdm.auto import tqdm

class ListDataset(Dataset):

    def __init__(self, original_list):
        self.original_list = original_list

    def __len__(self):
        return len(self.original_list)

    def __getitem__(self, i):
        return self.original_list[i]

query_dataset = ListDataset(query)
res_dataset = ListDataset(response)

In [5]:
from transformers import pipeline

pipe = pipeline("translation", model="facebook/nllb-200-3.3B",src_lang='eng_Latn',tgt_lang='ben_Beng',device="cuda")

2024-07-31 10:27:44.164695: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-31 10:27:44.164815: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-31 10:27:44.297156: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


config.json:   0%|          | 0.00/808 [00:00<?, ?B/s]

pytorch_model.bin.index.json:   0%|          | 0.00/90.0k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

pytorch_model-00001-of-00003.bin:   0%|          | 0.00/6.93G [00:00<?, ?B/s]

pytorch_model-00002-of-00003.bin:   0%|          | 0.00/8.55G [00:00<?, ?B/s]

pytorch_model-00003-of-00003.bin:   0%|          | 0.00/2.10G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

  return self.fget.__get__(instance, owner)()


generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/564 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/4.85M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.3M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/3.55k [00:00<?, ?B/s]

In [6]:
# Translating and merging queries
model_query = [i for i in tqdm(pipe(query_dataset, batch_size=4, max_length=200))]
model_query = [q[0]['translation_text'] for q in model_query]
tr_query = []
idx = 0
for i in len_query:
    tr_query.append(''.join(model_query[idx: idx + i]))
    idx = idx + i
len(tr_query)

  0%|          | 0/626 [00:00<?, ?it/s]

1000

In [7]:
# Translating and merging response
model_res = [i for i in tqdm(pipe(res_dataset, batch_size=4, max_length=300))]
model_res = [q[0]['translation_text'] for q in model_res]
tr_res = []
idx = 0
for i in len_response:
    tr_res.append(''.join(model_query[idx: idx + i]))
    idx = idx + i

  0%|          | 0/1090 [00:00<?, ?it/s]

In [8]:
# saving the dataset
df = pd.DataFrame({
    'query': tr_query,
    'eng_query': sub['query'],
    'response': tr_res,
    'eng_response': sub['response']
})
df.to_csv('dart_math_bangla.csv', index=False)
df

Unnamed: 0,query,eng_query,response,eng_response
167855,$f: \mathbb{R} \to \mathbb{R}$ এমন একটি ফাংশন ...,Let $f : \mathbb{R} \to \mathbb{R}$ be a funct...,$f: \mathbb{R} \to \mathbb{R}$ এমন একটি ফাংশন ...,"Setting $x = y = 0,$ we get\n\[f(0)^2 - f(0) =..."
199546,কোরি ১ থেকে ২৫ এর মধ্যে মৌলিক সংখ্যাগুলোর একটি...,Cory made a complete list of the prime numbers...,তার তালিকার সবচেয়ে ছোট মৌলিক সংখ্যা এবং সবচেয...,The smallest prime number between 1 and 25 is ...
28560,${}-27) ^{5/3}$ এর মান নির্ণয় করুন।,Evaluate $(-27)^{5/3}$.,স্মিথ তাকে আরো ২০০ ডলার দিলো।র্যান্ডি তখন স্যা...,We have $(-27)^{5/3} = ((-3)^3)^{5/3}$.\nUsing...
8415,${1}{2}{3}{2}{4}{3}{4}{4}}{5}{4}}$ সরলীকরণ করুন।,Simplify $(1)(2a)(3a^2)(4a^3)(5a^4)$.,"প্রতিদিন, আমি আমার স্কোয়াডের সংখ্যা আগের দিনে...",Multiplying the coefficients: $1\cdot2\cdot3\c...
492063,র্যান্ডির কাছে ৩০০০ ডলার ছিল।স্মিথ তাকে আরো ২০...,"Randy had $3,000. Smith gave him another $200....",জেমস ৮ টুকরো পিৎজা অর্ডার করেছে।তার বন্ধু 2 টু...,"Randy started with $3,000 from the beginning, ..."
...,...,...,...,...
310309,ড্যান কোভিড-১৯-এর কারণে নিহত একটি পরিবারের জন্...,Dan is digging graves for a family killed by C...,,"If it takes Dan 3 hours to dig an adult grave,..."
364722,"একটি অনুমান খেলায়, হাজারের স্কোর ২৪ পয়েন্ট।এ...","In a guessing game, Hajar's score is 24 points...",,"If the score Hajar gets is 24, and the differe..."
7224,ডান তলায় ডান তলায় ডান তলায় ডান তলায় ডান তলায়,Evaluate $\left\lfloor |{-34.1}|\right\rfloor$.,,"First, we need to evaluate the absolute value ..."
474340,একটি পিট দোকান পিটুর প্রতি টুকরো ৩ ডলার নেয়।ত...,A pie shop charges $3 per slice of custard pie...,,"If each whole pie is cut into 10 slices, and t..."


In [9]:
import gc
del pipe
gc.collect()

12808

**Link to the dataset on Hugging Face Hub:** [https://huggingface.co/datasets/Asib27/dart_math_bangla]