### In this notebook, we finetune Helsinki model (English to Arabic) using DataScience content.

First: Let's download the wanted libraires.
 

** Note: Don't forget to restart runtime after running this cell.**

We're using nltk 3.4 because it's the only version that works without errors on the metric we're using.

In [1]:
!pip install --quiet transformers==4.7.0
!pip install --quiet sentencepiece==0.1.95
!pip install --upgrade gupload
!pip install datasets 
!pip install nltk==3.4

[K     |████████████████████████████████| 2.5 MB 5.3 MB/s 
[K     |████████████████████████████████| 3.3 MB 31.9 MB/s 
[K     |████████████████████████████████| 895 kB 43.1 MB/s 
[K     |████████████████████████████████| 1.2 MB 5.2 MB/s 
[?25hCollecting gupload
  Downloading gupload-1.1.0-py3-none-any.whl (4.7 kB)
Collecting google-api-python-client==1.7.10
  Downloading google_api_python_client-1.7.10-py3-none-any.whl (56 kB)
[K     |████████████████████████████████| 56 kB 2.6 MB/s 
[?25hCollecting click==7.0
  Downloading Click-7.0-py2.py3-none-any.whl (81 kB)
[K     |████████████████████████████████| 81 kB 8.9 MB/s 
Installing collected packages: google-api-python-client, click, gupload
  Attempting uninstall: google-api-python-client
    Found existing installation: google-api-python-client 1.12.8
    Uninstalling google-api-python-client-1.12.8:
      Successfully uninstalled google-api-python-client-1.12.8
  Attempting uninstall: click
    Found existing installation: cli

In [1]:
import pandas as pd
import re
import string
import torch
from tqdm.notebook import tqdm
tqdm.pandas()
from sklearn.model_selection import train_test_split
import nltk

In this cell, we're downloading the Data we'll use from googledrive directly.

In [2]:
# find the share link of the file/folder on Google Drive
file_share_link = "https://docs.google.com/spreadsheets/d/15kVKP0AVYvN0KGK-RDxVsHFus3lGGxqD"

# extract the ID of the file
file_id = '15kVKP0AVYvN0KGK-RDxVsHFus3lGGxqD'

# append the id to this REST command
file_download_link = "https://docs.google.com/uc?export=download&id=" + file_id 

In [3]:
!wget -O Data.xlsx --no-check-certificate "$file_download_link"

--2021-12-06 17:45:22--  https://docs.google.com/uc?export=download&id=15kVKP0AVYvN0KGK-RDxVsHFus3lGGxqD
Resolving docs.google.com (docs.google.com)... 64.233.182.113, 64.233.182.100, 64.233.182.138, ...
Connecting to docs.google.com (docs.google.com)|64.233.182.113|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-08-as-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/0dn9nuib03uk0hmfc6tcfnv7lva2p6s4/1638812700000/15694836540117004366/*/15kVKP0AVYvN0KGK-RDxVsHFus3lGGxqD?e=download [following]
--2021-12-06 17:45:28--  https://doc-08-as-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/0dn9nuib03uk0hmfc6tcfnv7lva2p6s4/1638812700000/15694836540117004366/*/15kVKP0AVYvN0KGK-RDxVsHFus3lGGxqD?e=download
Resolving doc-08-as-docs.googleusercontent.com (doc-08-as-docs.googleusercontent.com)... 74.125.202.132, 2607:f8b0:4001:c06::84
Connecting to doc-08-as-docs.googleusercontent.com (doc-08-as-d

now, we have the data in our environment. This Data needs cleaning and splitting into train and test before entering the model.

In [4]:
def cleaning(df):   
    # removing apostrophe from the sentences
    df['AR'] = df['AR'].apply(lambda x: re.sub("'","",x))
    df['EN'] = df['EN'].apply(lambda x: re.sub("'","",x))
    exclude = set(string.punctuation)
    # removing all the punctuations
    df['AR'] = df['AR'].apply(lambda x: ''.join(ch for ch in x if ch not in exclude))
    df['EN'] = df['EN'].apply(lambda x: ''.join(ch for ch in x if ch not in exclude))
    # removing digits from the sentences
    digit = str.maketrans('','',string.digits)
    df['AR'] = df['AR'].apply(lambda x: x.translate(digit))
    df['EN'] = df['EN'].apply(lambda x: x.translate(digit))

In [5]:
data = pd.read_excel("Data.xlsx")
data.head()

Unnamed: 0.1,Unnamed: 0,Arabic_transcript,English_transcript
0,0,مرحبا بكم في مقدمة لعلوم البيانات مع بايثون. ه...,Welcome to an introduction to Data Science wit...
1,1,مرحباً، أنا (كريس بروكز)، هيئة التدريس هنا بكل...,"Hi, I'm Chris Brooks, faculty here at the Univ..."
2,2,مرحبا. أريد أن أريكم قليلا عن نظام دفتر جوبيتر...,Hi. I want to show you a little bit about the ...
3,3,في بقية هذه الوحدة، سأقوم بتقديم نظرة عامة أسا...,"In the rest of this module, I'm going to provi..."
4,5,تحدثنا عن السلاسل عندما تحدثنا عن القوائم والت...,We talked about strings when we talked about l...


We have to make sure that our data don't contain nulls.

In [6]:
data= data.rename(columns={ 'Arabic_transcript': 'AR','English_transcript': 'EN'})
data = data[['AR','EN']]
data.dropna(inplace=True)
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16809 entries, 0 to 16817
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   AR      16809 non-null  object
 1   EN      16809 non-null  object
dtypes: object(2)
memory usage: 394.0+ KB


In [7]:
# cleaning the data 
cleaning(data)

# shuffling the data. (changing order)
# because our data was from different resources then we merged it. 
data = data.sample(frac=1,random_state=42).reset_index(drop=True)

In [8]:
from sklearn.model_selection import train_test_split

train,test = train_test_split(data, test_size=0.1, random_state=42)

In [9]:
train.to_csv('train.csv')
test.to_csv('test.csv')

The data is all set, let's import our model and make function to finetune.

In [10]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
  
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-ar")

model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-ar").to('cuda')

Downloading:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/801k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/917k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.12M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/308M [00:00<?, ?B/s]

In [11]:
# setting the optimizer and lr
optimizer = torch.optim.AdamW(model.parameters(),lr=0.0001)

In [12]:
def model_train(train):
    '''
    This function fine tune model using the sent dataframe.
    Change column names to use it on different languages.
    Change no. batches, epochs and loop to your application.
    Inputs
    train: dataframe contains the training data
    Outputs
    model: the finetuned model

    '''

    # setting the model into training mode
    model.train()
    losses = 0
    max_epochs = 54
    n_batches = 8
    for epoch in tqdm(range(max_epochs)):
        # shuffling train to make sure that the model get trained on all data not limited to batches and ram constrains.
        train = train.sample(frac=1).reset_index(drop=True)
        X = train['EN']
        y = train['AR']
        for i in tqdm(range(50)):
            # making batches 
            local_X, local_y = X[i*n_batches:(i+1)*n_batches,], y[i*n_batches:(i+1)*n_batches,]
            # preparing the data according to the model input
            batch = tokenizer.prepare_seq2seq_batch(list(local_X),list(local_y),return_tensors='pt').to('cuda')
            output = model(**batch)
            # loss can be taken directly from the model output
            loss = output.loss
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            losses = losses+loss
    average = losses/len(train)
    print('Loss: ' + str(average) )
    
    return model

In [13]:
model_fine = model_train(train)

  0%|          | 0/54 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]



  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

RuntimeError: ignored

In [None]:
model_fine.eval()

In [None]:
# saving the model 
torch.save(model_fine, "helsinki_finetuned.pt")

now, let's evaluate our model on test set

In [None]:
from datasets import load_metric

metric = load_metric('google_bleu')

In [None]:
source = []
predictions = []
references = []
for index,row in test.iterrows():
    encode = model.generate(**tokenizer.prepare_seq2seq_batch(row['EN'],return_tensors='pt').to('cuda'))
    output = tokenizer.batch_decode(encode,skip_special_tokens=True)[0]
    source.append(row['EN'])
    predictions.append(output.split())
    references.append([row['AR'].split()])

In [None]:
result = metric.compute(predictions=predictions, references=references)
result["google_bleu"]

We finished all the steps. we'll upload model and data to drive directly.

In [None]:
from pydrive.auth import GoogleAuth
from google.colab import auth

# Authenticate and create the PyDrive client.
auth.authenticate_user()

In [None]:
# this id is for google drive folder set to edit mode
!gupload --to 'ID' train.csv

In [None]:
!gupload --to 'ID' test.csv

In [None]:
!gupload --to 'ID' model.pt

Refernces: 

*   [Hugging Face metrics](https://huggingface.co/metrics)
*   [Simple machine translation](https://towardsdatascience.com/simple-machine-translation-yor%C3%B9b%C3%A1-to-english-1b958ccdc8a1)


