# <center>Data Cleaning</center>

- The offered data is not correctly structured and the comma separation of the CSV file shifts a lot of raws as well as the shifting of the Arabic translation with the English utterances and prompt.

- We solve this issue by dropping the raws that make the problem and relocate every translated utterance and prompt with its corresponding English sentences. 

- The cleaning is done to both the training and the testing data.

## Install Needed libraries

In [1]:
# %%capture
# !pip install numpy
# !pip install pandas
# !pip install csv
# !pip install git-python==1.0.3
# !pip install sacrebleu==1.4.2
# !pip install rouge_score
# !pip install farasapy
# !git clone https://github.com/aub-mind/arabert
# !pip install pyarabic
# !pip install datasets
# !pip install -U transformers==4.5.1
# !pip install awesometkinter

## Import Needed Libraries

In [24]:
import os
import numpy as np
import pandas as pd

# Data Cleaning

## Clean Training Data

In [5]:
# read the training data
train_data = pd.read_csv("./empatheticdialogues_arabic/train_arabic.csv",low_memory=False)
print("Data Size: ", len(train_data))
train_data.sample(5)

Data Size:  79189


Unnamed: 0,conv_id,utterance_idx,context,prompt,translated_prompt,speaker_idx,utterance,translated_utterance,selfeval,tags,Unnamed: 10,Unnamed: 11,Unnamed: 12
47586,hit:7245_conv:14490,3,caring,Sometimes I feel like a dad to my best friend.,أفتقد عندما كانت ابنتي رضيعة ، أتمنى أن أعود ف...,445,nope_comma_ young people know everything alre...,حسنًا ، فهمت الآن.,5|5|5_5|5|5,,,,
29151,hit:4513_conv:9027,4,joyful,I really wanted to go on this trip to Texas an...,أنا أطعم كلبي وسقيها كل يوم.,9,Food is so expensive these days.,من الرائع سماع ذلك ، أتمنى أن يمضي يومه الأول ...,5|5|5_5|5|5,,,,
5689,hit:980_conv:1960,2,annoyed,I go grocery shopping on Saturday morning_comm...,أذهب للتسوق من البقالة صباح يوم السبت ، ويقضي ...,11,Tell him to get out and do something! That mus...,حسنًا ، يدفع إيجارًا لكنه بالتأكيد يستخدم موار...,5|5|5_5|5|5,,,,
52390,hit:7994_conv:15989,3,terrified,I keep hearing strange noises from underneath ...,كان على فتياتي دراسة علم الأحياء العام الماضي....,139,I don't have a dog and I just moved in the hou...,كنت سأضحك بشدة إذا كنت أرى كلبي يفعل ذلك,5|5|5_3|4|5,,,,
55992,hit:8556_conv:17112,1,hopeful,I'm starting an internship in my field in a co...,لقد شاهدت للتو الفيلم الذي رأيته مع زوجتي في أ...,292,I am starting an internship in my field of stu...,أوه لا. هذا ليس جيدا!,5|5|5_5|5|5,,,,


After inspecting the shifting problem in translated_prompt, it was found that problems come from some specific indecies. Hence the translated_prompt is shifted one step down for each index.

In [6]:
index_of_shift = [7555,9668,15991,22415,50362,55863,57375,65143,66877,67011,73774,74022]
for i in index_of_shift:
    train_data["translated_prompt"][i+1:len(train_data)] = train_data["translated_prompt"].loc[i:len(train_data)-2]

After inspecting the shifting problem translated_utterance, it was found that the problems comes from rows that did not contain word "hit:". So we shift the translated_utterance one step down for each index

In [7]:
index_of_shift = train_data[~train_data.conv_id.str.contains('hit:')].index
for i in index_of_shift:
    if(i<22675):
        train_data["translated_utterance"][i+1:len(train_data)] = train_data["translated_utterance"].loc[i:len(train_data)-2]

we stoped here as we found that the row in index 23255 is also make a problem, and we keep doing that till the end of the data

In [8]:
train_data["translated_utterance"][23255:len(train_data)] = train_data["translated_utterance"].loc[23247:len(train_data)-9]

In [9]:
index_of_shift = train_data[~train_data.conv_id.str.contains('hit:')].index
for i in index_of_shift:
    if(i>22675 and i<51780):
        train_data["translated_utterance"][i+1:len(train_data)] = train_data["translated_utterance"].loc[i:len(train_data)-2]

In [10]:
train_data["translated_utterance"][52910:len(train_data)-23] = train_data["translated_utterance"].loc[52933:len(train_data)]

In [11]:
index_of_shift = train_data[~train_data.conv_id.str.contains('hit:')].index
for i in index_of_shift:
    if(i>51780):
        train_data["translated_utterance"][i+1:len(train_data)] = train_data["translated_utterance"].loc[i:len(train_data)-2]

In [12]:
train_data["translated_utterance"][74612:len(train_data)] = train_data["translated_utterance"].loc[74276:len(train_data)-337]

at the end we drop all the rows that making the shifting problem and drop the unneeded columns

In [13]:
train_data = train_data.drop(range(23247,23255))
train_data = train_data.drop(range(74276,74612))
train_data = train_data.reset_index(drop=True)
train_data = train_data[train_data.conv_id.str.contains('hit:')]
train_data = train_data[~train_data.utterance.str.contains('hit:')]
train_data = train_data[~train_data.translated_utterance.str.contains('hit:')]
train_data = train_data.drop(columns=['Unnamed: 10', 'Unnamed: 11','Unnamed: 12'])

In [14]:
print("Data Length: ", len(train_data))
train_data.sample(5)

Data Length:  78773


Unnamed: 0,conv_id,utterance_idx,context,prompt,translated_prompt,speaker_idx,utterance,translated_utterance,selfeval,tags
2873,hit:456_conv:912,1,nostalgic,There are certain times when I hear a song com...,هناك أوقات معينة عندما أسمع أغنية تأتي على الر...,163,I sometimes wish I could go back to high schoo...,أتمنى أحيانًا أن أعود إلى المدرسة الثانوية وأن...,3|5|5_5|5|5,
76289,hit:12046_conv:24092,1,impressed,I was surprised that the Browns managed to bea...,لقد فوجئت بأن براون نجح في التغلب على النسور ف...,10,I was surprised that the Browns managed to bea...,لقد فوجئت بأن براون تمكن من هزيمة النسور في ذل...,5|5|5_5|5|5,
26100,hit:4069_conv:8138,1,terrified,I am starting my first year of college and I a...,أبدأ عامي الأول في الكلية وأنا خائف للغاية.,262,I am about to start college at the end of the ...,أنا على وشك بدء الدراسة الجامعية في نهاية الشهر.,5|5|5_5|5|5,
42120,hit:6493_conv:12986,2,sentimental,I am finding it hard to sell my grandfathers c...,أجد صعوبة في بيع كاديلاك أجدادي,238,put it on offerup,وضعه على العرض,4|5|5_5|5|5,
77265,hit:12190_conv:24381,2,surprised,I had not seen my best friend in over 3 years....,لم أر أفضل صديق لي منذ أكثر من 3 سنوات. عندما ...,139,Sorry to hear that. What happened to him?,آسف لسماع ذلك. ماذا حدث له؟,5|5|5_5|5|5,


 Wrtie the cleaned data to the disk

In [30]:
path = "./empatheticdialogues_arabic_cleaned"
if(not os.path.isdir(path)):
    os.mkdir(path)
train_data.to_csv(path+"/train_arabic_clean.csv",index=False)

## Clean Testing Data

Read the testing data

In [18]:
test_data = pd.read_csv("./empatheticdialogues_arabic/test_arabic.csv",low_memory=False)
print("Data Length: ", len(test_data))
test_data.sample(5)

Data Length:  10957


Unnamed: 0,conv_id,utterance_idx,context,prompt,translated_prompt,speaker_idx,utterance,translated_utterance,selfeval,tags,Unnamed: 10
6866,hit:8700_conv:17400,4,disappointed,I was hoping to go on vacation this summer but...,لقد كنت مجتهدة حقًا في دراستي ، وسأكون في فريق...,366.0,Wow_comma_ that is a long time! That shows how...,لقد كنت مجتهدًا للغاية في دراستي حتى الآن ، وأ...,5|5|5_5|5|5,,My son keeps trying to show me Youtube videos ...
4542,hit:5481_conv:10963,3,surprised,My mother gave me something very unexpected an...,كنت أرغب في غسل الصحون في ذلك اليوم لكنني لم أ...,45.0,Yes_comma_ I didn't have a clue she was going ...,لقد كسرت إصبعًا في ذلك اليوم ولم أستطع غسل أي ...,5|5|5_5|5|5,,
577,hit:556_conv:1112,3,trusting,My mother recently took out a bunch of money o...,أخذت والدتي مؤخرًا مجموعة من المال نيابة عني م...,64.0,Nice people are the best. Empathy is something...,الناس الطيبون هم الأفضل. التعاطف هو شيء نحتاج ...,5|5|5_5|5|5,,
3807,hit:4742_conv:9485,1,impressed,Everybody say a quiet place movie was boring b...,الجميع يقول أن فيلم مكان هادئ كان مملًا لكنني ...,50.0,Everybody say a quiet place movie was boring b...,يا إلهي ، يجب عليك! إنه فيلم فريد من نوعه. لا ...,5|5|5_5|5|5,,
2388,hit:3108_conv:6217,5,confident,I found out that I was one of the top performe...,اكتشفت أنني كنت أحد أفضل الفنانين أداءً في منط...,90.0,How long have you been there?,أنا متأكد من أنك قمت ببعض العمل الجيد - هل سيك...,5|5|5_5|5|5,,


After inspecting the data, it was found that the problem was introduced by the rows that not contain "hit:"
The translated_utterance and translated_prompt are shifted one step down for each index

In [19]:
pd.options.mode.chained_assignment = None
index_of_shift = []
for i in range(0,len(test_data)):
    if(type(test_data["conv_id"][i]) != str):
        index_of_shift.append(i)
    elif ('hit:' not in test_data["conv_id"][i]):
        index_of_shift.append(i)
for i in index_of_shift:
        test_data["translated_utterance"][i+1:len(test_data)] = test_data["translated_utterance"].loc[i:len(test_data)-2]
        test_data["translated_prompt"][i+1:len(test_data)] = test_data["translated_prompt"].loc[i:len(test_data)-2]
test_data = test_data.drop(index_of_shift)
test_data = test_data.reset_index(drop=True)
test_data = test_data.drop(columns=['Unnamed: 10'])

In [20]:
print("Data Length: ", len(test_data))
data.sample(5)

Data Length:  10953


Unnamed: 0,conv_id,utterance_idx,context,prompt,translated_prompt,speaker_idx,utterance,translated_utterance,selfeval,tags
9742,hit:11129_conv:22259,2,trusting,I walk out on belief that I will be lead down ...,أسير على اعتقاد أنني سوف أسير في الطريق الصحيح...,632.0,That's a big decision! What are your choices?,هذا قرار كبير! ما هي اختياراتك؟,5|5|5_5|5|5,
7793,hit:9608_conv:19216,1,grateful,There's a woman in town who has started delive...,هناك امرأة في المدينة بدأت في توصيل البقالة إل...,343.0,I get my groceries delivered to my home and am...,أحصل على مشترياتي من البقالة إلى منزلي وأنا مم...,5|5|5_5|5|5,
9908,hit:11238_conv:22477,2,guilty,I snuck some candy out of my daughter's snack ...,لقد تسللت بعض الحلوى من مخزون الوجبات الخفيفة ...,35.0,I understand that feeling. I've had trouble le...,أنا أفهم هذا الشعور. لقد واجهت مشكلة في التخلي...,5|5|5_5|5|5,
10383,hit:11698_conv:23397,2,proud,I just finished first in a foot race! I feel a...,لقد أنهيت للتو المركز الأول في سباق القدم! أشع...,438.0,Oh my goodness! That's amazing! Congratulati...,يا إلهي! هذا مذهل! تهانينا! هل كان هذا أول سبا...,5|5|5_5|5|5,
7920,hit:9722_conv:19445,3,surprised,I walked out to my car and got in and felt som...,خرجت إلى سيارتي ودخلت وشعرت بشيء أصاب وجهي. لق...,313.0,A GIANT SPIDER! It was terrible and I was so s...,عنكبوت عملاق! كان الأمر فظيعًا وكنت مذهولًا جدًا!,5|5|5_5|5|5,


 Write cleaned data to disk

In [31]:
path = "./empatheticdialogues_arabic_cleaned"
if(not os.path.isdir(path)):
    os.mkdir(path)
test_data.to_csv(path+"/test_arabic_clean.csv",index=False)