# <center>Data Cleaning</center>

- The offered data is not correctly structured and the comma separation of the CSV file shifts a lot of raws as well as the shifting of the Arabic translation with the English utterances and prompt.

- We solve this issue by dropping the raws that make the problem and relocate every translated utterance and prompt with its corresponding English sentences. 

- The cleaning is done to both the training and the testing data.

## Install Needed libraries

In [1]:
# %%capture
# !pip install numpy
# !pip install pandas
# !pip install csv

## Import Needed Libraries

In [1]:
import os
import numpy as np
import pandas as pd

# Data Cleaning

## Clean Training Data

In [2]:
# read the training data
train_data = pd.read_csv("./empatheticdialogues_arabic/train_arabic.csv",low_memory=False)
print("Data Size: ", len(train_data))
train_data.sample(5)

Data Size:  79189


Unnamed: 0,conv_id,utterance_idx,context,prompt,translated_prompt,speaker_idx,utterance,translated_utterance,selfeval,tags,Unnamed: 10,Unnamed: 11,Unnamed: 12
67987,hit:10636_conv:21272,5,content,I went on vacation_comma_ sat on the beach an...,أشعر بالخجل الشديد لأنني دخلت مكتبي في وقت متأ...,652,Just a couple hours drive. It made me relaxed ...,أنا لست أفضل رجل يبحث على ما يرام. في حفلة ، ذ...,5|5|5_5|5|5,,,,
39447,hit:6107_conv:12214,4,sentimental,I feel something when I think back to where I ...,على الرغم من أن الأمور لم تكن تسير على ما يرام...,117,I can see why you miss it_comma_ as I actually...,لم أذهب إلى صالة الألعاب الرياضية هذا الصباح. ...,4|5|5_3|5|5,,,,
30989,hit:4842_conv:9684,4,prepared,I researched for weeks for my paper. With that...,اكتشفت أنه كان لا بد من إلغاء الإجازة التي خطط...,447,Then it was worth it_comma_ even if it was ard...,تبدو مثيرة. لم أكن هناك مطلقا.,5|5|5_5|5|5,,,,
30453,hit:4745_conv:9490,1,proud,I'm feeling good about how I've handled a lot ...,أصيب أحد أفراد الأسرة المقربين بالمرض بعد تسري...,225,I'm pleased with how I've been handling a larg...,ذات مرة سرقت قالب حلوى من متجر شعرت بالخجل الشديد,5|5|5_5|5|5,,,,
13392,hit:2104_conv:4209,3,anxious,When I was younger_comma_ I had social anxiety...,كنت أتحدث مع خالتي على الهاتف لكنني شعرت بالحر...,293,Slowly. Therapy and medication have helped a lot.,في بعض الأحيان يكون من الصعب التحدث مع الأقارب...,5|5|5_5|5|5,,,,


After inspecting the shifting problem in translated_prompt, it was found that problems come from some specific indecies. Hence the translated_prompt is shifted one step down for each index.

In [3]:
index_of_shift = [7555,9668,15991,22415,50362,55863,57375,65143,66877,67011,73774,74022]
for i in index_of_shift:
    train_data["translated_prompt"][i+1:len(train_data)] = train_data["translated_prompt"].loc[i:len(train_data)-2]

After inspecting the shifting problem translated_utterance, it was found that the problems comes from rows that did not contain word "hit:". So we shift the translated_utterance one step down for each index

In [4]:
index_of_shift = train_data[~train_data.conv_id.str.contains('hit:')].index
for i in index_of_shift:
    if(i<22675):
        train_data["translated_utterance"][i+1:len(train_data)] = train_data["translated_utterance"].loc[i:len(train_data)-2]

we stoped here as we found that the row in index 23255 is also make a problem, and we keep doing that till the end of the data

In [5]:
train_data["translated_utterance"][23255:len(train_data)] = train_data["translated_utterance"].loc[23247:len(train_data)-9]

In [6]:
index_of_shift = train_data[~train_data.conv_id.str.contains('hit:')].index
for i in index_of_shift:
    if(i>22675 and i<51780):
        train_data["translated_utterance"][i+1:len(train_data)] = train_data["translated_utterance"].loc[i:len(train_data)-2]

In [7]:
train_data["translated_utterance"][52910:len(train_data)-23] = train_data["translated_utterance"].loc[52933:len(train_data)]

In [8]:
index_of_shift = train_data[~train_data.conv_id.str.contains('hit:')].index
for i in index_of_shift:
    if(i>51780):
        train_data["translated_utterance"][i+1:len(train_data)] = train_data["translated_utterance"].loc[i:len(train_data)-2]

In [9]:
train_data["translated_utterance"][74612:len(train_data)] = train_data["translated_utterance"].loc[74276:len(train_data)-337]

at the end we drop all the rows that making the shifting problem and drop the unneeded columns

In [10]:
train_data = train_data.drop(range(23247,23255))
train_data = train_data.drop(range(74276,74612))
train_data = train_data.reset_index(drop=True)
train_data = train_data[train_data.conv_id.str.contains('hit:')]
train_data = train_data[~train_data.utterance.str.contains('hit:')]
train_data = train_data[~train_data.translated_utterance.str.contains('hit:')]
train_data = train_data.drop(columns=['Unnamed: 10', 'Unnamed: 11','Unnamed: 12'])

In [11]:
print("Data Length: ", len(train_data))
train_data.sample(5)

Data Length:  78773


Unnamed: 0,conv_id,utterance_idx,context,prompt,translated_prompt,speaker_idx,utterance,translated_utterance,selfeval,tags
78477,hit:12373_conv:24747,1,proud,I felt this way when my son took his first ste...,شعرت بهذه الطريقة عندما اتخذ ابني خطواته الأولى.,123,My son just took his first steps.,ابني اتخذ خطواته الأولى للتو.,5|5|5_5|5|5,
31994,hit:5031_conv:10062,3,furious,I was furious to hear that my dental bill was ...,كنت غاضبًا لسماع أن فاتورة الأسنان الخاصة بي ك...,448,Yes_comma_ but I suppose it's worth it in the ...,نعم ، لكني أعتقد أن الأمر يستحق ذلك على المدى ...,4|4|5_5|5|5,
17053,hit:2615_conv:5230,5,excited,I got my promotion_comma_ I am more than happy...,لقد حصلت على ترقيتي ، أنا أكثر من سعيد اليوم,210,A small party with my family. You are also inv...,حفلة صغيرة مع عائلتي. أنت مدعو أيضا :),5|5|5_5|5|5,
59692,hit:9153_conv:18306,1,nostalgic,My family used to go apple picking every year ...,اعتادت عائلتي على قطف التفاح كل عام عندما كنت ...,438,My family used to go apple picking every year ...,اعتادت عائلتي على قطف التفاح كل عام عندما كنت ...,3|5|5_5|5|5,
65293,hit:10063_conv:20126,2,anticipating,I just got a new apartment and I don't move in...,لقد حصلت للتو على شقة جديدة ولم أتحرك فيها لمد...,128,Congratulations_comma_ that sounds like things...,تهانينا ، يبدو أن الأمور تسير بخطى جيدة.,5|5|5_5|5|5,


 Wrtie the cleaned data to the disk

In [12]:
path = "./empatheticdialogues_arabic_cleaned"
if(not os.path.isdir(path)):
    os.mkdir(path)
train_data.to_csv(path+"/train_arabic_clean.csv",index=False)

## Clean Testing Data

Read the testing data

In [13]:
test_data = pd.read_csv("./empatheticdialogues_arabic/test_arabic.csv",low_memory=False)
print("Data Length: ", len(test_data))
test_data.sample(5)

Data Length:  10957


Unnamed: 0,conv_id,utterance_idx,context,prompt,translated_prompt,speaker_idx,utterance,translated_utterance,selfeval,tags,Unnamed: 10
6099,hit:7732_conv:15464,2,disgusted,I came home yesterday to dog poop everywhere!,لقد عدت إلى المنزل أمس لأنبوب الكلب في كل مكان!,397.0,Oh no! Were you gone for long?,يا له من كلب سيء ... آمل ألا يفعلوا ذلك مرة أخرى,5|5|4_5|5|5,,That sounds like you have a great wife. What ...
3155,hit:3722_conv:7445,4,excited,I went to an amusement park recently,بدأت في وضع المال في حساب التوفير,139.0,Yeah I really like Six Flags. They have some g...,أبقه مرتفعا. قريباً سيكون لديك ما يكفي من الما...,5|4|5_5|5|5,,I can imagine_comma_ sorry|That sounds horribl...
6411,hit:8076_conv:16152,4,confident,I teach a lot of students how to skydive_comma...,كل فرد في عائلتي يضحك عليّ لأنني شخص منظم للغا...,209.0,I am in VA. I have been looking into it for qu...,كل فرد في عائلتي يضحك عليّ لأنني منظم للغاية و...,5|5|5_5|5|5,,I'm sorry. Where did she go?|I'm always leery...
2513,hit:3363_conv:6727,4,anticipating,anticipate: can't wait for my baby boy,الرجاء: أدعو الله أن يتم الاهتمام بكل قلق - ضر...,445.0,the buggy is important_comma_ I have a baby b...,هل لديك طفل واحد على الطريق؟,5|5|5_5|5|5,,You should follow up with a few veggies later ...
5849,hit:7103_conv:14207,1,embarrassed,One day I fell at work while I was pregnant. L...,ذات يوم وقعت في العمل وأنا حامل. لحسن الحظ ، ل...,45.0,One day at work while I was pregnant_comma_ I ...,نعم ، لقد كان الأمر كذلك حقًا!,5|5|5_5|5|5,,


After inspecting the data, it was found that the problem was introduced by the rows that not contain "hit:"
The translated_utterance and translated_prompt are shifted one step down for each index

In [14]:
pd.options.mode.chained_assignment = None
index_of_shift = []
for i in range(0,len(test_data)):
    if(type(test_data["conv_id"][i]) != str):
        index_of_shift.append(i)
    elif ('hit:' not in test_data["conv_id"][i]):
        index_of_shift.append(i)
for i in index_of_shift:
        test_data["translated_utterance"][i+1:len(test_data)] = test_data["translated_utterance"].loc[i:len(test_data)-2]
        test_data["translated_prompt"][i+1:len(test_data)] = test_data["translated_prompt"].loc[i:len(test_data)-2]
test_data = test_data.drop(index_of_shift)
test_data = test_data.reset_index(drop=True)
test_data = test_data.drop(columns=['Unnamed: 10'])

In [16]:
print("Data Length: ", len(test_data))
test_data.sample(5)

Data Length:  10953


Unnamed: 0,conv_id,utterance_idx,context,prompt,translated_prompt,speaker_idx,utterance,translated_utterance,selfeval,tags
6826,hit:8681_conv:17363,1,disgusted,My friend asked me to help clean his grandma's...,طلب مني صديقي المساعدة في تنظيف منزل جدته بعد ...,117.0,When my buddy asked me to help him clean out h...,عندما طلب مني صديقي مساعدته في تنظيف منزل جدته...,2|3|4_5|5|5,
2724,hit:3489_conv:6979,1,sentimental,I feel blue whenever i see a picture of my dog...,أشعر بالحزن كلما رأيت صورة كلبي .. ماتت قبل شه...,416.0,I feel heartbroken whenever i see a picture of...,أشعر بالحزن كلما رأيت صورة كلبي,5|5|5_5|4|4,
3029,hit:3652_conv:7305,1,prepared,I used to live in California. Early one mornin...,كنت أعيش في كاليفورنيا. في وقت مبكر من صباح أح...,343.0,Since that earthquake in the early hours of th...,منذ ذلك الزلزال في الساعات الأولى من الصباح ، ...,5|5|5_5|5|4,
5864,hit:7118_conv:14237,2,surprised,My husband planned a secret weekend trip for u...,لقد خطط زوجي لنا منذ وقت ليس ببعيد لرحلة سرية ...,4.0,Where did he take you to?,إلى أين أخذك إلى؟,5|5|5_5|5|5,
9282,hit:10785_conv:21570,6,faithful,Even though I could've gotten more money at an...,على الرغم من أنه كان بإمكاني الحصول على المزيد...,563.0,I always need convincing to take my yucky medi...,أحتاج دائمًا إلى الإقناع لأخذ دوائي المقزز,5|5|5_5|5|5,


 Write cleaned data to disk

In [17]:
path = "./empatheticdialogues_arabic_cleaned"
if(not os.path.isdir(path)):
    os.mkdir(path)
test_data.to_csv(path+"/test_arabic_clean.csv",index=False)