<a href="https://colab.research.google.com/github/KelvinLam05/Text-Data-Augmentation/blob/main/Text_Data_Augmentation_with_Back_Translation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Goal of the project**

When working on Natural Language Processing applications such as text classification, collecting enough labeled examples for each category manually can be difficult. In this notebook, we will go over an interesting technique to augment our existing text data automatically called back translation. 

**Introduction to Back Translation**

The key idea of back translation is very simple. We create augmented version of a sentence using the following steps:


1. Given an input text in some source language (e.g. English)

2. Translate this text to a temporary destination language (e.g. English -> Dutch)

3. Translate back the previously translated text into the source language (e.g. Dutch -> English)

In [1]:
# Importing libraries
import pandas as pd
import numpy as np
import ktrain
import tensorflow as tf
from ktrain import text
from sklearn.model_selection import train_test_split

In [2]:
# Load dataset
df = pd.read_csv('/content/onlinedeliverydata.csv')

In [3]:
# Examine the data
df.head()

Unnamed: 0,Age,Gender,Marital Status,Occupation,Monthly Income,Educational Qualifications,Family size,latitude,longitude,Pin code,Medium (P1),Medium (P2),Meal(P1),Meal(P2),Perference(P1),Perference(P2),Ease and convenient,Time saving,More restaurant choices,Easy Payment option,More Offers and Discount,Good Food quality,Good Tracking system,Self Cooking,Health Concern,Late Delivery,Poor Hygiene,Bad past experience,Unavailability,Unaffordable,Long delivery time,Delay of delivery person getting assigned,Delay of delivery person picking up food,Wrong order delivered,Missing item,Order placed by mistake,Influence of time,Order Time,Maximum wait time,Residence in busy location,Google Maps Accuracy,Good Road Condition,Low quantity low time,Delivery person ability,Influence of rating,Less Delivery time,High Quality of package,Number of calls,Politeness,Freshness,Temperature,Good Taste,Good Quantity,Output,Reviews
0,20,Female,Single,Student,No Income,Post Graduate,4,12.9766,77.5993,560001,Food delivery apps,Web browser,Breakfast,Lunch,Non Veg foods (Lunch / Dinner),Bakery items (snacks),Neutral,Neutral,Neutral,Neutral,Neutral,Neutral,Neutral,Neutral,Neutral,Neutral,Neutral,Neutral,Neutral,Neutral,Agree,Agree,Agree,Agree,Agree,Agree,Yes,Weekend (Sat & Sun),30 minutes,Agree,Neutral,Neutral,Neutral,Neutral,Yes,Moderately Important,Moderately Important,Moderately Important,Moderately Important,Moderately Important,Moderately Important,Moderately Important,Moderately Important,Yes,Nil\n
1,24,Female,Single,Student,Below Rs.10000,Graduate,3,12.977,77.5773,560009,Food delivery apps,Web browser,Snacks,Dinner,Non Veg foods (Lunch / Dinner),Veg foods (Breakfast / Lunch / Dinner),Strongly agree,Strongly agree,Strongly agree,Strongly agree,Strongly agree,Neutral,Agree,Strongly agree,Strongly agree,Agree,Strongly agree,Strongly agree,Strongly agree,Strongly agree,Strongly agree,Strongly agree,Strongly agree,Strongly agree,Strongly agree,Strongly agree,Yes,Anytime (Mon-Sun),30 minutes,Strongly Agree,Neutral,Disagree,Strongly disagree,Agree,Yes,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Yes,Nil
2,22,Male,Single,Student,Below Rs.10000,Post Graduate,3,12.9551,77.6593,560017,Food delivery apps,Direct call,Lunch,Snacks,Non Veg foods (Lunch / Dinner),Ice cream / Cool drinks,Strongly agree,Strongly agree,Strongly agree,Neutral,Neutral,Disagree,Neutral,Disagree,Neutral,Neutral,Agree,Agree,Agree,Agree,Agree,Agree,Agree,Strongly agree,Agree,Neutral,Yes,Anytime (Mon-Sun),45 minutes,Agree,Strongly Agree,Neutral,Neutral,Agree,Yes,Important,Very Important,Moderately Important,Very Important,Very Important,Important,Very Important,Moderately Important,Yes,"Many a times payment gateways are an issue, so..."
3,22,Female,Single,Student,No Income,Graduate,6,12.9473,77.5616,560019,Food delivery apps,Walk-in,Snacks,Dinner,Veg foods (Breakfast / Lunch / Dinner),Bakery items (snacks),Agree,Agree,Strongly agree,Agree,Strongly agree,Agree,Agree,Agree,Strongly agree,Neutral,Agree,Disagree,Disagree,Neutral,Agree,Agree,Agree,Disagree,Disagree,Neutral,Yes,Anytime (Mon-Sun),30 minutes,Disagree,Agree,Agree,Neutral,Agree,Yes,Very Important,Important,Moderately Important,Very Important,Very Important,Very Important,Very Important,Important,Yes,nil
4,22,Male,Single,Student,Below Rs.10000,Post Graduate,4,12.985,77.5533,560010,Walk-in,Direct call,Lunch,Dinner,Non Veg foods (Lunch / Dinner),Veg foods (Breakfast / Lunch / Dinner),Agree,Agree,Agree,Agree,Agree,Neutral,Neutral,Agree,Strongly agree,Strongly agree,Agree,Strongly agree,Agree,Disagree,Strongly agree,Strongly agree,Neutral,Neutral,Neutral,Disagree,Yes,Weekend (Sat & Sun),30 minutes,Agree,Agree,Agree,Agree,Agree,Yes,Important,Important,Moderately Important,Important,Important,Important,Very Important,Very Important,Yes,NIL


In [4]:
# Overview of all variables, their datatypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 388 entries, 0 to 387
Data columns (total 55 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   Age                                        388 non-null    int64  
 1   Gender                                     388 non-null    object 
 2   Marital Status                             388 non-null    object 
 3   Occupation                                 388 non-null    object 
 4   Monthly Income                             388 non-null    object 
 5   Educational Qualifications                 388 non-null    object 
 6   Family size                                388 non-null    int64  
 7   latitude                                   388 non-null    float64
 8   longitude                                  388 non-null    float64
 9   Pin code                                   388 non-null    int64  
 10  Medium (P1)               

**Preprocessing**

In [5]:
# Drop columns that are not needed
df = df[['Reviews', 'Output']]

In [6]:
# Change column names to lower case
df.columns = df.columns.str.lower()

In [7]:
# Make all strings in the output column lowercase
df['output'] = df['output'].str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [8]:
# Checking for missing values
df.isnull().sum().sort_values(ascending = False)

output     0
reviews    0
dtype: int64

In [9]:
# Checking the distribution of classes
df['output'].value_counts() 

yes    301
no      87
Name: output, dtype: int64

The target variable is imbalanced.

In [10]:
# Get the first five rows
df['reviews'].head()

0                                                Nil\n
1                                                  Nil
2    Many a times payment gateways are an issue, so...
3                                                  nil
4                                                  NIL
Name: reviews, dtype: object

The dataset has missing values represented as Nil or nil.

In [11]:
# Check for missing values (not NaN)
df[df['reviews'].str[0].isin(['N', 'n'])].value_counts()

reviews                                                                                                             output
NIL                                                                                                                 yes       75
Nil                                                                                                                 yes       54
No                                                                                                                  yes        4
nil                                                                                                                 yes        4
NIL                                                                                                                 no         3
Need quality food delivery. had worst experience of spilled food                                                    no         3
Nil                                                                                                    

In [12]:
# Replace missing values (not NaN) with NaN values
df['reviews'].replace('NIL', np.nan, inplace = True)
df['reviews'].replace('Nil', np.nan, inplace = True)
df['reviews'].replace('nil', np.nan, inplace = True)
df['reviews'].replace('No', np.nan, inplace = True)
df['reviews'].replace('N0', np.nan, inplace = True)
df['reviews'].replace('No Comments!', np.nan, inplace = True)
df['reviews'].replace('None', np.nan, inplace = True)
df['reviews'].replace('Nill', np.nan, inplace = True)
df['reviews'].replace('Nil\n', np.nan, inplace = True)
df['reviews'].replace('NiL', np.nan, inplace = True)

In [13]:
# Checking for missing values
df.isnull().sum().sort_values(ascending = False)

reviews    150
output       0
dtype: int64

In [14]:
# Drop rows with NaN values
df = df[df['reviews'].notna()]

In [15]:
# Checking the distribution of classes
df['output'].value_counts() 

yes    158
no      80
Name: output, dtype: int64

In [16]:
# Find duplicates
df[df.duplicated(['reviews'], keep = False)]

Unnamed: 0,reviews,output
2,"Many a times payment gateways are an issue, so...",yes
11,Language barrier is also one major issue. Mosl...,yes
17,"Spillage, bad packaging and missing items",yes
36,Now days delivery ?? is improved a lot like fa...,no
48,"Spillage, bad packaging and missing items",yes
...,...,...
375,I had bad quality order delivered twice,no
376,Bad rating doesn't mean that the food tastes b...,yes
377,Order delivered to my location are late,no
378,My location is pretty well built for food deli...,yes


In [17]:
# Drop only the very first duplicate, keep the other duplicates of that matching value
df.drop_duplicates(subset = ['reviews'], keep = 'first', inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [18]:
# Reset index
df.reset_index(drop = True, inplace = True)

In [19]:
# Checking the distribution of classes
df['output'].value_counts() 

yes    112
no      59
Name: output, dtype: int64

In [20]:
# Find all unique characters and symbols 
all_text = str()

for sentence in df['reviews'].values:
    all_text += sentence
    
''.join(set(all_text))

"!E0laKZMWAx4cvht,z-bgsHD1B?7y+GnVFePYpC '\nijrm9T5LOkNwRUI6SQu.fdqo"

The kind of data we get from customer feedback is usually unstructured. It contains unusual text and symbols that need to be cleaned so that a machine learning model can grasp it.

In [21]:
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer 

In [22]:
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [23]:
stop_words = set(stopwords.words('english'))
lemma = WordNetLemmatizer()

We will now set up our cleaning function.

In [24]:
def clean_review(review_text):

  # Removing all irrelevant characters (numbers and punctuation)                           
  review_text = re.sub('[^a-zA-Z]', ' ', review_text)                           
  # Replace one or more spaces with single space
  review_text = re.sub(r'\s+', ' ', review_text)                                
  # Convert all characters into lowercase
  review_text = str(review_text).lower()                                        
  # Tokenization
  review_text = word_tokenize(review_text)
  # Removing Stopwords                                      
  review_text = [item for item in review_text if item not in stop_words]        
  # Lemmatization
  review_text = [lemma.lemmatize(word = w, pos = 'v') for w in review_text]     
  # Remove the words having length <= 2
  review_text = [i for i in review_text if len(i) > 2]                          
  # Convert the list of tokens into back to the string
  review_text = ' '.join (review_text)                                          
  
  return review_text 

In [25]:
df['reviews'] = df['reviews'].apply(clean_review)

In [26]:
all_text = str()

for sentence in df['reviews'].values:
    all_text += sentence
    
''.join(set(all_text))

'laxcvhtzbgsynep ijrmkwufdqo'

When working with unstructured text data, we will inevitably find misspelled words. Luckily, SpellChecker can fix this.

In [28]:
from spellchecker import SpellChecker

In [29]:
# Instantiate spell checker
spell = SpellChecker()

In [30]:
# Correct spelling
def correct_spellings(text):
    
    corrected_text = []
    misspelled_words = spell.unknown(text.split())
    for word in text.split():
        if word in misspelled_words:
            corrected_text.append(spell.correction(word))
        else:
            corrected_text.append(word)
    
    return ' '.join(corrected_text)

In [31]:
df['reviews'] = df['reviews'].apply(correct_spellings)

In [32]:
# Get the first five rows
df['reviews'].head()

0    many time payment gateways issue get refund su...
1    language barrier also one major issue mostly d...
2                      spillage bad package miss items
3    order ufc get exchange someone else fault deli...
4    feel wiggy good interface users delivery time ...
Name: reviews, dtype: object

In [33]:
# Display full strings
with pd.option_context('display.max_colwidth', None):
  display(df['reviews'])

0                                          many time payment gateways issue get refund surcharge inconvenience
1      language barrier also one major issue mostly delivery boys familiar canada create problem address issue
2                                                                              spillage bad package miss items
3                                                       order ufc get exchange someone else fault delivery boy
4        feel wiggy good interface users delivery time place bangalore take wiggy less tomato order restaurant
                                                        ...                                                   
166                                                                       many bad experience respect delivery
167                                                        love price offer offer wiggy usually rate four plus
168                                                                             good price offer fast delivery
1

In [34]:
# Make a copy of the dataframe
df2 = df.copy(deep = True)

In [35]:
# Drop columns that are not needed
df2 = df2[['reviews', 'output']]

In [36]:
# Make another copy of the dataframe
df3 = df.copy(deep = True)

In [37]:
# Find the length of strings 
df['reviews’_length'] = df['reviews'].apply(len)

In [38]:
# Generate descriptive statistics 
df['reviews’_length'].describe()

count    171.000000
mean      39.561404
std       20.822514
min        0.000000
25%       27.000000
50%       36.000000
75%       48.500000
max      136.000000
Name: reviews’_length, dtype: float64

**Testing for GPU**

In [39]:
import torch

In [40]:
torch.cuda.is_available()

True

In [41]:
device = torch.cuda.current_device() if torch.cuda.is_available() else -1

In [42]:
print(device)

0


In [43]:
# Load
Load_on_CPU = torch.device('cuda')

**Do a dry run**

To see how the translation model works, we’ll do a quick dry run.

In [44]:
from transformers import pipeline

In [45]:
task = 'translation'
en_nl_translation_model = 'Helsinki-NLP/opus-mt-en-nl'
translator = pipeline(task, en_nl_translation_model, device = device)

Downloading:   0%|          | 0.00/1.13k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/316M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/790k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/814k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.66M [00:00<?, ?B/s]

In [46]:
# Translate from English to Dutch
translator('many time payment gateways issue get refund surcharge inconvenience')[0]['translation_text']

'veel tijd betaling gateways probleem krijgen restitutie toeslag ongemak'

In [47]:
task = 'translation'
nl_en_translation_model = 'Helsinki-NLP/opus-mt-nl-en'
translator = pipeline(task, nl_en_translation_model, device = device)

Downloading:   0%|          | 0.00/1.13k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/316M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/814k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/790k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.66M [00:00<?, ?B/s]

In [48]:
# Translate from Dutch to English
translator('veel tijd betaling gateways probleem krijgen restitutie toeslag ongemak')[0]['translation_text']

'lot of time payment gateways problem getting refund surcharge discomfort'

**Translation model with Helsinki-NLP**

In [49]:
task = 'translation'
en_nl_translation_model = 'Helsinki-NLP/opus-mt-en-nl'
translator = pipeline(task, en_nl_translation_model, device = device)

In [50]:
 # Translate from English to Dutch
df3['dutch_translations'] = df3['reviews'].apply(lambda x: translator(x)[0]['translation_text'])

In [51]:
df3.head()

Unnamed: 0,reviews,output,dutch_translations
0,many time payment gateways issue get refund su...,yes,veel tijd betaling gateways probleem krijgen r...
1,language barrier also one major issue mostly d...,yes,taalbarrière ook een groot probleem vooral lev...
2,spillage bad package miss items,yes,Slechte pakket missen items morsen
3,order ufc get exchange someone else fault deli...,yes,bestelling ufc krijgen ruil iemand anders fout...
4,feel wiggy good interface users delivery time ...,yes,feel wiggy goede interface gebruikers levertij...


In [52]:
task = 'translation'
nl_en_translation_model = 'Helsinki-NLP/opus-mt-nl-en'
translator = pipeline(task, nl_en_translation_model, device = device)

In [53]:
# Translate from Dutch to English
df3['english_translations'] = df3['dutch_translations'].apply(lambda x: translator(x)[0]['translation_text'])

In [54]:
df3.head()

Unnamed: 0,reviews,output,dutch_translations,english_translations
0,many time payment gateways issue get refund su...,yes,veel tijd betaling gateways probleem krijgen r...,lot of time payment gateways problem getting r...
1,language barrier also one major issue mostly d...,yes,taalbarrière ook een groot probleem vooral lev...,language barrier also create a big problem esp...
2,spillage bad package miss items,yes,Slechte pakket missen items morsen,Bad package missing items spilling
3,order ufc get exchange someone else fault deli...,yes,bestelling ufc krijgen ruil iemand anders fout...,order ufc get exchange someone else error deli...
4,feel wiggy good interface users delivery time ...,yes,feel wiggy goede interface gebruikers levertij...,feel wiggy good interface users delivery time ...


**Balancing the dataset**

In [55]:
# Extract no values
minority_class = df3.loc[df3['output'] == 'no']

In [56]:
# Checking the distribution of classes
minority_class['output'].value_counts()

no    59
Name: output, dtype: int64

In [57]:
# Drop column
minority_class.drop('reviews', axis = 1, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [58]:
# Rename column header
minority_class = minority_class.rename(columns = {'english_translations': 'reviews'})

In [59]:
# Drop columns that are not needed
minority_class = minority_class[['reviews', 'output']]

In [60]:
# Create a list 
frames = [df2, minority_class]

In [61]:
# Concatenating df2 and minority_class along rows
df_final = pd.concat(frames)

**Conclusion**

In [62]:
# Examine the data
df_final.head()

Unnamed: 0,reviews,output
0,many time payment gateways issue get refund su...,yes
1,language barrier also one major issue mostly d...,yes
2,spillage bad package miss items,yes
3,order ufc get exchange someone else fault deli...,yes
4,feel wiggy good interface users delivery time ...,yes


In [63]:
# Overview of all variables, their datatypes
df_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 230 entries, 0 to 170
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   reviews  230 non-null    object
 1   output   230 non-null    object
dtypes: object(2)
memory usage: 5.4+ KB


In [64]:
# Checking the distribution of classes
df_final['output'].value_counts()

no     118
yes    112
Name: output, dtype: int64

The dataset is balanced.