# order_review_df

This file consists the steps of transformation that was done to clean the data.
Steps include as below:

1. Exploring dataset using shape, info(), describe()
    - Observations are 100,000 rows and 7 columns
    - Column names : review_id, order_id, review_score, review_comment_title, review_comment_message, review_creation_date, review_answer_timestamp
    - Found all of the columns are of object datatype except review_score which is int64 datatype
    - review score has a mean of 4.07, minimum of 1, maximum of 5, 25 percentile of 4, 50 percentile of 5 and 75 percentile of 5
    - No Null values except for review_comment_title and review_comment_title. Which is in brazilian portuguese

2. Amending the review_creation_date and review_answer_timestamp to datetime format

3. Create a translation function using deeptranslate library from google 

4. Create a new column called **cleaned_review_comment** which removes special characters like \r \n from the column **review_comment_message**

5. Created 2 new columns of t**translated_review_comment**, which is translated from the **cleaned_review_comment**, 
    and **translated_review_title** which was translated from **review_comment_title**

5. Exported out the dataframe into a csv **cleaned_orders_reviews.csv** into the clean_data folder

### Primary key and Foreign Key

Primary Key : review_id
Foreign Key : order_id reference to orders dataset

In [2]:
import pandas as pd

orders_reviews_df = pd.read_csv("data/olist_order_reviews_dataset.csv")

orders_reviews_df.head()

Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp
0,7bc2406110b926393aa56f80a40eba40,73fc7af87114b39712e6da79b0a377eb,4,,,2018-01-18 00:00:00,2018-01-18 21:46:59
1,80e641a11e56f04c1ad469d5645fdfde,a548910a1c6147796b98fdf73dbeba33,5,,,2018-03-10 00:00:00,2018-03-11 03:05:13
2,228ce5500dc1d8e020d8d1322874b6f0,f9e4b658b201a9f2ecdecbb34bed034b,5,,,2018-02-17 00:00:00,2018-02-18 14:36:24
3,e64fb393e7b32834bb789ff8bb30750e,658677c97b385a9be170737859d3511b,5,,Recebi bem antes do prazo estipulado.,2017-04-21 00:00:00,2017-04-21 22:02:06
4,f7c4243c7fe1938f181bec41a392bdeb,8e6bfb81e283fa7e4f11123a3fb894f1,5,,Parabéns lojas lannister adorei comprar pela I...,2018-03-01 00:00:00,2018-03-02 10:26:53


In [3]:

print(orders_reviews_df.shape) #check row and column
print('\n')
print(orders_reviews_df.info()) #check dataframe information number of entries, to col names
print('\n')
print(orders_reviews_df.describe()) #check dataframe statitical summary

(100000, 7)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 7 columns):
 #   Column                   Non-Null Count   Dtype 
---  ------                   --------------   ----- 
 0   review_id                100000 non-null  object
 1   order_id                 100000 non-null  object
 2   review_score             100000 non-null  int64 
 3   review_comment_title     11715 non-null   object
 4   review_comment_message   41753 non-null   object
 5   review_creation_date     100000 non-null  object
 6   review_answer_timestamp  100000 non-null  object
dtypes: int64(1), object(6)
memory usage: 5.3+ MB
None


        review_score
count  100000.000000
mean        4.070890
std         1.359663
min         1.000000
25%         4.000000
50%         5.000000
75%         5.000000
max         5.000000


### Amending the review_creation_date and review_answer_timestamp to datetime format

In [4]:
# amending to date time for column review_creation_date
orders_reviews_df['review_creation_date']=pd.to_datetime(orders_reviews_df['review_creation_date'])

# amending to date time for column review_answer_timestamp
orders_reviews_df['review_answer_timestamp']=pd.to_datetime(orders_reviews_df['review_answer_timestamp'])

print(orders_reviews_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 7 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   review_id                100000 non-null  object        
 1   order_id                 100000 non-null  object        
 2   review_score             100000 non-null  int64         
 3   review_comment_title     11715 non-null   object        
 4   review_comment_message   41753 non-null   object        
 5   review_creation_date     100000 non-null  datetime64[ns]
 6   review_answer_timestamp  100000 non-null  datetime64[ns]
dtypes: datetime64[ns](2), int64(1), object(4)
memory usage: 5.3+ MB
None


## Initial Observation
- Total number of rows is 100,000 , columns are 7
- No null values found except for the review_comment_title and review_comment_message, which is mostly the comments provided. Will need to scrutinise this 2 columns in later stages with text analysis
- Average rating from users is 4.07 from the ~ 100,000 reviews
- Creation date is the date the review was created
- Assumption: The Answer date timestamp is the reply by the seller can be analysed on how fast they respond
- There is duplicated review_id need to drop if all columns the same as there are 99173 unique review_id. Dropping duplicated rows

### Creating a function called translate_text

In [5]:
from deep_translator import GoogleTranslator
import pandas as pd
import numpy as np

# Function to translate text
def translate_text(text, dest_lang='en'):
    '''
    Function to translate text if it's not null, otherwise return the original text
    parameters: take in text to translate
    destinantion will be to english
    '''
    if pd.notnull(text):
        translation = GoogleTranslator(source='pt', target=dest_lang).translate(text)
        return translation
    else:
        return text   



### Cleaning the dataframe column from special characters like \r \n

In [6]:
import re

# Function to remove special characters
def remove_special_characters(text):
    if pd.notnull(text):
        cleaned_text=text.replace('\r', '').replace('\n', '')
        return cleaned_text
    else:
        return text   
    
    
# Identify special characters in the 'Text' column
orders_reviews_df['cleaned_review_comment'] = orders_reviews_df['review_comment_message'].apply(remove_special_characters)

orders_reviews_df.head()

Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp,cleaned_review_comment
0,7bc2406110b926393aa56f80a40eba40,73fc7af87114b39712e6da79b0a377eb,4,,,2018-01-18,2018-01-18 21:46:59,
1,80e641a11e56f04c1ad469d5645fdfde,a548910a1c6147796b98fdf73dbeba33,5,,,2018-03-10,2018-03-11 03:05:13,
2,228ce5500dc1d8e020d8d1322874b6f0,f9e4b658b201a9f2ecdecbb34bed034b,5,,,2018-02-17,2018-02-18 14:36:24,
3,e64fb393e7b32834bb789ff8bb30750e,658677c97b385a9be170737859d3511b,5,,Recebi bem antes do prazo estipulado.,2017-04-21,2017-04-21 22:02:06,Recebi bem antes do prazo estipulado.
4,f7c4243c7fe1938f181bec41a392bdeb,8e6bfb81e283fa7e4f11123a3fb894f1,5,,Parabéns lojas lannister adorei comprar pela I...,2018-03-01,2018-03-02 10:26:53,Parabéns lojas lannister adorei comprar pela I...


### Translation

In [8]:
import time
batch_size=20000
total_rows = len(orders_reviews_df)

for idx, row in orders_reviews_df.iterrows():
    try:
        if idx < batch_size:
            # translate the text of each row of cleaned_review_comment
            translated_comment = translate_text(row['cleaned_review_comment'])
            
            # Update the DataFrame with translated texts
            # orders_reviews_df.loc[idx, 'translated_review_comment'] = translated_comment
            orders_reviews_df.at[idx, 'translated_review_comment'] = translated_comment

            #----------------------------------   
            #translate the review_comment_title
            translated_title = translate_text(row['review_comment_title'])
            # Update the DataFrame with translated texts
            orders_reviews_df.loc[idx, 'translated_comment_title'] = translated_title
        else:
            print(f"Translation complete for {idx} rows out of {len(orders_reviews_df)}")
            #time.sleep(30) # wait for 1mins before proceeding the next 5000
            batch_size += 10000
    except IndexError:
        print("Warning: Occuring out of bounds error")
        pass

print("All Translation complete")

orders_reviews_df.head()

Translation complete for 20000 rows out of 100000
Translation complete for 30000 rows out of 100000
Translation complete for 40000 rows out of 100000
Translation complete for 50000 rows out of 100000
Translation complete for 60000 rows out of 100000
Translation complete for 70000 rows out of 100000
Translation complete for 80000 rows out of 100000
Translation complete for 90000 rows out of 100000
All Translation complete


Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp,cleaned_review_comment,translated_review_comment,translated_comment_title
0,7bc2406110b926393aa56f80a40eba40,73fc7af87114b39712e6da79b0a377eb,4,,,2018-01-18,2018-01-18 21:46:59,,,
1,80e641a11e56f04c1ad469d5645fdfde,a548910a1c6147796b98fdf73dbeba33,5,,,2018-03-10,2018-03-11 03:05:13,,,
2,228ce5500dc1d8e020d8d1322874b6f0,f9e4b658b201a9f2ecdecbb34bed034b,5,,,2018-02-17,2018-02-18 14:36:24,,,
3,e64fb393e7b32834bb789ff8bb30750e,658677c97b385a9be170737859d3511b,5,,Recebi bem antes do prazo estipulado.,2017-04-21,2017-04-21 22:02:06,Recebi bem antes do prazo estipulado.,I received it well before the stipulated deadl...,
4,f7c4243c7fe1938f181bec41a392bdeb,8e6bfb81e283fa7e4f11123a3fb894f1,5,,Parabéns lojas lannister adorei comprar pela I...,2018-03-01,2018-03-02 10:26:53,Parabéns lojas lannister adorei comprar pela I...,Congratulations lannister stores I loved shopp...,


In [9]:
orders_reviews_df.tail()

Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp,cleaned_review_comment,translated_review_comment,translated_comment_title
99995,f3897127253a9592a73be9bdfdf4ed7a,22ec9f0669f784db00fa86d035cf8602,5,,,2017-12-09,2017-12-11 20:06:42,,,
99996,b3de70c89b1510c4cd3d0649fd302472,55d4004744368f5571d1f590031933e4,5,,"Excelente mochila, entrega super rápida. Super...",2018-03-22,2018-03-23 09:10:43,"Excelente mochila, entrega super rápida. Super...","Excellent backpack, super fast delivery. I hig...",
99997,1adeb9d84d72fe4e337617733eb85149,7725825d039fc1f0ceb7635e3f7d9206,4,,,2018-07-01,2018-07-02 12:59:13,,,
99998,be360f18f5df1e0541061c87021e6d93,f8bd3f2000c28c5342fedeb5e50f2e75,1,,Solicitei a compra de uma capa de retrovisor c...,2017-12-15,2017-12-16 01:29:43,Solicitei a compra de uma capa de retrovisor c...,I requested the purchase of a Celtic/Prisma/Me...,
99999,efe49f1d6f951dd88b51e6ccd4cc548f,90531360ecb1eec2a1fbb265a0db0508,1,,"meu produto chegou e ja tenho que devolver, po...",2017-07-03,2017-07-03 21:01:49,"meu produto chegou e ja tenho que devolver, po...",My product arrived and I already have to retur...,


## If your code stopped during translation
Use below code and adjust the batch size and the range number, to where the code stopped during translation

In [None]:
# batch_size=99940                
# num_rows=len(orders_reviews_df)
# for idx in range(99940  , num_rows+1):
#     if idx < batch_size:
#         # translate the text of each row of cleaned_review_comment
#         translated_comment = translate_text(orders_reviews_df.iloc[idx]['cleaned_review_comment'])

#         # Update the DataFrame with translated texts
#         orders_reviews_df.loc[idx, 'translated_review_comment'] = translated_comment

#         #----------------------------------   
#         #translate the review_comment_title
#         translated_title = translate_text(row['review_comment_title'])

#         # Update the DataFrame with translated texts
#         orders_reviews_df.loc[idx, 'translated_comment_title'] = translated_title
#     else:
#         print(f"Translation complete for {idx} rows out of {len(orders_reviews_df)}")
#         #time.sleep(20) # wait for 3mins before proceeding the next 1000
#         batch_size += 10


Translation complete for 99940 rows out of 100000
Translation complete for 99950 rows out of 100000
Translation complete for 99960 rows out of 100000
Translation complete for 99970 rows out of 100000
Translation complete for 99980 rows out of 100000
Translation complete for 99990 rows out of 100000
Translation complete for 100000 rows out of 100000


In [10]:
orders_reviews_df.tail()

orders_reviews_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 10 columns):
 #   Column                     Non-Null Count   Dtype         
---  ------                     --------------   -----         
 0   review_id                  100000 non-null  object        
 1   order_id                   100000 non-null  object        
 2   review_score               100000 non-null  int64         
 3   review_comment_title       11715 non-null   object        
 4   review_comment_message     41753 non-null   object        
 5   review_creation_date       100000 non-null  datetime64[ns]
 6   review_answer_timestamp    100000 non-null  datetime64[ns]
 7   cleaned_review_comment     41753 non-null   object        
 8   translated_review_comment  41630 non-null   object        
 9   translated_comment_title   11628 non-null   object        
dtypes: datetime64[ns](2), int64(1), object(7)
memory usage: 7.6+ MB


## Convert to csv file to be later uploaded to postgresql

In [11]:
# Save DataFrame to a CSV file
orders_reviews_df.to_csv("clean_data/cleaned_orders_reviews.csv", index=False) 