# Reviews Dataset

Contains the scores for orders. In some cases also a review writed for buyer.

## Initial column description

|**Column Title**|**review_id -> str** |**order_id -> str** |**review_score -> int** |**review_comment_title -> str** |**review_comment_message -> str** |**review_creation_date -> timestamp** |**review_answer_timestamp -> timestamp** |
|--|--|--|--|--|--|--|--|
|Description |Primary key |Order identifier |Order Score from 1 to 5 |Review title |Review message from user |Creation review timestamp |Published review timestamp |
|Example |8670d52e15e00043ae7de4c01cc2fe06 |b9bf720beb4ab3728760088589c62129 |4 |recomendo |aparelho eficiente no site a marca do aparelho esta impresso como tresdesinfector e ao chegar esta com outro nomeatualizar com a marca correta uma vez que e o mesmo aparelho |2018-05-22 00:00:00 |2018-05-23 16:45:47 |

### Errors found

+ Some empty cell in review_comment_title and review_comment_message
+ A lot of special characters and emogis on review_comment_title and review_comment_message

## Required libraries

In [1]:
# Allows to work with CSV easily
import pandas as pd

# Replace special characters like ã for a or ç for c
from unidecode import unidecode

## Data Preprocessing

### Specific row

In [None]:


df_2 = _deepnote_execute_sql("""SELECT review_comment_title
FROM '/work/Data/NoClean/olist_order_reviews_dataset.csv'
WHERE order_id = '2d687102eef8e4949a9d2af49e8fa946'
""", 'SQL_DEEPNOTE_DATAFRAME_SQL')
df_2

Unnamed: 0,review_comment_title
0,


In [None]:


df_4 = _deepnote_execute_sql("""SELECT *
FROM '/work/Data/NoClean/olist_geolocation_dataset.csv'
WHERE geolocation_zip_code_prefix = 64120
""", 'SQL_DEEPNOTE_DATAFRAME_SQL')
df_4

Unnamed: 0,geolocation_zip_code_prefix,geolocation_lat,geolocation_lng,geolocation_city,geolocation_state
0,64120,-4.584783,-42.855798,uniao,PI
1,64120,-4.584783,-42.855798,uniao,PI
2,64120,-4.590553,-42.863525,uniao,PI
3,64120,-4.587877,-42.85696,união,PI
4,64120,-4.594149,-42.866918,uniao,PI
5,64120,-4.591259,-42.867454,uniao,PI
6,64120,-4.588754,-42.865009,uniao,PI
7,64120,-4.577519,-42.86467,uniao,PI
8,64120,-4.587079,-42.862017,uniao,PI
9,64120,-4.585312,-42.868971,união,PI


## Column Revision

In [None]:


df_1 = _deepnote_execute_sql("""SELECT review_answer_timestamp
FROM '/work/Data/Clean/olist_order_reviews_dataset.csv'
GROUP BY review_answer_timestamp
ORDER BY review_answer_timestamp
""", 'SQL_DEEPNOTE_DATAFRAME_SQL')
df_1

Unnamed: 0,review_answer_timestamp
0,2016-10-07 18:32:28
1,2016-10-11 14:31:29
2,2016-10-16 03:20:17
3,2016-10-16 15:45:11
4,2016-10-17 21:02:49
...,...
98243,2018-10-24 16:27:36
98244,2018-10-24 18:26:25
98245,2018-10-24 21:34:38
98246,2018-10-26 21:36:41


### Dictionaries for replacements

Keys are the misspelling words searched, values are the correct words. There are two dictionaries:

**characters:** contains replacements for unique characters:
> For example:
> ~~~
> input: xique-xique
> output: xique xique
> ~~~
> ~~~
> input: alta floresta d'oeste
> output: alta floresta d oeste
> ~~~

**blacklist:** contains replacements for full words
> For example:
> ~~~
> input: alta floresta do oeste
> output: alta floresta d oeste
> ~~~
> ~~~
> input: 4º centenario
> output: quarto centenario
> ~~~
> ~~~
> input: xangri-lá
> output: xangri la
> ~~~

For others special characters like "â" or  "ç" [Unidecode](https://pypi.org/project/Unidecode/) will replace for most similar ASCII characters.

In [2]:
diccionary = {
    "-": " ",
    "'": " ",
    "5": "cinco",
    "!": "",
    ".": "",
    ",": "",
    '"': "",
    "*": "",
    "#": "",
    "+": "",
    ",": "",
    "-": "",
    ".": "",
    "0": "zero",
    "4": "quatro",
    "1": "um",
    "%": "",
    "2": "dois",
    "ª": "",
    "3": "tres",
    "🌟": "",
    "6": "seis",
    "7": "sete",
    "8": "oito",
    "9": "nove",
    ":": "",
    ")": "",
    "?": "",
    "😍": "",
    "_": " ",
    "🤗": "",
    "/": "",
    "👏": "",
    "🚚": "",
    "👍🏽": "",
    "👍": "",
    "👎": "",
    "👏🏻": "",
    "$": "",
    "😀": "",
    "👍🏼": "",
    "👍🏽": "",
    "💥": "",
    "🔟": "dez"
}

black_list = {
    "10": "dez",
    "***** recomendo 25/07/20": "recomendo",
    "00": "zero",
    "01": "um",
    "02673583082": "",
    "05": "cinco",
    "05/06/2018": "",
    "10 super recomendo": "dez super recomendo",
    "10 top": "1dez top",
    "10! Gostei muito!": "dez gostei muito",
    "100": "cem",
    "100% confiável": "cem confiavel",
    "100% original": "cem original",
    "100% recomendado": "cem recomendado",
    "100% satisfeito": "cem satisfeito",
    "1000": "mil",
    "1000000": "um milhao",
    "100000000": "um milhao",
    "10000000000000": "um milhao",
    "1o": "um",
    "353454t": "",
    "40": "quatro",
    "50% pessima": "cinquenta pessima",
    "99": "noventa nove",
    "Adaptador USB 2.0 Wireles": "adaptador usb wireles",
    "Adaptador de 12 v p/ 110v": "adaptador de tensao",
    "Bomba 12V": "bomba",
    "Cartão SD 64 gb": "cartao sd gb",
    "Comprei 02 e só recebi 01": "comprei dois e so recebi um",
    "Comprei 02 recebi 01 lona": "comprei dois e so recebi um",
    "Comprei 02, recebi 01": "comprei dois e so recebi um",
    "Defeito menos de 30 Dias": "defeito menos de trinta dias",
    "Entrega 100%": "entrega",
    "Entrega nota 10": "entrega nota dez",
    "Entrega nota 10.": "entrega nota dez",
    "Fio elétrico 6mm": "fio eletrico",
    "Kit de rodas aro 29": "kit de rodas aro",
    "LM 327": "lm ci",
    "Loja nota 10": "loja nota dez",
    "Luminaria sobrepor led 18": "luminaria sobrepor led",
    "Mesmo site com 02 FRETES!": "mesmo site com dois fretes",
    "Mochila Denlex DL0011.": "mochila denlex",
    "Muito bom nota 10": "muito bom nota dez",
    "Muito bom nota 10 para am": "muito bom nota dez para am",
    "Nita 10": "nota dez",
    "Nota 10": "nota dez",
    "Nota 10 ": "nota dez",
    "Nota 10 a loja": "nota dez a loja",
    "Nota 10!!!": "nota dez",
    "Nota 10.": "nota dez",
    "Nota 1000": "nota mil",
    "Nota 1000 ": "nota mil",
    "Nota10": "nota dez",
    "Pedi de 1/2 e veio 3/4": "pedido errado",
    "Pedido 02-671663252": "pedido",
    "Pedido com 10 toalhas": "pedido com dez toalhas",
    "Pedido: 01-68698482 de 21": "pedido",
    "Produto NÃO 100% igual.": "produto nao igual",
    "Produto levara 60 dias": "produto levara sessenta dias",
    "Produto não veio 250 ml": "produto nao veio",
    "Quase nota 10": "quase nota dez",
    "Recebimento nota 10": "recebimento nota dez",
    "Recomendo 100%": "recomendo",
    "Recomendo 100%.": "recomendo",
    "Recomendo mas não é 100%": "recomendo",
    "Solicitado 30X30 recebi 4": "solicitado",
    "Squeeze Cantil Inox 750ml": "squeeze cantil inox",
    "Super recomendado 17/05/": "super recomendado",
    "Super recomendo 16/07/201": "super recomendado",
    "TUDO 100%": "tudo certo",
    "Tesoura 5.5": "tesoura",
    "Top 10": "top dez",
    "Tudo perfeito nota 10.": "tudo perfeito nota dez",
    "Vcs sao 10": "vcs sao dez",
    "Ventilador Excelente...10": "ventilador excelente dez",
    "nota 10": "nota dez",
    "nota 10 bem rapido": "nota dez bem rapido",
    "nota 10000": "nota mil",
    "recomend 02-667550502 de ": "recomendo",
    "recomendo 100 por cento": "recomendo cem por cento",
    "recomendo 14/7/2018": "recomendo",
    "super recomendo 01092018": "super recomendo",
    "super recomendo-10": "super recomendo",
    "super recomendo15 /07/18 ": "super recomendo",
    "targaryen 10. lannister 0.": "targaryen dez lannister zero",
    "xx": "",
    "xxx": "",
    "xxxxxxxxxxx": ""
}

### Replace Function

In [3]:
def replace(word):
    # if str(word) == "nan":
    #     return "no_aplica"
    word = str(word)
    if word in black_list:
        return black_list[word]
    for i, j in diccionary.items():
        word = word.replace(i, j).lower()
    return unidecode(word)

def replace_without_black_list(word):
    # if str(word) == "nan":
    #     return "no_aplica"
    word = str(word)
    for i, j in diccionary.items():
        word = word.replace(i, j).lower()
    return unidecode(word)

customers_dataset = pd.read_csv('../../data/raw/olist_order_reviews_dataset.csv')
customers_dataset['review_comment_title'] = customers_dataset['review_comment_title'].map(replace)
customers_dataset['review_comment_message'] = customers_dataset['review_comment_message'].map(replace_without_black_list)
customers_dataset.to_csv('../../data/interim/order_reviews_dataset.csv', encoding='utf-8', index=False)

In [None]:


df_3 = _deepnote_execute_sql("""SELECT review_comment_title
FROM '/work/Data/NoClean/olist_order_reviews_dataset.csv'
GROUP BY review_comment_title
ORDER BY review_comment_title
""", 'SQL_DEEPNOTE_DATAFRAME_SQL')
df_3

Unnamed: 0,review_comment_title
0,
1,
2,10
3,4
4,5 estrelas
...,...
4523,👍🏼
4524,👍🏽
4525,👍👍👍👍👍
4526,💥💥💥💥💥


In [4]:
dict_replace = pd.read_csv('../../data/interim/unique_order_id.csv')
replace_keys = list(dict_replace['order_id'])
replace_values = list(dict_replace['unique_id'])
replace_dict = dict(zip(replace_keys, replace_values))

def change_id(id):
    return replace_dict[id]

dataset = pd.read_csv('../../data/interim/order_reviews_dataset.csv')
dataset['order_id'] = dataset['order_id'].map(change_id)
print(dataset)

TypeError: 'DataFrame' object is not callable

## Final Column Description

|**Column Title**|**review_id -> str** |**order_id -> str** |**review_score -> int** |**review_comment_title -> str** |**review_comment_message -> str** |**review_creation_date -> timestamp** |**review_answer_timestamp -> timestamp** |
|--|--|--|--|--|--|--|--|
|Description |Primary key |Order identifier |Order Score from 1 to 5 |Review title |Review message from user |Creation review timestamp |Published review timestamp |
|Before Preprocessing |419137599a91ca81d4c24e9c6832486a |343146b92daf167255eed8e706a25d2d |5 | |Produto excelente para conservação do console, contra poeira e tempo 👍🏽 |2017-07-18 00:00:00 |2017-07-18 21:07:20 |
|After Preprocessing |419137599a91ca81d4c24e9c6832486a |343146b92daf167255eed8e706a25d2d |5 | |produto excelente para conservacao do console contra poeira e tempo |2017-07-18 00:00:00 |2017-07-18 21:07:20 |