# Analysis of Reviews on Olist

🎯 Now that you are familiar with NLP, let's analyze the reviews of Olist.

👇 Run the following cell to load the reviews dataset.

In [1]:
import pandas as pd

url = "https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ml_olist_nlp_reviews.csv"
df = pd.read_csv(url, low_memory = False)

df.head()

Unnamed: 0.1,Unnamed: 0,review_id,length_review,review_score,order_id,product_category_name,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
0,0,7bc2406110b926393aa56f80a40eba40,0,4,73fc7af87114b39712e6da79b0a377eb,esporte_lazer,,,2018-01-18 00:00:00,2018-01-18 21:46:59,41dcb106f807e993532d446263290104,delivered,2018-01-11 15:30:49,2018-01-11 15:47:59,2018-01-12 21:57:22,2018-01-17 18:42:41,2018-02-02 00:00:00
1,1,80e641a11e56f04c1ad469d5645fdfde,0,5,a548910a1c6147796b98fdf73dbeba33,informatica_acessorios,,,2018-03-10 00:00:00,2018-03-11 03:05:13,8a2e7ef9053dea531e4dc76bd6d853e6,delivered,2018-02-28 12:25:19,2018-02-28 12:48:39,2018-03-02 19:08:15,2018-03-09 23:17:20,2018-03-14 00:00:00
2,2,228ce5500dc1d8e020d8d1322874b6f0,0,5,f9e4b658b201a9f2ecdecbb34bed034b,informatica_acessorios,,,2018-02-17 00:00:00,2018-02-18 14:36:24,e226dfed6544df5b7b87a48208690feb,delivered,2018-02-03 09:56:22,2018-02-03 10:33:41,2018-02-06 16:18:28,2018-02-16 17:28:48,2018-03-09 00:00:00
3,3,e64fb393e7b32834bb789ff8bb30750e,37,5,658677c97b385a9be170737859d3511b,ferramentas_jardim,,Recebi bem antes do prazo estipulado.,2017-04-21 00:00:00,2017-04-21 22:02:06,de6dff97e5f1ba84a3cd9a3bc97df5f6,delivered,2017-04-09 17:41:13,2017-04-09 17:55:19,2017-04-10 14:24:47,2017-04-20 09:08:35,2017-05-10 00:00:00
4,4,f7c4243c7fe1938f181bec41a392bdeb,100,5,8e6bfb81e283fa7e4f11123a3fb894f1,esporte_lazer,,Parabéns lojas lannister adorei comprar pela I...,2018-03-01 00:00:00,2018-03-02 10:26:53,5986b333ca0d44534a156a52a8e33a83,delivered,2018-02-10 10:59:03,2018-02-10 15:48:21,2018-02-15 19:36:14,2018-02-28 16:33:35,2018-03-09 00:00:00


In [133]:
df.review_score.unique()

array([4, 5, 1, 3, 2])

❓ **Question: Analyse the reviews to understand what could be the causes of the bad review scores** ❓

This challenge is not as guided as the previous ones. But here are some questions to ask yourself:

- Are all the reviews relevant ? 
- What about combining the title and the body of a review ?
- What cleaning operations would you apply to the reviews ?

🇧🇷 Some Brazilian expressions and their translations:

- `producto errado` = wrong product
- `ainda nao` = not yet
- `nao entregue` = not delivered
- `nao veio` = did not come
- `nao gostei` = did not like it
- `produto defeito` = defective product
- `nao functiona` = not working
- `produto diferente` = different product
- `pessima qualidade` = poor quality
- `veio defeito` = came defect
- `veio faltando` = came missing
- `veio errado` = came wrong

In [62]:
data = pd.DataFrame()
data['review_title'] =df['review_comment_title']
data['review_comment'] = df['review_comment_message']

In [63]:
data.review_title.dropna()

9                        recomendo
15                 Super recomendo
19         Não chegou meu produto 
22                           Ótimo
36                      Muito bom.
                   ...            
98622                 Nota máxima!
98627                            👍
98631           muito bom produto 
98632    Não foi entregue o pedido
98634               Foto enganosa 
Name: review_title, Length: 11486, dtype: object

In [64]:
data=pd.DataFrame((data.review_title+ ' ' + data.review_comment).dropna())

In [71]:
data.rename(columns={0:'text'},inplace=True)

## CLEANING

In [72]:
data

Unnamed: 0,text
9,recomendo aparelho eficiente. no site a marca ...
15,"Super recomendo Vendedor confiável, produto ok..."
19,Não chegou meu produto Péssimo
22,Ótimo Loja nota 10
36,Muito bom. Recebi exatamente o que esperava. A...
...,...
98622,"Nota máxima! Muito obrigado,\r\n\r\nExcelente ..."
98627,👍 Aprovado!
98631,muito bom produto Ficamos muito satisfeitos c...
98632,Não foi entregue o pedido Bom dia \r\nDas 6 un...


In [120]:
import string
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
import numpy as np

In [177]:
def translator(sentence:str):
    translator_aplha={
        'ã':'a',
        'á':'a',
        'é':'e',
        'í':'i',
        'ó':'o',
        'ú':'u',
        'ç':'c',
        'à':'a',
        'ñ':'n',
        'ê':'e',
        'ò':'o'
    }
    sentence = ''.join(char if not char in translator_aplha.keys() else translator_aplha[char] for char in sentence)
    return sentence

In [178]:
def cleaning(sentence:str):
    sentence = sentence.lower()
    sentence = sentence.strip()
    sentence = ''.join(char for char in sentence if not char.isdigit())
    
    for k in string.punctuation:
        sentence = sentence.replace(k,'')
    
    tokens = word_tokenize(sentence)
    verb_lemmatized = [
        WordNetLemmatizer().lemmatize(word,pos='v')
        for word in tokens
    ]
    noun_lemmatized = [
        WordNetLemmatizer().lemmatize(word,pos='n')
        for word in verb_lemmatized
    ]
    
    sentence = ' '.join(word for word in noun_lemmatized)
    sentence = translator(sentence)
    return sentence

In [179]:
data['text']=data.text.map(lambda x: cleaning(x))

In [180]:
data

Unnamed: 0,text,important
9,recomendo aparelho eficiente no site a marca d...,
15,super recomendo vendedor confiavel produto ok ...,
19,nao chegou meu produto pessimo,
22,otimo loja nota,
36,muito bom recebi exatamente o que esperava a d...,
...,...,...
98622,nota maximum muito obrigado excelente atendime...,
98627,👍 aprovado,
98631,muito bom produto ficamos muito satisfeitos co...,
98632,nao foi entregue o pedido bom dia da unidades ...,not delivered


In [181]:
df['review_comment_message_and_title']=data['text']

In [182]:
df['review_comment_message_and_title'].dropna()

9        recomendo aparelho eficiente no site a marca d...
15       super recomendo vendedor confiavel produto ok ...
19                          nao chegou meu produto pessimo
22                                         otimo loja nota
36       muito bom recebi exatamente o que esperava a d...
                               ...                        
98622    nota maximum muito obrigado excelente atendime...
98627                                           👍 aprovado
98631    muito bom produto ficamos muito satisfeitos co...
98632    nao foi entregue o pedido bom dia da unidades ...
98634    foto enganosa foto muito diferente principalme...
Name: review_comment_message_and_title, Length: 9760, dtype: object

In [183]:
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

In [184]:
new_df=df['review_comment_message_and_title'].dropna()

In [190]:
vectorizer = TfidfVectorizer()
vectorized_reviews = vectorizer.fit_transform(new_df)
vectorized_reviews = pd.DataFrame(
            vectorized_reviews.toarray(),
            columns= vectorizer.get_feature_names_out()
)

In [191]:
vectorized_reviews

Unnamed: 0,aa,aaa,aaprelho,ab,abaixada,abaixo,abajur,abaulada,abdominal,abencoe,...,zenildo,zero,ziper,zippo,zl,zuado,zupin,ômega,ônibus,ünica
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9755,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9756,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9757,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9758,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [194]:
model = LatentDirichletAllocation(n_components=11,max_iter=100)

model.fit(vectorized_reviews)

In [195]:
new_review =['muito bom recebi exatamente mais nao foi entregue']

In [196]:
model.transform(vectorizer.transform(new_review))



array([[0.02452295, 0.02450917, 0.0245156 , 0.02450918, 0.02450917,
        0.02451517, 0.75488206, 0.02450917, 0.02450917, 0.02450917,
        0.02450918]])

🏁 Congratulations. Instead of reading 90K+ reviews, you were able to detect the main reasons of dissatisfactions on Olist.

💾 Don't forget to `git add/commit/push`