# Analysis of Reviews on Olist

🎯 Now that you are familiar with NLP, let's analyze the reviews of Olist.

👇 Run the following cell to load the reviews dataset and install `unidecode`

In [256]:
#!pip install -q unidecode

import pandas as pd

url = "https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ml_olist_nlp_reviews.csv"
df = pd.read_csv(url, low_memory = False)

df.head()

Unnamed: 0.1,Unnamed: 0,review_id,length_review,review_score,order_id,product_category_name,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
0,0,7bc2406110b926393aa56f80a40eba40,0,4,73fc7af87114b39712e6da79b0a377eb,esporte_lazer,,,2018-01-18 00:00:00,2018-01-18 21:46:59,41dcb106f807e993532d446263290104,delivered,2018-01-11 15:30:49,2018-01-11 15:47:59,2018-01-12 21:57:22,2018-01-17 18:42:41,2018-02-02 00:00:00
1,1,80e641a11e56f04c1ad469d5645fdfde,0,5,a548910a1c6147796b98fdf73dbeba33,informatica_acessorios,,,2018-03-10 00:00:00,2018-03-11 03:05:13,8a2e7ef9053dea531e4dc76bd6d853e6,delivered,2018-02-28 12:25:19,2018-02-28 12:48:39,2018-03-02 19:08:15,2018-03-09 23:17:20,2018-03-14 00:00:00
2,2,228ce5500dc1d8e020d8d1322874b6f0,0,5,f9e4b658b201a9f2ecdecbb34bed034b,informatica_acessorios,,,2018-02-17 00:00:00,2018-02-18 14:36:24,e226dfed6544df5b7b87a48208690feb,delivered,2018-02-03 09:56:22,2018-02-03 10:33:41,2018-02-06 16:18:28,2018-02-16 17:28:48,2018-03-09 00:00:00
3,3,e64fb393e7b32834bb789ff8bb30750e,37,5,658677c97b385a9be170737859d3511b,ferramentas_jardim,,Recebi bem antes do prazo estipulado.,2017-04-21 00:00:00,2017-04-21 22:02:06,de6dff97e5f1ba84a3cd9a3bc97df5f6,delivered,2017-04-09 17:41:13,2017-04-09 17:55:19,2017-04-10 14:24:47,2017-04-20 09:08:35,2017-05-10 00:00:00
4,4,f7c4243c7fe1938f181bec41a392bdeb,100,5,8e6bfb81e283fa7e4f11123a3fb894f1,esporte_lazer,,Parabéns lojas lannister adorei comprar pela I...,2018-03-01 00:00:00,2018-03-02 10:26:53,5986b333ca0d44534a156a52a8e33a83,delivered,2018-02-10 10:59:03,2018-02-10 15:48:21,2018-02-15 19:36:14,2018-02-28 16:33:35,2018-03-09 00:00:00


In [257]:
df.shape

(98657, 17)

In [258]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 98657 entries, 0 to 98656
Data columns (total 17 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   Unnamed: 0                     98657 non-null  int64 
 1   review_id                      98657 non-null  object
 2   length_review                  98657 non-null  int64 
 3   review_score                   98657 non-null  int64 
 4   order_id                       98657 non-null  object
 5   product_category_name          98657 non-null  object
 6   review_comment_title           11486 non-null  object
 7   review_comment_message         40439 non-null  object
 8   review_creation_date           98657 non-null  object
 9   review_answer_timestamp        98657 non-null  object
 10  customer_id                    98657 non-null  object
 11  order_status                   98657 non-null  object
 12  order_purchase_timestamp       98657 non-null  object
 13  o

In [259]:
df.isnull().sum() 

Unnamed: 0                           0
review_id                            0
length_review                        0
review_score                         0
order_id                             0
product_category_name                0
review_comment_title             87171
review_comment_message           58218
review_creation_date                 0
review_answer_timestamp              0
customer_id                          0
order_status                         0
order_purchase_timestamp             0
order_approved_at                   13
order_delivered_carrier_date       985
order_delivered_customer_date     2098
order_estimated_delivery_date        0
dtype: int64

❓ **Question: Analyse the reviews to understand what could be the causes of the bad review scores** ❓

This challenge is not as guided as the previous ones. But here are some questions to ask yourself:

- Are all the reviews relevant ? 
- What about combining the title and the body of a review ?
- What cleaning operations would you apply to the reviews ?

🇧🇷 Some Brazilian expressions and their translations:

- `producto errado` = wrong product
- `ainda nao` = not yet
- `nao entregue` = not delivered
- `nao veio` = did not come
- `nao gostei` = did not like it
- `produto defeito` = defective product
- `nao functiona` = not working
- `produto diferente` = different product
- `pessima qualidade` = poor quality
- `veio defeito` = came defect
- `veio faltando` = came missing
- `veio errado` = came wrong

# ME EQUIVOQUE E INTENTE INICIAR DE NUEVO TODO, ES UN DESASTRE DE NOTEBOOK


In [260]:
df_negative = df[df['review_score'] < 3]
df_negative

Unnamed: 0.1,Unnamed: 0,review_id,length_review,review_score,order_id,product_category_name,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
5,5,15197aa66ff4d0650b5434f1b46cda19,0,1,b18dcdf73be66366873cd26c5724d1dc,cama_mesa_banho,,,2018-04-13 00:00:00,2018-04-16 00:39:37,eecafc3ff695f031bfe354a9fff9d437,delivered,2018-04-06 22:18:54,2018-04-09 20:10:35,2018-04-11 16:48:35,2018-04-12 17:17:53,2018-05-03 00:00:00
16,16,9314d6f9799f5bfba510cc7bcd468c01,78,2,0dacf04c5ad59fd5a0cc1faa07c34e39,cool_stuff,,"GOSTARIA DE SABER O QUE HOUVE, SEMPRE RECEBI E...",2018-01-18 00:00:00,2018-01-20 21:25:45,db13a417a95ad304e9674468c17ade85,delivered,2017-12-19 13:14:37,2017-12-19 13:32:58,2017-12-20 20:28:58,2018-02-21 01:25:41,2018-01-17 00:00:00
19,19,373cbeecea8286a2b66c97b1b157ec46,7,1,583174fbe37d3d5f0d6661be3aad1786,malas_acessorios,Não chegou meu produto,Péssimo,2018-08-15 00:00:00,2018-08-15 04:10:37,e545e697bb9d1b81e0a702121d4e94d5,canceled,2018-08-04 19:25:07,2018-08-05 19:24:33,,,2018-08-13 00:00:00
29,29,2c5e27fc178bde7ac173c9c62c31b070,35,1,0ce9a24111d850192a933fcaab6fbad3,cama_mesa_banho,,Não gostei ! Comprei gato por lebre,2017-12-13 00:00:00,2017-12-16 07:14:07,5bb8de60ca2ca8b01a5ce471802fe10b,delivered,2017-11-24 01:40:48,2017-11-24 01:49:34,2017-12-06 15:19:09,2017-12-13 00:28:44,2017-12-19 00:00:00
34,34,58044bca115705a48fe0e00a21390c54,173,1,68e55ca79d04a79f20d4bfc0146f4b66,relogios_presentes,,Sempre compro pela Internet e a entrega ocorre...,2018-04-08 00:00:00,2018-04-09 12:22:39,30e6e854c81fa16f46a5d7f3ab025e6f,delivered,2018-03-16 12:51:35,2018-03-16 13:09:21,2018-03-20 18:32:31,2018-04-11 02:12:46,2018-04-06 00:00:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98610,98610,cf0b8c06ba024a8a8d3f2ac51fcd99f4,0,2,fff2cdc825f9fc0ba3c04227cfa02303,telefonia,,,2018-03-09 00:00:00,2018-04-23 17:52:49,9c18c06ccf9b2fafcc5f956c5b145212,delivered,2018-02-02 10:28:41,2018-02-03 02:52:42,2018-02-08 00:26:55,2018-03-12 17:08:53,2018-03-06 00:00:00
98619,98619,6cf47345d15e054dd6df872e929bdb27,0,1,54e6829fe81bc86cf88b12e6d07ea298,moveis_decoracao,,,2017-06-08 00:00:00,2017-06-08 22:52:39,73b659d3fa440f212dde93bf1cba93b1,delivered,2017-05-19 16:02:15,2017-05-19 16:10:12,2017-05-24 10:09:57,2017-06-07 08:33:06,2017-06-12 00:00:00
98634,98634,2ee221b28e5b6fceffac59487ed39348,87,2,f2d12dd37eaef72ed7b1186b2edefbcd,pet_shop,Foto enganosa,Foto muito diferente principalmente a graninha...,2018-03-28 00:00:00,2018-05-25 01:23:26,75b5d720874f58a6f6e2863e378c8575,delivered,2018-03-25 18:01:37,2018-03-25 18:15:29,2018-03-26 20:03:43,2018-03-27 13:48:59,2018-04-06 00:00:00
98637,98637,5085bc489aa6b58a29c4f922d59ff826,197,2,18ed848509774f56cc8c1c0a1903ad7f,construcao_ferramentas_construcao,,Tive um problema na entrega em que o correio c...,2018-02-21 00:00:00,2018-02-23 11:43:12,8f89d962f49f0d7a6d354a4ef3d099c2,delivered,2018-02-05 13:13:28,2018-02-05 13:30:39,2018-02-06 21:43:26,2018-02-20 01:15:50,2018-03-07 00:00:00


In [261]:
df_negative['combined_review'] = df_negative['review_comment_title'].fillna('') + " " + df_negative['review_comment_message'].fillna('')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_negative['combined_review'] = df_negative['review_comment_title'].fillna('') + " " + df_negative['review_comment_message'].fillna('')


In [262]:
import string

In [236]:
def basic_cleaning(sentence):
    sentence = sentence.strip()
    sentence = sentence.lower()
    sentence = ''.join(char for char in sentence if not char.isdigit())
    sentence = ''.join(char for char in sentence if char not in string.punctuation)
    tokenized_sentence = sentence.split()
    return ' '.join(tokenized_sentence)

In [238]:
df_negative['processed_review'] = df_negative['combined_review'].apply(basic_cleaning)
df_negative

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_negative['processed_review'] = df_negative['combined_review'].apply(basic_cleaning)


Unnamed: 0.1,Unnamed: 0,review_id,length_review,review_score,order_id,product_category_name,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date,combined_review,processed_review
5,5,15197aa66ff4d0650b5434f1b46cda19,0,1,b18dcdf73be66366873cd26c5724d1dc,cama_mesa_banho,,,2018-04-13 00:00:00,2018-04-16 00:39:37,eecafc3ff695f031bfe354a9fff9d437,delivered,2018-04-06 22:18:54,2018-04-09 20:10:35,2018-04-11 16:48:35,2018-04-12 17:17:53,2018-05-03 00:00:00,,
16,16,9314d6f9799f5bfba510cc7bcd468c01,78,2,0dacf04c5ad59fd5a0cc1faa07c34e39,cool_stuff,,"GOSTARIA DE SABER O QUE HOUVE, SEMPRE RECEBI E...",2018-01-18 00:00:00,2018-01-20 21:25:45,db13a417a95ad304e9674468c17ade85,delivered,2017-12-19 13:14:37,2017-12-19 13:32:58,2017-12-20 20:28:58,2018-02-21 01:25:41,2018-01-17 00:00:00,"GOSTARIA DE SABER O QUE HOUVE, SEMPRE RECEBI ...",gostaria de saber o que houve sempre recebi e ...
19,19,373cbeecea8286a2b66c97b1b157ec46,7,1,583174fbe37d3d5f0d6661be3aad1786,malas_acessorios,Não chegou meu produto,Péssimo,2018-08-15 00:00:00,2018-08-15 04:10:37,e545e697bb9d1b81e0a702121d4e94d5,canceled,2018-08-04 19:25:07,2018-08-05 19:24:33,,,2018-08-13 00:00:00,Não chegou meu produto Péssimo,não chegou meu produto péssimo
29,29,2c5e27fc178bde7ac173c9c62c31b070,35,1,0ce9a24111d850192a933fcaab6fbad3,cama_mesa_banho,,Não gostei ! Comprei gato por lebre,2017-12-13 00:00:00,2017-12-16 07:14:07,5bb8de60ca2ca8b01a5ce471802fe10b,delivered,2017-11-24 01:40:48,2017-11-24 01:49:34,2017-12-06 15:19:09,2017-12-13 00:28:44,2017-12-19 00:00:00,Não gostei ! Comprei gato por lebre,não gostei comprei gato por lebre
34,34,58044bca115705a48fe0e00a21390c54,173,1,68e55ca79d04a79f20d4bfc0146f4b66,relogios_presentes,,Sempre compro pela Internet e a entrega ocorre...,2018-04-08 00:00:00,2018-04-09 12:22:39,30e6e854c81fa16f46a5d7f3ab025e6f,delivered,2018-03-16 12:51:35,2018-03-16 13:09:21,2018-03-20 18:32:31,2018-04-11 02:12:46,2018-04-06 00:00:00,Sempre compro pela Internet e a entrega ocorr...,sempre compro pela internet e a entrega ocorre...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98610,98610,cf0b8c06ba024a8a8d3f2ac51fcd99f4,0,2,fff2cdc825f9fc0ba3c04227cfa02303,telefonia,,,2018-03-09 00:00:00,2018-04-23 17:52:49,9c18c06ccf9b2fafcc5f956c5b145212,delivered,2018-02-02 10:28:41,2018-02-03 02:52:42,2018-02-08 00:26:55,2018-03-12 17:08:53,2018-03-06 00:00:00,,
98619,98619,6cf47345d15e054dd6df872e929bdb27,0,1,54e6829fe81bc86cf88b12e6d07ea298,moveis_decoracao,,,2017-06-08 00:00:00,2017-06-08 22:52:39,73b659d3fa440f212dde93bf1cba93b1,delivered,2017-05-19 16:02:15,2017-05-19 16:10:12,2017-05-24 10:09:57,2017-06-07 08:33:06,2017-06-12 00:00:00,,
98634,98634,2ee221b28e5b6fceffac59487ed39348,87,2,f2d12dd37eaef72ed7b1186b2edefbcd,pet_shop,Foto enganosa,Foto muito diferente principalmente a graninha...,2018-03-28 00:00:00,2018-05-25 01:23:26,75b5d720874f58a6f6e2863e378c8575,delivered,2018-03-25 18:01:37,2018-03-25 18:15:29,2018-03-26 20:03:43,2018-03-27 13:48:59,2018-04-06 00:00:00,Foto enganosa Foto muito diferente principalm...,foto enganosa foto muito diferente principalme...
98637,98637,5085bc489aa6b58a29c4f922d59ff826,197,2,18ed848509774f56cc8c1c0a1903ad7f,construcao_ferramentas_construcao,,Tive um problema na entrega em que o correio c...,2018-02-21 00:00:00,2018-02-23 11:43:12,8f89d962f49f0d7a6d354a4ef3d099c2,delivered,2018-02-05 13:13:28,2018-02-05 13:30:39,2018-02-06 21:43:26,2018-02-20 01:15:50,2018-03-07 00:00:00,Tive um problema na entrega em que o correio ...,tive um problema na entrega em que o correio c...


In [239]:
from collections import Counter

In [241]:
all_words = ' '.join(df_negative['processed_review'])

words = all_words.split()

word_freq = Counter(words)

most_common_words = word_freq.most_common(20)
most_common_words

[('o', 8415),
 ('não', 7570),
 ('produto', 6420),
 ('e', 6210),
 ('a', 4949),
 ('de', 4916),
 ('que', 3805),
 ('recebi', 3388),
 ('do', 2417),
 ('com', 2344),
 ('um', 2305),
 ('foi', 1884),
 ('comprei', 1852),
 ('no', 1669),
 ('veio', 1598),
 ('para', 1567),
 ('uma', 1508),
 ('entrega', 1467),
 ('é', 1463),
 ('da', 1455)]

In [242]:
from sklearn.feature_extraction.text import CountVectorizer

In [243]:
vectorizer = CountVectorizer(ngram_range=(2, 3), stop_words=None)
ngrams = vectorizer.fit_transform(df_negative['processed_review'])

In [244]:
sum_ngrams = ngrams.sum(axis=0)
ngrams_freq = [(word, sum_ngrams[0, idx]) for word, idx in vectorizer.vocabulary_.items()]
ngrams_freq = sorted(ngrams_freq, key=lambda x: x[1], reverse=True)

In [245]:
ngrams_freq[:20]

[('não recebi', 1679),
 ('recebi produto', 996),
 ('foi entregue', 842),
 ('produto não', 804),
 ('ainda não', 760),
 ('não recebi produto', 675),
 ('do produto', 640),
 ('não foi', 609),
 ('meu produto', 583),
 ('não chegou', 499),
 ('não foi entregue', 400),
 ('até agora', 391),
 ('ainda não recebi', 387),
 ('um produto', 376),
 ('não recomendo', 375),
 ('de entrega', 342),
 ('produto veio', 335),
 ('estou aguardando', 299),
 ('veio com', 292),
 ('que não', 285)]

In [246]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

In [251]:
tfidf_vectorizer = TfidfVectorizer(max_features=20)
tfidf_matrix = tfidf_vectorizer.fit_transform(df_negative['processed_review'])

tfidf_features = tfidf_vectorizer.get_feature_names_out()
tfidf_scores = tfidf_matrix.toarray().sum(axis=0)

tfidf_results = pd.DataFrame({'Word': tfidf_features, 'TF-IDF Score': tfidf_scores})
tfidf_results = tfidf_results.sort_values(by='TF-IDF Score', ascending=False)

print(tfidf_results.head(20))

        Word  TF-IDF Score
12       não   2285.355551
14   produto   2030.357878
4         de   1565.576337
16    recebi   1400.588530
15       que   1247.817910
5         do    953.698500
1        com    923.937732
17        um    897.240332
2    comprei    789.668868
8        foi    747.888192
19      veio    742.450798
0      ainda    695.956805
6    entrega    681.494698
11        no    678.936539
13      para    647.003662
18       uma    641.822319
3         da    628.447508
7   entregue    609.709379
9        meu    604.259579
10        na    579.089281


In [252]:
topics = {
    'Reception Issues': ['recebi', 'entrega', 'veio', 'entregue'],
    'Product Issues': ['produto', 'comprei', 'ainda'],
    'Service Issues': ['não', 'foi', 'ainda', 'meu']
}

def classify_review(review):
    review_topics = []
    for topic, keywords in topics.items():
        if any(word in review for word in keywords):
            review_topics.append(topic)
    return review_topics

df_negative['topics'] = df_negative['processed_review'].apply(classify_review)
df_negative[['processed_review', 'topics']].head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_negative['topics'] = df_negative['processed_review'].apply(classify_review)


Unnamed: 0,processed_review,topics
5,,[]
16,gostaria de saber o que houve sempre recebi e ...,[Reception Issues]
19,não chegou meu produto péssimo,"[Product Issues, Service Issues]"
29,não gostei comprei gato por lebre,"[Product Issues, Service Issues]"
34,sempre compro pela internet e a entrega ocorre...,"[Reception Issues, Product Issues, Service Iss..."
41,nada de chegar o meu pedido,[Service Issues]
42,,[]
43,,[]
53,recebi somente controle midea split estilo fal...,[Reception Issues]
70,o produto não chegou no prazo estipulado e cau...,"[Product Issues, Service Issues]"


In [254]:
topic_counts = df_negative['topics'].explode().value_counts()
topic_counts

Product Issues      7043
Service Issues      6808
Reception Issues    6556
Name: topics, dtype: int64

In [263]:
def basic_cleaning(sentence):
    sentence = sentence.strip()
    sentence = sentence.lower()
    sentence = ''.join(char for char in sentence if not char.isdigit())
    sentence = ''.join(char for char in sentence if char not in string.punctuation)
    tokenized_sentence = sentence.split()
    return ' '.join(tokenized_sentence)
df['processed_review'] = df['review_comment_message'].fillna('').apply(basic_cleaning)

In [264]:
df['label'] = (df['review_score'] >= 3).astype(int)

tfidf = TfidfVectorizer()
X = tfidf.fit_transform(df['processed_review'].fillna(''))  # Llenar valores NaN con cadenas vacías
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = MultinomialNB()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.84      0.31      0.46      4160
           1       0.90      0.99      0.94     25438

    accuracy                           0.90     29598
   macro avg       0.87      0.65      0.70     29598
weighted avg       0.89      0.90      0.87     29598



🏁 Congratulations. Instead of reading 90K+ reviews, you were able to detect the main reasons of dissatisfactions on Olist.

💾 Don't forget to `git add/commit/push`