# Creating a Dataframe with reviews evaluation

This document will use the file available at https://raw.githubusercontent.com/olist/work-at-olist-data/master/datasets/olist_order_reviews_dataset.csv to create a dataframe with 41,754 rows and two columns:

- Review Comment Message: the text in Portuguese that costumers left after the purchase.
- Is good review: 0 when they didn't like the experience, 1 when they like it.

This document can be used for further projects regarding NLP related projects.

In [15]:
# As usual, let's invite all the cool guys to join our python party.
import numpy as np
import pandas as pd

As the datasets are stored at the Olist github repository, **we can load directly from there**, without the need to download the files.

Please note that an auxiliary file provided, named *prod_cat_trans*, that I will use to translate the products categories from Portuguese to English. 

In [16]:
# Here we use the link to the raw data hosted at GitHub
order_reviews = pd.read_csv('https://raw.githubusercontent.com/olist/work-at-olist-data/master/datasets/olist_order_reviews_dataset.csv')

In [17]:
order_reviews.head()

Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12
0,7bc2406110b926393aa56f80a40eba40,73fc7af87114b39712e6da79b0a377eb,4,,,2018-01-18 00:00:00,2018-01-18 21:46:59,,,,,,
1,80e641a11e56f04c1ad469d5645fdfde,a548910a1c6147796b98fdf73dbeba33,5,,,2018-03-10 00:00:00,2018-03-11 03:05:13,,,,,,
2,228ce5500dc1d8e020d8d1322874b6f0,f9e4b658b201a9f2ecdecbb34bed034b,5,,,2018-02-17 00:00:00,2018-02-18 14:36:24,,,,,,
3,e64fb393e7b32834bb789ff8bb30750e,658677c97b385a9be170737859d3511b,5,,Recebi bem antes do prazo estipulado.,2017-04-21 00:00:00,2017-04-21 22:02:06,,,,,,
4,f7c4243c7fe1938f181bec41a392bdeb,8e6bfb81e283fa7e4f11123a3fb894f1,5,,Parabéns lojas lannister adorei comprar pela I...,2018-03-01 00:00:00,2018-03-02 10:26:53,,,,,,


In [18]:
# Creating the binary target feature
order_reviews['is_good_review'] = order_reviews['review_score'].apply(lambda x: 1 if x > 3 else 0)

In [19]:
columns_to_keep = ['review_comment_message', 'is_good_review']
reviews_nlp = order_reviews[columns_to_keep]

reviews_nlp.head()

Unnamed: 0,review_comment_message,is_good_review
0,,1
1,,1
2,,1
3,Recebi bem antes do prazo estipulado.,1
4,Parabéns lojas lannister adorei comprar pela I...,1


In [20]:
reviews_nlp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 2 columns):
review_comment_message    41754 non-null object
is_good_review            100000 non-null int64
dtypes: int64(1), object(1)
memory usage: 1.5+ MB


In [21]:
reviews_nlp.isnull().sum()

review_comment_message    58246
is_good_review                0
dtype: int64

In [22]:
# Dropping all rows with null values:
reviews_nlp.dropna(how='any', axis=0, inplace=True)

print(orders_nlp.isnull().sum())
print(" ")
print(orders_nlp.info())

review_comment_message    0
is_good_review            0
dtype: int64
 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 41754 entries, 3 to 99999
Data columns (total 2 columns):
review_comment_message    41754 non-null object
is_good_review            41754 non-null int64
dtypes: int64(1), object(1)
memory usage: 978.6+ KB
None


In [23]:
# Saving the file for analysis
reviews_nlp.to_csv('reviews_for_nlp.csv')