# Olist product categories bad reviews analysis

## The dataset

Welcome back to the [Olist dataset](https://kitt.lewagon.com/karr/data-lectures.kitt/04-Decision-Science_01-Project-Setup.slides.html?title=Project+Setup&program_id=10#/2/6) from the Decision Science module!

## The task

We would like to study some particular categories of product which have more bad reviews than other and understand why?

## Setup

If you followed the Decision Science module, you already have the `olist` package installed and importable. YOU CAN SKIP THIS SECTION and move to the **Data collection** section.

### 1. Import `olist` package

Download a fresh version of the `olist` package:

```bash
mkdir ~/code/lewagon
cd ~/code/lewagon
git clone git@github.com:lewagon/olist.git
cd olist
git fetch
git checkout full-package
```

### 2.  Download the datasets

- Download the datasets from Kaggle https://www.kaggle.com/olistbr/brazilian-ecommerce
- Unzip them into the `/data/csv` directory of the `olist` package:

```bash
.
├── README.md
├── data
│   └── csv
│       ├── olist_customers_dataset.csv
│       ├── olist_geolocation_dataset.csv
│       ├── olist_order_items_dataset.csv
│       ├── olist_order_payments_dataset.csv
│       ├── olist_order_reviews_dataset.csv
│       ├── olist_orders_dataset.csv
│       ├── olist_products_dataset.csv
│       ├── olist_sellers_dataset.csv
│       └── product_category_name_translation.csv
├── notebooks
├── olist
│   ├── README.md
│   ├── __init__.py
│   ├── data.py
│   ├── order.py
│   ├── product.py
│   ├── product_updated.py
│   ├── review.py
│   ├── seller.py
│   ├── seller_updated.py
│   └── utils.py
├── requirements.txt
└── setup.p
```

### Install the `olist` package

```bash
pip install -e .
```

⚠️ Restart the kernel.

## Data collection

In [0]:
from olist.review import Review
from olist.data import Olist

# Get dataset with product category name
product_category = Review().get_training_data()

product_category.head()

Unnamed: 0,review_id,length_review,review_score,order_id,product_category_name
0,7bc2406110b926393aa56f80a40eba40,0,4,73fc7af87114b39712e6da79b0a377eb,esporte_lazer
1,80e641a11e56f04c1ad469d5645fdfde,0,5,a548910a1c6147796b98fdf73dbeba33,informatica_acessorios
2,228ce5500dc1d8e020d8d1322874b6f0,0,5,f9e4b658b201a9f2ecdecbb34bed034b,informatica_acessorios
3,e64fb393e7b32834bb789ff8bb30750e,37,5,658677c97b385a9be170737859d3511b,ferramentas_jardim
4,f7c4243c7fe1938f181bec41a392bdeb,100,5,8e6bfb81e283fa7e4f11123a3fb894f1,esporte_lazer


In [0]:
# Get dataset with reviews
data = Olist().get_data()

reviews_data = data['order_reviews']

reviews_data.head()

Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp
0,7bc2406110b926393aa56f80a40eba40,73fc7af87114b39712e6da79b0a377eb,4,,,2018-01-18 00:00:00,2018-01-18 21:46:59
1,80e641a11e56f04c1ad469d5645fdfde,a548910a1c6147796b98fdf73dbeba33,5,,,2018-03-10 00:00:00,2018-03-11 03:05:13
2,228ce5500dc1d8e020d8d1322874b6f0,f9e4b658b201a9f2ecdecbb34bed034b,5,,,2018-02-17 00:00:00,2018-02-18 14:36:24
3,e64fb393e7b32834bb789ff8bb30750e,658677c97b385a9be170737859d3511b,5,,Recebi bem antes do prazo estipulado.,2017-04-21 00:00:00,2017-04-21 22:02:06
4,f7c4243c7fe1938f181bec41a392bdeb,8e6bfb81e283fa7e4f11123a3fb894f1,5,,Parabéns lojas lannister adorei comprar pela I...,2018-03-01 00:00:00,2018-03-02 10:26:53


In [0]:
# Get dataset with estimated delivery date
data = Olist().get_data()

orders_data = data['orders']

orders_data.head()

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18 00:00:00
1,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered,2018-07-24 20:41:37,2018-07-26 03:24:27,2018-07-26 14:31:00,2018-08-07 15:27:45,2018-08-13 00:00:00
2,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered,2018-08-08 08:38:49,2018-08-08 08:55:23,2018-08-08 13:50:00,2018-08-17 18:06:29,2018-09-04 00:00:00
3,949d5b44dbf5de918fe9c16f97b45f8a,f88197465ea7920adcdbec7375364d82,delivered,2017-11-18 19:28:06,2017-11-18 19:45:59,2017-11-22 13:39:59,2017-12-02 00:28:42,2017-12-15 00:00:00
4,ad21c59c0840e6cb83a9ceb5573f8159,8ab97904e6daea8866dbdbc4fb7aad2c,delivered,2018-02-13 21:18:39,2018-02-13 22:20:29,2018-02-14 19:46:34,2018-02-16 18:17:02,2018-02-26 00:00:00


In [0]:
# Build a dataset merging the 3 datasets

## Data cleaning

In [0]:
# Remove reviews for orders delivered BEFORE expected

# Remove reviews about undelivered orders 


In [0]:
# Keep only text columns and review score


In [0]:
# combine review title and review message


In [0]:
# Clean reviews text


## Bad reviews analysis per category

In [0]:
# Groupby product category and aggregate mean, min, max review scores


In [0]:
# Keep categories that have more than 100 reviews


In [0]:
# Filter out the relogios_presentes category


In [0]:
# Filter out bad reviews


In [0]:
# Encode bad reviews text

# Tuned TFidfvectorizer

# Transform text to vectors

# Sum of tfidf weighting by word

# Get the word and associated weight

# Sort

# Display the sorted list


## Tracking sellers of counterfeit products

In [0]:
# Get seller ID


In [0]:
# Filter out reviews with words associated with conterfeit watches


In [0]:
# Groupby seller id 


In [0]:
# Filter out the one seller with 11 counterfeit related reviews
