# Products

🎯 Our goal is to find `product categories` that repeatedly `underperform` vs. others, and understand the reasons behind.

## `product.py` 

🎁 We gave you the solution to `product.py` in your challenge folder

💾 Copy-paste it to your local `olist` folder

👉 `product.py` provides you with aggregated data for each `product_id` sold on Olist.

----
The `get_training_data` method in `olist/product.py` returns a DataFrame with the following features:

| feature_name                 	| type  	| description                                                               	|
|:------------------------------	|:-------:	|:---------------------------------------------------------------------------	|
| `product_id`                 	| str   	| id of the product **UNIQUE**                                              	|
| `category`                   	| str   	| category name (in English)                                                	|
| `product_name_length`        	| float 	| number of characters of a product name                                    	|
| `product_description_length` 	| float 	| number of characters of a product description                             	|
| `product_photos_qty`         	| int   	| number of photos available for a product                                  	|
| `product_weight_g`           	| float 	| weight of the product                                                     	|
| `product_length_cm`          	| float 	| length of the product                                                     	|
| `product_height_cm`          	| float 	| height of the product                                                     	|
| `product_width_cm`           	| float 	| width of the product                                                      	|
| `price`                      	| float 	| average price at which the product is sold                                	|
| `wait_time`                  	| float 	| average wait time (in days) for orders in which the product was sold      	|
| `share_of_five_stars`        	| float 	| share of five-star review_scores for orders in which the product was sold 	|
| `share_of_one_stars`         	| float 	| share of one-star review_scores for orders in which the product was sold  	|
| `review_score`               	| float 	| average review score of the orders in which the product was sold          	|
| `n_orders`                   	| int   	| number of orders in which the product appears                             	|
| `quantity`                   	| int   	| total number of products sold for each product_id                         	|
| `sales`                      	| int   	| total sales (in BRL) for each product_id                                  	|

## Analysis per `product_id`

🎯 Can we predict the average `review_score` per `product_id` ? 

In [0]:
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf

### A first glimpse at the `product().get_training_data()` DataFrame

👉 We inspect for you the new `Product().get_training_data()` dataframe, for instance by `plotting histograms of each variable` using `plt.hist()`. 

In [0]:
from olist.product import Product
products = Product().get_training_data()

In [0]:
products.head()

In [0]:
products.hist()
fig = plt.gcf()
fig.set_size_inches(15,10)

### Predicting the  average `review_score` per `product_id`

🚀 Model `review_score` with an `OLS`.

👉  Choose which continuous features you would like to use to predict the review score

* What is the $R^2$ of your model ❓
* Among the feature you chose, which ones are the most important/significant ❓

In [0]:
# YOUR CODE HERE

💡 Some features seem interesting:
* `product_photos_qty` : the more pictures available, the more likely you will make the decision to purchase it
* `product_volume_cm3` : it is faster to ship a smartphone cable than a convertible sofa...
* `wait_time` : are Olist' customers patient ?
* `price` : how does the price of a product influence a customer's satisfaction ?
* `n_orders` : it might indicate whether a product is popular or not
* `quantity` : same thing

🤔 There may be some correlation between the quantity and the number of orders but let's run an OLS on these features now and see what outputs come out of it !

👇 Your turn: run the OLS with features shown above

In [0]:
# YOUR CODE HERE

### Some insights based on our OLS 👇:

🎉 All the p-values are lower than 5%, which means than all our coefficients are statistically signifiant !

----

ℹ️ The price has a small positive impact on the review score. Is it a psychological effect when customers do not want to admit a product is bad because they paid a certain amount of money on it ?

----

😮 The number of photos is apparently less important than expected, even if there should be a minimum.

----

😮 The quantity has apparently no impact on the review score of a product according to this model...

⚠️⚠️⚠️ Let's not draw conclusions too quickly ⚠️⚠️⚠️

* *Example 1*: for smartphones, it's extremely important that a charging cable does not break after a few weeks, so in general on e-commerce platforms, a good quality cable is ordered a thousand times and has a high average review score

* *Example 2* : for a product that belongs to a niche market, it can have a high average review score even if it was bought only a few times

🧑🏻‍🏫 The lesson here is that even if your coefficients are significant, a model represents some kind of guidance for your business-related decisions ! Please also use your common sense !

----

ℹ️ The product volume does not seem to have a big impact on the review score... but still the impact is slightly negative.

----

🔴 The `wait_time` has a huge negative impact on the review_score. Customers do not like to wait even though they already know some products are longer to ship than others !


## Aggregation operations per product category

### Build an aggregated dataframe with a `get_product_cat` function.

👉 Create a function `get_product_cat` which:
* takes an `aggregating method as an argument` 
* returns a DataFrame with:
    * each `product_category`'s `quantity` summed 
    * all other numerical features aggregated by the chosen method.  


    For instance get_product_cat('median') returns:

      - `quantity` (sum)
      - `wait_time` (median)
      - `review_score` (median)
      - `price` (median)
      - ....

In [0]:
# YOUR CODE HERE

### Test code

In [0]:
from nbresult import ChallengeResult
product_cat = get_product_cat('mean')
result = ChallengeResult('products',
    shape=product_cat.shape,
    avg_review_score=int(product_cat['review_score'].mean()),
    avg_price=int(product_cat['price'].mean()),
    avg_quantity=int(product_cat['quantity'].mean())
)
result.write()
print(result.check())

### 🧨 Products' Analysis 🧨

How many product categories does Olist have ❓

In [0]:
# YOUR CODE HERE

💪 What are the best performing product categories ❓

In [0]:
# YOUR CODE HERE

👎  What are the worst performing product categories ❓

In [0]:
# YOUR CODE HERE

👀 Let's try to understand _why_ some categories are performing better than others. 

Using `plotly`, create different scatterplots, varying `x`, `y`, `color` and `size`, to find clues about factors impacting the `review_score`. 

- Do you notice some underperforming product categories?
- Can you think about a strategy to improve Olist's profit margin as its CEO requested?

<details>
    <summary>Hints</summary>

Plot `product_length_cm` against `wait_time`, with color as `review_score`, and bubble size as "sales" for instance
    
</details>

In [0]:
# YOUR CODE HERE

### Causal inference

☝️ It seems that `large products` like `office_furniture` and `furniture_mattress_and_upholstery`, which happen to take longer to deliver, are performing worse than other products.

🤔 Are consumers disappointed about these products or by the slow delivery time ❓ 

👉 Run an OLS to model `review_score` :
* to isolate the real contribution of each product category on customer satisfaction, 
* by holding `wait_time` constant.

<u>**Questions**</u>:

1️⃣ Which dataset should you use for this regression: `product_cat` or the entire `products` training dataset ?

2️⃣ Which independent variables / features should you use ?

3️⃣ 🕵🏻 Investigate the results: which product categories correlate with a higher `review_score` (holding `wait_time` constant) ?

🎁 Feel free to use `return_significative_coef(model)` function coded for you in `olist/utils.py`

In [0]:
# YOUR CODE HERE

☝️ Furnitures are not in the list of significant coefficients! 

😉 The low review_score for furnitures may result from the delivery rather than the product itself! 

💡On the contrary, books are regularly driving higher reviews, even after accounting for generally quicker delivery time. 

🏁 Congratulations with this final challenge! 

👉 Don't forget to commit and push your analysis :)