# Sellers

The following analysis of Sellers is relatively similar than that of Products.
Our goal is to find Sellers that repetively underperform others, and understand why.
This will help us shape our recommendations on how to improve Olist's profit margin

## 0 -  Code `get_training_data` in olist/seller.py

- Create the method `get_training_data` in `olist/seller.py` that will return the following DataFrame:

  - `seller_id` (_str_) _the id of the product_
  - `seller_state` (_str_) _the state where seller is located_
  - `seller_city` (_str_) _the city where seller is located_
  - `delay_to_carrier` (_float_) _if the order is delivered after the shipping limit date, return the number of days between two dates, otherwise 0_
  - `wait_time` (_float_) _Average number of days customers waited_
  - `share_of_five_stars` (_float_) _The share of five stars orders for orders in which the seller was involved_
  - `share_of_one_stars` (_float_) _The share of one stars orders for orders in which the seller was involved_
  - `review_score` (_float_) _The average review score for orders in which the seller was involved_
  - `n_orders` (_int_) _The number of unique orders the seller was involved with._
  - `quantity` (_int_) _The total number of items sold by this seller_
  - `sales` (_float_) _The total sales associated with this seller (excluding freight value)_
  - `date_first_sale` (_datetime_) _Date of first sales on Olist_
  - `date_last_sale` (_datetime_) _Date of last sales on Olist_
  
Feel free to code all intermediary methods below if you prefer to breakdown the problem step by step.  

✅ Once your logic is encoded, commit and push your new file `order.py`

In [None]:
%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf

### `get_seller_features`
Returns a DataFrame with: 'seller_id', 'seller_city', 'seller_state'

### `get_seller_delay_wait_time`
Returns a DataFrame with: 'seller_id', 'delay_to_carrier', 'seller_wait_time'

### `get_active_dates`
Returns a DataFrame with 'seller_id', 'date_first_sale', 'date_last_sale'

### `get_review_score`
['seller_id', 'share_of_five_stars', 'share_of_one_stars', 'review_score']

### `get_quantity`
['seller_id', 'n_orders', 'quantity']

### `get_sales`
['seller_id', 'sales']

## 1 - Exploration

❓ Let's start by some initial exploratory analysis on sellers distribution:

- Plot the distribution of each numerical variables of the dataset in one large figure
- Do you notice any outliers?
- 
- What's the median number of orders per seller? How is the distribution on that variable looking?

----
💡There seems to be a group of sellers which stands out by having very low review scores! Let's investigate a bit more:

❓ Using plotly, create a scatterplot of `review_score` against `n_orders`, varying bubble size by total `sales` for that seller, and coloring by `seller_states`.  

- Do you notice the underperforming sellers categories?
- Experiment with other x-axis features, color categories, and also with `share_of_one_stars` instead of `review_score` on the y-axis
- Remember that Olist gets a revenue proportional to the sale prices, and get a cost penality at for each low review!
- Can you think of a strategy to improve Olist's profit margin as per CEO request? (keep it in mind for later!)

## 2 - Delay to Carrier

The variable `delay_to_carrier` measures the number of days between the shipping date limit imposed by Olist and the actual delivery date to the customer.

- What's the share of sellers that have an average `delay_to_carrier` above 0?
- Model out the impact of variable `delay_to_carrier` to `review_score`. Start with a simple correlation matrix, then move to a multivariate OLS in statsmodel: Try to isolate the specific impact of this delay, holding out other correlated variable constant. What do you conclude?

----

<details>
    <summary>💡 Insight!</summary>
Contrary to our analysis at the individual order level, the specific impact of `delay_to_carrier` is even more impactful than `wait_time` in driving lower reviews!
</details>

## 3 - Seller location

Let's investigate the seller states performance
- Create an aggretation table `seller_state` aggregating the feature of your choice at state level
- What's the share of orders per seller state? Is it concentrated or distributed across Brazil?


❓We now want to explore the impact of `seller_state` to `review_score` or `wait_time`:

- Plot a plotly scatterplot for states that had more than 100 orders, with `n_orders` on the x axis and `wait_time` on the y axis. Color by review_score

- Model out the impact of each `seller_state` to variable `wait_time`. Which locations impact more `wait_time`?


One question we now want to explore is the impact of location to the variable `wait_time`. Our hypothesis being that some states being more remote, those can have longer delivery time to customer 

❓All together now! Let's model out the impact of each variable to our target variable `review_score`:

Run an OLS model with all previous numerical features, plus the categorical one `seller_state`

Feel free to use `return_significative_coef(model)` coded for you in `olist/utils.py` to explore all significative coefs at once

Let's now model out the impact of each variable `avg_review_score`: 

✅ Congratulation with this challenge! Commit and push your notebook before moving to the next one!