In [None]:
%run ../../_pre_run.ipynb

# Initial Data Filtering

**Selecting the Time Period for Analysis**

Let's see how many orders there were before 2017.

In [None]:
df_orders[df_orders.order_approved_dt < '2017-04-01'].groupby(pd.Grouper(key='order_approved_dt', freq='ME')).agg({'order_id': 'nunique'})

**Key Observations:**

- Before 2017, either there were very few orders or the data is incomplete. It would be reasonable to use data starting from 2017.
- If we include data before 2017, the results for this period will not be statistically significant, as only October 2016 had a somewhat sufficient number of orders. This will manifest as anomalous values in average metrics on graphs due to small sample sizes in groups.

Let's see how many orders were approved after September 2018.

In [None]:
df_orders[df_orders.order_approved_dt > '2018-09-01']

**Key Observations:**  

- Only one order. We conclude that data after August 2018 is incomplete.

Let's examine what orders were created after August 2018.

In [None]:
df_orders[df_orders.order_purchase_dt > '2018-09-01']

**Key Observations:**  

- All orders created after August 2018 were canceled, except for one. This means we can safely trim the data up to September 2018.

As we determined, there were very few sales before January 2017, or the data is incomplete, so for more accurate analysis, we will consider data from January 2017.

Additionally, there is only one approved order after August 2018. Very few were created, and all were canceled, so we will analyze data up to and including August 2018 to avoid distorting results with incomplete data.

At the same time, it's important not to lose rows with missing values in order_approved_dt.

In [None]:
df_orders = df_orders[
    df_orders.order_purchase_dt.between(pd.to_datetime('2017-01-01'), pd.to_datetime('2018-09-01'), inclusive='left')
    | df_orders.order_purchase_dt.isna()
]

We’ll do the same for the reviews table. But we’ll only trim the lower date since reviews are created later than orders.

In [None]:
df_reviews = df_reviews[
    (df_reviews.review_creation_dt >= pd.to_datetime('2017-01-01'))
    | df_reviews.review_creation_dt.isna()
]

---

**Filtering by Order Presence**

To ensure data integrity, we kept only those records in related tables that have a corresponding order_id in the orders table.

We’ll keep only users present in the orders table.

In [None]:
fron.analyze_join_keys(df_customers, df_orders, on='customer_id', how='inner')

In [None]:
df_customers = df_customers.merge(df_orders[['customer_id']], on='customer_id', how='inner')

In [None]:
fron.analyze_join_keys(df_customers, df_orders, on='customer_id', only_coverage=True)

We’ll keep only payments present in the orders table.

In [None]:
fron.analyze_join_keys(df_payments, df_orders, on='order_id', how='inner')

In [None]:
df_payments = df_payments.merge(df_orders[['order_id']], on='order_id', how='inner')

In [None]:
fron.analyze_join_keys(df_payments, df_orders, on='order_id', only_coverage=True)

We’ll keep only reviews present in the orders table.

In [None]:
fron.analyze_join_keys(df_reviews, df_orders, on='order_id', how='inner')

In [None]:
df_reviews = df_reviews.merge(df_orders[['order_id']], on='order_id', how='inner')

In [None]:
fron.analyze_join_keys(df_reviews, df_orders, on='order_id', only_coverage=True)

We’ll keep only order items present in the orders table.

In [None]:
fron.analyze_join_keys(df_items, df_orders, on='order_id', how='inner')

In [None]:
df_items = df_items.merge(df_orders[['order_id']], on='order_id', how='inner')

In [None]:
fron.analyze_join_keys(df_items, df_orders, on='order_id', only_coverage=True)

Orders that exist in the orders table but not in the order items table cannot be deleted.

We’ll keep all products to identify which ones were never purchased.

We’ll also keep all sellers to see whose products didn’t sell.

In [None]:
%run ../../_post_run.ipynb