# Exploratory Analysis

In [6]:
import numpy as np
import pandas as pd
%load_ext autoreload
%autoreload 2
import os

In [7]:
from olist.data import Olist
data = Olist().get_data()
data.keys()


dict_keys(['order_reviews', 'product_category_name_translation', 'order_payments', 'sellers', 'products', 'customers', 'order_items', 'geolocation', 'orders'])

Each transaction on the Olist ecommerce platform is characterized by:
- a `customer_id`, that would buy...
- various`product_id`...
- to a `seller_id`...
- and leaves a `rewiew_id`...
- all this belonging to an `order_id`

## 1 - Run an automated exploratory analysis with [pandas profiling](https://github.com/pandas-profiling/pandas-profiling)

In [8]:
# First, let's install the pandas-profiling package
! pip install --quiet pandas==1.4.4 pandas-profiling==3.3.0


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.10 -m pip install --upgrade pip[0m


In [9]:
# Then create a "reports" directory
!mkdir reports

mkdir: cannot create directory ‘reports’: File exists


In [10]:
import pandas_profiling
datasets_to_profile = ['orders', 'products', 'sellers',
                  'customers', 'order_reviews',
                  'order_items']

👉 Create and save one `html report` per dataset to profile 

⏳ (It usually takes a few minutes)

In [11]:
# YOUR CODE HERE
# loop over each dataset and profile it
for dataset in datasets_to_profile:
    # check if the dataset name is a key in the data dictionary
    if dataset in data.keys():
        # load the dataset into a Pandas dataframe
        df = data[dataset]

        # generate the pandas profiling report
        report = pandas_profiling.ProfileReport(df)

        # save the report to an HTML file
        report.to_file(f'{dataset}_report.html')
    else:
        print(f"Dataset '{dataset}' not found in data dictionary.")


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

## 2 - Investigate the cardinalities of your various DataFrames

❓ **How many unique `orders`, `reviews`, `sellers`, `products` and `customer` is there ?**  
(You can use pandas profiling or pandas methods on your notebook if you prefer)

In [57]:
# YOUR CODE HERE
orders = data["orders"]
reviews = data["order_reviews"]
sellers = data["sellers"]
products = data["products"]
customers = data["customers"]
unique_orders = len(orders["order_id"].unique())
unique_reviews = len(reviews["review_id"].unique())
unique_sellers = len(sellers["seller_id"].unique())
unique_products = len(products["product_id"].unique())
unique_customers = len(customers["customer_unique_id"].unique())


❓ **How many reviews is there per order? Do we have reviews for all orders ?**
<details>
    <summary markdown='span'>Hints</summary>

This info is not directly accessible in your individual csv. You'll need to proceed to merge
</details>

In [83]:
# YOUR CODE HERE
merged_df = pd.merge(orders, reviews, on='order_id', how='left')
num_order_items = merged_df['order_id'].count()
num_reviews = merged_df['review_id'].count()
per_order_review = num_reviews / num_order_items


99224

🧪 **Test your code below**

Store the number of orders with missing reviews as `int` in a variable named `n_missing_reviews`

In [85]:
# YOUR CODE HERE
n_missing_reviews = num_order_items - num_reviews
n_missing_reviews

768

In [86]:
from nbresult import ChallengeResult

result = ChallengeResult('exploratory',
    n=n_missing_reviews
)
result.write()
print(result.check())


platform linux -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /home/saikotdasjoy/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /home/saikotdasjoy/code/Saikot1997/data-exploratory-analysis/tests
plugins: asyncio-0.19.0, anyio-3.6.2
asyncio: mode=strict
[1mcollecting ... [0mcollected 1 item

test_exploratory.py::TestExploratory::test_n_missing_reviews [32mPASSED[0m[32m      [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/exploratory.pickle

[32mgit[39m commit -m [33m'Completed exploratory step'[39m

[32mgit[39m push origin master

