## Spark warmup

🎯 The goal of the exercise is to learn pyspark syntax and will involve 5 questions tested directly in the notebook!

## Setup

The next cell will create the spark session and load the data in for you

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Olist Exercises").getOrCreate()

df_customers = spark.read.csv("data/olist_customers_dataset.csv", header=True, inferSchema=True)
df_geolocation = spark.read.csv("data/olist_geolocation_dataset.csv", header=True, inferSchema=True)
df_items = spark.read.csv("data/olist_order_items_dataset.csv", header=True, inferSchema=True)
df_payments = spark.read.csv("data/olist_order_payments_dataset.csv", header=True, inferSchema=True)
df_reviews = spark.read.csv("data/olist_order_reviews_dataset.csv", header=True, inferSchema=True)
df_orders = spark.read.csv("data/olist_orders_dataset.csv", header=True, inferSchema=True)
df_products = spark.read.csv("data/olist_products_dataset.csv", header=True, inferSchema=True)
df_sellers = spark.read.csv("data/olist_sellers_dataset.csv", header=True, inferSchema=True)
df_category_translation = spark.read.csv("data/product_category_name_translation.csv", header=True, inferSchema=True)


**Question 1:** Caculate the count of each of the order statuses and save the result as `df_order_status_count`


In [None]:
# YOUR CODE HERE

🧪 **Testing** Run the cell below

In [None]:
from nbresult import ChallengeResult

pandas_df_order_status_count = df_order_status_count.toPandas()

result = ChallengeResult('order_status_count',
    order_status_count_df=pandas_df_order_status_count
)
result.write()
print(result.check())


**Question 2:** Calculate the Customer Lifetime Value for each customer. CLV is calculated as the sum of the payment values for each customer. Save the resulting spark DataFrame into a variable called `df_clv`. The clv column should be named `CLV`.


In [None]:
# YOUR CODE HERE

🧪 **Testing** Run the cell below

In [None]:
from nbresult import ChallengeResult

df_clv_pd = df_clv.toPandas()

result = ChallengeResult('customer_clv',
    df_clv_pd=df_clv_pd
)
result.write()
print(result.check())


**Question 3:** Calculate the total revenue for each seller. Save the resulting spark DataFrame in a variable called `df_seller_revenue`. The total revenue column should be named `Total_Revenue`.

In [None]:
# YOUR CODE HERE

In [None]:
from nbresult import ChallengeResult

df_seller_revenue_pd = df_seller_revenue.toPandas()

result = ChallengeResult('seller_revenue',
    df_seller_revenue_pd=df_seller_revenue_pd
)
result.write()
print(result.check())


**Question 4:** Identify the top 10 product categories with the highest **count** of canceled or unavailable orders. Save the result in a spark DataFrame called `df_high_cancel_categories`. The column for total instances should be named `Total_Canceled_Unavailable`


In [None]:
# YOUR CODE HERE

In [None]:
from nbresult import ChallengeResult

df_high_cancel_categories_pd = df_high_cancel_categories.toPandas()

result = ChallengeResult('high_cancel_categories',
    df_high_cancel_categories_pd=df_high_cancel_categories_pd
)
result.write()
print(result.check())


**Question 5:** Calculate the average extended delivery time for each product category. The extended delivery time is the actual delivery time minus the estimated delivery time. Save the result in a variable called `df_delivery_time`. The `Avg_Extended_Delivery` column should contain the number of hours (as a float), sorted in descending order, and limited to the top 10 product categories with the highest average extended delivery time.

<details>
    <summary>💡Hint</summary>

Break up the transformation into a few steps! Try creating an `actual_delivery_time` and `estimated_delivery_time` on a DataFrame that makes sense.

</details>

In [None]:
# YOUR CODE HERE

In [None]:
from nbresult import ChallengeResult

df_delivery_time_pd = df_delivery_time.toPandas()

result = ChallengeResult('delivery_time',
    df_delivery_time_pd=df_delivery_time_pd
)
result.write()
print(result.check())


## Finished 🏁

Now you have finished the notebook you should have a good grounding of how to use pyspark!

You can run 

```bash
make test
```

Make sure to commit and push all your changes to github so you can track your progress on Kitt!

Time to move on to the next exercise!