Most Lucrative Products

You have been asked to find the 5 most lucrative products in terms of total revenue for the first half of 2022 (from January to June inclusive).


Output their IDs and the total revenue.

In [2]:
import pandas as pd
import datetime as dt

In [6]:
online_orders = pd.read_csv("CSV/online_orders.csv")
online_orders = online_orders.iloc[:, :6]
online_orders.head(3)

Unnamed: 0,product_id,promotion_id,cost_in_dollars,customer_id,date,units_sold
0,1,1,2,1,2022-04-01,4
1,3,3,6,3,2022-05-24,6
2,1,2,2,10,2022-05-01,3


In [7]:
online_orders["date"] = pd.to_datetime(online_orders["date"])
online_orders.head(3)

Unnamed: 0,product_id,promotion_id,cost_in_dollars,customer_id,date,units_sold
0,1,1,2,1,2022-04-01,4
1,3,3,6,3,2022-05-24,6
2,1,2,2,10,2022-05-01,3


In [8]:
online_orders["month"] = online_orders["date"].dt.month
online_orders.head(3)

Unnamed: 0,product_id,promotion_id,cost_in_dollars,customer_id,date,units_sold,month
0,1,1,2,1,2022-04-01,4,4
1,3,3,6,3,2022-05-24,6,5
2,1,2,2,10,2022-05-01,3,5


In [9]:
quarter_sales = online_orders[
    (online_orders["month"] >= 1) & (online_orders["month"] <= 6)
]
quarter_sales.head(3)

Unnamed: 0,product_id,promotion_id,cost_in_dollars,customer_id,date,units_sold,month
0,1,1,2,1,2022-04-01,4,4
1,3,3,6,3,2022-05-24,6,5
2,1,2,2,10,2022-05-01,3,5


In [10]:
quarter_sales["total"] = (
    quarter_sales["cost_in_dollars"] * quarter_sales["units_sold"]
)
quarter_sales.head(3)

Unnamed: 0,product_id,promotion_id,cost_in_dollars,customer_id,date,units_sold,month,total
0,1,1,2,1,2022-04-01,4,4,8
1,3,3,6,3,2022-05-24,6,5,36
2,1,2,2,10,2022-05-01,3,5,6


In [11]:
products = (
    quarter_sales.groupby(by="product_id")[["total"]].agg(func="sum").reset_index()
)
products

Unnamed: 0,product_id,total
0,1,65
1,2,207
2,3,201
3,4,14
4,5,199
5,6,56
6,8,24
7,9,47
8,10,45
9,11,45


In [12]:
products["ranking"] = products["total"].rank(method="min", ascending=False)
products

Unnamed: 0,product_id,total,ranking
0,1,65,4.0
1,2,207,1.0
2,3,201,2.0
3,4,14,10.0
4,5,199,3.0
5,6,56,5.0
6,8,24,9.0
7,9,47,6.0
8,10,45,7.0
9,11,45,7.0


In [13]:
result = products[products["ranking"] <= 5][["product_id", "total"]].sort_values(
    "total", ascending=False
)

In [14]:
result

Unnamed: 0,product_id,total
1,2,207
2,3,201
4,5,199
0,1,65
5,6,56


Solution Walkthrough
This solution walkthrough is about finding the 5 most lucrative products in terms of total revenue for the first half of 2022 (from January to June inclusive).

We will be using the pandas library to perform various data manipulation and aggregation tasks.

Understanding The Data
Before diving into the code, let's understand the data we will be working with. The online_orders dataset appears to contain information about online orders. It likely includes columns such as 'date', 'month', 'cost_in_dollars', 'units_sold', and 'product_id'.

We will need to convert the 'date' column to datetime format using the pd.to_datetime() function and extract the 'month' from it using the dt.month property. This will enable us to filter the data for the first half of 2022.

The Problem Statement
The goal is to find the 5 most lucrative products in terms of total revenue for the first half of 2022. We can calculate the total revenue for each product by multiplying the 'cost_in_dollars' column with the 'units_sold' column.

Once we have calculated the total revenue for each product, we can determine the top 5 products based on their total revenue.

Breaking Down The Code
The code snippet starts by importing the required libraries: pandas as pd and datetime as dt.

import pandas as pd
import datetime as dt
Next, it converts the 'date' column in the online_orders DataFrame to datetime format using the pd.to_datetime() function. This allows us to perform datetime operations on the 'date' column.

online_orders["date"] = pd.to_datetime(online_orders["date"])
Then, it extracts the 'month' from the 'date' column and assigns it to a new column called 'month' in the online_orders DataFrame. The dt.month property extracts the month component from each date.

online_orders["month"] = online_orders["date"].dt.month
After that, it filters the online_orders DataFrame to include only the rows where the 'month' is between 1 and 6 (inclusive). This gives us the data for the first half of 2022.

quarter_sales = online_orders[
    (online_orders["month"] >= 1) & (online_orders["month"] <= 6)
]
Next, it calculates the total revenue for each row by multiplying the 'cost_in_dollars' column with the 'units_sold' column and assigns it to a new column called 'total' in the quarter_sales DataFrame.

quarter_sales["total"] = (
    quarter_sales["cost_in_dollars"] * quarter_sales["units_sold"]
)
Then, it performs a groupby operation on the 'product_id' column in the quarter_sales DataFrame and calculates the sum of the 'total' column for each product. The result is stored in a new DataFrame called 'products'.

products = (
    quarter_sales.groupby(by="product_id")[["total"]]
    .agg(func="sum")
    .reset_index()
)
After that, it assigns a ranking to each product based on their total revenue using the rank() method. The method="min" argument ensures that ties are ranked by the minimum rank. The rankings are stored in a new column called 'ranking' in the 'products' DataFrame.

products["ranking"] = products["total"].rank(
    method="min", ascending=False
)
Finally, it filters the 'products' DataFrame to include only the rows where the 'ranking' is less than or equal to 5, selects only the 'product_id' and 'total' columns, and sorts the DataFrame based on the 'total' column in descending order. The result gives us the top 5 products with their total revenue.

result = products[products["ranking"] <= 5][
    ["product_id", "total"]
].sort_values("total", ascending=False)
Bringing It All Together
To find the 5 most lucrative products in terms of total revenue for the first half of 2022, we follow these steps:

Convert the 'date' column to datetime format using pd.to_datetime() and extract the 'month' using dt.month.
Filter the data for the first half of 2022 (January to June inclusive).
Calculate the total revenue for each product by multiplying the 'cost_in_dollars' with the 'units_sold'.
Group the data by 'product_id' and calculate the sum of the 'total' revenue for each product.
Assign a ranking to each product based on their total revenue.
Filter the data to include only the top 5 products based on their ranking.
Output the 'product_id' and 'total' revenue for the top 5 products, sorted by total revenue in descending order.
Conclusion
The code provided allows us to find the 5 most lucrative products in terms of total revenue for the first half of 2022. By breaking down the code and understanding each step, we can see how pandas can be used to manipulate, aggregate, and analyze data efficiently.