# Unlocking Insights: Calculating Total Sales by Year using Pandas
In the world of e-commerce and retail, understanding how products perform across different years is crucial for making informed business decisions. Tracking sales trends, identifying top-performing items, and predicting future demand all rely on accurate data analysis.

In this blog, we’ll dive into an interesting data challenge: calculating the total sales amount of each product for each year, using a combination of sales periods, average daily sales, and product information. The problem involves managing overlapping date ranges, ensuring proper alignment with calendar years, and producing an output that not only provides meaningful insights but is also sorted and ready for reporting.

By the end of this blog, you’ll not only understand how to solve the problem but also gain valuable techniques for working with time-series data and creating actionable reports. Whether you’re a data enthusiast, a budding analyst, or a seasoned developer, this guide will equip you with insights to tackle similar challenges in your own projects.

## Problem Description

### Tables

#### Product Table
| Column Name   | Type    |
|---------------|---------|
| product_id    | int     |
| product_name  | varchar |

- `product_id`: Primary key (unique values).
- `product_name`: Name of the product.

#### Sales Table
| Column Name         | Type    |
|---------------------|---------|
| product_id          | int     |
| period_start        | date    |
| period_end          | date    |
| average_daily_sales | int     |

- `product_id`: Foreign key referencing the product.
- `period_start`: Start date for the sales period (inclusive).
- `period_end`: End date for the sales period (inclusive).
- `average_daily_sales`: Average daily sales amount during the period.
- Sales years range between **2018** and **2020**.

---

### Task
Write a solution to calculate the **total sales amount** for each product in each year.

The result should include:
- `product_name`
- `product_id`
- `report_year`
- `total_amount`

### Output Requirements
The result table should be **ordered by `product_id` and `report_year`**.

---

### Example

#### Input

**Product Table**
| product_id | product_name  |
|------------|---------------|
| 1          | LC Phone      |
| 2          | LC T-Shirt    |
| 3          | LC Keychain   |

**Sales Table**
| product_id | period_start | period_end  | average_daily_sales |
|------------|--------------|-------------|---------------------|
| 1          | 2019-01-25   | 2019-02-28  | 100                 |
| 2          | 2018-12-01   | 2020-01-01  | 10                  |
| 3          | 2019-12-01   | 2020-01-31  | 1                   |

#### Output

| product_id | product_name  | report_year | total_amount |
|------------|---------------|-------------|--------------|
| 1          | LC Phone      | 2019        | 3500         |
| 2          | LC T-Shirt    | 2018        | 310          |
| 2          | LC T-Shirt    | 2019        | 3650         |
| 2          | LC T-Shirt    | 2020        | 10           |
| 3          | LC Keychain   | 2019        | 31           |
| 3          | LC Keychain   | 2020        | 31           |

---

### Explanation
1. **LC Phone**: Sold during 2019-01-25 to 2019-02-28. There are 35 days in this period.
   - Total amount: `35 * 100 = 3500`.

2. **LC T-Shirt**: Sold during 2018-12-01 to 2020-01-01.
   - 31 days in 2018: `31 * 10 = 310`.
   - 365 days in 2019: `365 * 10 = 3650`.
   - 1 day in 2020: `1 * 10 = 10`.

3. **LC Keychain**: Sold during 2019-12-01 to 2020-01-31.
   - 31 days in 2019: `31 * 1 = 31`.
   - 31 days in 2020: `31 * 1 = 31`.


In [5]:
import pandas as pd
data = [[1, 'LC Phone '],
        [2, 'LC T-Shirt'],
        [3, 'LC Keychain']]
product = pd.DataFrame(data,
                       columns=['product_id',
                                'product_name']).astype({'product_id':'Int64',
                                'product_name':'object'})
display(product)

Unnamed: 0,product_id,product_name
0,1,LC Phone
1,2,LC T-Shirt
2,3,LC Keychain


In [6]:
data = [[1, '2019-01-25', '2019-02-28', 100],
        [2, '2018-12-01', '2020-01-01', 10],
        [3, '2019-12-01', '2020-01-31', 1]]
sales = pd.DataFrame(data,
                     columns=['product_id',
                              'period_start',
                              'period_end',
                              'average_daily_sales']).astype({'product_id':'Int64',
                              'period_start':'datetime64[ns]',
                              'period_end':'datetime64[ns]',
                              'average_daily_sales':'Int64'})
display(sales)

Unnamed: 0,product_id,period_start,period_end,average_daily_sales
0,1,2019-01-25,2019-02-28,100
1,2,2018-12-01,2020-01-01,10
2,3,2019-12-01,2020-01-31,1


**Step 1. Creating Year Ranges DataFrame**
- Creates a DataFrame (year_ranges) with columns representing the start and end dates for each year (2018, 2019, and 2020).
- A list of years (years) is defined.
- Each year is transformed into a start date (YYYY-01-01) and an end date (YYYY-12-31) using list comprehensions and pd.to_datetime.
- The resulting DataFrame has three columns: report_year, year_start, and year_end.



In [7]:
years = ["2018", "2019", "2020"]
year_ranges = pd.DataFrame({
    'report_year': years,
    'year_start': pd.to_datetime([year+"-01-01" for year in years]),
    'year_end': pd.to_datetime([year+"-12-31" for year in years])
})
display(year_ranges)

Unnamed: 0,report_year,year_start,year_end
0,2018,2018-01-01,2018-12-31
1,2019,2019-01-01,2019-12-31
2,2020,2020-01-01,2020-12-31


**Step 2. Adding a Common Key for Cartesian Product**
- Creates a Cartesian product of sales and year_ranges.
- A dummy column key is added to both sales and year_ranges with a constant value of 1.
- pd.merge combines every row in sales with every row in year_ranges.
- After merging, the key column is removed with .drop('key', axis=1).

In [8]:
sales['key'] = 1
year_ranges['key'] = 1
df = pd.merge(sales, year_ranges, on='key').drop('key', axis=1)
display(df)

Unnamed: 0,product_id,period_start,period_end,average_daily_sales,report_year,year_start,year_end
0,1,2019-01-25,2019-02-28,100,2018,2018-01-01,2018-12-31
1,1,2019-01-25,2019-02-28,100,2019,2019-01-01,2019-12-31
2,1,2019-01-25,2019-02-28,100,2020,2020-01-01,2020-12-31
3,2,2018-12-01,2020-01-01,10,2018,2018-01-01,2018-12-31
4,2,2018-12-01,2020-01-01,10,2019,2019-01-01,2019-12-31
5,2,2018-12-01,2020-01-01,10,2020,2020-01-01,2020-12-31
6,3,2019-12-01,2020-01-31,1,2018,2018-01-01,2018-12-31
7,3,2019-12-01,2020-01-31,1,2019,2019-01-01,2019-12-31
8,3,2019-12-01,2020-01-31,1,2020,2020-01-01,2020-12-31


**Step 3. Calculating Overlap Start and End Dates**
- Determines the overlapping date range between the period_start/period_end from sales and year_start/year_end from year_ranges.
- overlap_start: The later of the two dates (period_start and year_start) is taken using .max(axis=1).
- overlap_end: The earlier of the two dates (period_end and year_end) is taken using .min(axis=1).

**Step 4. Calculating Overlap Days**
- Computes the number of overlapping days.
- The difference between overlap_end and overlap_start is calculated, converted to days using .dt.days, and incremented by 1 (inclusive).

In [9]:
df['overlap_start'] = df[['period_start', 'year_start']].max(axis=1)
df['overlap_end'] = df[['period_end', 'year_end']].min(axis=1)
df['overlap_days'] = (df['overlap_end'] - df['overlap_start']).dt.days + 1
display(df)

Unnamed: 0,product_id,period_start,period_end,average_daily_sales,report_year,year_start,year_end,overlap_start,overlap_end,overlap_days
0,1,2019-01-25,2019-02-28,100,2018,2018-01-01,2018-12-31,2019-01-25,2018-12-31,-24
1,1,2019-01-25,2019-02-28,100,2019,2019-01-01,2019-12-31,2019-01-25,2019-02-28,35
2,1,2019-01-25,2019-02-28,100,2020,2020-01-01,2020-12-31,2020-01-01,2019-02-28,-306
3,2,2018-12-01,2020-01-01,10,2018,2018-01-01,2018-12-31,2018-12-01,2018-12-31,31
4,2,2018-12-01,2020-01-01,10,2019,2019-01-01,2019-12-31,2019-01-01,2019-12-31,365
5,2,2018-12-01,2020-01-01,10,2020,2020-01-01,2020-12-31,2020-01-01,2020-01-01,1
6,3,2019-12-01,2020-01-31,1,2018,2018-01-01,2018-12-31,2019-12-01,2018-12-31,-334
7,3,2019-12-01,2020-01-31,1,2019,2019-01-01,2019-12-31,2019-12-01,2019-12-31,31
8,3,2019-12-01,2020-01-31,1,2020,2020-01-01,2020-12-31,2020-01-01,2020-01-31,31


**Step 5. Filtering Valid Overlap Rows**
- Keeps only rows where there is a valid overlap (overlap days > 0).
- Filters out rows where overlap_days is 0 or negative, meaning no overlap occurred.

In [10]:
df = df[df['overlap_days'] > 0]
display(df)

Unnamed: 0,product_id,period_start,period_end,average_daily_sales,report_year,year_start,year_end,overlap_start,overlap_end,overlap_days
1,1,2019-01-25,2019-02-28,100,2019,2019-01-01,2019-12-31,2019-01-25,2019-02-28,35
3,2,2018-12-01,2020-01-01,10,2018,2018-01-01,2018-12-31,2018-12-01,2018-12-31,31
4,2,2018-12-01,2020-01-01,10,2019,2019-01-01,2019-12-31,2019-01-01,2019-12-31,365
5,2,2018-12-01,2020-01-01,10,2020,2020-01-01,2020-12-31,2020-01-01,2020-01-01,1
7,3,2019-12-01,2020-01-31,1,2019,2019-01-01,2019-12-31,2019-12-01,2019-12-31,31
8,3,2019-12-01,2020-01-31,1,2020,2020-01-01,2020-12-31,2020-01-01,2020-01-31,31


**Step 6. Calculating Total Amount**
- Calculates the total sales amount for the overlapping period.
- Multiplies overlap_days by average_daily_sales to compute the sales amount for the overlapping period.


In [11]:
df['total_amount'] = df['overlap_days'] * df['average_daily_sales']
display(df)

Unnamed: 0,product_id,period_start,period_end,average_daily_sales,report_year,year_start,year_end,overlap_start,overlap_end,overlap_days,total_amount
1,1,2019-01-25,2019-02-28,100,2019,2019-01-01,2019-12-31,2019-01-25,2019-02-28,35,3500
3,2,2018-12-01,2020-01-01,10,2018,2018-01-01,2018-12-31,2018-12-01,2018-12-31,31,310
4,2,2018-12-01,2020-01-01,10,2019,2019-01-01,2019-12-31,2019-01-01,2019-12-31,365,3650
5,2,2018-12-01,2020-01-01,10,2020,2020-01-01,2020-12-31,2020-01-01,2020-01-01,1,10
7,3,2019-12-01,2020-01-31,1,2019,2019-01-01,2019-12-31,2019-12-01,2019-12-31,31,31
8,3,2019-12-01,2020-01-31,1,2020,2020-01-01,2020-12-31,2020-01-01,2020-01-31,31,31


**Step 7. Grouping and Summing by Product and Year**
- Aggregates the total sales amount by product_id and report_year.
- Groups the DataFrame by product_id and report_year.
- Sums the total_amount for each group.
- The grouped result is reset into a flat DataFrame with .reset_index().

In [12]:
df = df.groupby(['product_id', 'report_year'])['total_amount'].sum().reset_index()
display(df)

Unnamed: 0,product_id,report_year,total_amount
0,1,2019,3500
1,2,2018,310
2,2,2019,3650
3,2,2020,10
4,3,2019,31
5,3,2020,31


**Step 8. Merging with Product Information**
- Adds product details (e.g., product name) by merging with the product DataFrame.
- Uses product_id as the key to join df with the product DataFrame.

In [13]:
df = pd.merge(df, product, on='product_id')
display(df)

Unnamed: 0,product_id,report_year,total_amount,product_name
0,1,2019,3500,LC Phone
1,2,2018,310,LC T-Shirt
2,2,2019,3650,LC T-Shirt
3,2,2020,10,LC T-Shirt
4,3,2019,31,LC Keychain
5,3,2020,31,LC Keychain


**Step 9. Reordering and Selecting Relevant Columns**
- Keeps only the required columns in the final DataFrame.
- Columns kept: product_id, product_name, report_year, and total_amount.

**Step 10. Sorting Final Data**
- Ensures the final DataFrame is sorted and clean.
- Sorts the DataFrame by product_id and report_year in ascending order.
- Resets the index and drops the old index.

In [14]:
df = df[['product_id', 'product_name', 'report_year', 'total_amount']]
df = df.sort_values(by=['product_id', 'report_year']).reset_index(drop=True)
display(df)

Unnamed: 0,product_id,product_name,report_year,total_amount
0,1,LC Phone,2019,3500
1,2,LC T-Shirt,2018,310
2,2,LC T-Shirt,2019,3650
3,2,LC T-Shirt,2020,10
4,3,LC Keychain,2019,31
5,3,LC Keychain,2020,31


References: [1] https://leetcode.com/problems/total-sales-amount-by-year/