In [None]:
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Analysis of pricing optimization on CDM Pricing Data

<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/workbench/pricing_optimization/pricing-optimization.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/workbench/pricing_optimization/pricing-optimization.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/official/workbench/pricing_optimization/pricing-optimization.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>                                                                                               
</table>

## Table of contents
* [Overview](#section-1)
* [Objective](#section-2)
* [Dataset](#section-3)
* [Costs](#section-4)
* [Create a BigQuery dataset](#section-5)
* [Load the dataset from Cloud Storage](#section-6)
* [Data analysis](#section-7)
* [Preprocess the data for training](#section-8)
* [Train the model using BigQuery ML](#section-9)
* [Generate forecasts from the model](#section-10)
* [Interpret the results to choose the best price](#section-11)
* [Clean up](#section-12)


## Overview
<a name="section-1"></a>

This notebook demonstrates analysis of pricing optimization on [CDM Pricing Data](https://github.com/trifacta/trifacta-google-cloud/tree/main/design-pattern-pricing-optimization) and automating the workflow using Vertex AI Workbench managed notebooks.

*Note: This notebook file was developed to run in a [Vertex AI Workbench managed notebooks](https://console.cloud.google.com/vertex-ai/workbench/list/managed) instance using the Python (Local) kernel. Some components of this notebook may not work in other notebook environments.*

Learn more about [Vertex AI Workbench](https://cloud.google.com/vertex-ai/docs/workbench/introduction) and Learn more about [BigQuery ML](https://cloud.google.com/vertex-ai/docs/beginner/bqml#machine_learning_directly_in).

### Objective
<a name="section-2"></a>

The objective of this notebook is to build a pricing optimization model using BigQuery ML. The following steps have been followed:  

This tutorial uses the following Google Cloud ML services and resources:

- Google Cloud Storage
- BigQuery


The steps performed include:

- Load the required dataset from a Cloud Storage bucket.
- Analyze the fields present in the dataset.
- Process the data to build a model.
- Build a BigQuery ML forecast model on the processed data.
- Get forecasted values from the BigQuery ML model.
- Interpret the forecasts to identify the best prices.
- Clean up.


### Dataset
<a name="section-3"></a>

The dataset used in this notebook is a part of the [CDM Pricing dataset](https://github.com/trifacta/trifacta-google-cloud/blob/main/design-pattern-pricing-optimization/CDM_Pricing_large_table.csv), which consists of product sales information on specified dates.

### Costs
<a name="section-4"></a>

This tutorial uses the following billable components of Google Cloud:

- Vertex AI
- BigQuery
- Cloud Storage


Learn about [Vertex AI
pricing](https://cloud.google.com/vertex-ai/pricing), [BigQuery pricing](https://cloud.google.com/bigquery/pricing) and [Cloud Storage
pricing](https://cloud.google.com/storage/pricing), and use the [Pricing
Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

### Install additional packages


In [None]:
! pip3 install --quiet --upgrade pandas-gbq 'google-cloud-bigquery[bqstorage,pandas]' seaborn fsspec gcsfs


### Colab Only: Uncomment the following cell to restart the kernel

In [None]:
# Automatically restart kernel after installs so that your environment can access the new packages
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

### Before you begin

#### Set your project ID

**If you don't know your project ID**, try the following:
-  Run `gcloud config list`
-  Run `gcloud projects list`
-  See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)

In [None]:
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

# set the project id
! gcloud config set project $PROJECT_ID

#### Region

You can also change the `REGION` variable used by Vertex AI. 
Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations).

In [None]:
REGION = "[your-region]"  # @param {type: "string"}

### Authenticate your Google Cloud account

Depending on your Jupyter environment, you may have to manually authenticate. Follow the relevant instructions below.

**1. Vertex AI Workbench** 
- Do nothing as you are already authenticated.

**2. Local JupyterLab Instance,** uncomment and run.

In [None]:
# ! gcloud auth login

**3. Colab,** uncomment and run:

In [None]:
# from google.colab import auth
# auth.authenticate_user()

**4. Service Account or other**
- See all the authentication options here: [Google Cloud Platform Jupyter Notebook Authentication Guide](https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/notebook_authentication_guide.ipynb)

#### UUID

If you are in a live tutorial session, you might be using a shared test account or project. To avoid name collisions between users on resources created, you create a uuid for each instance session, and append it onto the name of resources you create in this tutorial.

In [None]:
import random
import string


# Generate a uuid of a specifed length(default=8)
def generate_uuid(length: int = 8) -> str:
    return "".join(random.choices(string.ascii_lowercase + string.digits, k=length))


UUID = generate_uuid()

### Import the required libraries and define constants


In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from google.cloud import bigquery
from google.cloud.bigquery import Client

#### Set the BigQuery dataset ID and table ID

In [None]:
DATASET = "pricing_optimization" + "_" + UUID  # set the BigQuery dataset-id
TRAINING_DATA_TABLE = (
    "training_data_table"  # set the BigQuery table-id to store the training data
)

## Create a BigQuery dataset
<a name="section-5"></a>


If you are using ***Vertex AI Workbench managed notebooks instance***, every cell which starts with "#@bigquery" will be a SQL Query. If you are using Vertex AI Workbench user managed notebooks instance or Colab it will be a markdown cell.

#@bigquery
-- create a dataset in BigQuery

CREATE SCHEMA [your-dataset-id]
OPTIONS(
  location="us"
  )

In [None]:
# Construct a BigQuery client object.
client = Client(project=PROJECT_ID)

In [None]:
query = """
CREATE SCHEMA {DATASET}
OPTIONS(
  location="us"
  )
""".format(
    DATASET=DATASET
)
query_job = client.query(query)
print(query_job.result())

## Load the BigQuery table from cloud storage
<a name="section-6"></a>


In [None]:
table_id_name = f"{PROJECT_ID}.{DATASET}.data"

In [None]:
table_id = "data"

In [None]:
job_config = bigquery.LoadJobConfig(
    autodetect=True,
    skip_leading_rows=1,
    # The source format defaults to CSV, so the line below is optional.
    source_format=bigquery.SourceFormat.CSV,
)
uri = "gs://cloud-samples-data/ai-platform-unified/datasets/tabular/cdm_pricing_large_table.csv"

load_job = client.load_table_from_uri(
    uri, table_id_name, job_config=job_config
)  # Make an API request.

load_job.result()  # Waits for the job to complete.

destination_table = client.get_table(table_id_name)  # Make an API request.
print("Loaded {} rows.".format(destination_table.num_rows))

You build a forecast model on this data and thus determine the best price for a product. For this type of model, you will not be using many fields: only the sales and price related ones. For the current execrcise, focus on the following fields:

- `Product_ID`
- `Customer_Hierarchy`
- `Fiscal_Date`
- `List_Price_Converged`
- `Invoiced_quantity_in_Pieces`
- `Net_Sales`

## Data Analysis
<a name="section-7"></a>

First, explore the data and distributions.

#### Select the required columns from the dataframe.

In [None]:
id_col = "Product_ID"
date_col = "Fiscal_Date"
categ_cols = ["Customer_Hierarchy"]
num_cols = ["List_Price_Converged", "Invoiced_quantity_in_Pieces", "Net_Sales"]
required_columns = [id_col] + [date_col] + categ_cols + num_cols
required_columns

Create a view to extract only required columns

In [None]:
query = """
    CREATE OR REPLACE TABLE {DATASET}.required_columns AS
    ( SELECT Product_ID,Fiscal_Date,Customer_Hierarchy,List_Price_Converged,Invoiced_quantity_in_Pieces,Net_Sales FROM `{DATASET}.{table_id}` )
    
""".format(
    DATASET=DATASET, table_id=table_id
)

query_job = client.query(query)  # Make an API request.
print(query_job.result())

See the data stored in the view

In [None]:
query = """
    SELECT * FROM {DATASET}.required_columns 
    
""".format(
    DATASET=DATASET
)

query_job = client.query(query)  # Make an API request.
print(query_job.result())

In [None]:
query_job.to_dataframe()

#### Check the column types and null values in the dataframe.

In [None]:
query_job.to_dataframe().info()

This data description reveals that there are no null values in the data. Also, the field `Fiscal_Date` which is a date field is loaded as an object type. 

#### Change the type of the date field to datetime.

Change Fiscal_Date data type from datetime to date and store resulting entire data in a view

In [None]:
query = """
CREATE OR REPLACE VIEW {DATASET}.required_columns_final AS
(
SELECT Product_ID,Customer_Hierarchy,List_Price_Converged,Invoiced_quantity_in_Pieces,Net_Sales,CAST(DATE(Fiscal_Date) AS DATE) AS Fiscal_Date FROM {DATASET}.required_columns    
)
""".format(
    DATASET=DATASET
)

query_job = client.query(query)  # Make an API request.
print(query_job.result())

See the data in required_columns_final view

In [None]:
query = """
SELECT * FROM {DATASET}.required_columns_final

""".format(
    DATASET=DATASET
)

query_job = client.query(query)  # Make an API request.
print(query_job.result())

In [None]:
required_columns_final_df = query_job.to_dataframe()

In [None]:
required_columns_final_df

#### Plot the distributions for the categorical fields.

In [None]:
for i in categ_cols:
    required_columns_final_df[i].value_counts(normalize=True).plot(kind="bar")
    plt.title(i)
    plt.show()

#### Plot the distributions for the numerical fields.

In [None]:
for i in num_cols:
    _, ax = plt.subplots(1, 2, figsize=(10, 4))
    required_columns_final_df[i].plot(kind="box", ax=ax[0])
    required_columns_final_df[i].plot(kind="hist", ax=ax[1])
    ax[0].set_title(i + "-Boxplot")
    ax[1].set_title(i + "-Histogram")
    plt.show()

#### Check the maximum date and minimum date in Fiscal_Date column.

In [None]:
print(required_columns_final_df["Fiscal_Date"].max())
print(required_columns_final_df["Fiscal_Date"].min())

#### Check the product distribution across each category.

In [None]:
query = """
SELECT Customer_Hierarchy,COUNT(*) as count FROM (SELECT Customer_Hierarchy,Product_ID FROM {DATASET}.required_columns_final GROUP BY Customer_Hierarchy,Product_ID) GROUP BY  Customer_Hierarchy
""".format(
    DATASET=DATASET
)
query_job = client.query(query)
print(query_job.result())

In [None]:
query_job.to_dataframe()

#### Check the percentage changes in the orders based on the percentage changes in the price.

You follow three steps to check percentage changes in the orders based on the percentage changes in the price

**Step 1**.First, you create a table that has one line each time the price of a product has changed, with information about that particular product pricing like how many items were ordered with each price and the total net sales associated with that price.


In [None]:
query = """
create table {DATASET}.price_changes as (
select
       product_id,
       list_price_converged,
       total_ordered_pieces,
       total_net_sales,
       first_price_date,
       lag(list_price_converged) over(partition by product_id order by first_price_date asc) as previous_list,
       lag(total_ordered_pieces) over(partition by product_id order by first_price_date asc) as previous_total_ordered_pieces,
       lag(total_net_sales) over(partition by product_id order by first_price_date asc) as previous_total_net_sales,
       lag(first_price_date) over(partition by product_id order by first_price_date asc) as previous_first_price_date,
       
       
       from (
           select
               product_id,list_price_converged,sum(invoiced_quantity_in_pieces) as total_ordered_pieces, sum(net_sales) as total_net_sales, min(fiscal_date) as first_price_date
           from `{DATASET}.required_columns_final` AS cdm_pricing
           group by 1,2
           order by 1, 2 asc
       )
);

""".format(
    DATASET=DATASET
)
query_job = client.query(query)
print(query_job.result())

In [None]:
query = """
select * from {DATASET}.price_changes order by product_id, first_price_date 
""".format(
    DATASET=DATASET
)
query_job = client.query(query)
print(query_job.result())

In [None]:
df_price_changes = query_job.to_dataframe()

In [None]:
df_price_changes

**Step 2**. Next, with the temporary table in place, you can calculate the price change across SKUs

Ex: (previous_list-list_price_converged)/nullif(previous_list,0)*100

**Step 3**. Next, you can calculate the total_ordered_pieces change across SKUs
(total_ordered_pieces-previous_total_ordered_pieces)/nullif(previous_total_ordered_pieces,0)*100 

In [None]:
query = """
select *,(list_price_converged-previous_list)/nullif(previous_list,0)*100 as price_change_perc,(total_ordered_pieces-previous_total_ordered_pieces)/nullif(previous_total_ordered_pieces,0)*100 as order_change_perc  from `{DATASET}.price_changes`
""".format(
    DATASET=DATASET
)
query_job = client.query(query)
print(query_job.result())

Now you have dataframe(df_for_plot) which has price_change_perc, order_change_perc fields

In [None]:
df_for_plot = query_job.to_dataframe()

In [None]:
# sort values chronologically
df_for_plot.sort_values(by=["product_id", "first_price_date"], inplace=True)
df_for_plot.reset_index(drop=True, inplace=True)

In [None]:
df_for_plot

Finally, you can analyze what happens after a price has been changed by looking at the relationship between each price change and the total amount of items that were ordered:

In [None]:
# plot a scatterplot to visualize the changes
sns.scatterplot(
    x="price_change_perc",
    y="order_change_perc",
    data=df_for_plot,
    hue="product_id",
    legend=False,
)
plt.title("Percentage of change in price vs order")
plt.show()

For most of the products, the percentage change in orders are high where the percentage changes in the prices are low. This suggests that too much change in the prices can affect the number of orders. 

**Note**: There seem to be some outliers in the data as percentage changes greater than 800 are found. In the current exercise, do not take any manual measures to deal with outliers as you will create a BigQuery ML timeseries model that already deals with outliers.

## Preprocess the data for training
<a name="section-8"></a>

#### Check which `Product_ID`'s  have the maximum orders.

Create a view which stores amount of orders for for each product based on Customer_Hierarchy

In [None]:
query = """
CREATE OR REPLACE VIEW {DATASET}.total_orders AS
(
SELECT Customer_Hierarchy,Product_ID,SUM(Invoiced_quantity_in_Pieces) AS Invoiced_quantity_in_Pieces FROM {DATASET}.required_columns_final GROUP BY Customer_Hierarchy,Product_ID

)
""".format(
    DATASET=DATASET
)
query_job = client.query(query)
print(query_job.result())

In [None]:
query = """
SELECT * FROM {DATASET}.total_orders""".format(
    DATASET=DATASET
)
query_job = client.query(query)
print(query_job.result())

In [None]:
# sort values chronologically
df_total_orders = query_job.to_dataframe()
df_total_orders.sort_values(by=["Product_ID"], inplace=True)
df_total_orders.reset_index(drop=True, inplace=True)

In [None]:
df_total_orders

#### Select top products in each Customer_Hierarchy

Below is a example to show how you find out top products in each Customer_Hierarchy

Example:
Assume at first total_orders view is

<table>
    <tr>
        <th>
            Customer_Hierarchy
        </th>
        <th> 
            Invoiced_quantity_in_Pieces
        </th>
        <th> 
            Product_ID
        </th>
    </tr>    
    <tr> 
        <td>Food</td>                 
        <td>200</td>                       
        <td>1</td> 
    </tr>
    <tr> 
        <td>Paper</td>                 
        <td>100</td>                       
        <td>2</td> 
    </tr>
    <tr> 
        <td>Food</td>                 
        <td>300</td>                       
        <td>3</td> 
    </tr>
    <tr> 
        <td>Paper</td>                 
        <td>400</td>                       
        <td>4</td> 
    </tr>
   
</table>    
For this first we partition total_orders view by Customer_Hierarchy and ORDER BY Invoiced_quantity_in_Pieces in descending order.
After applying partion it becomes  

<table>
    <tr>
        <th>
            Customer_Hierarchy
        </th>
        <th> 
            Invoiced_quantity_in_Pieces
        </th>
        <th> 
            Product_ID
        </th>
    </tr>    
    <tr> 
        <td>Food</td>                 
        <td>300</td>                       
        <td>3</td> 
    </tr>
    <tr> 
        <td>Food</td>                 
        <td>200</td>                       
        <td>1</td> 
    </tr>
    <tr> 
        <td>Paper</td>                 
        <td>100</td>                       
        <td>2</td> 
    </tr>
    <tr> 
        <td>Paper</td>                 
        <td>400</td>                       
        <td>4</td> 
    </tr>
</table>   

Now for every Customer_Hierarchy, Invoiced_quantity_in_Pieces will be in descending order.    
Now we apply ROW_NUMBER function to above table 
Now it becomes

<table>
    <tr>
        <th>
            Customer_Hierarchy
        </th>
        <th> 
            Invoiced_quantity_in_Pieces
        </th>
        <th> 
            Product_ID
        </th>
        <th>
            rowNumber
        </th>    
    </tr>    
    <tr> 
        <td>Food</td>                 
        <td>300</td>                       
        <td>3</td>
        <td>1</td>
    </tr>
    <tr> 
        <td>Food</td>                 
        <td>200</td>                       
        <td>1</td>
        <td>2</td>
    </tr>
    <tr> 
        <td>Paper</td>                 
        <td>100</td>                       
        <td>2</td> 
        <td>1 </td>
    </tr>
    <tr> 
        <td>Paper</td>                 
        <td>400</td>                       
        <td>4</td> 
        <td>2</td>
    </tr>
</table>   

(For unique Customer_Hierarchy number starts from 1)


In [None]:
query = """
SELECT 
  *,
  ROW_NUMBER() OVER(PARTITION BY Customer_Hierarchy ORDER BY Invoiced_quantity_in_Pieces DESC) rowNumber
  FROM {DATASET}.total_orders
""".format(
    DATASET=DATASET
)
query_job = client.query(query)

In [None]:
query_job.to_dataframe()

As you can see if you take Customer_Hierarchy paper, Invoiced_quantity_in_Pieces is in descending order and rowNumber starts from 1 

In [None]:
query_job.to_dataframe().loc[query_job.to_dataframe()["Customer_Hierarchy"] == "Paper"]

We want row for which Invoiced_quantity_in_Pieces is highest in each Customer_Hierarchy, so selecting rowNumber 1

In [None]:
query = """
 SELECT A.Product_ID, A.Customer_Hierarchy,A.Invoiced_quantity_in_Pieces
  FROM (
  SELECT 
  *,
  ROW_NUMBER() OVER(PARTITION BY Customer_Hierarchy ORDER BY Invoiced_quantity_in_Pieces DESC) rowNumber
  FROM {DATASET}.total_orders
  )A
  WHERE A.rowNumber =1;
""".format(
    DATASET=DATASET
)
query_job = client.query(query)
print(query_job.result())

In [None]:
query_job.to_dataframe()

From the above result, you can infer the following:

- Under the **Food** category, **SKU 62** has the maximum orders.
- Under the **Manufacturing** category, **SKU 17** has the maximum orders.
- Under the **Paper** category, **SKU 107** has the maximum orders.
- Under the **Publishing** category, **SKU 8** has the maximum orders.
- Under the **Utilities** category, **SKU 140** has the maximum orders.

Given that there are too many ids and only a few records for most of them, consider only the above `Product_ID`s for which there are a maximum number of orders. 

**Note**: The `Invoiced_quantity_in_Pieces` field seems to be a *float* type rather than an *int* type as it should be. This could be because the data itself might be averaged in the first place.

#### Check the various prices available for these `Product_ID`s.

First from required_columns_final view we select only rows that have our desired product id and customer hierarchy

In [None]:
query = """
SELECT * FROM {DATASET}.required_columns_final WHERE Product_ID="SKU 62" AND Customer_Hierarchy="Food"
""".format(
    DATASET=DATASET
)
query_job = client.query(query)
df_sku_62 = query_job.to_dataframe()
df_sku_62

Then we plot various prices available for these `Product_ID`s.

In [None]:
print(df_sku_62["List_Price_Converged"].value_counts())

In [None]:
query = """
SELECT * FROM {DATASET}.required_columns_final WHERE Product_ID="SKU 17" AND Customer_Hierarchy="Manufacturing"
""".format(
    DATASET=DATASET
)
query_job = client.query(query)
df_sku_17 = query_job.to_dataframe()
df_sku_17

In [None]:
print(df_sku_17["List_Price_Converged"].value_counts())

In [None]:
query = """
SELECT * FROM {DATASET}.required_columns_final WHERE Product_ID="SKU 107" AND Customer_Hierarchy="Paper"
""".format(
    DATASET=DATASET
)
query_job = client.query(query)
df_sku_107 = query_job.to_dataframe()
df_sku_107

In [None]:
print(df_sku_107["List_Price_Converged"].value_counts())

In [None]:
query = """
SELECT * FROM {DATASET}.required_columns_final WHERE Product_ID="SKU 8" AND Customer_Hierarchy="Publishing"
""".format(
    DATASET=DATASET
)
query_job = client.query(query)
df_sku_8 = query_job.to_dataframe()
df_sku_8

In [None]:
print(df_sku_8["List_Price_Converged"].value_counts())

In [None]:
query = """
SELECT * FROM {DATASET}.required_columns_final WHERE Product_ID="SKU 140" AND Customer_Hierarchy="Utilities"
""".format(
    DATASET=DATASET
)
query_job = client.query(query)
df_sku_140 = query_job.to_dataframe()

In [None]:
print(df_sku_140["List_Price_Converged"].value_counts())

In the publishing category, `Product_ID` `SKU 8` and `SKU 17` are less than or equal to two different prices in the entire data and so you exclude them and consider the rest for building the forecast model. The idea here is to train a forecast model on the timeseries data for products with different prices.

#### Join the data for all the `Product_ID`s into one dataframe and remove duplicate records.

In [None]:
df_final = pd.concat([df_sku_62, df_sku_107, df_sku_140])
df_final = (
    df_final[
        [
            "Product_ID",
            "Fiscal_Date",
            "Customer_Hierarchy",
            "List_Price_Converged",
            "Invoiced_quantity_in_Pieces",
        ]
    ]
    .drop_duplicates()
    .reset_index(drop=True)
)
df_final

#### Save the data to a BigQuery table.

In [None]:
bq_client = bigquery.Client(project=PROJECT_ID)

job_config = bigquery.LoadJobConfig(
    # Specify a (partial) schema. All columns are always written to the
    # table. The schema is used to assist in data type definitions.
    schema=[
        bigquery.SchemaField("Product_ID", bigquery.enums.SqlTypeNames.STRING),
        bigquery.SchemaField("Fiscal_Date", bigquery.enums.SqlTypeNames.DATE),
        bigquery.SchemaField("List_Price_Converged", bigquery.enums.SqlTypeNames.FLOAT),
        bigquery.SchemaField(
            "Invoiced_quantity_in_Pieces", bigquery.enums.SqlTypeNames.FLOAT
        ),
    ],
    # Optionally, set the write disposition. BigQuery appends loaded rows
    # to an existing table by default, but with WRITE_TRUNCATE write
    # disposition it replaces the table with the loaded data.
    write_disposition="WRITE_TRUNCATE",
)

# save the dataframe to a table in the created dataset
job = bq_client.load_table_from_dataframe(
    df_final,
    "{}.{}.{}".format(PROJECT_ID, DATASET, TRAINING_DATA_TABLE),
    job_config=job_config,
)  # Make an API request.
print(job.result())  # Wait for the job to complete.

# Train the model using BigQuery ML
<a name="section-9"></a>

Train an [Arima-Plus](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-time-series) model on the data using BigQuery ML.

#@bigquery
create or replace model [your-dataset-id].bqml_arima
options
 (model_type = 'ARIMA_PLUS',
  time_series_timestamp_col = 'Fiscal_Date',
  time_series_data_col = 'Invoiced_quantity_in_Pieces',
  time_series_id_col = 'ID'
 ) as
select
 Fiscal_Date,
 Concat(Product_ID,"_" ,Cast(List_Price_Converged as string)) as ID,
 Invoiced_quantity_in_Pieces
from
 [your-dataset-id].TRAINING_DATA


In [None]:
query = """
create or replace model `{PROJECT_ID}.{DATASET}.bqml_arima`
options
 (model_type = 'ARIMA_PLUS',
  time_series_timestamp_col = 'Fiscal_Date',
  time_series_data_col = 'Invoiced_quantity_in_Pieces',
  time_series_id_col = 'ID'
 ) as
select
 Fiscal_Date,
 Concat(Product_ID,"_" ,Cast(List_Price_Converged as string)) as ID,
 Invoiced_quantity_in_Pieces
from
 `{DATASET}.{TRAINING_DATA_TABLE}`""".format(
    PROJECT_ID=PROJECT_ID, DATASET=DATASET, TRAINING_DATA_TABLE=TRAINING_DATA_TABLE
)
query_job = client.query(query)
print(query_job.result())

## Generate forecasts from the model
<a name="section-10"></a>

Predict the sales for the next 30 days for each id and save to a dataframe.

In [None]:
query = '''
DECLARE HORIZON STRING DEFAULT "30"; #number of values to forecast
DECLARE CONFIDENCE_LEVEL STRING DEFAULT "0.90"; ## required confidence level

EXECUTE IMMEDIATE format("""
    SELECT
      *
    FROM 
      ML.FORECAST(MODEL {DATASET}.bqml_arima, 
                  STRUCT(%s AS horizon, 
                         %s AS confidence_level)
                 )
    """,HORIZON,CONFIDENCE_LEVEL)'''.format(
    DATASET=DATASET
)
job = client.query(query)
dfforecast = job.to_dataframe()
dfforecast.head()

## Interpret the results to choose the best price
<a name="section-11"></a>

#### Calculate average forecast values for the forecast duration.

In [None]:
dfforecast_avg = (
    dfforecast[["ID", "forecast_value"]].groupby("ID", as_index=False).mean()
)

#### Extract the ID and Price fields from the ID field.

In [None]:
dfforecast_avg["Product_ID"] = dfforecast_avg["ID"].apply(lambda x: x.split("_")[0])
dfforecast_avg["Price"] = dfforecast_avg["ID"].apply(lambda x: x.split("_")[1])

#### Plot the average forecasted sales vs. the price of the product.

In [None]:
for i in dfforecast_avg["Product_ID"].unique():
    dfforecast_avg[dfforecast_avg["Product_ID"] == i].set_index("Price").sort_values(
        "forecast_value"
    ).plot(kind="bar")
    plt.title("Price vs. Average Sales for " + i)
    plt.show()

Based on the plots for price vs. the average forecasted orders, it can be said that to use the maximum orders, each of the considered `Product_ID`s can follow the below prices:

- SKU 107's price range can be from 4.44 - 4.73 units
- SKU 140's price can be 1.95 units
- SKU 62's price can be 4.23 units


## Clean Up
<a name="section-12"></a>

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial. The following code deletes the entire dataset.

In [None]:
# Set dataset_id to the ID of the dataset to fetch.
dataset_id = "{PROJECT_ID}.{DATASET}".format(PROJECT_ID=PROJECT_ID, DATASET=DATASET)

# Use the delete_contents parameter to delete a dataset and its contents.
# Use the not_found_ok parameter to not receive an error if the dataset has already been deleted.
client.delete_dataset(
    dataset_id, delete_contents=True, not_found_ok=True
)  # Make an API request.

print("Deleted dataset '{}'.".format(dataset_id))