# Data Science Using Snowpark for Python and Auto Arima

The purpose of this script is to demonstrate simple data science predictions on Snowflake objects using Snowpark for Python and Auto Arima. The intent is to begin with a Snowflake table containing monthly website sales data spanning multiple categories and create a predictive model to approximate future sales.

## Import the various packages

Before we can begin, we must import the required packages.

### Main packages

In [1]:
import pandas
import pmdarima
import snowflake.snowpark

### InterWorks Snowpark package

We must also import the required package from the InterWorks Snowpark package and leverage it to create a Snowflake Snowpark Session object that is connected to our Snowflake environment. Alternatively, you can modify the code to establish a Snowflake Snowpark Session through any method of your choice.

In [2]:
## Import module to build snowpark sessions
from interworks_snowpark.snowpark_session_builder import build_snowpark_session_via_parameters_json as build_snowpark_session

## Generate Snowpark session
snowpark_session = build_snowpark_session()

## Retrieve data

Before we can train a model, we must retrieve the data that we wish to leverage.

### Create variables that will be fed into the stored procedure

By creating variables now, we can more easily convert our process to a Stored Procedure later.

In [3]:
origin_table = 'WEBSITE_SALES'
destination_table = 'WEBSITE_SALES_PREDICTIONS'

### Retrieve the data from the source table

In [4]:
df_sales_sf = snowpark_session.table(f'"SALES_DB"."CLEAN"."{origin_table}"') 

df_sales_sf.show()

---------------------------------------------------
|"MONTH_OF_OPERATION"  |"CATEGORY"  |"SALES"      |
---------------------------------------------------
|2020-06-01 00:00:00   |HIGH        |4667132.369  |
|2020-07-01 00:00:00   |HIGH        |5537749.13   |
|2020-08-01 00:00:00   |HIGH        |5539887.906  |
|2020-09-01 00:00:00   |HIGH        |4905363.078  |
|2020-10-01 00:00:00   |HIGH        |3318235.872  |
|2020-10-01 00:00:00   |MEDIUM      |584250.14    |
|2020-11-01 00:00:00   |HIGH        |2413273.809  |
|2020-11-01 00:00:00   |MEDIUM      |1395640.868  |
|2020-12-01 00:00:00   |HIGH        |1970506.003  |
|2020-12-01 00:00:00   |MEDIUM      |1581726.646  |
---------------------------------------------------



### Convert data into a Pandas dataframe

Our current dataframe is a Snowflake dataframe, representing a query to an object in Snowflake. We wish to download this into a Pandas dataframe so that we can manipulate it more freely.

In [5]:
df_sales = pandas.DataFrame(data=df_sales_sf.collect()) \
  .sort_values(by=['MONTH_OF_OPERATION', 'CATEGORY'], ignore_index=True)

display(df_sales)

Unnamed: 0,MONTH_OF_OPERATION,CATEGORY,SALES
0,2017-01-01,HIGH,389788.900
1,2017-01-01,LOW,972043.500
2,2017-01-01,MEDIUM,2921744.500
3,2017-02-01,HIGH,361717.200
4,2017-02-01,LOW,127406.600
...,...,...,...
131,2022-05-01,HIGH,3800767.616
132,2022-05-01,MEDIUM,210168.375
133,2022-06-01,HIGH,4750553.049
134,2022-07-01,HIGH,5411509.156


## Create predictive model

Now that we have our data, we are ready to begin constructing our predictive model.

### Test and Train

Split our data into train and test, based on a predictive horizon of 2 years

In [6]:
pred_periods = 24
split_number = df_sales['SALES'].count() - pred_periods # corresponds to a prediction horizon of 2 years
df_train     = pandas.DataFrame(df_sales['SALES'][:split_number]).rename(columns={'SALES':'y_train'})
df_test      = pandas.DataFrame(df_sales['SALES'][split_number:]).rename(columns={'SALES':'y_test' })

### Create Auto Arima model

Leverage Auto Arima to create a model fit.

In [7]:
model_fit = pmdarima.auto_arima(df_train, test='adf', 
                         max_p=3, max_d=3, max_q=3, 
                         seasonal=True, m=12,
                         max_P=3, max_D=2, max_Q=3,
                         trace=True,
                         error_action='ignore',  
                         suppress_warnings=True, 
                         stepwise=True)

Performing stepwise search to minimize aic
 ARIMA(2,0,2)(1,0,1)[12] intercept   : AIC=3525.832, Time=0.48 sec
 ARIMA(0,0,0)(0,0,0)[12] intercept   : AIC=3545.533, Time=0.01 sec
 ARIMA(1,0,0)(1,0,0)[12] intercept   : AIC=3539.564, Time=0.06 sec
 ARIMA(0,0,1)(0,0,1)[12] intercept   : AIC=3541.961, Time=0.04 sec
 ARIMA(0,0,0)(0,0,0)[12]             : AIC=3648.498, Time=0.00 sec
 ARIMA(2,0,2)(0,0,1)[12] intercept   : AIC=3540.453, Time=0.12 sec
 ARIMA(2,0,2)(1,0,0)[12] intercept   : AIC=3540.386, Time=0.11 sec
 ARIMA(2,0,2)(2,0,1)[12] intercept   : AIC=3526.661, Time=1.01 sec
 ARIMA(2,0,2)(1,0,2)[12] intercept   : AIC=3525.846, Time=1.03 sec
 ARIMA(2,0,2)(0,0,0)[12] intercept   : AIC=3538.498, Time=0.04 sec
 ARIMA(2,0,2)(0,0,2)[12] intercept   : AIC=3534.069, Time=0.51 sec
 ARIMA(2,0,2)(2,0,0)[12] intercept   : AIC=3530.644, Time=0.41 sec
 ARIMA(2,0,2)(2,0,2)[12] intercept   : AIC=3524.210, Time=1.21 sec
 ARIMA(2,0,2)(3,0,2)[12] intercept   : AIC=3526.011, Time=2.51 sec
 ARIMA(2,0,2)(2,0,3

### Summarise model

If desired, the model can be summaries.

In [8]:
print(model_fit.summary())

                                        SARIMAX Results                                        
Dep. Variable:                                       y   No. Observations:                  112
Model:             SARIMAX(2, 0, 1)x(2, 0, [1, 2], 12)   Log Likelihood               -1752.962
Date:                                 Wed, 07 Sep 2022   AIC                           3523.924
Time:                                         16:38:36   BIC                           3548.390
Sample:                                              0   HQIC                          3533.851
                                                 - 112                                         
Covariance Type:                                   opg                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
intercept   1.104e+06   3.12e-08   3.54e+13      0.000     1.1e+06     1.1

### Generate in-sample predictions

The parameter `dynamic=False` means that the model makes predictions upon the lagged values. This means that the model is trained until a point in the time-series and then tries to predict the next value.

In [9]:
# Create the predictions
pred = model_fit.predict_in_sample(dynamic=False) # works only with auto-arima
df_train['y_train_pred'] = pred

# Calculate the percentage difference
df_train['diff_percent'] = abs((df_train['y_train'] - pred) / df_train['y_train'])* 100

### Generate predictions on test data

Generate prediction for n periods. Predictions start from the last date of the training data

In [10]:
test_pred = model_fit.predict(n_periods=pred_periods, dynamic=False)
df_test['y_test_pred'] = test_pred

### Combine test and train prediction values with original

In [11]:
df_union = pandas.concat([df_sales, df_train, df_test], axis = 1) \
  .rename(columns={'y_train_pred':'TRAIN_PREDICTION', 'y_test_pred': 'TEST_PREDICTION'}) \
  [["MONTH_OF_OPERATION", "CATEGORY", "SALES", "TRAIN_PREDICTION", "TEST_PREDICTION"]]
 
display(df_union)

Unnamed: 0,MONTH_OF_OPERATION,CATEGORY,SALES,TRAIN_PREDICTION,TEST_PREDICTION
0,2017-01-01,HIGH,389788.900,2.216705e+06,
1,2017-01-01,LOW,972043.500,2.381818e+06,
2,2017-01-01,MEDIUM,2921744.500,2.571555e+06,
3,2017-02-01,HIGH,361717.200,2.117738e+06,
4,2017-02-01,LOW,127406.600,2.308963e+06,
...,...,...,...,...,...
131,2022-05-01,HIGH,3800767.616,,3.134833e+06
132,2022-05-01,MEDIUM,210168.375,,1.463570e+06
133,2022-06-01,HIGH,4750553.049,,4.162073e+06
134,2022-07-01,HIGH,5411509.156,,4.589970e+06


### Write output back to Snowflake

Upload the data into the Snowflake table.

In [17]:
snowpark_session.write_pandas(
    df = df_union
  , table_name = destination_table
  , schema = 'MART'
  , database = 'SALES_DB'
  , auto_create_table = True
)

<snowflake.snowpark.table.Table at 0x185e9004820>

## Create Snowflake Stored Procedure

Now that we have run through the above in steps, we can combine it all into a function and convert it into a stored procedure.

### Create function

The first part of creating a Stored Procedure to deploy to Snowflake is to create the function that will become the Stored Procedure.

In [16]:
def generate_auto_arima_predictions(
    snowpark_session: snowflake.snowpark.Session
  , origin_table: str
  , destination_table: str
) :
  # Retrieve the data from the source table
  df_sales_sf = snowpark_session.table(f'"SALES_DB"."CLEAN"."{origin_table}"')

  # Convert data into a Pandas dataframe
  df_sales = pandas.DataFrame(data=df_sales_sf.collect()) \
    .sort_values(by=['MONTH_OF_OPERATION', 'CATEGORY'], ignore_index=True)

  # Test and train
  pred_periods = 24
  split_number = df_sales['SALES'].count() - pred_periods # corresponds to a prediction horizon of 2 years
  df_train     = pandas.DataFrame(df_sales['SALES'][:split_number]).rename(columns={'SALES':'y_train'})
  df_test      = pandas.DataFrame(df_sales['SALES'][split_number:]).rename(columns={'SALES':'y_test' })

  # Create Auto Arima model
  model_fit = pmdarima.auto_arima(df_train, test='adf', 
                         max_p=3, max_d=3, max_q=3, 
                         seasonal=True, m=12,
                         max_P=3, max_D=2, max_Q=3,
                         trace=True,
                         error_action='ignore',  
                         suppress_warnings=True, 
                         stepwise=True)

  # Generate in-sample predictions
  pred = model_fit.predict_in_sample(dynamic=False) # works only with auto-arima
  df_train['y_train_pred'] = pred

  # Generate predictions on test data
  test_pred = model_fit.predict(n_periods=pred_periods, dynamic=False)
  df_test['y_test_pred'] = test_pred

  # Combine test and train prediction values with original
  df_union = pandas.concat([df_sales, df_train, df_test], axis = 1) \
    .rename(columns={'y_train_pred':'TRAIN_PREDICTION', 'y_test_pred': 'TEST_PREDICTION'}) \
    [["MONTH_OF_OPERATION", "CATEGORY", "SALES", "TRAIN_PREDICTION", "TEST_PREDICTION"]]
  
  # Write output back to Snowflake
  snowpark_session.write_pandas(
      df = df_union
    , table_name = destination_table
    , schema = 'MART'
    , database = 'SALES_DB'
    , auto_create_table = True
  )

  return 'Complete'

### Import any required Snowpark objects

Our stored procedure only requires the data type `StringType` as all inputs and outputs are strings. We must also import the function to create stored procedures.

In [14]:
from snowflake.snowpark.functions import sproc
from snowflake.snowpark.types import StringType

### Convert function into Stored Procedure

In [18]:
# Add required packages into the session creating our stored procedure 
snowpark_session.add_packages('snowflake-snowpark-python', 'pandas', 'pmdarima')

# Upload SProc to Snowflake
snowpark_session.sproc.register(
    func = generate_auto_arima_predictions
  , return_type = StringType()
  , input_types = [StringType(), StringType()]
  , is_permanent = True
  , name = 'SALES_DB.PROCEDURES.GENERATE_AUTO_ARIMA_FUNCTION'
  , replace = True
  , stage_location = '@SALES_DB.PROCEDURES.MY_STAGE'
)

<snowflake.snowpark.stored_procedure.StoredProcedure at 0x185cf1a4490>