# Data Science Using Snowpark for Python and Auto Arima

The purpose of this script is to demonstrate simple data science predictions on Snowflake objects using Snowpark for Python and Auto Arima. The intent is to begin with a Snowflake table containing monthly website sales data spanning multiple categories and create a predictive model to approximate future sales.

Our final process will iterate over both categories in the dataset, before combining the results into a single table.

## Import the various packages

Before we can begin, we must import the required packages.

### Main packages

In [93]:
import pandas
import pmdarima
import snowflake.snowpark

### InterWorks Snowpark package

We must also import the required package from the InterWorks Snowpark package and leverage it to create a Snowflake Snowpark Session object that is connected to our Snowflake environment. Alternatively, you can modify the code to establish a Snowflake Snowpark Session through any method of your choice.

In [94]:
## Import module to build snowpark sessions
from interworks_snowpark.snowpark_session_builder import build_snowpark_session_via_parameters_json as build_snowpark_session

## Generate Snowpark session
snowpark_session = build_snowpark_session()

## Create Snowflake Stored Procedure

Now that we have run through the above in steps, we can combine it all into a function and convert it into a stored procedure.

### Create function

The first part of creating a Stored Procedure to deploy to Snowflake is to create the function that will become the Stored Procedure.

In [128]:
def generate_auto_arima_predictions(
    snowpark_session: snowflake.snowpark.Session
  , origin_table: str
  , destination_table: str
) :

  # Retrieve the data from the source table
  df_sales_sf = snowpark_session.table(f'"SALES_DB"."CLEAN"."{origin_table}"')

  # Convert data into a Pandas dataframe
  df_sales = pandas.DataFrame(data=df_sales_sf.collect()) \
    .sort_values(by=["SALE_MONTH","CATEGORY" ], ignore_index=True)

  # Convert the data field into a Pandas datetime
  df_sales["SALE_MONTH"] = pandas.to_datetime(df_sales["SALE_MONTH"]).dt.tz_localize('UTC')
  
  # Define prediction horizon of 2 years
  pred_periods = 24

  # Define final output dataframe
  df_final_output = pandas.DataFrame(columns=["SALE_MONTH", "CATEGORY", "SALES", "TRAIN_PREDICTION", "TEST_PREDICTION"])

  # Iterate through different categories
  for category in df_sales["CATEGORY"].unique():

    # Define dataframe for current category
    df_current_category = df_sales[df_sales["CATEGORY"] == category].reset_index(drop=True)

    # Test and train
    split_number = df_current_category['SALES'].count() - pred_periods
    df_train     = pandas.DataFrame(df_current_category['SALES'][:split_number]).rename(columns={'SALES':'y_train'})
    df_test      = pandas.DataFrame(df_current_category['SALES'][split_number:]).rename(columns={'SALES':'y_test' })

    # Create Auto Arima model
    model_fit = pmdarima.auto_arima(df_train, test='adf', 
                          max_p=3, max_d=3, max_q=3, 
                          seasonal=True, m=12,
                          max_P=3, max_D=2, max_Q=3,
                          trace=True,
                          error_action='ignore',  
                          suppress_warnings=True, 
                          stepwise=True)

    # Generate in-sample predictions
    pred = model_fit.predict_in_sample(dynamic=False) # works only with auto-arima
    df_train['y_train_pred'] = pandas.to_numeric(pred)

    # Generate predictions on test data
    test_pred = model_fit.predict(n_periods=pred_periods, dynamic=False)
    df_test['y_test_pred'] = pandas.to_numeric(test_pred)

    # Combine test and train prediction values with original
    df_combined = pandas.concat([df_current_category, df_train, df_test], axis = 1) \
      .rename(columns={'y_train_pred':'TRAIN_PREDICTION', 'y_test_pred': 'TEST_PREDICTION'}) \
      [["SALE_MONTH", "CATEGORY", "SALES", "TRAIN_PREDICTION", "TEST_PREDICTION"]]

    # Append combined result to final output
    df_final_output = pandas.concat([df_final_output, df_combined], ignore_index = True, sort = False)
  
  # Write output back to Snowflake
  snowpark_session.write_pandas(
      df = df_final_output
    , table_name = destination_table
    , schema = 'MART'
    , database = 'SALES_DB'
    , auto_create_table = True
  )

  return 'Complete'

### Import any required Snowpark objects

Our stored procedure only requires the data type `StringType` as all inputs and outputs are strings. We must also import the function to create stored procedures.

In [122]:
from snowflake.snowpark.functions import sproc
from snowflake.snowpark.types import StringType

### Add the required packages to the session

Add required packages into the session creating our stored procedure, so that the stored procedure can leverage them.

In [123]:
snowpark_session.add_packages('snowflake-snowpark-python', 'pandas', 'pmdarima')

### Convert function into Stored Procedure

In [129]:
snowpark_session.sproc.register(
    func = generate_auto_arima_predictions
  , return_type = StringType()
  , input_types = [StringType(), StringType()]
  , is_permanent = True
  , name = 'SALES_DB.PROCEDURES.GENERATE_AUTO_ARIMA_FUNCTION'
  , replace = True
  , stage_location = '@SALES_DB.PROCEDURES.MY_STAGE'
)

<snowflake.snowpark.stored_procedure.StoredProcedure at 0x1acb411e4c0>