# 2. ML Feature Transformations

In this notebook, we will walk through a few transformations that are included in the Snowpark ML Preprocessing API. <br>
We will also build a preprocessing pipeline to be used in the ML modeling notebook. Having a preprocessing pipeline is very useful to be able to apply in a standard way the same treatment to train, test and data to be scored.

In [None]:
# Snowpark for Python
from snowflake.snowpark import Session
from snowflake.snowpark.version import VERSION
import snowflake.snowpark.functions as F
from snowflake.snowpark.types import DecimalType

# Snowpark ML
import snowflake.ml.modeling.preprocessing as snowml
from snowflake.ml.modeling.pipeline import Pipeline
from snowflake.ml.modeling.metrics.correlation import correlation

# Data Science Libs
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

# Misc
import json
import joblib

# warning suppresion
import warnings; warnings.simplefilter('ignore')

## 1. Establish Secure Connection to Snowflake


In [None]:
with open('creds.json') as f:
    connection_parameters = json.load(f)

session = Session.builder.configs(connection_parameters).create()

snowflake_environment = session.sql('SELECT current_user(), current_version()').collect()
snowpark_version = VERSION

# Current Environment Details
print('\nConnection Established with the following parameters:')
print('User                        : {}'.format(snowflake_environment[0][0]))
print('Role                        : {}'.format(session.get_current_role()))
print('Database                    : {}'.format(session.get_current_database()))
print('Schema                      : {}'.format(session.get_current_schema()))
print('Warehouse                   : {}'.format(session.get_current_warehouse()))
print('Snowflake version           : {}'.format(snowflake_environment[0][1]))
print('Snowpark for Python version : {}.{}.{}'.format(snowpark_version[0],snowpark_version[1],snowpark_version[2]))

In [None]:
# Specify the table name where we stored the scooby_doo dataset
DEMO_TABLE = 'scooby_clean'
input_tbl = f"{session.get_current_database()}.{session.get_current_schema()}.{DEMO_TABLE}"

## 2. Data Loading

Load the data from snowflake and select the features to be used in our ML model, few things I considered for this selection:
<ol>

<li>For categorical features, I dismissed the ones that have very sparse values. Given that we have 604 observations, I dismissed the features with more than 30 categories, in this way we'll have at least 20 possible observations per category in average. <br>
The selected features are: <code>`FORMAT","NETWORK","SETTING_TERRAIN","MOTIVE","MONSTER_GENDER","CULPRIT_GENDER" </code> </li>

<br>
<li>For numerical features I keep <code>"IMDB"</code> as it's our target feature. <br>
I keep <code>"ENGAGEMENT"</code> and <code>"RUN_TIME" </code>which is the number of minutes of the episode. <br> <br>

For the rest of the integers I decide that as a minumum, 80% of the observations should contain non-null values
I keep <code>"MONSTER_AMOUNT","SUSPECTS_AMOUNT","CULPRIT_AMOUNT" </code><br>
And I keep the features that indicate how many times each of these phrases were said during the episode: <br> 
<code> "SPLIT_UP","ANOTHER_MYSTERY","SET_A_TRAP","JEEPERS","JINKIES","MY_GLASSES" <br>
,"JUST_ABOUT_WRAPPED_UP","ZOINKS","GROOVY","SCOOBY_DOO_WHERE_ARE_YOU","ROOBY_ROOBY_ROO" </code> </li>

<br>
<li>For boolean features I will keep as well only the ones that have non-null values in at least 80% of the rows <br>
Additionally the features caught, captured and unmasked give pretty similar information, so we'll keep only caught features <br>
I keep <code>"MONSTER_REAL","CAUGHT_SHAGGY","CAUGHT_SCOOBY","SNACK_SHAGGY","SNACK_SCOOBY","UNMASK_OTHER","CAUGHT_OTHER", <br>
"CAUGHT_NOT","DOOR_GAG","BATMAN","SCOOBY_DUM","SCRAPPY_DOO","HEX_GIRLS","BLUE_FALCON"</code> </li>
</ol>

In [None]:
# First, we read in the data from a Snowflake table into a Snowpark DataFrame
scooby_df = session.table(input_tbl)

### 1. Filter out rows where IMDB is null

In [None]:
scooby_df = scooby_df.filter(F.not_(F.is_null(F.col("IMDB"))))
print(scooby_df.count())
scooby_df.show()

### 2. Select the features we are going to use

In [None]:
# We select the features that we are going to use in the ML model according to the analysis below
features_array = ["IMDB","ENGAGEMENT","RUN_TIME","FORMAT","NETWORK","SETTING_TERRAIN","MOTIVE","MONSTER_GENDER","CULPRIT_GENDER"
,"MONSTER_AMOUNT","SUSPECTS_AMOUNT","CULPRIT_AMOUNT","ZOINKS","GROOVY","SCOOBY_DOO_WHERE_ARE_YOU","ROOBY_ROOBY_ROO"
,"MONSTER_REAL","CAUGHT_SHAGGY","CAUGHT_SCOOBY","SNACK_SHAGGY","SNACK_SCOOBY","UNMASK_OTHER","CAUGHT_OTHER","CAUGHT_NOT"
,"DOOR_GAG","BATMAN","SCOOBY_DUM","SCRAPPY_DOO","HEX_GIRLS","BLUE_FALCON"]

scooby_ml_df = scooby_df.select(features_array)
scooby_ml_df.show()

## Analysis to identify candidate features to use

This analysis is informative, NO need to run it once the features have been identified

### Analyse category features to identify the ones that have < 30 categories

In [None]:
# Categorical Columns review
cat_array = ["FORMAT","NETWORK","SETTING_TERRAIN","SETTING_COUNTRY_STATE","MOTIVE"
             ,"MONSTER_GENDER","MONSTER_TYPE","MONSTER_SUBTYPE","MONSTER_SPECIES","CULPRIT_GENDER"]

for c in cat_array:
    print(c + " " + str(len(set(scooby_df.select(F.col(c)).collect()))))

print(scooby_df.count())
print(set(scooby_df.select(F.col("FORMAT")).collect()))
print(set(scooby_df.select(F.col("NETWORK")).collect()))
print(set(scooby_df.select(F.col("SETTING_TERRAIN")).collect()))
print(set(scooby_df.select(F.col("SETTING_COUNTRY_STATE")).collect()))
print(set(scooby_df.select(F.col("MOTIVE")).collect()))


print(set(scooby_df.select(F.col("MONSTER_GENDER")).collect()))
print(set(scooby_df.select(F.col("MONSTER_TYPE")).collect()))
print(set(scooby_df.select(F.col("MONSTER_SUBTYPE")).collect()))
print(set(scooby_df.select(F.col("MONSTER_SPECIES")).collect()))
print(set(scooby_df.select(F.col("CULPRIT_GENDER")).collect()))

### Numerical and boolean features

For numerical and boolean features we want at least 80% of the dataset containing a value, given that we have 604 observations we need at least 483 non-null values per feature to consider it. <br>
We can investigate the dataset with a describe function

In [None]:
# Arrays from the 02_snowpark_ml_data_ingest.ipynb cast types step
int_array = ["ENGAGEMENT","RUN_TIME","MONSTER_AMOUNT","SUSPECTS_AMOUNT","CULPRIT_AMOUNT","SPLIT_UP","ANOTHER_MYSTERY","SET_A_TRAP","JEEPERS","JINKIES","MY_GLASSES"
,"JUST_ABOUT_WRAPPED_UP","ZOINKS","GROOVY","SCOOBY_DOO_WHERE_ARE_YOU","ROOBY_ROOBY_ROO"]
float_array = ["IMDB"]
boolean_array = ["MONSTER_REAL","CAUGHT_FRED","CAUGHT_DAPHNE","CAUGHT_VELMA","CAUGHT_SHAGGY","CAUGHT_SCOOBY"
,"CAPTURED_FRED","CAPTURED_DAPHNE","CAPTURED_VELMA","CAPTURED_SHAGGY","CAPTURED_SCOOBY"
,"UNMASK_FRED","UNMASK_DAPHNE","UNMASK_VELMA","UNMASK_SHAGGY","UNMASK_SCOOBY"
,"SNACK_FRED","SNACK_DAPHNE","SNACK_VELMA","SNACK_SHAGGY","SNACK_SCOOBY","UNMASK_OTHER","CAUGHT_OTHER","CAUGHT_NOT","TRAP_WORK_FIRST","NON_SUSPECT","ARRESTED","DOOR_GAG"
,"BATMAN","SCOOBY_DUM","SCRAPPY_DOO","HEX_GIRLS","BLUE_FALCON"]

In [None]:
scooby_df.select(int_array).describe().show()

In [None]:
# Snowpark dataframe describe doesn't work with boolean data types at the moment, so we create a function to investigate this. We'll dismiss those that have > 120 NULL (20% of 604)
# Open feature request for support of non-numeric and non-string data types: https://github.com/snowflakedb/snowpark-python/issues/1016

# scooby_df.select(boolean_array).describe(include="all").show()
def desc_bools(col):
    print(col)
    scooby_df.group_by(col).count().show()

for c in boolean_array:
    desc_bools(c)

## 3. Feature Transformations

We will illustrate a few of the transformation functions here, but the rest can be found in the [documentation](https://docs.snowflake.com/LIMITEDACCESS/snowflake-ml-preprocessing).

In [None]:
scooby_ml_df.describe().show()

In [None]:
scooby_ml_df.show()

### 1. Categorical values to numerical features
We use the `OneHotEncoder` to transform the categorical values: `FORMAT`, `NETWORK`,`SETTING_TERRAIN`, `MOTIVE`, `MONSTER_GENDER`,`CULPRIT_GENDER`

In [None]:
# Encode categoricals to numeric columns
cat_array2 = ["FORMAT","NETWORK","SETTING_TERRAIN","MOTIVE","MONSTER_GENDER","CULPRIT_GENDER"]
cat_array_ohe = ["FORMAT_OHE","NETWORK_OHE","SETTING_TERRAIN_OHE","MOTIVE_OHE","MONSTER_GENDER_OHE","CULPRIT_GENDER_OHE"]

snowml_ohe = snowml.OneHotEncoder(input_cols=cat_array2, output_cols=cat_array_ohe)
ohe_scooby_df = snowml_ohe.fit(scooby_ml_df).transform(scooby_ml_df)

np.array(ohe_scooby_df.columns)

In [None]:
ohe_scooby_df.show()

### 2. Normalize numerical features

Use `MinMaxScaler` to normalize the numerical features with large differences between their min and max values: `ENGAGEMENT`, `RUN_TIME`, `ZOINKS`, `GROOVY`, `SCOOBY_DOO_WHERE_ARE_YOU`, 
`ROOBY_ROOBY_ROO`

In [None]:
int_array_2 = ["ENGAGEMENT","RUN_TIME","ZOINKS","GROOVY","SCOOBY_DOO_WHERE_ARE_YOU","ROOBY_ROOBY_ROO"]
int_array_norm = ["ENGAGEMENT_NORM","RUN_TIME_NORM","ZOINKS_NORM","GROOVY_NORM","SCOOBY_DOO_WHERE_ARE_YOU_NORM","ROOBY_ROOBY_ROO_NORM"]

# Normalize the CARAT column
snowml_mms = snowml.MinMaxScaler(input_cols=int_array_2, output_cols=int_array_norm)
normalized_scooby_df = snowml_mms.fit(ohe_scooby_df).transform(ohe_scooby_df)

# Reduce the number of decimals
for c in int_array_norm:
    new_col = normalized_scooby_df.col(c).cast(DecimalType(7, 6))
    normalized_scooby_df = normalized_scooby_df.with_column(c, new_col)

In [None]:
normalized_scooby_df.show()

### 3. Build the full preprocessing Pipeline

Having a preprocessing pipeline is helpful both for training and inference to have standardised steps for feature transformation

In [None]:
CATEGORICAL_COLUMNS = ["FORMAT","NETWORK","SETTING_TERRAIN","MOTIVE","MONSTER_GENDER","CULPRIT_GENDER"]
CATEGORICAL_COLUMNS_OE = ["FORMAT_OHE","NETWORK_OHE","SETTING_TERRAIN_OHE","MOTIVE_OHE","MONSTER_GENDER_OHE","CULPRIT_GENDER_OHE"] # To name the ordinal encoded columns

NUMERICAL_COLUMNS = ["ENGAGEMENT","RUN_TIME","ZOINKS","GROOVY","SCOOBY_DOO_WHERE_ARE_YOU","ROOBY_ROOBY_ROO"]
NUMERICAL_COLUMNS_NORM = ["ENGAGEMENT_NORM","RUN_TIME_NORM","ZOINKS_NORM","GROOVY_NORM","SCOOBY_DOO_WHERE_ARE_YOU_NORM","ROOBY_ROOBY_ROO_NORM"]

# Build the pipeline
preprocessing_pipeline = Pipeline(
    steps=[
            (
                "OHE",
                snowml.OneHotEncoder(
                    input_cols=CATEGORICAL_COLUMNS,
                    output_cols=CATEGORICAL_COLUMNS_OE
                )
            ),
            (
                "MMS",
                snowml.MinMaxScaler(
                    clip=True,
                    input_cols=NUMERICAL_COLUMNS,
                    output_cols=NUMERICAL_COLUMNS_NORM,
                )
            )
    ]
)

PIPELINE_FILE = 'preprocessing_pipeline.joblib'
joblib.dump(preprocessing_pipeline, PIPELINE_FILE) # We are just pickling it locally first

transformed_scooby_df = preprocessing_pipeline.fit(scooby_ml_df).transform(scooby_ml_df)
transformed_scooby_df.show()

In [None]:
# You can also save the pickled object into the stage we created earlier for deployment
session.file.put(PIPELINE_FILE, "@SCOOBY_ASSETS", overwrite=True)

## 4. Data Exploration

In [None]:
transformed_scooby_df.columns

In [None]:
ml_features = ["IMDB","ENGAGEMENT_NORM","RUN_TIME_NORM",
"FORMAT_OHE_CROSSOVER","FORMAT_OHE_MOVIE","FORMAT_OHE_MOVIE_THEATRICAL_","FORMAT_OHE_TV_SERIES","FORMAT_OHE_TV_SERIES_SEGMENTED_",
"NETWORK_OHE_ABC","NETWORK_OHE_ADULT_SWIM","NETWORK_OHE_BOOMERANG","NETWORK_OHE_CARTOON_NETWORK","NETWORK_OHE_CBS","NETWORK_OHE_SYNDICATION",
"NETWORK_OHE_TBC","NETWORK_OHE_THE_CW","NETWORK_OHE_THE_WB","NETWORK_OHE_WARNER_BROS_PICTURE","NETWORK_OHE_WARNER_HOME_VIDEO"]

simplified_scooby_df = transformed_scooby_df[ml_features]

In [None]:
simplified_scooby_df.show()

In [None]:
corr_scooby_df = correlation(df=simplified_scooby_df)
corr_scooby_df # This is a Pandas DataFrame 

In [None]:
# Create a heatmap with the features
fig, ax = plt.subplots(figsize=(15, 15))
plt.title('Heatmap for Transformed Scooby Data', fontsize=28)
dataplot = sns.heatmap(corr_scooby_df, cmap="YlGnBu", annot=True)

plt.show()

There is not very high correlation with IMDB in any of the features, but we can see that the highest correlation is for the Format = TV_Series of 0.41. <br>
We used a one hot encoding technique to transform the `FORMAT` categorical value, into a series of continuous variables; in this way we can apply correlation checks to it. <br>
<br>
In the following scatterplot, given that the TV_Series feature is binary (1 or 0) we will see that the graphic is not very useful to portrait any kind of correlation. <br>
Another way to calculate a correlation between a binary variable and a continuous one is to use a point biserial correlation. <br>
https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.pointbiserialr.html <br>
https://www.statology.org/point-biserial-correlation-python/

In [None]:
# Set up a plot to look at FORMAT_OHE_TV_SERIES and IMDB
counts = simplified_scooby_df.to_pandas().groupby(['IMDB', 'FORMAT_OHE_TV_SERIES']).size().reset_index(name='Count')

fig, ax = plt.subplots(figsize=(10, 6))
ax = sns.scatterplot(data=counts, x='FORMAT_OHE_TV_SERIES', y='IMDB', size='Count', markers='o', alpha=(0.1, .25, 0.5, 0.75, 1))
ax.grid(axis='y')

sns.move_legend(ax, "upper right")
sns.despine(left=True, bottom=True)

Using the scipy.stats library, we can calculate the point biserial correlation for `FORMAT_OHE_TV_SERIES`. <br>
The correlation is 0.41 and the p-value is 0.05 which means that it is statistically significant. <br> <br>

In the same way, for testing purposes we calculate for `NETWORK_OHE_THE_CW`. <br>
This has a negative correlation of -0.45 <br>

When the correlation is positive, this indicates that when the variable x takes on the value “1” that the variable y tends to take on higher values compared to when the variable x takes on the value “0”. We can then interpret this as the episodes with FORMAT = TV_SERIES having in general a higher IMDB than the episodes with FORMAT = THE_CW.

In [None]:
x = simplified_scooby_df.select(F.col("FORMAT_OHE_TV_SERIES")).toPandas().to_numpy().flatten()
y = simplified_scooby_df.select(F.col("IMDB")).toPandas().to_numpy().flatten()
print("Correlation for FORMAT_OHE_TV_SERIES")
print(stats.pointbiserialr(x, y))

x = simplified_scooby_df.select(F.col("NETWORK_OHE_THE_CW")).toPandas().to_numpy().flatten()
y = simplified_scooby_df.select(F.col("IMDB")).toPandas().to_numpy().flatten()
print("Correlation for NETWORK_OHE_THE_CW")
print(stats.pointbiserialr(x, y))

Our target variable `IMDB` is a continuous variable. <br>
The other 2 natural continuous variables are: `ENGAGEMENT_NORM` and `RUN_TIME_NORM`. <br>
The correlation heatmap indicates a negative very low correlation between IMDB and these 2, let's explore that in a visual way:

In [None]:
# Set up a plot to look at ENGAGEMENT_NORM and IMDB
counts = simplified_scooby_df.to_pandas().groupby(['IMDB', 'ENGAGEMENT_NORM']).size().reset_index(name='Count')

fig, ax = plt.subplots(figsize=(10, 6))
ax = sns.scatterplot(data=counts, x='ENGAGEMENT_NORM', y='IMDB', size='Count', markers='o', alpha=(0.1, .25, 0.5, 0.75, 1))
ax.grid(axis='y')


sns.move_legend(ax, "upper right")
sns.despine(left=True, bottom=True)

In [None]:
# Set up a plot to look at RUN_TIME_NORM and IMDB
counts = simplified_scooby_df.to_pandas().groupby(['IMDB', 'RUN_TIME_NORM']).size().reset_index(name='Count')

fig, ax = plt.subplots(figsize=(10, 6))
ax = sns.scatterplot(data=counts, x='RUN_TIME_NORM', y='IMDB', size='Count', markers='o', alpha=(0.1, .25, 0.5, 0.75, 1))
ax.grid(axis='y')


sns.move_legend(ax, "upper right")
sns.despine(left=True, bottom=True)

In [None]:
# Set up a plot to look at RUN_TIME_NORM and ENGAGEMENT_NORM
counts = simplified_scooby_df.to_pandas().groupby(['ENGAGEMENT_NORM', 'RUN_TIME_NORM']).size().reset_index(name='Count')

fig, ax = plt.subplots(figsize=(10, 6))
ax = sns.scatterplot(data=counts, x='RUN_TIME_NORM', y='ENGAGEMENT_NORM', size='Count', markers='o', alpha=(0.1, .25, 0.5, 0.75, 1))
ax.grid(axis='y')

sns.move_legend(ax, "upper right")
sns.despine(left=True, bottom=True)

In [None]:
session.close()