# Regression Lab

Someone from your Data Science group has been creating a regression to predict, and fit a line, to the average arrival delays to each airport.

Currently, they:
1. Extract the flight data from their source system (using CSV)
1. In their notebook load, parse the data and convert to datatypes
1. Sample 20000 flights randomly in memory (this has been known to exhaust memory on their machine)
1. Calculate average delays into a particular airport
1. Use a regression to fit a line to predict what the average delay is for a given month
1. Upload to an internal system for use by other analysts

Now that our historical flight data is centrally located and loaded, we're going to *replace steps 1-4 above* with a single SQL query to Snowflake and let them focus on what they enjoy most about their jobs... fitting models, and doing their detailed statistical and machine learning algorithims!

## Connect to Snowflake

Using the credentials for our class:
* `snowflake_account` : `sfeducationalservices1_acct982` or similar
* `snowflake_user` : `mongoose` (or whatever your animal is!)
* `snowflake_password` : `EasyToGuess123!` or whatever password you have set for your user


In [None]:
import getpass
from urllib.parse import quote

## Get snowflake Account name (no .snowflakecomputing.com)
print("Snowflake Account:")
snowflake_account = input()
print("Snowflake Username:")
snowflake_user = input()
snowflake_password = quote(getpass.getpass("Snowflake Password:"))

### SQL Alchemy

In this example, we are going to use a couple of convenience packages to connect to Snowflake. Of course, we can connect directly via `.conn()`, but for now let's use the helpful `%sql` commands and connect via `SQLAlchemy` to make getting the Data Scientist data straightforward.

A SQLAlchemy URL has some specific formatting requirements, some of which are for all SQLAlchemy dialects (i.e., to other databases) and some are specific to Snowflake's Python connector.  You can read more about the SQLAlchemy connector here:

* [SQLAlchemy URL docs](https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls)
* [Snowflake SQLAlchemy Connection instructions](https://docs.snowflake.com/en/user-guide/sqlalchemy.html#snowflake-specific-parameters-and-behavior)

In [None]:
## Used by the SQL magic commands below!  SQLAlchemy!
database_url = f"snowflake://{snowflake_user}:{snowflake_password}@{snowflake_account}/{snowflake_user}_DB?warehouse={snowflake_user}_WH"

### Connect to Snowflake

We're all setup and ready to connect to Snowflake.  Now we will use the magic command `%sql` which is a convenience for executing sql statements.

In [None]:
## Connecting to Snowflake via SQLAlchemy
%load_ext sql
%sql $database_url

### Check the connection

Next, let's just do the Snowflake equivalent of `HELLO WORLD!`

We'll issue a trivial call, that executes in all security roles, contexts, and without warehouses running, to test to ensure we're connected.

In [None]:
%%sql
select current_date();

### Set warehouse size

Now we will set a query tag and our warehouse size before executing any queries.

In [None]:
%%sql
alter session set query_tag='({snowflake_user}) Lab - SCENARIO: Connectors for Data Science Workloads';
alter warehouse {snowflake_user}_WH set warehouse_size = 'xsmall';

Our Data Scientists want to sample, randomly, 1000 rows from all flights.  We'll use a SQL statement that grabs data from our `RAW` table and samples randomly 1000 rows for a single destination `SEA`.

In [None]:
%%sql result_set << 
select 
    month
    , nvl(arr_delay, 0) as avg_delay
from raw.ONTIME_REPORTING SAMPLE (1000 rows)
where 
DEST ='SEA';

In [None]:
import numpy as np

df = result_set.DataFrame()
df

We now have our 1000 randomly sampled rows available for use in our regression.  Let's take a quick look, to better understand the data.  Let's just take a quick peek and plot it on a scatter.

In [None]:
df.plot.scatter(0,1)

NumPy has a very particular Array format that it likes for operations.  Let's get the dataset oriented (rows to columns) to ensure the array is ready for NumPy

In [None]:
result_numpy = df.to_numpy().transpose()

To ensure we have the correct datatypes, let's let NumPy know that our months (1..12) are INTs and that our delays are FLOATS

In [None]:
xs = np.array(result_numpy[0]).astype(int)
xy = np.array(result_numpy[1]).astype(float)

Your Data Scientists have done this fitting before; they use the NumPy polynomial fitting to make a reasonable guess as to the delay for a particular month.

In [None]:
import matplotlib.pyplot as plt

trend = np.polyfit(xs,xy,5)
trendpoly = np.poly1d(trend)

plt.plot(xs,xy,'o')
plt.plot(np.unique(xs),trendpoly(np.unique(xs)))

In [None]:
monthly_estimate = trendpoly(np.unique(xs))
monthly_estimate

Snowflake, our cloud data platform, is the centralized location for where this information, and the outcome of our Data Scientists' work, will be.  

We will take our data, and push it up to Snowflake into a table in our `MODELED` zone for downstream consumption by other applications and dashboards.

In [None]:
regression_data = np.array ([['SEA','SEA','SEA','SEA','SEA','SEA','SEA','SEA','SEA','SEA','SEA','SEA']
             ,np.unique(xs)
             ,monthly_estimate]);

In [None]:
for_snowflake = regression_data.transpose()

In [None]:
np.savetxt("sea_regression_data.csv", for_snowflake, delimiter=',', fmt='%s', comments="")

In [None]:
%sql use schema modeled;
%sql PUT file://./sea_regression_data.csv @~;

In [None]:
%sql copy into estimated_delays from @~/sea_regression_data.csv.gz purge=true

In [None]:
%sql select * from modeled.estimated_delays;

In [None]:
%sql alter session unset query_tag;