# Linear Regression Using Snowpark for Python and Scikit Learn

[Frosty Friday Challenge: Week 18 - Hard - Linear Regression](https://frostyfriday.org/2022/10/14/week-18-linear-regression/)

The purpose of this script is to demonstrate simple data science linear regression on Snowflake objects using Snowpark for Python and Scikit Learn.

## Import the various packages

Before we can begin, we must import the required packages.

### Main packages

In [1]:
import pandas
from sklearn.linear_model import LinearRegression
from datetime import date
import snowflake.snowpark



### InterWorks Snowpark package

We must also import the required package from the InterWorks Snowpark package and leverage it to create a Snowflake Snowpark Session object that is connected to our Snowflake environment. Alternatively, you can modify the code to establish a Snowflake Snowpark Session through any method of your choice.

In [2]:
## Import module to build snowpark sessions
from shared.interworks_snowpark.interworks_snowpark_python.snowpark_session_builder import build_snowpark_session_via_parameters_json as build_snowpark_session

## Generate Snowpark session
snowpark_session = build_snowpark_session()

## Retrieve data

Before we can train a model, we must retrieve the data that we wish to leverage.

### Retrieve the data from the source table

In [3]:
df_input_sf = snowpark_session.sql('''
  SELECT YEAR("Date") AS "YEAR", "Value" as "MEASURE"
  FROM "SHARE_ECONOMY_DATA_ATLAS"."ECONOMY"."BEANIPA"
  WHERE "Table Name" = 'Price Indexes For Personal Consumption Expenditures By Major Type Of Product'
    AND "Indicator Name" = 'Personal consumption expenditures (PCE)'
    AND "Frequency" = 'A'
    AND "Date" >= '1972-01-01' 
    AND "Date" < '2021-01-01' 
  ORDER BY "Date"
''') 

df_input_sf.show()

----------------------
|"YEAR"  |"MEASURE"  |
----------------------
|1972    |22.542     |
|1973    |23.756     |
|1974    |26.229     |
|1975    |28.415     |
|1976    |29.974     |
|1977    |31.923     |
|1978    |34.145     |
|1979    |37.178     |
|1980    |41.182     |
|1981    |44.871     |
----------------------



### Convert data into a Pandas dataframe

Our current dataframe is a Snowflake dataframe, representing a query to an object in Snowflake. We wish to download this into a Pandas dataframe so that we can manipulate it more freely.

In [4]:
# df_input = df_input_sf.select(year(col('"Date"')).alias('"Year"'), col('"Value"').alias('PCE') ).to_pandas()
df_input = df_input_sf.to_pandas()

df_input.head()

Unnamed: 0,YEAR,MEASURE
0,1972,22.542
1,1973,23.756
2,1974,26.229
3,1975,28.415
4,1976,29.974


## Create predictive model

Now that we have our data, we are ready to begin constructing our predictive model.

### Determine inputs

Determine the inputs for our linear regression model.

In [5]:
#x = df_input.index.to_numpy().reshape(-1, 1)

x = df_input["YEAR"].to_numpy().reshape(-1, 1)
y = df_input["MEASURE"].to_numpy()

### Create linear regression model

Leverage LinearRegression to create a model.

In [6]:
model = LinearRegression().fit(x, y)

### Test model

Test the model on a given predicted value.

In [7]:
predictYear = 2021
pce_pred = model.predict([[predictYear]])
# print the last 5 years
print (df_input.tail() )
# run the prediction for 2021
print ('Prediction for '+str(predictYear)+': '+ str(round(pce_pred[0],2)))

    YEAR  MEASURE
44  2016  104.148
45  2017  106.054
46  2018  108.317
47  2019  109.933
48  2020  111.145
Prediction for 2021: 116.22


## Not seeing challenge values

The challenge expects a 2021 value of 116.23 for this prediction, which I am not seeing. I have tried filtering to 1972 onwards like in the [suggested quickstart](https://quickstarts.snowflake.com/guide/data_apps_summit_lab/) but then I get a value of 116.18. Also filtering out the 2021 actual value from the input then yields a 2021 prediction of 116.22 which is far closer.

Comparing the results and values with the original [Snowflake Quick Starts code](https://github.com/Snowflake-Labs/sfquickstarts/blob/master/site/sfguides/src/data_apps_summit_lab/assets/project_files/my_snowpark_pce.ipynb) it appears the original data itself has changed in very small volumes, for example the actual value for 2019 is now 109.933 when it used to be 109.922

I believe this means my solution is correct and the input data itself has simply changed.