**Store Sales Forecasting** an ongoing Kaggle competition that I 
have decided to use pyspark to load, data model, analyse and then 
move it into data modeling.

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
import psycopg2

In [2]:
#credentials = "postgresql://{}:{}@{}:{}/{}".format(user,passwd,host,port,db)

#using psycopg2 to test connection since there are no tables
#import psycopg2
#try:
 #   conn = psycopg2.connect(host=host,dbname=db,user=user,password=passwd,port=port)
#except Exception as e:
 #   print(e)
    
#conn.set_session(autocommit=True)

#try:
 #   cur = conn.cursor()
    
#except:
 #   print(e)
    
#Helper functions to work with the database
def schemaGen(dataframe, schemaName):
    localSchema = pd.io.sql.get_schema(dataframe,schemaName)
    localSchema = localSchema.replace('TEXT','VARCHAR(255)').replace('INTEGER','NUMERIC').replace('\n','').replace('"',"")
    return "".join(localSchema)

#Using pandas read_sql for getting schema
def getSchema(tableName, credentials):
    schema = pd.read_sql("""SELECT * FROM information_schema.columns where table_name='{}'""".format(tableName),con=credentials)
    return schema

#Issue is in using pd.read_sql to write data to the database. so using psycopg2
def queryTable(query):
    try:
        schema = cur.execute(query)
        return 
    except Exception as e:
        print(e)
        
#This doesn't return anything

#Using the pd.read_sql for getting data from db
def queryBase(query):
    requiredTable = pd.read_sql(query,con=credentials)
    return requiredTable

#This returns the dataframe

#I am maintaining the above psycopg code, just in case 
#it is required

In [3]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

In [3]:
%%sh
ls

SalesFC_Datamodeling_Pyspark.ipynb
store-sales-time-series-forecasting.zip


In [4]:
#bring in the data

import shutil
shutil.unpack_archive('store-sales-time-series-forecasting.zip')

In [None]:
## That should unpack all the data for our consumption.

In [5]:
%%sh
ls

holidays_events.csv
oil.csv
SalesFC_Datamodeling_Pyspark.ipynb
sample_submission.csv
store-sales-time-series-forecasting.zip
stores.csv
test.csv
train.csv
transactions.csv


In [4]:
#lets assign var names to the source files for easy references

holidays = 'holidays_events.csv'
oil = 'oil.csv'
stores = 'stores.csv'
train = 'train.csv'
txn = 'transactions.csv'
#We wont be needing those for quite some time
test = 'test.csv'
sample = 'sample_submission.csv'

In [5]:
#starting the spark session and getting the database setup.

spark = SparkSession.builder.appName('sales_fc').getOrCreate()
sparkql= spark.sql
sparkreader = spark.read

23/01/01 04:35:24 WARN Utils: Your hostname, codeStation resolves to a loopback address: 127.0.1.1; using 172.17.0.1 instead (on interface docker0)
23/01/01 04:35:24 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/01/01 04:35:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [9]:
sparkql("SET spark.sql.warehouse.dir").show(truncate=False)

+-----------------------+-------------------------------------------------------------------------------------------+
|key                    |value                                                                                      |
+-----------------------+-------------------------------------------------------------------------------------------+
|spark.sql.warehouse.dir|file:/run/media/solverbot/repoA/gitFolders/moreDE/store_sales_data_analysis/spark-warehouse|
+-----------------------+-------------------------------------------------------------------------------------------+



In [6]:
#creating local database, even though not having hive file system
sparkql("CREATE DATABASE IF NOT EXISTS sales_forecast")
sparkql("USE sales_forecast")

DataFrame[]

In [7]:
#Reading in the data
holidays_data = sparkreader.csv(holidays,inferSchema=True,header=True)
oil_data = sparkreader.csv(oil,inferSchema=True,header=True)
stores_data = sparkreader.csv(stores,inferSchema=True,header=True)
train_data = sparkreader.csv(train,inferSchema=True,header=True)
txn_data = sparkreader.csv(txn,inferSchema=True,header=True)

                                                                                

Anything that is outside the database is data, once it is 
inside then it is a table. That will keep things separate

In [8]:
#Lets create temp views of the tables first. 
holidays_data.createOrReplaceTempView("holidays_table")
oil_data.createOrReplaceTempView("oil_table")
stores_data.createOrReplaceTempView("stores_table")
train_data.createOrReplaceTempView("train_table")
txn_data.createOrReplaceTempView("txn_table")

The temp tables are dropped like the usual sql tables. sparkql("DROP TABLE holidays_data")

In [93]:
#We have the data inside the spark data base to start manipulation
#using sql
sparkql("SHOW TABLES").show()

+---------+--------------+-----------+
|namespace|     tableName|isTemporary|
+---------+--------------+-----------+
|         |    date_table|       true|
|         |full_oil_table|       true|
|         |holidays_table|       true|
|         |     oil_table|       true|
|         |  stores_table|       true|
|         |   train_table|       true|
|         |     txn_table|       true|
+---------+--------------+-----------+



Remove the *.csv files from the local directory

In [94]:
%%sh

rm -f *.csv

### Lets get to know the data... one table at a time

In [15]:
sparkql("select * from holidays_data limit 5").show()

[Stage 12:>                                                         (0 + 1) / 1]

+-------------------+-------+--------+-----------+--------------------+-----------+
|               date|   type|  locale|locale_name|         description|transferred|
+-------------------+-------+--------+-----------+--------------------+-----------+
|2012-03-02 00:00:00|Holiday|   Local|      Manta|  Fundacion de Manta|      false|
|2012-04-01 00:00:00|Holiday|Regional|   Cotopaxi|Provincializacion...|      false|
|2012-04-12 00:00:00|Holiday|   Local|     Cuenca| Fundacion de Cuenca|      false|
|2012-04-14 00:00:00|Holiday|   Local|   Libertad|Cantonizacion de ...|      false|
|2012-04-21 00:00:00|Holiday|   Local|   Riobamba|Cantonizacion de ...|      false|
+-------------------+-------+--------+-----------+--------------------+-----------+



                                                                                

In [20]:
#Observe there are multiple categories, type, locale, locale_name
holidays_data.count()

350

In [25]:
sparkql("""select count(*) as categ_counts, hd.type, \
            hd.locale,hd.locale_name \
            from holidays_table hd \
            group by hd.type,hd.locale,hd.locale_name 
            order by hd.locale_name
            """).show()

+------------+----------+--------+-----------+
|categ_counts|      type|  locale|locale_name|
+------------+----------+--------+-----------+
|          12|   Holiday|   Local|     Ambato|
|           6|   Holiday|   Local|    Cayambe|
|           6|   Holiday|Regional|   Cotopaxi|
|           6|   Holiday|   Local|     Cuenca|
|           1|  Transfer|   Local|     Cuenca|
|          40|Additional|National|    Ecuador|
|          56|     Event|National|    Ecuador|
|           8|  Transfer|National|    Ecuador|
|          60|   Holiday|National|    Ecuador|
|           5|  Work Day|National|    Ecuador|
|           5|    Bridge|National|    Ecuador|
|           6|   Holiday|   Local|  El Carmen|
|           6|   Holiday|   Local| Esmeraldas|
|          12|   Holiday|   Local|   Guaranda|
|           5|   Holiday|   Local|  Guayaquil|
|           5|Additional|   Local|  Guayaquil|
|           1|  Transfer|   Local|  Guayaquil|
|           1|  Transfer|   Local|     Ibarra|
|           6

In [26]:
oil_data.count()

1218

In [29]:
%%sh
tail oil.csv

2017-08-18,48.59
2017-08-21,47.39
2017-08-22,47.65
2017-08-23,48.45
2017-08-24,47.24
2017-08-25,47.65
2017-08-28,46.4
2017-08-29,46.46
2017-08-30,45.96
2017-08-31,47.26


### The data is available till Aug'17 starting from Jan'13

In [28]:
sparkql("select * from oil_table limit 5").show()

+-------------------+----------+
|               date|dcoilwtico|
+-------------------+----------+
|2013-01-01 00:00:00|      null|
|2013-01-02 00:00:00|     93.14|
|2013-01-03 00:00:00|     92.97|
|2013-01-04 00:00:00|     93.12|
|2013-01-07 00:00:00|      93.2|
+-------------------+----------+



In [31]:
sparkql("select * from stores_table limit 5").show()

+---------+-------------+--------------------+----+-------+
|store_nbr|         city|               state|type|cluster|
+---------+-------------+--------------------+----+-------+
|        1|        Quito|           Pichincha|   D|     13|
|        2|        Quito|           Pichincha|   D|     13|
|        3|        Quito|           Pichincha|   D|      8|
|        4|        Quito|           Pichincha|   D|      9|
|        5|Santo Domingo|Santo Domingo de ...|   D|      4|
+---------+-------------+--------------------+----+-------+



In [33]:
stores_data.count()

54

In [32]:
sparkql("select * from train_table limit 5").show()

+---+-------------------+---------+----------+-----+-----------+
| id|               date|store_nbr|    family|sales|onpromotion|
+---+-------------------+---------+----------+-----+-----------+
|  0|2013-01-01 00:00:00|        1|AUTOMOTIVE|  0.0|          0|
|  1|2013-01-01 00:00:00|        1| BABY CARE|  0.0|          0|
|  2|2013-01-01 00:00:00|        1|    BEAUTY|  0.0|          0|
|  3|2013-01-01 00:00:00|        1| BEVERAGES|  0.0|          0|
|  4|2013-01-01 00:00:00|        1|     BOOKS|  0.0|          0|
+---+-------------------+---------+----------+-----+-----------+



In [34]:
train_data.count()

                                                                                

3000888

In [35]:
sparkql("SELECT * FROM txn_table LIMIT 5").show()

+-------------------+---------+------------+
|               date|store_nbr|transactions|
+-------------------+---------+------------+
|2013-01-01 00:00:00|       25|         770|
|2013-01-02 00:00:00|        1|        2111|
|2013-01-02 00:00:00|        2|        2358|
|2013-01-02 00:00:00|        3|        3487|
|2013-01-02 00:00:00|        4|        1922|
+-------------------+---------+------------+



In [36]:
txn_data.count()

83488

In [13]:
data = [(123, 1, "01/01/2021",),
        (123, 0, "01/02/2021",),
        (123, 1, "01/03/2021",),
        (123, 0, "01/06/2021",),
        (123, 0, "01/08/2021",),
        (777, 0, "01/01/2021",),
        (777, 1, "01/03/2021",), ]

df = spark.createDataFrame(data, ("ID", "FLAG", "DATE",)) \
        .withColumn("DATE", to_date(col("DATE"), "dd/MM/yyyy"))

In [18]:
df.show(2)

                                                                                

+---+----+----------+
| ID|FLAG|      DATE|
+---+----+----------+
|123|   1|2021-01-01|
|123|   0|2021-02-01|
+---+----+----------+
only showing top 2 rows



In [48]:
sparkql("""SELECT MIN(date) as min_date,MAX(date) as max_date,
                MIN(date) - MAX(date) as interval
                from oil_table""").show()

+-------------------+-------------------+--------------------+
|           min_date|           max_date|            interval|
+-------------------+-------------------+--------------------+
|2013-01-01 00:00:00|2017-08-31 00:00:00|INTERVAL '-1703 0...|
+-------------------+-------------------+--------------------+



In [38]:
all_dates_df = df.groupBy("id").agg(
    date_trunc("mm", max(to_date("date", "dd/MM/yyyy"))).\
            alias("max_date"),
    date_trunc("mm", min(to_date("date", "dd/MM/yyyy"))). \
            alias("min_date")). \
    select("id",expr("sequence(min_date, max_date, interval 1 month)").alias("date_seq")). \
        withColumn("date_new",explode("date_seq")). \
        withColumn("date_form",date_format("date_new", "dd/MM/yyyy"))

In [50]:
oil_data.select(date_trunc("mm", max(to_date("date", "dd/MM/yyyy"))).\
            alias("max_date"),
            date_trunc("mm", min(to_date("date", "dd/MM/yyyy"))). \
            alias("min_date")).show()

+-------------------+-------------------+
|           max_date|           min_date|
+-------------------+-------------------+
|2017-08-01 00:00:00|2013-01-01 00:00:00|
+-------------------+-------------------+



### Creating the date sequence that we want

In [74]:
data_date_series = oil_data.select(date_trunc("mm", max(to_date("date", "dd/MM/yyyy"))).\
            alias("max_date"),
            date_trunc("mm", min(to_date("date", "dd/MM/yyyy"))). \
            alias("min_date")). \
    select(expr("sequence(min_date, max_date, interval 1 day)").alias("date_seq")). \
        withColumn("date_new",explode("date_seq")). \
        withColumn("date_form",date_format("date_new", "yyyy-MM-dd"))

In [75]:
date_series=data_date_series.drop("date_seq","date_new")

In [76]:
date_series.count()

1674

In [None]:
date_series.co

In [77]:
date_series.createOrReplaceTempView('date_table')

In [78]:
sparkql("""SELECT date_form 
            from date_table""").show(2)

+----------+
| date_form|
+----------+
|2013-01-01|
|2013-01-02|
+----------+
only showing top 2 rows



In [80]:
sparkql("""SELECT date, dcoilwtico
            from oil_table""").show(2)

+-------------------+----------+
|               date|dcoilwtico|
+-------------------+----------+
|2013-01-01 00:00:00|      null|
|2013-01-02 00:00:00|     93.14|
+-------------------+----------+
only showing top 2 rows



### Build the tables SQL style:Not so fast. 

In Spark SQL implementation the constraints like
Primary, Secondary is not established. The PR has been already raised in tho ASF though.

sparkql(""" CREATE TABLE full_oil_table AS
        
        SELECT date_form, COALESCE(dcoilwtico,0) as dcoilwtico
        
        FROM date_table dt LEFT JOIN oil_table ot
        
        ON dt.date_form = ot.date""")
        
        
The above command requires hive support, and errors out. We cannot create fully constrained tables in spark context. We have to do it in RDBMS environment if required

In [91]:
#Resorting to the Temp view creation route instead
sparkql(""" SELECT date_form, COALESCE(dcoilwtico,0) as dcoilwtico
        FROM date_table dt LEFT JOIN oil_table ot
        ON dt.date_form = ot.date"""). \
    createOrReplaceTempView('full_oil_table')

In [92]:
# Creating table the sql style
sparkql("""SELECT * 
            FROM full_oil_table""").show(2)

+----------+----------+
| date_form|dcoilwtico|
+----------+----------+
|2013-01-01|       0.0|
|2013-01-02|     93.14|
+----------+----------+
only showing top 2 rows

