<a href="https://www.kaggle.com/code/kamaljp/store-sales-forecasting-pysparkdatamodeling?scriptVersionId=115252548" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

**Store Sales Forecasting** an ongoing Kaggle competition that I 
have decided to use pyspark to load, data model, analyse and then 
move it into data modeling.

In [1]:
! pip install pyspark

Collecting pyspark
  Downloading pyspark-3.3.1.tar.gz (281.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m281.4/281.4 MB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l- \ done
[?25hCollecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.7/199.7 kB[0m [31m16.9 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l- \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / done
[?25h  Created wheel for pyspark: filename=pyspark-3.3.1-py2.py3-none-any.whl size=281845513 sha256=515809b127ae1d55b157babd761643bcedd8b09984e23547c3342b40600f6fc4
  Stored in directory: /root/.cache/pip/wheels/42

In [2]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

In [3]:
%%sh
cd /kaggle/input/store-sales-time-series-forecasting/
ls

holidays_events.csv
oil.csv
sample_submission.csv
stores.csv
test.csv
train.csv
transactions.csv


In [4]:
#lets assign var names to the source files for easy references

holidays = '/kaggle/input/store-sales-time-series-forecasting/holidays_events.csv'
oil = '/kaggle/input/store-sales-time-series-forecasting/oil.csv'
stores = '/kaggle/input/store-sales-time-series-forecasting/stores.csv'
train = '/kaggle/input/store-sales-time-series-forecasting/train.csv'
txn = '/kaggle/input/store-sales-time-series-forecasting/transactions.csv'
#We wont be needing those for quite some time
test = '/kaggle/input/store-sales-time-series-forecasting/test.csv'
sample = '/kaggle/input/store-sales-time-series-forecasting/sample_submission.csv'

In [5]:
#starting the spark session and getting the database setup.

spark = SparkSession.builder.appName('sales_fc').getOrCreate()
sparkql= spark.sql
sparkreader = spark.read

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/01/02 01:37:19 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [6]:
sparkql("SET spark.sql.warehouse.dir").show(truncate=False)

+-----------------------+------------------------------------+
|key                    |value                               |
+-----------------------+------------------------------------+
|spark.sql.warehouse.dir|file:/kaggle/working/spark-warehouse|
+-----------------------+------------------------------------+



In [7]:
#creating local database, even though not having hive file system
sparkql("CREATE DATABASE IF NOT EXISTS sales_forecast")
sparkql("USE sales_forecast")

DataFrame[]

In [8]:
#Reading in the data
holidays_data = sparkreader.csv(holidays,inferSchema=True,header=True)
oil_data = sparkreader.csv(oil,inferSchema=True,header=True)
stores_data = sparkreader.csv(stores,inferSchema=True,header=True)
train_data = sparkreader.csv(train,inferSchema=True,header=True)
txn_data = sparkreader.csv(txn,inferSchema=True,header=True)

                                                                                

Anything that is outside the database is data, once it is 
inside then it is a table. That will keep things separate

In [9]:
#Lets create temp views of the tables first. 
holidays_data.createOrReplaceTempView("holidays_table")
oil_data.createOrReplaceTempView("oil_table")
stores_data.createOrReplaceTempView("stores_table")
train_data.createOrReplaceTempView("train_table")
txn_data.createOrReplaceTempView("txn_table")

The temp tables are dropped like the usual sql tables. sparkql("DROP TABLE holidays_data")

In [10]:
#We have the data inside the spark data base to start manipulation
#using sql
sparkql("SHOW TABLES").show()

+---------+--------------+-----------+
|namespace|     tableName|isTemporary|
+---------+--------------+-----------+
|         |holidays_table|       true|
|         |     oil_table|       true|
|         |  stores_table|       true|
|         |   train_table|       true|
|         |     txn_table|       true|
+---------+--------------+-----------+



### Lets get to know the data... one table at a time

In [11]:
sparkql("select * from holidays_table limit 5").show()

+-------------------+-------+--------+-----------+--------------------+-----------+
|               date|   type|  locale|locale_name|         description|transferred|
+-------------------+-------+--------+-----------+--------------------+-----------+
|2012-03-02 00:00:00|Holiday|   Local|      Manta|  Fundacion de Manta|      false|
|2012-04-01 00:00:00|Holiday|Regional|   Cotopaxi|Provincializacion...|      false|
|2012-04-12 00:00:00|Holiday|   Local|     Cuenca| Fundacion de Cuenca|      false|
|2012-04-14 00:00:00|Holiday|   Local|   Libertad|Cantonizacion de ...|      false|
|2012-04-21 00:00:00|Holiday|   Local|   Riobamba|Cantonizacion de ...|      false|
+-------------------+-------+--------+-----------+--------------------+-----------+



In [12]:
#Observe there are multiple categories, type, locale, locale_name
holidays_data.count()

350

In [13]:
sparkql("""select count(*) as categ_counts, hd.type, \
            hd.locale,hd.locale_name \
            from holidays_table hd \
            group by hd.type,hd.locale,hd.locale_name 
            order by hd.locale_name
            """).show()

+------------+----------+--------+-----------+
|categ_counts|      type|  locale|locale_name|
+------------+----------+--------+-----------+
|          12|   Holiday|   Local|     Ambato|
|           6|   Holiday|   Local|    Cayambe|
|           6|   Holiday|Regional|   Cotopaxi|
|           6|   Holiday|   Local|     Cuenca|
|           1|  Transfer|   Local|     Cuenca|
|          40|Additional|National|    Ecuador|
|          56|     Event|National|    Ecuador|
|           8|  Transfer|National|    Ecuador|
|          60|   Holiday|National|    Ecuador|
|           5|  Work Day|National|    Ecuador|
|           5|    Bridge|National|    Ecuador|
|           6|   Holiday|   Local|  El Carmen|
|           6|   Holiday|   Local| Esmeraldas|
|          12|   Holiday|   Local|   Guaranda|
|           5|   Holiday|   Local|  Guayaquil|
|           5|Additional|   Local|  Guayaquil|
|           1|  Transfer|   Local|  Guayaquil|
|           1|  Transfer|   Local|     Ibarra|
|           6

In [14]:
oil_data.count()

1218

### The data is available till Aug'17 starting from Jan'13

In [15]:
sparkql("select * from oil_table limit 5").show()

+-------------------+----------+
|               date|dcoilwtico|
+-------------------+----------+
|2013-01-01 00:00:00|      null|
|2013-01-02 00:00:00|     93.14|
|2013-01-03 00:00:00|     92.97|
|2013-01-04 00:00:00|     93.12|
|2013-01-07 00:00:00|      93.2|
+-------------------+----------+



In [16]:
sparkql("select * from stores_table limit 5").show()

+---------+-------------+--------------------+----+-------+
|store_nbr|         city|               state|type|cluster|
+---------+-------------+--------------------+----+-------+
|        1|        Quito|           Pichincha|   D|     13|
|        2|        Quito|           Pichincha|   D|     13|
|        3|        Quito|           Pichincha|   D|      8|
|        4|        Quito|           Pichincha|   D|      9|
|        5|Santo Domingo|Santo Domingo de ...|   D|      4|
+---------+-------------+--------------------+----+-------+



In [17]:
stores_data.count()

54

In [18]:
sparkql("select * from train_table limit 5").show()

+---+-------------------+---------+----------+-----+-----------+
| id|               date|store_nbr|    family|sales|onpromotion|
+---+-------------------+---------+----------+-----+-----------+
|  0|2013-01-01 00:00:00|        1|AUTOMOTIVE|  0.0|          0|
|  1|2013-01-01 00:00:00|        1| BABY CARE|  0.0|          0|
|  2|2013-01-01 00:00:00|        1|    BEAUTY|  0.0|          0|
|  3|2013-01-01 00:00:00|        1| BEVERAGES|  0.0|          0|
|  4|2013-01-01 00:00:00|        1|     BOOKS|  0.0|          0|
+---+-------------------+---------+----------+-----+-----------+



In [19]:
train_data.count()

                                                                                

3000888

In [20]:
sparkql("SELECT * FROM txn_table LIMIT 5").show()

+-------------------+---------+------------+
|               date|store_nbr|transactions|
+-------------------+---------+------------+
|2013-01-01 00:00:00|       25|         770|
|2013-01-02 00:00:00|        1|        2111|
|2013-01-02 00:00:00|        2|        2358|
|2013-01-02 00:00:00|        3|        3487|
|2013-01-02 00:00:00|        4|        1922|
+-------------------+---------+------------+



In [21]:
txn_data.count()

83488

In [22]:
data = [(123, 1, "01/01/2021",),
        (123, 0, "01/02/2021",),
        (123, 1, "01/03/2021",),
        (123, 0, "01/06/2021",),
        (123, 0, "01/08/2021",),
        (777, 0, "01/01/2021",),
        (777, 1, "01/03/2021",), ]

df = spark.createDataFrame(data, ("ID", "FLAG", "DATE",)) \
        .withColumn("DATE", to_date(col("DATE"), "dd/MM/yyyy"))

In [23]:
df.show(2)

                                                                                

+---+----+----------+
| ID|FLAG|      DATE|
+---+----+----------+
|123|   1|2021-01-01|
|123|   0|2021-02-01|
+---+----+----------+
only showing top 2 rows



In [24]:
sparkql("""SELECT MIN(date) as min_date,MAX(date) as max_date,
                MIN(date) - MAX(date) as interval
                from oil_table""").show()

+-------------------+-------------------+--------------------+
|           min_date|           max_date|            interval|
+-------------------+-------------------+--------------------+
|2013-01-01 00:00:00|2017-08-31 00:00:00|INTERVAL '-1703 0...|
+-------------------+-------------------+--------------------+



In [25]:
all_dates_df = df.groupBy("id").agg(
    date_trunc("mm", max(to_date("date", "dd/MM/yyyy"))).\
            alias("max_date"),
    date_trunc("mm", min(to_date("date", "dd/MM/yyyy"))). \
            alias("min_date")). \
    select("id",expr("sequence(min_date, max_date, interval 1 month)").alias("date_seq")). \
        withColumn("date_new",explode("date_seq")). \
        withColumn("date_form",date_format("date_new", "dd/MM/yyyy"))

In [26]:
oil_data.select(date_trunc("mm", max(to_date("date", "dd/MM/yyyy"))).\
            alias("max_date"),
            date_trunc("mm", min(to_date("date", "dd/MM/yyyy"))). \
            alias("min_date"),
            (date_trunc("mm", max(to_date("date", "dd/MM/yyyy"))) - \
            date_trunc("mm", min(to_date("date", "dd/MM/yyyy")))).alias('diff_date')).show()

+-------------------+-------------------+--------------------+
|           max_date|           min_date|           diff_date|
+-------------------+-------------------+--------------------+
|2017-08-01 00:00:00|2013-01-01 00:00:00|INTERVAL '1673 00...|
+-------------------+-------------------+--------------------+



### Creating the date sequence that we want

In [27]:
data_date_series = oil_data.select(date_trunc("mm", max(to_date("date", "dd/MM/yyyy"))).\
            alias("max_date"),
            date_trunc("mm", min(to_date("date", "dd/MM/yyyy"))). \
            alias("min_date")). \
    select(expr("sequence(min_date, max_date, interval 1 day)").alias("date_seq")). \
        withColumn("date_new",explode("date_seq")). \
        withColumn("date_form",date_format("date_new", "yyyy-MM-dd"))

In [28]:
date_series=data_date_series.drop("date_seq","date_new")

In [29]:
date_series.count()

1674

In [30]:
date_series.createOrReplaceTempView('date_table')

In [31]:
sparkql("""SELECT date_form 
            from date_table""").show(2)

+----------+
| date_form|
+----------+
|2013-01-01|
|2013-01-02|
+----------+
only showing top 2 rows



### Build the tables SQL style:Not so fast. 

In Spark SQL implementation the constraints like
Primary, Secondary is not established. The PR has been already raised in tho ASF though.

sparkql(""" CREATE TABLE full_oil_table AS
        
        SELECT date_form, COALESCE(dcoilwtico,0) as dcoilwtico
        
        FROM date_table dt LEFT JOIN oil_table ot
        
        ON dt.date_form = ot.date""")
        
        
The above command requires hive support, and errors out. We cannot create fully constrained tables in spark context. We have to do it in RDBMS environment if required

In [32]:
#Resorting to the Temp view creation route instead
sparkql(""" SELECT date_form, COALESCE(dcoilwtico,0) as dcoilwtico
        FROM date_table dt LEFT JOIN oil_table ot
        ON dt.date_form = ot.date"""). \
    createOrReplaceTempView('full_oil_table')

In [33]:
# Creating table the sql style
sparkql("""SELECT * 
            FROM full_oil_table""").show(2)

+----------+----------+
| date_form|dcoilwtico|
+----------+----------+
|2013-01-01|       0.0|
|2013-01-02|     93.14|
+----------+----------+
only showing top 2 rows



#### Attempting to join the oil_data with holiday_table 

-- Doing some recon on the table columns, the ranges and data types

-- Thinking of the format to be used for columns used for joining 

In [34]:
sparkql("""SELECT MAX(date) as max_date,
            MIN(date) as min_date,
            MAX(date) - MIN(date) as avbl_span
            FROM holidays_table""").show(2)

+-------------------+-------------------+--------------------+
|           max_date|           min_date|           avbl_span|
+-------------------+-------------------+--------------------+
|2017-12-26 00:00:00|2012-03-02 00:00:00|INTERVAL '2125 00...|
+-------------------+-------------------+--------------------+



In [35]:
sparkql("""SELECT *
            FROM holidays_table""").show(2)

+-------------------+-------+--------+-----------+--------------------+-----------+
|               date|   type|  locale|locale_name|         description|transferred|
+-------------------+-------+--------+-----------+--------------------+-----------+
|2012-03-02 00:00:00|Holiday|   Local|      Manta|  Fundacion de Manta|      false|
|2012-04-01 00:00:00|Holiday|Regional|   Cotopaxi|Provincializacion...|      false|
+-------------------+-------+--------+-----------+--------------------+-----------+
only showing top 2 rows



In [36]:
# Date_trunc is avble but not useful
# To_char is not avble
# Extract is avble, did not try
# date_format, found today only, my new_year gift ;)

sparkql("""SELECT date_format(date, 'yyyy-MM-dd') FROM holidays_table""").show(2)

+-----------------------------+
|date_format(date, yyyy-MM-dd)|
+-----------------------------+
|                   2012-03-02|
|                   2012-04-01|
+-----------------------------+
only showing top 2 rows



In [37]:
sparkql(""" SELECT ot.date_form, ht.date, ht.type, ht.locale,
        ht.locale_name,ot.dcoilwtico
        FROM holidays_table ht JOIN full_oil_table ot
        ON date_format(ht.date,'yyyy-MM-dd') = ot.date_form
""").show(2, truncate=False)
## The tables are joining

+----------+-------------------+--------+--------+-----------+----------+
|date_form |date               |type    |locale  |locale_name|dcoilwtico|
+----------+-------------------+--------+--------+-----------+----------+
|2013-01-01|2013-01-01 00:00:00|Holiday |National|Ecuador    |0.0       |
|2013-01-05|2013-01-05 00:00:00|Work Day|National|Ecuador    |0.0       |
+----------+-------------------+--------+--------+-----------+----------+
only showing top 2 rows



In [38]:
sparkql(""" SELECT ot.date_form, COALESCE(ht.type,'Working') as type, 
        COALESCE(ht.locale,'National') as locale,
        COALESCE(ht.locale_name,'National') as locale_name,
        ot.dcoilwtico
        FROM holidays_table ht RIGHT JOIN full_oil_table ot
        ON date_format(ht.date,'yyyy-MM-dd') = ot.date_form
""").createOrReplaceTempView('full_oil_with_holidays')
## The tables are joining

In [39]:
sparkql("""SELECT * 
            FROM full_oil_with_holidays""").show(2)

+----------+-------+--------+-----------+----------+
| date_form|   type|  locale|locale_name|dcoilwtico|
+----------+-------+--------+-----------+----------+
|2013-01-01|Holiday|National|    Ecuador|       0.0|
|2013-01-02|Working|National|   National|     93.14|
+----------+-------+--------+-----------+----------+
only showing top 2 rows



Validating the table join

- Check if there is extra rows

- Find the extra rows 

- Ensure there is no duplication

In [40]:
sparkql(""" SELECT *
        FROM full_oil_with_holidays
""").count()

1704

In [41]:
sparkql(""" SELECT distinct date_form
        FROM full_oil_with_holidays
""").count()

1674

We can observe the date has been duplicated. The reason must be linked with the locales and types. Running a group by with those 
columns must assure there is no data duplication

In [42]:
sparkql(""" SELECT COUNT(1) as typ_counts, date_form, type
        FROM full_oil_with_holidays
        GROUP BY date_form, type
        HAVING COUNT(1) > 1
""").show()

+----------+----------+-------+
|typ_counts| date_form|   type|
+----------+----------+-------+
|         3|2013-06-25|Holiday|
|         2|2013-07-03|Holiday|
|         3|2014-06-25|Holiday|
|         2|2014-07-03|Holiday|
|         3|2015-06-25|Holiday|
|         2|2015-07-03|Holiday|
|         2|2016-05-08|  Event|
|         3|2016-06-25|Holiday|
|         2|2016-07-03|Holiday|
|         2|2017-04-14|Holiday|
|         3|2017-06-25|Holiday|
|         2|2017-07-03|Holiday|
+----------+----------+-------+



In [43]:
sparkql(""" SELECT ft.date_form, ft.dcoilwtico
        FROM full_oil_with_holidays ft
        EXCEPT
        SELECT ot.date_form, ot.dcoilwtico 
        FROM full_oil_table ot
""").show()

+---------+----------+
|date_form|dcoilwtico|
+---------+----------+
+---------+----------+



Based on above checks the table join and new view creation is successful. Proceeding to the next join

Stores table shown below looks like a dimension table. The store_nbr can be the unique id. Lets check that.

The store-nbr is arbitrary, to identify a particular store. There are multiple store in same city, state, type and cluster. It is a valid joiner.

In [44]:
sparkql(""" SELECT st.*
        FROM stores_table st
""").show(5)

+---------+-------------+--------------------+----+-------+
|store_nbr|         city|               state|type|cluster|
+---------+-------------+--------------------+----+-------+
|        1|        Quito|           Pichincha|   D|     13|
|        2|        Quito|           Pichincha|   D|     13|
|        3|        Quito|           Pichincha|   D|      8|
|        4|        Quito|           Pichincha|   D|      9|
|        5|Santo Domingo|Santo Domingo de ...|   D|      4|
+---------+-------------+--------------------+----+-------+
only showing top 5 rows



In [45]:
sparkql(""" SELECT tt.id, date_format(tt.date,'yyyy-MM-dd') as date,
            tt.store_nbr, tt.family, 
            tt.sales, tt.onpromotion
        FROM train_table tt""").createOrReplaceTempView('full_train_table')

In [46]:
sparkql("""select COUNT(1) as day_data,tt.date
            FROM full_train_table tt
            GROUP BY tt.date
            ORDER BY tt.date""").show(2)



+--------+----------+
|day_data|      date|
+--------+----------+
|    1782|2013-01-01|
|    1782|2013-01-02|
+--------+----------+
only showing top 2 rows



                                                                                

In [47]:
sparkql("""SELECT MAX(date) as max_date,
            MIN(date) as min_date,
            MAX(date) - MIN(date) as avbl_span
            FROM train_table""").show(2)

#Number of days is 1687 which is 14 days more than data
# available in oil_data. 

[Stage 100:>                                                        (0 + 4) / 4]

+-------------------+-------------------+--------------------+
|           max_date|           min_date|           avbl_span|
+-------------------+-------------------+--------------------+
|2017-08-15 00:00:00|2013-01-01 00:00:00|INTERVAL '1687 00...|
+-------------------+-------------------+--------------------+



                                                                                

Remember the SQL follows the 
                        
                        From
                                Join
                                
                             Where
                             
                         Groupby
                         
                                 Select
                                 
                                        Order by for execution.
                                        

Based on above execution, order by can see variables present in Select. But G / W cannot see them

Lets get joining the train table with the stores and full_oil_table.

In [48]:
sparkql(""" SELECT ftt.*, st.*,fot.*
        FROM full_train_table ftt JOIN stores_table st
        on ftt.store_nbr = st.store_nbr
        join full_oil_table fot
        on fot.date_form = ftt.date
""").show(2)

+---+----------+---------+----------+-----+-----------+---------+-----+---------+----+-------+----------+----------+
| id|      date|store_nbr|    family|sales|onpromotion|store_nbr| city|    state|type|cluster| date_form|dcoilwtico|
+---+----------+---------+----------+-----+-----------+---------+-----+---------+----+-------+----------+----------+
|  0|2013-01-01|        1|AUTOMOTIVE|  0.0|          0|        1|Quito|Pichincha|   D|     13|2013-01-01|       0.0|
|  1|2013-01-01|        1| BABY CARE|  0.0|          0|        1|Quito|Pichincha|   D|     13|2013-01-01|       0.0|
+---+----------+---------+----------+-----+-----------+---------+-----+---------+----+-------+----------+----------+
only showing top 2 rows



In [49]:
sparkql(""" SELECT ftt.*, st.*,fot.*
        FROM full_train_table ftt JOIN stores_table st
        on ftt.store_nbr = st.store_nbr
        RIGHT JOIN full_oil_table fot
        on fot.date_form = ftt.date
""").show(2)



+----+----------+---------+----------+-----+-----------+---------+-----+---------+----+-------+----------+----------+
|  id|      date|store_nbr|    family|sales|onpromotion|store_nbr| city|    state|type|cluster| date_form|dcoilwtico|
+----+----------+---------+----------+-----+-----------+---------+-----+---------+----+-------+----------+----------+
|3564|2013-01-03|        1|AUTOMOTIVE|  3.0|          0|        1|Quito|Pichincha|   D|     13|2013-01-03|     92.97|
|3565|2013-01-03|        1| BABY CARE|  0.0|          0|        1|Quito|Pichincha|   D|     13|2013-01-03|     92.97|
+----+----------+---------+----------+-----+-----------+---------+-----+---------+----+-------+----------+----------+
only showing top 2 rows



                                                                                

Lets try validating the join by the usual process of checking the data

-- Row Counts of store Number of individual tables and final 
joined tables

In [50]:
sparkql("""SELECT * FROM train_table""").count()

                                                                                

3000888

In [51]:
sparkql(""" SELECT ftt.*, st.*,fot.*
        FROM full_train_table ftt JOIN stores_table st
        on ftt.store_nbr = st.store_nbr
        join full_oil_table fot
        on fot.date_form = ftt.date
""").count()

                                                                                

2975940

Hmm the rows has been lost... I guess some of the full oil table 
has lesser date rows... 

In [52]:
sparkql("""select COUNT(1) as day_data,tt.date
            FROM full_train_table tt
            GROUP BY tt.date
            ORDER BY tt.date""").count()

                                                                                

1684

In [53]:
sparkql("""select COUNT(1) as day_data,tt.date_form
            FROM full_oil_table tt
            GROUP BY tt.date_form
            ORDER BY tt.date_form""").count()

1674

In [54]:
# That provides part of the answer.
3000888 - 1782 * 10 

2983068

In [55]:
# Lets check the store numbers. That tallys up with the 
# 54 store numbers
sparkql("""select COUNT(1) as day_data,tt.store_nbr
            FROM full_train_table tt
            GROUP BY tt.store_nbr
            ORDER BY tt.store_nbr""").count()

                                                                                

54

So there we found the culprit. We had to do left outer join.
There might be days which is present in train_table and not 
in oil_data. We need to work on that next

In [56]:
sparkql(""" SELECT ftt.*, st.*,fot.*
        FROM full_train_table ftt JOIN stores_table st
        on ftt.store_nbr = st.store_nbr
        LEFT JOIN full_oil_table fot
        on fot.date_form = ftt.date""").count()

                                                                                

3000888

In [57]:
sparkql(""" SELECT ftt.id, ftt.date,ftt.store_nbr,ftt.family,
            ftt.sales, ftt.onpromotion, st.city, st.state,st.type,
            st.cluster,fot.dcoilwtico
        FROM full_train_table ftt JOIN stores_table st
        on ftt.store_nbr = st.store_nbr
        LEFT JOIN full_oil_table fot
        on fot.date_form = ftt.date"""). \
        createOrReplaceTempView("train_store_oil_table")

In [58]:
sparkql("""SELECT tsot.*
            FROM train_store_oil_table tsot
            WHERE tsot.date = '2017-08-02'""").show(2)

[Stage 159:>                                                        (0 + 3) / 3]

+-------+----------+---------+----------+-----+-----------+-----+---------+----+-------+----------+
|     id|      date|store_nbr|    family|sales|onpromotion| city|    state|type|cluster|dcoilwtico|
+-------+----------+---------+----------+-----+-----------+-----+---------+----+-------+----------+
|2975940|2017-08-02|        1|AUTOMOTIVE|  4.0|          0|Quito|Pichincha|   D|     13|      null|
|2975941|2017-08-02|        1| BABY CARE|  0.0|          0|Quito|Pichincha|   D|     13|      null|
+-------+----------+---------+----------+-----+-----------+-----+---------+----+-------+----------+
only showing top 2 rows



                                                                                

In [59]:
sparkql("""SELECT date_format(txt.date,'yyyy-MM-dd') as date,
                sum(txt.transactions) as total_txn
                FROM txn_table txt
            GROUP BY date_format(txt.date,'yyyy-MM-dd')
            ORDER BY date""").show(5)

+----------+---------+
|      date|total_txn|
+----------+---------+
|2013-01-01|      770|
|2013-01-02|    93215|
|2013-01-03|    78504|
|2013-01-04|    78494|
|2013-01-05|    93573|
+----------+---------+
only showing top 5 rows



In [60]:
sparkql("""SELECT MAX(date) as max_date,
            MIN(date) as min_date,
            MAX(date) - MIN(date) as avbl_span
            FROM txn_table""").show(2)

+-------------------+-------------------+--------------------+
|           max_date|           min_date|           avbl_span|
+-------------------+-------------------+--------------------+
|2017-08-15 00:00:00|2013-01-01 00:00:00|INTERVAL '1687 00...|
+-------------------+-------------------+--------------------+



In [61]:
sparkql("""SELECT date_format(txt.date,'yyyy-MM-dd') as date,
                sum(txt.transactions) as total_txn
                FROM txn_table txt
            GROUP BY date_format(txt.date,'yyyy-MM-dd')
            ORDER BY date""").tail(5)

[Row(date='2017-08-11', total_txn=89551),
 Row(date='2017-08-12', total_txn=89927),
 Row(date='2017-08-13', total_txn=85993),
 Row(date='2017-08-14', total_txn=85448),
 Row(date='2017-08-15', total_txn=86561)]

There are missing txn data in the middle of the span. Which the above way of checking will not show. Lets proceed with the joining

In [62]:
sparkql("""SELECT tsot.*, date_format(txt.date,'yyyy-MM-dd') as txn_date,
            txt.transactions, txt.store_nbr
            FROM train_store_oil_table tsot LEFT JOIN txn_table txt
            on tsot.date = date_format(txt.date,'yyyy-MM-dd')
        """).show(2)

+---+----------+---------+----------+-----+-----------+-----+---------+----+-------+----------+----------+------------+---------+
| id|      date|store_nbr|    family|sales|onpromotion| city|    state|type|cluster|dcoilwtico|  txn_date|transactions|store_nbr|
+---+----------+---------+----------+-----+-----------+-----+---------+----+-------+----------+----------+------------+---------+
|  0|2013-01-01|        1|AUTOMOTIVE|  0.0|          0|Quito|Pichincha|   D|     13|       0.0|2013-01-01|         770|       25|
|  1|2013-01-01|        1| BABY CARE|  0.0|          0|Quito|Pichincha|   D|     13|       0.0|2013-01-01|         770|       25|
+---+----------+---------+----------+-----+-----------+-----+---------+----+-------+----------+----------+------------+---------+
only showing top 2 rows



In [63]:
sparkql("""SELECT tsot.*, date_format(txt.date,'yyyy-MM-dd') as txn_date,
            txt.transactions, txt.store_nbr
            FROM train_store_oil_table tsot LEFT JOIN txn_table txt
            on tsot.date = date_format(txt.date,'yyyy-MM-dd') and
                tsot.store_nbr = txt.store_nbr
        """).count()

                                                                                

3000888

In [64]:
sparkql("""SELECT date_format(txt.date,'yyyy-MM-dd') as txn_date,
            txt.transactions, txt.store_nbr
            FROM txn_table txt""").count()

83488

In [65]:
83488 * 3000888

250538137344

In [66]:
sparkql("""SELECT tsot.*, 
            COALESCE(DATE_FORMAT(txt.date,'yyyy-MM-dd'),tsot.date) as txn_date,
            COALESCE(txt.transactions,0) as store_txns, 
            COALESCE(txt.store_nbr, tsot.store_nbr) as store_nbr
            FROM train_store_oil_table tsot LEFT JOIN txn_table txt
            on tsot.date = date_format(txt.date,'yyyy-MM-dd')
            and tsot.store_nbr = txt.store_nbr
        """).createOrReplaceTempView("all_data_joined_data")

In [67]:
sparkql("""SELECT * FROM all_data_joined_data adj
            WHERE adj.date = '2013-01-01'
            and adj.store_txns != 0""").show()

+---+----------+---------+-------------------+---------+-----------+-------+-----------+----+-------+----------+----------+----------+---------+
| id|      date|store_nbr|             family|    sales|onpromotion|   city|      state|type|cluster|dcoilwtico|  txn_date|store_txns|store_nbr|
+---+----------+---------+-------------------+---------+-----------+-------+-----------+----+-------+----------+----------+----------+---------+
|561|2013-01-01|       25|         AUTOMOTIVE|      0.0|          0|Salinas|Santa Elena|   D|      1|       0.0|2013-01-01|       770|       25|
|562|2013-01-01|       25|          BABY CARE|      0.0|          0|Salinas|Santa Elena|   D|      1|       0.0|2013-01-01|       770|       25|
|563|2013-01-01|       25|             BEAUTY|      2.0|          0|Salinas|Santa Elena|   D|      1|       0.0|2013-01-01|       770|       25|
|564|2013-01-01|       25|          BEVERAGES|    810.0|          0|Salinas|Santa Elena|   D|      1|       0.0|2013-01-01|       