<a href="https://colab.research.google.com/github/Brent-Morrison/Misc_scripts/blob/master/PySpark_flow_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PySpark in Google Colab

Installations


In [0]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-eu.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
!tar xf spark-2.4.5-bin-hadoop2.7.tgz
!pip install -q findspark
!pip install pyspark
!pip install pyarrow

Collecting pyspark
[?25l  Downloading https://files.pythonhosted.org/packages/9a/5a/271c416c1c2185b6cb0151b29a91fff6fcaed80173c8584ff6d20e46b465/pyspark-2.4.5.tar.gz (217.8MB)
[K     |████████████████████████████████| 217.8MB 57kB/s 
[?25hCollecting py4j==0.10.7
[?25l  Downloading https://files.pythonhosted.org/packages/e3/53/c737818eb9a7dc32a7cd4f1396e787bd94200c3997c72c1dbe028587bd76/py4j-0.10.7-py2.py3-none-any.whl (197kB)
[K     |████████████████████████████████| 204kB 47.3MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-2.4.5-py2.py3-none-any.whl size=218257927 sha256=91e65afa86e22e8f38bd0797f4b5852b83659a7ce5707fdc6009136248398c87
  Stored in directory: /root/.cache/pip/wheels/bf/db/04/61d66a5939364e756eb1c1be4ec5bdce6e04047fc7929a3c3c
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.7 pyspark-2.4.5


Set the environment variables so that Colab can find Spark

In [0]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.5-bin-hadoop2.7"

Add PySpark to sys.path

PySpark isn't on sys.path by default, but that doesn't mean it can't be used as a regular library. You can address this by either symlinking pyspark into your site-packages, or adding pyspark to sys.path at runtime. [findspark](https://github.com/minrk/findspark) does the latter.

In [0]:
import findspark
findspark.init()

Create the Spark session

In [0]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

Test everything is working.  Creating a test dataframe

In [0]:
test_df = spark.createDataFrame([{"hello": "world"} for x in range(100)])

test_df.printSchema()
test_df.show(3)



root
 |-- hello: string (nullable = true)

+-----+
|hello|
+-----+
|world|
|world|
|world|
+-----+
only showing top 3 rows



# The IFRS9 data transformations 

This is a practice data set to look at implementing complex data transformations in PySpark.

## Load the data

Get a feather file from github.  The file loaded below is generated by the processes outlined in [this](https://brentmorrison.netlify.app/post/ifrs9-disclosures-part-2/) blog post.

The intention was to be implement this using PyArrow however  [it looks like](https://issues.apache.org/jira/browse/ARROW-6998?filter=-6) that won't happen.

[This](https://github.com/pandas-dev/pandas/issues/29055) workaround seems to be the way to go.

In [0]:
# As described above, not working
# import pyarrow as pa
# fin_data = pa.feather.read_feather('https://github.com/Brent-Morrison/hugo_website/blob/master/content/post/ifrs9_part2.feather')

import pandas as pd
import requests
import io
resp = requests.get('https://github.com/Brent-Morrison/hugo_website/raw/master/content/post/ifrs9_part2.feather', stream = True)
resp.raw.decode_content = True
mem_fh = io.BytesIO(resp.raw.read())
fin_data = pd.read_feather(mem_fh)
fin_data.head()

Unnamed: 0,Ticker,me.date,clust.name,RiskStage,TotalDebt,ECL
0,A,2016-12-31,rev.vol_oa.ta_td.ta_da.ta,1.0,1887.0,11.691196
1,A,2017-01-31,np.ta_td.ta_nca.ta_rev.vol,1.0,1887.0,32.852229
2,A,2017-02-28,np.ta_td.ta_nca.ta_rev.vol,1.0,1887.0,34.309091
3,A,2017-03-31,np.ta_td.ta_nca.ta_rev.vol,1.0,2043.0,38.319819
4,A,2017-04-30,np.ta_td.ta_nca.ta_rev.vol,1.0,2043.0,33.507803


Check data type

In [0]:
fin_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24236 entries, 0 to 24235
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Ticker      24236 non-null  object 
 1   me.date     24236 non-null  object 
 2   clust.name  24236 non-null  object 
 3   RiskStage   24236 non-null  float64
 4   TotalDebt   24236 non-null  float64
 5   ECL         24236 non-null  float64
dtypes: float64(3), object(3)
memory usage: 1.1+ MB


The date column 'me.date' has been imported as on object instead of a date.  This needs to be changed before converting to a PySpark dataframe.

In [0]:
fin_data['me.date'] = pd.to_datetime(fin_data['me.date']) 
fin_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24236 entries, 0 to 24235
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   Ticker      24236 non-null  object        
 1   me.date     24236 non-null  datetime64[ns]
 2   clust.name  24236 non-null  object        
 3   RiskStage   24236 non-null  float64       
 4   TotalDebt   24236 non-null  float64       
 5   ECL         24236 non-null  float64       
dtypes: datetime64[ns](1), float64(3), object(2)
memory usage: 1.1+ MB


## Convert to a Spark dataframe

In [0]:
fin_data_spark = spark.createDataFrame(fin_data)

fin_data_spark.printSchema()

root
 |-- Ticker: string (nullable = true)
 |-- me.date: timestamp (nullable = true)
 |-- clust.name: string (nullable = true)
 |-- RiskStage: double (nullable = true)
 |-- TotalDebt: double (nullable = true)
 |-- ECL: double (nullable = true)



Find the first date

In [0]:
from pyspark.sql import functions as F

min_date = fin_data_spark \
  .withColumnRenamed('me.date','date') \
  .withColumn('date', F.to_date(F.col('date'))) \
  .select(F.min('date')).first()

min_date

Row(min(date)=datetime.date(2016, 12, 31))

The code block below implements the first data transform step outlined in [this](https://brentmorrison.netlify.app/post/ifrs9-disclosures-part-3/) post.

In [0]:
from pyspark.sql import Window as W
from pyspark.sql import Column as C
from pyspark.sql.types import DoubleType

ifrs9_wide2 = fin_data_spark \
  .withColumnRenamed('me.date', 'date') \
  .withColumnRenamed('Ticker', 'cust') \
  .withColumnRenamed('clust.name', 'unit') \
  .withColumnRenamed('RiskStage', 'stage') \
  .withColumnRenamed('TotalDebt', 'gca') \
  .withColumnRenamed('ECL','ecl') \
  .withColumn('date', F.to_date(F.col('date'))) \
  .withColumn('year', F.year('date')) \
  .withColumn('ccy', F.when(F.col('cust').startswith('G'), 'GBP').otherwise('USD')) \
  .withColumn('type', F.when(F.col('cust').startswith('R'), 'rvlv').otherwise('term')) \
  .withColumn('poci', F.when(F.col('cust').startswith('P'), 'Y').otherwise('N')) \
  .withColumn('bal', F.col('gca')) \
  .withColumn('ecl', F.col('ecl') * -1) \
  .withColumn('wof', F.when((F.col('cust') == 'ORCL') & (F.col('date') == '2018-06-30'), F.lit(96)). \
              otherwise(F.lit(0))) \
  .withColumn('pryr', F.lit(0)) \
  .withColumn('prlt', F.lit(0)) \
  .withColumn('ctgy', 
              F.when((F.col('wof') != F.lit(0)) & (F.col('poci') == 'N'), F.lit(3)) \
              .when((F.col('wof') != F.lit(0)) & (F.col('poci') == 'Y'), F.lit(5)) \
              .when((F.col('wof') != F.lit(0)) & (F.isnull(F.col('poci'))), F.lit(3)) \
              .when((F.col('stage') == F.lit(1)) & (F.col('poci') == 'N'), F.lit(1)) \
              .when((F.col('stage') == F.lit(2)) & (F.col('poci') == 'N'), F.lit(2)) \
              .when((F.col('stage') == F.lit(3)) & (F.col('poci') == 'N'), F.lit(3)) \
              .when((F.isnull(F.col('stage'))) & (F.col('poci') == 'N'), F.lit(1)) \
              .when((F.col('stage') == F.lit(1)) & (F.col('poci') == 'Y'), F.lit(4)) \
              .when((F.col('stage') == F.lit(2)) & (F.col('poci') == 'Y'), F.lit(4)) \
              .when((F.col('stage') == F.lit(3)) & (F.col('poci') == 'Y'), F.lit(5)) \
              .when((F.isnull(F.col('stage'))) & (F.col('poci') == 'Y'), F.lit(4)) \
              .when((F.col('stage') == F.lit(1)) & (F.isnull(F.col('poci'))), F.lit(1)) \
              .when((F.col('stage') == F.lit(2)) & (F.isnull(F.col('poci'))), F.lit(2)) \
              .when((F.col('stage') == F.lit(3)) & (F.isnull(F.col('poci'))), F.lit(3)) \
              .otherwise(F.lit(1))) \
  .withColumn('gca_op', F.lag('gca', 1).over(W.partitionBy('cust').orderBy('date'))) \
  .withColumn('bal_op', F.lag('bal', 1).over(W.partitionBy('cust').orderBy('date'))) \
  .withColumn('ecl_op', F.lag('ecl', 1).over(W.partitionBy('cust').orderBy('date'))) \
  .withColumn('ctgy_op', F.lag('ctgy', 1).over(W.partitionBy('cust').orderBy('date'))) \
  .withColumn('bal_pr', F.first('bal_op', ignorenulls = True).over(W.partitionBy('cust', 'year').orderBy('date'))) \
  .withColumnRenamed('gca', 'gca_cl') \
  .withColumnRenamed('bal', 'bal_cl') \
  .withColumnRenamed('ecl', 'ecl_cl') \
  .withColumnRenamed('ctgy', 'ctgy_cl') \
  .withColumnRenamed('wof', 'wof_cl') \
  .withColumn('wof_cum', F.sum('wof_cl').over(W.partitionBy('cust', 'year').orderBy('date'))) \
  .withColumn('bal_y', F.col('bal_cl') - F.col('bal_pr') + F.col('wof_cum')) \
  .withColumn('bal_y_dd', 
              F.when((F.col('bal_y') > F.lit(0)), F.col('bal_y')) \
              .otherwise(F.lit(0))) \
  .withColumn('bal_y_dd_pr', F.lag('bal_y_dd', 1).over(W.partitionBy('cust').orderBy('date'))) \
  .withColumn('bal_y_rd', 
              F.when((F.col('bal_y') < F.lit(0)), F.col('bal_y')) \
              .otherwise(F.lit(0))) \
  .withColumn('bal_y_rd_pr', F.lag('bal_y_rd', 1).over(W.partitionBy('cust').orderBy('date'))) \
  .withColumn('cover_cl', -F.col('ecl_cl') / F.col('bal_cl')) \
  .withColumn('cover_op', -F.col('ecl_op') / F.col('bal_op')) \
  .withColumn('cover', 
              F.when(C.isNull(F.col('cover_op')), F.col('cover_cl')) \
              .otherwise(F.col('cover_op'))) \
  .withColumn('cover', 
              F.when(F.col('cover') < F.lit(0), F.lit(0)) \
              .otherwise(F.round(F.col('cover'), 5))) \
  .withColumn('cover', 
              F.when(F.col('cover') > F.lit(1), F.lit(1)) \
              .otherwise(F.round(F.col('cover'), 5))) \
  .withColumn('incr_decr', 
              F.when(F.col('bal_cl') > F.col('bal_op'), 'incr') \
              .when(F.col('bal_cl') < F.col('bal_op'), 'decr') \
              .otherwise('unch')) \
  .withColumn('ctgy_dir', 
            F.when(F.col('ctgy_cl') > F.col('ctgy_op'), 'd') \
            .when(F.col('ctgy_cl') < F.col('ctgy_op'), 'i') \
            .otherwise('u')) \
  .withColumn('pre_post', 
            F.when((F.col('ctgy_dir') == 'i') & (F.col('incr_decr') == 'decr'), 'pre') \
            .when((F.col('ctgy_dir') == 'd') & (F.col('incr_decr') == 'incr'), 'pre') \
            .otherwise('post')) \
  .withColumn('pre_stage', 
            F.when(F.col('pre_post') == 'pre', F.col('ctgy_op')) \
            .otherwise(F.col('ctgy_op'))) \
  .withColumn('gca_m_dd_r', 
            F.when(F.col('type') == 'rvlv', F.col('bal_y_dd') - F.col('bal_y_dd_pr')) \
            .otherwise(F.lit(0))) \
  .withColumn('gca_m_dd_t', 
            F.when((F.col('type') == 'term') & (F.col('incr_decr') == 'incr'), F.col('bal_cl') - F.col('bal_op') + F.col('wof_cl')) \
            .otherwise(F.lit(0))) \
  .withColumn('gca_m_rd_t_f', 
            F.when((F.col('type') == 'term') & (F.col('incr_decr') == 'incr') & (F.col('bal_cl') == F.lit(0)), F.col('bal_cl') - F.col('bal_op') + F.col('wof_cl')) \
            .otherwise(F.lit(0))) \
  .withColumn('gca_m_rd_t', 
            F.when((F.col('type') == 'term') & (F.col('incr_decr') == 'decr') & (F.col('bal_cl') != F.lit(0)), F.col('bal_cl') - F.col('bal_op') + F.col('wof_cl')) \
            .otherwise(F.lit(0))) \
  .withColumn('gca_m_rd_r', 
            F.when(F.col('type') == 'rvlv', F.col('bal_y_rd') - F.col('bal_y_rd_pr')) \
            .otherwise(F.lit(0))) \
  .withColumn('gca_m_oth', (F.col('gca_cl') - F.col('bal_cl')) - (F.col('gca_op') - F.col('bal_op'))) \
  .withColumn('gca_tfr_pre', F.col('gca_op') + F.col('gca_m_dd_r') + F.col('gca_m_dd_t') + F.col('gca_m_rd_t_f') + F.col('gca_m_rd_t') + F.col('gca_m_rd_r')) \
  .withColumn('gca_m_wof', -F.col('wof_cl').cast(DoubleType())) \
  .withColumn('gca_m_tfr_o', 
            F.when((F.col('ctgy_dir') != 'u') & (F.col('pre_post') == 'pre'), -F.col('gca_tfr_pre')) \
            .when((F.col('ctgy_dir') != 'u') & (F.col('pre_post') == 'post'), -F.col('gca_op')) \
            .otherwise(F.lit(0))) \
  .withColumn('gca_m_tfr_i', -F.col('gca_m_tfr_o')) \
  .withColumn('ecl_m_dd_r', F.round(-F.col('cover') * F.col('gca_m_dd_r'), 2)) \
  .withColumn('ecl_m_dd_t', F.round(-F.col('cover') * F.col('gca_m_dd_t'), 2)) \
  .withColumn('ecl_m_rd_t_f', F.round(-F.col('cover') * F.col('gca_m_rd_t_f'), 2)) \
  .withColumn('ecl_m_rd_t', F.round(-F.col('cover') * F.col('gca_m_rd_t'), 2)) \
  .withColumn('ecl_m_rd_r', F.round(-F.col('cover') * F.col('gca_m_rd_r'), 2)) \
  .withColumn('ecl_m_wof', F.col('wof_cl').cast(DoubleType())) \
  .withColumn('ecl_m_prm', 
            F.when((F.col('ctgy_cl') == F.lit(1)) & (F.col('pryr') != F.lit(0)), F.col('ecl_cl') + F.col('pryr')) \
            .when((F.col('ctgy_cl') != F.lit(1)) & (F.col('pryr') != F.lit(0)), F.col('ecl_cl') + F.col('prlt')) \
            .otherwise(F.lit(0))) \
  .withColumn('ecl_m_rem_mig', 
            F.when((F.col('ctgy_dir') != F.lit('u')), F.col('ecl_cl') - F.col('ecl_op') - F.col('ecl_m_dd_r') \
            - F.col('ecl_m_dd_t') - F.col('ecl_m_rd_t_f') - F.col('ecl_m_rd_t') - F.col('ecl_m_rd_r') \
            - F.col('ecl_m_wof') - F.col('ecl_m_prm')) \
            .otherwise(F.lit(0))) \
  .withColumn('ecl_m_rem', 
            F.when((F.col('ctgy_dir') == F.lit('u')), F.col('ecl_cl') - F.col('ecl_op') - F.col('ecl_m_dd_r') \
            - F.col('ecl_m_dd_t') - F.col('ecl_m_rd_t_f') - F.col('ecl_m_rd_t') - F.col('ecl_m_rd_r') \
            - F.col('ecl_m_wof') - F.col('ecl_m_prm')) \
            .otherwise(F.lit(0))) \
  .withColumn('ecl_tfr_pre', F.col('ecl_op') + F.col('ecl_m_dd_r') + F.col('ecl_m_dd_t') + F.col('ecl_m_rd_t_f') + F.col('ecl_m_rd_t') + F.col('ecl_m_rd_r')) \
  .withColumn('ecl_m_tfr_o', 
            F.when((F.col('ctgy_dir') != F.lit('u')) & (F.col('pre_post') == F.lit('pre')), F.col('ecl_tfr_pre')) \
            .when((F.col('ctgy_dir') != F.lit('u')) & (F.col('pre_post') == F.lit('post')), F.col('ecl_op')) \
            .otherwise(F.lit(0))) \
  .withColumn('ecl_m_tfr_i', -F.col('ecl_m_tfr_o')) \
  .na.fill(0) \

# Show a sample
ifrs9_wide2[ifrs9_wide2.cust.isin('AAPL', 'ORCL')].show(50)

#  USEFUL

# OPENING BALANCE FROM PRIOR YEAR
# .withColumn('bal_pr', F.when((F.col('date') == min_date[0]), F.col('bal')).otherwise(None)) \

# FILL FORWARD
# .withColumn('bal_pr', F.last('bal_pr', ignorenulls = True).over(W.partitionBy('cust').orderBy('date'))) \

# ALTERNATE FILTERING
# bln_ctgy.filter(F.col('cust') == 'AAPL').show(50)
# bln_ctgy.where(F.col('cust').isin({'AAPL', 'ORCL'})).show(50)

+----+----------+--------------------+-----+--------+-------------------+----+---+----+----+--------+------+----+----+-------+--------+--------+-------------------+-------+--------+-------+-------+--------+-----------+--------+-----------+--------------------+--------------------+-------+---------+--------+--------+---------+----------+----------+------------+----------+----------+---------+-----------+---------+-----------+-----------+----------+----------+------------+----------+----------+---------+---------+-------------------+-------------------+-------------------+-------------------+------------------+
|cust|      date|                unit|stage|  gca_cl|             ecl_cl|year|ccy|type|poci|  bal_cl|wof_cl|pryr|prlt|ctgy_cl|  gca_op|  bal_op|             ecl_op|ctgy_op|  bal_pr|wof_cum|  bal_y|bal_y_dd|bal_y_dd_pr|bal_y_rd|bal_y_rd_pr|            cover_cl|            cover_op|  cover|incr_decr|ctgy_dir|pre_post|pre_stage|gca_m_dd_r|gca_m_dd_t|gca_m_rd_t_f|gca_m_rd_t|gca_m_rd_r

In [0]:
col_list = [
  'date'
  ,'cust'
  ,'ctgy_cl'
  ,'ctgy_op'
  ,'pre_post'
  ,'gca_cl'
  ,'ecl_cl'
  ,'gca_op'
  ,'ecl_op'
  ,'gca_m_dd_r'
  ,'gca_m_dd_t'
  ,'gca_m_rd_t_f'
  ,'gca_m_rd_t'
  ,'gca_m_rd_r'
  ,'gca_m_oth'
  ,'gca_m_wof'
  ,'gca_m_tfr_o'
  ,'gca_m_tfr_i'
  ,'ecl_m_dd_r'
  ,'ecl_m_dd_t'
  ,'ecl_m_rd_t_f'
  ,'ecl_m_rd_t'
  ,'ecl_m_rd_r'
  ,'ecl_m_wof'
  ,'ecl_m_prm'
  ,'ecl_m_rem_mig'
  ,'ecl_m_rem'
  ,'ecl_m_tfr_o'
  ,'ecl_m_tfr_i'
  ]
  
test = ifrs9_wide2[ifrs9_wide2.cust.isin('AAPL', 'ORCL')].na.fill(0).select(col_list)
# You can also do this
# test = ifrs9_wide2[ifrs9_wide2.cust.isin('AAPL', 'ORCL')].na.fill(0).select(test.columns[0:5])
# or this
# test = ifrs9_wide2[ifrs9_wide2.cust.isin('AAPL', 'ORCL')].na.fill(0).select('cust', 'date', 'unit', 'stage', 'gca_cl')

test.show()
test.printSchema()

+----------+----+-------+-------+--------+--------+-------------------+--------+-------------------+----------+----------+------------+----------+----------+---------+---------+-----------+-----------+----------+----------+------------+----------+----------+---------+---------+-------------------+-------------------+------------------+-----------------+
|      date|cust|ctgy_cl|ctgy_op|pre_post|  gca_cl|             ecl_cl|  gca_op|             ecl_op|gca_m_dd_r|gca_m_dd_t|gca_m_rd_t_f|gca_m_rd_t|gca_m_rd_r|gca_m_oth|gca_m_wof|gca_m_tfr_o|gca_m_tfr_i|ecl_m_dd_r|ecl_m_dd_t|ecl_m_rd_t_f|ecl_m_rd_t|ecl_m_rd_r|ecl_m_wof|ecl_m_prm|      ecl_m_rem_mig|          ecl_m_rem|       ecl_m_tfr_o|      ecl_m_tfr_i|
+----------+----+-------+-------+--------+--------+-------------------+--------+-------------------+----------+----------+------------+----------+----------+---------+---------+-----------+-----------+----------+----------+------------+----------+----------+---------+---------+----------

Unpivot.  Note that all of the columns in the `stack` function need to be off the same type.

In [0]:
test.selectExpr("date", "cust", "ctgy_cl", "ctgy_op", "pre_post", \
  "stack(24 \
    ,'gca_cl', gca_cl \
    ,'ecl_cl', ecl_cl \
    ,'gca_op', gca_op \
    ,'ecl_op', ecl_op \
    ,'gca_m_dd_r', gca_m_dd_r \
    ,'gca_m_dd_t', gca_m_dd_t \
    ,'gca_m_rd_t_f', gca_m_rd_t_f \
    ,'gca_m_rd_t', gca_m_rd_t \
    ,'gca_m_rd_r', gca_m_rd_r \
    ,'gca_m_oth', gca_m_oth \
    ,'gca_m_wof', gca_m_wof \
    ,'gca_m_tfr_o', gca_m_tfr_o \
    ,'gca_m_tfr_i', gca_m_tfr_i \
    ,'ecl_m_dd_r', ecl_m_dd_r \
    ,'ecl_m_dd_t', ecl_m_dd_t \
    ,'ecl_m_rd_t_f', ecl_m_rd_t_f \
    ,'ecl_m_rd_t', ecl_m_rd_t \
    ,'ecl_m_rd_r', ecl_m_rd_r \
    ,'ecl_m_wof', ecl_m_wof \
    ,'ecl_m_prm', ecl_m_prm \
    ,'ecl_m_rem_mig', ecl_m_rem_mig \
    ,'ecl_m_rem', ecl_m_rem \
    ,'ecl_m_tfr_o', ecl_m_tfr_o \
    ,'ecl_m_tfr_i', ecl_m_tfr_i \
    ) as (m_ment, tran_ccy)" \
  ) \
  .where("tran_ccy is not null and tran_ccy != 0") \
  .show()

+----------+----+-------+-------+--------+----------+-------------------+
|      date|cust|ctgy_cl|ctgy_op|pre_post|    m_ment|           tran_ccy|
+----------+----+-------+-------+--------+----------+-------------------+
|2016-12-31|AAPL|      1|      0|    post|    gca_cl|            87032.0|
|2016-12-31|AAPL|      1|      0|    post|    ecl_cl|-1612.9299999999923|
|2017-01-31|AAPL|      1|      1|    post|    gca_cl|            87032.0|
|2017-01-31|AAPL|      1|      1|    post|    ecl_cl|-314.12823779193207|
|2017-01-31|AAPL|      1|      1|    post|    gca_op|            87032.0|
|2017-01-31|AAPL|      1|      1|    post|    ecl_op|-1612.9299999999923|
|2017-01-31|AAPL|      1|      1|    post| ecl_m_rem| 1298.8017622080602|
|2017-02-28|AAPL|      1|      1|    post|    gca_cl|            87549.0|
|2017-02-28|AAPL|      1|      1|    post|    ecl_cl| -372.4997093023264|
|2017-02-28|AAPL|      1|      1|    post|    gca_op|            87032.0|
|2017-02-28|AAPL|      1|      1|    p

Practice unpivoting per [this](https://stackoverflow.com/questions/42465568/unpivot-in-spark-sql-pyspark) SO answer

In [0]:
df = spark.createDataFrame([("G",5,4,2,None),("H",2,3,4,5)],list("AWXYZ"))
df.show()

+---+---+---+---+----+
|  A|  W|  X|  Y|   Z|
+---+---+---+---+----+
|  G|  5|  4|  2|null|
|  H|  2|  3|  4|   5|
+---+---+---+---+----+



In [0]:
df.selectExpr("A", "stack(4, 'W', W, 'X', X, 'Y', Y, 'Z', Z) as (B, C)").where("C is not null").show()

+---+---+---+
|  A|  B|  C|
+---+---+---+
|  G|  W|  5|
|  G|  X|  4|
|  G|  Y|  2|
|  H|  W|  2|
|  H|  X|  3|
|  H|  Y|  4|
|  H|  Z|  5|
+---+---+---+



In [0]:
df.selectExpr("A", "W", "stack(3, 'X', X, 'Y', Y, 'Z', Z) as (B, C)").where("C is not null").show()

+---+---+---+---+
|  A|  W|  B|  C|
+---+---+---+---+
|  G|  5|  X|  4|
|  G|  5|  Y|  2|
|  H|  2|  X|  3|
|  H|  2|  Y|  4|
|  H|  2|  Z|  5|
+---+---+---+---+



In [0]:
df.selectExpr("A", "W", "X", "stack(2, 'Y', Y, 'Z', Z) as (B, C)").where("C is not null").show()

+---+---+---+---+---+
|  A|  W|  X|  B|  C|
+---+---+---+---+---+
|  G|  5|  4|  Y|  2|
|  H|  2|  3|  Y|  4|
|  H|  2|  3|  Z|  5|
+---+---+---+---+---+



**References**

Setup

https://colab.research.google.com/github/asifahmed90/pyspark-ML-in-Colab/blob/master/PySpark_Regression_Analysis.ipynb

https://support.treasuredata.com/hc/en-us/articles/360034951753-TD-Python-Spark-Driver-with-Google-Colab

https://medium.com/@sushantgautam_930/apache-spark-in-google-collaboratory-in-3-steps-e0acbba654e6

https://mikestaszel.com/2018/03/07/apache-spark-on-google-colaboratory/

https://mc.ai/practical-data-science-with-apache-spark%E2%80%8A-%E2%80%8Apart-1/

https://github.com/verakai/DS/blob/master/flight_delays.ipynb

https://gist.github.com/ryansmccoy/09b285525789bb355a15249aaeab7498

https://sparkbyexamples.com/