<a href="https://colab.research.google.com/github/Brent-Morrison/Misc_scripts/blob/master/PySpark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Using pySpark in Google Colab**

Installations


In [0]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-eu.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
!tar xf spark-2.4.5-bin-hadoop2.7.tgz
!pip install -q findspark
!pip install pyspark
!pip install pyarrow



Set the environment variables so that Colab can find Spark

In [0]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.5-bin-hadoop2.7"

Add pyspark to sys.path

PySpark isn't on sys.path by default, but that doesn't mean it can't be used as a regular library. You can address this by either symlinking pyspark into your site-packages, or adding pyspark to sys.path at runtime. [findspark](https://github.com/minrk/findspark) does the latter.

In [0]:
import findspark
findspark.init()

Create the Spark session

In [0]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

Test everything is working by creating a test dataframe

In [0]:
test_df = spark.createDataFrame([{"hello": "world"} for x in range(100)])

test_df.printSchema()
test_df.show(3)

root
 |-- hello: string (nullable = true)

+-----+
|hello|
+-----+
|world|
|world|
|world|
+-----+
only showing top 3 rows





Get a feather file from github.

The intention was to be implement this using PyArrow however  [it looks like](https://issues.apache.org/jira/browse/ARROW-6998?filter=-6) that won't happen.

[This](https://github.com/pandas-dev/pandas/issues/29055) workaround seems to be the way to go.

In [0]:
# Not working
# import pyarrow as pa
# fin_data = pa.feather.read_feather('https://github.com/Brent-Morrison/hugo_website/blob/master/content/post/ifrs9_part2.feather')

# Working
import pandas as pd
import requests
import io
resp = requests.get('https://github.com/Brent-Morrison/hugo_website/raw/master/content/post/ifrs9_part2.feather', stream = True)
resp.raw.decode_content = True
mem_fh = io.BytesIO(resp.raw.read())
fin_data = pd.read_feather(mem_fh)
fin_data.head()

Unnamed: 0,Ticker,me.date,clust.name,RiskStage,TotalDebt,ECL
0,A,2016-12-31,rev.vol_oa.ta_td.ta_da.ta,1.0,1887.0,11.691196
1,A,2017-01-31,np.ta_td.ta_nca.ta_rev.vol,1.0,1887.0,32.852229
2,A,2017-02-28,np.ta_td.ta_nca.ta_rev.vol,1.0,1887.0,34.309091
3,A,2017-03-31,np.ta_td.ta_nca.ta_rev.vol,1.0,2043.0,38.319819
4,A,2017-04-30,np.ta_td.ta_nca.ta_rev.vol,1.0,2043.0,33.507803


Check data type

In [0]:
fin_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24236 entries, 0 to 24235
Data columns (total 6 columns):
Ticker        24236 non-null object
me.date       24236 non-null object
clust.name    24236 non-null object
RiskStage     24236 non-null float64
TotalDebt     24236 non-null float64
ECL           24236 non-null float64
dtypes: float64(3), object(3)
memory usage: 1.1+ MB


The date column 'me.date' has been imported as on object instead of a date.  This needs to be changed before converting to a PySpark dataframe.

In [0]:
fin_data['me.date'] = pd.to_datetime(fin_data['me.date']) 
fin_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24236 entries, 0 to 24235
Data columns (total 6 columns):
Ticker        24236 non-null object
me.date       24236 non-null datetime64[ns]
clust.name    24236 non-null object
RiskStage     24236 non-null float64
TotalDebt     24236 non-null float64
ECL           24236 non-null float64
dtypes: datetime64[ns](1), float64(3), object(2)
memory usage: 1.1+ MB


We now convert to a Spark dataframe

In [0]:
fin_data_spark = spark.createDataFrame(fin_data)

fin_data_spark.printSchema()

root
 |-- Ticker: string (nullable = true)
 |-- me.date: timestamp (nullable = true)
 |-- clust.name: string (nullable = true)
 |-- RiskStage: double (nullable = true)
 |-- TotalDebt: double (nullable = true)
 |-- ECL: double (nullable = true)



Rename columns

In [0]:
fin_data_spark.withColumnRenamed('Ticker','cust') \
  .withColumnRenamed('clust.name','unit') \
  .withColumnRenamed('RiskStage','stage') \
  .withColumnRenamed('TotalDebt','gca') \
  .withColumnRenamed('ECL','ecl').printSchema()

root
 |-- cust: string (nullable = true)
 |-- me.date: timestamp (nullable = true)
 |-- unit: string (nullable = true)
 |-- stage: double (nullable = true)
 |-- gca: double (nullable = true)
 |-- ecl: double (nullable = true)



In [0]:
from pyspark.sql import functions as F
from pyspark.sql import Window as W

fin_data_spark.withColumnRenamed('me.date','date') \
  .withColumnRenamed('Ticker','cust') \
  .withColumnRenamed('clust.name','unit') \
  .withColumnRenamed('RiskStage','stage') \
  .withColumnRenamed('TotalDebt','gca') \
  .withColumnRenamed('ECL','ecl') \
  .withColumn('ccy', F.when(F.col('cust').startswith('G'), 'GBP').otherwise('USD')) \
  .withColumn('type', F.when(F.col('cust').startswith('R'), 'rvlv').otherwise('term')) \
  .withColumn('poci', F.when(F.col('cust').startswith('P'), 'Y').otherwise('N')) \
  .withColumn('bal', F.col('gca')) \
  .withColumn('ecl', F.col('ecl') * -1) \
  .withColumn('wof', F.lit(0)) \
  .withColumn('pryr', F.lit(0)) \
  .withColumn('prlt', F.lit(0)) \
  .withColumn('ctgy', 
              F.when((F.col('wof') != F.lit(0)) & (F.col('poci') == 'N'), F.lit(3)) \
              .when((F.col('wof') != F.lit(0)) & (F.col('poci') == 'Y'), F.lit(5)) \
              .when((F.col('wof') != F.lit(0)) & (F.isnull(F.col('poci'))), F.lit(3)) \
              .when((F.col('stage') == F.lit(0)) & (F.col('poci') == 'N'), F.lit(1)) \
              .when((F.col('stage') == F.lit(1)) & (F.col('poci') == 'N'), F.lit(2)) \
              .when((F.col('stage') == F.lit(2)) & (F.col('poci') == 'N'), F.lit(3)) \
              .when((F.isnull(F.col('stage'))) & (F.col('poci') == 'N'), F.lit(1)) \
              .when((F.col('stage') == F.lit(0)) & (F.col('poci') == 'Y'), F.lit(4)) \
              .when((F.col('stage') == F.lit(1)) & (F.col('poci') == 'Y'), F.lit(4)) \
              .when((F.col('stage') == F.lit(2)) & (F.col('poci') == 'Y'), F.lit(5)) \
              .when((F.isnull(F.col('stage'))) & (F.col('poci') == 'Y'), F.lit(4)) \
              .when((F.col('stage') == F.lit(1)) & (F.isnull(F.col('poci'))), F.lit(1)) \
              .when((F.col('stage') == F.lit(2)) & (F.isnull(F.col('poci'))), F.lit(2)) \
              .when((F.col('stage') == F.lit(3)) & (F.isnull(F.col('poci'))), F.lit(3)) \
              .otherwise(F.lit(1))) \
              .withColumn('gca_op', F.lag('gca', 1).over(W.partitionBy('cust').orderBy('date'))) \
              .withColumn('bal_op', F.lag('bal', 1).over(W.partitionBy('cust').orderBy('date'))) \
              .withColumn('ecl_op', F.lag('ecl', 1).over(W.partitionBy('cust').orderBy('date'))) \
  .filter(F.col('CTGY') == F.lit(4)).head(10)
  #https://stackoverflow.com/questions/50113504/how-to-use-spark-lag-and-lead-over-group-by-and-order-by

[Row(cust='PKI', date=datetime.datetime(2016, 12, 31, 0, 0), unit='oa.ta_nca.ta_rev.ta_da.ta', stage=1.0, gca=1012.885, ecl=-5.559857880434761, ccy='USD', type='term', poci='Y', bal=1012.885, wof=0, pryr=0, prlt=0, ctgy=4, gca_op=None, bal_op=None, ecl_op=None),
 Row(cust='PKI', date=datetime.datetime(2017, 1, 31, 0, 0), unit='oa.ta_nca.ta_td.ta_da.ta', stage=1.0, gca=1012.885, ecl=-6.155803211252653, ccy='USD', type='term', poci='Y', bal=1012.885, wof=0, pryr=0, prlt=0, ctgy=4, gca_op=1012.885, bal_op=1012.885, ecl_op=-5.559857880434761),
 Row(cust='PKI', date=datetime.datetime(2017, 2, 28, 0, 0), unit='oa.ta_nca.ta_td.ta_da.ta', stage=1.0, gca=1012.885, ecl=-13.972673625792892, ccy='USD', type='term', poci='Y', bal=1012.885, wof=0, pryr=0, prlt=0, ctgy=4, gca_op=1012.885, bal_op=1012.885, ecl_op=-6.155803211252653),
 Row(cust='PKI', date=datetime.datetime(2017, 3, 31, 0, 0), unit='oa.ta_nca.ta_td.ta_da.ta', stage=1.0, gca=1012.885, ecl=-16.038242826780035, ccy='USD', type='term', poc

**References**

Setup

https://colab.research.google.com/github/asifahmed90/pyspark-ML-in-Colab/blob/master/PySpark_Regression_Analysis.ipynb

https://support.treasuredata.com/hc/en-us/articles/360034951753-TD-Python-Spark-Driver-with-Google-Colab

https://medium.com/@sushantgautam_930/apache-spark-in-google-collaboratory-in-3-steps-e0acbba654e6

https://mikestaszel.com/2018/03/07/apache-spark-on-google-colaboratory/

https://mc.ai/practical-data-science-with-apache-spark%E2%80%8A-%E2%80%8Apart-1/

https://github.com/verakai/DS/blob/master/flight_delays.ipynb

https://gist.github.com/ryansmccoy/09b285525789bb355a15249aaeab7498

https://sparkbyexamples.com/