# Laborator 3

In [None]:
!sudo apt update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
!tar xf spark-3.2.1-bin-hadoop3.2.tgz
!pip install -q findspark
!pip install pyspark
!pip install py4j

[33m0% [Working][0m            Hit:1 http://archive.ubuntu.com/ubuntu jammy InRelease
[33m0% [Waiting for headers] [Waiting for headers] [Waiting for headers] [Connected[0m                                                                               Get:2 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [119 kB]
                                                                               Get:3 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]
                                                                               Get:4 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
Get:5 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Get:6 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [109 kB]
Hit:7 https://ppa.launchpadcontent.net/c2d4u.team/c2d4u4.0+/ubuntu jammy InRelease
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 http

In [None]:
import os
import sys

The following cell imports the SparkSession and creates a new context for our application. The name of the application is *Example*. If the session corresponding for this application exists (from a previous run) then this session will be retrieved. Otherwise a new session will be created.

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Example').getOrCreate()

Spark offers us simple primitives to read data. The following example shows how to read a CSV file. The content of the file is parsed and columns are identified. The result is a [**DataFrame**](https://spark.apache.org/docs/3.5.1/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.html) which is stored in our variable named **df**.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

data_folder = '/content/drive/MyDrive/BDT Datasets/Lab 4/'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
index_data_file = data_folder + 'indexData.csv'
index_info_file = data_folder + 'indexInfo.csv'
index_processed_file = data_folder + 'indexProcessed.csv'

df = spark.read.csv(index_data_file)

The *DataFrame* is **distributed** collection of data. The actual data is not stored into memory until it is actually consumed. If we print the variable, it will only show us the name and type of each column.

In [None]:
print(df)

DataFrame[_c0: string, _c1: string, _c2: string, _c3: string, _c4: string, _c5: string, _c6: string, _c7: string]


We can retrieve part of the data using the **show** function as presented below.
If no argument is provided it will print the first 20 rows of the DataFrame. A parameter can be provied to print a set number of rows.

In [None]:
df.show(5)

+-----+----------+----------+----------+----------+----------+----------+------+
|  _c0|       _c1|       _c2|       _c3|       _c4|       _c5|       _c6|   _c7|
+-----+----------+----------+----------+----------+----------+----------+------+
|Index|      Date|      Open|      High|       Low|     Close| Adj Close|Volume|
|  NYA|1965-12-31|528.690002|528.690002|528.690002|528.690002|528.690002|     0|
|  NYA|1966-01-03|527.210022|527.210022|527.210022|527.210022|527.210022|     0|
|  NYA|1966-01-04|527.840027|527.840027|527.840027|527.840027|527.840027|     0|
|  NYA|1966-01-05|531.119995|531.119995|531.119995|531.119995|531.119995|     0|
+-----+----------+----------+----------+----------+----------+----------+------+
only showing top 5 rows



### Caution
Please notice that the csv contained a header which is now part of the data (the first row is actually the header). To prevent this, we can provide an additional argument to the *read.csv* function, namely **header=True**.
Additionally, we can use the **inferSchema=True** argumet such that the column type is automatically detected. Otherwise, all column will be strings.


In [None]:
df = spark.read.csv(index_data_file, header=True, inferSchema=True)

In [None]:
df.show(5)

+-----+----------+----------+----------+----------+----------+----------+------+
|Index|      Date|      Open|      High|       Low|     Close| Adj Close|Volume|
+-----+----------+----------+----------+----------+----------+----------+------+
|  NYA|1965-12-31|528.690002|528.690002|528.690002|528.690002|528.690002|     0|
|  NYA|1966-01-03|527.210022|527.210022|527.210022|527.210022|527.210022|     0|
|  NYA|1966-01-04|527.840027|527.840027|527.840027|527.840027|527.840027|     0|
|  NYA|1966-01-05|531.119995|531.119995|531.119995|531.119995|531.119995|     0|
|  NYA|1966-01-06|532.070007|532.070007|532.070007|532.070007|532.070007|     0|
+-----+----------+----------+----------+----------+----------+----------+------+
only showing top 5 rows



We can use the *printSchema()* function to check the type of each column, as presented below.

In [None]:
df.printSchema()

root
 |-- Index: string (nullable = true)
 |-- Date: date (nullable = true)
 |-- Open: string (nullable = true)
 |-- High: string (nullable = true)
 |-- Low: string (nullable = true)
 |-- Close: string (nullable = true)
 |-- Adj Close: string (nullable = true)
 |-- Volume: string (nullable = true)



The DataFrame object has an attribute named columns which can be used to retrieve the columns in the dataset as a list

In [None]:
df.columns

['Index', 'Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume']

We can use the *count* function to check the number of rows in our DataFrame.

In [None]:
df.count()

112457

We can use the *describe* function to show statistics related to the different columns:
- count (number of rows)
- mean (average of the values from this column)
- stddev (standard deviation of the values in this column)
- min (minimum value from this column)
- max (maximum value from this column)

We need to call the *show* function because the result is a DataFrame itself.

In [None]:
df.describe().show()

+-------+---------+-----------------+------------------+-----------------+-----------------+------------------+--------------------+
|summary|    Index|             Open|              High|              Low|            Close|         Adj Close|              Volume|
+-------+---------+-----------------+------------------+-----------------+-----------------+------------------+--------------------+
|  count|   112457|           112457|            112457|           112457|           112457|            112457|              112457|
|   mean|     NULL|7658.515221546726|7704.3729612772095|7608.000422337706|7657.545871842853|7657.3517293638315|1.2739751626030312E9|
| stddev|     NULL| 9011.47891296602| 9066.638548034824| 8954.50698125186|9011.510443530395| 9011.608899984878| 4.315783120882288E9|
|    min|000001.SS|              100|               100|              100|              100|               100|                   0|
|    max|     TWII|             null|              null|             

We can use the *select* function to retrieve one or multiple columns of the DataFrame. The result of the *select* funtion is another DataFrame.

In [None]:
df.select('Date').show(5)

+----------+
|      Date|
+----------+
|1965-12-31|
|1966-01-03|
|1966-01-04|
|1966-01-05|
|1966-01-06|
+----------+
only showing top 5 rows



In [None]:
df2 = df.select(['Index', 'Date', 'Open'])
df2.show(7)

+-----+----------+----------+
|Index|      Date|      Open|
+-----+----------+----------+
|  NYA|1965-12-31|528.690002|
|  NYA|1966-01-03|527.210022|
|  NYA|1966-01-04|527.840027|
|  NYA|1966-01-05|531.119995|
|  NYA|1966-01-06|532.070007|
|  NYA|1966-01-07|532.599976|
|  NYA|1966-01-10|533.869995|
+-----+----------+----------+
only showing top 7 rows



We can add columns to the dataframe using the *withColumn* function.
The original DataFrame is not changed, but a new DataFrame with the original and new data is created.

In [None]:
df.withColumn('NewOpen', df['Open']).show(5)

+-----+----------+----------+----------+----------+----------+----------+------+----------+
|Index|      Date|      Open|      High|       Low|     Close| Adj Close|Volume|   NewOpen|
+-----+----------+----------+----------+----------+----------+----------+------+----------+
|  NYA|1965-12-31|528.690002|528.690002|528.690002|528.690002|528.690002|     0|528.690002|
|  NYA|1966-01-03|527.210022|527.210022|527.210022|527.210022|527.210022|     0|527.210022|
|  NYA|1966-01-04|527.840027|527.840027|527.840027|527.840027|527.840027|     0|527.840027|
|  NYA|1966-01-05|531.119995|531.119995|531.119995|531.119995|531.119995|     0|531.119995|
|  NYA|1966-01-06|532.070007|532.070007|532.070007|532.070007|532.070007|     0|532.070007|
+-----+----------+----------+----------+----------+----------+----------+------+----------+
only showing top 5 rows



We can add columns programatically, using mathematical expressions in conjuction with columns.
The following code adds a column that will contain the double of the Open column. The result is saved in another DataFrame.

Notice the original DataFrame is not affected.

In [None]:
df2 = df.withColumn('DoubleOpen', df['Open']*2)
df2.show(5)
df.show(5)

+-----+----------+----------+----------+----------+----------+----------+------+-----------+
|Index|      Date|      Open|      High|       Low|     Close| Adj Close|Volume| DoubleOpen|
+-----+----------+----------+----------+----------+----------+----------+------+-----------+
|  NYA|1965-12-31|528.690002|528.690002|528.690002|528.690002|528.690002|     0|1057.380004|
|  NYA|1966-01-03|527.210022|527.210022|527.210022|527.210022|527.210022|     0|1054.420044|
|  NYA|1966-01-04|527.840027|527.840027|527.840027|527.840027|527.840027|     0|1055.680054|
|  NYA|1966-01-05|531.119995|531.119995|531.119995|531.119995|531.119995|     0| 1062.23999|
|  NYA|1966-01-06|532.070007|532.070007|532.070007|532.070007|532.070007|     0|1064.140014|
+-----+----------+----------+----------+----------+----------+----------+------+-----------+
only showing top 5 rows

+-----+----------+----------+----------+----------+----------+----------+------+
|Index|      Date|      Open|      High|       Low|     C

We can use the *withColumnRenamed* function to rename a column. This will create a new DataFrame and the original is not affected.

In [None]:
df3 = df2.withColumnRenamed('DoubleOpen', 'DOpen')
df3.show(5)
df2.show(5)

+-----+----------+----------+----------+----------+----------+----------+------+-----------+
|Index|      Date|      Open|      High|       Low|     Close| Adj Close|Volume|      DOpen|
+-----+----------+----------+----------+----------+----------+----------+------+-----------+
|  NYA|1965-12-31|528.690002|528.690002|528.690002|528.690002|528.690002|     0|1057.380004|
|  NYA|1966-01-03|527.210022|527.210022|527.210022|527.210022|527.210022|     0|1054.420044|
|  NYA|1966-01-04|527.840027|527.840027|527.840027|527.840027|527.840027|     0|1055.680054|
|  NYA|1966-01-05|531.119995|531.119995|531.119995|531.119995|531.119995|     0| 1062.23999|
|  NYA|1966-01-06|532.070007|532.070007|532.070007|532.070007|532.070007|     0|1064.140014|
+-----+----------+----------+----------+----------+----------+----------+------+-----------+
only showing top 5 rows

+-----+----------+----------+----------+----------+----------+----------+------+-----------+
|Index|      Date|      Open|      High|     

# Filters
We can use the *filter* function to only retain data which satisfies the condition in the filter.
Column names can be used with logical operators, as in the example below

In [None]:
filt_df = df.filter('Open > 600')
filt_df.count()

100726

In [None]:
df4 = df.filter('Open > Close')
df4.count()

43373

We can use an additional notation for more complex filters. The one below is using the Column type *df['Open']* and compares it with 300.

In [None]:
df.filter(df['Open'] < 300).show(5)

+-----+----------+----------+----------+----------+----------+----------+------+
|Index|      Date|      Open|      High|       Low|     Close| Adj Close|Volume|
+-----+----------+----------+----------+----------+----------+----------+------+
| IXIC|1971-02-05|       100|       100|       100|       100|       100|     0|
| IXIC|1971-02-08|100.839996|100.839996|100.839996|100.839996|100.839996|     0|
| IXIC|1971-02-09|100.760002|100.760002|100.760002|100.760002|100.760002|     0|
| IXIC|1971-02-10|100.690002|100.690002|100.690002|100.690002|100.690002|     0|
| IXIC|1971-02-11|101.449997|101.449997|101.449997|101.449997|101.449997|     0|
+-----+----------+----------+----------+----------+----------+----------+------+
only showing top 5 rows



### Caution
Logical conjunctions must use the **&** operator. An example is shown below.

In [None]:
df.filter((df['Open'] > 500) & (df['Close'] < 500)).show(5)

+-----+----------+----------+----------+----------+----------+----------+------+
|Index|      Date|      Open|      High|       Low|     Close| Adj Close|Volume|
+-----+----------+----------+----------+----------+----------+----------+------+
| KS11|1997-10-28|522.390015|522.390015|492.299988|495.279999|495.279999| 45600|
| KS11|1997-11-17|517.650024|526.119995|495.410004|496.980011|496.980011| 44700|
| KS11|1997-11-20|504.940002|505.709991|475.859985|488.410004|488.410004| 56900|
| KS11|1998-01-16| 512.27002|532.429993|485.829987|488.100006|488.100006|200100|
| KS11|1998-01-22| 503.98999|510.869995| 483.98999| 483.98999| 483.98999|110800|
+-----+----------+----------+----------+----------+----------+----------+------+
only showing top 5 rows



## Grouping Data
We can group data based on a column, but this returns us a *GroupedData* object.

In [None]:
df.groupBy('Index')

GroupedData[grouping expressions: [Index], value: [Index: string, Date: date ... 6 more fields], type: GroupBy]

The *GroupedData* object can be used to call aggregator functions such as *mean*, *min*, *max*, *sum*, etc.
However, it will not work on our DataFrame because our columns ar of type string.

In [None]:
df.printSchema()
df.groupBy('Index').mean().show()

root
 |-- Index: string (nullable = true)
 |-- Date: date (nullable = true)
 |-- Open: string (nullable = true)
 |-- High: string (nullable = true)
 |-- Low: string (nullable = true)
 |-- Close: string (nullable = true)
 |-- Adj Close: string (nullable = true)
 |-- Volume: string (nullable = true)

+---------+
|    Index|
+---------+
|     NSEI|
|   GSPTSE|
|      NYA|
|      HSI|
|399001.SZ|
|     IXIC|
|000001.SS|
|     KS11|
|     SSMI|
|    GDAXI|
|     TWII|
|     N225|
|     N100|
|  J203.JO|
+---------+



# Casting columns
We can cast our columns using the *cast* function. We need to import the corresponding Spark type.

In [None]:
from pyspark.sql.types import DoubleType, IntegerType, BooleanType, DateType

df2 = df.withColumn('Open', df['Open'].cast(DoubleType()))
df2 = df2.withColumn('Close', df['Close'].cast(DoubleType()))

df2.printSchema()
df2.groupBy('Index').mean().show()

root
 |-- Index: string (nullable = true)
 |-- Date: date (nullable = true)
 |-- Open: double (nullable = true)
 |-- High: string (nullable = true)
 |-- Low: string (nullable = true)
 |-- Close: double (nullable = true)
 |-- Adj Close: string (nullable = true)
 |-- Volume: string (nullable = true)

+---------+------------------+------------------+
|    Index|         avg(Open)|        avg(Close)|
+---------+------------------+------------------+
|     NSEI| 7665.751272509498| 7660.047238088113|
|   GSPTSE|8091.1065434325765|8090.0663048776605|
|      NYA|4451.7781505843395| 4452.174710990798|
|      HSI|15206.355607330794| 15200.60562929962|
|399001.SZ| 7968.340420628268| 7973.831004994244|
|     IXIC|1985.0269605701312|1984.9067945188267|
|000001.SS| 2381.208888480223| 2383.069134972016|
|     KS11|1489.8147043765202| 1489.145146681539|
|     SSMI| 6410.012761798593| 6409.733077902093|
|    GDAXI| 5915.549219565168| 5914.846717036025|
|     TWII| 8037.662437671959| 8028.866141293032|


Now we are able to call aggregator functions which will only work on columns of type Number.

In [None]:
df2.groupBy('Index').min().show()

+---------+-----------+-----------+
|    Index|  min(Open)| min(Close)|
+---------+-----------+-----------+
|     NSEI|2553.600098|2524.199951|
|   GSPTSE|     1352.0|1346.400024|
|      NYA| 347.769989| 347.769989|
|      HSI|     1950.5|1894.900024|
|399001.SZ| 2533.27002|2534.719971|
|     IXIC|  54.869999|  54.869999|
|000001.SS|1007.901001|1011.499023|
|     KS11| 283.410004|      280.0|
|     SSMI|1288.699951|1287.599976|
|    GDAXI|      936.0|      936.0|
|     TWII|3475.870117| 3446.26001|
|     N225| 1020.48999| 1020.48999|
|     N100| 427.600006| 419.950012|
|  J203.JO|32887.44922|32887.44922|
+---------+-----------+-----------+



In [None]:
df2.groupBy('Index').sum().show()

+---------+--------------------+--------------------+
|    Index|           sum(Open)|          sum(Close)|
+---------+--------------------+--------------------+
|     NSEI| 2.564960375781678E7|2.5630518058642827E7|
|   GSPTSE|  8.51669874761713E7| 8.515603792514226E7|
|      NYA|6.2088949866199784E7| 6.209448069418867E7|
|      HSI| 1.291323718174531E8|1.2908354300401238E8|
|399001.SZ| 4.589764082281882E7| 4.592926658876684E7|
|     IXIC|2.5189992129634965E7|2.5188467222443912E7|
|000001.SS| 1.378958067318897E7|1.3800353360622944E7|
|     KS11|    8982092.85268604|   8978056.089342998|
|     SSMI|4.9171207895757005E7|4.9169062440586954E7|
|    GDAXI| 4.991540431469089E7| 4.990947659834998E7|
|     TWII| 4.717304084669673E7|4.7121415383248806E7|
|     N225| 1.783126192764188E8|1.7826917919808513E8|
|     N100|   4501058.618828013|   4500469.458731014|
|  J203.JO|1.1940372837282091E8|1.1943707898611091E8|
+---------+--------------------+--------------------+



In [None]:
df2.groupBy('Index').agg({'Open':'min', 'Close':'max'}).show()

+---------+-----------+-----------+
|    Index| max(Close)|  min(Open)|
+---------+-----------+-----------+
|     NSEI|15582.79981|2553.600098|
|   GSPTSE|19852.19922|     1352.0|
|      NYA|16590.42969| 347.769989|
|      HSI|33154.12109|     1950.5|
|399001.SZ|19531.15039| 2533.27002|
|     IXIC|14138.78027|  54.869999|
|000001.SS|6092.057129|1007.901001|
|     KS11|3249.300049| 283.410004|
|     SSMI|11426.15039|1288.699951|
|    GDAXI|15519.98047|      936.0|
|     TWII|17595.90039|3475.870117|
|     N225|38915.87109| 1020.48999|
|     N100|1263.619995| 427.600006|
|  J203.JO| 68775.0625|32887.44922|
+---------+-----------+-----------+



In [None]:
df.orderBy('Open').show()

+-----+----------+----------+----------+----------+----------+----------+------+
|Index|      Date|      Open|      High|       Low|     Close| Adj Close|Volume|
+-----+----------+----------+----------+----------+----------+----------+------+
| IXIC|1971-02-05|       100|       100|       100|       100|       100|     0|
| IXIC|1973-07-09|100.010002|100.010002|100.010002|100.010002|100.010002|     0|
| IXIC|1977-08-03|100.019997|100.019997|100.019997|100.019997|100.019997|     0|
| IXIC|1977-08-31|100.099998|100.099998|100.099998|100.099998|100.099998|     0|
| IXIC|1977-08-30|100.110001|100.110001|100.110001|100.110001|100.110001|     0|
| IXIC|1977-07-05|100.120003|100.120003|100.120003|100.120003|100.120003|     0|
| IXIC|1977-08-25|100.139999|100.139999|100.139999|100.139999|100.139999|     0|
| IXIC|1973-07-02|100.150002|100.150002|100.150002|100.150002|100.150002|     0|
| IXIC|1978-01-09|100.199997|100.199997|100.199997|100.199997|100.199997|     0|
| IXIC|1973-06-27|100.220001

In [None]:
df.orderBy(df['Open'].desc()).show()

+-----+----------+----+----+----+-----+---------+------+
|Index|      Date|Open|High| Low|Close|Adj Close|Volume|
+-----+----------+----+----+----+-----+---------+------+
| TWII|1997-09-29|null|null|null| null|     null|  null|
|  HSI|1988-02-17|null|null|null| null|     null|  null|
| TWII|1997-10-10|null|null|null| null|     null|  null|
|  HSI|1989-06-19|null|null|null| null|     null|  null|
| TWII|1997-10-31|null|null|null| null|     null|  null|
|  HSI|1988-02-18|null|null|null| null|     null|  null|
| TWII|1998-07-01|null|null|null| null|     null|  null|
|  HSI|1987-12-25|null|null|null| null|     null|  null|
| TWII|1997-11-12|null|null|null| null|     null|  null|
|  HSI|1988-02-19|null|null|null| null|     null|  null|
| SSMI|2021-01-04|null|null|null| null|     null|  null|
|  HSI|1988-04-01|null|null|null| null|     null|  null|
| TWII|1997-12-25|null|null|null| null|     null|  null|
|  HSI|1989-06-08|null|null|null| null|     null|  null|
| SSMI|2021-02-15|null|null|nul

In [None]:
df2.groupBy(df2['Index']).mean().orderBy('avg(Open)').show()

+---------+------------------+------------------+
|    Index|         avg(Open)|        avg(Close)|
+---------+------------------+------------------+
|     N100| 822.2613479773497| 822.1537191689832|
|     KS11|1489.8147043765202| 1489.145146681539|
|     IXIC|1985.0269605701312|1984.9067945188267|
|000001.SS| 2381.208888480223| 2383.069134972016|
|      NYA|4451.7781505843395| 4452.174710990798|
|    GDAXI| 5915.549219565168| 5914.846717036025|
|     SSMI| 6410.012761798593| 6409.733077902093|
|     NSEI| 7665.751272509498| 7660.047238088113|
|399001.SZ| 7968.340420628268| 7973.831004994244|
|     TWII| 8037.662437671959| 8028.866141293032|
|   GSPTSE|8091.1065434325765|8090.0663048776605|
|     N225|12852.286238750094| 12849.15519663292|
|      HSI|15206.355607330794| 15200.60562929962|
|  J203.JO|50896.729911688366|50910.945859382315|
+---------+------------------+------------------+



Drop a column

In [None]:
df.drop('Open', 'Close').show()

+-----+----------+----------+----------+----------+------+
|Index|      Date|      High|       Low| Adj Close|Volume|
+-----+----------+----------+----------+----------+------+
|  NYA|1965-12-31|528.690002|528.690002|528.690002|     0|
|  NYA|1966-01-03|527.210022|527.210022|527.210022|     0|
|  NYA|1966-01-04|527.840027|527.840027|527.840027|     0|
|  NYA|1966-01-05|531.119995|531.119995|531.119995|     0|
|  NYA|1966-01-06|532.070007|532.070007|532.070007|     0|
|  NYA|1966-01-07|532.599976|532.599976|532.599976|     0|
|  NYA|1966-01-10|533.869995|533.869995|533.869995|     0|
|  NYA|1966-01-11|534.289978|534.289978|534.289978|     0|
|  NYA|1966-01-12|533.340027|533.340027|533.340027|     0|
|  NYA|1966-01-13|534.400024|534.400024|534.400024|     0|
|  NYA|1966-01-14|535.450012|535.450012|535.450012|     0|
|  NYA|1966-01-17|537.460022|537.460022|537.460022|     0|
|  NYA|1966-01-18|538.940002|538.940002|538.940002|     0|
|  NYA|1966-01-19|537.669983|537.669983|537.669983|     