<a href="https://colab.research.google.com/github/EonTechie/Big_Data_Processing_Spark_Projects/blob/main/spark-rdd-tasks/CurrencyDailyIncreaseDetection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Filiz-Yıldız-Part1-Question5

"""
Dataset:
DollarDataset.txt

Goal:
Find the top 5 days with the highest percentage increase in dollar value.

My Approach:
I solved this question in two different ways:

1. Method (as requested) : Using zipWithIndex (RDD-based):
I added an index to each row in the RDD using zipWithIndex().
Then, I joined each current value with its previous one based on the index and calculated the percentage change.
This method allowed me to simulate a "lag" operation using only pure RDD transformations.

2. Extra Method: Using lag() window function (DataFrame-based):
I also solved the same problem using Spark DataFrames by applying the lag() window function.
This gave me direct access to the previous row, and I used it to compute percentage increases more easily.

Result:
After calculating percentage changes in both approaches, I sorted the results in descending order
and selected the top 5 days with the highest percentage increases.

This question helped me practice both RDD and DataFrame techniques, and compare their flexibility for time-based analysis.
"""

# Connect colab to my drive account to fetch the dataset stored there
from google.colab import drive
drive.mount('/content/drive')

# Print files to see the namesof all (optional)
import os
folder_path = "/content/drive/My Drive/datasets"
files = os.listdir(folder_path)
print(files)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
['2.txt', 'Capitals.txt', 'EartquakeData-07032025.txt', 'DollarDataset.txt', 'couples.txt', 'join-actors.txt', 'points-null-values.txt', 'numbers-test.txt', 'join-series.txt', 'points.txt', 'names.txt', 'Lottery.txt', 'JamesJoyce-Ulyses.txt', 'world.txt', 'points-places.txt', 'Iris.csv', 'ml-latest-small']


In [None]:
from pyspark.sql import SparkSession  # Import SparkSession to create and manage Spark applications

# Create or get a Spark session with the name
# This session is the entry point to use DataFrame and SQL functionalities in PySpark
spark = SparkSession.builder.appName("Part1-Question5-DollarDataset").getOrCreate()


In [None]:
# I solved the problem using 'zipWithIndex' first

In [None]:
# Read the file as an RDD (Resilient Distributed Dataset) using Spark
# The textFile() method reads the text file and loads it as an RDD, where each line is an element in the RDD
rdd = spark.sparkContext.textFile("/content/drive/My Drive/datasets/DollarDataset.txt")

# Take the first 5 lines of the RDD to inspect the data
# The 'take(5)' function retrieves the first 5 elements (lines) from the RDD
rdd.take(5)


['1\t02-01-1950\t2,80',
 '2\t03-01-1950\t2,80',
 '3\t04-01-1950\t2,80',
 '4\t05-01-1950\t2,80',
 '5\t06-01-1950\t2,80']

In [None]:
# Split each line of the RDD by the tab character ("\t")
# The map() function applies the lambda function to each line in the RDD, splitting each line into a list based on tab separation
rdd_split = rdd.map(lambda line: line.split("\t"))

# Take the first 5 elements (lines) of the split RDD to check the result
# This will show the first 5 lines, where each line is now split into a list of values
print(rdd_split.take(5))
rdd_split.count()

[['1', '02-01-1950', '2,80'], ['2', '03-01-1950', '2,80'], ['3', '04-01-1950', '2,80'], ['4', '05-01-1950', '2,80'], ['5', '06-01-1950', '2,80']]


17776

In [None]:
# Function to check if a string can be converted to a valid float
# It replaces commas with dots (for decimal consistency) and attempts to convert the value to float
def is_valid_float(val):
    try:
        float(val.replace(",", "."))  # Replace comma with dot and try to convert to float
        return True
    except:
        return False  # If conversion fails, return False

# Filter the RDD to keep only the rows with 3 columns, where the third column is not empty
# Additionally, the third column must be a valid float value
# This ensures only valid rows (with 3 columns and a valid float value in the third column) are kept
rdd_clean = rdd_split.filter(lambda x: len(x) == 3 and x[2] != '' and is_valid_float(x[2]))

# Show the first 5 rows of the cleaned RDD to verify the result
print(rdd_clean.take(5))
rdd_clean.count()

# Check the types of the individual columns in the first row
print("Third column type in the current rdd : ", type(rdd_clean.take(1)[0][2]))

[['1', '02-01-1950', '2,80'], ['2', '03-01-1950', '2,80'], ['3', '04-01-1950', '2,80'], ['4', '05-01-1950', '2,80'], ['5', '06-01-1950', '2,80']]
Third column type in the current rdd :  <class 'str'>


In [None]:
# Convert the third column to float and create a new RDD with 3 columns
rdd_float = rdd_clean.map(lambda x: (x[0], x[1], float(x[2].replace(",", "."))))

# Check the types of the individual columns in the first row
print("Third column type in the current rdd : ", type(rdd_float.take(1)[0][2]))

# Take the first row of rdd_float to check the types of each element
rdd_float.take(5)

Third column type in the current rdd :  <class 'float'>


[('1', '02-01-1950', 2.8),
 ('2', '03-01-1950', 2.8),
 ('3', '04-01-1950', 2.8),
 ('4', '05-01-1950', 2.8),
 ('5', '06-01-1950', 2.8)]

In [None]:
# I did dthis extra for checking and analytical purposes
# Extract the first column (index) from the rdd
indexes = rdd_float.map(lambda x: int(x[0]))  # Assuming the first column is the index and is a string that needs to be converted to an integer

# Check if the indexes are sequential and find the first non-sequential index
sequential_errors = indexes.zipWithIndex().filter(lambda x: x[0] != x[1] + 1).collect()

# Output the result
if sequential_errors:
    print("First non-sequential index:", sequential_errors[0])
else:
    print("The indexes are sequential.")

# By doing this so I acknowledged that I need to use indexes which I will set with zipWithIndex()
# (Because exist ones even were not in a sequential form, and also there are some data need to be cleaned like some values are lack missing in the file so that indexes cnannot be used)

First non-sequential index: (101, 99)


In [None]:
# Apply zipWithIndex() to the rdd_float to add an index to each row

rdd_indexed = rdd_float.zipWithIndex() # ((index, date, value), index)

# Show the first 5 rows of the indexed RDD to verify the result
rdd_indexed.take(5)


[(('1', '02-01-1950', 2.8), 0),
 (('2', '03-01-1950', 2.8), 1),
 (('3', '04-01-1950', 2.8), 2),
 (('4', '05-01-1950', 2.8), 3),
 (('5', '06-01-1950', 2.8), 4)]

In [None]:
# I changed the places of index to the left side to use groupByKey afterwards, also not taking incorrect ordered indexing of columns coming from file
current = rdd_indexed.map(lambda x: (int(x[1]), ( x[0][1], x[0][2])))   # (index, date, value)
current.take(5)

# The result is a tuple where the first element is the original row (x[0]) and the second element is the index (x[1])
# We then map each row to a new tuple (index, (date, value)) where:
# - x[0] is the index
# - x[1][0] is the date (first column)
# - x[1][1] is the value (second column, already a float)

[(0, ('02-01-1950', 2.8)),
 (1, ('03-01-1950', 2.8)),
 (2, ('04-01-1950', 2.8)),
 (3, ('05-01-1950', 2.8)),
 (4, ('06-01-1950', 2.8))]

In [None]:
# Decrease the key by 1 to align it with the next record (for joining later)
# (x[0] - 1) → shifts the key back by one
# (x[1][0], x[1][1]) → keeps the date and value unchanged
prev = current.map(lambda x: (x[0] - 1, (x[1][0], x[1][1])))

# Show the first 5 results
prev.take(5)


[(-1, ('02-01-1950', 2.8)),
 (0, ('03-01-1950', 2.8)),
 (1, ('04-01-1950', 2.8)),
 (2, ('05-01-1950', 2.8)),
 (3, ('06-01-1950', 2.8))]

In [None]:
# Combine current and previous RDDs using union
# Then sort the combined RDD by the key (x[0])
rdd = current.union(prev).sortBy(lambda x: x[0])

# Show the first 5 results
rdd.take(5)

[(-1, ('02-01-1950', 2.8)),
 (0, ('02-01-1950', 2.8)),
 (0, ('03-01-1950', 2.8)),
 (1, ('03-01-1950', 2.8)),
 (1, ('04-01-1950', 2.8))]

In [None]:
# Group values by key as indexes of previos day and next day will be in the same tuple
# Convert the grouped values (which are iterable) into a tuple
# Then sort the result by the key
resultRDD = rdd.groupByKey().map(lambda x: (x[0], tuple(x[1]))).sortBy(lambda x: x[0])

# Show the first 5 results
resultRDD.take(5)

[(-1, (('02-01-1950', 2.8),)),
 (0, (('02-01-1950', 2.8), ('03-01-1950', 2.8))),
 (1, (('03-01-1950', 2.8), ('04-01-1950', 2.8))),
 (2, (('04-01-1950', 2.8), ('05-01-1950', 2.8))),
 (3, (('05-01-1950', 2.8), ('06-01-1950', 2.8)))]

In [None]:
# Keep only the records where the value has exactly 2 items (paired entries),
# because the first and last records have only 1 item,
# which otherwise cause an index out of range error later
filteredRDD = resultRDD.filter(lambda x: len(x[1]) == 2)

# Show the first 5 results
filteredRDD.take(5)


[(0, (('02-01-1950', 2.8), ('03-01-1950', 2.8))),
 (1, (('03-01-1950', 2.8), ('04-01-1950', 2.8))),
 (2, (('04-01-1950', 2.8), ('05-01-1950', 2.8))),
 (3, (('05-01-1950', 2.8), ('06-01-1950', 2.8))),
 (4, (('06-01-1950', 2.8), ('09-01-1950', 2.8)))]

In [None]:
# Create a new RDD with:
# - a string showing the date range: "from <oldDate> to <newDate>"
# - percentage change: ((new - old) / old) * 100
#   If the old value is 0, return None to safely avoid division by zero
rdd_pct = filteredRDD.map(lambda x: (
    "from " + x[1][0][0] + " to " + x[1][1][0],
    ((x[1][1][1] - x[1][0][1]) / x[1][0][1] * 100) if x[1][0][1] != 0 else None
))

# Show the first 5 results
rdd_pct.take(5)


[('from 02-01-1950 to 03-01-1950', 0.0),
 ('from 03-01-1950 to 04-01-1950', 0.0),
 ('from 04-01-1950 to 05-01-1950', 0.0),
 ('from 05-01-1950 to 06-01-1950', 0.0),
 ('from 06-01-1950 to 09-01-1950', 0.0)]

In [None]:
# Sort the RDD by the percentage change in descending order
# x[1] is the percentage value, so we use -x[1] for descending order
# If the percentage is None (to avoid division by zero), treat it as very small (negative infinity)
sorted_rdd = rdd_pct.sortBy(lambda x: -x[1])

# Show the top 5 results
sorted_rdd.take(5)

[('from 19-08-1960 to 22-08-1960', 221.42857142857144),
 ('from 24-01-1980 to 25-01-1980', 100.0),
 ('from 07-08-1970 to 10-08-1970', 64.99999999999999),
 ('from 11-06-1979 to 12-06-1979', 32.075471698113205),
 ('from 28-02-1978 to 01-03-1978', 29.87012987012987)]

In [None]:
# Print header
print("Date Range                    |  Percentage Change")
print("----------------------------------------------")

# Print the top 5 rows from the sorted RDD
for row in sorted_rdd.take(5):
    print(f"{row[0]} | {row[1]}")

Date Range                    |  Percentage Change
----------------------------------------------
from 19-08-1960 to 22-08-1960 | 221.42857142857144
from 24-01-1980 to 25-01-1980 | 100.0
from 07-08-1970 to 10-08-1970 | 64.99999999999999
from 11-06-1979 to 12-06-1979 | 32.075471698113205
from 28-02-1978 to 01-03-1978 | 29.87012987012987


In [None]:
# Convert the sorted RDD to a DataFrame with proper column names
df = sorted_rdd.toDF(["DateRange", "PercentageChange"])

# Display the first 5 rows without truncating the content
df.show(5, False)

# In the dataset, some dates are missing,
# so the result may show jumps between non-consecutive dates

+-----------------------------+------------------+
|DateRange                    |PercentageChange  |
+-----------------------------+------------------+
|from 19-08-1960 to 22-08-1960|221.42857142857144|
|from 24-01-1980 to 25-01-1980|100.0             |
|from 07-08-1970 to 10-08-1970|64.99999999999999 |
|from 11-06-1979 to 12-06-1979|32.075471698113205|
|from 28-02-1978 to 01-03-1978|29.87012987012987 |
+-----------------------------+------------------+
only showing top 5 rows



In [None]:
################################################################################################################################################################################
################################################################################################################################################################################
################################################################################################################################################################################
#################################################################################### EXTRA SECOND WAY SOLUTION #################################################################
################################################################################################################################################################################
################################################################################################################################################################################
################################################################################################################################################################################

In [None]:
# Then I wanted to try another method to solve it using 'window' and 'lag'

In [None]:
from pyspark.sql.functions import col, lag, regexp_replace, round  # Import Spark SQL functions for column operations and transformations
from pyspark.sql.window import Window  # Import windowing functionality to apply functions across rows (like lag)

# Start a Spark session named "Question5" for this analysis
spark = SparkSession.builder.appName("Question5").getOrCreate()

# Read the tab-separated dataset without headers and infer column types
df = (
    spark.read
    .option("inferSchema", "true")   # Automatically detect column data types (e.g., double, int)
    .option("delimiter", "\t")       # Specify tab as the delimiter between columns
    .option("header", "false")       # The dataset does not contain a header row
    .csv("/content/drive/My Drive/datasets/DollarDataset.txt")
    )

# Display the first few rows of the dataset to understand its structure and contents
df.show()


+---+----------+----+
|_c0|       _c1| _c2|
+---+----------+----+
|  1|02-01-1950|2,80|
|  2|03-01-1950|2,80|
|  3|04-01-1950|2,80|
|  4|05-01-1950|2,80|
|  5|06-01-1950|2,80|
|  6|09-01-1950|2,80|
|  7|10-01-1950|2,80|
|  8|11-01-1950|2,80|
|  9|12-01-1950|2,80|
| 10|13-01-1950|2,80|
| 11|16-01-1950|2,80|
| 12|17-01-1950|2,80|
| 13|18-01-1950|2,80|
| 14|19-01-1950|2,80|
| 15|20-01-1950|2,80|
| 16|23-01-1950|2,80|
| 17|24-01-1950|2,80|
| 18|25-01-1950|2,80|
| 19|26-01-1950|2,80|
| 20|27-01-1950|2,80|
+---+----------+----+
only showing top 20 rows



In [None]:
# Rename the default column names (_c0, _c1, _c2) to more meaningful ones
df1 = (
    df.withColumnRenamed("_c0", "Index")     # Rename _c0 to Index (could be row number or unique ID)
      .withColumnRenamed("_c1", "Date")      # Rename _c1 to Date (represents the date of the value)
      .withColumnRenamed("_c2", "Value")     # Rename _c2 to Value (the dollar value for that date)
)

df1.show(20, False)  # Display the renamed DataFrame to verify column names

+-----+----------+-----+
|Index|Date      |Value|
+-----+----------+-----+
|1    |02-01-1950|2,80 |
|2    |03-01-1950|2,80 |
|3    |04-01-1950|2,80 |
|4    |05-01-1950|2,80 |
|5    |06-01-1950|2,80 |
|6    |09-01-1950|2,80 |
|7    |10-01-1950|2,80 |
|8    |11-01-1950|2,80 |
|9    |12-01-1950|2,80 |
|10   |13-01-1950|2,80 |
|11   |16-01-1950|2,80 |
|12   |17-01-1950|2,80 |
|13   |18-01-1950|2,80 |
|14   |19-01-1950|2,80 |
|15   |20-01-1950|2,80 |
|16   |23-01-1950|2,80 |
|17   |24-01-1950|2,80 |
|18   |25-01-1950|2,80 |
|19   |26-01-1950|2,80 |
|20   |27-01-1950|2,80 |
+-----+----------+-----+
only showing top 20 rows



In [None]:
df2 = (
        df1.withColumn("Value",               # Target the 'Value' column
        regexp_replace("Value", ",", ".")     # Replace commas with dots for decimal consistency
        .cast("float")                        # Convert the cleaned string to float
    )
)

df2.show()  # Show the DataFrame to verify numeric conversion
df2.count()

+-----+----------+-----+
|Index|      Date|Value|
+-----+----------+-----+
|    1|02-01-1950|  2.8|
|    2|03-01-1950|  2.8|
|    3|04-01-1950|  2.8|
|    4|05-01-1950|  2.8|
|    5|06-01-1950|  2.8|
|    6|09-01-1950|  2.8|
|    7|10-01-1950|  2.8|
|    8|11-01-1950|  2.8|
|    9|12-01-1950|  2.8|
|   10|13-01-1950|  2.8|
|   11|16-01-1950|  2.8|
|   12|17-01-1950|  2.8|
|   13|18-01-1950|  2.8|
|   14|19-01-1950|  2.8|
|   15|20-01-1950|  2.8|
|   16|23-01-1950|  2.8|
|   17|24-01-1950|  2.8|
|   18|25-01-1950|  2.8|
|   19|26-01-1950|  2.8|
|   20|27-01-1950|  2.8|
+-----+----------+-----+
only showing top 20 rows



17776

In [None]:
df3 = df2.na.drop(subset=["Value"])  # Drop rows where the 'Value' column is null (missing values)
df3.show()  # Display the cleaned DataFrame
df3.count()

+-----+----------+-----+
|Index|      Date|Value|
+-----+----------+-----+
|    1|02-01-1950|  2.8|
|    2|03-01-1950|  2.8|
|    3|04-01-1950|  2.8|
|    4|05-01-1950|  2.8|
|    5|06-01-1950|  2.8|
|    6|09-01-1950|  2.8|
|    7|10-01-1950|  2.8|
|    8|11-01-1950|  2.8|
|    9|12-01-1950|  2.8|
|   10|13-01-1950|  2.8|
|   11|16-01-1950|  2.8|
|   12|17-01-1950|  2.8|
|   13|18-01-1950|  2.8|
|   14|19-01-1950|  2.8|
|   15|20-01-1950|  2.8|
|   16|23-01-1950|  2.8|
|   17|24-01-1950|  2.8|
|   18|25-01-1950|  2.8|
|   19|26-01-1950|  2.8|
|   20|27-01-1950|  2.8|
+-----+----------+-----+
only showing top 20 rows



12895

In [None]:
from pyspark.sql.functions import lead
# Define the window specification to order the data by "Index" (row order)
# This will allow us to compute the previous day's value using the lag() function
windowSpec = Window.orderBy("Index")

# Add a new column "PrevValue" which contains the value of the previous day
# This is done using lag() function over the "windowSpec" which orders the data by "Index"
df4 = df3.withColumn("PrevValue", lag("Value").over(windowSpec)) # lead kullnaarak da geri çekebilirdim

# Add a new column "PrevDate" that holds the previous day's "Date"
df5 = df4.withColumn("PrevDate", lag("Date").over(windowSpec)) # lead kullnaarak da geri çekebilirdim

# Show the updated DataFrame to verify the new column
df5.show()

# Count the total number of rows in the DataFrame
df5.count()

+-----+----------+-----+---------+----------+
|Index|      Date|Value|PrevValue|  PrevDate|
+-----+----------+-----+---------+----------+
|    1|02-01-1950|  2.8|     NULL|      NULL|
|    2|03-01-1950|  2.8|      2.8|02-01-1950|
|    3|04-01-1950|  2.8|      2.8|03-01-1950|
|    4|05-01-1950|  2.8|      2.8|04-01-1950|
|    5|06-01-1950|  2.8|      2.8|05-01-1950|
|    6|09-01-1950|  2.8|      2.8|06-01-1950|
|    7|10-01-1950|  2.8|      2.8|09-01-1950|
|    8|11-01-1950|  2.8|      2.8|10-01-1950|
|    9|12-01-1950|  2.8|      2.8|11-01-1950|
|   10|13-01-1950|  2.8|      2.8|12-01-1950|
|   11|16-01-1950|  2.8|      2.8|13-01-1950|
|   12|17-01-1950|  2.8|      2.8|16-01-1950|
|   13|18-01-1950|  2.8|      2.8|17-01-1950|
|   14|19-01-1950|  2.8|      2.8|18-01-1950|
|   15|20-01-1950|  2.8|      2.8|19-01-1950|
|   16|23-01-1950|  2.8|      2.8|20-01-1950|
|   17|24-01-1950|  2.8|      2.8|23-01-1950|
|   18|25-01-1950|  2.8|      2.8|24-01-1950|
|   19|26-01-1950|  2.8|      2.8|

12895

In [None]:
# Calculate the percentage change between the current value and the previous value
# The formula is: (Current Value - Previous Value) / Previous Value * 100
# Round the result to 2 decimal places using the round() function
df6 = df5.withColumn("PctChange",
                    round(((col("Value") - col("PrevValue")) / col("PrevValue")) * 100, 2))

# Display the DataFrame with the calculated percentage change
df6.show()
df6.count()


+-----+----------+-----+---------+----------+---------+
|Index|      Date|Value|PrevValue|  PrevDate|PctChange|
+-----+----------+-----+---------+----------+---------+
|    1|02-01-1950|  2.8|     NULL|      NULL|     NULL|
|    2|03-01-1950|  2.8|      2.8|02-01-1950|      0.0|
|    3|04-01-1950|  2.8|      2.8|03-01-1950|      0.0|
|    4|05-01-1950|  2.8|      2.8|04-01-1950|      0.0|
|    5|06-01-1950|  2.8|      2.8|05-01-1950|      0.0|
|    6|09-01-1950|  2.8|      2.8|06-01-1950|      0.0|
|    7|10-01-1950|  2.8|      2.8|09-01-1950|      0.0|
|    8|11-01-1950|  2.8|      2.8|10-01-1950|      0.0|
|    9|12-01-1950|  2.8|      2.8|11-01-1950|      0.0|
|   10|13-01-1950|  2.8|      2.8|12-01-1950|      0.0|
|   11|16-01-1950|  2.8|      2.8|13-01-1950|      0.0|
|   12|17-01-1950|  2.8|      2.8|16-01-1950|      0.0|
|   13|18-01-1950|  2.8|      2.8|17-01-1950|      0.0|
|   14|19-01-1950|  2.8|      2.8|18-01-1950|      0.0|
|   15|20-01-1950|  2.8|      2.8|19-01-1950|   

12895

In [None]:
from pyspark.sql.functions import col, concat, lit

# Create a new column "ChangeDate" that combines "PrevDate" and "Date"
# Format: "from <PrevDate> to <Date>"
df7 = df6.withColumn(
    "ChangeDate",
    concat(lit("from "), col("PrevDate"), lit(" to "), col("Date"))
)

# Display the DataFrame (first 20 rows) to verify the percentage change and date range
df7.show(20, False)

# Count the total number of rows in the DataFrame
df7.count()

+-----+----------+-----+---------+----------+---------+-----------------------------+
|Index|Date      |Value|PrevValue|PrevDate  |PctChange|ChangeDate                   |
+-----+----------+-----+---------+----------+---------+-----------------------------+
|1    |02-01-1950|2.8  |NULL     |NULL      |NULL     |NULL                         |
|2    |03-01-1950|2.8  |2.8      |02-01-1950|0.0      |from 02-01-1950 to 03-01-1950|
|3    |04-01-1950|2.8  |2.8      |03-01-1950|0.0      |from 03-01-1950 to 04-01-1950|
|4    |05-01-1950|2.8  |2.8      |04-01-1950|0.0      |from 04-01-1950 to 05-01-1950|
|5    |06-01-1950|2.8  |2.8      |05-01-1950|0.0      |from 05-01-1950 to 06-01-1950|
|6    |09-01-1950|2.8  |2.8      |06-01-1950|0.0      |from 06-01-1950 to 09-01-1950|
|7    |10-01-1950|2.8  |2.8      |09-01-1950|0.0      |from 09-01-1950 to 10-01-1950|
|8    |11-01-1950|2.8  |2.8      |10-01-1950|0.0      |from 10-01-1950 to 11-01-1950|
|9    |12-01-1950|2.8  |2.8      |11-01-1950|0.0      

12895

In [None]:
# Sort by PctChange in descending order to find the top 5 days with the highest percentage increase
# Select only the Date and PctChange columns, and limit the result to the top 5
top5 = df7.orderBy(col("PctChange").desc()).select("ChangeDate", "PctChange").limit(5)

# Display the top 5 rows showing the highest percentage increase in "PctChange"
print("top 5 greatest daily increase ( by percentage )")
top5.show(5, False)

top 5 greatest daily increase ( by percentage )
+-----------------------------+---------+
|ChangeDate                   |PctChange|
+-----------------------------+---------+
|from 19-08-1960 to 22-08-1960|221.43   |
|from 24-01-1980 to 25-01-1980|100.0    |
|from 07-08-1970 to 10-08-1970|65.0     |
|from 11-06-1979 to 12-06-1979|32.08    |
|from 28-02-1978 to 01-03-1978|29.87    |
+-----------------------------+---------+

