# Nequi project

### 1. Based on the link provided, files are dowloaded with aws cli.

In [1]:
!aws s3 cp "s3://nequi-open-data/sample_data_0006_part_00.parquet" .

download: s3://nequi-open-data/sample_data_0006_part_00.parquet to ./sample_data_0006_part_00.parquet


In [2]:
!aws s3 cp "s3://nequi-open-data/sample_data_0007_part_00.parquet" .

download: s3://nequi-open-data/sample_data_0007_part_00.parquet to ./sample_data_0007_part_00.parquet


### 2. Reading the file

#### 2.a Pandas is first used to read parquet files

In [1]:
import pandas as pd

In [2]:
sample006 = pd.read_parquet('./sample_data_0006_part_00.parquet')
sample007 = pd.read_parquet('./sample_data_0007_part_00.parquet')

samples = pd.concat([sample006,sample007], ignore_index=True) # Since the whole data is stuided, samples are concatanated.

With Pandas it took 45.0 seconds to read and concatenate the files.

#### 2.b Pyspark is used to improve times

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

In [2]:
spark = SparkSession.builder.appName("NequiProject").getOrCreate()

23/08/09 02:54:10 WARN Utils: Your hostname, Danielas-MacBook-Air.local resolves to a loopback address: 127.0.0.1; using 192.168.1.8 instead (on interface en0)
23/08/09 02:54:10 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/08/09 02:54:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
sample006 = spark.read.parquet('./sample_data_0006_part_00.parquet')
sample007 = spark.read.parquet('./sample_data_0007_part_00.parquet')
samples = sample006.union(sample007)
samples = samples.na.drop()

                                                                                

Since pyspark is more time efficient, the project will use it instead of pandas.

In [4]:
samples.show(5)

                                                                                

+--------------------+--------------------+--------------------+-------------------+--------------------+--------------------+------------------+----------------+
|         merchant_id|                 _id|          subsidiary|   transaction_date|      account_number|             user_id|transaction_amount|transaction_type|
+--------------------+--------------------+--------------------+-------------------+--------------------+--------------------+------------------+----------------+
|075d178871d8d4850...|aa8dacff663072244...|824b2af470cbe6a65...|2021-09-12 13:32:03|648e257c9d74909a1...|ba42d192a145583ba...|      178.33365037|         CREDITO|
|075d178871d8d4850...|a53bb81bd0bba2ae2...|2d8d34be7509a6b12...|2021-09-12 13:31:58|c0b62f9046c83ea55...|5cfff960ea6d732c1...|       35.66673007|         CREDITO|
|075d178871d8d4850...|79f893ea65c06fe29...|5eeb18254850b21af...|2021-09-12 13:31:56|872d10143fc0ac7d5...|c97e63a92c82c7217...|      142.66692029|         CREDITO|
|075d178871d8d4850...|

In [5]:
samples.printSchema()

root
 |-- merchant_id: string (nullable = true)
 |-- _id: string (nullable = true)
 |-- subsidiary: string (nullable = true)
 |-- transaction_date: timestamp (nullable = true)
 |-- account_number: string (nullable = true)
 |-- user_id: string (nullable = true)
 |-- transaction_amount: decimal(24,8) (nullable = true)
 |-- transaction_type: string (nullable = true)



In [6]:
samples.select("user_id").show(5, truncate=False)

+--------------------------------+
|user_id                         |
+--------------------------------+
|ba42d192a145583ba8e7bf04875f837f|
|5cfff960ea6d732c1ba3e63d24f3be52|
|c97e63a92c82c7217b333635d75928ed|
|fc09bdd00f283222d65eaff4d00a6594|
|213527e8ba94fcaf2f9378969f9f6abc|
+--------------------------------+
only showing top 5 rows



In [7]:
samples.select(F.year("transaction_date").alias("Year")).distinct().show()



+----+
|Year|
+----+
|2021|
|2020|
+----+



                                                                                

In [8]:
samples.select("merchant_id").distinct().show(truncate=False)



+--------------------------------+
|merchant_id                     |
+--------------------------------+
|075d178871d8d48502bf1f54887e52fe|
|817d18cd3c31e40e9bff0566baae7758|
|838a8fa992a4aa2fb5a0cf8b15b63755|
+--------------------------------+



                                                                                

In [9]:
num_users = samples.select("user_id").distinct().count()
num_accounts = samples.select("account_number").distinct().count()

print(f"Total unique users that has placed at least one transaction are {num_users:,}, while there are a total of {num_accounts:,} accounts.")

[Stage 18:>                                                         (0 + 9) / 9]

Total unique users that has placed at least one transaction are 3,087,217, while there are a total of 3,099,711 accounts.


                                                                                

In [10]:
accounts = samples.groupBy("user_id").agg(F.countDistinct("account_number").alias("account_count")).orderBy(F.desc("account_count")).limit(20)
ammounts = accounts.join(samples, "user_id", "inner") \
    .groupBy("user_id") \
    .agg(F.sum("transaction_amount").alias("sum_transactions"),
         F.stddev("transaction_amount").alias("std_deviation"))
accounts.join(ammounts, 'user_id', 'inner').orderBy(F.desc("account_count")).show(truncate=False)

23/08/09 02:39:42 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/08/09 02:39:42 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/08/09 02:39:42 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/08/09 02:39:42 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/08/09 02:39:42 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/08/09 02:39:42 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/08/09 02:39:42 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/08/09 02:39:42 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/08/09 02:39:45 WARN RowBasedKeyValueBatch: Calling spill() on

+--------------------------------+-------------+----------------+------------------+
|user_id                         |account_count|sum_transactions|std_deviation     |
+--------------------------------+-------------+----------------+------------------+
|8c292781ac3e591312d7fc5b767687ca|43           |57794.36941102  |297.1821683130902 |
|6ba67792fbc4e375fa69ed6df0e44854|27           |59241.24976144  |441.69874871545954|
|328550eea11d2441ab258a1f07581dc8|24           |6242.86665343   |91.07442149253173 |
|a9fbdb2d32424dbac15edc0b17c12e23|23           |17405.36427600  |161.0434881880177 |
|62e1cb0b55d1d7b577292916528ef6d8|20           |1190.07989329   |23.55723555824086 |
|16100156a44d52a0ab40b2661dd648cc|17           |5350.00951085   |459.00474438452113|
|94cf8a83c7f7d2b2d33450d9ca233ef6|17           |15539.99429289  |374.8176078533819 |
|a71f0b8fc6267a4687430bea42310468|16           |13470.13505786  |249.89727904121185|
|e066a19e39bbb74e82d0d88702292b7e|13           |4761.50846469   |

                                                                                

In [11]:
samples.filter(samples["user_id"] == "8c292781ac3e591312d7fc5b767687ca").orderBy('transaction_date').show()

                                                                                

+--------------------+--------------------+--------------------+-------------------+--------------------+--------------------+------------------+----------------+
|         merchant_id|                 _id|          subsidiary|   transaction_date|      account_number|             user_id|transaction_amount|transaction_type|
+--------------------+--------------------+--------------------+-------------------+--------------------+--------------------+------------------+----------------+
|075d178871d8d4850...|2a797d11beb190624...|db413bbd87a3df94b...|2021-01-08 09:10:30|5946e4c935c965c4b...|8c292781ac3e59131...|      118.88910024|         CREDITO|
|838a8fa992a4aa2fb...|84d44445c328f3255...|8c52a2d7745e37bce...|2021-01-14 04:27:51|b1d56a01b7cd5e3c6...|8c292781ac3e59131...|       59.44455012|         CREDITO|
|838a8fa992a4aa2fb...|5d69e18742f6e6024...|ce9fcdc2a3b5e4f32...|2021-01-14 10:53:16|b1d56a01b7cd5e3c6...|8c292781ac3e59131...|       89.16682518|         CREDITO|
|838a8fa992a4aa2fb...|

In [12]:
samples.select("transaction_type").distinct().show(truncate=False)



+----------------+
|transaction_type|
+----------------+
|DEBITO          |
|CREDITO         |
+----------------+



                                                                                

In [13]:
samples.groupBy("subsidiary", F.year("transaction_date").alias("transaction_year")).agg(F.sum("transaction_amount").alias("total_transactions")).orderBy(F.desc("total_transactions")).show(truncate=False)



+--------------------------------+----------------+------------------+
|subsidiary                      |transaction_year|total_transactions|
+--------------------------------+----------------+------------------+
|8c52a2d7745e37bcee79717300f796e3|2021            |22293285.26933247 |
|d4b621a24cc03e3f92155a7e241fa1c3|2021            |18196560.97313392 |
|f54e0b6b32831a6307361ed959903e76|2021            |18027608.57745139 |
|3458b243beebecf55605ca649b6b2ea5|2021            |12223479.69223730 |
|dff70ce33784a932ce4a7efc81a43863|2021            |11941222.46639275 |
|fee20d2f0753125f11b4376da5dbad4c|2021            |9850216.26379212  |
|464139dd69c67ebf50f2f946bc12513e|2021            |8400298.85851027  |
|d4b31b123120a4eefd51ba95975f2ae4|2021            |7712461.85988958  |
|7428212cf0193f799447ec0dfe53e4a0|2021            |5093246.51878560  |
|4ecafb5dcecd6027257e8af4d9c82853|2021            |4801280.52566820  |
|4af00427a95e40c71244a8c66ec00a4b|2021            |3917039.18579283  |
|4f511

                                                                                

In [14]:
samples.groupBy("merchant_id", F.year("transaction_date").alias("transaction_year")).agg(F.sum("transaction_amount").alias("total_transactions")).orderBy(F.desc("total_transactions")).show(truncate=False)



+--------------------------------+----------------+-------------------+
|merchant_id                     |transaction_year|total_transactions |
+--------------------------------+----------------+-------------------+
|817d18cd3c31e40e9bff0566baae7758|2021            |3225043739.22018780|
|075d178871d8d48502bf1f54887e52fe|2021            |550735204.98759165 |
|838a8fa992a4aa2fb5a0cf8b15b63755|2021            |340173600.19185339 |
|817d18cd3c31e40e9bff0566baae7758|2020            |10117.46243062     |
|838a8fa992a4aa2fb5a0cf8b15b63755|2020            |4986.05430744      |
+--------------------------------+----------------+-------------------+



                                                                                

In [15]:
spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY")

In [16]:
samples.withColumn("day_of_week", F.date_format("transaction_date", "u").cast("int")).groupBy("day_of_week", "merchant_id").agg(F.sum("transaction_amount").alias("total_amount")).orderBy(F.desc("total_amount")).show(truncate=False)



+-----------+--------------------------------+------------------+
|day_of_week|merchant_id                     |total_amount      |
+-----------+--------------------------------+------------------+
|6          |817d18cd3c31e40e9bff0566baae7758|511639944.34732382|
|5          |817d18cd3c31e40e9bff0566baae7758|510000748.98903125|
|2          |817d18cd3c31e40e9bff0566baae7758|498828692.68367606|
|4          |817d18cd3c31e40e9bff0566baae7758|479862825.74442169|
|3          |817d18cd3c31e40e9bff0566baae7758|478057031.08982711|
|1          |817d18cd3c31e40e9bff0566baae7758|450742807.19836309|
|7          |817d18cd3c31e40e9bff0566baae7758|295921806.62997540|
|2          |075d178871d8d48502bf1f54887e52fe|101500653.96417880|
|3          |075d178871d8d48502bf1f54887e52fe|93823515.55129834 |
|5          |075d178871d8d48502bf1f54887e52fe|88593515.06636584 |
|4          |075d178871d8d48502bf1f54887e52fe|88097825.00550824 |
|1          |075d178871d8d48502bf1f54887e52fe|86474062.18459486 |
|6        

                                                                                

In [17]:
samples.withColumn("day_of_week", F.date_format("transaction_date", "u").cast("int")).groupBy("day_of_week").agg(F.sum("transaction_amount").alias("total_amount")).orderBy(F.desc("total_amount")).show(truncate=False)



+-----------+------------------+
|day_of_week|total_amount      |
+-----------+------------------+
|2          |661109882.95579203|
|5          |650657929.53672232|
|6          |631336590.36788792|
|3          |629145744.09150287|
|4          |620943152.52094752|
|1          |590467515.02875521|
|7          |332306833.41476303|
+-----------+------------------+



                                                                                

In [4]:
num_columns = len(samples.columns)
num_rows = samples.count()

print(f"Sample's shape: ({num_rows},{num_columns})")



Sample's shape: (21516918,8)


                                                                                

In [5]:
samples = samples.dropDuplicates().cache()
num_unique = samples.count()
if num_rows > num_unique:
    diff = num_rows - num_unique
    print(f"There a total of {diff} duplicate samples")



There a total of 11 duplicate samples


                                                                                

In [6]:
samples.show(5)

                                                                                

+--------------------+--------------------+--------------------+-------------------+--------------------+--------------------+------------------+----------------+
|         merchant_id|                 _id|          subsidiary|   transaction_date|      account_number|             user_id|transaction_amount|transaction_type|
+--------------------+--------------------+--------------------+-------------------+--------------------+--------------------+------------------+----------------+
|075d178871d8d4850...|0ccf1fe8dd3333e6e...|00015fd77a0f4d869...|2021-06-18 07:08:03|a79ff75955445c687...|ebc8d3ae01b8334f3...|      356.66730074|          DEBITO|
|075d178871d8d4850...|b86f53170772d5b15...|00015fd77a0f4d869...|2021-05-05 06:28:04|c790394d677140db4...|05ee34549ddebf8c2...|      350.72284572|          DEBITO|
|075d178871d8d4850...|cfe6f6fb72e5d5ad2...|00015fd77a0f4d869...|2021-07-09 05:48:27|c6a712474dc3e8c59...|de5f51b22e24411f3...|       35.66673007|         CREDITO|
|075d178871d8d4850...|

In [6]:
from pyspark.sql.functions import col, unix_timestamp, lag, when
from pyspark.sql.window import Window

In [7]:
window_spec = Window().partitionBy("account_number").orderBy("transaction_date")

In [8]:
samples = samples.withColumn("time_diff",unix_timestamp("transaction_date") - unix_timestamp(lag("transaction_date").over(window_spec)))

In [9]:
samples = samples.withColumn("fraudulent",when((col("time_diff") < 86400) | (lag("time_diff").over(window_spec) < 86400), 1).otherwise(0)).cache()

In [10]:
samples = samples.where(samples["fraudulent"] == 1).cache()

In [11]:
samples.filter(samples["user_id"] == "8c292781ac3e591312d7fc5b767687ca").orderBy('transaction_date').show()

23/08/09 02:48:10 WARN MemoryStore: Not enough space to cache rdd_28_0 in memory! (computed 14.5 MiB so far)
23/08/09 02:48:10 WARN MemoryStore: Not enough space to cache rdd_28_7 in memory! (computed 14.4 MiB so far)
23/08/09 02:48:12 WARN MemoryStore: Not enough space to cache rdd_28_9 in memory! (computed 1302.2 KiB so far)
23/08/09 02:48:12 WARN MemoryStore: Not enough space to cache rdd_28_10 in memory! (computed 1302.4 KiB so far)
23/08/09 02:48:12 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_28_13 in memory.
23/08/09 02:48:12 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_28_14 in memory.
23/08/09 02:48:12 WARN MemoryStore: Not enough space to cache rdd_28_11 in memory! (computed 1301.6 KiB so far)
23/08/09 02:48:12 WARN MemoryStore: Not enough space to cache rdd_28_8 in memory! (computed 1301.6 KiB so far)
23/08/09 02:48:12 WARN MemoryStore: Not enough space to cache rdd_28

+--------------------+--------------------+--------------------+-------------------+--------------------+--------------------+------------------+----------------+---------+----------+
|         merchant_id|                 _id|          subsidiary|   transaction_date|      account_number|             user_id|transaction_amount|transaction_type|time_diff|fraudulent|
+--------------------+--------------------+--------------------+-------------------+--------------------+--------------------+------------------+----------------+---------+----------+
|838a8fa992a4aa2fb...|5d69e18742f6e6024...|ce9fcdc2a3b5e4f32...|2021-01-14 10:53:16|b1d56a01b7cd5e3c6...|8c292781ac3e59131...|       89.16682518|         CREDITO|    23125|         1|
|838a8fa992a4aa2fb...|6c7436bea08e28d52...|d4b31b123120a4eef...|2021-01-14 11:33:29|b1d56a01b7cd5e3c6...|8c292781ac3e59131...|      237.77820049|         CREDITO|     2413|         1|
|838a8fa992a4aa2fb...|ca01fe25bf00390ca...|dff70ce33784a932c...|2021-01-14 12:28

In [12]:
accounts = samples.groupBy("user_id").agg(F.countDistinct("account_number").alias("account_count")).orderBy(F.desc("account_count")).limit(20)
ammounts = accounts.join(samples, "user_id", "inner") \
    .groupBy("user_id") \
    .agg(F.sum("transaction_amount").alias("sum_transactions"),
         F.stddev("transaction_amount").alias("std_deviation"))
accounts.join(ammounts, 'user_id', 'inner').orderBy(F.desc("account_count")).show(truncate=False)

23/08/09 02:50:50 WARN MemoryStore: Not enough space to cache rdd_54_123 in memory! (computed 3.0 MiB so far)
23/08/09 02:50:50 WARN MemoryStore: Not enough space to cache rdd_54_122 in memory! (computed 2.9 MiB so far)
23/08/09 02:50:55 WARN MemoryStore: Not enough space to cache rdd_54_11 in memory! (computed 2.8 MiB so far)
23/08/09 02:50:55 WARN MemoryStore: Not enough space to cache rdd_54_12 in memory! (computed 2.7 MiB so far)
23/08/09 02:51:01 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_54_83 in memory.
23/08/09 02:51:01 WARN MemoryStore: Not enough space to cache rdd_54_83 in memory! (computed 384.0 B so far)
23/08/09 02:51:01 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_54_93 in memory.
23/08/09 02:51:01 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_54_94 in memory.
23/08/09 02:51:01 WARN MemoryStore: Not enough spac

+--------------------------------+-------------+----------------+------------------+
|user_id                         |account_count|sum_transactions|std_deviation     |
+--------------------------------+-------------+----------------+------------------+
|8c292781ac3e591312d7fc5b767687ca|29           |48663.68651237  |321.5247230417246 |
|6ba67792fbc4e375fa69ed6df0e44854|18           |50596.82328270  |460.6176785325386 |
|328550eea11d2441ab258a1f07581dc8|16           |5090.83127218   |101.8017290745311 |
|7cac676a8d21f4fb7a66d4966dd3a12c|13           |82850.83103378  |374.58528610909485|
|a9fbdb2d32424dbac15edc0b17c12e23|10           |5433.23188123   |179.1143739521465 |
|f0e7c4ab7966fcb88c1834057a4e9538|10           |4641.43047297   |136.40428742818864|
|5e674596af22a66e826bf15b2a363cce|10           |38031.33201560  |9.671990273558986 |
|a71f0b8fc6267a4687430bea42310468|9            |4398.89670909   |194.54913620921513|
|be14ec758eeebe23d51338528f01ec4e|8            |3507.22845724   |

                                                                                

In [13]:
samples.groupBy("merchant_id", F.year("transaction_date").alias("transaction_year")).agg(F.sum("transaction_amount").alias("total_transactions")).show(truncate=False)

23/08/09 02:51:20 WARN MemoryStore: Not enough space to cache rdd_54_23 in memory! (computed 2.7 MiB so far)
23/08/09 02:51:20 WARN MemoryStore: Not enough space to cache rdd_54_27 in memory! (computed 2.7 MiB so far)
23/08/09 02:51:20 WARN MemoryStore: Not enough space to cache rdd_54_37 in memory! (computed 2.7 MiB so far)
23/08/09 02:51:20 WARN MemoryStore: Not enough space to cache rdd_54_39 in memory! (computed 2.8 MiB so far)
23/08/09 02:51:20 WARN MemoryStore: Not enough space to cache rdd_54_40 in memory! (computed 2.9 MiB so far)
23/08/09 02:51:20 WARN MemoryStore: Not enough space to cache rdd_54_45 in memory! (computed 2.8 MiB so far)
23/08/09 02:51:20 WARN MemoryStore: Not enough space to cache rdd_54_43 in memory! (computed 2.7 MiB so far)
23/08/09 02:51:20 WARN MemoryStore: Not enough space to cache rdd_54_44 in memory! (computed 2.8 MiB so far)
23/08/09 02:51:20 WARN MemoryStore: Not enough space to cache rdd_54_53 in memory! (computed 2.9 MiB so far)
23/08/09 02:51:20 W

+--------------------------------+----------------+-------------------+
|merchant_id                     |transaction_year|total_transactions |
+--------------------------------+----------------+-------------------+
|817d18cd3c31e40e9bff0566baae7758|2021            |1056167395.82953939|
|075d178871d8d48502bf1f54887e52fe|2021            |101802394.25971641 |
|838a8fa992a4aa2fb5a0cf8b15b63755|2021            |91317031.23030191  |
|838a8fa992a4aa2fb5a0cf8b15b63755|2020            |2743.09373133      |
|817d18cd3c31e40e9bff0566baae7758|2020            |142.66692028       |
+--------------------------------+----------------+-------------------+



                                                                                

In [14]:
samples.groupBy("subsidiary", F.year("transaction_date").alias("transaction_year")).agg(F.sum("transaction_amount").alias("total_transactions")).orderBy(F.desc("total_transactions")).show(truncate=False)

23/08/09 02:51:25 WARN MemoryStore: Not enough space to cache rdd_54_1 in memory! (computed 2.8 MiB so far)
23/08/09 02:51:25 WARN MemoryStore: Not enough space to cache rdd_54_5 in memory! (computed 2.9 MiB so far)
23/08/09 02:51:25 WARN MemoryStore: Not enough space to cache rdd_54_7 in memory! (computed 2.8 MiB so far)
23/08/09 02:51:25 WARN MemoryStore: Not enough space to cache rdd_54_6 in memory! (computed 2.8 MiB so far)
23/08/09 02:51:25 WARN MemoryStore: Not enough space to cache rdd_54_0 in memory! (computed 3.0 MiB so far)
23/08/09 02:51:25 WARN MemoryStore: Not enough space to cache rdd_54_4 in memory! (computed 2.7 MiB so far)
23/08/09 02:51:28 WARN MemoryStore: Not enough space to cache rdd_54_19 in memory! (computed 2.8 MiB so far)
23/08/09 02:51:28 WARN MemoryStore: Not enough space to cache rdd_54_18 in memory! (computed 2.8 MiB so far)
23/08/09 02:51:28 WARN MemoryStore: Not enough space to cache rdd_54_31 in memory! (computed 2.8 MiB so far)
23/08/09 02:51:28 WARN Me

+--------------------------------+----------------+------------------+
|subsidiary                      |transaction_year|total_transactions|
+--------------------------------+----------------+------------------+
|f54e0b6b32831a6307361ed959903e76|2021            |8874729.11124442  |
|8c52a2d7745e37bcee79717300f796e3|2021            |6801937.90222742  |
|d4b621a24cc03e3f92155a7e241fa1c3|2021            |6101673.09084552  |
|dff70ce33784a932ce4a7efc81a43863|2021            |5114972.10410273  |
|3458b243beebecf55605ca649b6b2ea5|2021            |2784398.76706536  |
|d4b31b123120a4eefd51ba95975f2ae4|2021            |2427512.55325734  |
|fee20d2f0753125f11b4376da5dbad4c|2021            |2308340.97746182  |
|4ecafb5dcecd6027257e8af4d9c82853|2021            |1997788.31421493  |
|464139dd69c67ebf50f2f946bc12513e|2021            |1778263.00882088  |
|4af00427a95e40c71244a8c66ec00a4b|2021            |1595586.83657984  |
|23b6e598e195241b496b81d95652870e|2021            |1543335.07702331  |
|b38bd

                                                                                

In [15]:
spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY")

In [16]:
samples.withColumn("day_of_week", F.date_format("transaction_date", "u").cast("int")).groupBy("day_of_week").agg(F.sum("transaction_amount").alias("total_amount")).orderBy(F.desc("total_amount")).show(truncate=False)

23/08/09 02:51:39 WARN MemoryStore: Not enough space to cache rdd_54_7 in memory! (computed 2.8 MiB so far)
23/08/09 02:51:39 WARN MemoryStore: Not enough space to cache rdd_54_0 in memory! (computed 3.0 MiB so far)
23/08/09 02:51:39 WARN MemoryStore: Not enough space to cache rdd_54_5 in memory! (computed 2.9 MiB so far)
23/08/09 02:51:40 WARN MemoryStore: Not enough space to cache rdd_54_22 in memory! (computed 2.7 MiB so far)
23/08/09 02:51:41 WARN MemoryStore: Not enough space to cache rdd_54_24 in memory! (computed 2.6 MiB so far)
23/08/09 02:51:41 WARN MemoryStore: Not enough space to cache rdd_54_33 in memory! (computed 3.0 MiB so far)
23/08/09 02:51:41 WARN MemoryStore: Not enough space to cache rdd_54_45 in memory! (computed 2.8 MiB so far)
23/08/09 02:51:42 WARN MemoryStore: Not enough space to cache rdd_54_50 in memory! (computed 2.7 MiB so far)
23/08/09 02:51:42 WARN MemoryStore: Not enough space to cache rdd_54_59 in memory! (computed 2.8 MiB so far)
23/08/09 02:51:42 WARN

+-----------+------------------+
|day_of_week|total_amount      |
+-----------+------------------+
|5          |200353768.81486865|
|2          |199233760.43795012|
|3          |193374513.64689457|
|6          |192585168.39525262|
|4          |192374911.92304794|
|1          |173471101.77146669|
|7          |97896482.09072873 |
+-----------+------------------+



                                                                                

In [17]:
samples.withColumn("day_of_week", F.date_format("transaction_date", "u").cast("int")).groupBy("day_of_week", "merchant_id").agg(F.sum("transaction_amount").alias("total_amount")).orderBy(F.desc("total_amount")).show(truncate=False)

23/08/09 02:51:47 WARN MemoryStore: Not enough space to cache rdd_54_4 in memory! (computed 2.7 MiB so far)
23/08/09 02:51:47 WARN MemoryStore: Not enough space to cache rdd_54_1 in memory! (computed 2.8 MiB so far)
23/08/09 02:51:47 WARN MemoryStore: Not enough space to cache rdd_54_6 in memory! (computed 2.8 MiB so far)
23/08/09 02:51:47 WARN MemoryStore: Not enough space to cache rdd_54_2 in memory! (computed 2.8 MiB so far)
23/08/09 02:51:48 WARN MemoryStore: Not enough space to cache rdd_54_11 in memory! (computed 2.8 MiB so far)
23/08/09 02:51:48 WARN MemoryStore: Not enough space to cache rdd_54_12 in memory! (computed 2.7 MiB so far)
23/08/09 02:51:48 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_54_15 in memory.
23/08/09 02:51:48 WARN MemoryStore: Not enough space to cache rdd_54_15 in memory! (computed 384.0 B so far)
23/08/09 02:51:48 WARN MemoryStore: Not enough space to cache rdd_54_17 in memory! (computed 2.7 MiB so far

+-----------+--------------------------------+------------------+
|day_of_week|merchant_id                     |total_amount      |
+-----------+--------------------------------+------------------+
|5          |817d18cd3c31e40e9bff0566baae7758|169324400.79524863|
|6          |817d18cd3c31e40e9bff0566baae7758|166363443.97568411|
|2          |817d18cd3c31e40e9bff0566baae7758|164430818.42908537|
|4          |817d18cd3c31e40e9bff0566baae7758|161728148.17985593|
|3          |817d18cd3c31e40e9bff0566baae7758|160529662.82705967|
|1          |817d18cd3c31e40e9bff0566baae7758|144806603.09745423|
|7          |817d18cd3c31e40e9bff0566baae7758|88984461.19207173 |
|2          |075d178871d8d48502bf1f54887e52fe|18570271.49229268 |
|3          |075d178871d8d48502bf1f54887e52fe|17527230.79419976 |
|5          |075d178871d8d48502bf1f54887e52fe|16862870.03593846 |
|4          |075d178871d8d48502bf1f54887e52fe|16415471.40792900 |
|2          |838a8fa992a4aa2fb5a0cf8b15b63755|16232670.51657207 |
|3        

                                                                                

In [18]:
samples.withColumn("month_of_year", F.month("transaction_date")).groupBy("month_of_year").agg(F.sum("transaction_amount").alias("total_amount")).orderBy(F.desc("total_amount")).show(truncate=False)


23/08/09 02:51:54 WARN MemoryStore: Not enough space to cache rdd_54_0 in memory! (computed 3.0 MiB so far)
23/08/09 02:51:54 WARN MemoryStore: Not enough space to cache rdd_54_5 in memory! (computed 2.9 MiB so far)
23/08/09 02:51:54 WARN MemoryStore: Not enough space to cache rdd_54_6 in memory! (computed 2.8 MiB so far)
23/08/09 02:51:54 WARN MemoryStore: Not enough space to cache rdd_54_1 in memory! (computed 2.8 MiB so far)
23/08/09 02:51:54 WARN MemoryStore: Not enough space to cache rdd_54_7 in memory! (computed 2.8 MiB so far)
23/08/09 02:51:54 WARN MemoryStore: Not enough space to cache rdd_54_4 in memory! (computed 2.7 MiB so far)
23/08/09 02:51:54 WARN MemoryStore: Not enough space to cache rdd_54_2 in memory! (computed 2.8 MiB so far)
23/08/09 02:51:54 WARN MemoryStore: Not enough space to cache rdd_54_3 in memory! (computed 2.5 MiB so far)
23/08/09 02:51:54 WARN MemoryStore: Not enough space to cache rdd_54_19 in memory! (computed 2.8 MiB so far)
23/08/09 02:51:54 WARN Memo

+-------------+------------------+
|month_of_year|total_amount      |
+-------------+------------------+
|10           |154958126.11750532|
|11           |140981841.51045328|
|9            |136687827.38637647|
|7            |127314858.32641232|
|8            |126769608.87923124|
|6            |114819181.29481997|
|5            |98725258.28666272 |
|3            |96063429.58330312 |
|2            |94186393.56017729 |
|4            |84114595.20802232 |
|1            |74665701.16659366 |
|12           |2885.76065161     |
+-------------+------------------+



                                                                                

In [19]:
spark.stop()