# Nequi project

### 1. Based on the link provided, files are dowloaded with aws cli.

In [1]:
!aws s3 cp "s3://nequi-open-data/sample_data_0006_part_00.parquet" .

download: s3://nequi-open-data/sample_data_0006_part_00.parquet to ./sample_data_0006_part_00.parquet


In [2]:
!aws s3 cp "s3://nequi-open-data/sample_data_0007_part_00.parquet" .

download: s3://nequi-open-data/sample_data_0007_part_00.parquet to ./sample_data_0007_part_00.parquet


### 2. Reading the file

#### 2.a Pandas is first used to read parquet files

In [1]:
import pandas as pd

In [2]:
sample006 = pd.read_parquet('./sample_data_0006_part_00.parquet')
sample007 = pd.read_parquet('./sample_data_0007_part_00.parquet')

samples = pd.concat([sample006,sample007], ignore_index=True) # Since the whole data is stuided, samples are concatanated.

With Pandas it took 45.0 seconds to read and concatenate the files.

#### 2.b Pyspark is used to improve times

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

In [2]:
spark = SparkSession.builder.appName("NequiProject").getOrCreate()

23/08/09 21:39:47 WARN Utils: Your hostname, Danielas-MacBook-Air.local resolves to a loopback address: 127.0.0.1; using 192.168.1.8 instead (on interface en0)
23/08/09 21:39:47 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/08/09 21:39:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
sample006 = spark.read.parquet('./sample_data_0006_part_00.parquet')
sample007 = spark.read.parquet('./sample_data_0007_part_00.parquet')
samples = sample006.union(sample007)
samples = samples.na.drop()

                                                                                

Since pyspark is more time efficient, the project will use it instead of pandas.

In [4]:
samples.show(5)

                                                                                

+--------------------+--------------------+--------------------+-------------------+--------------------+--------------------+------------------+----------------+
|         merchant_id|                 _id|          subsidiary|   transaction_date|      account_number|             user_id|transaction_amount|transaction_type|
+--------------------+--------------------+--------------------+-------------------+--------------------+--------------------+------------------+----------------+
|075d178871d8d4850...|aa8dacff663072244...|824b2af470cbe6a65...|2021-09-12 13:32:03|648e257c9d74909a1...|ba42d192a145583ba...|      178.33365037|         CREDITO|
|075d178871d8d4850...|a53bb81bd0bba2ae2...|2d8d34be7509a6b12...|2021-09-12 13:31:58|c0b62f9046c83ea55...|5cfff960ea6d732c1...|       35.66673007|         CREDITO|
|075d178871d8d4850...|79f893ea65c06fe29...|5eeb18254850b21af...|2021-09-12 13:31:56|872d10143fc0ac7d5...|c97e63a92c82c7217...|      142.66692029|         CREDITO|
|075d178871d8d4850...|

In [5]:
samples.printSchema()

root
 |-- merchant_id: string (nullable = true)
 |-- _id: string (nullable = true)
 |-- subsidiary: string (nullable = true)
 |-- transaction_date: timestamp (nullable = true)
 |-- account_number: string (nullable = true)
 |-- user_id: string (nullable = true)
 |-- transaction_amount: decimal(24,8) (nullable = true)
 |-- transaction_type: string (nullable = true)



In [5]:
spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY")

Sum of transactions per day of the week

In [6]:
samples.withColumn("day_of_week", F.date_format("transaction_date", "u").cast("int")).groupBy("day_of_week").agg(F.sum("transaction_amount").alias("total_amount")).orderBy(F.desc("total_amount")).show(truncate=False)



+-----------+------------------+
|day_of_week|total_amount      |
+-----------+------------------+
|2          |661109882.95579203|
|5          |650657929.53672232|
|6          |631336590.36788792|
|3          |629145744.09150287|
|4          |620943152.52094752|
|1          |590467515.02875521|
|7          |332306833.41476303|
+-----------+------------------+



                                                                                

Checking how large is a user_id string is

In [6]:
samples.select("user_id").show(5, truncate=False)

+--------------------------------+
|user_id                         |
+--------------------------------+
|ba42d192a145583ba8e7bf04875f837f|
|5cfff960ea6d732c1ba3e63d24f3be52|
|c97e63a92c82c7217b333635d75928ed|
|fc09bdd00f283222d65eaff4d00a6594|
|213527e8ba94fcaf2f9378969f9f6abc|
+--------------------------------+
only showing top 5 rows



Checking the years of data

In [7]:
samples.select(F.year("transaction_date").alias("Year")).distinct().show()



+----+
|Year|
+----+
|2021|
|2020|
+----+



                                                                                

Checking unique merchants

In [8]:
samples.select("merchant_id").distinct().show(truncate=False)



+--------------------------------+
|merchant_id                     |
+--------------------------------+
|075d178871d8d48502bf1f54887e52fe|
|817d18cd3c31e40e9bff0566baae7758|
|838a8fa992a4aa2fb5a0cf8b15b63755|
+--------------------------------+



                                                                                

Checking users and accounts

In [9]:
num_users = samples.select("user_id").distinct().count()
num_accounts = samples.select("account_number").distinct().count()

print(f"Total unique users that has placed at least one transaction are {num_users:,}, while there are a total of {num_accounts:,} accounts.")

[Stage 18:>                                                         (0 + 9) / 9]

Total unique users that has placed at least one transaction are 3,087,217, while there are a total of 3,099,711 accounts.


                                                                                

Checking top users transactions value and std

In [10]:
accounts = samples.groupBy("user_id").agg(F.countDistinct("account_number").alias("account_count")).orderBy(F.desc("account_count")).limit(20)
ammounts = accounts.join(samples, "user_id", "inner") \
    .groupBy("user_id") \
    .agg(F.sum("transaction_amount").alias("sum_transactions"),
         F.stddev("transaction_amount").alias("std_deviation"))
accounts.join(ammounts, 'user_id', 'inner').orderBy(F.desc("account_count")).show(truncate=False)

23/08/09 02:39:42 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/08/09 02:39:42 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/08/09 02:39:42 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/08/09 02:39:42 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/08/09 02:39:42 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/08/09 02:39:42 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/08/09 02:39:42 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/08/09 02:39:42 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/08/09 02:39:45 WARN RowBasedKeyValueBatch: Calling spill() on

+--------------------------------+-------------+----------------+------------------+
|user_id                         |account_count|sum_transactions|std_deviation     |
+--------------------------------+-------------+----------------+------------------+
|8c292781ac3e591312d7fc5b767687ca|43           |57794.36941102  |297.1821683130902 |
|6ba67792fbc4e375fa69ed6df0e44854|27           |59241.24976144  |441.69874871545954|
|328550eea11d2441ab258a1f07581dc8|24           |6242.86665343   |91.07442149253173 |
|a9fbdb2d32424dbac15edc0b17c12e23|23           |17405.36427600  |161.0434881880177 |
|62e1cb0b55d1d7b577292916528ef6d8|20           |1190.07989329   |23.55723555824086 |
|16100156a44d52a0ab40b2661dd648cc|17           |5350.00951085   |459.00474438452113|
|94cf8a83c7f7d2b2d33450d9ca233ef6|17           |15539.99429289  |374.8176078533819 |
|a71f0b8fc6267a4687430bea42310468|16           |13470.13505786  |249.89727904121185|
|e066a19e39bbb74e82d0d88702292b7e|13           |4761.50846469   |

                                                                                

Checking transactions for user with the most transactions

In [11]:
samples.filter(samples["user_id"] == "8c292781ac3e591312d7fc5b767687ca").orderBy('transaction_date').show()

                                                                                

+--------------------+--------------------+--------------------+-------------------+--------------------+--------------------+------------------+----------------+
|         merchant_id|                 _id|          subsidiary|   transaction_date|      account_number|             user_id|transaction_amount|transaction_type|
+--------------------+--------------------+--------------------+-------------------+--------------------+--------------------+------------------+----------------+
|075d178871d8d4850...|2a797d11beb190624...|db413bbd87a3df94b...|2021-01-08 09:10:30|5946e4c935c965c4b...|8c292781ac3e59131...|      118.88910024|         CREDITO|
|838a8fa992a4aa2fb...|84d44445c328f3255...|8c52a2d7745e37bce...|2021-01-14 04:27:51|b1d56a01b7cd5e3c6...|8c292781ac3e59131...|       59.44455012|         CREDITO|
|838a8fa992a4aa2fb...|5d69e18742f6e6024...|ce9fcdc2a3b5e4f32...|2021-01-14 10:53:16|b1d56a01b7cd5e3c6...|8c292781ac3e59131...|       89.16682518|         CREDITO|
|838a8fa992a4aa2fb...|

Checking transaction_type unique values

In [12]:
samples.select("transaction_type").distinct().show(truncate=False)



+----------------+
|transaction_type|
+----------------+
|DEBITO          |
|CREDITO         |
+----------------+



                                                                                

Checking subsidiary distribution

In [13]:
samples.groupBy("subsidiary", F.year("transaction_date").alias("transaction_year")).agg(F.sum("transaction_amount").alias("total_transactions")).orderBy(F.desc("total_transactions")).show(truncate=False)



+--------------------------------+----------------+------------------+
|subsidiary                      |transaction_year|total_transactions|
+--------------------------------+----------------+------------------+
|8c52a2d7745e37bcee79717300f796e3|2021            |22293285.26933247 |
|d4b621a24cc03e3f92155a7e241fa1c3|2021            |18196560.97313392 |
|f54e0b6b32831a6307361ed959903e76|2021            |18027608.57745139 |
|3458b243beebecf55605ca649b6b2ea5|2021            |12223479.69223730 |
|dff70ce33784a932ce4a7efc81a43863|2021            |11941222.46639275 |
|fee20d2f0753125f11b4376da5dbad4c|2021            |9850216.26379212  |
|464139dd69c67ebf50f2f946bc12513e|2021            |8400298.85851027  |
|d4b31b123120a4eefd51ba95975f2ae4|2021            |7712461.85988958  |
|7428212cf0193f799447ec0dfe53e4a0|2021            |5093246.51878560  |
|4ecafb5dcecd6027257e8af4d9c82853|2021            |4801280.52566820  |
|4af00427a95e40c71244a8c66ec00a4b|2021            |3917039.18579283  |
|4f511

                                                                                

Checking merchants distribution

In [14]:
samples.groupBy("merchant_id", F.year("transaction_date").alias("transaction_year")).agg(F.sum("transaction_amount").alias("total_transactions")).orderBy(F.desc("total_transactions")).show(truncate=False)



+--------------------------------+----------------+-------------------+
|merchant_id                     |transaction_year|total_transactions |
+--------------------------------+----------------+-------------------+
|817d18cd3c31e40e9bff0566baae7758|2021            |3225043739.22018780|
|075d178871d8d48502bf1f54887e52fe|2021            |550735204.98759165 |
|838a8fa992a4aa2fb5a0cf8b15b63755|2021            |340173600.19185339 |
|817d18cd3c31e40e9bff0566baae7758|2020            |10117.46243062     |
|838a8fa992a4aa2fb5a0cf8b15b63755|2020            |4986.05430744      |
+--------------------------------+----------------+-------------------+



                                                                                

In [5]:
spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY")

Checking merchants distribution along days of the week

In [16]:
samples.withColumn("day_of_week", F.date_format("transaction_date", "u").cast("int")).groupBy("day_of_week", "merchant_id").agg(F.sum("transaction_amount").alias("total_amount")).orderBy(F.desc("total_amount")).show(truncate=False)



+-----------+--------------------------------+------------------+
|day_of_week|merchant_id                     |total_amount      |
+-----------+--------------------------------+------------------+
|6          |817d18cd3c31e40e9bff0566baae7758|511639944.34732382|
|5          |817d18cd3c31e40e9bff0566baae7758|510000748.98903125|
|2          |817d18cd3c31e40e9bff0566baae7758|498828692.68367606|
|4          |817d18cd3c31e40e9bff0566baae7758|479862825.74442169|
|3          |817d18cd3c31e40e9bff0566baae7758|478057031.08982711|
|1          |817d18cd3c31e40e9bff0566baae7758|450742807.19836309|
|7          |817d18cd3c31e40e9bff0566baae7758|295921806.62997540|
|2          |075d178871d8d48502bf1f54887e52fe|101500653.96417880|
|3          |075d178871d8d48502bf1f54887e52fe|93823515.55129834 |
|5          |075d178871d8d48502bf1f54887e52fe|88593515.06636584 |
|4          |075d178871d8d48502bf1f54887e52fe|88097825.00550824 |
|1          |075d178871d8d48502bf1f54887e52fe|86474062.18459486 |
|6        

                                                                                

Checking total sum of transactions over days of the week

In [6]:
samples.withColumn("day_of_week", F.date_format("transaction_date", "u").cast("int")).groupBy("day_of_week").agg(F.sum("transaction_amount").alias("total_amount")).orderBy(F.desc("total_amount")).show(truncate=False)



+-----------+------------------+
|day_of_week|total_amount      |
+-----------+------------------+
|2          |661109882.95579203|
|5          |650657929.53672232|
|6          |631336590.36788792|
|3          |629145744.09150287|
|4          |620943152.52094752|
|1          |590467515.02875521|
|7          |332306833.41476303|
+-----------+------------------+



                                                                                

Checking information shape

In [4]:
num_columns = len(samples.columns)
num_rows = samples.count()

print(f"Sample's shape: ({num_rows},{num_columns})")



Sample's shape: (21516918,8)


                                                                                

Checking number of duplicates

In [5]:
samples = samples.dropDuplicates().cache()
num_unique = samples.count()
if num_rows > num_unique:
    diff = num_rows - num_unique
    print(f"There a total of {diff} duplicate samples")



There a total of 11 duplicate samples


                                                                                

In [6]:
samples.show(5)

                                                                                

+--------------------+--------------------+--------------------+-------------------+--------------------+--------------------+------------------+----------------+
|         merchant_id|                 _id|          subsidiary|   transaction_date|      account_number|             user_id|transaction_amount|transaction_type|
+--------------------+--------------------+--------------------+-------------------+--------------------+--------------------+------------------+----------------+
|075d178871d8d4850...|0ccf1fe8dd3333e6e...|00015fd77a0f4d869...|2021-06-18 07:08:03|a79ff75955445c687...|ebc8d3ae01b8334f3...|      356.66730074|          DEBITO|
|075d178871d8d4850...|b86f53170772d5b15...|00015fd77a0f4d869...|2021-05-05 06:28:04|c790394d677140db4...|05ee34549ddebf8c2...|      350.72284572|          DEBITO|
|075d178871d8d4850...|cfe6f6fb72e5d5ad2...|00015fd77a0f4d869...|2021-07-09 05:48:27|c6a712474dc3e8c59...|de5f51b22e24411f3...|       35.66673007|         CREDITO|
|075d178871d8d4850...|

Creating flag column and chaking same information this time only for fragmented transactions

In [5]:
from pyspark.sql.functions import col, unix_timestamp, lag, when
from pyspark.sql.window import Window

In [6]:
window_spec = Window().partitionBy("user_id").orderBy("transaction_date")

In [7]:
samples = samples.withColumn("time_diff",unix_timestamp("transaction_date") - unix_timestamp(lag("transaction_date").over(window_spec)))

In [8]:
samples = samples.withColumn("fraudulent",when((col("time_diff") < 86400) | (lag("time_diff").over(window_spec) < 86400), 1).otherwise(0)).cache()

In [9]:
samples = samples.where(samples["fraudulent"] == 1).cache()

In [10]:
samples.filter(samples["user_id"] == "8c292781ac3e591312d7fc5b767687ca").orderBy('transaction_date').show()

                                                                                

+--------------------+--------------------+--------------------+-------------------+--------------------+--------------------+------------------+----------------+---------+----------+
|         merchant_id|                 _id|          subsidiary|   transaction_date|      account_number|             user_id|transaction_amount|transaction_type|time_diff|fraudulent|
+--------------------+--------------------+--------------------+-------------------+--------------------+--------------------+------------------+----------------+---------+----------+
|838a8fa992a4aa2fb...|5d69e18742f6e6024...|ce9fcdc2a3b5e4f32...|2021-01-14 10:53:16|b1d56a01b7cd5e3c6...|8c292781ac3e59131...|       89.16682518|         CREDITO|    23125|         1|
|838a8fa992a4aa2fb...|6c7436bea08e28d52...|d4b31b123120a4eef...|2021-01-14 11:33:29|b1d56a01b7cd5e3c6...|8c292781ac3e59131...|      237.77820049|         CREDITO|     2413|         1|
|838a8fa992a4aa2fb...|ca01fe25bf00390ca...|dff70ce33784a932c...|2021-01-14 12:28

In [11]:
accounts = samples.groupBy("user_id").agg(F.countDistinct("account_number").alias("account_count")).orderBy(F.desc("account_count")).limit(20)
ammounts = accounts.join(samples, "user_id", "inner") \
    .groupBy("user_id") \
    .agg(F.sum("transaction_amount").alias("sum_transactions"),
         F.stddev("transaction_amount").alias("std_deviation"))
accounts.join(ammounts, 'user_id', 'inner').orderBy(F.desc("account_count")).show(truncate=False)

23/08/09 20:49:31 WARN MemoryStore: Not enough space to cache rdd_25_134 in memory! (computed 2.7 MiB so far)
23/08/09 20:49:31 WARN MemoryStore: Not enough space to cache rdd_25_139 in memory! (computed 2.9 MiB so far)
23/08/09 20:49:31 WARN MemoryStore: Not enough space to cache rdd_25_143 in memory! (computed 2.7 MiB so far)
23/08/09 20:49:34 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_25_7 in memory.
23/08/09 20:49:34 WARN MemoryStore: Not enough space to cache rdd_25_7 in memory! (computed 384.0 B so far)
23/08/09 20:49:37 WARN MemoryStore: Not enough space to cache rdd_25_90 in memory! (computed 2.8 MiB so far)
23/08/09 20:49:38 WARN MemoryStore: Not enough space to cache rdd_25_115 in memory! (computed 2.7 MiB so far)
23/08/09 20:49:42 WARN MemoryStore: Not enough space to cache rdd_25_10 in memory! (computed 2.8 MiB so far)
23/08/09 20:49:42 WARN MemoryStore: Not enough space to cache rdd_25_14 in memory! (computed 2.7 MiB 

+--------------------------------+-------------+----------------+------------------+
|user_id                         |account_count|sum_transactions|std_deviation     |
+--------------------------------+-------------+----------------+------------------+
|8c292781ac3e591312d7fc5b767687ca|36           |52475.27106616  |312.07761869093906|
|6ba67792fbc4e375fa69ed6df0e44854|24           |54537.99695582  |455.8318585950875 |
|328550eea11d2441ab258a1f07581dc8|22           |5640.09891523   |93.4755689489242  |
|a9fbdb2d32424dbac15edc0b17c12e23|16           |10331.46281136  |179.88379422520907|
|7cac676a8d21f4fb7a66d4966dd3a12c|13           |89395.67956901  |380.43454266565084|
|f0e7c4ab7966fcb88c1834057a4e9538|11           |4812.63077727   |131.37769223055886|
|a71f0b8fc6267a4687430bea42310468|11           |10700.01902214  |229.8234893855068 |
|16100156a44d52a0ab40b2661dd648cc|11           |1489.68042590   |49.18184116848666 |
|94cf8a83c7f7d2b2d33450d9ca233ef6|10           |11476.36484656  |

In [12]:
samples.groupBy("merchant_id", F.year("transaction_date").alias("transaction_year")).agg(F.sum("transaction_amount").alias("total_transactions")).show(truncate=False)

23/08/09 20:49:47 WARN MemoryStore: Not enough space to cache rdd_25_14 in memory! (computed 2.7 MiB so far)
23/08/09 20:49:47 WARN MemoryStore: Not enough space to cache rdd_25_8 in memory! (computed 2.7 MiB so far)
23/08/09 20:49:48 WARN MemoryStore: Not enough space to cache rdd_25_23 in memory! (computed 2.7 MiB so far)
23/08/09 20:49:48 WARN MemoryStore: Not enough space to cache rdd_25_20 in memory! (computed 2.7 MiB so far)
23/08/09 20:49:48 WARN MemoryStore: Not enough space to cache rdd_25_30 in memory! (computed 2.9 MiB so far)
23/08/09 20:49:48 WARN MemoryStore: Not enough space to cache rdd_25_34 in memory! (computed 2.7 MiB so far)
23/08/09 20:49:48 WARN MemoryStore: Not enough space to cache rdd_25_37 in memory! (computed 2.6 MiB so far)
23/08/09 20:49:49 WARN MemoryStore: Not enough space to cache rdd_25_45 in memory! (computed 2.8 MiB so far)
23/08/09 20:49:49 WARN MemoryStore: Not enough space to cache rdd_25_46 in memory! (computed 2.9 MiB so far)
23/08/09 20:49:49 WA

+--------------------------------+----------------+-------------------+
|merchant_id                     |transaction_year|total_transactions |
+--------------------------------+----------------+-------------------+
|817d18cd3c31e40e9bff0566baae7758|2021            |1056439984.75858185|
|075d178871d8d48502bf1f54887e52fe|2021            |101918185.82453112 |
|838a8fa992a4aa2fb5a0cf8b15b63755|2021            |91375366.24313956  |
|838a8fa992a4aa2fb5a0cf8b15b63755|2020            |2743.09373133      |
|817d18cd3c31e40e9bff0566baae7758|2020            |142.66692028       |
+--------------------------------+----------------+-------------------+



                                                                                

In [13]:
samples.groupBy("subsidiary", F.year("transaction_date").alias("transaction_year")).agg(F.sum("transaction_amount").alias("total_transactions")).orderBy(F.desc("total_transactions")).show(truncate=False)

23/08/09 20:49:56 WARN MemoryStore: Not enough space to cache rdd_25_4 in memory! (computed 2.6 MiB so far)
23/08/09 20:49:56 WARN MemoryStore: Not enough space to cache rdd_25_6 in memory! (computed 2.8 MiB so far)
23/08/09 20:49:56 WARN MemoryStore: Not enough space to cache rdd_25_7 in memory! (computed 2.7 MiB so far)
23/08/09 20:49:56 WARN MemoryStore: Not enough space to cache rdd_25_1 in memory! (computed 2.6 MiB so far)
23/08/09 20:49:56 WARN MemoryStore: Not enough space to cache rdd_25_0 in memory! (computed 2.7 MiB so far)
23/08/09 20:49:56 WARN MemoryStore: Not enough space to cache rdd_25_9 in memory! (computed 2.9 MiB so far)
23/08/09 20:49:56 WARN MemoryStore: Not enough space to cache rdd_25_12 in memory! (computed 2.7 MiB so far)
23/08/09 20:49:57 WARN MemoryStore: Not enough space to cache rdd_25_19 in memory! (computed 2.8 MiB so far)
23/08/09 20:49:57 WARN MemoryStore: Not enough space to cache rdd_25_20 in memory! (computed 2.7 MiB so far)
23/08/09 20:49:57 WARN Me

+--------------------------------+----------------+------------------+
|subsidiary                      |transaction_year|total_transactions|
+--------------------------------+----------------+------------------+
|f54e0b6b32831a6307361ed959903e76|2021            |8890792.33527857  |
|8c52a2d7745e37bcee79717300f796e3|2021            |6803649.54860358  |
|d4b621a24cc03e3f92155a7e241fa1c3|2021            |6103533.20474100  |
|dff70ce33784a932ce4a7efc81a43863|2021            |5116042.10600490  |
|3458b243beebecf55605ca649b6b2ea5|2021            |2787199.79426710  |
|d4b31b123120a4eefd51ba95975f2ae4|2021            |2428375.54545818  |
|fee20d2f0753125f11b4376da5dbad4c|2021            |2309214.81234861  |
|4ecafb5dcecd6027257e8af4d9c82853|2021            |1997835.86985502  |
|464139dd69c67ebf50f2f946bc12513e|2021            |1778381.89792112  |
|4af00427a95e40c71244a8c66ec00a4b|2021            |1595634.39221993  |
|23b6e598e195241b496b81d95652870e|2021            |1543335.07702331  |
|b38bd

In [14]:
spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY")

In [15]:
samples.withColumn("day_of_week", F.date_format("transaction_date", "u").cast("int")).groupBy("day_of_week").agg(F.sum("transaction_amount").alias("total_amount")).orderBy(F.desc("total_amount")).show(truncate=False)

23/08/09 20:50:09 WARN MemoryStore: Not enough space to cache rdd_25_7 in memory! (computed 2.7 MiB so far)
23/08/09 20:50:09 WARN MemoryStore: Not enough space to cache rdd_25_2 in memory! (computed 2.6 MiB so far)
23/08/09 20:50:09 WARN MemoryStore: Not enough space to cache rdd_25_13 in memory! (computed 2.8 MiB so far)
23/08/09 20:50:10 WARN MemoryStore: Not enough space to cache rdd_25_19 in memory! (computed 2.8 MiB so far)
23/08/09 20:50:10 WARN MemoryStore: Not enough space to cache rdd_25_24 in memory! (computed 3.0 MiB so far)
23/08/09 20:50:10 WARN MemoryStore: Not enough space to cache rdd_25_30 in memory! (computed 2.9 MiB so far)
23/08/09 20:50:10 WARN MemoryStore: Not enough space to cache rdd_25_45 in memory! (computed 2.8 MiB so far)
23/08/09 20:50:10 WARN MemoryStore: Not enough space to cache rdd_25_43 in memory! (computed 2.8 MiB so far)
23/08/09 20:50:11 WARN MemoryStore: Not enough space to cache rdd_25_56 in memory! (computed 3.0 MiB so far)
23/08/09 20:50:11 WAR

+-----------+------------------+
|day_of_week|total_amount      |
+-----------+------------------+
|5          |200417772.89751891|
|2          |199312059.24311286|
|3          |193446611.65787820|
|6          |192640704.31164829|
|4          |192451290.38153017|
|1          |173534984.15091172|
|7          |97932999.94430399 |
+-----------+------------------+



                                                                                

In [16]:
samples.withColumn("day_of_week", F.date_format("transaction_date", "u").cast("int")).groupBy("day_of_week", "merchant_id").agg(F.sum("transaction_amount").alias("total_amount")).orderBy(F.desc("total_amount")).show(truncate=False)

23/08/09 20:50:16 WARN MemoryStore: Not enough space to cache rdd_25_7 in memory! (computed 2.7 MiB so far)
23/08/09 20:50:16 WARN MemoryStore: Not enough space to cache rdd_25_5 in memory! (computed 2.8 MiB so far)
23/08/09 20:50:16 WARN MemoryStore: Not enough space to cache rdd_25_4 in memory! (computed 2.6 MiB so far)
23/08/09 20:50:16 WARN MemoryStore: Not enough space to cache rdd_25_2 in memory! (computed 2.6 MiB so far)
23/08/09 20:50:16 WARN MemoryStore: Not enough space to cache rdd_25_1 in memory! (computed 2.6 MiB so far)
23/08/09 20:50:16 WARN MemoryStore: Not enough space to cache rdd_25_6 in memory! (computed 2.8 MiB so far)
23/08/09 20:50:16 WARN MemoryStore: Not enough space to cache rdd_25_3 in memory! (computed 2.9 MiB so far)
23/08/09 20:50:16 WARN MemoryStore: Not enough space to cache rdd_25_0 in memory! (computed 2.7 MiB so far)
23/08/09 20:50:17 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_25_23 in memory.
23

+-----------+--------------------------------+------------------+
|day_of_week|merchant_id                     |total_amount      |
+-----------+--------------------------------+------------------+
|5          |817d18cd3c31e40e9bff0566baae7758|169367747.76119812|
|6          |817d18cd3c31e40e9bff0566baae7758|166396768.59048266|
|2          |817d18cd3c31e40e9bff0566baae7758|164481857.51982079|
|4          |817d18cd3c31e40e9bff0566baae7758|161776001.04270479|
|3          |817d18cd3c31e40e9bff0566baae7758|160566114.22519479|
|1          |817d18cd3c31e40e9bff0566baae7758|144842162.82733766|
|7          |817d18cd3c31e40e9bff0566baae7758|89009475.45876332 |
|2          |075d178871d8d48502bf1f54887e52fe|18587560.70191732 |
|3          |075d178871d8d48502bf1f54887e52fe|17550301.93743669 |
|5          |075d178871d8d48502bf1f54887e52fe|16876837.60299148 |
|4          |075d178871d8d48502bf1f54887e52fe|16434496.81452908 |
|2          |838a8fa992a4aa2fb5a0cf8b15b63755|16242641.02137475 |
|3        

                                                                                

In [17]:
samples.withColumn("month_of_year", F.month("transaction_date")).groupBy("month_of_year").agg(F.sum("transaction_amount").alias("total_amount")).orderBy(F.desc("total_amount")).show(truncate=False)


23/08/09 20:50:23 WARN MemoryStore: Not enough space to cache rdd_25_3 in memory! (computed 2.9 MiB so far)
23/08/09 20:50:23 WARN MemoryStore: Not enough space to cache rdd_25_5 in memory! (computed 2.8 MiB so far)
23/08/09 20:50:23 WARN MemoryStore: Not enough space to cache rdd_25_2 in memory! (computed 2.6 MiB so far)
23/08/09 20:50:24 WARN MemoryStore: Not enough space to cache rdd_25_9 in memory! (computed 2.9 MiB so far)
23/08/09 20:50:24 WARN MemoryStore: Not enough space to cache rdd_25_17 in memory! (computed 2.8 MiB so far)
23/08/09 20:50:24 WARN MemoryStore: Not enough space to cache rdd_25_16 in memory! (computed 2.8 MiB so far)
23/08/09 20:50:24 WARN MemoryStore: Not enough space to cache rdd_25_25 in memory! (computed 2.7 MiB so far)
23/08/09 20:50:24 WARN MemoryStore: Not enough space to cache rdd_25_32 in memory! (computed 2.8 MiB so far)
23/08/09 20:50:24 WARN MemoryStore: Not enough space to cache rdd_25_35 in memory! (computed 2.8 MiB so far)
23/08/09 20:50:25 WARN 

+-------------+------------------+
|month_of_year|total_amount      |
+-------------+------------------+
|10           |155010653.09938550|
|11           |141089887.03784305|
|9            |136740682.98367422|
|7            |127350422.81185943|
|8            |126816161.45199047|
|6            |114857102.16223392|
|5            |98753247.81253220 |
|3            |96085401.47910819 |
|2            |94219262.32206514 |
|4            |84138166.75548230 |
|1            |74672548.91007811 |
|12           |2885.76065161     |
+-------------+------------------+



                                                                                

In [19]:
spark.stop()