### Simulated Transactions - Dataset Analysis

The goal of this notebook is to analyze and extract some useful informations from [kaggle simulated-transactions dataset](https://www.kaggle.com/datasets/conorsully1/simulated-transactions). 

The dataset contains ~22GB of data that represents random transactions.
[Notebook used to generate data](https://github.com/conorosully/medium-articles/blob/master/src/transaction_data_generator.ipynb).

Transactions are generated for 75,000 customers and are classified into 12 expenditure types:

- Groceries
- Clothing
- Housing
- Education
- Health
- Motor/Travel
- Entertainment
- Gambling
- Savings
- Bills and Utilities
- Tax
- Fines

Each transaction is represented by 10 features/columns:

- <strong> CUST_ID</strong>: unique ID for every customer
- <strong> START_DATE</strong>: the month the customer started making transactions
- <strong> END_DATE</strong>: the month the customer stopped making transactions
- <strong> TRANS_ID</strong>: unique ID for every transaction
- <strong> DATE</strong>: the date of the transaction
- <strong> YEAR</strong>: the year of the transaction
- <strong> MONTH</strong>: the month of the transaction
- <strong> DAY</strong>: the day of the transaction
- <strong> EXP_TYPE</strong>: the expenditure type (listed above)
- <strong> AMOUNT</strong>: the amount of the transaction in dollars $

# Cluster configuration

Let's start by eneble to connetct to spark ui and setting the proper cluster configuration as saw in labs. 

- 3 executors with 3 cores each (leave 1 for daemons; and there's also the AMP)
- 8G of memory per executor (recommended 11G, but it exceeds YARN's default maximum allowed in this EMR cluster)

In [1]:
sc.applicationId
"SPARK UI: Enable forwarding of port 20888 and connect to http://localhost:20888/proxy/" + sc.applicationId + "/"

VBox()

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
0,application_1675957348150_0002,spark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

res1: String = application_1675957348150_0002
res2: String = SPARK UI: Enable forwarding of port 20888 and connect to http://localhost:20888/proxy/application_1675957348150_0002/


In [2]:
%%configure -f
{"executorMemory":"8G", "numExecutors":3, "executorCores":3, "conf": {"spark.dynamicAllocation.enabled": "false"}}

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
1,application_1675957348150_0003,spark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
1,application_1675957348150_0003,spark,idle,Link,Link,✔


# Dataset Preprocessing 

After an analysis of the dataset I decided to:
- Removing the **DATE** column because it is the aggregation of the colums year, month and day that are more useful for the next jobs.
- creating some smaller datasets to test the correctness of the functions.


In [5]:
// Name of bucket in s3
val bucketname = "unibo-bd2122-arettaroli"
// Paths of datasets
// S3 path of simulated transactions dataset
val s3_path_dataset = "s3a://"+bucketname+"/exam-dataset/transactions.csv"
// S3 path of cleaned dataset without column "date"
val s3_path_dataset_cleaned = "s3a://"+bucketname+"/exam-dataset/transactions-cleaned.csv"
// S3 path of dataset without column "date" with 30% of data for optimization
val s3_path_dataset_small = "s3a://"+bucketname+"/exam-dataset/transactions-small.csv"
// S3 path of dataset without column "date" with 15 rows (to test the correct functioning of the jobs)
val s3_path_dataset_smallest = "s3a://"+bucketname+"/exam-dataset/transactions-smallest.csv"

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

bucketname: String = unibo-bd2122-arettaroli
s3_path_dataset: String = s3a://unibo-bd2122-arettaroli/exam-dataset/transactions.csv
s3_path_dataset_cleaned: String = s3a://unibo-bd2122-arettaroli/exam-dataset/transactions-cleaned.csv
s3_path_dataset_small: String = s3a://unibo-bd2122-arettaroli/exam-dataset/transactions-small.csv
s3_path_dataset_smallest: String = s3a://unibo-bd2122-arettaroli/exam-dataset/transactions-smallest.csv


**ALERT:** The next three cells need to be executed only the first time.

In [4]:
// Read data from the simulated transactions dataset, drops the column DATE (_c4) then save to another csv.
spark.
  read.
  csv(s3_path_dataset).
  drop("_c4").
  write.
  csv(s3_path_dataset_cleaned)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [5]:
// Read data from csv, limit the number of rows considering the 30% of data for optimization
// Number of transactions: 261969719 => 30% of transactions is : 78.590.915,7 => 78500000 
spark.
  read.
  csv(s3_path_dataset_cleaned).
  limit(78500000). 
  write.
  csv(s3_path_dataset_small)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [6]:
// Save a small set of data for check the results of following jobs
spark.
  read.
  csv(s3_path_dataset_cleaned).
  limit(10).
  write.
  csv(s3_path_dataset_smallest)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

# Dataset preparation


In order to read each transaction is useful to define an object **TransactionsParser** to parse the transactions from the csv file. 

Each value in the csv is divided by ',' (comma) so for each line it's possible to retrieve values splitting on comma.


In [3]:
import java.util.Calendar
import org.apache.spark.sql.SaveMode
import org.apache.spark.HashPartitioner

object TransactionsParser {
  // Each value in the csv is divided by ','
  val commaRegex = ",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"
    
  type CustomerId = String
  type StartDate = String
  type EndDate = String
  type TransactionId = String 
  // type TransactionDate = String
  type Year = Int
  type Month = Int
  type Day = Int
  type ExpenditureType = String
  type Amount = Double
  
  def parseTransactionLine(line: String): Option[(CustomerId, StartDate, EndDate, TransactionId, Year, Month, Day, ExpenditureType, Amount)] = {
    try {
      val input = line.split(commaRegex) // Splitting on comma
      if (input(0) == "CUST_ID") { // To discard the headers column line
          None
      }
      Some(input(0).trim, input(1).trim, input(2).trim, input(3).trim, 
           input(4).trim.toInt, input(5).trim.toInt, input(6).trim.toInt, input(7).trim, input(8).trim.toDouble)
      
    } catch {
      case _: Exception => None
    }
  }
}

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

import java.util.Calendar
import org.apache.spark.sql.SaveMode
import org.apache.spark.HashPartitioner
defined object TransactionsParser


//TODO: write about partitions and dataset dimention

In [7]:
import org.apache.spark.storage.StorageLevel._

// Each row is flatMapped with parseTransactionLine method

//val rddTransactionsOriginal = sc.textFile(s3_path_dataset).flatMap(TransactionsParser.parseTransactionLine)
//val rddTransactions = sc.textFile(s3_path_dataset_small).flatMap(TransactionsParser.parseTransactionLine).coalesce(60)
val rddTransactions = sc.textFile(s3_path_dataset_cleaned).flatMap(TransactionsParser.parseTransactionLine).coalesce(300)
// Persist on memory and disk with serialization
val diskMemoryRdd = rddTransactions.persist(MEMORY_AND_DISK_SER) 

// Same for the smallest (10 rows)
val rddTransactions_smallest = sc.textFile(s3_path_dataset_smallest).flatMap(TransactionsParser.parseTransactionLine)
// Persist on memory and disk with serialization
val diskMemoryRdd_smallest = rddTransactions_smallest.persist(MEMORY_AND_DISK_SER) 

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

import org.apache.spark.storage.StorageLevel._
rddTransactions: org.apache.spark.rdd.RDD[(TransactionsParser.CustomerId, TransactionsParser.StartDate, TransactionsParser.EndDate, TransactionsParser.TransactionId, TransactionsParser.Year, TransactionsParser.Month, TransactionsParser.Day, TransactionsParser.ExpenditureType, TransactionsParser.Amount)] = CoalescedRDD[3] at coalesce at <console>:43
diskMemoryRdd: rddTransactions.type = CoalescedRDD[3] at coalesce at <console>:43
rddTransactions_smallest: org.apache.spark.rdd.RDD[(TransactionsParser.CustomerId, TransactionsParser.StartDate, TransactionsParser.EndDate, TransactionsParser.TransactionId, TransactionsParser.Year, TransactionsParser.Month, TransactionsParser.Day, TransactionsParser.ExpenditureType, TransactionsParser.Amount)] = MapPartitionsRDD[6] at flatMap at <console>:40
diskMemoryRdd_smallest: rddTransactions_smallest.type = MapPartitionsRDD[6] at flatMap at <console>:40


In [8]:
//data preview
diskMemoryRdd_smallest.
    collect().
    foreach(transaction => print(transaction + "\n"))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

(CPVZ2MIAO3,2010-07-01,2017-01-01,T18P9KZ3O350W73,2012,4,5,Entertainment,21.48)
(CPVZ2MIAO3,2010-07-01,2017-01-01,T2N065QH3L11AZS,2015,5,24,Entertainment,17.25)
(CPVZ2MIAO3,2010-07-01,2017-01-01,TRNQXHP2B5T069M,2014,3,5,Education,412.9)
(CPVZ2MIAO3,2010-07-01,2017-01-01,T9PQMGLRKOAI2SJ,2012,9,13,Entertainment,19.18)
(CPVZ2MIAO3,2010-07-01,2017-01-01,TS5UMAAB2WWPNXW,2016,5,13,Entertainment,18.66)
(CPVZ2MIAO3,2010-07-01,2017-01-01,T48ZCBVO5UW71NC,2013,12,9,Groceries,63.61)
(CPVZ2MIAO3,2010-07-01,2017-01-01,T8K9FUYYV2OY9WB,2015,12,4,Entertainment,45.66)
(CPVZ2MIAO3,2010-07-01,2017-01-01,T2Y5VJ9D579QGTB,2012,8,5,Entertainment,49.35)
(CPVZ2MIAO3,2010-07-01,2017-01-01,T13QP1402CVF9PW,2012,9,22,Groceries,56.19)
(CPVZ2MIAO3,2010-07-01,2017-01-01,TY7N8BTA0ASGXO9,2011,7,18,Groceries,64.94)


**DEBUG: use to see RDDs partitions status**: the cell below show the RDDs partiotions status.

In [24]:
sc.getRDDStorageInfo.foreach(x=> print(x + "\n")) 

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

RDD "MapPartitionsRDD" (69) StorageLevel: StorageLevel(disk, memory, 1 replicas); CachedPartitions: 2; TotalPartitions: 2; MemorySize: 0.0 B; DiskSize: 1662.0 B
RDD "CoalescedRDD" (73) StorageLevel: StorageLevel(disk, memory, 1 replicas); CachedPartitions: 300; TotalPartitions: 300; MemorySize: 0.0 B; DiskSize: 71.3 GiB
RDD "MapPartitionsRDD" (78) StorageLevel: StorageLevel(disk, memory, 1 replicas); CachedPartitions: 300; TotalPartitions: 300; MemorySize: 0.0 B; DiskSize: 1499.0 MiB
RDD "MapPartitionsRDD" (80) StorageLevel: StorageLevel(disk, memory, 1 replicas); CachedPartitions: 300; TotalPartitions: 300; MemorySize: 0.0 B; DiskSize: 4.3 GiB
RDD "MapPartitionsRDD" (76) StorageLevel: StorageLevel(disk, memory, 1 replicas); CachedPartitions: 2; TotalPartitions: 2; MemorySize: 0.0 B; DiskSize: 1662.0 B
RDD "MapPartitionsRDD" (77) StorageLevel: StorageLevel(disk, memory, 1 replicas); CachedPartitions: 300; TotalPartitions: 300; MemorySize: 0.0 B; DiskSize: 3.9 GiB
RDD "MapPartitionsRDD"

**Use only to clean RDDs in memory**: the cell below clean the RDDs in memory.

In [None]:
sc.getPersistentRDDs.foreach(_._2.unpersist())

# Dataset exploration

In this section various exploration queries are performed on the dataset but also more complex jobs to find useful information from data.

To improve drastically the performances data used in different jobs will be cached in memory and disk.

In [9]:
// Caching of customers, amount, year, expenditure type and amount for following jobs to improve performance
val cachedCustomer = rddTransactions.map(x=>x._1).persist(MEMORY_AND_DISK_SER)
val cachedAmount = rddTransactions.map(x=>x._9).persist(MEMORY_AND_DISK_SER)
val cachedYear = rddTransactions.map(x=>x._5).persist(MEMORY_AND_DISK_SER)
val cachedExpenditureType = rddTransactions.map(x=>x._8).persist(MEMORY_AND_DISK_SER)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

cachedCustomer: org.apache.spark.rdd.RDD[TransactionsParser.CustomerId] = MapPartitionsRDD[7] at map at <console>:35
cachedAmount: org.apache.spark.rdd.RDD[TransactionsParser.Amount] = MapPartitionsRDD[8] at map at <console>:34
cachedYear: org.apache.spark.rdd.RDD[TransactionsParser.Year] = MapPartitionsRDD[9] at map at <console>:34
cachedExpenditureType: org.apache.spark.rdd.RDD[TransactionsParser.ExpenditureType] = MapPartitionsRDD[10] at map at <console>:34


### Exploring dataset dimentions

To know the size of the dataset entities:
1. How many transactions? 
2. How many distinct customers?
3. How many distinct expenditure type?
4. How many distinct years?
5. Min amount and max ammount?
6. From what year to what year?

In [11]:
"1. Number of transactions: " + diskMemoryRdd.count() // Each row is a transaction => 261.969.719 already without header line
"2. Number of distinct customers: " + cachedCustomer.distinct().count() // 75000
"3. Number of distinct expenditure type: " + cachedExpenditureType.distinct().count() // 12 
"4. Number of distinct years: " + cachedYear.distinct().count() // 11
"5. Range of amount: " + cachedAmount.min() + " $ to " + cachedAmount.max() +" $" //0.12 to 6519.61
"6. From year: " + cachedYear.min() + " to " + cachedYear.max() // 2010 to 2020

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

res31: String = 1. Number of transactions: 261969718
res32: String = 2. Number of distinct customers: 75000
res33: String = 3. Number of distinct expenditure type: 12
res34: String = 4. Number of distinct years: 11
res35: String = 5. Range of amount: 0.12 $ to 6519.61 $
res36: String = 6. From year: 2010 to 2020


Learn more about this dataset by answering more complex questions such as:

7. which are the years listed in the dataset in ascending order?
8. What is the average amount?
9. What is the average amount calculate for every year?
10. What is the average amount for each type of expenditure?
11. What is the maximum amount for each type of expenditure?

In [12]:
//7. which are the years listed in the dataset in ascending order?
cachedYear.
    distinct().
    collect().
    sorted.
    foreach(year => print(year + "\n"))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020


In [16]:
//8. What is the average amount?
"Amount on average: " + cachedAmount.mean()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

res46: String = Amount on average: 81.5629738836724


In [15]:
//9. What is the average amount calculate for every year?
diskMemoryRdd.
    map(x => (x._5, x._9)). //take YEAR and AMOUNT
    aggregateByKey((0.0,0.0))((a,v)=>(a._1+v,a._2+1),(a1,a2)=>(a1._1+a2._1,a1._2+a2._2)). //sum AMOUNT on fist value, and count summed AMOUNTs on the second, end it aggregate partitions
    map({case(k,v)=>(k,v._1/v._2)}). //calculate the average based on YEAR(key)
    sortByKey(). //order on YEAR
    collect().
    foreach(result => print(result + "\n"))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

(2010,58.13171661388978)
(2011,59.74314431061093)
(2012,62.611076607244456)
(2013,65.9246247454225)
(2014,69.75024925871851)
(2015,73.98090515674585)
(2016,78.64610882393865)
(2017,83.6344258468586)
(2018,89.35581974762604)
(2019,96.05548560402437)
(2020,103.33282605610562)


In [17]:
// Caching the pair EXPENDITURE_TYPE - AMOUNT for following jobs to improve performance 
val cachedExpenditureTypeAmount = rddTransactions.map(x => (x._8, x._9)).persist(MEMORY_AND_DISK_SER)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

cachedExpenditureTypeAmount: org.apache.spark.rdd.RDD[(TransactionsParser.ExpenditureType, TransactionsParser.Amount)] = MapPartitionsRDD[41] at map at <console>:35


In [18]:
//10. What is the average amount for each type of expenditure?
cachedExpenditureTypeAmount.
    aggregateByKey((0.0,0.0))((a,v)=>(a._1+v,a._2+1),(a1,a2)=>(a1._1+a2._1,a1._2+a2._2)). //sum AMOUNT on fist value, and count summed AMOUNTs on the second, end it aggregate partitions
    map({case(k,v)=>(k,v._1/v._2)}). //calculate the average based on EXPENDITURE_TYPE(key)
    sortBy(_._2, false). //order by AMOUNT descending 
    collect().
    foreach(result => print(result + "\n"))

// In this way is also possible to view which expenditure type is more expensive on average

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

(Housing,1558.7478390875365)
(Tax,412.9514105976914)
(Education,281.1921104205165)
(Savings,223.37088318474534)
(Bills and Utilities,208.28440022266733)
(Clothing,179.36878004462235)
(Health,159.31446095332143)
(Fines,159.2007908590807)
(Motor/Travel,133.38064934051442)
(Gambling,105.02011185360807)
(Groceries,80.30824109361191)
(Entertainment,24.03148653971737)


In [19]:
//11. What is the maximum amount for each type of expenditure?
cachedExpenditureTypeAmount.
    reduceByKey((x,y)=>{if(x<y) y else x}). //take the maximum AMMOUNT for each EXPENDITURE_TYPE
    sortBy(_._2, false). //reorder by AMOUNT descending 
    collect().
    foreach(result => print(result + "\n"))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

(Housing,6519.61)
(Motor/Travel,6334.35)
(Clothing,4319.49)
(Education,2787.69)
(Savings,1996.85)
(Groceries,1952.39)
(Health,1816.22)
(Bills and Utilities,1489.54)
(Tax,1333.29)
(Entertainment,1002.84)
(Fines,988.5)
(Gambling,870.31)


### Analyze transactions

Analyze transactions more in deep answering the follow questions:

1. What is the average of transactions for customer?
2. What is the average of transactions for year?
3. What is the average of transactions for expenditure type?
4. Which is the year with the most number of transactions?
5. Which is the month with the most number of transactions?
6. Which is the year-month with the most number of transactions?


In [20]:
// For following jobs to improve performance Caching customer and year each for counting the transactions for following jobs to improve performance
// Caching the pair CUSTOMER - NUMBER OF TRANSACTIONS 
val cachedCustomerTransactions = rddTransactions.map(x => (x._1, 1)).reduceByKey(_+_).persist(MEMORY_AND_DISK_SER) //create pair (customer_id, 1) and sum reducing by customer_id
// Caching the pair YEAR - NUMBER OF TRANSACTIONS
val cachedYearTransactions = rddTransactions.map(x => (x._5, 1)).reduceByKey(_+_).persist(MEMORY_AND_DISK_SER) //create pair (year, 1) and  sum reducing by year
// Caching the pair EXPENDITURE TYPE - NUMBER OF TRANSACTIONS
val cachedExpenditureTypeTransactions = rddTransactions.map(x => (x._8, 1)).reduceByKey(_+_).persist(MEMORY_AND_DISK_SER) //create pair (expenditure_type, 1) and sum reducing by expenditure_type

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

cachedCustomerTransactions: org.apache.spark.rdd.RDD[(TransactionsParser.CustomerId, Int)] = ShuffledRDD[56] at reduceByKey at <console>:36
cachedYearTransactions: org.apache.spark.rdd.RDD[(TransactionsParser.Year, Int)] = ShuffledRDD[58] at reduceByKey at <console>:35
cachedExpenditureTypeTransactions: org.apache.spark.rdd.RDD[(TransactionsParser.ExpenditureType, Int)] = ShuffledRDD[60] at reduceByKey at <console>:35


In [21]:
//1. What is the average of transactions for customer?

// Two methods
// 1st method, simpler
"Number of transactions per customer: " + diskMemoryRdd.count() / cachedCustomer.distinct().count()

// 2nd method
val avgTransactionsPerCustomer = cachedCustomerTransactions.
    aggregate((0,0))((a,v)=>(a._1+v._2, a._2+1),(a1,a2)=>(a1._1+a2._1,a1._2+a2._2)) //crate pair (Customer_transactions, counter_of_customers) and sum each pair
   

"Number of transactions per customer: " + (avgTransactionsPerCustomer._1/avgTransactionsPerCustomer._2) // (total transactions / number of different customers)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

res63: String = Number of transactions per customer: 3492
avgTransactionsPerCustomer: (Int, Int) = (261969719,75000)
res68: String = Number of transactions per customer: 3492


In [22]:
//2. What is the average of transactions for year?

// Two methods
// 1st method, simpler
"Number of transactions per year: " + diskMemoryRdd.count() / cachedYear.distinct().count()

// 2nd method
val avgTransactionsPerYear = cachedYearTransactions.
    aggregate((0,0))((a,v)=>(a._1+v._2, a._2+1),(a1,a2)=>(a1._1+a2._1,a1._2+a2._2)) //crate pair (transactions_per_year, counter_of_years) and sum each pair

"Number of transactions per year: " + (avgTransactionsPerYear._1/avgTransactionsPerYear._2) // (total transactions / number of different years)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

res73: String = Number of transactions per year: 23815429
avgTransactionsPerYear: (Int, Int) = (261969719,11)
res77: String = Number of transactions per year: 23815429


In [23]:
//3. What is the average of transactions for expenditure type?

// Two methods
// 1st method, simpler
"Number of transactions per expenditure type: " + diskMemoryRdd.count() / cachedExpenditureType.distinct().count()

// 2nd method
val avgTransactionsPerExpenditureType = cachedExpenditureTypeTransactions.
    aggregate((0,0))((a,v)=>(a._1+v._2, a._2+1),(a1,a2)=>(a1._1+a2._1,a1._2+a2._2))

"Number of transactions per expenditure type: " + (avgTransactionsPerExpenditureType._1/avgTransactionsPerExpenditureType._2)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

res82: String = Number of transactions per expenditure type: 21830809
avgTransactionsPerExpenditureType: (Int, Int) = (261969719,12)
res86: String = Number of transactions per expenditure type: 21830809


In [24]:
// Save results to a csv file
val s3_path_transaction_per_expenditure_type = "s3a://"+bucketname+"/exam-dataset/jobs/transaction-per-expenditure-type.csv"

cachedExpenditureTypeTransactions.
    coalesce(1). // put all in one partition to have only one csv file 
    toDF().
    write.
    format("csv").
    mode(SaveMode.Overwrite).
    save(s3_path_transaction_per_expenditure_type)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

s3_path_transaction_per_expenditure_type: String = s3a://unibo-bd2122-arettaroli/exam-dataset/jobs/transaction-per-expenditure-type.csv


In [31]:
// 4. Which is the year with the most number of transactions?

val yearWithMostTransactions = cachedYearTransactions.
    sortBy(_._2, false). //order by tatal transactions counted descending
    first()

"The year with the most number of transactions is: " + yearWithMostTransactions._1 + " with a total of : "+ yearWithMostTransactions._2 +" transactions"

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

yearWithMostTransactions: (TransactionsParser.Year, Int) = (2019,34013186)
res112: String = The year with the most number of transactions is: 2019 with a total of : 34013186 transactions


In [34]:
//5. Which is the month with the most number of transactions?
val monthWithMoreTransactions = diskMemoryRdd.map(x => (x._6,1)). //create pair (month,counter) for each transaction example: (january, 1) 
    reduceByKey(_+_).   // sum by month 
    sortBy(_._2, false). //order by tatal transactions counted descending
    first()


"The month with the most number of transactions is: " +  monthWithMoreTransactions._1 + " with a total of : "+ monthWithMoreTransactions._2

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

monthWithMoreTransactions: (TransactionsParser.Month, Int) = (12,22824625)


In [43]:
//6. Which is the year-month with the most number of transactions?


val pairYearMonthWithMoreTransactions = diskMemoryRdd.map(x => ((x._5, x._6),1)).
    reduceByKey(_+_). //reduce in pair (year,month) 
    sortBy(_._2, false). //order by counter of transactions descending
    first()

"The pair Year-Month with the most number of transactions: " + pairYearMonthWithMoreTransactions

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

pairYearMonthWithMoreTransactions: ((TransactionsParser.Year, TransactionsParser.Month), Int) = ((2019,9),2844778)
res126: String = The pair Year-Month with the most number of transactions: ((2019,9),2844778)


### Analyze customers

Analyze customers answering the follow questions:

1. What is the customer with the most number of transactions?
2. What is the customer with the most amount spent?


In [25]:
// 1.  What is the customer with the most number of transactions?

// use of sort + first
val customerWithMoreTransactions = cachedCustomerTransactions.
    sortBy(_._2, false). //order by Transactions descending 
    first()

"Customer: " + customerWithMoreTransactions._1 + " has made: " + customerWithMoreTransactions._2

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

customerWithMoreTransactions: (TransactionsParser.CustomerId, Int) = (C0YDPQWPBJ,10369)
res94: String = Customer: C0YDPQWPBJ has made: 10369


In [27]:
// Caching the pair (Total_amount_spent, customer) to improve performance for following jobs
val cachedCustomersTotalAmounts = diskMemoryRdd.map(x => (x._1, x._9)).
                                         reduceByKey(_+_). //sum AMOUNT spent for transactions of each Customer
                                         persist(MEMORY_AND_DISK_SER) 

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

cachedCustomersTotalAmounts: org.apache.spark.rdd.RDD[(TransactionsParser.CustomerId, TransactionsParser.Amount)] = ShuffledRDD[82] at reduceByKey at <console>:36


In [28]:
//2.  What is the customer with the most amount spent?

val customerWithMostAmountSpent = cachedCustomersTotalAmounts.
    sortBy(_._2, false). //order by Total amount spent descending 
    first()

"Customer: " + customerWithMostAmountSpent._1 + " with: " + customerWithMostAmountSpent._2


VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

customerWithMostAmountSpent: (TransactionsParser.CustomerId, TransactionsParser.Amount) = (CP2KXQSX9I,2310805.889999996)
res100: String = Customer: CP2KXQSX9I with: 2310805.889999996


# Classification

The goal of this project is classify customers according to their spending, assigning them a score, then add to each line the score and recalculate the total transactions on each type of expense and then graph the result with a heatmap. In doing this you also want to meet the following requests: 

- Classify each customer on the basis of how much he has spent in total using a rank from 1 to 5

- Classify each expenditure type on the basis of how much is the total spent using a rank from 1 to 5

- Classify each expenditure type on the basis of how much is the total spent for each customer using a rank from 1 to 5

where: 
- 1 means low 
- 2 means mid-low 
- 3 means mid
- 4 means mid-high
- 5 means high



In [46]:
// Number of distinct customer 
val totalDistinctCustomers = cachedCustomersTotalAmounts.count() //or cachedCustomer.distinct().count()

"Total distinct customers " + totalDistinctCustomers

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

totalDistinctCustomers: Long = 75000
res133: String = total75000


In [72]:
// To define the bounds of ranking 
val lowestBound = (totalAmountAllCustomers * 0.2).toInt 
val lowBound = (totalAmountAllCustomers * 0.4).toInt 
val mediumBound = (totalAmountAllCustomers * 0.6).toInt
val highBound = (totalAmountAllCustomers * 0.8).toInt

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

lowestBound: Int = 15000
lowBound: Int = 30000
mediumBound: Int = 45000
highBound: Int = 60000


In [74]:
val rankCustomerPerTotalAmount = cachedCustomersTotalAmounts.
    sortBy(x => x._2, true). //sort by total amount ascending
    zipWithIndex(). // gives an index 
    map{ x => //x is ((customer_id, total_amount),index)
        println(x)
        val rank = x._2 match { //dove v è la posizione del consumatore nell'elenco ordinato in maniera crescente e x_2 è comunque l'indice
            case v if (v <= lowestBound) => 1
            case v if (v <= lowBound) => 2
            case v if (v <= mediumBound) => 3
            case v if (v <= highBound) => 4
            case _ => 5
        }
        (x._1._1, x._1._2, rank) // return object = (customer id , taotal ammount , rank)
    }

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

rankCustomerPerTotalAmount: org.apache.spark.rdd.RDD[(TransactionsParser.CustomerId, TransactionsParser.Amount, Int)] = MapPartitionsRDD[144] at map at <console>:45


In [76]:
// Show for each customer the ranking based on the total amount 
rankCustomerPerTotalAmount.collect().foreach(result => print("customer: "+ result._1+" with: total amount spent: "+result._2+ " has rank: "+result._3 + "\n"))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

customer: CR3YGBA15O with: total amount spent: 288.36999999999995has rank: 1
customer: C0FVNHTA0Y with: total amount spent: 301.1500000000001has rank: 1
customer: CPTJNAHDEE with: total amount spent: 304.40999999999997has rank: 1
customer: CZ8DQCR5S1 with: total amount spent: 317.74999999999994has rank: 1
customer: C70IOOC40J with: total amount spent: 327.33has rank: 1
customer: CQCAP2YLPT with: total amount spent: 389.1300000000002has rank: 1
customer: CKZZS91H1W with: total amount spent: 394.0400000000003has rank: 1
customer: CL414JQR23 with: total amount spent: 459.6800000000001has rank: 1
customer: CW6UFH41D2 with: total amount spent: 464.10000000000025has rank: 1
customer: C0KBONKK6K with: total amount spent: 483.7299999999999has rank: 1
customer: CAPK383402 with: total amount spent: 492.5999999999999has rank: 1
customer: CJLF1MHC00 with: total amount spent: 510.40000000000015has rank: 1
customer: CSE8Y29ZHH with: total amount spent: 561.47has rank: 1
customer: C958NUAA2X with: to