# DataFrame Operations
Goal is to learn -> Basic Operations, Aggregations, Functions	

- Creating DataFrames
- select, filter, where, orderBy
- groupBy, agg, count, sum, avg
- Window functions
- Joins (inner, outer, cross)
- UDFs

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("dataframe_operations").getOrCreate()
print("You Spark Session is ready!")

You Spark Session is ready!


In [2]:
from pyspark.sql import functions as F
print("Guns are loaded")

Guns are loaded


In [3]:
from pyspark.sql.window import Window
print("Windows are loaded")

Windows are loaded


In [4]:
# List all functions in pyspark.sql.functions
print(dir(F))



In [5]:
transactions_file = "transactions.parquet"
df_transactions = spark.read.parquet(transactions_file)
df_transactions.printSchema()

root
 |-- cust_id: string (nullable = true)
 |-- start_date: string (nullable = true)
 |-- end_date: string (nullable = true)
 |-- txn_id: string (nullable = true)
 |-- date: string (nullable = true)
 |-- year: string (nullable = true)
 |-- month: string (nullable = true)
 |-- day: string (nullable = true)
 |-- expense_type: string (nullable = true)
 |-- amt: string (nullable = true)
 |-- city: string (nullable = true)



In [6]:
customers_file = "customers.parquet"
df_customers = spark.read.parquet(customers_file)
df_customers.printSchema()

root
 |-- cust_id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- age: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- birthday: string (nullable = true)
 |-- zip: string (nullable = true)
 |-- city: string (nullable = true)



## 1.1. Basic Operations (Easy)

1. Load both parquet files and display the first 5 rows of each DataFrame.
2. Count the total number of customers in the customers DataFrame.
3. Display all unique expense types from the transactions DataFrame.
4. Find all transactions with amounts greater than 1000.
5. List all unique cities from both datasets.
6. Count how many male and female customers are in the database.
7. Find all transactions that occurred in 2017.
8. Display the names of customers who live in "New York" city.
9. Count the number of transactions for each expense type.
10. Find the oldest customer in the database.

In [16]:
transactions_file = "transactions.parquet"
df_transactions = spark.read.parquet(transactions_file)
df_transactions.show(5)

customers_file = "customers.parquet"
df_customers = spark.read.parquet(customers_file)
df_customers.show(5)

+----------+----------+----------+---------------+----------+----+-----+---+-------------+------+-----------+
|   cust_id|start_date|  end_date|         txn_id|      date|year|month|day| expense_type|   amt|       city|
+----------+----------+----------+---------------+----------+----+-----+---+-------------+------+-----------+
|C0YDPQWPBJ|2010-07-01|2018-12-01|TZ5SMKZY9S03OQJ|2018-10-07|2018|   10|  7|Entertainment| 10.42|     boston|
|C0YDPQWPBJ|2010-07-01|2018-12-01|TYIAPPNU066CJ5R|2016-03-27|2016|    3| 27| Motor/Travel| 44.34|   portland|
|C0YDPQWPBJ|2010-07-01|2018-12-01|TETSXIK4BLXHJ6W|2011-04-11|2011|    4| 11|Entertainment|  3.18|    chicago|
|C0YDPQWPBJ|2010-07-01|2018-12-01|TQKL1QFJY3EM8LO|2018-02-22|2018|    2| 22|    Groceries|268.97|los_angeles|
|C0YDPQWPBJ|2010-07-01|2018-12-01|TYL6DFP09PPXMVB|2010-10-16|2010|   10| 16|Entertainment|  2.66|    chicago|
+----------+----------+----------+---------------+----------+----+-----+---+-------------+------+-----------+
only showi

In [5]:
# Count the total number of customers in the customers DataFrame.
df_customers.filter(F.col('cust_id')!='').count()

5000

In [6]:
# Display all unique expense types from the transactions DataFrame.
df_transactions.select("expense_type").distinct().show()

+-------------------+
|       expense_type|
+-------------------+
|              Fines|
|          Education|
|      Entertainment|
|            Housing|
|            Savings|
|          Groceries|
|                Tax|
|             Health|
|           Clothing|
|           Gambling|
|       Motor/Travel|
|Bills and Utilities|
+-------------------+



In [20]:
# Find all transactions with amounts greater than 1000.
df_transactions.filter(F.col("amt")>1000).show(5)

+----------+----------+----------+---------------+----------+----+-----+---+------------+-------+-------------+
|   cust_id|start_date|  end_date|         txn_id|      date|year|month|day|expense_type|    amt|         city|
+----------+----------+----------+---------------+----------+----+-----+---+------------+-------+-------------+
|C0YDPQWPBJ|2011-03-01|2019-12-01|TM1A3KGAH6MZM2G|2017-10-08|2017|   10|  8|     Housing|2663.91|    san_diego|
|C0YDPQWPBJ|2011-03-01|2019-12-01|TYD3HUFENAI2QR5|2017-11-02|2017|   11|  2|   Education|1251.05|      seattle|
|C0YDPQWPBJ|2011-03-01|2019-12-01|TNCQ08OEZKED3GV|2018-12-02|2018|   12|  2|   Education|1278.18|  los_angeles|
|C0YDPQWPBJ|2011-03-01|2019-12-01|TNBWA6ZYJ07AD1G|2019-07-08|2019|    7|  8|     Housing|2901.35|      chicago|
|C0YDPQWPBJ|2011-03-01|2019-12-01|THS7NFU1HW4LGQ9|2011-07-02|2011|    7|  2|   Education|1103.59|san_francisco|
+----------+----------+----------+---------------+----------+----+-----+---+------------+-------+-------

In [32]:
# List all unique cities from both datasets.

# Rename the 'city' column in both DataFrames before joining
transactions_renamed = df_transactions.withColumnRenamed("city", "txn_city")
customers_renamed = df_customers.withColumnRenamed("city", "cust_city")

# Perform the join
joined_df = transactions_renamed.join(customers_renamed, on="cust_id", how="left")
joined_df.select("txn_city").distinct().show()


+-------------+
|     txn_city|
+-------------+
|    san_diego|
|      chicago|
|       denver|
|       boston|
|      seattle|
|  los_angeles|
|     new_york|
|san_francisco|
| philadelphia|
|     portland|
+-------------+



In [34]:
# Count how many male and female customers are in the database.
df_customers.groupby(F.col("gender")).count().show()

+------+-----+
|gender|count|
+------+-----+
|Female| 2502|
|  Male| 2498|
+------+-----+



In [41]:
# Find all transactions that occurred in 2017
# Cast 'year' column to an integer and filter transactions from 2017
df_transactions.filter(F.col("year").cast("int") == 2017).show(5)

+----------+----------+----------+---------------+----------+----+-----+---+-------------+------+---------+
|   cust_id|start_date|  end_date|         txn_id|      date|year|month|day| expense_type|   amt|     city|
+----------+----------+----------+---------------+----------+----+-----+---+-------------+------+---------+
|C0YDPQWPBJ|2010-07-01|2018-12-01|T1FSOGKASVCV7FM|2017-07-14|2017|    7| 14|Entertainment|  4.01|   denver|
|C0YDPQWPBJ|2010-07-01|2018-12-01|TODEIOF2REW2GKB|2017-03-25|2017|    3| 25|     Gambling| 164.5| new_york|
|C0YDPQWPBJ|2010-07-01|2018-12-01|TU285PGHVE16XEX|2017-07-24|2017|    7| 24|Entertainment|  4.05|san_diego|
|C0YDPQWPBJ|2010-07-01|2018-12-01|TEY9ROTMZD6AHVI|2017-04-04|2017|    4|  4|    Groceries|129.98| portland|
|C0YDPQWPBJ|2010-07-01|2018-12-01|TXU0JZKYENLM0P5|2017-08-08|2017|    8|  8| Motor/Travel|  43.3|san_diego|
+----------+----------+----------+---------------+----------+----+-----+---+-------------+------+---------+
only showing top 5 rows



In [44]:
# Display the names of customers who live in "New York" city.
# where is an alias of filter function
df_customers.where(F.col("city") == "new_york").show(6)

+----------+--------------+---+------+----------+-----+--------+
|   cust_id|          name|age|gender|  birthday|  zip|    city|
+----------+--------------+---+------+----------+-----+--------+
|C01USDV4EE|   Aaron Blair| 35|Female|  9/9/1974|80078|new_york|
|C02JNTM46B|Aaron Chambers| 51|  Male|  1/6/2001|63337|new_york|
|C04S4IDJV4| Aaron Gilbert| 24|Female|  7/2/2002|13359|new_york|
|C06KZJ6B2B|    Aaron Hall| 36|  Male| 9/13/1977|14297|new_york|
|C08MWU4U7A|  Aaron Jacobs| 38|  Male|12/21/1970|20955|new_york|
|C0BBHD2UQE| Aaron Patrick| 21|  Male| 6/16/1989|05992|new_york|
+----------+--------------+---+------+----------+-----+--------+
only showing top 6 rows



In [48]:
# Count the number of transactions for each expense type.
df_transactions.groupby("expense_type").agg(F.count("txn_id").alias("txn_count")).show()

+-------------------+---------+
|       expense_type|txn_count|
+-------------------+---------+
|              Fines|     7467|
|          Education|   559518|
|      Entertainment| 22417986|
|            Housing|   261668|
|            Savings|   357141|
|          Groceries|  6473528|
|                Tax|   453669|
|             Health|  1136161|
|           Clothing|  1165579|
|           Gambling|   958807|
|       Motor/Travel|  4738090|
|Bills and Utilities|  1260478|
+-------------------+---------+



In [55]:
# Find the oldest customer in the database.
highest_age = df_customers.agg(F.max("age").alias("highest_age")).collect()[0]["highest_age"]

df_customers.filter(F.col("age")== highest_age).show()

+----------+-----------------+---+------+---------+-----+-------------+
|   cust_id|             name|age|gender| birthday|  zip|         city|
+----------+-----------------+---+------+---------+-----+-------------+
|C07OSZNHI9|     Aaron Harper| 65|Female|3/29/2004|27672|      seattle|
|C0FDN25R0K|     Abbie Abbott| 65|  Male|1/11/1970|53095|      seattle|
|C0FQB13ESC|    Abbie Burgess| 65|  Male|9/15/1965|79294|    san_diego|
|C153A5ZXKO|      Adam Atkins| 65|  Male| 9/4/1971|97631|  los_angeles|
|C1AXBX5002|       Adam Hicks| 65|  Male|1/24/2003|39189|     portland|
|C1JGG6YJM4|    Addie Bridges| 65|  Male|3/13/1995|79628| philadelphia|
|C1RJXLKOVJ|    Addie Shelton| 65|Female|11/4/1999|10451|san_francisco|
|C1Y80F4B9K|  Adelaide Harvey| 65|Female|5/22/1979|54573|      chicago|
|C3B55OYDXS|    Agnes Flowers| 65|  Male| 8/8/1962|41089|     new_york|
|C3LVXNC8ID|    Agnes Stanley| 65|  Male| 3/3/1977|66589|       denver|
|C3ONI1HDNM|      Agnes Wells| 65|Female|7/22/2005|23392|san_fra

##### End of basic operations (easy)
----

#
#1.2. **Easy Questio**
1. **Create a DataFrame from `transactions.parquet` and display its schema.**  
   *Hint: Use `spark.read.parquet` and `printSchema`.*

2. **Display the first 10 rows of the `transactions` DataFrame.**  
   *Hint: Use `show` with a limit.*

3. **Select the `cust_id`, `txn_id`, and `amt` columns from the `transactions` DataFrame.**  
   *Hint: Use `select`.*

4. **Filter all transactions where the `amt` is greater than 1000.**  
   *Hint: Use `filter` or `where`.*

5. **Sort the `transactions` DataFrame by the `amt` column in descending order.**  
   *Hint: Use `orderBy`.*

6. **Count the total number of transactions in the `transactions` DataFrame.**  
   *Hint: Use `count`.*

7. **Group the `transactions` DataFrame by `expense_type` and count the number of transactions for each type.**  
   *Hint: Use `groupBy` and `count`.*

8. **Read the `customers.parquet` file and display the distinct cities in the DataFrame.**  
   *Hint: Use `distinct`.*

9. **Find the total amount (`amt`) spent by all customers in the `transactions` DataFrame.**  
   *Hint: Use `agg` with `sum`.*

10. **Rename the `amt` column in the `transctiFrion on any of these questions! ðŸ˜Š

In [10]:
#Sort the transactions DataFrame by the amt column in descending order.
df_transactions.orderBy(["amt"],ascending=False).show(10)

+----------+----------+----------+---------------+----------+----+-----+---+-------------------+------+-------------+
|   cust_id|start_date|  end_date|         txn_id|      date|year|month|day|       expense_type|   amt|         city|
+----------+----------+----------+---------------+----------+----+-----+---+-------------------+------+-------------+
|C0YDPQWPBJ|2012-11-01|2020-11-01|TQS5CGEC5EI915T|2020-10-10|2020|   10| 10|       Motor/Travel|999.99|    san_diego|
|CQN2R97ZSG|2011-11-01|2020-05-01|T2OM31ABM16KCHC|2019-01-04|2019|    1|  4|       Motor/Travel|999.99|san_francisco|
|C3R72MS04F|2011-08-01|      NULL|TACKNPEPG630N0H|2020-03-07|2020|    3|  7|             Health|999.99|       boston|
|C8I6XM3Y2A|2011-12-01|2020-09-01|TLGZFM6PSF4TJ4D|2013-11-05|2013|   11|  5|Bills and Utilities|999.99|     new_york|
|C0YDPQWPBJ|2011-01-01|2020-03-01|TSYJ1TQBYYJAUVS|2011-07-06|2011|    7|  6|Bills and Utilities|999.99|       denver|
|CLE3P7DBPR|2013-04-01|      NULL|TOAZ5LRVHSP991T|2018-1

In [11]:
# Count the total number of transactions in the transactions DataFrame.
df_transactions.count()

39790092

### 2.1. Medium Questions
11. **Join the `transactions` and `customers` DataFrames on `cust_id` to get a single DataFrame with customer details and their transactions.**  
    *Hint: Use `join`.*

12. **Find the average transaction amount (`amt`) for each `expense_type`.**  
    *Hint: Use `groupBy` with `agg` and `avg`.*

13. **Filter the `customers` DataFrame to only include customers who are 25 years old or older.**  
    *Hint: Use `filter` with a condition on the `age` column.*

14. **Add a new column to the `transactions` DataFrame called `txn_year` that extracts the year from the `date` column.**  
    *Hint: Use `withColumn` and PySpark functions like `year`.*

15. **Create a DataFrame with all unique pairs of `cust_id` and `expense_type` from the `transactions` DataFrame.**  
    *Hint: Use `select` and `distinct`.*

16. **Find the maximum transaction amount (`amt`) for each customer.**  
    *Hint: Use `groupBy` and `agg` with `max`.*

17. **Find the top 3 customers who have spent the highest total amount in transactions.**  
    *Hint: Use `groupBy`, `agg`, and `orderBy` with `limit`.*

18. **For each customer, find the total number of transactions they made and their average transaction amount.**  
    *Hint: Use `groupBy` with `count` and `avg`.*

19. **Filter all customers from the `customers` DataFrame who share the same city as their transactions.**  
    *Hint: Use `join` on the `city` column with a filter.*

20. **Find all transactions made in December (month = "12").**  
    *Hint: Use `filter` with a condition on the `month` column.*

---

In [13]:
# Find the average transaction amount (amt) for each expense_type
df_transactions.groupby(F.col("expense_type")).agg(F.round(F.avg("amt"),2).alias("avg_amt")).show()

+-------------------+-------+
|       expense_type|avg_amt|
+-------------------+-------+
|              Fines| 166.27|
|          Education| 282.11|
|      Entertainment|  24.38|
|            Housing|1591.13|
|            Savings| 230.73|
|          Groceries|  82.71|
|                Tax| 427.62|
|             Health| 162.91|
|           Clothing| 178.02|
|           Gambling| 109.74|
|       Motor/Travel| 124.99|
|Bills and Utilities|  220.4|
+-------------------+-------+



In [15]:
# Add a new column to the transactions DataFrame called txn_year that extracts the year from the date column.
df_transactions.withColumn("txn_start_year", F.year(F.col("start_date"))).show(5)

+----------+----------+----------+---------------+----------+----+-----+---+-------------+------+-----------+--------------+
|   cust_id|start_date|  end_date|         txn_id|      date|year|month|day| expense_type|   amt|       city|txn_start_year|
+----------+----------+----------+---------------+----------+----+-----+---+-------------+------+-----------+--------------+
|C0YDPQWPBJ|2010-07-01|2018-12-01|TZ5SMKZY9S03OQJ|2018-10-07|2018|   10|  7|Entertainment| 10.42|     boston|          2010|
|C0YDPQWPBJ|2010-07-01|2018-12-01|TYIAPPNU066CJ5R|2016-03-27|2016|    3| 27| Motor/Travel| 44.34|   portland|          2010|
|C0YDPQWPBJ|2010-07-01|2018-12-01|TETSXIK4BLXHJ6W|2011-04-11|2011|    4| 11|Entertainment|  3.18|    chicago|          2010|
|C0YDPQWPBJ|2010-07-01|2018-12-01|TQKL1QFJY3EM8LO|2018-02-22|2018|    2| 22|    Groceries|268.97|los_angeles|          2010|
|C0YDPQWPBJ|2010-07-01|2018-12-01|TYL6DFP09PPXMVB|2010-10-16|2010|   10| 16|Entertainment|  2.66|    chicago|          2010|


In [16]:
# Create a DataFrame with all unique pairs of cust_id and expense_type from the transactions DataFrame
df_transactions.select(["cust_id","expense_type"]).distinct().show()

+----------+-------------------+
|   cust_id|       expense_type|
+----------+-------------------+
|C1S3GH1FKR|              Fines|
|COTEH9WRII|Bills and Utilities|
|C0HCX8JU7B|          Education|
|C0W0YS75TQ|            Savings|
|CP665RP38K|          Groceries|
|CM8GEPBJW0|           Clothing|
|CHA4JTIZ9R|       Motor/Travel|
|CMY3LTLU4P|            Savings|
|CLTCHYIE8F|            Savings|
|CAJ9AVEA8J|      Entertainment|
|C44F5Y5V9W|           Clothing|
|CGADRJ9CMF|            Housing|
|CEUYCUI1GE|                Tax|
|CJ359NPTTF|           Clothing|
|CE9AFTP6UO|Bills and Utilities|
|CRB5VB5VGF|      Entertainment|
|CJIY3R0IJ2|      Entertainment|
|C0YDPQWPBJ|            Housing|
|CYV0UQGKF1|                Tax|
|C8EKBL7G9T|          Groceries|
+----------+-------------------+
only showing top 20 rows



In [20]:
# Find the maximum transaction amount (amt) for each customer.
df_transactions.groupby("cust_id").agg(F.max("amt").alias("highest_purchase_amt"))\
                .orderBy("highest_purchase_amt", ascending=False).show()

+----------+--------------------+
|   cust_id|highest_purchase_amt|
+----------+--------------------+
|C0YDPQWPBJ|              999.99|
|C3R72MS04F|              999.99|
|C8I6XM3Y2A|              999.99|
|CQN2R97ZSG|              999.99|
|CCFR9ND2HX|              999.98|
|CLE3P7DBPR|              999.98|
|CVVC8ELOXF|              999.98|
|C0I1IG7GEX|              999.97|
|C2CCLKG3EX|              999.97|
|C226O53ID1|              999.97|
|CP73FT1JD5|              999.97|
|C8OQLZ2AVM|              999.96|
|CNXGF8RRAM|              999.96|
|CWX3K09BDQ|              999.96|
|CZ9L6WLYUT|              999.96|
|CLV775F7OR|              999.95|
|CQKVJ5BB83|              999.95|
|CXRL8VPR54|              999.95|
|COGQ9IM7L1|              999.94|
|C2ORXL6547|              999.94|
+----------+--------------------+
only showing top 20 rows



In [21]:
# Find the top 3 customers who have spent the highest total amount in transactions.
df_transactions.groupby("cust_id").agg(F.sum("amt").alias("total_amt"))\
                .orderBy("total_amt", ascending=False)\
                .limit(3).show()

+----------+-------------------+
|   cust_id|          total_amt|
+----------+-------------------+
|C0YDPQWPBJ|1.491545975629971E9|
|CA9UYOQ5DA| 2028950.4899999967|
|C1YG5D12KV| 2014760.8700000087|
+----------+-------------------+



In [23]:
# For each customer, find the total number of transactions they made and their average transaction amount.
# Group by 'cust_id' to find the total number of transactions and the average transaction amount
df_transactions.groupby("cust_id") \
    .agg(
        F.count("txn_id").alias("txns"),  # Count total transactions
        F.avg("amt").alias("avg_amt")    # Average transaction amount
    ) \
    .orderBy("txns", ascending=False) \
    .show() 

+----------+--------+------------------+
|   cust_id|    txns|           avg_amt|
+----------+--------+------------------+
|C0YDPQWPBJ|17539732| 85.03812804152145|
|C3KUDEN3KO|    7999|192.02384048005962|
|CBW3FMEAU7|    7999| 65.58727340917598|
|C89FCEGPJP|    7999| 49.99358669833731|
|CHNFNR89ZV|    7998|20.588202050512546|
|CFTWKHAVFW|    7997|10.566552457171412|
|CZAZIXSCKP|    7995| 50.86983614759223|
|CAKBTN7JJQ|    7995|140.51184615384645|
|C8GCIG2WOM|    7995| 7.448561601000619|
|CXRV8HXIQD|    7994|15.313982987240395|
|CLZND1XQQZ|    7994| 104.4256067050287|
|C9X9HA3C9I|    7994|34.243212409307034|
|CZWMRCSUXO|    7994|30.166269702276665|
|C57PDEGOUJ|    7994| 18.62143857893418|
|CR1DHMX30F|    7994|57.927873405053965|
|CID7KEOKJV|    7993|202.23004253721962|
|CAIJYX09LT|    7993|58.399869886149965|
|CHSQEB7XLF|    7993|213.13004378831454|
|CP6ZEG31D9|    7993| 10.55756411860379|
|CLT59OG1NE|    7992|124.48662912912876|
+----------+--------+------------------+
only showing top

In [5]:
#Filter all customers from the customers DataFrame who share the same city as their transactions.
df_transactions.join(F.broadcast(df_customers), on=["cust_id","city"] ,how = "inner").show()

+----------+------+----------+----------+---------------+----------+----+-----+---+-------------+------+--------+---+------+---------+-----+
|   cust_id|  city|start_date|  end_date|         txn_id|      date|year|month|day| expense_type|   amt|    name|age|gender| birthday|  zip|
+----------+------+----------+----------+---------------+----------+----+-----+---+-------------+------+--------+---+------+---------+-----+
|C0YDPQWPBJ|denver|2010-07-01|2018-12-01|T16BPI2Q485F7XF|2015-08-19|2015|    8| 19|     Gambling| 59.75|Ada Lamb| 32|Female|9/29/2005|22457|
|C0YDPQWPBJ|denver|2010-07-01|2018-12-01|T1FSOGKASVCV7FM|2017-07-14|2017|    7| 14|Entertainment|  4.01|Ada Lamb| 32|Female|9/29/2005|22457|
|C0YDPQWPBJ|denver|2010-07-01|2018-12-01|TGO6A5NNOMI6IT5|2016-04-21|2016|    4| 21|Entertainment|  2.51|Ada Lamb| 32|Female|9/29/2005|22457|
|C0YDPQWPBJ|denver|2010-07-01|2018-12-01|TDEWL11NTJCN8CE|2016-06-02|2016|    6|  2|     Gambling| 59.88|Ada Lamb| 32|Female|9/29/2005|22457|
|C0YDPQWPBJ|d


#2.2. # Intermediate Operations (Medium)

11. Calculate the average transaction amount for each expense type, sorted by average a mount in descending order.

12. Find customers who have made transactions in cities different from their residence city.

13. Calculate the total amount spent by each customer, showing their name and total amount.

14. Find the month with the highest number of transactions in each year.

15. Create a summary showing for each customer:
    - Total number of transactions
    - Average transaction amount
    - First transaction date
    - Last transaction date

16. Find customers who have made transactions in all expense types.

17. Calculate the age distribution of customers (count of customers in different age groups: 18-25, 26-35, 36-45, 46-55, 56+).

18. Find the top 3 customers who spent the most in each city.

19. Calculate thectrunning total oustomer concepts in more detail?

In [7]:
# Find customers who have made transactions in cities different from their residence city.

# left ant-join means that it will return only those rows from left DF that do not have matching data records in right DF
df_transactions.join(F.broadcast(df_customers), on = ["cust_id", "city"], how = "left_anti").show(5)

+----------+-----------+----------+----------+---------------+----------+----+-----+---+-------------+------+
|   cust_id|       city|start_date|  end_date|         txn_id|      date|year|month|day| expense_type|   amt|
+----------+-----------+----------+----------+---------------+----------+----+-----+---+-------------+------+
|C0YDPQWPBJ|     boston|2010-07-01|2018-12-01|TZ5SMKZY9S03OQJ|2018-10-07|2018|   10|  7|Entertainment| 10.42|
|C0YDPQWPBJ|   portland|2010-07-01|2018-12-01|TYIAPPNU066CJ5R|2016-03-27|2016|    3| 27| Motor/Travel| 44.34|
|C0YDPQWPBJ|    chicago|2010-07-01|2018-12-01|TETSXIK4BLXHJ6W|2011-04-11|2011|    4| 11|Entertainment|  3.18|
|C0YDPQWPBJ|los_angeles|2010-07-01|2018-12-01|TQKL1QFJY3EM8LO|2018-02-22|2018|    2| 22|    Groceries|268.97|
|C0YDPQWPBJ|    chicago|2010-07-01|2018-12-01|TYL6DFP09PPXMVB|2010-10-16|2010|   10| 16|Entertainment|  2.66|
+----------+-----------+----------+----------+---------------+----------+----+-----+---+-------------+------+
only showi

In [9]:
# Calculate the total amount spent by each customer, showing their name and total amount.

df_transactions_join_cols = df_transactions.select(F.col("cust_id").alias("txn_cust_id"), "amt")

df_customers_amt = df_transactions_join_cols.join(df_customers, 
                                                  df_transactions_join_cols.txn_cust_id == df_customers.cust_id, 
                                                how = "inner")

df_customers_amt.groupby("cust_id","name").agg(F.sum("amt")).show()

+----------+---------------+------------------+
|   cust_id|           name|          sum(amt)|
+----------+---------------+------------------+
|CG1NEJSAXO|  Bernard Mason|1620561.4300000053|
|CGTWUY5AGZ|  Bertha Sparks|283178.89000000054|
|CV3G07K3WC|  Chris Osborne| 780173.4400000022|
|CA59VTI86N|       Amy Wise| 693919.1100000032|
|CL3B876N0W| Blanche Foster|411599.46000000095|
|CKGIIMAL4T|   Blake Cortez| 191719.0899999989|
|C6G96DT69P|Alfred Buchanan|1641742.7300000028|
|CUWOXYDTSP|  Chris Chapman| 903847.9500000024|
|CW1X61XN86|Christine Clark|254222.44000000038|
|C0EFPK9NVV| Aaron Thornton| 893915.4100000007|
|C34CEP15CV|   Adrian Stone| 660656.7900000014|
|CYEKKU5GV5|Clifford Fuller| 348114.5799999992|
|CM8GEPBJW0|   Brent Conner| 936285.4199999992|
|CMS8N5LMMI|  Brett Lindsey|  590509.289999999|
|CGQEQIV6B1|  Bertha Cooper| 431130.0600000015|
|CGITEKW8I0|   Bernice Rowe| 230427.7499999999|
|CYV0UQGKF1|    Clyde Tyler| 660873.9500000012|
|CAKRBAKXAI|Andrew Schwartz| 1322510.729

In [12]:
# Find the month with the highest number of transactions in each year.
df_txn_year_month =  df_transactions.groupby("year","month").agg(F.count("txn_id").alias("transactions"))

# specify the window
window_spec_txn_yr = Window.partitionBy("year").orderBy(F.col("transactions").desc())

# rank the months using the window and filter for top mnths
df_top_mnths_per_year = df_txn_year_month.withColumn("rank", F.row_number().over(window_spec_txn_yr))\
                                        .filter(F.col("rank")==1)

In [13]:
df_top_mnths_per_year.show(5)

+----+-----+------------+----+
|year|month|transactions|rank|
+----+-----+------------+----+
|2010|   12|      121503|   1|
|2011|   12|      230898|   1|
|2012|   12|      341426|   1|
|2013|   12|      384566|   1|
|2014|    5|      389520|   1|
+----+-----+------------+----+
only showing top 5 rows



In [10]:
'''
Create a summary showing for each customer:

Total number of transactions
Average transaction amount
First transaction date
Last transaction date
'''
df_joined = df_transactions.join(df_customers, on = "cust_id", how="inner")

df_customer_summary = df_joined.groupby("cust_id","name").agg(F.count("txn_id").alias("transactions"), 
                                        F.avg("amt").alias("avg_txn_amt"),
                                       F.min("date").alias("first_txn_date"),
                                       F.max("date").alias("last_txn_date")
                                       )
df_customer_summary.show(5)

+----------+-------------+------------+------------------+--------------+-------------+
|   cust_id|         name|transactions|       avg_txn_amt|first_txn_date|last_txn_date|
+----------+-------------+------------+------------------+--------------+-------------+
|C007YEYTX9| Aaron Abbott|        7445|200.59494425789075|    2012-02-01|   2020-09-27|
|C00B971T1J| Aaron Austin|        7532|133.01896840148697|    2012-10-01|   2020-12-27|
|C00WRSJF1Q| Aaron Barnes|        7777|105.57785007072154|    2012-11-01|   2020-12-27|
|C01AZWQMF3|Aaron Barrett|        7548|17.418098834128244|    2010-10-01|   2019-03-27|
|C01BKUFRHA| Aaron Becker|        7401| 82.77615457370636|    2012-04-01|   2020-09-27|
+----------+-------------+------------+------------------+--------------+-------------+
only showing top 5 rows



In [17]:
# Find the total number of distinct expense types
distinct_expense_types = df_transactions.select("expense_type").distinct().count()

# Step 2: Group by 'cust_id' and count distinct expense types for each customer
result = df_transactions.groupby("cust_id") \
    .agg(F.countDistinct("expense_type").alias("distinct_expense_types")) \
    .filter(F.col("distinct_expense_types") == distinct_expense_types)

# Step 3: Show the result
result.show()

+----------+----------------------+
|   cust_id|distinct_expense_types|
+----------+----------------------+
|C1AAN0AHMK|                    12|
|C0YDPQWPBJ|                    12|
|CO13ZCQKBA|                    12|
|C33RJK3Z4T|                    12|
|CIJ881FF3R|                    12|
+----------+----------------------+



In [21]:
# Calculate the age distribution of customers (count of customers in different age groups: 18-25, 26-35, 36-45, 46-55, 56+).

#IMP: Wrap each condition in parentheses
df_customers =  df_customers.withColumn(
                                        "age_group",
                                        F.when((F.col("age") >= 18) & (F.col("age") <= 25), "18-25")
                                         .when((F.col("age") >= 26) & (F.col("age") <= 35), "26-35")
                                         .when((F.col("age") >= 36) & (F.col("age") <= 45), "36-45")
                                         .when((F.col("age") >= 46) & (F.col("age") <= 55), "46-55")
                                         .otherwise("56+")
                                    )

df_customers.groupby("age_group").agg(F.countDistinct("cust_id").alias("customers")).show()


+---------+---------+
|age_group|customers|
+---------+---------+
|    18-25|      828|
|    26-35|     1049|
|      56+|     1038|
|    46-55|     1019|
|    36-45|     1066|
+---------+---------+



In [31]:
# Find the top 3 customers who spent the most in each city.
df_spend_per_city = df_transactions.groupby("cust_id", "city").agg(F.sum("amt").alias("total_spent"))

window_spec = Window.partitionBy("city").orderBy(F.col("total_spent").desc())

df_spend_per_city.withColumn("rank", F.row_number().over(window_spec))\
                 .filter(F.col("rank").isin(1,2,3))\
                 .show()


+----------+------------+--------------------+----+
|   cust_id|        city|         total_spent|rank|
+----------+------------+--------------------+----+
|C0YDPQWPBJ|      boston|1.4896172264999977E8|   1|
|CA9UYOQ5DA|      boston|  217740.75000000006|   2|
|CV441771EN|      boston|  214680.87999999995|   3|
|C0YDPQWPBJ|     chicago|1.4916500168000048E8|   1|
|CA8M8U39J2|     chicago|  212630.29000000053|   2|
|C7REP7GP36|     chicago|   206931.9399999998|   3|
|C0YDPQWPBJ|      denver|1.4894455308000013E8|   1|
|CGN9VRRD9S|      denver|  225581.85000000015|   2|
|CAAOZWFK5A|      denver|  215381.55000000005|   3|
|C0YDPQWPBJ| los_angeles| 1.492145839800003E8|   1|
|CA9UYOQ5DA| los_angeles|  235140.72999999975|   2|
|CP1LHO57T6| los_angeles|  207523.43999999994|   3|
|C0YDPQWPBJ|    new_york|1.4870774328000045E8|   1|
|CGN9VRRD9S|    new_york|  218306.62999999986|   2|
|C0DNME1XBG|    new_york|  211895.57999999993|   3|
|C0YDPQWPBJ|philadelphia|1.4962093732000005E8|   1|
|CHSW3FQ8WY|

In [34]:
# Calculate the running total of transactions for each customer ordered by date.
# we will create a window over which the amount would be summed
# partitionBy("cust_id") -> means for each customers
# the rowsBetween specifies the window lenght - here it is all the rows before the current row, till the current row
window_spec = Window.partitionBy("cust_id").orderBy(F.col("date")).rowsBetween(Window.unboundedPreceding, Window.currentRow)

df_transactions = df_transactions.withColumn("running_total", F.sum("amt").over(window_spec))

df_transactions.select("cust_id","date","amt","running_total").orderBy("cust_id","date").show()

+----------+----------+-------+------------------+
|   cust_id|      date|    amt|     running_total|
+----------+----------+-------+------------------+
|C007YEYTX9|2012-02-01|  45.73|             45.73|
|C007YEYTX9|2012-02-01|  28.89|             74.62|
|C007YEYTX9|2012-02-02| 252.29|326.90999999999997|
|C007YEYTX9|2012-02-02|  40.82|367.72999999999996|
|C007YEYTX9|2012-02-03|  146.7|            514.43|
|C007YEYTX9|2012-02-04| 3647.9|           4162.33|
|C007YEYTX9|2012-02-05| 147.76|           4310.09|
|C007YEYTX9|2012-02-05|  28.34|           4338.43|
|C007YEYTX9|2012-02-05|  37.53|           4375.96|
|C007YEYTX9|2012-02-05|  47.83|           4423.79|
|C007YEYTX9|2012-02-06| 458.48|           4882.27|
|C007YEYTX9|2012-02-06| 147.24|           5029.51|
|C007YEYTX9|2012-02-06|  42.54|           5072.05|
|C007YEYTX9|2012-02-06|  37.64|5109.6900000000005|
|C007YEYTX9|2012-02-06|  33.77| 5143.460000000001|
|C007YEYTX9|2012-02-07|1075.03| 6218.490000000001|
|C007YEYTX9|2012-02-08|  573.1|

## 3.1. Difficult Questions
21. **Write a PySpark SQL query to find the top 5 cities with the highest total transaction amounts.**  
    *Hint: Use `createOrReplaceTempView` to register a temporary view and write a SQL query.*

22. **Create a new column `is_weekend` in the `transactions` DataFrame that indicates whether the `date` falls on a Saturday or Sunday.**  
    *Hint: Use PySpark functions like `date_format` or `dayofweek`.*

23. **Using a window function, calculate the cumulative transaction amount (`amt`) for each customer, ordered by transaction date.**  
    *Hint: Use `Window` with `partitionBy` and `orderBy`.*

24. **Write a User-Defined Function (UDF) to categorize transactions as `High` (amt > 1000), `Medium` (500 â‰¤ amt â‰¤ 1000), or `Low` (amt < 500). Add this categorization as a new column.**  
    *Hint: Use `udf` and `withColumn`.*

25. **For each `expense_type`, calculate the difference between the maximum and minimum transaction amounts.**  
    *Hint: Use `groupBy` with `agg` and `max`/`min`.*

26. **Find customers who have made transactions in more than one city.**  
    *Hint: Use `groupBy` on `cust_id` with `countDistinct` on `city`.*

27. **Using a window function, find the top transaction for each customer based on the amount (`amt`).**  
    *Hint: Use `Window` with `rank` or `dense_rank`.*

28. **Create a DataFrame showing, for each customer, the first and last transaction dates.**  
    *Hint: Use `groupBy` with `agg` and `min`/`max`.*

29. **Join `customers` and `transactions` DataFrames and find the average transaction amount for male and female customers.**  
    *Hint: Use `join` and `groupBy` on `gender`.*

30. **Using a window function, calculate the rolling average of transaction amounts for each customer over their last 3 transactions.**  
    *Hint: Use `Window` with `rowsBetween`.*

## 3.2. Advanced Operations (Difficult)

21. Calculate the month-over-month percentage change in total transaction amounts for each expense type.

22. Create a cohort analysis showing customer retention based on their first transaction month:
    - Group customers by their first transaction month
    - Show how many were still active in subsequent months

23. Implement a fraud detection feature that flags suspicious transactions based on:
    - More than 3 transactions in a single day
    - Transactions amount > 5x the customer's average transaction amount
    - Transactions in different cities within 24 hours

24. Calculate the customer lifetime value (CLV) for each customer:
    - Total amount spent
    - Average transaction frequency
    - Customer age in the system
    - Trend of spending (increasing/decreasing)

25. Create a product affinity analysis:
    - Find which expense types are commonly seen together in the same month
    - Calculate the correlation between different expense types

26. Implement a dynamic window calculation that shows:
    - Moving average of transaction amounts (7-day window)
    - Transaction amount percentile within their expense type
    - Rank of transaction amount within customer's history

27. Build a customer segmentation analysis using:
    - Recency (days since last transaction)
    - Frequency (number of transactions)
    - Monetary value (total amount spent)
    - Create segments like "High Value", "Medium Value", "Low Value"

28. Create a geographical analysis:
    - Transaction density by city
    - Average transaction amount by city
    - Customer movement patterns between cities
    - City-wise customer demographics

29. Implement a recommendation system:
    - Based on similar customers (age, gender, city)
    - Based on transaction patterns
    - Calculate similarity scores between customers

30. Build a churn prediction feature:
    - Define churn (e.g., no transactions in last 3 months)
    - Calculate churn probability based on:
        - Transaction frequency changes
        - Amount changes
        - Last transaction recency
        - Customer demographics

## 4.1. Bonus Challenge Questions

31. Create a customer journey analysis:
    - Map the progression of expense types over time
    - Identify common paths/patterns in customer spending
    - Calculate the probability of next expense type

32. Implement a seasonal analysis:
    - Detect seasonal patterns in different expense types
    - Account for city-specific seasonal variations
    - Create year-over-year comparison accounting for seasonality
---

In [7]:
# we will work with a smaller dataset hereon out
df_transactions = df_transactions.limit(100000)
df_transactions = df_transactions.repartition(2)

In [7]:
# Write a PySpark SQL query to find the top 5 cities with the highest total transaction amounts.

# two types of views
# 1.createOrReplaceTempView - the created view is only available to the current spark session and not accessible by other spark sessions
# 2.createOrReplaceGlobalTempView - a global view that can be accessed by other spark applications through the global_temp database

df_transactions.createOrReplaceTempView("transactions")

query = '''
        SELECT 
        city,
        SUM(amt) as total_spent
        FROM transactions
        GROUP BY city
        ORDER BY total_spent DESC
        LIMIT 5
        '''

spark.sql(query).show()

+------------+------------------+
|        city|       total_spent|
+------------+------------------+
|    portland|1103259.6099999961|
|     chicago|1059268.4400000009|
|     seattle|1046470.0400000042|
| los_angeles| 1039331.759999998|
|philadelphia|1038915.2799999986|
+------------+------------------+



In [8]:
# Create a new column is_weekend in the transactions DataFrame that indicates whether the date falls on a Saturday or Sunday.
# Lets work with date 
df_transactions.withColumn("is_weekend", 
                            F.when(F.dayofweek("date").isin(1, 7), "yes").otherwise("no")).show()

+----------+----------+----------+---------------+----------+----+-----+---+-------------------+-------+-------------+----------+
|   cust_id|start_date|  end_date|         txn_id|      date|year|month|day|       expense_type|    amt|         city|is_weekend|
+----------+----------+----------+---------------+----------+----+-----+---+-------------------+-------+-------------+----------+
|C9S3C56LPV|2012-11-01|      NULL|TNI2YXRCY8FLQEU|2019-06-15|2019|    6| 15|Bills and Utilities|  31.56|       denver|       yes|
|C4P7AEI6CC|2013-09-01|      NULL|TO4ARSBVH7FJ3G6|2019-06-15|2019|    6| 15|      Entertainment|   3.34|       denver|       yes|
|C0YDPQWPBJ|2012-07-01|      NULL|TTHKIN5IJJB9EEW|2015-07-19|2015|    7| 19|      Entertainment|  12.91|       denver|       yes|
|C4P7AEI6CC|2013-09-01|      NULL|T84K5F5Y77E9GRC|2020-08-07|2020|    8|  7|      Entertainment|   3.35|san_francisco|        no|
|CXMWPDLHKQ|2013-09-01|      NULL|TNBK45PI6OWCYEB|2020-10-07|2020|   10|  7|      Entertai

In [14]:
# Using a window function, calculate the cumulative no. of transactions for each customer, ordered by transaction date.
window_spec = Window.partitionBy("cust_id").orderBy("date").rowsBetween(Window.unboundedPreceding, Window.currentRow)

df_transactions.withColumn("cum_txns", F.count("txn_id").over(window_spec)).show()



+----------+----------+--------+---------------+----------+----+-----+---+-------------+------+-------------+--------+
|   cust_id|start_date|end_date|         txn_id|      date|year|month|day| expense_type|   amt|         city|cum_txns|
+----------+----------+--------+---------------+----------+----+-----+---+-------------+------+-------------+--------+
|C00WRSJF1Q|2012-11-01|    NULL|TLBGW4LK9IRX4G3|2012-11-01|2012|   11|  1|Entertainment| 23.56|      chicago|       1|
|C00WRSJF1Q|2012-11-01|    NULL|T7RBVNNRJH33E96|2012-11-01|2012|   11|  1|Entertainment|337.55|san_francisco|       2|
|C00WRSJF1Q|2012-11-01|    NULL|T5K2L4VNTHJAJVL|2012-11-01|2012|   11|  1|    Education|337.55|      seattle|       3|
|C00WRSJF1Q|2012-11-01|    NULL|T00H6MUJOW1LAW6|2012-11-01|2012|   11|  1| Motor/Travel|713.27|    san_diego|       4|
|C00WRSJF1Q|2012-11-01|    NULL|TTGC9U9CUNQG0VB|2012-11-02|2012|   11|  2|Entertainment| 23.57|  los_angeles|       5|
|C00WRSJF1Q|2012-11-01|    NULL|TCZ47E6G2090WJD|

In [17]:
# Write a User-Defined Function (UDF) to categorize transactions as High (amt > 1000), Medium (500 â‰¤ amt â‰¤ 1000), or Low (amt < 500). Add this categorization as a new column.
from pyspark.sql import types as T

def categorize_transaction(amount):
    if amount > 1000:
        return "High"
    elif amount >= 500 and amount <=1000:
        return "Medium"
    else:
        return "Low"

categorize_transaction_udf = F.udf(categorize_transaction, T.StringType())

df_transactions.withColumn("transaction_category", categorize_transaction_udf(F.col("amt").cast("int"))).show(5)

+----------+----------+----------+---------------+----------+----+-----+---+-------------+------+------------+--------------------+
|   cust_id|start_date|  end_date|         txn_id|      date|year|month|day| expense_type|   amt|        city|transaction_category|
+----------+----------+----------+---------------+----------+----+-----+---+-------------+------+------------+--------------------+
|CHKQQWH8EO|2012-12-01|2020-08-01|TPP59D3CVSNS0M2|2018-11-12|2018|   11| 12| Motor/Travel| 75.08| los_angeles|                 Low|
|CJS78KKO2R|2014-03-01|      NULL|T8WENNS0QMI3WBD|2020-04-23|2020|    4| 23|    Groceries|119.66|philadelphia|                 Low|
|CM8GEPBJW0|2010-04-01|2018-08-01|TE7Q1TL44L963Z8|2011-05-01|2011|    5|  1|          Tax|775.53|   san_diego|              Medium|
|C0YDPQWPBJ|2010-01-01|2019-01-01|THFZS1YHQ77LARE|2016-11-27|2016|   11| 27|Entertainment|  9.41|   san_diego|                 Low|
|C1YTOSZPBB|2013-01-01|      NULL|T0SMOQFPHBZAYBF|2013-10-03|2013|   10|  3|

In [19]:
# Join customers and transactions DataFrames and find the average transaction amount for male and female customers.
df_joined = df_transactions.join(df_customers, on = "cust_id", how = "inner")
df_joined.groupby("gender").agg(F.avg("amt").alias("avg_txn_amt")).show()

+------+-----------+
|gender|avg_txn_amt|
+------+-----------+
|Female|     61.605|
|  Male|    111.945|
+------+-----------+



In [23]:
# Using a window function, calculate the rolling average of transaction amounts for each customer over their last 3 transactions.
# The range -2 specifies the last two rows before the current row (inclusive), making the total window size three (last 3 transactions, including the current one).

window_spec = Window.partitionBy("cust_id").orderBy(F.col("date")).rowsBetween(-2, Window.currentRow)

df_transactions.withColumn("cum_sum_last_3_txn", F.sum("amt").over(window_spec)).show()

+----------+----------+----------+---------------+----------+----+-----+---+-------------------+------+-----------+------------------+
|   cust_id|start_date|  end_date|         txn_id|      date|year|month|day|       expense_type|   amt|       city|cum_sum_last_3_txn|
+----------+----------+----------+---------------+----------+----+-----+---+-------------------+------+-----------+------------------+
|C0YDPQWPBJ|2012-04-01|      NULL|TVC719DW1VM9RDJ|2016-06-26|2016|    6| 26|      Entertainment|  8.73|    chicago|              8.73|
|C0YDPQWPBJ|2012-04-01|      NULL|TAKG4AC0L2QXEPB|2020-08-26|2020|    8| 26|          Groceries| 65.16|   new_york|             73.89|
|C8TZQZJBU9|2011-12-01|2020-06-01|TZR6IQS57CJVWSO|2017-10-13|2017|   10| 13|          Groceries| 17.14|los_angeles|             17.14|
|CMAF9IOXI3|2013-10-01|2020-11-01|TLZ4IY4GKQOAOUY|2016-05-13|2016|    5| 13|          Education| 26.32|    seattle|             26.32|
|CMX7KOJDQN|2013-01-01|      NULL|TGFRORVKGUZ65GI|2016-

In [13]:
# Calculate the month-over-month percentage change in total transaction amounts for each expense type.
# Group by year, month, and expense_type
df_grouped = df_transactions \
    .withColumn("year", F.year(F.col("date"))) \
    .withColumn("month", F.month(F.col("date"))) \
    .groupby("year", "month", "expense_type") \
    .agg(F.sum("amt").alias("total_amt"))

# Define a window specification
window_spec = Window.partitionBy("expense_type").orderBy("year", "month")

df_grouped = df_grouped.withColumn("previous_amt", F.lag("total_amt", 1).over(window_spec))

df_grouped.withColumn("percentage_change", ((F.col("total_amt") - F.col("previous_amt"))/F.col("previous_amt"))*100).show()

+----+-----+-------------------+------------------+------------------+-------------------+
|year|month|       expense_type|         total_amt|      previous_amt|  percentage_change|
+----+-----+-------------------+------------------+------------------+-------------------+
|2010|    7|Bills and Utilities|            467.38|              NULL|               NULL|
|2010|    8|Bills and Utilities|             463.9|            467.38|-0.7445761478882319|
|2010|    9|Bills and Utilities|             468.2|             463.9| 0.9269239060142298|
|2010|   10|Bills and Utilities|458.91999999999996|             468.2| -1.982058949167029|
|2010|   11|Bills and Utilities|            466.44|458.91999999999996| 1.6386298265492982|
|2010|   12|Bills and Utilities|            464.06|            466.44|-0.5102478346625494|
|2011|    1|Bills and Utilities|            994.41|            464.06| 114.28479075981552|
|2011|    2|Bills and Utilities| 988.8199999999999|            994.41| -0.562142375881179|

In [20]:
# Create a cohort analysis showing customer retention based on their first transaction month:
# Group customers by their first transaction month
# Show how many were still active in subsequent months
df_first_txn = df_transactions.groupby("cust_id")\
                .agg(F.min("date").alias("first_txn_date"))\
                .withColumn("first_txn_month", F.date_format(F.col("first_txn_date"), "yyyy-MM"))

df_first_txn.show()


+----------+--------------+---------------+
|   cust_id|first_txn_date|first_txn_month|
+----------+--------------+---------------+
|C0YDPQWPBJ|    2010-07-01|        2010-07|
|C4P7AEI6CC|    2013-09-01|        2013-09|
|C9S3C56LPV|    2012-11-01|        2012-11|
|CL3B876N0W|    2010-09-01|        2010-09|
|CXMWPDLHKQ|    2013-09-01|        2013-09|
|CY3U68IY1F|    2010-12-01|        2010-12|
+----------+--------------+---------------+



In [36]:
df_joined = df_transactions.join(df_first_txn, on = "cust_id", how = "inner")
df_joined = df_joined.withColumn("txn_month", F.date_format(F.col("date"), "yyyy-MM"))\
                    .withColumn("months_btwn_first_last_txn", 
                                F.round(F.months_between(F.col("date"), F.col("first_txn_date")),0).cast("int")
                               )
df_joined.select("cust_id","date","first_txn_date","months_btwn_first_last_txn").show(5)

+----------+----------+--------------+--------------------------+
|   cust_id|      date|first_txn_date|months_btwn_first_last_txn|
+----------+----------+--------------+--------------------------+
|C0YDPQWPBJ|2017-05-24|    2010-07-01|                        83|
|C0YDPQWPBJ|2017-05-24|    2010-07-01|                        83|
|C0YDPQWPBJ|2017-05-24|    2010-07-01|                        83|
|C0YDPQWPBJ|2017-05-24|    2010-07-01|                        83|
|C0YDPQWPBJ|2017-05-24|    2010-07-01|                        83|
+----------+----------+--------------+--------------------------+
only showing top 5 rows



In [42]:
df_retention = df_joined.groupby("first_txn_month", "months_btwn_first_last_txn")\
                        .agg(F.countDistinct("cust_id").alias("active_customers"))

df_retention.orderBy("first_txn_month","months_btwn_first_last_txn").show()
# Show the cohort analysis
df_retention_pivot = df_retention.groupby("first_txn_month").pivot("months_btwn_first_last_txn").agg(F.sum("active_customers"))
df_retention_pivot.select("first_txn_month", "0", "1", "2", "3").show()

+---------------+--------------------------+----------------+
|first_txn_month|months_btwn_first_last_txn|active_customers|
+---------------+--------------------------+----------------+
|        2010-07|                         0|               1|
|        2010-07|                         1|               1|
|        2010-07|                         2|               1|
|        2010-07|                         3|               1|
|        2010-07|                         4|               1|
|        2010-07|                         5|               1|
|        2010-07|                         6|               1|
|        2010-07|                         7|               1|
|        2010-07|                         8|               1|
|        2010-07|                         9|               1|
|        2010-07|                        10|               1|
|        2010-07|                        11|               1|
|        2010-07|                        12|               1|
|       

In [19]:
# Implement a fraud detection feature that flags suspicious transactions based on:
    # More than 3 transactions in a single day
    # Transactions amount > 5x the customer's average transaction amount
    # Transactions in different cities within 24 hours

#let's tackle this one first: Transactions amount > 5x the customer's average transaction amount
df_cust_avg_amt = df_transactions.groupby("cust_id").agg(F.avg("amt").alias("avg_txn_amt"))
df_transactions = df_transactions.join(F.broadcast(df_cust_avg_amt), on="cust_id", how="inner")
df_transactions = df_transactions.withColumn("is_5x_avg_amt", F.when(F.col("amt") > F.col("avg_txn_amt")*5, 1).otherwise(0))

In [20]:
# Now: More than 3 transactions in a single day
# group by cust_id and date to see no. of transactions per day
df_transactions = df_transactions.groupby("cust_id","date").agg(F.countDistinct("txn_id").alias("txns"))\
                .withColumn("is_more_than_3_txn_per_day", F.when(F.col("txns")>3, 1).otherwise(0))

In [29]:
# 1. More than 3 transactions in a single day (you already have this one)
df_transactions_per_day = df_transactions.groupby("cust_id", "date").agg(F.countDistinct("txn_id").alias("txns")) \
    .withColumn("is_more_than_3_txn_per_day", F.when(F.col("txns") > 3, 1).otherwise(0))

# 2. Transactions amount > 5x the customer's average transaction amount (you already have this one)
df_cust_avg_amt = df_transactions.groupby("cust_id").agg(F.avg("amt").alias("avg_txn_amt"))
df_transactions = df_transactions.join(F.broadcast(df_cust_avg_amt), on="cust_id", how="inner")
df_transactions = df_transactions.withColumn("is_5x_avg_amt", F.when(F.col("amt") > F.col("avg_txn_amt") * 5, 1).otherwise(0))

# 3. Transactions in different cities within 24 hours
# Window specification to partition by cust_id and order by date
window_spec = Window.partitionBy("cust_id").orderBy("date")

# Calculate the time difference between the current transaction and the next transaction
df_transactions_with_lag = df_transactions.withColumn("next_city", F.lead("city", 1).over(window_spec)) \
    .withColumn("next_date", F.lead("date", 1).over(window_spec)) \
    .withColumn("time_diff_seconds", 
                F.unix_timestamp(F.col("next_date")) - F.unix_timestamp(F.col("date"))) \
    .withColumn("is_diff_city_within_24hr", 
                F.when((F.col("time_diff_seconds") <= 86400) & (F.col("city") != F.col("next_city")), 1).otherwise(0))

# Final result combining all flags
df_result = df_transactions_with_lag.join(df_transactions_per_day, on="cust_id", how="left") \
    .withColumn("fraud_flag", 
                F.when((F.col("is_5x_avg_amt") == 1) | (F.col("is_more_than_3_txn_per_day") == 1) | 
                       (F.col("is_diff_city_within_24hr") == 1), 1).otherwise(0))

AnalysisException: [AMBIGUOUS_REFERENCE] Reference `avg_txn_amt` is ambiguous, could be: [`avg_txn_amt`, `avg_txn_amt`].

In [26]:
df_result.select("cust_id", "amt", "city", "is_5x_avg_amt", 
                 "is_more_than_3_txn_per_day", "is_diff_city_within_24hr", "fraud_flag").show(5)

+----------+-----+------+-------------+--------------------------+------------------------+----------+
|   cust_id|  amt|  city|is_5x_avg_amt|is_more_than_3_txn_per_day|is_diff_city_within_24hr|fraud_flag|
+----------+-----+------+-------------+--------------------------+------------------------+----------+
|C0YDPQWPBJ|15.84|denver|            0|                         1|                       0|         1|
|C0YDPQWPBJ|15.84|denver|            0|                         0|                       0|         0|
|C0YDPQWPBJ|15.84|denver|            0|                         0|                       0|         0|
|C0YDPQWPBJ|15.84|denver|            0|                         1|                       0|         1|
|C0YDPQWPBJ|15.84|denver|            0|                         1|                       0|         1|
+----------+-----+------+-------------+--------------------------+------------------------+----------+
only showing top 5 rows



In [31]:
'''
Calculate the customer lifetime value (CLV) for each customer:

Total amount spent
Average transaction frequency
Customer age in the system
Trend of spending (increasing/decreasing)
'''

df_transactions_grouped_cust_id = df_transactions.groupby("cust_id").agg(F.sum("amt").alias("total_amt_spent"),
                                                                           F.countDistinct("txn_id").alias("txns"),
                                                                           F.min("date").alias("first_txn_date"),
                                                                           F.max("date").alias("last_txn_date")
                                                                          )
df_joined = df_transactions_grouped_cust_id.join(df_customers, on="cust_id", how="inner")
df_joined = df_joined.withColumn("days_between_first_last_txn", 
                     F.date_diff(F.col("last_txn_date"),F.col("first_txn_date"))
                    )\
            .withColumn("transaction_frequency", 
                        F.when(F.col("days_between_first_last_txn") > 0, F.col("txns") / F.col("days_between_first_last_txn"))
                         .otherwise(0)
                       )

window_spec = Window.partitionBy("cust_id").orderBy("date")
df_transactions.withColumn("cumulative_spend", 
                          F.sum("amt").over(window_spec)
                          )\
                .withColumn("spending_trend",
                            F.when(F.col("cumulative_spend")>F.lag("cumulative_spend",1).over(window_spec),"increasing").otherwise("decreasing")                          
                           ).show()



+----------+----------+----------+---------------+----------+----+-----+---+-------------------+------+-------------+------------------+-------------+------------------+-----------------+--------------+
|   cust_id|start_date|  end_date|         txn_id|      date|year|month|day|       expense_type|   amt|         city|       avg_txn_amt|is_5x_avg_amt|       avg_txn_amt| cumulative_spend|spending_trend|
+----------+----------+----------+---------------+----------+----+-----+---+-------------------+------+-------------+------------------+-------------+------------------+-----------------+--------------+
|C0YDPQWPBJ|2010-07-01|2018-12-01|TTKJ71UL9MC2FIA|2010-07-01|2010|    7|  1|      Entertainment| 17.96|       denver|122.20164913079681|            0|122.20164913079681|           849.06|    decreasing|
|C0YDPQWPBJ|2010-07-01|2018-12-01|T3QJN1UXI21LUMC|2010-07-01|2010|    7|  1|      Entertainment| 15.84|       denver|122.20164913079681|            0|122.20164913079681|           849.06| 

In [48]:
'''
Create a product affinity analysis:
Find which expense types are commonly seen together in the same month
Calculate the correlation between different expense types
'''

# total amount spent for each expense_type in each month
df_aggregated = df_transactions.withColumn("month", F.date_format(F.col("date"), "yyyy-MM"))\
                                .groupBy("expense_type", "month")\
                                .agg(F.sum("amt").alias("total_amt"),
                                     F.countDistinct("txn_id").alias("txns")
                                    )

df_monthly_totals = df_aggregated.groupby("month") \
                              .agg(F.sum("total_amt").alias("monthly_total_amt"),
                                   F.sum("txns").alias("monthly_total_txns")
                                  )

df_joined = df_aggregated.join(df_monthly_totals, on="month", how="inner")


df_percentages = df_joined.withColumn("percentage",
                                      F.round((F.col("txns") / F.col("monthly_total_txns")) * 100, 0)
                                     )


# Once we have the aggregated data, we need to pivot it so that each expense_type becomes a column.
# df.pivot(pivot_col, [list_of_values])

df_pivot = df_percentages.groupBy("month").pivot("expense_type").agg(F.first("percentage"))
df_pivot.toPandas()

Unnamed: 0,month,Bills and Utilities,Clothing,Education,Entertainment,Fines,Gambling,Groceries,Health,Housing,Motor/Travel,Savings,Tax
0,2020-06,3.0,3.0,2.0,57.0,0.0,4.0,12.0,4.0,1.0,12.0,1.0,1.0
1,2013-05,3.0,3.0,2.0,50.0,0.0,5.0,18.0,2.0,1.0,11.0,2.0,2.0
2,2019-10,3.0,3.0,2.0,56.0,0.0,5.0,13.0,1.0,1.0,12.0,1.0,1.0
3,2020-12,3.0,3.0,2.0,55.0,,5.0,14.0,3.0,1.0,12.0,1.0,1.0
4,2018-10,3.0,2.0,3.0,54.0,,4.0,18.0,3.0,1.0,9.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
121,2013-04,4.0,1.0,2.0,51.0,,5.0,16.0,3.0,1.0,14.0,1.0,1.0
122,2017-01,3.0,4.0,2.0,56.0,0.0,4.0,16.0,2.0,0.0,10.0,2.0,1.0
123,2012-03,4.0,3.0,1.0,56.0,0.0,1.0,15.0,3.0,0.0,13.0,2.0,1.0
124,2018-07,4.0,3.0,3.0,47.0,,6.0,15.0,3.0,1.0,17.0,1.0,1.0


In [20]:
'''
Implement a dynamic window calculation that shows:

Moving average of transaction amounts (7-day window)
Transaction amount percentile within their expense type
Rank of transaction amount within customer's history
'''

df_transactions_selected = df_transactions.select("cust_id","txn_id","date","amt")

In [26]:
# Moving average of transaction amounts (7-day window)
# Convert dates to timestamps for window calculation
window_spec_7_avg = (Window
    .orderBy(F.unix_timestamp("date"))
    .rangeBetween(-7 * 24 * 60 * 60, 0)  # 7 days in seconds
)

# Calculate 7-day moving average
df_with_moving_avg = (df_transactions_selected.withColumn("7_day_moving_avg", 
                                                F.avg("amt").over(window_spec_7_avg)
                                                )
                                        )
df_with_moving_avg.orderBy("date").show()

+----------+---------------+----------+------+------------------+
|   cust_id|         txn_id|      date|   amt|  7_day_moving_avg|
+----------+---------------+----------+------+------------------+
|C0YDPQWPBJ|TFKMYCIEKH6VDKY|2010-07-01| 19.53|102.10035290000053|
|C0YDPQWPBJ|TMFZTGDFPD0MVVN|2010-07-01| 17.23|102.10035290000053|
|C0YDPQWPBJ|TTKJ71UL9MC2FIA|2010-07-01| 17.96|102.10035290000053|
|C0YDPQWPBJ|T3QJN1UXI21LUMC|2010-07-01| 15.84|102.10035290000053|
|C0YDPQWPBJ|TCCEYC0LOOYFC46|2010-07-01| 778.5|102.10035290000053|
|C0YDPQWPBJ|TP1MUZ1P7B60SSD|2010-07-02| 32.53|102.10035290000053|
|C0YDPQWPBJ|TEX2I1AW51029I6|2010-07-02| 21.18|102.10035290000053|
|C0YDPQWPBJ|TBO4EJ6HDLGSLNX|2010-07-03| 55.06|102.10035290000053|
|C0YDPQWPBJ|TWJ8IZU7CGDHU34|2010-07-03| 55.34|102.10035290000053|
|C0YDPQWPBJ|TOQ9CAG3Y0B7LKH|2010-07-03|228.39|102.10035290000053|
|C0YDPQWPBJ|T77P5KOH15A5RV3|2010-07-04| 32.38|102.10035290000053|
|C0YDPQWPBJ|T67Q2REI30200D2|2010-07-05| 26.29|102.10035290000053|
|C0YDPQWPB

In [27]:
# Transaction amount percentile within their expense type

# order By amount as we are ranking based on transaction size
window_spec = Window.partitionBy("expense_type").orderBy("amt")

# The percent_rank func will calculate the relative position of each transaction amount (0 to 1) within its expense type
# 0 means lowest amount in that expense type
# 1 means highest amount in that expense type

df_transactions_wt_rank = df_transactions.withColumn("percentile_rank_amt",
                          F.percent_rank().over(window_spec)
                          )

df_transactions_wt_rank.filter(F.col("expense_type")=='Entertainment')\
                        .select("expense_type", "amt", "percentile_rank_amt")\
                        .orderBy("amt").show()

+-------------+----+--------------------+
| expense_type| amt| percentile_rank_amt|
+-------------+----+--------------------+
|Entertainment|1.69|                 0.0|
|Entertainment| 1.7|1.853430700226118...|
|Entertainment|1.71|3.706861400452237E-5|
|Entertainment|1.71|3.706861400452237E-5|
|Entertainment|1.72|7.413722800904474E-5|
|Entertainment|1.72|7.413722800904474E-5|
|Entertainment|1.72|7.413722800904474E-5|
|Entertainment|1.73|1.297401490158283E-4|
|Entertainment|1.73|1.297401490158283E-4|
|Entertainment|1.73|1.297401490158283E-4|
|Entertainment|1.73|1.297401490158283E-4|
|Entertainment|1.75|2.038773770248730...|
|Entertainment|1.75|2.038773770248730...|
|Entertainment|1.77|2.409459910293954E-4|
|Entertainment|1.78|2.594802980316566E-4|
|Entertainment|1.78|2.594802980316566E-4|
|Entertainment|1.78|2.594802980316566E-4|
|Entertainment|1.79|3.150832190384401...|
|Entertainment|1.79|3.150832190384401...|
|Entertainment|1.79|3.150832190384401...|
+-------------+----+--------------

In [28]:
# Rank of transaction amount within customer's history
'''
The difference between rank and dense_rank is that dense_rank leaves no gaps in ranking
sequence when there are ties. That is, if you were ranking a competition using dense_rank
and had three people tie for second place, you would say that all three were in second
place and that the next person came in third. Rank would give me sequential numbers, making
the person that came in third place (after the ties) would register as coming in fifth.
'''
window_spec_cust = Window.partitionBy("cust_id").orderBy(F.col("amt").desc())

df_transactions_selected = df_transactions_selected.withColumn("txn_rank_customer",
                                   F.dense_rank().over(window_spec_cust)
                                   )

df_transactions_selected.filter(F.col("cust_id")=='C0YDPQWPBJ').orderBy("amt", ascending=False).show()

+----------+---------------+----------+------+-----------------+
|   cust_id|         txn_id|      date|   amt|txn_rank_customer|
+----------+---------------+----------+------+-----------------+
|C0YDPQWPBJ|T60WDKTEXEWVU9F|2014-04-10|998.87|                1|
|C0YDPQWPBJ|TPBGFXSHKAFSQEK|2014-03-10|997.22|                2|
|C0YDPQWPBJ|TE0VSZ1C4CJUHE4|2014-02-10|995.58|                3|
|C0YDPQWPBJ|T3R5NS38YAMUCA4|2020-12-05|995.17|                4|
|C0YDPQWPBJ|T8EUZWQK3AB1A0U|2014-01-10|993.94|                5|
|C0YDPQWPBJ|TBP6330UW8QDLI2|2020-11-05|993.53|                6|
|C0YDPQWPBJ|TL6S9N4XSXJ52DX|2013-12-10| 992.3|                7|
|C0YDPQWPBJ|T8YU3FTQUXYEN4Z|2020-10-05|991.89|                8|
|C0YDPQWPBJ|TJ2RD2RQNNLFIA4|2013-11-10|990.66|                9|
|C0YDPQWPBJ|TEBW5PBVDIVMD2P|2020-09-05|990.26|               10|
|C0YDPQWPBJ|TQ3C5WZVRIC3HLZ|2016-10-27| 99.99|               11|
|C0YDPQWPBJ|T73JKOC2MA79RM1|2020-10-25| 99.97|               12|
|C0YDPQWPBJ|TRGL8DST6VWDM

In [37]:
'''
Build a customer segmentation analysis using:

Recency (days since last transaction)
Frequency (number of transactions)
Monetary value (total amount spent)
Create segments like "High Value", "Medium Value", "Low Value"
'''
df_transactions_grouped =  df_transactions.groupby("cust_id")\
                                            .agg(F.countDistinct("txn_id").alias("txns"),
                                                 F.round(F.sum("amt"),0).alias("total_amt_spent")
                                                )
df_transactions_joined = df_transactions.join(df_transactions_grouped, on="cust_id", how="inner")

window_spec_cust = Window.partitionBy("cust_id").orderBy("date")

df_transactions_joined = df_transactions_joined.withColumn("last_transaction_date", 
                                                           F.lag("date").over(window_spec_cust))

# Get the latest transaction date in the dataset (reference date)
max_date = df_transactions.agg(F.max("date").alias("max_date")).collect()[0]["max_date"]

# Calculate Recency
df_transactions_joined = df_transactions_joined.withColumn(
    "recency_days",
    F.when(F.col("last_transaction_date").isNotNull(), 
           F.datediff(F.lit(max_date), F.col("last_transaction_date"))
    ).otherwise(None)  # Handle nulls for first transaction
)

df_transactions_joined.select("cust_id", "txn_id", "last_transaction_date", "recency_days").show()

+----------+---------------+---------------------+------------+
|   cust_id|         txn_id|last_transaction_date|recency_days|
+----------+---------------+---------------------+------------+
|C0YDPQWPBJ|TFKMYCIEKH6VDKY|                 NULL|        NULL|
|C0YDPQWPBJ|TMFZTGDFPD0MVVN|           2010-07-01|        3832|
|C0YDPQWPBJ|TTKJ71UL9MC2FIA|           2010-07-01|        3832|
|C0YDPQWPBJ|T3QJN1UXI21LUMC|           2010-07-01|        3832|
|C0YDPQWPBJ|TCCEYC0LOOYFC46|           2010-07-01|        3832|
|C0YDPQWPBJ|TP1MUZ1P7B60SSD|           2010-07-01|        3832|
|C0YDPQWPBJ|TEX2I1AW51029I6|           2010-07-02|        3831|
|C0YDPQWPBJ|TOQ9CAG3Y0B7LKH|           2010-07-02|        3831|
|C0YDPQWPBJ|TWJ8IZU7CGDHU34|           2010-07-03|        3830|
|C0YDPQWPBJ|TBO4EJ6HDLGSLNX|           2010-07-03|        3830|
|C0YDPQWPBJ|T77P5KOH15A5RV3|           2010-07-03|        3830|
|C0YDPQWPBJ|T67Q2REI30200D2|           2010-07-04|        3829|
|C0YDPQWPBJ|T79V268J85VGI2I|           2

In [38]:
from pyspark.sql.functions import when

# Step 1: Calculate RFM Metrics
rfm_df = df_transactions_grouped.select("cust_id", "total_amt_spent", "txns").join(
    df_transactions_joined.select("cust_id", "recency_days").distinct(),
    on="cust_id",
    how="inner"
)

# Step 2: Assign Scores for Each Metric
# Recency: Lower days = Higher score
rfm_df = rfm_df.withColumn(
    "recency_score",
    when(F.col("recency_days") <= 30, 5)
    .when((F.col("recency_days") > 30) & (F.col("recency_days") <= 60), 4)
    .when((F.col("recency_days") > 60) & (F.col("recency_days") <= 90), 3)
    .when((F.col("recency_days") > 90) & (F.col("recency_days") <= 120), 2)
    .otherwise(1)
)

# Frequency: Higher txns = Higher score
rfm_df = rfm_df.withColumn(
    "frequency_score",
    when(F.col("txns") >= 20, 5)
    .when((F.col("txns") >= 15) & (F.col("txns") < 20), 4)
    .when((F.col("txns") >= 10) & (F.col("txns") < 15), 3)
    .when((F.col("txns") >= 5) & (F.col("txns") < 10), 2)
    .otherwise(1)
)

# Monetary Value: Higher total amount spent = Higher score
rfm_df = rfm_df.withColumn(
    "monetary_score",
    when(F.col("total_amt_spent") >= 1000, 5)
    .when((F.col("total_amt_spent") >= 750) & (F.col("total_amt_spent") < 1000), 4)
    .when((F.col("total_amt_spent") >= 500) & (F.col("total_amt_spent") < 750), 3)
    .when((F.col("total_amt_spent") >= 250) & (F.col("total_amt_spent") < 500), 2)
    .otherwise(1)
)

# Step 3: Create a Combined RFM Score
rfm_df = rfm_df.withColumn(
    "rfm_score",
    F.col("recency_score") + F.col("frequency_score") + F.col("monetary_score")
)

# Step 4: Segment Customers
rfm_df = rfm_df.withColumn(
    "segment",
    when(F.col("rfm_score") >= 13, "High Value")
    .when((F.col("rfm_score") >= 9) & (F.col("rfm_score") < 13), "Medium Value")
    .otherwise("Low Value")
)

# Show Final Segments
rfm_df.groupby("cust_id", "recency_score", "frequency_score", "monetary_score", "rfm_score", "segment").show()


+----------+-------------+---------------+--------------+---------+------------+
|   cust_id|recency_score|frequency_score|monetary_score|rfm_score|     segment|
+----------+-------------+---------------+--------------+---------+------------+
|C4P7AEI6CC|            1|              5|             5|       11|Medium Value|
|C4P7AEI6CC|            5|              5|             5|       15|  High Value|
|C4P7AEI6CC|            5|              5|             5|       15|  High Value|
|C4P7AEI6CC|            5|              5|             5|       15|  High Value|
|C4P7AEI6CC|            5|              5|             5|       15|  High Value|
|C4P7AEI6CC|            5|              5|             5|       15|  High Value|
|C4P7AEI6CC|            5|              5|             5|       15|  High Value|
|C4P7AEI6CC|            5|              5|             5|       15|  High Value|
|C4P7AEI6CC|            5|              5|             5|       15|  High Value|
|C4P7AEI6CC|            5|  

In [39]:
'''
Create a geographical analysis:

Transaction density by city
Average transaction amount by city
Customer movement patterns between cities
City-wise customer demographics
'''
# Transaction density by city
# Average transaction amount by city

df_transactions.groupby("city").agg(F.sum("amt").alias("total_amt"),
                                    F.count_distinct("txn_id").alias("txns")
                                   ).show()

+-------------+------------------+-----+
|         city|         total_amt| txns|
+-------------+------------------+-----+
|    san_diego| 968898.9500000017| 9864|
|      chicago|        1059268.44|10141|
|       denver|1023027.3099999998| 9924|
|       boston| 984069.3899999992| 9800|
|      seattle|1046470.0400000014|10010|
|  los_angeles|1039331.7599999973|10061|
|     new_york| 993799.5100000021| 9959|
|san_francisco| 952994.9999999984|10135|
| philadelphia|1038915.2800000006|10008|
|     portland|        1103259.61|10098|
+-------------+------------------+-----+



In [11]:
# Customer movement patterns between cities
# What do you want to achieve with this
# maybe answer questions like customers from so and so on avg make the highest txn in this city
df_transactions_select = df_transactions.select(F.col("cust_id"),F.col("city").alias("txn_city"),"txn_id","amt","date")
df_customers_select = df_customers.select("cust_id", F.col("city").alias("cust_city"))

df_txn_cust = df_transactions_select.join(df_customers_select, on="cust_id",how="inner")
df_grouped = df_txn_cust.groupby("cust_city","txn_city").agg(F.avg("amt").alias("avg_txn_amt"),
                                               F.count_distinct("txn_id").alias("txns")
                                               )
df_grouped.filter(F.col("cust_city")=='new_york').orderBy("txns", ascending=False).show()

+---------+-------------+-----------------+----+
|cust_city|     txn_city|      avg_txn_amt|txns|
+---------+-------------+-----------------+----+
| new_york|      seattle|75.23041966426867| 834|
| new_york|       denver|84.91699266503655| 818|
| new_york|  los_angeles|84.66374233128838| 815|
| new_york|san_francisco|80.04725925925923| 810|
| new_york|     portland|80.14797011207962| 803|
| new_york| philadelphia|82.38393483709275| 798|
| new_york|    san_diego|79.36619949494947| 792|
| new_york|       boston|81.94199488491047| 782|
| new_york|     new_york|83.34528074866307| 748|
| new_york|      chicago|73.33599190283394| 741|
+---------+-------------+-----------------+----+



In [15]:
# Another way of thinking about it
# Define a window to order transactions by date for each customer
window_spec_movement = Window.partitionBy("cust_id").orderBy("date")

# Add a lagged column to capture the previous city
customer_movement = df_transactions \
    .withColumn("prev_city", F.lag("city").over(window_spec_movement)) \
    .withColumn("next_city", F.lead("city").over(window_spec_movement)) \
    .filter((F.col("city") != F.col("prev_city"))|(F.col("city") != F.col("next_city")))  
    # Filter only if the customer moved to a different city

# Count movements between city pairs
movement_patterns = customer_movement \
    .groupBy("prev_city", "city", "next_city") \
    .agg(F.count("txn_id").alias("movement_count")) \
    .orderBy(F.desc("movement_count"))

movement_patterns.orderBy("movement_count", ascending=False).show()

+-------------+-------------+-------------+--------------+
|    prev_city|         city|    next_city|movement_count|
+-------------+-------------+-------------+--------------+
|san_francisco|  los_angeles|       boston|           141|
| philadelphia|      chicago|       denver|           134|
|     new_york|      chicago|     portland|           130|
|san_francisco|    san_diego|     portland|           129|
|      chicago|       denver| philadelphia|           128|
| philadelphia|      chicago|     portland|           127|
|     new_york|san_francisco|     portland|           126|
|       denver|       denver|     portland|           126|
|       boston|       denver| philadelphia|           125|
|       boston|      chicago|      chicago|           125|
|     portland|     new_york|san_francisco|           125|
|  los_angeles|      chicago|       denver|           124|
|      chicago|     portland|       denver|           124|
|     portland|    san_diego|     portland|           12