## Premium Accounts
You have a dataset that records daily active users for each premium account. A premium account appears in the data every day as long as it remains premium. However, some premium accounts may be temporarily discounted, meaning they are not actively paying—this is indicated by a final_price of 0.


For each of the first 7 available dates, count the number of premium accounts that were actively paying on that day. Then, track how many of those same accounts remain premium and are still paying exactly 7 days later (regardless of activity in between).


Output three columns:
- The date of initial calculation.
- The number of premium accounts that were actively paying on that day.
- The number of those accounts that remain premium and are still paying after 7 days.

<br> <br>
Table: premium_accounts_by_day

In [42]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window 
import pandas as pd

import os
import sys

In [43]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [3]:
os.environ['JAVA_HOME'] = "C:/Program Files/Java/jdk-11"
spark = SparkSession.builder.appName('Premium Accounts').getOrCreate()

In [11]:
df = spark.read.format('csv') \
    .option('header', 'true') \
    .option('inferSchema', 'true') \
    .load('../Data/premium_accounts_by_day.csv')
    
df.show()

+----------+----------+----------------+-----------+---------+
|account_id|entry_date|users_visited_7d|final_price|plan_size|
+----------+----------+----------------+-----------+---------+
|       A01|2022-02-07|               1|        100|       10|
|       A03|2022-02-07|              30|        400|       50|
|       A01|2022-02-08|               3|        100|       10|
|       A03|2022-02-08|              39|        400|       50|
|       A05|2022-02-08|              14|        400|       50|
|       A01|2022-02-09|              12|        100|       10|
|       A03|2022-02-09|              44|        400|       50|
|       A04|2022-02-09|              25|          0|       70|
|       A05|2022-02-09|              32|        400|       50|
|       A01|2022-02-10|              17|        100|       10|
|       A02|2022-02-10|              82|        800|      100|
|       A03|2022-02-10|              60|        400|       50|
|       A04|2022-02-10|              72|          0|   

In [None]:
result = df.filter(
    F.col('final_price') > 0    
    )
        
result = result.withColumn(
    'after_7days',
    F.date_add(F.col('entry_date'), 7)
    )

result = result.alias('r1').join(
    result.alias('r2'),
    (F.col('r1.after_7days') == F.col('r2.entry_date')) & (F.col('r1.account_id') == F.col('r2.account_id')),
    how = 'left'
    ).select(
        F.col('r1.entry_date'),
        F.col('r1.account_id').alias('initial_account_id'),
        F.col('r2.account_id').alias('after7_account_id')
        )
    
final_result = result.groupBy(
    'entry_date'
    ).agg(
        F.count('initial_account_id').alias('premium_paid_accounts'),
        F.count('after7_account_id').alias('premium_paid_accounts_after_7d')
        ).orderBy(
            'entry_date'
            ).limit(7).toPandas()
        


+----------+---------------------+------------------------------+
|entry_date|premium_paid_accounts|premium_paid_accounts_after_7d|
+----------+---------------------+------------------------------+
|2022-02-07|                    2|                             2|
|2022-02-08|                    3|                             2|
|2022-02-09|                    3|                             2|
|2022-02-10|                    4|                             3|
|2022-02-11|                    4|                             1|
|2022-02-12|                    4|                             2|
|2022-02-13|                    4|                             1|
+----------+---------------------+------------------------------+

