## Most Popular Client For Calls

Select the most popular client_id based on the number of users who individually have at least 50% of their events from the following list: 'video call received', 'video call sent', 'voice call received', 'voice call sent'.

Table: fact_events

In [1]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window
import pandas as pd

import os
import sys

In [6]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [2]:
os.environ["JAVA_HOME"] = "C:/Program Files/Java/jdk-11"
spark = SparkSession.builder.appName("Most Popular Client For Calls").getOrCreate()

In [3]:
df = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load("../Data/Most Popular Client For Calls_fact_events_Tbl.csv")

In [8]:
df.show() #only shows 20 rows
# df.toPandas() # will not work as it requires 3.11 python version or lower

+---+-------+----------+----------------+---------+-------------------+--------+
| id|time_id|   user_id|     customer_id|client_id|         event_type|event_id|
+---+-------+----------+----------------+---------+-------------------+--------+
|  1|  43889|3668-QPYBK|          Sendit|  desktop|       message sent|       3|
|  2|  43889|7892-POOKP|       Connectix|   mobile|      file received|       2|
|  3|  43924|9763-GRSKD|          Zoomit|  desktop|video call received|       7|
|  4|  43923|9763-GRSKD|       Connectix|  desktop|video call received|       7|
|  5|  43867|9237-HQITU|          Sendit|  desktop|video call received|       7|
|  6|  43888|8191-XWSZG|       Connectix|  desktop|      file received|       2|
|  7|  43924|9237-HQITU|       Connectix|  desktop|video call received|       7|
|  8|  43891|9237-HQITU|       Connectix|   mobile|   message received|       4|
|  9|  43923|4190-MFLUW|       Connectix|   mobile|video call received|       7|
| 10|  43942|9763-GRSKD|    

In [17]:
result = df.groupBy(
    'user_id',
    'client_id'
    ).agg(
        F.count('*')
    ).show()
    # filter(
    #     F.col('user_id') == '3668-QPYBK'
    # )

+----------+---------+--------+
|   user_id|client_id|count(1)|
+----------+---------+--------+
|3668-QPYBK|  desktop|       7|
|5575-GNVDE|  desktop|       4|
|8091-TTVAX|   mobile|       5|
|9237-HQITU|   mobile|       3|
|7469-LKBCI|  desktop|       5|
|1452-KIOVK|  desktop|       1|
|7892-POOKP|  desktop|       4|
|0280-XJGEX|   mobile|       2|
|1452-KIOVK|   mobile|       4|
|7590-VHVEG|   mobile|       5|
|5575-GNVDE|   mobile|       4|
|7590-VHVEG|  desktop|       4|
|7795-CFOCW|  desktop|       4|
|5129-JLPIS|   mobile|       3|
|6713-OKOMC|  desktop|       5|
|8191-XWSZG|   mobile|       5|
|6713-OKOMC|   mobile|       3|
|3655-SNQYZ|  desktop|       2|
|9237-HQITU|  desktop|       7|
|8191-XWSZG|  desktop|       3|
+----------+---------+--------+
only showing top 20 rows



In [18]:
result = df.groupBy(
    'user_id',
    'client_id',
    'event_type'
    ).agg(
        F.count('*')
    ).filter(
        F.col('user_id') == '5575-GNVDE'
        ).show()

+----------+---------+-------------------+--------+
|   user_id|client_id|         event_type|count(1)|
+----------+---------+-------------------+--------+
|5575-GNVDE|   mobile|   message received|       1|
|5575-GNVDE|  desktop|voice call received|       1|
|5575-GNVDE|   mobile|          file sent|       1|
|5575-GNVDE|   mobile|voice call received|       2|
|5575-GNVDE|  desktop| video call started|       1|
|5575-GNVDE|  desktop|      file received|       2|
+----------+---------+-------------------+--------+



In [37]:
result = df.groupBy(
    'user_id',
    'client_id'
    ).agg(
        F.round(
            F.sum(
                F.when(
                    F.col('event_type').isin( 'video call received', 'video call sent', 'voice call received', 'voice call sent'), 1
                ).otherwise(0)
            )/F.count('*'), 2).alias('cnt_ratio')
        ).filter(
            F.col('cnt_ratio') >= 0.5        
            ).groupBy(
                'client_id'
                ).agg(
                    F.count('*').alias('total_cnt')
                    ).orderBy(
                        F.col('total_cnt').desc()
                        ).limit(1).select('client_id').show()

+---------+
|client_id|
+---------+
|  desktop|
+---------+

