<a href="https://colab.research.google.com/github/Erike-Simon/CESAR-AED/blob/main/ProcDados_spark_assignment_erike_simon.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**CESAR School Recife**

**Disciplina:** *Processamento de Dados em Larga Escala*

**Alunos:** *Erike Simon, José Aparecido*

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install --upgrade pyspark

Collecting pyspark
  Downloading pyspark-3.5.0.tar.gz (316.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.9/316.9 MB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.0-py2.py3-none-any.whl size=317425345 sha256=b23c7b39164e5378b38063c3de5caa7f2ce04927de4de3be7b5cb1ea01da9d47
  Stored in directory: /root/.cache/pip/wheels/41/4e/10/c2cf2467f71c678cfc8a6b9ac9241e5e44a01940da8fbb17fc
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.0


## Sobre os dados

O arquivo CSV contém eventos 'click' ou 'view' no tempo, de usuários em anúncios de determinadas campanhas.

**Descrição das colunas:**  
timestamp,user_id,action,adId,campaignId

**Amostra:**  
2016-09-21 22:11:00,7c74953c-66cc-48bd-9d02-a02bf039cf3f,click,adId_09,campaignId_01  
2016-06-25 18:29:00,676a083e-2f8e-4ff2-9ec2-270f7f9d6033,view,adId_09,campaignId_02  
2016-02-14 19:03:00,77158997-0dfa-48b7-9149-973dc151ef8d,click,adId_02,campaignId_02  
2016-03-26 06:27:00,78aa2467-b502-413b-94e9-04ec8210bd13,click,adId_07,campaignId_03

**Nome do arquivo CSV:**  
data/ad_action.csv

## Sobre as questões

As questões devem ser respondidas usando alguma API da tecnologia Spark, exceto a API "Pandas API on Spark".

Quando utilizar uma action do Spark tenha cuidado para evitar estouro de memória, sempre imaginado que vai executar o código com uma grande massa de dados.

Mesmo que não consiga terminar alguma questão, favor enviar, porque parte do código pode valer alguma pontuação.

In [None]:
import os
import pyspark.sql.functions as F
import pyspark.sql.types as T

from pyspark.sql import SparkSession

os.environ['PYSPARK_SUBMIT_ARGS'] = '\
    --driver-memory 2G \
    --executor-memory 2G \
    pyspark-shell'

In [None]:
ROOT_DATA_PATH = 'drive/MyDrive/Colab Notebooks/proc-dados-larga-escala/data/ad_action.csv'

In [None]:
spark = SparkSession.builder\
    .master("local[*]")\
    .getOrCreate()
data_spark = spark.read.csv(ROOT_DATA_PATH, header=False, inferSchema=True)\
    .toDF('timestamp', 'user_id', 'action', 'adId', 'campaignId')
data_spark.printSchema()

root
 |-- timestamp: timestamp (nullable = true)
 |-- user_id: string (nullable = true)
 |-- action: string (nullable = true)
 |-- adId: string (nullable = true)
 |-- campaignId: string (nullable = true)



In [None]:
data_spark.show(5)

+-------------------+--------------------+------+-------+-------------+
|          timestamp|             user_id|action|   adId|   campaignId|
+-------------------+--------------------+------+-------+-------------+
|2016-09-21 22:11:00|7c74953c-66cc-48b...| click|adId_09|campaignId_01|
|2016-06-25 18:29:00|676a083e-2f8e-4ff...|  view|adId_09|campaignId_02|
|2016-02-14 19:03:00|77158997-0dfa-48b...| click|adId_02|campaignId_02|
|2016-03-26 06:27:00|78aa2467-b502-413...| click|adId_07|campaignId_03|
|2016-01-02 04:57:00|fef9a98c-d73e-48e...|  view|adId_02|campaignId_02|
+-------------------+--------------------+------+-------+-------------+
only showing top 5 rows



In [None]:
# Descomente e execute para desligar clusters

#spark.stop()

## 1) Quais são as top 3 campanhas que geraram mais eventos? Ordene pela quantidade de eventos (2,5 pontos)

In [None]:
data_spark.groupBy('campaignId')\
    .count()\
    .withColumnRenamed('count', 'event_count')\
    .orderBy(F.desc('event_count'))\
    .show()

+-------------+-----------+
|   campaignId|event_count|
+-------------+-----------+
|campaignId_02|      91216|
|campaignId_03|      87036|
|campaignId_01|      76461|
+-------------+-----------+



**Resposta:**

1. Agrupamos por *'campaignId'* e contamos as ocorrências;
2. Renomeamos a coluna da contagem para *'event_count'*;
3. Ordenamos os valores de contagem por *'event_count'*.

Resultado:
1. campaignId_02  ->  91216
2. campaignId_03  ->  87036
3. campaignId_01  ->  76461

Sendo a campanha 2 a mais popular entre as 3.

## 2) Qual campanha teve mais clicks? (2,5 pontos)

In [None]:
# Número total de registros do dataset
data_spark.count()

254713

In [None]:
# Filtrando apenas os clicks

clicks_df = data_spark.where(F.col('action') == 'click')
clicks_df.count()

178305

In [None]:
# Filtrando apenas os views

views_df = data_spark.where(F.col('action') == 'view')
views_df.count()

76408

In [None]:
data_spark.where(F.col('action') == 'click')\
    .groupBy('campaignId')\
    .count()\
    .withColumnRenamed('count', 'event_count')\
    .orderBy(F.desc('event_count'))\
    .show()

+-------------+-----------+
|   campaignId|event_count|
+-------------+-----------+
|campaignId_02|      63983|
|campaignId_03|      60947|
|campaignId_01|      53375|
+-------------+-----------+



**Resposta:**

1. Filtramos os registros que possuem 'action' = 'click';
2. Agrupamos por 'campaignId' e contamos a quantidade de eventos;
3. Renomeamos a coluna da contagem e ordenamos os dados por ela.

Resultado:
1. campaignId_02 -> 63983
2. campaignId_03 -> 60947
3. campaignId_01 -> 53375

A **campanha 2** teve mais clicks em relação as demais.


## 3) Qual mês teve o maior total de eventos acumulado? (2,5 pontos)

In [None]:
from pyspark.sql.functions import month, year

Todos os dados se referem a um mesmo ano?

In [None]:
data_spark.count()

254713

In [None]:
data_spark.groupBy(year('timestamp').alias('year')).count().show()

+----+------+
|year| count|
+----+------+
|2016|254713|
+----+------+



Todos os eventos são do mesmo ano, logo podemos fazer o agrupamento diretamente pelo mês.

In [None]:
df_task = data_spark.groupBy(month('timestamp').alias('month'))\
    .count()\
    .cache()

df_task.toPandas()

Unnamed: 0,month,count
0,12,20297
1,1,25800
2,6,20657
3,3,21377
4,5,21224
5,9,20627
6,4,20558
7,8,21362
8,7,21183
9,10,21363


In [None]:
df_task.orderBy(F.desc('count'))\
    .limit(1)\
    .toPandas()

Unnamed: 0,month,count
0,1,25800


**Resposta:**

1. Agrupamos pelos meses em *'timestamp'* e fizemos a contagem de eventos;
2. Ordenamos pela contagem de eventos *'count'*;

Resultado:

O mês 1 do ano de 2016 teve mais eventos acumulados (25800).

## 4) Nas situações onde existe um evento de view seguido de um evento de click criados pelo mesmo usuário no mesmo anúncio e campanha, quais são os 5 pares de anúncio e campanha com menores médias de tempo entre os dois eventos (2,5 pontos)

In [None]:
df_task = data_spark.orderBy('user_id', 'campaignId','adId', 'timestamp')
df_task.show()

+-------------------+--------------------+------+-------+-------------+
|          timestamp|             user_id|action|   adId|   campaignId|
+-------------------+--------------------+------+-------+-------------+
|2016-01-13 21:09:00|00023420-6ead-463...| click|adId_01|campaignId_01|
|2016-01-14 09:30:00|00023420-6ead-463...| click|adId_01|campaignId_01|
|2016-03-18 11:24:00|00023420-6ead-463...| click|adId_01|campaignId_01|
|2016-03-18 18:25:00|00023420-6ead-463...| click|adId_01|campaignId_01|
|2016-04-02 19:42:00|00023420-6ead-463...| click|adId_01|campaignId_01|
|2016-04-03 13:12:00|00023420-6ead-463...|  view|adId_01|campaignId_01|
|2016-04-03 20:50:00|00023420-6ead-463...| click|adId_01|campaignId_01|
|2016-04-11 09:04:00|00023420-6ead-463...|  view|adId_01|campaignId_01|
|2016-05-25 18:50:00|00023420-6ead-463...|  view|adId_01|campaignId_01|
|2016-05-26 20:45:00|00023420-6ead-463...|  view|adId_01|campaignId_01|
|2016-07-21 19:04:00|00023420-6ead-463...|  view|adId_01|campaig

In [None]:
df_task.select('action').count()

254713

Criando um novo dataset (df_helper) que irá conter uma nova coluna com os valores de 'action' deslocados 1 passo a frente. Ou seja, essa nova coluna conterá os valores das ações imediatamente anteriores de cada usuário.

In [None]:
from pyspark.sql.functions import lag, when, col
from pyspark.sql.window import Window

# Definindo a janela de ordenação
window = Window.orderBy('user_id', 'campaignId', 'adId', 'timestamp')

# Criando a coluna previous_event com ação anterior apenas quando as condições são atendidas
df_helper = df_task.withColumn('previous_event',
                               when(
                                   ((lag("user_id").over(window) == col("user_id")) &
                                    (lag("campaignId").over(window) == col("campaignId")) &
                                    (lag("adId").over(window) == col("adId"))),
                                   lag("action").over(window)
                               ).otherwise(None))

# Criando a coluna next_event com ação posterior
df_helper = df_helper.withColumn('next_event',
                               when(
                                   ((lag("user_id", -1).over(window) == col("user_id")) &
                                    (lag("campaignId", -1).over(window) == col("campaignId")) &
                                    (lag("adId", -1).over(window) == col("adId"))),
                                   lag("action", -1).over(window)
                               ).otherwise(None))

df_helper.show(40)

+-------------------+--------------------+------+-------+-------------+--------------+----------+
|          timestamp|             user_id|action|   adId|   campaignId|previous_event|next_event|
+-------------------+--------------------+------+-------+-------------+--------------+----------+
|2016-01-13 21:09:00|00023420-6ead-463...| click|adId_01|campaignId_01|          NULL|     click|
|2016-01-14 09:30:00|00023420-6ead-463...| click|adId_01|campaignId_01|         click|     click|
|2016-03-18 11:24:00|00023420-6ead-463...| click|adId_01|campaignId_01|         click|     click|
|2016-03-18 18:25:00|00023420-6ead-463...| click|adId_01|campaignId_01|         click|     click|
|2016-04-02 19:42:00|00023420-6ead-463...| click|adId_01|campaignId_01|         click|      view|
|2016-04-03 13:12:00|00023420-6ead-463...|  view|adId_01|campaignId_01|         click|     click|
|2016-04-03 20:50:00|00023420-6ead-463...| click|adId_01|campaignId_01|          view|      view|
|2016-04-11 09:04:00

Filtrar e ordenar as duplas de linhas onde:
* A dupla de linhas deve ter o mesmo 'userId', o mesmo 'adId' e a mesma 'campaignId';
* A primeira linha da dupla deve ter *'action' = 'view'*, *'previous_event'* pode ser igual a *'click'*, *'view'* ou *'NULL'* e *'next_event' = 'click'*;
* A segunda linha da dupla deve ter 'action' = 'click', 'previous_event' = 'view' e *'previous_event'* pode ser igual a *'click'*, *'view'* ou *'NULL'*.

In [None]:
from pyspark.sql import functions as F

# Filtrar e ordenar as duplas de linhas
filtered_df = df_helper.filter(((col("action") == "view") & (col("next_event") == "click")) |
                               ((col("action") == "click") & (col("previous_event") == "view")))\
                               .orderBy("user_id", "adId", "campaignId", "timestamp")

filtered_df.show()

+-------------------+--------------------+------+-------+-------------+--------------+----------+
|          timestamp|             user_id|action|   adId|   campaignId|previous_event|next_event|
+-------------------+--------------------+------+-------+-------------+--------------+----------+
|2016-04-03 13:12:00|00023420-6ead-463...|  view|adId_01|campaignId_01|         click|     click|
|2016-04-03 20:50:00|00023420-6ead-463...| click|adId_01|campaignId_01|          view|      view|
|2016-07-21 19:04:00|00023420-6ead-463...|  view|adId_01|campaignId_01|          view|     click|
|2016-09-02 19:26:00|00023420-6ead-463...| click|adId_01|campaignId_01|          view|     click|
|2016-12-01 19:36:00|00023420-6ead-463...|  view|adId_01|campaignId_01|         click|     click|
|2016-12-01 20:42:00|00023420-6ead-463...| click|adId_01|campaignId_01|          view|     click|
|2016-02-09 09:57:00|000f0200-0918-414...|  view|adId_07|campaignId_03|          NULL|     click|
|2016-04-01 10:16:00

Criação da coluna temporal



In [None]:
from pyspark.sql.functions import col, unix_timestamp

# Converte as colunas de timestamp em segundos
filtered_df = filtered_df.withColumn("timestamp_sec", unix_timestamp(col("timestamp")))

# Adiciona uma coluna 'previous_timestamp_sec' usando a função lag
filtered_df = filtered_df.withColumn("previous_timestamp_sec", lag("timestamp_sec").over(window))

# Calcula a diferença de tempo entre os eventos de view e click
filtered_df = filtered_df.withColumn("time_diff",
                   when(col("action") == "click", col("timestamp_sec") - col("previous_timestamp_sec"))
                   .otherwise(None))

# Seleciona as colunas relevantes
result = filtered_df.select("timestamp", "user_id", "action", "adId", "campaignId", "time_diff")

result.show()

+-------------------+--------------------+------+-------+-------------+---------+
|          timestamp|             user_id|action|   adId|   campaignId|time_diff|
+-------------------+--------------------+------+-------+-------------+---------+
|2016-04-03 13:12:00|00023420-6ead-463...|  view|adId_01|campaignId_01|     NULL|
|2016-04-03 20:50:00|00023420-6ead-463...| click|adId_01|campaignId_01|    27480|
|2016-07-21 19:04:00|00023420-6ead-463...|  view|adId_01|campaignId_01|     NULL|
|2016-09-02 19:26:00|00023420-6ead-463...| click|adId_01|campaignId_01|  3716520|
|2016-12-01 19:36:00|00023420-6ead-463...|  view|adId_01|campaignId_01|     NULL|
|2016-12-01 20:42:00|00023420-6ead-463...| click|adId_01|campaignId_01|     3960|
|2016-02-09 09:57:00|000f0200-0918-414...|  view|adId_07|campaignId_03|     NULL|
|2016-04-01 10:16:00|000f0200-0918-414...| click|adId_07|campaignId_03|  4493940|
|2016-09-20 21:14:00|000f0200-0918-414...|  view|adId_07|campaignId_03|     NULL|
|2016-10-18 08:5

In [None]:
result.count()

102772

Note que os primeiros valores de cada dupla de linhas possuem valor NULL (linhas pares do dataset result 0, 2, 4,...). As linhas de interesse serão as ímpares (1, 3, 5,...), que possuem na coluna *'time_diff'* as diferenças de tempo entre os eventos de *'view'* e *'click'*. Portanto, filtraremos novamente o conjunto de dados.

In [None]:
from pyspark.sql.functions import monotonically_increasing_id

# Adiciona um índice às linhas
df_with_index = result.withColumn("index", monotonically_increasing_id())

# Filtra apenas as linhas com índices ímpares
filtered_df = df_with_index.filter((col("index") % 2) == 1)

# Seleciona apenas as colunas desejadas e exibir o resultado
filtered_result = filtered_df.select("timestamp", "user_id", "action", "adId", "campaignId", "time_diff")
filtered_result.show()

+-------------------+--------------------+------+-------+-------------+---------+
|          timestamp|             user_id|action|   adId|   campaignId|time_diff|
+-------------------+--------------------+------+-------+-------------+---------+
|2016-04-03 20:50:00|00023420-6ead-463...| click|adId_01|campaignId_01|    27480|
|2016-09-02 19:26:00|00023420-6ead-463...| click|adId_01|campaignId_01|  3716520|
|2016-12-01 20:42:00|00023420-6ead-463...| click|adId_01|campaignId_01|     3960|
|2016-04-01 10:16:00|000f0200-0918-414...| click|adId_07|campaignId_03|  4493940|
|2016-10-18 08:54:00|000f0200-0918-414...| click|adId_07|campaignId_03|  2374800|
|2016-03-12 11:24:00|00130041-b283-415...| click|adId_08|campaignId_02|    27060|
|2016-07-12 06:51:00|00130041-b283-415...| click|adId_08|campaignId_02|    35160|
|2016-08-22 10:35:00|00130041-b283-415...| click|adId_08|campaignId_02|   849240|
|2016-09-28 00:33:00|00130041-b283-415...| click|adId_08|campaignId_02|   109800|
|2016-11-27 18:1

Cálculo das médias temporais:

1. Agrupar por usuário, campanha e anúncio;
2. calcular as médias de tempos dos pares de anúncio e campanha pra cada usuário.





In [None]:
from pyspark.sql import functions as F

# Agrupa por usuário, anúncio e campanha e calcula as médias de tempos
average_time = filtered_result.groupBy("user_id", "adId", "campaignId").agg(F.avg("time_diff").alias("avg_time"))

# Exibe o resultado
average_time.show()

+--------------------+-------+-------------+---------+
|             user_id|   adId|   campaignId| avg_time|
+--------------------+-------+-------------+---------+
|00023420-6ead-463...|adId_01|campaignId_01|1249320.0|
|000f0200-0918-414...|adId_07|campaignId_03|3434370.0|
|00130041-b283-415...|adId_08|campaignId_02| 208800.0|
|00317295-4aa4-46e...|adId_05|campaignId_02| 316884.0|
|0031aa2d-5988-402...|adId_03|campaignId_03| 270840.0|
|00355f85-a403-4fc...|adId_06|campaignId_03|3289350.0|
|00430a3b-a186-460...|adId_09|campaignId_01| 731145.0|
|00437ba9-82bd-41a...|adId_05|campaignId_01|1414980.0|
|004c6259-a845-49e...|adId_05|campaignId_02|1794264.0|
|004f82bf-4dc9-409...|adId_05|campaignId_03| 544884.0|
|00553186-c886-4f6...|adId_05|campaignId_03|1210065.0|
|0061c33e-b346-4c6...|adId_03|campaignId_02|1624680.0|
|00640825-4c67-4ec...|adId_09|campaignId_01| 547335.0|
|0070c25e-e658-4ff...|adId_07|campaignId_01| 483204.0|
|0078bbc5-2db4-477...|adId_02|campaignId_01| 338570.0|
|0080122b-

Selecionando os 5 pares de anúncio e campanha com menores médias de tempo entre eventos de *view* e *click* para um mesmo usuário.

In [None]:
# Ordena o DataFrame pela coluna time_diff em ordem descendente
sorted_df = average_time.orderBy(F.asc("avg_time"))

# Seleciona as 5 primeiras linhas
top_5 = sorted_df.limit(5)
top_5.toPandas()

Unnamed: 0,user_id,adId,campaignId,avg_time
0,848eb13c-6881-45a3-bd62-05422be20cad,adId_05,campaignId_02,480.0
1,f4db1c5e-7780-4beb-a36f-1322743d99d5,adId_05,campaignId_02,660.0
2,e929e33e-791a-4b73-8a1b-63cc65238ecc,adId_10,campaignId_03,900.0
3,28c5def5-3ede-4ead-ac85-13c1819ead4b,adId_01,campaignId_01,960.0
4,c67d4a61-a911-4982-a685-33664bdcbd76,adId_06,campaignId_02,1140.0


**Resposta:**


Logo, os top 5 pares de anúncio e campanha com menores médias de tempo entre os eventos de *view* seguido por *click* estão presentes em `top_5`, exibido na celula anterior.