<img src="https://www.iscte-iul.pt/assets/images/logo_iscte_detailed.svg" style="width: 450px;margin-top:30px;" align ="center">

<div style= "font-size: 40px;  margin-top:40px; font-weight:bold; font-family: 'Avenir Next LT Pro', sans-serif;"><center>Data Joining: <strong>E-Commerce</strong></center></div>
<div style= "font-size: 35px; font-weight:bold; font-family: 'Avenir Next LT Pro', sans-serif;"><center>Merging the 2 csv files to a unique parquet</center></div>

<div style= "font-size: 27px;font-weight:bold;line-height: 1.1; margin-top:40px; font-family: 'Avenir Next LT Pro', sans-serif;"><center>Processamento e Modelação de Big Data 2024/2025</center></div> <br>

   <div style= "font-size: 20px;font-weight:bold; font-family: 'Avenir Next LT Pro', sans-serif;"><center> Grupo 7:</center></div>
   <div><center> Diogo Freitas | 104841 </center></div>
   <div><center> João Francisco Botas | 104782 </center></div>
   <div><center> Miguel Gonçalves | 105944 </center></div>
   <div><center> Ricardo Galvão | 105285 </center></div>

--- 
## Spark Session

Iniciar a sessão do Spark com o nome de `Projeto`.

In [11]:
# Basic imports
import pyspark
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import (
    StructType, StructField,
    StringType, LongType, DoubleType, TimestampType
)

# Create a Spark session
spark = SparkSession.builder \
    .appName("Projeto") \
    .getOrCreate()

---
## Read Data

Primeiro vamos definir o schema ao ler os dados para ser mais eficiente a leitura.

NOTA: **Ajustar a pasta de `data` consoante o caminho**

In [12]:
# read csv files on data folder
data_dir = "../data/"

schema = StructType([
    StructField("event_time", TimestampType(), True),
    StructField("event_type", StringType(), True),
    StructField("product_id", LongType(), True),
    StructField("category_id", LongType(), True),
    StructField("category_code", StringType(), True),
    StructField("brand", StringType(), True),
    StructField("price", DoubleType(), True),
    StructField("user_id", LongType(), True),
    StructField("user_session", StringType(), True)
])

ec_oct = spark.read.csv(data_dir + "2019-Oct.csv", header=True, schema=schema)
ec_nov = spark.read.csv(data_dir + "2019-Nov.csv", header=True, schema=schema)

In [13]:
# count rows
print(f"Number of rows in October 2019 file: {ec_oct.count()}")
print(f"Number of rows in November 2019 file: {ec_nov.count()}")

Number of rows in October 2019 file: 42448764
Number of rows in November 2019 file: 67501979


- Number of rows in October 2019 file: 42448764
- Number of rows in November 2019 file: 67501979

---
## Data Joining

Juntar os dados dos dois ficheiros e depois escrever num parquet para ser utilizado mais para a frente. 

In [5]:
# merge the two datasets
ec_total = ec_oct.union(ec_nov)

In [6]:
# write in a parquet file
ec_total.write.parquet(data_dir + "ec_total.parquet", mode="overwrite")

---
Ler o parquet para ver se está tudo bem.

In [None]:
# read parquet file and select 5% of the data
ec_total = spark.read.parquet(data_dir + "ec_total.parquet")
# TODO: talvez ajustar porque a ordem pode ser aleatoria nos 5%
ec_5p = ec_total.sample(fraction=0.05, seed=42) 

In [8]:
# count rows
print(f"Number of rows in the 5% sample: {ec_5p.count()}")

Number of rows in the 5% sample: 5495602


In [15]:
print(f"Number of rows in the parquet file: {ec_total.count()}")
print(f"Number of rows of each : {42448764 + 67501979}")

Number of rows in the parquet file: 109950743
109950743


In [9]:
# show 10 rows
ec_5p.show(10)

+-------------------+----------+----------+-------------------+--------------------+-------------+------+---------+--------------------+
|         event_time|event_type|product_id|        category_id|       category_code|        brand| price|  user_id|        user_session|
+-------------------+----------+----------+-------------------+--------------------+-------------+------+---------+--------------------+
|2019-11-17 08:43:00|      view|   1005253|2053013555631882655|electronics.smart...|       xiaomi|288.04|516404307|a383cb03-2673-446...|
|2019-11-17 08:43:01|      view|  60000003|2162513074060264222|                NULL|geoffanderson|  44.4|515817144|505cf403-f7ce-4ab...|
|2019-11-17 08:43:01|      view|   4700388|2053013560899928785|auto.accessories....|    prestigio| 32.18|572492652|4879a14c-58b3-43a...|
|2019-11-17 08:43:01|      cart|  28718385|2053013565228450757|       apparel.shoes|       rieker|103.99|518296473|8af4a493-9188-43a...|
|2019-11-17 08:43:01|      cart|  2650014

In [14]:
ec_5p.printSchema()

root
 |-- event_time: timestamp (nullable = true)
 |-- event_type: string (nullable = true)
 |-- product_id: integer (nullable = true)
 |-- category_id: long (nullable = true)
 |-- category_code: string (nullable = true)
 |-- brand: string (nullable = true)
 |-- price: double (nullable = true)
 |-- user_id: integer (nullable = true)
 |-- user_session: string (nullable = true)

