*CESAR School*

*Engenharia em Análise de dados*

*Disciplina: Computação em Nuvem*

*Aluno: Erike Simon Costa Cativo do Nascimento*

*Desafio Final*

## Configurações iniciais

Configurações de drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Instalação do Spark

In [None]:
!pip install --upgrade pyspark

Collecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.1-py2.py3-none-any.whl size=317488491 sha256=acaa10da488f6a36716b769c17d7538c9911730754ea9cc3e3884b020d1998c7
  Stored in directory: /root/.cache/pip/wheels/80/1d/60/2c256ed38dddce2fdd93be545214a63e02fbd8d74fb0b7f3a6
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.1


Imports e configurações do Spark

In [None]:
import os
import pandas as pd
import numpy as np
import pyspark.sql.functions as F

from pyspark.sql import SparkSession

os.environ['PYSPARK_SUBMIT_ARGS'] = '\
      --driver-memory 4G \
      --executor-memory 4G \
      pyspark-shell'

# Não utilizar matplotlib como engine de gráficos e usar plotly
pd.options.plotting.backend = "plotly"

In [None]:
# Criando um cluster local com 1 executor e a quantidade de threads igual a quantidade de cores de CPU disponíveis

spark = SparkSession.builder\
    .master("local[*]")\
    .getOrCreate()
spark

In [None]:
# Comando para desativar os recursos do spark
# spark.stop()

## Import e análise dos dados

In [None]:
ROOT_DATA_PATH = "/content/drive/MyDrive/Colab Notebooks/computacao-nuvem/data"

Dataset de filmes

In [None]:
movies_df = spark.read.csv(f'{ROOT_DATA_PATH}/movies.csv', header=False, inferSchema=True, sep=';')\
            .toDF('Movie_Id' , 'Title')
movies_df.show(5)

+--------+--------------------+
|Movie_Id|               Title|
+--------+--------------------+
|       1|(Dinosaur Planet,...|
|       2|(Isle of Man TT 2...|
|       3|   (Character, 1997)|
|       4|(Paula Abdul's Ge...|
|       5|(The Rise and Fal...|
+--------+--------------------+
only showing top 5 rows



In [None]:
movies_df.printSchema()

root
 |-- Movie_Id: integer (nullable = true)
 |-- Title: string (nullable = true)



Dataset de avaliações

In [None]:
customers_df = spark.read.csv(f'{ROOT_DATA_PATH}/customers_rating.csv', header=True, inferSchema=True, sep=';')
customers_df.show(5)

+-------+------+----------+--------+
|Cust_Id|Rating|      Date|Movie_Id|
+-------+------+----------+--------+
|1488844|   3.0|2005-09-06|       1|
| 822109|   5.0|2005-05-13|       1|
| 885013|   4.0|2005-10-19|       1|
|  30878|   4.0|2005-12-26|       1|
| 823519|   3.0|2004-05-03|       1|
+-------+------+----------+--------+
only showing top 5 rows



In [None]:
customers_df.printSchema()

root
 |-- Cust_Id: integer (nullable = true)
 |-- Rating: double (nullable = true)
 |-- Date: date (nullable = true)
 |-- Movie_Id: integer (nullable = true)



Observando e tratando dados

`movies_df`

In [None]:
# Conta o número de valores nulos em cada coluna
null_counts = movies_df.select([F.when(F.col(c).isNull(), 1).otherwise(0).alias(c) for c in movies_df.columns]) \
                       .agg(*[F.sum(F.col(c)).alias(c) for c in movies_df.columns])

null_counts.show()

+--------+-----+
|Movie_Id|Title|
+--------+-----+
|       0|    0|
+--------+-----+



In [None]:
# Calcula estatísticas resumidas
summary_df = movies_df.summary()

summary_df.show()

+-------+------------------+--------------------+
|summary|          Movie_Id|               Title|
+-------+------------------+--------------------+
|  count|              4499|                4499|
|   mean|            2250.0|                NULL|
| stddev|1298.8937600897157|                NULL|
|    min|                 1|('N Sync: 'N the ...|
|    25%|              1125|                NULL|
|    50%|              2250|                NULL|
|    75%|              3375|                NULL|
|    max|              4499|    (s-Cry-ed, 2003)|
+-------+------------------+--------------------+



obs: não existem dados nulos no dataset `movies_df`

Extraindo informações de data de lançamento da coluna *'Title'*

In [None]:
# Extrai o ano do título usando regex 'Year'
movies_df = movies_df.withColumn('Year', F.regexp_extract('Title', '\((.*),\s*(\d{4})\)', 2))

# Converte a coluna 'Year' para o formato de data
movies_df = movies_df.withColumn('Year', F.year('Year'))

# Aplica regex para extrair apenas o nome do filme
movies_df = movies_df.withColumn('Title', F.regexp_extract('Title', r'\((.*),\s*\d{4}\)', 1))

movies_df.toPandas().head()

Unnamed: 0,Movie_Id,Title,Year
0,1,Dinosaur Planet,2003
1,2,Isle of Man TT 2004 Review,2004
2,3,Character,1997
3,4,Paula Abdul's Get Up & Dance,1994
4,5,The Rise and Fall of ECW,2004


`customers_df`

In [None]:
null_counts = customers_df.select([F.when(F.col(c).isNull(), 1).otherwise(0).alias(c) for c in customers_df.columns]) \
                       .agg(*[F.sum(F.col(c)).alias(c) for c in customers_df.columns])

null_counts.show()

+-------+------+----+--------+
|Cust_Id|Rating|Date|Movie_Id|
+-------+------+----+--------+
|      0|     0|   0|       0|
+-------+------+----+--------+



obs: não existem dados nulos no dataset `customers_df`





In [None]:
summary_df = customers_df.summary()

summary_df.show()

+-------+------------------+------------------+------------------+
|summary|           Cust_Id|            Rating|          Movie_Id|
+-------+------------------+------------------+------------------+
|  count|          24053764|          24053764|          24053764|
|   mean|1322285.3422910443|3.5996343025565563|2308.3239047743214|
| stddev| 764577.9360816252|1.0861181978521708|1303.9093031879506|
|    min|                 6|               1.0|                 1|
|    25%|            660822|               3.0|              1180|
|    50%|           1318548|               4.0|              2342|
|    75%|           1984315|               4.0|              3433|
|    max|           2649429|               5.0|              4499|
+-------+------------------+------------------+------------------+



Criando um *.join()* com os dois datasets

In [None]:
movies = customers_df.join(movies_df, on='Movie_Id', how='inner')
movies = movies.cache()
movies.show(5)

+--------+-------+------+----------+---------------+----+
|Movie_Id|Cust_Id|Rating|      Date|          Title|Year|
+--------+-------+------+----------+---------------+----+
|       1|1488844|   3.0|2005-09-06|Dinosaur Planet|2003|
|       1| 822109|   5.0|2005-05-13|Dinosaur Planet|2003|
|       1| 885013|   4.0|2005-10-19|Dinosaur Planet|2003|
|       1|  30878|   4.0|2005-12-26|Dinosaur Planet|2003|
|       1| 823519|   3.0|2004-05-03|Dinosaur Planet|2003|
+--------+-------+------+----------+---------------+----+
only showing top 5 rows



Salvando os dados em *.parquet* após análise e pré-processamento




In [None]:
movies.write.parquet(f'{ROOT_DATA_PATH}/movies', mode='overwrite')