<a href="https://colab.research.google.com/github/LaisST/FIAP_202501_HandsOn_data_analytics/blob/main/BigData_Aula1_DataManipulation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fase 3 - Aula 1
# Framework de Big Data

## üöÄ O que √© o Spark?

Apache Spark √© uma plataforma de processamento de dados open source criada para ser r√°pida, f√°cil de usar e capaz de realizar an√°lises avan√ßadas.
üß† Principais caracter√≠sticas:

    Desenvolvido em 2009 na Universidade de Berkeley.

    Mantido pela Apache Software Foundation, sem v√≠nculo com fornecedores.

    Processa grandes volumes de dados em clusters com milhares de n√≥s.

    Utilizado por empresas como Netflix, Yahoo e eBay para lidar com petabytes de dados.

‚öôÔ∏è Por que ele √© especial?

    Resolve problemas de performance e processamento paralelo.

    Executa tarefas em mem√≥ria, o que acelera o processamento.

    Suporta m√∫ltiplos tipos de aplica√ß√µes:

        Big Data

        Machine Learning

        Streaming de dados

        Consultas SQL

## Configura√ß√£o de Ambiente

In [None]:
import os

In [None]:
! pip install -q findspark

In [None]:
import findspark

Apache Spark 3.5.7
https://dlcdn.apache.org/spark/spark-3.5.7/spark-3.5.7-bin-hadoop3.tgz

In [None]:
# Instalar o Java
! apt-get install openjdk-8-jdk-headless -qq > /dev/null

In [None]:
# Baixar, descompactar e configurar o Apache Spark

! wget -q https://dlcdn.apache.org/spark/spark-3.5.7/spark-3.5.7-bin-hadoop3.tgz

In [None]:
! tar xf spark-3.5.7-bin-hadoop3.tgz

In [None]:
! rm -rf spark-3.5.7-bin-hadoop3.tgz

In [None]:
# COnfigurar as variaveis de ambiente
os.environ['JAVA_HOME'] = '/usr/lib/jvm/java-8-openjdk-amd64'
os.environ['SPARK_HOME'] = '/content/spark-3.5.7-bin-hadoop3'

In [None]:
# Inicializar o Spark e Criar uma sess√£o

findspark.init()

from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local[*]').appName('Exercicio01').getOrCreate()


In [None]:
# Teste
df = spark.sql('''select 'spark' as hello''')
df.show()

+-----+
|hello|
+-----+
|spark|
+-----+



## Importa√ß√£o das Bibliotecas

In [None]:
# Importar bibliotecas spark
from pyspark.sql import Row, DataFrame
from pyspark.sql.types import StringType, StringType, StructField, IntegralType
from pyspark.sql.functions import col, expr, lit, substring, concat, concat_ws, when, coalesce
from pyspark.sql import functions as F
from functools import reduce


## Manipula√ß√£o de Dados Usando Spark

In [None]:
# Importar a base de dados
df = spark.read.csv('banklist.csv', sep = ',', inferSchema=True, header=True)

In [None]:
# Sobre a bsae
print('Quantidade de linhas (df.count): ', df.count())
print('Quantidade de colunas (df.col ct): ', len(df.columns))
print('Valores das colunas (df.columns): ', df.columns)

Quantidade de linhas (df.count):  561
Quantidade de colunas (df.col ct):  6
Valores das colunas (df.columns):  ['Bank Name', 'City', 'ST', 'CERT', 'Acquiring Institution', 'Closing Date']


## Usando SQL em PySpark

In [None]:
df.createGlobalTempView('banklist')

df_check = spark.sql("SELECT `Bank Name`, City, `Closing Date` FROM global_temp.banklist")
df_check.show()



+--------------------+------------------+------------+
|           Bank Name|              City|Closing Date|
+--------------------+------------------+------------+
|The First State Bank|     Barboursville|    3-Apr-20|
|  Ericson State Bank|           Ericson|   14-Feb-20|
|City National Ban...|            Newark|    1-Nov-19|
|       Resolute Bank|            Maumee|   25-Oct-19|
|Louisa Community ...|            Louisa|   25-Oct-19|
|The Enloe State Bank|            Cooper|   31-May-19|
|Washington Federa...|           Chicago|   15-Dec-17|
|The Farmers and M...|           Argonia|   13-Oct-17|
| Fayette County Bank|        Saint Elmo|   26-May-17|
|Guaranty Bank, (d...|         Milwaukee|    5-May-17|
|      First NBC Bank|       New Orleans|   28-Apr-17|
|       Proficio Bank|Cottonwood Heights|    3-Mar-17|
|Seaway Bank and T...|           Chicago|   27-Jan-17|
|Harvest Community...|        Pennsville|   13-Jan-17|
|         Allied Bank|          Mulberry|   23-Sep-16|
|The Woodb

## DataFrame OPera√ß√µes B√°sicas

In [None]:
df.describe().show()

+-------+--------------------+-------+----+-----------------+---------------------+------------+
|summary|           Bank Name|   City|  ST|             CERT|Acquiring Institution|Closing Date|
+-------+--------------------+-------+----+-----------------+---------------------+------------+
|  count|                 561|    561| 561|              561|                  561|         561|
|   mean|                NULL|   NULL|NULL|31685.68449197861|                 NULL|        NULL|
| stddev|                NULL|   NULL|NULL|16446.65659309965|                 NULL|        NULL|
|    min|1st American Stat...|Acworth|  AL|               91|      1st United Bank|    1-Aug-08|
|    max|               ebank|Wyoming|  WY|            58701|  Your Community Bank|    9-Sep-11|
+-------+--------------------+-------+----+-----------------+---------------------+------------+



In [None]:
df.describe('City', 'ST').show()

+-------+-------+----+
|summary|   City|  ST|
+-------+-------+----+
|  count|    561| 561|
|   mean|   NULL|NULL|
| stddev|   NULL|NULL|
|    min|Acworth|  AL|
|    max|Wyoming|  WY|
+-------+-------+----+



## Count, Columns e Schema

In [None]:
print('Total Linhas: ', df.count())
print('Total Colunas: ', len(df.columns))
print('Colunas: ', df.columns)
print('Tipo de dado: ', df.dtypes)
print('Schema', df.schema)

Total Linhas:  561
Total Colunas:  6
Colunas:  ['Bank Name', 'City', 'ST', 'CERT', 'Acquiring Institution', 'Closing Date']
Tipo de dado:  [('Bank Name', 'string'), ('City', 'string'), ('ST', 'string'), ('CERT', 'int'), ('Acquiring Institution', 'string'), ('Closing Date', 'string')]
Schema StructType([StructField('Bank Name', StringType(), True), StructField('City', StringType(), True), StructField('ST', StringType(), True), StructField('CERT', IntegerType(), True), StructField('Acquiring Institution', StringType(), True), StructField('Closing Date', StringType(), True)])


In [None]:
df.printSchema()

root
 |-- Bank Name: string (nullable = true)
 |-- City: string (nullable = true)
 |-- ST: string (nullable = true)
 |-- CERT: integer (nullable = true)
 |-- Acquiring Institution: string (nullable = true)
 |-- Closing Date: string (nullable = true)



## Remover Duplicadas

In [None]:
df = df.dropDuplicates()
print('Total Linhas: ', df.count())
print('Colunas ', df.columns)

Total Linhas:  561
Colunas  ['Bank Name', 'City', 'ST', 'CERT', 'Acquiring Institution', 'Closing Date']


## Selecionar Colunas Especificas

In [None]:
df2 = df.select(*['Bank Name', 'City'])
df2.show(2)

+--------------------+--------+
|           Bank Name|    City|
+--------------------+--------+
| First Bank of Idaho| Ketchum|
|Amcore Bank, Nati...|Rockford|
+--------------------+--------+
only showing top 2 rows



## Selecionar multiplas colunas

In [None]:
col_l = list(set(df.columns) - {'CERT', 'ST'})
df2 = df.select(*col_l)
df2.show(2)

+---------------------+--------------------+------------+--------+
|Acquiring Institution|           Bank Name|Closing Date|    City|
+---------------------+--------------------+------------+--------+
|      U.S. Bank, N.A.| First Bank of Idaho|   24-Apr-09| Ketchum|
|          Harris N.A.|Amcore Bank, Nati...|   23-Apr-10|Rockford|
+---------------------+--------------------+------------+--------+
only showing top 2 rows



## Renomear Colunas

In [None]:
df2 = df \
  .withColumnRenamed('Bank Name', 'bank_name') \
  .withColumnRenamed('Acquiring Institution', 'acq_institution') \
  .withColumnRenamed('Closing Date', 'closing_date') \
  .withColumnRenamed('ST', 'state') \
  .withColumnRenamed('CERT', 'cert') #\
df2.show(2)

+--------------------+--------+-----+-----+---------------+------------+
|           bank_name|    City|state| cert|acq_institution|closing_date|
+--------------------+--------+-----+-----+---------------+------------+
| First Bank of Idaho| Ketchum|   ID|34396|U.S. Bank, N.A.|   24-Apr-09|
|Amcore Bank, Nati...|Rockford|   IL| 3735|    Harris N.A.|   23-Apr-10|
+--------------------+--------+-----+-----+---------------+------------+
only showing top 2 rows



## Adicionar Colunas

In [None]:
df2 = df.withColumn('State', col('ST'))
df2.show(2)

+--------------------+--------+---+-----+---------------------+------------+-----+
|           Bank Name|    City| ST| CERT|Acquiring Institution|Closing Date|State|
+--------------------+--------+---+-----+---------------------+------------+-----+
| First Bank of Idaho| Ketchum| ID|34396|      U.S. Bank, N.A.|   24-Apr-09|   ID|
|Amcore Bank, Nati...|Rockford| IL| 3735|          Harris N.A.|   23-Apr-10|   IL|
+--------------------+--------+---+-----+---------------------+------------+-----+
only showing top 2 rows



## Adicionar Coluna COnstante
Valor fixo(o mesmo para todos)


In [None]:
df2 = df.withColumn('country', lit('US'))
df2.show(2)

+--------------------+--------+---+-----+---------------------+------------+-------+
|           Bank Name|    City| ST| CERT|Acquiring Institution|Closing Date|country|
+--------------------+--------+---+-----+---------------------+------------+-------+
| First Bank of Idaho| Ketchum| ID|34396|      U.S. Bank, N.A.|   24-Apr-09|     US|
|Amcore Bank, Nati...|Rockford| IL| 3735|          Harris N.A.|   23-Apr-10|     US|
+--------------------+--------+---+-----+---------------------+------------+-------+
only showing top 2 rows



## Remover Colunas

In [None]:
df2 = df.drop('CERT')
df2.show(2)

+--------------------+--------+---+---------------------+------------+
|           Bank Name|    City| ST|Acquiring Institution|Closing Date|
+--------------------+--------+---+---------------------+------------+
| First Bank of Idaho| Ketchum| ID|      U.S. Bank, N.A.|   24-Apr-09|
|Amcore Bank, Nati...|Rockford| IL|          Harris N.A.|   23-Apr-10|
+--------------------+--------+---+---------------------+------------+
only showing top 2 rows



## Remover Multiplas Colunas

In [None]:
df2 = df.drop(*['CERT', 'ST'])
df2.show(2)

+--------------------+--------+---------------------+------------+
|           Bank Name|    City|Acquiring Institution|Closing Date|
+--------------------+--------+---------------------+------------+
| First Bank of Idaho| Ketchum|      U.S. Bank, N.A.|   24-Apr-09|
|Amcore Bank, Nati...|Rockford|          Harris N.A.|   23-Apr-10|
+--------------------+--------+---------------------+------------+
only showing top 2 rows



In [None]:
df2 = reduce(DataFrame.drop, ['CERT', 'ST'], df)
df2.show(2)

+--------------------+--------+---------------------+------------+
|           Bank Name|    City|Acquiring Institution|Closing Date|
+--------------------+--------+---------------------+------------+
| First Bank of Idaho| Ketchum|      U.S. Bank, N.A.|   24-Apr-09|
|Amcore Bank, Nati...|Rockford|          Harris N.A.|   23-Apr-10|
+--------------------+--------+---------------------+------------+
only showing top 2 rows



## Filtrar os dados

In [None]:
# Igual a ...
df2 = df.where(df['ST'] == 'NE')

# Entre Valores
df3 = df.where(df['CERT'].between(1000, 2000))

# Est√° dentro de...
df4 = df.where(df['ST'].isin('NE', 'IL'))

print('Quantidade de linhas em df', df.count())
print('Quantidade de linhas em df2', df2.count())
print('Quantidade de linhas em df3', df3.count())
print('Quantidade de linhas em df4', df4.count())

Quantidade de linhas em df 561
Quantidade de linhas em df2 4
Quantidade de linhas em df3 9
Quantidade de linhas em df4 73


## Filtrar usando operadores l√≥gicos

In [None]:
df2 = df.where((df['ST'] == 'NE') & (df['City'] == 'Ericson'))
df2.show()

+------------------+-------+---+-----+---------------------+------------+
|         Bank Name|   City| ST| CERT|Acquiring Institution|Closing Date|
+------------------+-------+---+-----+---------------------+------------+
|Ericson State Bank|Ericson| NE|18265| Farmers and Merch...|   14-Feb-20|
+------------------+-------+---+-----+---------------------+------------+



## Substitui√ß√£o de valores

In [None]:
#Pre Substitui√ß√£o
df.show()

# P√≥s Substitui√ß√£o
print('Substituir o 7 por 17')
df.na.replace(7,17).show()

+--------------------+----------------+---+-----+---------------------+------------+
|           Bank Name|            City| ST| CERT|Acquiring Institution|Closing Date|
+--------------------+----------------+---+-----+---------------------+------------+
| First Bank of Idaho|         Ketchum| ID|34396|      U.S. Bank, N.A.|   24-Apr-09|
|Amcore Bank, Nati...|        Rockford| IL| 3735|          Harris N.A.|   23-Apr-10|
|        Venture Bank|           Lacey| WA|22868| First-Citizens Ba...|   11-Sep-09|
|First State Bank ...|           Altus| OK| 9873|         Herring Bank|   31-Jul-09|
|Valley Capital Ba...|            Mesa| AZ|58399| Enterprise Bank &...|   11-Dec-09|
|Michigan Heritage...|Farmington Hills| MI|34369|       Level One Bank|   24-Apr-09|
|Columbia Savings ...|      Cincinnati| OH|32284| United Fidelity B...|   23-May-14|
|       Fidelity Bank|        Dearborn| MI|33883| The Huntington Na...|   30-Mar-12|
|The Park Avenue Bank|        Valdosta| GA|19797|   Bank of the O

# Teste Base Covid

In [None]:
import requests
path = "https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv"
req = requests.get(path)
url_content = req.content

csv_file_name = 'owid-covid-data.csv'
csv_file = open(csv_file_name, 'wb')

csv_file.write(url_content)
csv_file.close()

df = spark.read.csv('/content/'+csv_file_name, header=True, inferSchema=True)

In [None]:
#Viewing the dataframe schema
df.printSchema()

root
 |-- iso_code: string (nullable = true)
 |-- continent: string (nullable = true)
 |-- location: string (nullable = true)
 |-- date: date (nullable = true)
 |-- total_cases: integer (nullable = true)
 |-- new_cases: integer (nullable = true)
 |-- new_cases_smoothed: double (nullable = true)
 |-- total_deaths: integer (nullable = true)
 |-- new_deaths: integer (nullable = true)
 |-- new_deaths_smoothed: double (nullable = true)
 |-- total_cases_per_million: double (nullable = true)
 |-- new_cases_per_million: double (nullable = true)
 |-- new_cases_smoothed_per_million: double (nullable = true)
 |-- total_deaths_per_million: double (nullable = true)
 |-- new_deaths_per_million: double (nullable = true)
 |-- new_deaths_smoothed_per_million: double (nullable = true)
 |-- reproduction_rate: double (nullable = true)
 |-- icu_patients: integer (nullable = true)
 |-- icu_patients_per_million: double (nullable = true)
 |-- hosp_patients: integer (nullable = true)
 |-- hosp_patients_per_mil

In [None]:
#Converting a date column
df.select(F.to_date(df.date).alias('date'))

DataFrame[date: date]

In [None]:
#Summary stats
df.describe().show()

+-------+--------+-------------+-----------+--------------------+------------------+------------------+------------------+------------------+-------------------+-----------------------+---------------------+------------------------------+------------------------+----------------------+-------------------------------+------------------+------------------+------------------------+------------------+-------------------------+---------------------+---------------------------------+----------------------+----------------------------------+-------------------+------------------+------------------------+----------------------+------------------+-------------------------------+-------------------+------------------+-------------+--------------------+--------------------+-----------------------+--------------------+------------------+-------------------------+------------------------------+-----------------------------+-----------------------------------+--------------------------+-------------

In [None]:
#Simple Group by Function
df.groupBy("location").sum("new_cases").orderBy(F.desc("sum(new_cases)")).show(truncate=False)

+-----------------------------+--------------+
|location                     |sum(new_cases)|
+-----------------------------+--------------+
|World                        |775935057     |
|High-income countries        |429044052     |
|Asia                         |301564180     |
|Europe                       |252916868     |
|Upper-middle-income countries|251756125     |
|European Union (27)          |185822587     |
|North America                |124492698     |
|United States                |103436829     |
|China                        |99373219      |
|Lower-middle-income countries|92019711      |
|South America                |68811012      |
|India                        |45041748      |
|France                       |38997490      |
|Germany                      |38437756      |
|Brazil                       |37511921      |
|South Korea                  |34571873      |
|Japan                        |33803572      |
|Italy                        |26781078      |
|United Kingd