<a href="https://colab.research.google.com/github/Dayrell/covid-brasil/blob/master/Dados_Brasil.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Instalando dependências e consigurando o Spark**

In [0]:
# instalar as dependências
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
!tar xf spark-2.4.4-bin-hadoop2.7.tgz
!pip install -q findspark

In [0]:
# configurar as variáveis de ambiente
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7"

# tornar o pyspark "importável"
import findspark
findspark.init('spark-2.4.4-bin-hadoop2.7')

In [0]:
import pandas as pd
import requests
import io
import csv

from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, col
from pyspark.sql.types import TimestampType
from pyspark.sql import functions as F

**Obtendo e salvando dados no BrasilIO**

In [0]:
url = 'https://brasil.io/dataset/covid19/caso_full/?format=csv'
response = requests.get(url)        

with open('covid.csv', 'w') as f:
    writer = csv.writer(f)
    for line in response.iter_lines():
        writer.writerow(line.decode('utf-8').split(','))

**Criando Spark session e lendo o json**

In [0]:
def create_spark_session():
    spark = SparkSession \
        .builder \
        .master('local[*]') \
        .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:2.7.0") \
        .getOrCreate()
    return spark

In [0]:
spark = create_spark_session()

In [0]:
df = spark.read.csv('covid.csv', header=True, inferSchema=True)

In [0]:
df.printSchema()

root
 |-- epidemiological_week: integer (nullable = true)
 |-- date: timestamp (nullable = true)
 |-- order_for_place: integer (nullable = true)
 |-- state: string (nullable = true)
 |-- city: string (nullable = true)
 |-- city_ibge_code: integer (nullable = true)
 |-- place_type: string (nullable = true)
 |-- last_available_confirmed: integer (nullable = true)
 |-- last_available_confirmed_per_100k_inhabitants: double (nullable = true)
 |-- new_confirmed: integer (nullable = true)
 |-- last_available_deaths: integer (nullable = true)
 |-- new_deaths: integer (nullable = true)
 |-- last_available_death_rate: double (nullable = true)
 |-- estimated_population_2019: integer (nullable = true)
 |-- is_last: boolean (nullable = true)
 |-- is_repeated: boolean (nullable = true)



**Removendo linhas nulas**

In [0]:
df = df.where(col("epidemiological_week").isNotNull())

**Criando df para estados e para cidades**

In [0]:
df_estados = df.where(col("city").isNull()).sort(col("date"))
df_cidades = df.where(col("city").isNotNull()).sort(col("date"))

In [0]:
# df.agg(F.sum("new_deaths")).collect()
# df.drop_duplicates(['date', 'place_type']).show(truncate=False)

In [0]:
df_estados.select(F.sum('new_deaths')).collect()[0][0]

34706

In [0]:
df_cidades.select(F.sum('new_deaths')).collect()[0][0]

34706