# **PROJETO ANÁLISE DOS DADOS DO COVID 19 NO ESTADO DE SÃO PAULO**

Este projeto analisa os dados dos casos de covid 19 no estado de São Paulo do período de fevereiro de 2020 a setembro de 2021.

Os dados estão disponíveis nos sites:

https://www.seade.gov.br/coronavirus/#

https://github.com/seade-R/dados-covid-sp

https://www.seade.gov.br/


## **Importação dos Dados**

In [1]:
import numpy as np
import pandas as pd

In [2]:
covid = pd.read_csv('covid_sp_tratado.csv',
                    sep=';', encoding='utf-8')

In [3]:
covid.head()

Unnamed: 0,indice,municipio,dia,mes,data,casos,casos_novos,casos_pc,casos_mm7d,obitos,obitos_novos,obitos_pc,obitos_mm7d,letalidade,pop,pop_60,area,semana_epidem,densidade
0,1,Adamantina,25,2,2020-02-25,0,0,0.0,0.0,0,0,0.0,0.0,0.0,33894,7398,411.99,9,82.268987
1,2,Adolfo,25,2,2020-02-25,0,0,0.0,0.0,0,0,0.0,0.0,0.0,3447,761,211.06,9,16.331849
2,3,Aguaí,25,2,2020-02-25,0,0,0.0,0.0,0,0,0.0,0.0,0.0,35608,5245,474.55,9,75.035297
3,4,Águas da Prata,25,2,2020-02-25,0,0,0.0,0.0,0,0,0.0,0.0,0.0,7797,1729,142.67,9,54.650592
4,5,Águas de Lindóia,25,2,2020-02-25,0,0,0.0,0.0,0,0,0.0,0.0,0.0,18374,3275,60.13,9,305.571262


In [4]:
covid.shape

(373455, 19)

In [5]:
# Análise dos atributos
covid.dtypes

indice             int64
municipio         object
dia                int64
mes                int64
data              object
casos              int64
casos_novos        int64
casos_pc         float64
casos_mm7d       float64
obitos             int64
obitos_novos       int64
obitos_pc        float64
obitos_mm7d      float64
letalidade       float64
pop                int64
pop_60             int64
area             float64
semana_epidem      int64
densidade        float64
dtype: object

## Filtrando duas cidades: Campinas e Guarulhos

In [6]:
covid_campinas = covid.loc[covid.municipio == 'Campinas'] 
covid_campinas.head(3)

Unnamed: 0,indice,municipio,dia,mes,data,casos,casos_novos,casos_pc,casos_mm7d,obitos,obitos_novos,obitos_pc,obitos_mm7d,letalidade,pop,pop_60,area,semana_epidem,densidade
108,109,Campinas,25,2,2020-02-25,0,0,0.0,0.0,0,0,0.0,0.0,0.0,1175501,192796,794.57,9,1479.417798
753,754,Campinas,26,2,2020-02-26,0,0,0.0,0.0,0,0,0.0,0.0,0.0,1175501,192796,794.57,9,1479.417798
1398,1399,Campinas,27,2,2020-02-27,0,0,0.0,0.0,0,0,0.0,0.0,0.0,1175501,192796,794.57,9,1479.417798


In [7]:
# Criação de uma coluna com a porcentagem de idosos
covid_campinas['porcentagem_idosos'] = 100*covid_campinas['pop_60'] / covid_campinas['pop']
covid_campinas.head(3) 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  covid_campinas['porcentagem_idosos'] = 100*covid_campinas['pop_60'] / covid_campinas['pop']


Unnamed: 0,indice,municipio,dia,mes,data,casos,casos_novos,casos_pc,casos_mm7d,obitos,obitos_novos,obitos_pc,obitos_mm7d,letalidade,pop,pop_60,area,semana_epidem,densidade,porcentagem_idosos
108,109,Campinas,25,2,2020-02-25,0,0,0.0,0.0,0,0,0.0,0.0,0.0,1175501,192796,794.57,9,1479.417798,16.401177
753,754,Campinas,26,2,2020-02-26,0,0,0.0,0.0,0,0,0.0,0.0,0.0,1175501,192796,794.57,9,1479.417798,16.401177
1398,1399,Campinas,27,2,2020-02-27,0,0,0.0,0.0,0,0,0.0,0.0,0.0,1175501,192796,794.57,9,1479.417798,16.401177


In [8]:
covid_guarulhos = covid.loc[covid.municipio == 'Guarulhos'] 
covid_guarulhos.head(3)

Unnamed: 0,indice,municipio,dia,mes,data,casos,casos_novos,casos_pc,casos_mm7d,obitos,obitos_novos,obitos_pc,obitos_mm7d,letalidade,pop,pop_60,area,semana_epidem,densidade
212,213,Guarulhos,25,2,2020-02-25,0,0,0.0,0.0,0,0,0.0,0.0,0.0,1351275,162662,318.68,9,4240.225304
857,858,Guarulhos,26,2,2020-02-26,0,0,0.0,0.0,0,0,0.0,0.0,0.0,1351275,162662,318.68,9,4240.225304
1502,1503,Guarulhos,27,2,2020-02-27,0,0,0.0,0.0,0,0,0.0,0.0,0.0,1351275,162662,318.68,9,4240.225304


In [9]:
# Criação de uma coluna com a porcentagem de idosos
covid_guarulhos['porcentagem_idosos'] = 100*covid_guarulhos['pop_60'] / covid_guarulhos['pop']
covid_guarulhos.head(3)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  covid_guarulhos['porcentagem_idosos'] = 100*covid_guarulhos['pop_60'] / covid_guarulhos['pop']


Unnamed: 0,indice,municipio,dia,mes,data,casos,casos_novos,casos_pc,casos_mm7d,obitos,obitos_novos,obitos_pc,obitos_mm7d,letalidade,pop,pop_60,area,semana_epidem,densidade,porcentagem_idosos
212,213,Guarulhos,25,2,2020-02-25,0,0,0.0,0.0,0,0,0.0,0.0,0.0,1351275,162662,318.68,9,4240.225304,12.037668
857,858,Guarulhos,26,2,2020-02-26,0,0,0.0,0.0,0,0,0.0,0.0,0.0,1351275,162662,318.68,9,4240.225304,12.037668
1502,1503,Guarulhos,27,2,2020-02-27,0,0,0.0,0.0,0,0,0.0,0.0,0.0,1351275,162662,318.68,9,4240.225304,12.037668


In [10]:
covid_campinas.shape

(579, 20)

In [11]:
covid_guarulhos.shape

(579, 20)

## *Medidas de tendência central*

### Média
A soma dos valores dividido pela quantidade de valores

    Utiliza-se .mean()

In [12]:
# Média de todos os valores da coluna
covid_campinas['obitos_novos'].mean()
# Mesma coisa mas usando como se fosse umn método
covid_campinas.casos_novos.mean()   

194.88946459412782

In [13]:
# Agora de guarulhos
covid_guarulhos['obitos_novos'].mean()


8.404145077720207

In [14]:
round(covid_guarulhos['casos_novos'].mean(),3)

109.496

### Mediana
A posição central dos dados

    Utiliza-se .median()

In [15]:
campinas = round(covid_campinas['casos_novos'].median(),2)
guarulhos = round(covid_guarulhos['casos_novos'].median(),2)

print(f'A Mediana de Campinas é {campinas} e Guarulhos é {guarulhos}.')

A Mediana de Campinas é 148.0 e Guarulhos é 87.0.


In [16]:
campinas = round(covid_campinas['obitos_novos'].median(),2)
guarulhos = round(covid_guarulhos['obitos_novos'].median(),2)

print(f'A Mediana de obitos_novos em Campinas é {campinas} e Guarulhos é {guarulhos}.')

A Mediana de obitos_novos em Campinas é 5.0 e Guarulhos é 4.0.


### Moda
Valor que aparece com mais frequência

    ******Utiliza-se .mode()******

- Amodal  - Não tem repetição nas entradas
- Bimodal - Duas Modas
- Trimodal - três modas

In [17]:
"""
São os meses que tem 31 dias
Como moda pega os valores que mais se repetem, e o registro é diário
Quantos mais dias por mês mais vai aparecer
"""
covid_campinas['mes'].mode()        

0    3
1    5
2    7
3    8
Name: mes, dtype: int64

In [18]:
'''
O .describe() Já traz a contagem, max, min, média(mean), as % são os quartis
std é o desvio padrão
'''
round(covid_campinas.describe(),2)

Unnamed: 0,indice,dia,mes,casos,casos_novos,casos_pc,casos_mm7d,obitos,obitos_novos,obitos_pc,obitos_mm7d,letalidade,pop,pop_60,area,semana_epidem,densidade,porcentagem_idosos
count,579.0,579.0,579.0,579.0,579.0,579.0,579.0,579.0,579.0,579.0,579.0,579.0,579.0,579.0,579.0,579.0,579.0,579.0
mean,186514.0,15.74,6.27,46549.33,194.89,3959.96,194.43,1763.33,7.72,150.01,7.7,0.04,1175501.0,192796.0,794.57,25.79,1479.42,16.4
std,107900.23,8.8,3.02,36212.35,173.93,3080.59,115.73,1427.15,8.96,121.41,5.99,0.01,0.0,0.0,0.0,13.39,0.0,0.0
min,109.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1175501.0,192796.0,794.57,1.0,1479.42,16.4
25%,93311.5,8.0,4.0,12926.0,39.0,1099.62,91.64,540.5,1.0,45.98,2.57,0.04,1175501.0,192796.0,794.57,15.0,1479.42,16.4
50%,186514.0,16.0,6.0,39112.0,148.0,3327.26,216.71,1407.0,5.0,119.69,7.43,0.04,1175501.0,192796.0,794.57,25.0,1479.42,16.4
75%,279716.5,23.0,9.0,77445.5,326.0,6588.3,283.36,3048.5,12.0,259.34,11.0,0.04,1175501.0,192796.0,794.57,36.0,1479.42,16.4
max,372919.0,31.0,12.0,112841.0,1080.0,9599.4,442.43,4471.0,67.0,380.35,29.29,0.15,1175501.0,192796.0,794.57,53.0,1479.42,16.4


In [19]:
round(covid_campinas['casos_novos'].describe(),2)

count     579.00
mean      194.89
std       173.93
min         0.00
25%        39.00
50%       148.00
75%       326.00
max      1080.00
Name: casos_novos, dtype: float64

In [20]:
covid_campinas2021 = covid_campinas.loc[covid_campinas['data'] > '2020-12-31']
covid_campinas2021

Unnamed: 0,indice,municipio,dia,mes,data,casos,casos_novos,casos_pc,casos_mm7d,obitos,obitos_novos,obitos_pc,obitos_mm7d,letalidade,pop,pop_60,area,semana_epidem,densidade,porcentagem_idosos
200703,200704,Campinas,1,1,2021-01-01,43502,92,3700.719948,205.285714,1474,0,125.393343,4.285714,0.033883,1175501,192796,794.57,53,1479.417798,16.401177
201348,201349,Campinas,2,1,2021-01-02,43508,6,3701.230369,203.857143,1475,1,125.478413,4.428571,0.033902,1175501,192796,794.57,53,1479.417798,16.401177
201993,201994,Campinas,3,1,2021-01-03,43561,53,3705.739085,207.428571,1476,1,125.563483,4.285714,0.033884,1175501,192796,794.57,1,1479.417798,16.401177
202638,202639,Campinas,4,1,2021-01-04,43603,42,3709.312030,205.857143,1476,0,125.563483,4.000000,0.033851,1175501,192796,794.57,1,1479.417798,16.401177
203283,203284,Campinas,5,1,2021-01-05,44035,432,3746.062317,186.285714,1487,11,126.499254,3.428571,0.033769,1175501,192796,794.57,1,1479.417798,16.401177
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
370338,370339,Campinas,21,9,2021-09-21,112492,207,9569.706874,192.714286,4449,3,378.476922,3.428571,0.039549,1175501,192796,794.57,38,1479.417798,16.401177
370983,370984,Campinas,22,9,2021-09-22,112624,132,9580.936129,211.000000,4461,12,379.497763,3.428571,0.039610,1175501,192796,794.57,38,1479.417798,16.401177
371628,371629,Campinas,23,9,2021-09-23,112745,121,9591.229612,227.571429,4464,3,379.752973,3.571429,0.039594,1175501,192796,794.57,38,1479.417798,16.401177
372273,372274,Campinas,24,9,2021-09-24,112773,28,9593.611575,77.285714,4467,3,380.008184,3.428571,0.039611,1175501,192796,794.57,38,1479.417798,16.401177


In [21]:
covid_campinas2021.shape

(268, 20)

In [22]:
media = covid_campinas2021['obitos_novos'].mean()
mediana = covid_campinas2021['obitos_novos'].median()
moda = covid_campinas2021['obitos_novos'].mode()

print( 'Referente a cidade de Campinas  temos os seguintes dados:')
print(f'A média é {media :.0f}.')
print(f'A mediana é {mediana :.0f}.')
print(f'A moda é {moda}.')

Referente a cidade de Campinas  temos os seguintes dados:
A média é 11.
A mediana é 9.
A moda é 0    0
Name: obitos_novos, dtype: int64.


### Plotando os resultados

In [23]:
import plotly.express as px

In [24]:
grafico = px.histogram(covid_campinas2021, x='obitos_novos', nbins=30, labels=True) #nbins é a largura das barras
grafico.update_layout(width = 400, height = 400, title_text = 'Obitos novos em Campinas em 2021')
grafico.show()

In [25]:
grafico = px.histogram(covid_campinas2021, x='obitos_novos', nbins=10) 
#nbins é a largura das classes, é a diferença entre os intervalos das classes
# exemplo nbins = 10, a classe vai de 0 a 9, de 10 a 19, 
# nbins = 20, 0 a 20
grafico.update_layout(width = 400, height = 400, title_text = 'Obitos novos em Campinas em 2021')
grafico.show()

## Medidas de posição

### Outliers

#### Calculando

In [26]:
# primeiro de tudo
covid_campinas['casos_novos'].min()

0

In [27]:
# Primeiro quartil
covid_campinas['casos_novos'].quantile(0.25)

39.0

In [28]:
# Segundo Quartil - Mediana
covid_campinas['casos_novos'].quantile(0.5)

148.0

In [29]:
# Terceriro quartil
covid_campinas['casos_novos'].quantile(0.75)

326.0

In [30]:
covid_campinas['casos_novos'].max()

1080

In [31]:
covid_campinas['casos_novos'].describe()

count     579.000000
mean      194.889465
std       173.933661
min         0.000000
25%        39.000000
50%       148.000000
75%       326.000000
max      1080.000000
Name: casos_novos, dtype: float64

#### Plotando 

In [32]:
import plotly.express as px

In [33]:
grafico = px.box(covid_campinas, y='casos_novos')
grafico.show()

In [34]:
# Calcula o limite superior do outlier
outlier_sup = covid_campinas['casos_novos'].quantile(0.75) + 1.5 * (covid_campinas['casos_novos'].quantile(0.75)-covid_campinas['casos_novos'].quantile(0.25))
outlier_sup

756.5

In [35]:
outlier_inf = covid_campinas['casos_novos'].quantile(0.25) - 1.5 * (covid_campinas['casos_novos'].quantile(0.75)-covid_campinas['casos_novos'].quantile(0.25))
outlier_inf

-391.5

In [36]:
# Muitas das vezes não faz sentido ver o outlier por causa do volume dos dados, mas caso queira, é so filtrar a tabela
ex_covid_campinas_s_out_sup = covid_campinas.loc[covid_campinas['casos_novos'] > outlier_sup]
ex_covid_campinas_s_out_sup

Unnamed: 0,indice,municipio,dia,mes,data,casos,casos_novos,casos_pc,casos_mm7d,obitos,obitos_novos,obitos_pc,obitos_mm7d,letalidade,pop,pop_60,area,semana_epidem,densidade,porcentagem_idosos
74283,74284,Campinas,19,6,2020-06-19,5127,797,436.154457,233.857143,203,9,17.269232,9.571429,0.039594,1175501,192796,794.57,25,1479.417798,16.401177
367758,367759,Campinas,17,9,2021-09-17,112232,1080,9547.588645,162.285714,4443,4,377.966501,5.714286,0.039588,1175501,192796,794.57,37,1479.417798,16.401177


Excluindo outliers

In [37]:
covid_campinas_s_outliers = covid_campinas.loc[covid_campinas['casos_novos'] <= outlier_sup]
covid_campinas_s_outliers.head(5)

Unnamed: 0,indice,municipio,dia,mes,data,casos,casos_novos,casos_pc,casos_mm7d,obitos,obitos_novos,obitos_pc,obitos_mm7d,letalidade,pop,pop_60,area,semana_epidem,densidade,porcentagem_idosos
108,109,Campinas,25,2,2020-02-25,0,0,0.0,0.0,0,0,0.0,0.0,0.0,1175501,192796,794.57,9,1479.417798,16.401177
753,754,Campinas,26,2,2020-02-26,0,0,0.0,0.0,0,0,0.0,0.0,0.0,1175501,192796,794.57,9,1479.417798,16.401177
1398,1399,Campinas,27,2,2020-02-27,0,0,0.0,0.0,0,0,0.0,0.0,0.0,1175501,192796,794.57,9,1479.417798,16.401177
2043,2044,Campinas,28,2,2020-02-28,0,0,0.0,0.0,0,0,0.0,0.0,0.0,1175501,192796,794.57,9,1479.417798,16.401177
2688,2689,Campinas,29,2,2020-02-29,0,0,0.0,0.0,0,0,0.0,0.0,0.0,1175501,192796,794.57,9,1479.417798,16.401177


In [38]:
grafico = px.box(covid_campinas_s_outliers, y='casos_novos')
grafico.show()

Agora Guarulhos

In [39]:
grafico = px.box(covid_guarulhos, y='casos_novos')
grafico.show()

Exibindo esses outliers

In [40]:
#Chumbei o valor pois estou analizando o 
ex_covid_guarulhos_s_outliers = covid_guarulhos.loc[covid_guarulhos['casos_novos'] > 359]
ex_covid_guarulhos_s_outliers

Unnamed: 0,indice,municipio,dia,mes,data,casos,casos_novos,casos_pc,casos_mm7d,obitos,obitos_novos,obitos_pc,obitos_mm7d,letalidade,pop,pop_60,area,semana_epidem,densidade,porcentagem_idosos
167267,167268,Guarulhos,10,11,2020-11-10,23258,410,1721.189247,75.285714,1546,0,114.410464,0.714286,0.066472,1351275,162662,318.68,46,4240.225304,12.037668
258857,258858,Guarulhos,1,4,2021-04-01,41087,465,3040.609794,279.142857,2691,77,199.145252,34.571429,0.065495,1351275,162662,318.68,13,4240.225304,12.037668
262082,262083,Guarulhos,6,4,2021-04-06,42091,381,3114.909992,252.142857,2781,61,205.805628,28.571429,0.066071,1351275,162662,318.68,14,4240.225304,12.037668
263372,263373,Guarulhos,8,4,2021-04-08,42813,388,3168.341011,246.571429,2874,46,212.688017,26.142857,0.067129,1351275,162662,318.68,14,4240.225304,12.037668
267242,267243,Guarulhos,14,4,2021-04-14,44553,466,3297.108287,304.0,3093,53,228.894933,37.857143,0.069423,1351275,162662,318.68,15,4240.225304,12.037668
267887,267888,Guarulhos,15,4,2021-04-15,45079,526,3336.034486,323.714286,3133,40,231.8551,37.0,0.0695,1351275,162662,318.68,15,4240.225304,12.037668
268532,268533,Guarulhos,16,4,2021-04-16,45676,597,3380.214982,363.142857,3158,25,233.705204,36.857143,0.069139,1351275,162662,318.68,15,4240.225304,12.037668
269177,269178,Guarulhos,17,4,2021-04-17,46374,698,3431.869901,423.428571,3192,34,236.221347,32.428571,0.068832,1351275,162662,318.68,15,4240.225304,12.037668
273047,273048,Guarulhos,23,4,2021-04-23,47973,499,3550.202586,328.142857,3325,44,246.063903,23.857143,0.06931,1351275,162662,318.68,16,4240.225304,12.037668


In [41]:
import calculos as cal

print(cal.outlier_sup(covid_guarulhos['casos_novos']))
print(cal.outlier_inf(covid_guarulhos['casos_novos']))

361.25
-172.75


In [43]:
covid_guarulhos_s_outlier = covid_guarulhos.loc[covid_guarulhos['casos_novos'] < cal.outlier_sup(covid_guarulhos['casos_novos'])]
covid_guarulhos_s_outlier

Unnamed: 0,indice,municipio,dia,mes,data,casos,casos_novos,casos_pc,casos_mm7d,obitos,obitos_novos,obitos_pc,obitos_mm7d,letalidade,pop,pop_60,area,semana_epidem,densidade,porcentagem_idosos
212,213,Guarulhos,25,2,2020-02-25,0,0,0.000000,0.000000,0,0,0.000000,0.000000,0.000000,1351275,162662,318.68,9,4240.225304,12.037668
857,858,Guarulhos,26,2,2020-02-26,0,0,0.000000,0.000000,0,0,0.000000,0.000000,0.000000,1351275,162662,318.68,9,4240.225304,12.037668
1502,1503,Guarulhos,27,2,2020-02-27,0,0,0.000000,0.000000,0,0,0.000000,0.000000,0.000000,1351275,162662,318.68,9,4240.225304,12.037668
2147,2148,Guarulhos,28,2,2020-02-28,0,0,0.000000,0.000000,0,0,0.000000,0.000000,0.000000,1351275,162662,318.68,9,4240.225304,12.037668
2792,2793,Guarulhos,29,2,2020-02-29,0,0,0.000000,0.000000,0,0,0.000000,0.000000,0.000000,1351275,162662,318.68,9,4240.225304,12.037668
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
370442,370443,Guarulhos,21,9,2021-09-21,63334,28,4686.980814,102.571429,4857,3,359.438308,2.571429,0.076689,1351275,162662,318.68,38,4240.225304,12.037668
371087,371088,Guarulhos,22,9,2021-09-22,63351,17,4688.238885,105.000000,4861,4,359.734325,2.428571,0.076731,1351275,162662,318.68,38,4240.225304,12.037668
371732,371733,Guarulhos,23,9,2021-09-23,63368,17,4689.496957,107.285714,4862,1,359.808329,2.000000,0.076726,1351275,162662,318.68,38,4240.225304,12.037668
372377,372378,Guarulhos,24,9,2021-09-24,63387,19,4690.903036,61.285714,4863,1,359.882333,2.142857,0.076719,1351275,162662,318.68,38,4240.225304,12.037668


In [48]:
grafico = px.box(covid_guarulhos_s_outlier, y = 'casos_novos')
grafico.show()