<a href="https://colab.research.google.com/github/GuilhermeFogolin/Python-Dados-Machine-Learning/blob/main/Tratamento_Projeto03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Projeto 03: COVID-19 mundial**

Este projeto analisa e implementa um algoritmo de Machine Learning para análise e previsão da evolução do COVID-19 no período de 24/02/2019 a 05/09/2019.

Os dados foram retirados em 04/09/2020 da Organização Mundial da Saúde:

(https://covid19.who.int/table)

## **ANÁLISE E EXPLORAÇÃO DOS DADOS**

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import numpy as np
import pandas as pd

In [3]:
covidmundial = pd.read_csv('/content/drive/MyDrive/Python_Dados_Machine_Learning/covid19_mundial.csv', sep=',')

In [4]:
covidmundial

Unnamed: 0,date,country_code,country,who_region,new_cases,cumulative_cases,new_deaths,cumulative_deaths
0,2020-02-24,AF,Afghanistan,EMRO,5,5,0,0
1,2020-02-25,AF,Afghanistan,EMRO,0,5,0,0
2,2020-02-26,AF,Afghanistan,EMRO,0,5,0,0
3,2020-02-27,AF,Afghanistan,EMRO,0,5,0,0
4,2020-02-28,AF,Afghanistan,EMRO,0,5,0,0
...,...,...,...,...,...,...,...,...
39428,2020-08-31,ZW,Zimbabwe,AFRO,6,6412,0,196
39429,2020-09-01,ZW,Zimbabwe,AFRO,85,6497,6,202
39430,2020-09-02,ZW,Zimbabwe,AFRO,62,6559,1,203
39431,2020-09-03,ZW,Zimbabwe,AFRO,79,6638,3,206


In [5]:
# Número de mortes acumuladas por país
totaldeaths = covidmundial.groupby('country').cumulative_deaths.max()

In [6]:
# Organização em ordem decrescente
totaldeaths.sort_values(ascending=False).head(15)

Unnamed: 0_level_0,cumulative_deaths
country,Unnamed: 1_level_1
United States of America,184614
Brazil,123780
India,68472
Mexico,65816
The United Kingdom,41527
Italy,35507
France,30556
Peru,29259
Spain,29234
Iran (Islamic Republic of),21926


In [7]:
# Casos acumulados por país
casos_total = covidmundial.groupby('country').cumulative_cases.max()

In [8]:
# Organização em ordem decrescente dos casos acumulados
casos_total.sort_values(ascending=False).head(20)

Unnamed: 0_level_0,cumulative_cases
country,Unnamed: 1_level_1
United States of America,6050444
Brazil,3997865
India,3936747
Russian Federation,1015105
Peru,663437
Colombia,633339
South Africa,633015
Mexico,610957
Spain,488513
Argentina,439172


In [9]:
covidbrasil = covidmundial.loc[covidmundial.country=='Brazil']

In [10]:
covidbrasil

Unnamed: 0,date,country_code,country,who_region,new_cases,cumulative_cases,new_deaths,cumulative_deaths
4896,2020-02-26,BR,Brazil,AMRO,5,5,0,0
4897,2020-02-27,BR,Brazil,AMRO,0,5,0,0
4898,2020-02-28,BR,Brazil,AMRO,0,5,0,0
4899,2020-02-29,BR,Brazil,AMRO,0,5,0,0
4900,2020-03-01,BR,Brazil,AMRO,1,6,0,0
...,...,...,...,...,...,...,...,...
5083,2020-08-31,BR,Brazil,AMRO,41350,3846153,958,120462
5084,2020-09-01,BR,Brazil,AMRO,16158,3862311,366,120828
5085,2020-09-02,BR,Brazil,AMRO,45961,3908272,553,121381
5086,2020-09-03,BR,Brazil,AMRO,42659,3950931,1215,122596


In [11]:
covidbrasil.shape

(192, 8)

In [12]:
# Análise dos tipos de variáveis
covidbrasil.dtypes

Unnamed: 0,0
date,object
country_code,object
country,object
who_region,object
new_cases,int64
cumulative_cases,int64
new_deaths,int64
cumulative_deaths,int64


In [13]:
# Análise de valores missing para o Brasil
covidbrasil.isnull().sum()

Unnamed: 0,0
date,0
country_code,0
country,0
who_region,0
new_cases,0
cumulative_cases,0
new_deaths,0
cumulative_deaths,0


In [14]:
# Análise de valores missing mundialmente
covidmundial.isnull().sum()

Unnamed: 0,0
date,0
country_code,175
country,0
who_region,0
new_cases,0
cumulative_cases,0
new_deaths,0
cumulative_deaths,0


**Análise de Valores Negativos**

In [15]:
covidmundial.loc[covidmundial.new_cases < 0, :]

Unnamed: 0,date,country_code,country,who_region,new_cases,cumulative_cases,new_deaths,cumulative_deaths
25,2020-03-20,AF,Afghanistan,EMRO,-2,24,0,0
1095,2020-03-24,AG,Antigua and Barbuda,AMRO,-2,3,0,0
1636,2020-03-19,AW,Aruba,AMRO,-3,5,0,0
2414,2020-03-20,BS,Bahamas,AMRO,-3,4,0,0
2960,2020-03-20,BB,Barbados,AMRO,-1,5,0,0
...,...,...,...,...,...,...,...,...
36347,2020-03-19,TT,Trinidad and Tobago,AMRO,-2,9,0,0
37112,2020-05-25,UG,Uganda,AFRO,-8,304,0,0
37140,2020-06-22,UG,Uganda,AFRO,-68,755,0,0
38030,2020-03-21,VI,United States Virgin Islands,AMRO,-1,6,0,0


In [16]:
covidmundial.loc[covidmundial.new_cases < 0, :].count()

Unnamed: 0,0
date,82
country_code,81
country,82
who_region,82
new_cases,82
cumulative_cases,82
new_deaths,82
cumulative_deaths,82


In [17]:
covidmundial.loc[covidmundial.new_deaths < 0, :].count()

Unnamed: 0,0
date,28
country_code,28
country,28
who_region,28
new_cases,28
cumulative_cases,28
new_deaths,28
cumulative_deaths,28


In [18]:
covidbrasil.loc[covidbrasil.new_cases < 0, :].count()

Unnamed: 0,0
date,0
country_code,0
country,0
who_region,0
new_cases,0
cumulative_cases,0
new_deaths,0
cumulative_deaths,0


In [19]:
covidbrasil.loc[covidbrasil.new_deaths < 0, :].count()

Unnamed: 0,0
date,0
country_code,0
country,0
who_region,0
new_cases,0
cumulative_cases,0
new_deaths,0
cumulative_deaths,0


In [20]:
# Criação de um dataframe com a soma de todos os países por data
covidmundial_sum = covidmundial.groupby('date').agg({'new_cases': 'sum','cumulative_cases':'sum','new_deaths': 'sum','cumulative_deaths':'sum'}).reset_index()

In [21]:
covidmundial_sum

Unnamed: 0,date,new_cases,cumulative_cases,new_deaths,cumulative_deaths
0,2020-01-04,1,1,0,0
1,2020-01-05,0,1,0,0
2,2020-01-06,3,4,0,0
3,2020-01-07,0,4,0,0
4,2020-01-08,0,4,0,0
...,...,...,...,...,...
240,2020-08-31,267850,25144425,5417,844473
241,2020-09-01,212698,25357123,3980,848453
242,2020-09-02,248989,25606112,4370,852823
243,2020-09-03,279613,25885725,6317,859140


## **ANÁLISES ESTATÍSTICAS**

In [None]:
covidbrasil.describe()

In [None]:
covidbrasil.new_cases.mode()

In [None]:
covidbrasil.new_deaths.mode()

**Análise dos Outliers**

In [None]:
import plotly.express as px

In [None]:
px.box(covidbrasil, y='cumulative_cases')

In [None]:
px.box(covidbrasil, y='cumulative_deaths')

**Análise da Normalidade**

In [None]:
import seaborn as sns

In [None]:
sns.histplot(covidbrasil, x='cumulative_cases', bins=20, color="brown", kde=True, stat="count");

In [None]:
sns.histplot(covidbrasil, x='cumulative_deaths', bins=20, color="brown", kde=True, stat="count",);

In [None]:
import scipy.stats as stats
import matplotlib.pyplot as plt

In [None]:
stats.probplot(covidbrasil['cumulative_cases'], dist="norm", plot=plt)
plt.title("Análise Normalidade")
plt.show()

In [None]:
stats.probplot(covidbrasil['cumulative_deaths'], dist="norm", plot=plt)
plt.title("Análise Normalidade")
plt.show()

In [None]:
import statsmodels
from statsmodels.stats.diagnostic import lilliefors

In [None]:
estatistica, p = statsmodels.stats.diagnostic.lilliefors(covidbrasil.cumulative_cases, dist = 'norm',)
print ('Estatística do teste (D) =', round(estatistica,2))
print ('p_valor =', p)

In [None]:
estatistica, p = statsmodels.stats.diagnostic.lilliefors(covidbrasil.cumulative_deaths, dist = 'norm',)
print ('Estatística do teste (D) =', round(estatistica,2))
print ('p_valor =', p)

**Gráfico de dispersão em função das datas**

In [None]:
import plotly.express as px

In [None]:
disp = px.scatter(x=covidbrasil.date, y=covidbrasil.cumulative_cases)
disp.update_layout(width=900,height=400,title_text='NÚMERO DE CASOS ACUMULADOS NO BRASIL')
disp.update_xaxes(title = 'DATAS')
disp.update_yaxes(title = 'CASOS')
disp.show()

In [None]:
disp = px.scatter(x=covidbrasil.date, y=covidbrasil.cumulative_deaths)
disp.update_layout(width=900,height=400,title_text='NÚMERO DE MORTES ACUMULADAS NO BRASIL')
disp.update_xaxes(title = 'DATAS')
disp.update_yaxes(title = 'MORTES')
disp.show()

In [None]:
plt.subplots(figsize=(10,5))
plt.stackplot(covidbrasil['date'], [covidbrasil['cumulative_cases'], covidbrasil['cumulative_deaths']],
              labels = ['cumulative_cases', 'cumulative_deaths'])
plt.legend(loc = 'upper left')
plt.title('Comparação da evolução dos casos e das mortes no Brasil');

**Correlação Linear**

In [None]:
correlacoes = covidbrasil.corr(method='spearman')

In [None]:
correlacoes

In [None]:
plt.figure()
sns.heatmap(correlacoes, annot=True);

## **MACHINE LEARNING**

### **Regressão Linear Simples**

In [None]:
covidbrasil

In [None]:
covidbrasil.shape

In [None]:
import plotly.express as px

In [None]:
disp = px.scatter(x=covidbrasil.new_cases, y=covidbrasil.new_deaths)
disp.update_layout(width=900,height=400,title_text='Número de mortes em função do número de casos')
disp.update_xaxes(title = 'Novos Casos')
disp.update_yaxes(title = 'Novas Mortes')
disp.show()

In [None]:
x = covidbrasil.iloc[:, 4].values
y = covidbrasil.iloc[:, 6].values

In [None]:
x

In [None]:
y

In [None]:
# Transformando em matriz
x = x.reshape(-1,1)

In [None]:
x

**Separando base de dados em teste e treinamento**

In [None]:
from sklearn.model_selection import train_test_split
x_treinamento, x_teste, y_treinamento, y_teste = train_test_split(x, y,
                                                                  test_size = 0.25,
                                                                  random_state = 2)

In [None]:
x_treinamento

In [None]:
x_teste

In [None]:
y_treinamento

In [None]:
y_teste

In [None]:
x_treinamento.size

In [None]:
x_teste.size

In [None]:
y_treinamento.size

In [None]:
y_teste.size

**Criação do modelo de regressão linear**

In [None]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_treinamento, y_treinamento)
score = regressor.score(x_treinamento, y_treinamento)

In [None]:
# Gráfico com dados de treinamento.
import matplotlib.pyplot as plt
plt.scatter(x_treinamento, y_treinamento)
plt.plot(x_treinamento, regressor.predict(x_treinamento), color = 'red');

In [None]:
score

In [None]:
previsoes = regressor.predict(x_teste)

In [None]:
# Gráfico com os dados de teste
plt.scatter(x_teste, y_teste)
plt.plot(x_teste, regressor.predict(x_teste), color = 'red');

In [None]:
previsao = regressor.predict(np.array(80000).reshape(1, -1))
previsao

In [None]:
# coeficiente linear
regressor.intercept_

In [None]:
# coeficiente angular
regressor.coef_

Equação: mortes = 201,03944 + 0,02188*casos

**Métricas de Desempenho**

In [None]:
# Coeficiente de Determinação (R^2)
regressor.score(x_teste, y_teste)

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

In [None]:
print('Erro Médio Absoluto (MAE):', mean_absolute_error(y_teste, previsoes))
print('Erro Quadrático Médio (MSE):', mean_squared_error(y_teste, previsoes))
print('Raiz do Erro Quadrático Médio (RMSE):', np.sqrt(mean_squared_error(y_teste, previsoes)))

### **Regressão Polinomial**

In [None]:
covidbrasil

In [None]:
disp = px.scatter(x=covidbrasil.date, y=covidbrasil.cumulative_cases)
disp.update_layout(width=900,height=400,title_text='NÚMERO DE CASOS ACUMULADOS NO BRASIL')
disp.update_xaxes(title = 'DATAS')
disp.update_yaxes(title = 'CASOS')
disp.show()

In [None]:
x = covidbrasil.iloc[:, 0].values

In [None]:
x

In [None]:
# Transformar as datas em sequência numérica (1,2,3,4...) e em matriz
x = np.arange(1,len(x)+1).reshape(-1,1)
x

In [None]:
y = covidbrasil.iloc[:, 5].values
y

**Separando base de dados em teste e treinamento**

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split

In [None]:
x_treinamento, x_teste, y_treinamento, y_teste = train_test_split(x, y,test_size = 0.25, random_state = 2)

In [None]:
x_treinamento.size

In [None]:
x_teste.size

In [None]:
y_treinamento.size

In [None]:
y_teste.size

In [None]:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree = 2)
x_treinamento_poly = poly.fit_transform(x_treinamento)
x_teste_poly = poly.transform(x_teste)

In [None]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_treinamento_poly, y_treinamento)
score = regressor.score(x_treinamento_poly, y_treinamento)

In [None]:
y_treinamento.size

In [None]:
previsoes = regressor.predict(x_teste_poly)

In [None]:
previsoes.size

In [None]:
# Criação de uma sequência para previsão.
forecast = np.arange(len(x) + 20).reshape(-1,1) # + 20 é uma previsão de 20 dias a mais.

In [None]:
forecast.shape

In [None]:
x_train_total = poly.transform(forecast)
x_train_total.shape

In [None]:
x_train_total

In [None]:
previsao_total = regressor.predict(x_train_total)
len(previsao_total)

In [None]:
previsao_total

In [None]:
plt.subplots(figsize=(10,5))
plt.plot(forecast[:-20], y, color='red')
plt.plot(forecast, previsao_total, linestyle='dashed')
plt.title('Casos de COVID-19 no Brasil')
plt.xlabel('A partir de 26/02/2020')
plt.ylabel('Número de Casos')
plt.legend(['Casos Acumulados', 'Previsão']);

In [None]:
previsao_total[100]

**Métricas de Desempenho**

In [None]:
from sklearn.metrics import mean_absolute_error

In [None]:
# Coeficiente de Determinação (R^2)
score

In [None]:
poly_teste_pred = regressor.predict(x_teste_poly)

In [None]:
print('MAE:', mean_absolute_error(poly_teste_pred, y_teste))
print('MSE:', mean_squared_error(poly_teste_pred, y_teste))
print('RMSE:', np.sqrt(mean_squared_error(poly_teste_pred, y_teste)))