# Overview

This forecasting model is based on a log-polynomial regression trained based on metrics for active cases of SARS-Conv-2 infection in other countries. It is a simple model, which might be furtherly improved using demografic and economical information to improve accuracy. The applicability of such a simplied modelling of the brazilian case is, however, supported by some demografic metrics, such as GDP, average education age and life expectation, where Brazil figures near the median of the world-wide evaluation.

The use of a log-polynomial regression is due to the caracteristics observed in China regarding the number of active cases overtime. A quasi exponential growth is observed after a certain number of cases been reported, and it follows a certain time till the aceleration decreases and a maximum point be reached. Then, the number of active cases starts to decrease, as more people become cured (and probably immune). During all the course of the epidemics, the number of active cases is decreased by the number of deaths, that might be higher in those countries with an older population (eg: Italy). As observed in China, the number of active cases increases more quickly than it decreases after the maximum point. Although it is unclear if the same behaviour will be observed in Brazil, the model also was able to predict the result.

In [None]:
from sklearn.linear_model import HuberRegressor 
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from matplotlib import pyplot as plt
from datetime import datetime, timedelta
import json
import requests
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import tqdm
import warnings

warnings.simplefilter('ignore')

## Data loading

In [None]:
INPUT_PREFIX = '../input/novel-corona-virus-2019-dataset/time_series_covid_19'

df_confirmed = pd.read_csv(
    INPUT_PREFIX + '_confirmed.csv').drop(['Province/State', 'Lat', 'Long'], axis=1).groupby(by='Country/Region').sum().reset_index()
df_recovered = pd.read_csv(INPUT_PREFIX + '_recovered.csv').drop(['Province/State', 'Lat', 'Long'], axis=1).groupby(by='Country/Region').sum().reset_index()
df_deaths = pd.read_csv(INPUT_PREFIX + '_deaths.csv').drop(['Province/State', 'Lat', 'Long'], axis=1).groupby(by='Country/Region').sum().reset_index()

## Data aggregation

In [None]:
dates = list(df_confirmed.columns)[1::]

In [None]:
df_merged = pd.DataFrame(columns=['country', 'date', 'confirmed', 'recovered', 'deaths'])

for r, row in tqdm.tqdm(df_confirmed.iterrows(), total=df_confirmed.shape[0]):
    for date in dates:
        df_merged = df_merged.append(
            {
                'country': row['Country/Region'],
                'date_raw': date,
                'confirmed': df_confirmed[df_confirmed['Country/Region']==row['Country/Region']][date].values[0],
                'recovered': df_recovered[df_recovered['Country/Region']==row['Country/Region']][date].values[0],
                'deaths': df_deaths[df_deaths['Country/Region']==row['Country/Region']][date].values[0]
            },
            ignore_index=True
        )

df_merged['active'] = df_merged['confirmed'] - df_merged['recovered'] - df_merged['deaths'] 

In [None]:
df_merged['date'] = df_merged['date_raw'].map(lambda d: datetime.strptime(d, '%m/%d/%y'))

In [None]:
df_merged = df_merged[df_merged['country'].isin(['Italy', 'China', 'Spain', 'Portugal', 'France', 'US', 'Japan', 'Brazil'])]

In [None]:
df_merged[df_merged['country']=='Brazil']

## Filtering data

Here we select only the records after the minimum number of k cases confirmed be reached (usually referred as the point were the growth becomes exponential). 

In [None]:
k = 100

df_filtered = df_merged[df_merged['confirmed']>=k]

days_from_k_cases = []

for _, row in df_filtered.iterrows():
    days_from_k_cases.append(
        (row['date'] - df_filtered[df_filtered['country'] == row['country']]['date'].min()).days + 1
    )
    
df_filtered['days_from_k_case'] = days_from_k_cases

## Grouping country for regression modelling

In [None]:
X_raw, y_raw = [], []

for country, df_country in df_filtered.groupby(by='country'):
    X_raw += list(df_country['days_from_k_case'].values)
    y_raw += list(df_country['active'].values)
    
X = np.array(X_raw).reshape(-1, 1)
y = np.array(y_raw)

## Model trainning

In [None]:
model = make_pipeline(PolynomialFeatures(2), HuberRegressor())
model.fit(X, np.log(y))
X_pred = np.arange(0, 100).reshape(-1, 1)
y_pred_log = model.predict(X_pred)
y_pred = np.exp(y_pred_log)
X_pred = X_pred.reshape(1, -1)[0]

## The growth curve in the model and other countries

In [None]:
fig = go.Figure(
    layout=go.Layout(
        paper_bgcolor='rgba(0,0,0,0)',
        plot_bgcolor='rgba(0,0,0,0)'
    )
)

for country, df_country in df_filtered.groupby(by='country'):
    fig.add_trace(
        go.Scatter(
            x=df_country['days_from_k_case'],
            y=df_country['active'],
            mode='lines',
            name=country
        )
    )

fig.add_trace(
    go.Scatter(
        x=X_pred,
        y=y_pred,
        mode='markers',
        name='Prediction'
    )
)

## Currently, how far Brazil and Italy are from the model?

In [None]:
y_brazil = df_filtered[(df_filtered['country']=='Brazil')]['active']
brazil_model_diff = y_brazil ** (1/y_pred[0:len(y_brazil)])
brazil_model_mean_diff = np.mean(brazil_model_diff)
print('The difference between the model and the real data is %.2f %% in Brazil'%((brazil_model_mean_diff-1)*100))

In [None]:
y_italy = df_filtered[(df_filtered['country']=='Italy')]['active']
italy_model_diff = y_italy ** (1/y_pred[0:len(y_italy)])
italy_model_mean_diff = np.mean(italy_model_diff)
print('The difference between the model and the real data is %.2f %% in Italy'%((italy_model_mean_diff-1)*100))

In [None]:
italy_brazil_diff = italy_model_mean_diff/brazil_model_mean_diff
print('The difference between the model and the real data is %.2f %% higher in Italy than is Brazil'%((italy_brazil_diff-1)*100))

## The growth curve in Brazil and Italy, and our distance for the max capacity in UTI (*Unidade de Tratamento Intensivo*, Intensive Care Unit) the Brazilian health system

**Note:** Values for "Max capacity of UTI in public health system (SUS)" and "Max capacity of UTI in public health system (SUS) + private sector" are based on Brazilian health system (SUS). Source: Estadão

In [None]:
fig = go.Figure(
    layout=go.Layout(
        paper_bgcolor='rgba(0,0,0,0)',
        plot_bgcolor='rgba(0,0,0,0)'
    )
)

fig.add_trace(
    go.Scatter(
        x=X_pred,
        y=y_pred*brazil_model_mean_diff,
        mode='markers',
        name='Brazil Prediction (active cases)',
        marker={
            'color':'black',
            'size': 2
        }
    )
)

fig.add_trace(
    go.Scatter(
        x=X_pred,
        y=y_pred*0.1*brazil_model_mean_diff,
        mode='markers',
        name='Brazil Prediction (active cases with UTI need)',
        marker={
            'color':'orange',
            'size': 4
        }
    )
)

fig.add_trace(
    go.Scatter(
        x=df_filtered[df_filtered['country']=='Italy']['days_from_k_case'],
        y=df_filtered[df_filtered['country']=='Italy']['active'],
        mode='lines',
        name='Italy',
        line={
            'color':'red',
            'width':5
            
        }
    )
)

fig.add_trace(
    go.Scatter(
        x=df_filtered[df_filtered['country']=='Brazil']['days_from_k_case'],
        y=df_filtered[df_filtered['country']=='Brazil']['active'],
        mode='lines',
        name='Brazil',
        line={
            'color':'black',
            'width':5
            
        }
    )
)

fig.add_trace(
    go.Scatter(
        name="Max capacity of UTI in public health system (SUS)",
        x = [0, 100],
        y = [27400, 27400],
    )
)

fig.add_trace(
    go.Scatter(
        name="Max capacity of UTI in public health system (SUS) + private sector",
        x = [0, 100],
        y = [55100, 55100],
    )
)

### When will Brazil reach the limit of the public health system (SUS)?

In [None]:
public_health_limit_reach = min([i+1 for i,v in enumerate(y_pred) if v >= 27400])
df_filtered[(df_filtered['country']=='Brazil')]['date'].min() + timedelta(days=public_health_limit_reach)