## hana_ml Tutorial - COVID-19
Author: SAP TI HA DB ML China

Date: 2020/5/21

In this tutorial, we will use SAP hana_ml to analyze the public dataset of COVID-19.

## Dataset
We use the public COVID-19 dataset from JHU, https://github.com/CSSEGISandData/COVID-19  (For tutorials only). 

## HANA Connection

First, create a connetion to SAP HANA. To create a such connection, a config file, config/e2edata.ini is used to control the connection parameters.A sample section in the config file is shown below which includes HANA url, port, user and password information.<br>

###################<br>
[hana]<br>
url=host-url<br>
user=username<br>
passwd=userpassword<br>
port=3xx15<br>
###################<br>


In [None]:
from hana_ml.dataframe import ConnectionContext
from hana_ml.algorithms.pal.utility import DataSets, Settings
url, port, user, pwd = Settings.load_config("../../config/e2edata.ini")
connection_context = ConnectionContext(url, port, user, pwd)

Connection functions samples:

In [None]:
print(connection_context.connection.isconnected())
print(connection_context.has_table(table='T1'))
print(connection_context.get_current_schema())
print(connection_context.hana_version())

## hana_ml DataFrame

### 1. worldwide-aggregated dataset

In [None]:
import pandas as pd
from hana_ml.dataframe import create_dataframe_from_pandas

worldwide_df = DataSets.load_covid_data(connection_context)

Coverts to Pandas dataframe with collect():

In [None]:
worldwide_df.head(3).collect()

In [None]:
worldwide_df.dtypes()

In [None]:
worldwide_df.columns

In [None]:
import matplotlib.pyplot as plt
worldwide = worldwide_df.collect()
fig, ax = plt.subplots()
fig.set_size_inches(10, 7)
ax.set_ylabel('Number', fontsize='x-large')
ax.set_xlabel('Date', fontsize='x-large')
ax.set_title('Global Confirmed COVID-19 Cases', fontsize='xx-large')
ax.plot(worldwide['Date'], worldwide['Confirmed'], 'k--', label='Confirmed')
ax.plot(worldwide['Date'], worldwide['Deaths'], 'b--', label='Deaths')
ax.plot(worldwide['Date'], worldwide['Recovered'], 'g--', label='Recovered')
legend = ax.legend(loc='upper left', shadow=True, fontsize='x-large')
Date = worldwide['Date']
xticks=list(range(0,len(Date),14)) 
xlabels=[Date[x] for x in xticks] 
ax.set_xticks(xticks)
ax.set_xticklabels(xlabels, rotation=40)
plt.show()

### 2. time_series_covid19_confirmed_US dataset

In [None]:
import numpy as np
# Load data, drop unneeded lat/long columns and group by country/region
def loadAndGroup(fileName, groupBy="Province_State", dropColumns=["UID","iso2","iso3","code3","FIPS","Admin2","Country_Region","Lat","Long_","Combined_Key"], extraDrop=[]):   #,"Population"
    df=pd.read_csv(fileName)
    for dc in dropColumns+extraDrop:
        df.drop(dc, axis=1, inplace=True)
    df=df.groupby(groupBy).sum()
    for dc in range(30):
        df.drop(df.columns[0], axis=1, inplace=True)
    return df

def diff(ys):
    res=[0]
    cur=ys[0]
    for y in ys[1:]:
        res.append(y-cur)
        cur=y
    return res

confd         =loadAndGroup('../datasets/time_series_covid19_confirmed_US.csv')
confd         =confd.append(confd.sum(axis=0).rename('US'))
confdDelta    =confd.diff(axis=1).replace(np.nan, 0)

# Preprocess the Data to transpose and add a 'timestamp' column as the first column.
def preprocessData(df):
    df_new = pd.DataFrame(df).T
    id= df_new.index
    col_name=df_new.columns.tolist()
    col_name.insert(0, 'Timestamp')  
    df_new=df_new.reindex(columns=col_name)
    df_new['Timestamp']=id
    return(df_new)

confd_df=preprocessData(confd)
confdDelta_df=preprocessData(confdDelta)

confd_df_hana = create_dataframe_from_pandas(connection_context=connection_context, pandas_df=confd_df, table_name='US_CONFIRMED', force=True, replace=True)
confdDelta_df_hana = create_dataframe_from_pandas(connection_context=connection_context, pandas_df=confdDelta_df, table_name='US_CONFIRMED_DELTA', force=True, replace=True)

In [None]:
confd_us = connection_context.table('US_CONFIRMED')
confd_us_delta = connection_context.table('US_CONFIRMED_DELTA')
print(confd_us)
print(confd_us_delta)

In [None]:
confd_us.head(3).collect()

In [None]:
confd_us_delta.collect().head(3)

In [None]:
print(confd_us.count())
print(confd_us_delta.count())

In [None]:
ny_confirmed = confd_us.select('New York').collect()
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
fig.set_size_inches(10, 7)
ax.set_ylabel('Number', fontsize='x-large')
ax.set_xlabel('Date', fontsize='x-large')
ax.set_title('Cumulative COVID-19 Cases in New York', fontsize='xx-large')
ax.plot(confd_us.collect()['Timestamp'], ny_confirmed, 'b-o', label="Confirmed")
legend = ax.legend(loc='upper left', shadow=True, fontsize='x-large')
Date = confd_us.collect()['Timestamp']
xticks=list(range(0,len(Date),7)) 
xlabels=[Date[x] for x in xticks] 
ax.set_xticks(xticks)
ax.set_xticklabels(xlabels, rotation=40)
plt.show()

In [None]:
ny_confirmed_delta = confd_us_delta.select('New York').collect()
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
fig.set_size_inches(10,7)
ax.set_ylabel('Number', fontsize='x-large')
ax.set_xlabel('Date', fontsize='x-large')
ax.set_title('Daily New Cases of Covid-19 in New York', fontsize='xx-large')
ax.plot(confd_us_delta.collect()['Timestamp'], ny_confirmed_delta, 'b-o', label="Daily New York Confirmed")
legend = ax.legend(loc='upper left', shadow=True, fontsize='large')
Date = confd_us_delta.collect()['Timestamp']
xticks=list(range(0,len(Date),7)) 
xlabels=[Date[x] for x in xticks] 
ax.set_xticks(xticks)
ax.set_xticklabels(xlabels, rotation=40)
plt.show()

## Forecast 

### Time Series Forecast Algorithms - Auto ARIMA, Additive Model Forecast

### Auto ARIMA

In [None]:
ny_confd_delta = confd_us_delta.select('New York').add_id('ID').cast('New York', 'INT')
print(ny_confd_delta.head(5).collect())

In [None]:
from hana_ml.algorithms.pal.tsa.auto_arima import AutoARIMA

autoarima = AutoARIMA()
autoarima.fit(ny_confd_delta, key="ID")

print(autoarima.model_.collect().head(5))
print(autoarima.fitted_.collect().head(10))

In [None]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
fig.set_size_inches(13, 8)
ax.set_ylabel('Number', fontsize='x-large')
ax.set_xlabel('Date', fontsize='x-large')
ax.set_title('New Cases of COVID-19 in New York Fitted with Auto ARIMA Model', fontsize='xx-large')
ax.plot(ny_confd_delta.collect()['ID'], ny_confd_delta.collect()['New York'], 'b--', label = 'Daily New Cases')
ax.plot(autoarima.fitted_.collect()['ID'], autoarima.fitted_.collect()['FITTED'],'r--', label='AutoARIMA fitted')
legend = ax.legend(loc='upper left', shadow=True, fontsize='x-large')
Date = confd_us_delta.collect()['Timestamp']
xticks=list(range(0,len(Date),7)) 
xlabels=[Date[x] for x in xticks] 
ax.set_xticks(xticks)
ax.set_xticklabels(xlabels, rotation=40)
plt.show()

### Model Storage

In [None]:
from hana_ml.model_storage import ModelStorage
# model storage must use the same connection as the model
model_storage = ModelStorage(connection_context=connection_context)

# Saves the model
autoarima.name = 'ARIMA model' 
autoarima.version = 1
model_storage.save_model(model=autoarima, if_exists='replace')

autoarima.name = 'ARIMA model' 
autoarima.version = 2
model_storage.save_model(model=autoarima, if_exists='replace')

# Lists models
model_storage.list_models()

In [None]:
model_storage.delete_model('ARIMA model', 2)
model_storage.list_models()

In [None]:
model_storage.list_models()['JSON'].iloc[0]

In [None]:
model = model_storage.load_model(name='ARIMA model', version=1)
print(model.model_.collect().head(5))

### Predict with ARIMA Model 

In [None]:
model.set_conn(connection_context)
result = model.predict(forecast_length=5)
print(result.collect())

In [None]:
id_predict = list(range(90,95))
id_all = list(range(1,94))

data_fitted_predict = autoarima.fitted_.collect()['FITTED'].append(result.collect()['FORECAST'])
data_fitted = ny_confd_delta.collect()['New York']

fig, ax = plt.subplots()
fig.set_size_inches(13, 8)
ax.set_ylabel('Number', fontsize='x-large')
ax.set_xlabel('Date', fontsize='x-large')
ax.set_title('New Cases of COVID-19 in New York Forecast', fontsize='xx-large')
ax.plot(ny_confd_delta.collect()['ID'], data_fitted, 'k--', label='Daily confirmed')
ax.plot(id_all, data_fitted_predict[1:94], 'r--', label='ARIMA fitted and forecast')
ax.plot(id_predict,  result.collect()['HI80'], 'b--', label='High 80% value')
ax.plot(id_predict,  result.collect()['HI95'], 'g--', label='High 95% value')
ax.plot(id_predict,  result.collect()['LO80'], 'y--', label='Low 80% value')
ax.plot(id_predict,  result.collect()['LO95'], 'c--', label='Low 95% value')
legend = ax.legend(loc='upper left', shadow=True, fontsize='x-large')
Date = confd_us_delta.collect()['Timestamp']
#Date.append(pd.DataFrame(['5/20/20', '5/21/20', '5/22/20', '5/23/20', '5/24/20']))
xticks=list(range(0,len(Date),7)) 
xlabels=[Date[x] for x in xticks] 
ax.set_xticks(xticks)
ax.set_xticklabels(xlabels, rotation=40)
plt.show()

### Additive Model Forecast

In [None]:
ny_confd_delta = confd_us_delta.select('Timestamp','New York').cast('New York', 'INT')
ny_confd_delta = ny_confd_delta.cast('Timestamp', 'DATE')
print(ny_confd_delta.head(3).collect())
print(ny_confd_delta.dtypes())

Predicted data:

In [None]:
from hana_ml.dataframe import create_dataframe_from_pandas
data = {
    'Timestamp':['2020-5-20', '2020-5-21', '2020-5-22', '2020-5-23', '2020-5-24'],
    'New York':[0, 0, 0, 0, 0]
}
predict = pd.DataFrame(data)
predict_df = create_dataframe_from_pandas(connection_context=connection_context, pandas_df= predict, table_name='ADDITIVE_PREDICT_TBL', force=True, replace=True)
predict_df = predict_df.cast('New York', 'DOUBLE')
predict_df = predict_df.cast('Timestamp', 'DATE')
print(predict_df.collect())
print(predict_df.dtypes())

In [None]:
from hana_ml.algorithms.pal.tsa import additive_model_forecast

amf = additive_model_forecast.AdditiveModelForecast()
amf.fit(ny_confd_delta)

print(amf.model_.collect())

In [None]:
result = amf.predict(predict_df)
print(result.collect())

In [None]:
id_predict = list(range(90,95))
id_all = list(range(1,95))

data_all = ny_confd_delta.collect()['New York'].append(result.collect()['YHAT']) 
upper = result.collect()['YHAT_UPPER']
lower = result.collect()['YHAT_LOWER']

fig, ax = plt.subplots()
fig.set_size_inches(13, 8)
ax.set_ylabel('Number', fontsize='x-large')
ax.set_xlabel('Date', fontsize='x-large')
ax.set_title('New Cases of COVID-19 in New York Forecast - Addictive Model Forecast', fontsize='xx-large')
ax.plot(id_all[1:89], data_all[1:89], 'k--', label='confirmed')
ax.plot(id_all[89:94], data_all[89:94], 'r--', label='predict data')
ax.plot(id_predict,  upper, 'b--', label='upper bound')
ax.plot(id_predict,  lower, 'c--', label='lower bound')
Date = confd_us_delta.collect()['Timestamp']
#Date.append(pd.DataFrame(['5/20/20', '5/21/20', '5/22/20', '5/23/20', '5/24/20']))
xticks=list(range(0,len(Date),7)) 
xlabels=[Date[x] for x in xticks] 
ax.set_xticks(xticks)
ax.set_xticklabels(xlabels, rotation=40)
plt.show()
legend = ax.legend(loc='upper left', shadow=True, fontsize='x-large')
plt.show()

### Regression - SVR

In [None]:
ny_confd_delta = confd_us_delta.select('New York').cast('New York', 'INT')
print(ny_confd_delta.head(3).collect())
print(ny_confd_delta.dtypes())

In [None]:
from pandas import DataFrame
from pandas import concat
 
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):    
    n_vars = 1 if type(data) is list else data.shape[1]
    df = DataFrame(data)
    cols, names = list(), list()
    # input sequence (t-n, ... t-1)
    for i in range(n_in, 0, -1):
        cols.append(df.shift(i))
        names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)]
    # forecast sequence (t, t+1, ... t+n)
    for i in range(0, n_out):
        cols.append(df.shift(-i))
        if i == 0:
            names += [('var%d(t)' % (j+1)) for j in range(n_vars)]
        else:
            names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)]
    # put it all together
    agg = concat(cols, axis=1)
    agg.columns = names
    # drop rows with NaN values
    if dropnan:
        agg.dropna(inplace=True)
    return agg
 
 
values = pd.DataFrame(ny_confd_delta.collect()['New York'])
regression_data = series_to_supervised(values, 5)
print(regression_data.head(5))

In [None]:
svr_df = create_dataframe_from_pandas(connection_context=connection_context, pandas_df=regression_data, table_name='NY_SVR_DATA_TBL', force=True, replace=True)
svr_df =svr_df.add_id('ID')
print(svr_df.head(3).collect())

### Model Training
Split the dataset into train_data and test_data:

In [None]:
from hana_ml.algorithms.pal.partition import train_test_val_split
train_data, test_data, validate_data = train_test_val_split(svr_df, training_percentage=0.8, testing_percentage=0.2, validation_percentage=0)
print(train_data.count())
print(train_data.collect().head(3))
print(test_data.count())
print(test_data.collect().head(3))

In [None]:
from hana_ml.algorithms.pal.svm import SVR
featurs_svr = ['var1(t-5)', 'var1(t-4)', 'var1(t-3)', 'var1(t-2)', 'var1(t-1)']
svr = SVR(kernel = 'rbf',
          scale_info='standardization', 
          gamma = 0.3,
          random_state=10,
          scale_label=True)
svr.fit(train_data, key='ID', features = featurs_svr, label = 'var1(t)')

print(svr.model_.collect())
print(svr.stat_.collect())

### Model Score
Calculate the r2_score:

In [None]:
print(svr.score(test_data, key='ID', features = featurs_svr, label = 'var1(t)'))

### Cross Validation

In [None]:
svr_cv = SVR(kernel='rbf', 
             scale_info='standardization', 
             scale_label=True, 
             resampling_method='cv',
             fold_num=10, 
             repeat_times=5,
             random_state=11,
             search_strategy='grid',
             param_range = [('gamma', [0.1, 0.1, 1.0])])
svr_cv.fit(train_data, key='ID', label = 'var1(t)')

print(svr_cv.model_.collect())
print(svr_cv.stat_.collect())

In [None]:
train_data.collect()

In [None]:
print(svr_cv.score(test_data, key='ID', features = featurs_svr, label = 'var1(t)'))

## Close HANA Connection

In [None]:
connection_context.close()