# Building train dataset

This script shows how to build a train dataset using our modules.

## Set configuration

There is a file named as cofigure.json in the /doc path. The methods implemented in our modules uses this file to load configures values that will be used, values as model type, nodes, epochs to train, data test size and others. Things like window data size, data accumulated or moving average application are configure there too.

In [1]:
# This is a configure.json file example.
import json

# Load the local configure.json and print it
configure_json = open('../doc/configure.json', 'r')
print(json.dumps(json.load(configure_json), indent=4))

{
    "ncovid": "ML COVID-19 configure file",
    "author": "NatalNet NCovid",
    "published_at": 2021,
    "folder_configs": {
        "docs_path": "../doc/",
        "data_path": "../dbs/",
        "model_path": "fitted_model/",
        "model_path_remote": "https://",
        "glossary_file": "glossary.json"
    },
    "model_configs": {
        "type_used": "Artificial",
        "is_predicting": "False",
        "Artificial": {
            "model": "lstm",
            "nodes": 300,
            "epochs": 100,
            "dropout": 0.1,
            "batch_size": 64,
            "earlystop": 30,
            "is_output_in_input": "True",
            "data_configs": {
                "is_accumulated_values": "False",
                "is_apply_moving_average": "True",
                "window_size": 7,
                "data_test_size_in_days": 35,
                "type_norm": ""
            },
            "Autoregressive": {
                "model": "arima",
                "p": 1,
    

To load this set of configurations, import the configs_manner.py file.

In [2]:
# If this script is running in another folder, change the base path to the /src folder.
import sys
sys.path.append("../src")

import configs_manner

# Priting some configures variabels.
print("Data window size: \n", configs_manner.model_infos["data_window_size"])
print("\n")
print("If data is accumulated: \n", configs_manner.model_infos["data_is_accumulated_values"])
print("\n")
print("If is the moving average applied: \n", configs_manner.model_infos["data_is_apply_moving_average"])



Data window size: 
 7


If data is accumulated: 
 False


If is the moving average applied: 
 True


To configure any data param, just change the value in the configure.json file.

In this script it'll be do a remote data request, so before to create a data constructor, it is necessary declare the remote repository, the locality, the features to get, and the start and finish date.

In [3]:
# specif code to the remote repository data.
repo = "p971074907"
# coutry and state acronym splited by a ":"
path = "brl:rn"
# columns (or features) to be extracted from the database, each one splited by a ":"
feature = "date:newDeaths:newCases:"
# start date for the data request.
begin = "2020-05-01"
# finish date for the data request.
end = "2021-07-01"

## Load data

In our modules, almost all procedures uses class objects and class methods. To get a dataset from a remote repository, firstly create a DataConstructor constructor and call the .collect_dataframe() method. See [Loading remote data](loading_remote_data.ipynb) file.

In [4]:
# import the data_manner.py file. (taking into account that you are in src/ path)
import data_manner

# creating the DataConstructor instance
data_constructor = data_manner.DataConstructor()
# collect data from the remote repository.
collected_data = data_constructor.collect_dataframe(path, repo, feature, begin, end)

Internally some data processings is done to deal with the data variation, like moving avarage or data difference.

## Build train

With the collected data, call the .build_train() method to create a data train in the right shape.

In [5]:
train = data_constructor.build_train(collected_data)

print("Tain X and Train target shapes: ", train.x.shape, train.y.shape)

Tain X and Train target shapes:  (413, 7, 2) (413, 7, 1)
