# Names Dataset with ExeTera

This notebook will demonstrate using ExeTera to load a dataset of given names from the ["Gender by Name Data Set"](https://archive.ics.uci.edu/ml/datasets/Gender+by+Name).

Each row has four fields:
* Name: String
* Gender: M/F (category/string)
* Count: Integer, total number of instances of this name in the dataset
* Probability: Float, chance of a randomly drawn person from the population having this name


In [1]:
import sys
sys.path.insert(0,"..")


from urllib.request import urlretrieve

SRC_URL = "https://archive.ics.uci.edu/ml/machine-learning-databases/00591/name_gender_dataset.csv"
FILENAME = "name_gender_dataset.csv"

_ = urlretrieve(SRC_URL, FILENAME)

In [2]:
%%writefile name_gender_dataset_schema.json

{
  "exetera": {
    "version": "1.0.0"
  },
  "schema": {
    "name_gender_dataset": {
      "primary_keys": [
        "Name"
      ],
      "fields": {
        "Name": {
          "field_type": "string"
        },
        "Gender": {
          "field_type": "categorical",
          "categorical": {
            "value_type": "int8",
            "strings_to_values": {
              "M": 1,
              "F": 2
            }
          }
        },
        "Count": {
          "field_type": "numeric",
          "value_type": "int32"
        },
        "Probability": {
          "field_type": "numeric",
          "value_type": "float32"
        }   
      }
    }
  }
}

Overwriting name_gender_dataset_schema.json


In [3]:
from exetera.io import importer
from exetera.core import session
from datetime import datetime, timezone

with session.Session() as s:
    importer.import_with_schema(
        session=s,
        timestamp=str(datetime.now(timezone.utc)),
        dataset_name="NameGender",
        dest_file_name="name_gender_dataset.hdf5",
        schema_file="name_gender_dataset_schema.json",
        files={"name_gender_dataset": "name_gender_dataset.csv"},
        overwrite=True,
    )

read_file_using_fast_csv_reader: 1 chunks, 147269 accumulated_written_rows parsed in 1.1086580753326416s
completed in 1.1138238906860352 seconds
Total time 1.1140117645263672s
