# Names Dataset with ExeTera

This notebook will demonstrate using ExeTera to load a dataset of given names from the ["Gender by Name Data Set"](https://archive.ics.uci.edu/ml/datasets/Gender+by+Name).

Each row has four fields:
* Name: String
* Gender: M/F (category/string)
* Count: Integer, total number of instances of this name in the dataset
* Probability: Float, chance of a randomly drawn person from the population having this name

The first thing to do is download the data in csv format:

In [1]:
from urllib.request import urlretrieve

SRC_URL = "https://archive.ics.uci.edu/ml/machine-learning-databases/00591/name_gender_dataset.csv"
FILENAME = "name_gender_dataset.csv"

_ = urlretrieve(SRC_URL, FILENAME)

A schema is then written to define the structure of the dataset. This has only one dataset called `name_gender_dataset` with the four above described fields:

In [2]:
%%writefile name_gender_dataset_schema.json

{
  "exetera": {
    "version": "1.0.0"
  },
  "schema": {
    "name_gender_dataset": {
      "primary_keys": [
        "Name"
      ],
      "fields": {
        "Name": {
          "field_type": "string"
        },
        "Gender": {
          "field_type": "categorical",
          "categorical": {
            "value_type": "int8",
            "strings_to_values": {
              "M": 1,
              "F": 2
            }
          }
        },
        "Count": {
          "field_type": "numeric",
          "value_type": "int32"
        },
        "Probability": {
          "field_type": "numeric",
          "value_type": "float32"
        }   
      }
    }
  }
}

Overwriting name_gender_dataset_schema.json


The data from the csv can now be imported into ExeTera to produce the hdf5 file. The equivalent can be done with the `exetera import` command:

In [3]:
from exetera.io import importer
from exetera.core import session
from datetime import datetime, timezone

with session.Session() as s:
    importer.import_with_schema(
        session=s,
        timestamp=str(datetime.now(timezone.utc)),
        dataset_name="NameGender",
        dest_file_name="name_gender_dataset.hdf5",
        schema_file="name_gender_dataset_schema.json",
        files={"name_gender_dataset": "name_gender_dataset.csv"},
        overwrite=True,
    )

read_file_using_fast_csv_reader: 1 chunks, 147269 accumulated_written_rows parsed in 1.520482063293457s
completed in 1.5250701904296875 seconds
Total time 1.5253255367279053s


The contents of the data can now be loaded and queried:

In [4]:
with session.Session() as s:
    dat = s.open_dataset("name_gender_dataset.hdf5", "r", "dataset")  # load the dataset

    print(list(dat))  # list the frames

    frame = dat["name_gender_dataset"]  # pull out a frame

    print("Frame type and length:", type(frame), len(frame))

    for name, col in frame.columns.items():
        print(name, type(col).__name__, len(col))

    field = frame["Name"]  # pull out a field of the frame

    print(type(field), len(field))

    print(field.data[:30])

['name_gender_dataset']
Frame type and length: <class 'exetera.core.dataframe.HDF5DataFrame'> 8
Count NumericField 147269
Count_valid NumericField 147269
Gender CategoricalField 147269
Name IndexedStringField 147269
Probability NumericField 147269
Probability_valid NumericField 147269
j_valid_from TimestampField 147269
j_valid_to TimestampField 147269
<class 'exetera.core.fields.IndexedStringField'> 147269
['James', 'John', 'Robert', 'Michael', 'William', 'Mary', 'David', 'Joseph', 'Richard', 'Charles', 'Thomas', 'Christopher', 'Daniel', 'Matthew', 'Elizabeth', 'Patricia', 'Jennifer', 'Anthony', 'George', 'Linda', 'Barbara', 'Donald', 'Paul', 'Mark', 'Andrew', 'Steven', 'Kenneth', 'Edward', 'Joshua', 'Margaret']
