Skip to content

AlexMili/torch-dataframe

Repository files navigation

Licence MIT Build Status codecov

Dataframe

Dataframe is a Torch7 class to load and manipulate tabular data (e.g. Kaggle-style CSVs) inspired from R's and pandas' data frames.

As of release 1.5 it fully supports the torchnet data structure. It also has custom iterators to convenient integration with torchnet's engines, see the mnist example. As of release 1.6 it has changed the internal storage to tensor

For a more detailed look at the changes between the versions have a look at the NEWS file.

Requirements

Installation

You can clone this repository or directly install it through luarocks:

git clone https://github.com/AlexMili/torch-dataframe
cd torch-dataframe
luarocks make rocks/torch-dataframe-scm-1.rockspec

the same in one line :

luarocks install torch-dataframe scm-1

or

luarocks install torch-dataframe

Changelog

Version: 1.7

  • Added faster torch.Tensor functions to fill/stat functions for speed
  • Added mutate function to Dataseries
  • __index__ access for Df_Array
  • More complete documentation for Df_Array and specs
  • Df_Dict elements can be accessed using myDict[index] or myDict["$colname"]
  • Df_Dict key property available. It list the Df_Dict's keys
  • Df_Dict length property available. It list by key, the length of its content
  • Df_Dict check_length() checks if all elements have the same length
  • Df_Dict set_keys(table) replaces every keys by the given table (must be the same size)
  • More complete documentation for Df_Dict and specs
  • More complete documentation for Df_Tbl and specs
  • Internal methods _infer_csvigo_schema() and _infer_data_schema() renamed to _infer_schema()
  • Type inference is now based on type frequences but if it encounter a single double/float in a integer column it will consider the column as double/float
  • it is now possible to directly set a schema for a Dataframe without any checks with set_schema(). Use it wisely
  • Possibility to init a Dataframe with a schema, a column order and a number of rows with internal method _init_with_schema()
  • Added bulk_load_csv() method wich loads large CSVs files using threads but without checking missing values or data integrity. To use with caution. See #28
  • Added load_threadcsv()
  • Added the possiblity to create empty Dataseries
  • Added Dataseries load() method to directly load a tensor or tds.Vec in memory without any check
  • Added iris dataset in /specs/data
  • New specs structure
  • Fixed csv loading when no header and test case according to it
  • Changed assert_is_index return value to true on success instead of self

See NEWS.md file for previous changes.

Usage

Named arguments

The Dataframe relies on argcheck for parsing arguments. This means that you can used named parameters using the function{arg_name=value} syntax. Named arguments are supported by all functions except the constructor and is in certain functions mandatory in order to avoid ambiguity.

The argcheck package also works as the API documentation. It checks arguments and if you happen to provide the function with invalid arguments it will automatically output the function documentation.

Important: Due to limitations in the Lua language the package uses helper classes for separating regular table arguments from tables passed into as arguments. The three classes are:

  • Df_Array - contains only values and no keys
  • Df_Dict - a dictionary table that has named keys that map to all values
  • Df_Tbl - a raw table wrapper that does a shallow argument copy

Load data

Initiate the object:

require 'Dataframe'
df = Dataframe()

Load CSV file:

df:load_csv{path='./data/training.csv', header=true}

Load from table:

df:load_table{data=Df_Dict{firstColumn={1,2,3},
                           secondColumn={4,5,6}}}

You can also instantiate the object with a csv-filename or a table by passing the table or filename as an argument:

require 'Dataframe'
df = Dataframe('./data/training.csv')

Data inspection

You can discover your dataset with the following functions:

-- you can either view the data as a plain text output or itorch html table
df:output() -- prints html if in itorch otherwise prints plain table
df:output{html=true} -- forces html output

df:show() -- prints the head + tail of the table

-- You can also directly call print() on the object
-- and it will print the ascii-table
print(df)

General dataset information can be found using:

df:shape() -- print {rows=3, cols=3}
#df -- gets the number of rows
df:size() -- returns a tensor with the size rows, columns
df.column_order -- table of columns names
df:count_na() -- print all the missing values by column name

If you want to inspect random elements you can use the get_random():

df:get_random(10):output()

Manipulate

You can manipulate it:

df:insert(Df_Dict({['first_column']={7,8,9},['second_column']={10,11,12}}))
df:remove_index(3) -- remove line 3 of the entire dataset

df:has_column('x') -- return true if the column exist
df:get_column('y') -- return column x as table
df["$y"] -- alias for get_column

df:add_column('z', 0) -- Add column with default value 0 at the end (right side of the table)
df:add_column('first_column', 1, 2) -- Add column with default value 2 at the beginning (left side of the table)
df:drop('x') -- delete column
df:rename_column('x', 'y') -- rename column 'x' in 'y'

df:reset_column('my_col', 0) -- reset the given column with 0
df:fill_na('x', 0) -- replace missing values in 'x' column with 0
df:fill_all_na(0) -- replace all missing values with the value 0

df:unique('col_name') -- return table with unique values of the given column
df:unique('col_name', true) -- return table with unique values of the given column as keys

df:where('column_name','my_value') -- find the first row where the column has the given value

-- Customly update all rows filling the condition defined in first lambda
df:update(function(row) row['column'] == 'test' end,
          function(row) row['other_column'] = 'new_value' return row end)

Categorical variables

You can define categorical variables that will be treated internally as numbers ranging from 1 to n levels while displayed as strings. The numeric representation is retained when exporting to_tensor allowing a simpler understanding of a classifier's output:

df:as_categorical('my string column') -- converts a column to categorical
df:get_cat_keys('my string column') -- retreives the keys used to converts
df:to_categorical(Df_Array({1,2,1}), 'my string column') -- converts numbers to the categories

Subsetting

You can subset your data using:

df:head(20) -- print 20 first elements (10 by default)
df:tail(5) -- print 5 last elements (10 by default)
df:show() -- print 10 first and 10 last elements

df[13] -- returns a table with the row values
df["13:17"] -- returns a Dataframe with values in that span
df["13:"] -- returns a Dataframe with values starting from index 13
df[Df_Array(1,3,4)] -- returns a Dataframe with values index 1,3 and 4

Exporting

Finally, you can save your dataset to tensor (only numerical/categorical columns will be taken):

df:to_tensor{filename = './data/train.th7'} -- saves data
data = df:to_tensor{columns = Df_Array('first_column', 'my string column')} -- Converts the two columns into tensor

or to CSV:

df:to_csv('data.csv')

Batch loading

The Dataframe provides a built-in system for handling batch loading. It also has an extensive set of samplers that you can use. See API docs for more on which that are available.

The gist of it is:

  • The main Dataframe is initialized for batch loading via calling the create_subsets. This creates random subsets that have their own samplers. The default is a train 70%, validate 20%, and a test 10% split in the data but you can choose any split and any names.
  • Each subset is a separate dataframe subclass that has two columns, (1) indexes with the corresponding index in the main dataframe, (2) labels that some of the samplers require.
  • When you want to retrieve a batch from a subset you call the subset using my_dataframe:get_subset('train'):get_batch(30) or my_dataframe['/train']:get_batch(30).
  • The batch returned is also a subclass that has a custom to_tensor function that returns the data and corresponding label tensors. You can provide custom functions that will get the full row as an argument allowing you to use e.g. a filename that permits load an external resource.

A simple example:

local df = Dataframe('my_csv'):
	create_subsets()

local batch = df["/train"]:get_batch(10)
local data, label = batch:to_tensor{
	load_data_fn = my_image_loader
}

As of version 1.5 you may also want to consider using th iterators that integrate with the torchnet infrastructure. Take a look at the iterator API and the mnist example for how an implementation may look.

Tests

The package contains an extensive test suite and tries to apply a behavior driven development approach. All features should be accompanied by a test-case.

To launch the tests you need to install busted (See: Olivine-Labs/busted) via luarocks:

luarocks install busted

then you can run all tests via command line:

cd specs/
./run_all.sh

Documentation

The package relies on self-documenting functions via the argcheck package that reside in the doc folder. The GitHub Wiki is intended for more extensive in detail documentation.

To generate the documentation please run:

th doc.lua > /dev/null

Contributing

See CONTRIBUTING.md for further details.