Skip to content

Latest commit

 

History

History
589 lines (397 loc) · 18.1 KB

function_reference_guide.rst

File metadata and controls

589 lines (397 loc) · 18.1 KB

Function Reference Guide

DataFrames

DataFrame(cols::Vector, colnames::Vector{ByteString})

Construct a DataFrame from the columns given by cols with the index generated by colnames. A DataFrame inherits from Associative{Any,Any}, so Associative operations should work. Columns are vector-like objects. Normally these are AbstractDataVector's (DataVector's or PooledDataVector's), but they can also (currently) include standard Julia Vectors.

DataFrame(cols::Vector)

Construct a DataFrame from the columns given by cols with default column names.

DataFrame()

An empty DataFrame.

copy(df::DataFrame)

A shallow copy of df. Columns are referenced, not copied.

deepcopy(df::DataFrame)

A deep copy of df. Copies of each column are made.

similar(df::DataFrame, nrow)

A new DataFrame with nrow rows and the same column names and types as df.

Basics

size(df), ndims(df)

Same meanings as for Arrays.

has(df, key), get(df, key, default), keys(df), and values(df)

Same meanings as Associative operations. keys are column names; values are column contents.

start(df), done(df,i), and next(df,i)

Methods to iterate over columns.

ncol(df::AbstractDataFrame)

Number of columns in df.

nrow(df::AbstractDataFrame)

Number of rows in df.

length(df::AbstractDataFrame)

Number of columns in df.

isempty(df::AbstractDataFrame)

Whether the number of columns equals zero.

head(df::AbstractDataFrame) and head(df::AbstractDataFrame, i::Int)

First i rows of df. Defaults to 6.

tail(df::AbstractDataFrame) and tail(df::AbstractDataFrame, i::Int)

Last i rows of df. Defaults to 6.

show(io, df::AbstractDataFrame)

Standard pretty-printer of df. Called by print() and the REPL.

dump(df::AbstractDataFrame)

Show the structure of df. Like R's str.

describe(df::AbstractDataFrame)

Show a description of each column of df.

complete_cases(df::AbstractDataFrame)

A Vector{Bool} of indexes of complete cases in df (rows with no NA's).

duplicated(df::AbstractDataFrame)

A Vector{Bool} of indexes indicating rows that are duplicates of prior rows.

unique(df::AbstractDataFrame)

DataFrame with unique rows in df.

Indexing, Assignment, and Concatenation

DataFrames are indexed like a Matrix and like an Associative. Columns may be indexed by column name. Rows do not have names. Referencing with one argument normally indexes by columns: df["col"], df[["col1","col3"]] or df[i]. With two arguments, rows and columns are selected. Indexing along rows works like Matrix indexing. Indexing along columns works like Matrix indexing with the addition of column name access.

getindex(df::DataFrame, ind) or df[ind]

Returns a subset of the columns of df as specified by ind, which may be an Int, a Range, a Vector{Int}, ByteString, or Vector{ByteString}. Columns are referenced, not copied. For a single-element ind, the column by itself is returned.

getindex(df::DataFrame, irow, icol) or df[irow,icol]

Returns a subset of df as specified by irow and icol. irow may be an Int, a Range, or a Vector{Int}. icol may be an Int, a Range, or a Vector{Int}, ByteString, or, ByteString, or Vector{ByteString}. For a single-element ind, the column subset by itself is returned.

index(df::DataFrame)

Returns the column Index for df.

set_group(df::DataFrame, newgroup, names::Vector{ByteString})

get_groups(df::DataFrame)

set_groups(df::DataFrame, gr::Dict)

See the Indexing section for these operations on column indexes.

colnames(df::DataFrame) or names(df::DataFrame)

The column names as an Array{ByteString}

setindex!(df::DataFrame, newcol, colname) or df[colname] = newcol

Replace or add a new column with name colname and contents newcol. Arrays are converted to DataVector's. Values are recycled to match the number of rows in df.

insert!(df::DataFrame, index::Integer, item, name)

Insert a column of name name and with contents item into df at position index.

insert!(df::DataFrame, df2::DataFrame)

Insert columns of df2 into df1.

del!(df::DataFrame, cols)

Delete columns in df at positions given by cols (noted with any means that columns can be referenced).

del(df::DataFrame, cols)

Nondestructive version. Return a DataFrame based on the columns in df after deleting columns specified by cols.

deleterows!(df::DataFrame, inds)

Delete rows at positions specified by inds from the given DataFrame.

cbind(df1, df2, ...) or hcat(df1, df2, ...) or [df1 df2 ...]

Concatenate columns. Duplicated column names are adjusted.

rbind(df1, df2, ...) or vcat(df1, df2, ...) or [df1, df2, ...]

Concatenate rows.

I/O

csvDataFrame(filename, o::Options)

Return a DataFrame from file filename. Options o include colnames ("true", "false", or "check" (the default)) and poolstrings ("check" (default) or "never").

Expression/Function Evaluation in a DataFrame

with(df::AbstractDataFrame, ex::Expr)

Evaluate expression ex with the columns in df.

within(df::AbstractDataFrame, ex::Expr)

Return a copy of df after evaluating expression ex with the columns in df.

within!(df::AbstractDataFrame, ex::Expr)

Modify df by evaluating expression ex with the columns in df.

based_on(df::AbstractDataFrame, ex::Expr)

Return a new DataFrame based on evaluating expression ex with the columns in df. Often used for summarizing operations.

colwise(f::Function, df::AbstractDataFrame)

colwise(f::Vector{Function}, df::AbstractDataFrame)

Apply f to each column of df, and return the results as an Array{Any}.

colwise(df::AbstractDataFrame, s::Symbol)

colwise(df::AbstractDataFrame, s::Vector{Symbol})

Apply the function specified by Symbol s to each column of df, and return the results as a DataFrame.

SubDataFrames

sub(df::DataFrame, r, c)

sub(df::DataFrame, r)

Return a SubDataFrame with references to rows and columns of df.

sub(sd::SubDataFrame, r, c)

sub(sd::SubDataFrame, r)

Return a SubDataFrame with references to rows and columns of df.

getindex(sd::SubDataFrame, r, c) or sd[r,c]

getindex(sd::SubDataFrame, c) or sd[c]

Referencing should work the same as DataFrames.

Grouping

groupby(df::AbstractDataFrame, cols)

Return a GroupedDataFrame based on unique groupings indicated by the columns with one or more names given in cols.

start(gd), done(gd,i), and next(gd,i)

Methods to iterate over GroupedDataFrame groupings.

getindex(gd::GroupedDataFrame, idx) or gd[idx]

Reference a particular grouping. Referencing returns a SubDataFrame.

with(gd::GroupedDataFrame, ex::Expr)

Evaluate expression ex with the columns in gd in each grouping.

within(gd::GroupedDataFrame, ex::Expr)

within!(gd::GroupedDataFrame, ex::Expr)

Return a DataFrame with the results of evaluating expression ex with the columns in gd in each grouping.

based_on(gd::GroupedDataFrame, ex::Expr)

Sweeps along groups and applies based_on to each group. Returns a DataFrame.

map(f::Function, gd::GroupedDataFrame)

Apply f to each grouping of gd and return the results in an Array.

colwise(f::Function, gd::GroupedDataFrame)

colwise(f::Vector{Function}, gd::GroupedDataFrame)

Apply f to each column in each grouping of gd, and return the results as an Array{Any}.

colwise(gd::GroupedDataFrame, s::Symbol)

colwise(gd::GroupedDataFrame, s::Vector{Symbol})

Apply the function specified by Symbol s to each column of in each grouping of gd, and return the results as a DataFrame.

by(df::AbstractDataFrame, cols, s::Symbol) or groupby(df, cols) |> s

by(df::AbstractDataFrame, cols, s::Vector{Symbol})

Return a DataFrame with the results of grouping on cols and colwise evaluation based on s. Equivalent to colwise(groupby(df, cols), s).

by(df::AbstractDataFrame, cols, e::Expr) or groupby(df, cols) |> e

Return a DataFrame with the results of grouping on cols and evaluation of e in each grouping. Equivalent to based_on(groupby(df, cols), e).

Reshaping / Merge

stack(df::DataFrame, cols)

For conversion from wide to long format. Returns a DataFrame with stacked columns indicated by cols. The result has column "key" with column names from df and column "value" with the values from df. Columns in df not included in cols are duplicated along the stack.

unstack(df::DataFrame, ikey, ivalue, irefkey)

For conversion from long to wide format. Returns a DataFrame. ikey indicates the key column--unique values in column ikey will be column names in the result. ivalue indicates the value column. irefkey is the column with a unique identifier for that . Columns not given by ikey, ivalue, or irefkey are currently ignored.

merge(df1::DataFrame, df2::DataFrame, bycol)

merge(df1::DataFrame, df2::DataFrame, bycol, jointype)

Return the database join of df1 and df2 based on the column bycol. Currently only a single merge key is supported. Supports jointype of "inner" (the default), "left", "right", or "outer".

Index

Index()

Index(s::Vector{ByteString})

An Index with names s. An Index is like an Associative type. An Index is used for column indexing of DataFrames. An Index maps ByteStrings and Vector{ByteStrings} to Indices.

length(x::Index), copy(x::Index), has(x::Index, key), keys(x::Index), push!(x::Index, name)

Normal meanings.

del(x::Index, idx::Integer), del(x::Index, s::ByteString)

Delete the name s or name at position idx in x.

names(x::Index)

A Vector{ByteString} with the names of x.

names!(x::Index, nm::Vector{ByteString})

Set names nm in x.

rename(x::Index, f::Function)

rename(x::Index, nd::Associative)

rename(x::Index, from::Vector, to::Vector)

Replace names in x, by applying function f to each name, by mapping old to new names with a dictionary (Associative), or using from and to vectors.

getindex(x::Index, idx) or x[idx]

This does the mapping from name(s) to Indices (positions). idx may be ByteString, Vector{ByteString}, Int, Vector{Int}, Range{Int}, Vector{Bool}, AbstractDataVector{Bool}, or AbstractDataVector{Int}.

set_group(idx::Index, newgroup, names::Vector{ByteString})

Add a group to idx with name newgroup that includes the names in the vector names.

get_groups(idx::Index)

A Dict that maps the name of each group to the names in the group.

set_groups(idx::Index, gr::Dict)

Set groups in idx based on the mapping given by gr.

Missing Values

Missing value behavior is implemented by instantiations of the AbstractDataVector abstract type.

NA

A constant indicating a missing value.

isna(x)

Return a Bool or Array{Bool} (if x is an AbstractDataVector) that is true for elements with missing values.

nafilter(x)

Return a copy of x after removing missing values.

nareplace(x, val)

Return a copy of x after replacing missing values with val.

naFilter(x)

Return an object based on x such that future operations like mean will not include missing values. This can be an iterator or other object.

naReplace(x, val)

Return an object based on x such that future operations like mean will replace NAs with val.

na(x)

Return an NA value appropriate for the type of x.

nas(x, dim)

Return an object like x filled with NA values with size dim.

DataVector's

DataArray(x::Vector)

DataArray(x::Vector, m::Vector{Bool})

Create a DataVector from x, with m optionally indicating which values are NA. DataVector's are like Julia Vectors with support for NA's. x may be any type of Vector.

PooledDataArray(x::Vector)

PooledDataArray(x::Vector, m::Vector{Bool})

Create a PooledDataVector from x, with m optionally indicating which values are NA. PooledDataVector's contain a pool of values with references to those values. This is useful in a similar manner to an R array of factors.

size, length, ndims, ref, assign, start, next, done

All normal Vector operations including array referencing should work.

isna(x), nafilter(x), nareplace(x, val), naFilter(x), naReplace(x, val)

All NA-related methods are supported.

Utilities

cut(x::Vector, breaks::Vector)

Returns a PooledDataVector with length equal to x that divides values in x based on the divisions given by breaks.

Formulas and Models

Formula(ex::Expr)

Return a Formula object based on ex. Formulas are two-sided expressions separated by ~, like :(y ~ w*x + z + i&v).

model_frame(f::Formula, d::AbstractDataFrame)

model_frame(ex::Expr, d::AbstractDataFrame)

A ModelFrame.

model_matrix(mf::ModelFrame)

model_matrix(f::Formula, d::AbstractDataFrame)

model_matrix(ex::Expr, d::AbstractDataFrame)

A ModelMatrix based on mf, f and d, or ex and d.

lm(ex::Expr, df::AbstractDataFrame)

Linear model results (type OLSResults) based on formula ex and df.