# `byrow()` in InMemoryDatasets.jl

### byrow function is a high-performance, multi-threaded function for row-wise operations. It is designed to make tasks like summing up each row simple, efficient, and lightning fast.  
### Generally speaking, the `byrow` function can automatically apply the function we specified to each row of data set, which is also the source of its name.   
### Let's start with an example, first we need to include the package.

In [2]:
using InMemoryDatasets

In [None]:
ds1 = Dataset(var1 = [1, -2, 4], var2 = [-1, 3, 5], var3 = [2, -4, 1])

### Assume we want to calculate the sum of each row, which is quite useful right, generally speaking, we thought of the sum function first, but the sum function can only deal with a matrix or an array, so we need to convert the dataset into a matrix. 

In [None]:
sum(Matrix(ds1),dims = 2)

### But if we use byrow(), it can directly apply sum() to each row. 

In [None]:
byrow(ds1, sum,1:3)

### But it can save time and memory space.

### Also since each column has its own name, we can just use its name to represent them in function

In [None]:
byrow(ds1,sum,[:var1, :var2, :var3])

## Optimised operations

### The common syntax of `byrow` is byrow, dataset, function name, columns we want to select, and two options we can choose, by and threads, and by default threads equals to true.
### The `by` keyword is for specifying a function to call on each value before calling the function to aggregate the result, and `threads = true` makes `byrow` to exploit all cores available to Julia for performing the computations.
### Also we have an example

In [3]:
ds2 = Dataset(g = [1, 1, 1, 2, 2],
                    x1_int = [0, 0, 1, missing, 2],
                    x2_int = [3, 2, 1, 3, -2],
                    x1_float = [1.2, missing, -1.0, 2.3, 10],
                    x2_float = [missing, missing, 3.0, missing, missing],
                    x3_float = [missing, missing, -1.4, 3.0, -100.0])

Unnamed: 0_level_0,g,x1_int,x2_int,x1_float,x2_float,x3_float
Unnamed: 0_level_1,Int64?,Int64?,Int64?,Float64?,Float64?,Float64?
Unnamed: 0_level_2,identity,identity,identity,identity,identity,identity
1,1,0,3,1.2,missing,missing
2,1,0,2,missing,missing,missing
3,1,1,1,-1.0,3.0,-1.4
4,2,missing,3,2.3,missing,3.0
5,2,2,-2,10.0,missing,-100.0


### missing which means a missing value, we don't know the actual value here and absolutely we can't make a fake value, so we just use missing to represent that. 

### Let's say we want to calculate the average of the absolute values of all numbers in each row.
### Suppose we only need to calculate the average value. It's quite simple, just use the mean function. 

In [4]:
byrow(ds2,mean)

5-element Vector{Float64}:
   1.3
   1.0
   0.6
   2.575
 -17.6

### Perhaps you have noticed that the previous functions all have at least three parameters. The functions used here only use two parameters. This is because we need to use all the values in a row. When byrow detects that the column name we specified is none, it will use all the columns by default. That is, the functions here are the same as the following two.

In [7]:
byrow(ds2,mean,1:3)

5-element Vector{Float64}:
 1.3333333333333333
 1.0
 1.0
 2.5
 0.6666666666666666

### A single colon also can represent all columns, or you can use

In [8]:
byrow(ds2,mean,:)

5-element Vector{Float64}:
   1.3
   1.0
   0.6
   2.575
 -17.6

### Then we need to consider taking the absolute value for each number before we take the average, which requires the use of the by keyword. 

In [9]:
byrow(ds2,mean,by = abs)

5-element Vector{Float64}:
  1.3
  1.0
  1.4000000000000001
  2.575
 23.2

### Using by equals to abs means before we using the mean function to each row, we use abs to each value first. In this way, we can easily solve this problem.

### Sometimes when we have a dataset, we want to find some interesting features in this dataset like is there any special value in the dataset? byrow also can help us to do this. For example, to find rows in which all their values are greater than 0 in the first to three columns we can use the following code,

In [None]:
byrow(ds2, all, 1:3, by = x->isless(0,x))

### And the result shows us the third and forth row are what we want

### You should note that in julia, missing is always greater than zero, so isless(0, missing) is true.

### Also if you want to find rows which contain at least one missing value in any of the columns we can use the following code,

In [None]:
byrow(ds2,any,by=ismissing)

### For byrow, there are many different optimised operations, if you want to know how to use other interesting optimised operations, we can just use help

In [15]:
?byrow(select)

```
byrow(ds::AbstractDataset, select, cols; [with, threads])
```

Select value of `with` among `cols`. The `with` must be a vector of column names(`Symbol` or `String`) or column index (relative to column position in `cols`) or a column name which contains this information.

For heterogeneous column types, `byrow` use `promote_type` for the output. If the column select doesn't exist among `cols`, `byrow` returns `missing`.

Passing `threads = false` disables multithreaded computations.

See [`byrow(findfirst)`](@ref), [`byrow(findlast)`](@ref)

# Examples

```jldoctest
julia> ds = Dataset(x1 = [1,2,3,4],
            x2 = [1.5,6.5,3.4,2.4],
            x3 = [true, false, true, false],
            y1 = ["x2", "x1", missing, "x2"],
            y2 = [:x2, :x1, missing, :x2],
            y3 = [3,1,1,2])
4×6 Dataset
 Row │ x1        x2        x3        y1        y2        y3
     │ identity  identity  identity  identity  identity  identity
     │ Int64?    Float64?  Bool?     String?   Symbol?   Int64?
─────┼────────────────────────────────────────────────────────────
   1 │        1       1.5      true  x2        x2               3
   2 │        2       6.5     false  x1        x1               1
   3 │        3       3.4      true  missing   missing          1
   4 │        4       2.4     false  x2        x2               2

julia> byrow(ds, select, 1:2, with = :y1)
4-element Vector{Union{Missing, Float64}}:
 1.5
 2.0
  missing
 2.4

julia> byrow(ds, select, [2,1,3], with = :y3)
4-element Vector{Union{Missing, Float64}}:
 1.0
 6.5
 3.4
 4.0

julia> byrow(ds, select, [2,1,3], with = [3,1,1,2])
4-element Vector{Union{Missing, Float64}}:
 1.0
 6.5
 3.4
 4.0

julia> ds = Dataset(x1 = [1,2,2], x2 = [5,6,7], x3 = [8,9,10])
3×3 Dataset
 Row │ x1        x2        x3
     │ identity  identity  identity
     │ Int64?    Int64?    Int64?
─────┼──────────────────────────────
   1 │        1         5         8
   2 │        2         6         9
   3 │        2         7        10

julia> byrow(ds, select, :, with = byrow(ds, findfirst, :, by = isodd))
3-element Vector{Union{Missing, Int64}}:
 1
 9
 7
```


### And we can see, there is the common syntax of byrow(select) and the explanation of each parameter, also there are many examples.

In [14]:
byrow(ds2,select, 4:6,with = [1,2,3,3,1])

5-element Vector{Union{Missing, Float64}}:
  1.2
   missing
 -1.4
  3.0
 10.0

## User defined operations

### In some cases, you may feel that these built-in functions cannot meet your unique needs. In this case, you can write your own function for byrow ().
### We just need to make sure your function return a single value, byrow treats each row as a vector, thus your function must accept a vector and return a single value.

### And our function is 

### Then put it in byrow,

### Then we can get the result, also, we mentioned before that the function can accept a tuple, so we can re-build our function to accept a tuple

### And then put it to byrow,

In [None]:
byrow(ds3, f2, (:var1,:var2,:var3))

## Special operations

### `byrow` also supports a few optimised operations which return a vector of values for each row.

### The main difference between these operations and the previous operations is that these operations return a data set with the corresponding row that has been updated with the operation.

In [None]:
ds4 = Dataset(x1 = [missing, 2, 1], x2 = [1, missing, missing], y = [4,5,3])

### Sometime we want to calculate the cumulative sum for each row, lets just use cumulative function to do this

In [None]:
byrow(ds4, cumsum, :)

### We can see unlike the result given by previous operations, the cumsum return a dataset here, we also have another example to help you understand. 

### Assume we want to fill all the missing value in first and second columns using the third columns. we just use fill function to fill all the missing value.
### Which means we will fill the missing data by using y column.

In [None]:
ds5 = byrow(ds4, fill!, 1:2, with = :y)

In [None]:
ds4

### The important thing is that for the operations with ! the updated version of the original data set is returned and for the operations without ! a modified copy of the original data set is returned. This is suitable for most functions in julia.

### In some cases, we want to convert a column to another type, like we want to change column x1 in ds4 which is Int to float, we can use byrow combine with modify to do this.

In [None]:
modify!(ds4, :x1 =>byrow(Float64))

### The modify function can modify columns of a data set, when a single column is passed to byrow, modify modifies the passed column. Also if you want to learn more about modify, you can just get some help from Julia, just type ?modify

In [None]:
?modify!

### As you can see, there are many useful information including examples to help you understand this function.

### From the above explanation, I believe you can have a comprehensive understanding of this function. If you still have doubts about it, you can find relevant content on InMemoryDatasets official documents at any time. Thank you for watching this video, see you next time.