Sometimes with reticulate
you import Python libraries that require using Pandas DataFrame and Series objects. Currently there is support for Numpy Arrays in reticulate
, but no support for these other objects.
feather
is a great package that allows you to write R tibble
s to disk, and read them into Python as DataFrames
, and vice versa, with type support. Combining reticulate
and feather
results in a way to more easily convert R objects to Pandas DataFrame and Series objects for use with other libraries you might import with reticulate
. It is actually surprisingly fast too!
You can install featherframe
from GitHub with:
# install.packages("devtools")
devtools::install_github("DavisVaughan/featherframe")
Note that you will have to ensure that any Python library you want to use (like in the examples below) is installed on your machine. You also will need to ensure that reticulate
is finding the correct version of Python, and how to do all of that is nicely outlined for you in their vignette. Personally I have set the RETICULATE_PYTHON
environment variable.
tibble
to Pandas DataFrame
and back.
library(dplyr)
library(featherframe)
data("iris")
# data.frame to tibble
iris <- iris %>%
as_tibble()
# tibble to Pandas DataFrame
iris_pd <- iris %>%
as_pandas_DataFrame()
iris_pd
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 0 5.1 3.5 1.4 0.2 setosa
#> 1 4.9 3.0 1.4 0.2 setosa
#> 2 4.7 3.2 1.3 0.2 setosa
#> 3 4.6 3.1 1.5 0.2 setosa
#> 4 5.0 3.6 1.4 0.2 setosa
#> 5 5.4 3.9 1.7 0.4 setosa
#> 6 4.6 3.4 1.4 0.3 setosa
#> 7 5.0 3.4 1.5 0.2 setosa
#> 8 4.4 2.9 1.4 0.2 setosa
#> 9 4.9 3.1 1.5 0.1 setosa
#> 10 5.4 3.7 1.5 0.2 setosa
#> 11 4.8 3.4 1.6 0.2 setosa
#> 12 4.8 3.0 1.4 0.1 setosa
#> 13 4.3 3.0 1.1 0.1 setosa
#> 14 5.8 4.0 1.2 0.2 setosa
#> 15 5.7 4.4 1.5 0.4 setosa
#> 16 5.4 3.9 1.3 0.4 setosa
#> 17 5.1 3.5 1.4 0.3 setosa
#> 18 5.7 3.8 1.7 0.3 setosa
#> 19 5.1 3.8 1.5 0.3 setosa
#> 20 5.4 3.4 1.7 0.2 setosa
#> 21 5.1 3.7 1.5 0.4 setosa
#> 22 4.6 3.6 1.0 0.2 setosa
#> 23 5.1 3.3 1.7 0.5 setosa
#> 24 4.8 3.4 1.9 0.2 setosa
#> 25 5.0 3.0 1.6 0.2 setosa
#> 26 5.0 3.4 1.6 0.4 setosa
#> 27 5.2 3.5 1.5 0.2 setosa
#> 28 5.2 3.4 1.4 0.2 setosa
#> 29 4.7 3.2 1.6 0.2 setosa
#> .. ... ... ... ... ...
#> 120 6.9 3.2 5.7 2.3 virginica
#> 121 5.6 2.8 4.9 2.0 virginica
#> 122 7.7 2.8 6.7 2.0 virginica
#> 123 6.3 2.7 4.9 1.8 virginica
#> 124 6.7 3.3 5.7 2.1 virginica
#> 125 7.2 3.2 6.0 1.8 virginica
#> 126 6.2 2.8 4.8 1.8 virginica
#> 127 6.1 3.0 4.9 1.8 virginica
#> 128 6.4 2.8 5.6 2.1 virginica
#> 129 7.2 3.0 5.8 1.6 virginica
#> 130 7.4 2.8 6.1 1.9 virginica
#> 131 7.9 3.8 6.4 2.0 virginica
#> 132 6.4 2.8 5.6 2.2 virginica
#> 133 6.3 2.8 5.1 1.5 virginica
#> 134 6.1 2.6 5.6 1.4 virginica
#> 135 7.7 3.0 6.1 2.3 virginica
#> 136 6.3 3.4 5.6 2.4 virginica
#> 137 6.4 3.1 5.5 1.8 virginica
#> 138 6.0 3.0 4.8 1.8 virginica
#> 139 6.9 3.1 5.4 2.1 virginica
#> 140 6.7 3.1 5.6 2.4 virginica
#> 141 6.9 3.1 5.1 2.3 virginica
#> 142 5.8 2.7 5.1 1.9 virginica
#> 143 6.8 3.2 5.9 2.3 virginica
#> 144 6.7 3.3 5.7 2.5 virginica
#> 145 6.7 3.0 5.2 2.3 virginica
#> 146 6.3 2.5 5.0 1.9 virginica
#> 147 6.5 3.0 5.2 2.0 virginica
#> 148 6.2 3.4 5.4 2.3 virginica
#> 149 5.9 3.0 5.1 1.8 virginica
#>
#> [150 rows x 5 columns]
# Pandas DataFrame to tibble
iris_tib <- iris_pd %>%
as_tibble()
iris_tib
#> # A tibble: 150 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.10 3.50 1.40 0.200 setosa
#> 2 4.90 3.00 1.40 0.200 setosa
#> 3 4.70 3.20 1.30 0.200 setosa
#> 4 4.60 3.10 1.50 0.200 setosa
#> 5 5.00 3.60 1.40 0.200 setosa
#> 6 5.40 3.90 1.70 0.400 setosa
#> 7 4.60 3.40 1.40 0.300 setosa
#> 8 5.00 3.40 1.50 0.200 setosa
#> 9 4.40 2.90 1.40 0.200 setosa
#> 10 4.90 3.10 1.50 0.100 setosa
#> # ... with 140 more rows
# It works!
identical(iris, iris_tib)
#> [1] TRUE
The original motivation for creating this mini-package is to use to pyfolio library from Quantopian right from R. It provides some neat financial analysis functions I was interested in testing, but requires you to pass in Pandas Series and DataFrame objects.
Specifically, I am going to demonstrate some functionality that requires a Pandas Series of daily returns.
library(reticulate)
library(featherframe)
library(tibbletime) # just for a test data set
library(lubridate)
library(dplyr)
# Use reticulate to import pyfolio
pyfolio <- import("pyfolio")
# Get the FB stock dataset from tibbletime
data(FB)
FB
#> # A tibble: 1,008 x 8
#> symbol date open high low close volume adjusted
#> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 FB 2013-01-02 27.4 28.2 27.4 28.0 69846400 28.0
#> 2 FB 2013-01-03 27.9 28.5 27.6 27.8 63140600 27.8
#> 3 FB 2013-01-04 28.0 28.9 27.8 28.8 72715400 28.8
#> 4 FB 2013-01-07 28.7 29.8 28.6 29.4 83781800 29.4
#> 5 FB 2013-01-08 29.5 29.6 28.9 29.1 45871300 29.1
#> 6 FB 2013-01-09 29.7 30.6 29.5 30.6 104787700 30.6
#> 7 FB 2013-01-10 30.6 31.5 30.3 31.3 95316400 31.3
#> 8 FB 2013-01-11 31.3 32.0 31.1 31.7 89598000 31.7
#> 9 FB 2013-01-14 32.1 32.2 30.6 31.0 98892800 31.0
#> 10 FB 2013-01-15 30.6 31.7 29.9 30.1 173242600 30.1
#> # ... with 998 more rows
# There is an issue with Dates becoming POSIXct so let's avoid that
FB <- mutate(FB, date = as_datetime(date))
# Calculate returns
FB_returns <- FB %>%
mutate(returns = adjusted / lag(adjusted) - 1) %>%
select(date, returns) %>%
na.omit()
# Convert to Pandas Series, specifying the index column
FB_returns_pd <- FB_returns %>%
as_pandas_Series(index = date)
FB_returns_pd
#> date
#> 2013-01-03 00:00:00+00:00 -0.008214
#> 2013-01-04 00:00:00+00:00 0.035650
#> 2013-01-07 00:00:00+00:00 0.022949
#> 2013-01-08 00:00:00+00:00 -0.012237
#> 2013-01-09 00:00:00+00:00 0.052650
#> 2013-01-10 00:00:00+00:00 0.023210
#> 2013-01-11 00:00:00+00:00 0.013419
#> 2013-01-14 00:00:00+00:00 -0.024275
#> 2013-01-15 00:00:00+00:00 -0.027464
#> 2013-01-16 00:00:00+00:00 -0.008306
#> 2013-01-17 00:00:00+00:00 0.009715
#> 2013-01-18 00:00:00+00:00 -0.015926
#> 2013-01-22 00:00:00+00:00 0.036076
#> 2013-01-23 00:00:00+00:00 0.002929
#> 2013-01-24 00:00:00+00:00 0.008436
#> 2013-01-25 00:00:00+00:00 0.014801
#> 2013-01-28 00:00:00+00:00 0.029486
#> 2013-01-29 00:00:00+00:00 -0.051740
#> 2013-01-30 00:00:00+00:00 0.014615
#> 2013-01-31 00:00:00+00:00 -0.008323
#> 2013-02-01 00:00:00+00:00 -0.040349
#> 2013-02-04 00:00:00+00:00 -0.054490
#> 2013-02-05 00:00:00+00:00 0.018854
#> 2013-02-06 00:00:00+00:00 0.014316
#> 2013-02-07 00:00:00+00:00 -0.013769
#> 2013-02-08 00:00:00+00:00 -0.003490
#> 2013-02-11 00:00:00+00:00 -0.010158
#> 2013-02-12 00:00:00+00:00 -0.031493
#> 2013-02-13 00:00:00+00:00 0.019730
#> 2013-02-14 00:00:00+00:00 0.021139
#> ...
#> 2016-11-17 00:00:00+00:00 0.012464
#> 2016-11-18 00:00:00+00:00 -0.006537
#> 2016-11-21 00:00:00+00:00 0.040591
#> 2016-11-22 00:00:00+00:00 -0.002464
#> 2016-11-23 00:00:00+00:00 -0.005187
#> 2016-11-25 00:00:00+00:00 -0.003807
#> 2016-11-28 00:00:00+00:00 0.000249
#> 2016-11-29 00:00:00+00:00 0.003820
#> 2016-11-30 00:00:00+00:00 -0.020270
#> 2016-12-01 00:00:00+00:00 -0.028036
#> 2016-12-02 00:00:00+00:00 0.002606
#> 2016-12-05 00:00:00+00:00 0.017591
#> 2016-12-06 00:00:00+00:00 -0.001022
#> 2016-12-07 00:00:00+00:00 0.005456
#> 2016-12-08 00:00:00+00:00 0.008139
#> 2016-12-09 00:00:00+00:00 0.006475
#> 2016-12-12 00:00:00+00:00 -0.015959
#> 2016-12-13 00:00:00+00:00 0.021567
#> 2016-12-14 00:00:00+00:00 -0.000831
#> 2016-12-15 00:00:00+00:00 0.002995
#> 2016-12-16 00:00:00+00:00 -0.005806
#> 2016-12-19 00:00:00+00:00 -0.005256
#> 2016-12-20 00:00:00+00:00 -0.001258
#> 2016-12-21 00:00:00+00:00 -0.000420
#> 2016-12-22 00:00:00+00:00 -0.013777
#> 2016-12-23 00:00:00+00:00 -0.001107
#> 2016-12-27 00:00:00+00:00 0.006310
#> 2016-12-28 00:00:00+00:00 -0.009237
#> 2016-12-29 00:00:00+00:00 -0.004875
#> 2016-12-30 00:00:00+00:00 -0.011173
#> Name: returns, Length: 1007, dtype: float64
# Use the rolling_volatility function from the current pip version of pyfolio
rolling_vol_pd <- pyfolio$timeseries$rolling_volatility(FB_returns_pd, rolling_vol_window = 10L)
rolling_vol_pd
#> date
#> 2013-01-03 00:00:00+00:00 NaN
#> 2013-01-04 00:00:00+00:00 NaN
#> 2013-01-07 00:00:00+00:00 NaN
#> 2013-01-08 00:00:00+00:00 NaN
#> 2013-01-09 00:00:00+00:00 NaN
#> 2013-01-10 00:00:00+00:00 NaN
#> 2013-01-11 00:00:00+00:00 NaN
#> 2013-01-14 00:00:00+00:00 NaN
#> 2013-01-15 00:00:00+00:00 NaN
#> 2013-01-16 00:00:00+00:00 0.425432
#> 2013-01-17 00:00:00+00:00 0.417229
#> 2013-01-18 00:00:00+00:00 0.403469
#> 2013-01-22 00:00:00+00:00 0.426051
#> 2013-01-23 00:00:00+00:00 0.415865
#> 2013-01-24 00:00:00+00:00 0.327421
#> 2013-01-25 00:00:00+00:00 0.314476
#> 2013-01-28 00:00:00+00:00 0.341510
#> 2013-01-29 00:00:00+00:00 0.420585
#> 2013-01-30 00:00:00+00:00 0.396567
#> 2013-01-31 00:00:00+00:00 0.396582
#> 2013-02-01 00:00:00+00:00 0.452145
#> 2013-02-04 00:00:00+00:00 0.523592
#> 2013-02-05 00:00:00+00:00 0.492085
#> 2013-02-06 00:00:00+00:00 0.501475
#> 2013-02-07 00:00:00+00:00 0.496638
#> 2013-02-08 00:00:00+00:00 0.481741
#> 2013-02-11 00:00:00+00:00 0.430293
#> 2013-02-12 00:00:00+00:00 0.389964
#> 2013-02-13 00:00:00+00:00 0.400238
#> 2013-02-14 00:00:00+00:00 0.431677
#> ...
#> 2016-11-17 00:00:00+00:00 0.276783
#> 2016-11-18 00:00:00+00:00 0.273903
#> 2016-11-21 00:00:00+00:00 0.346276
#> 2016-11-22 00:00:00+00:00 0.332978
#> 2016-11-23 00:00:00+00:00 0.331663
#> 2016-11-25 00:00:00+00:00 0.317450
#> 2016-11-28 00:00:00+00:00 0.306937
#> 2016-11-29 00:00:00+00:00 0.239519
#> 2016-11-30 00:00:00+00:00 0.257021
#> 2016-12-01 00:00:00+00:00 0.294442
#> 2016-12-02 00:00:00+00:00 0.285935
#> 2016-12-05 00:00:00+00:00 0.300280
#> 2016-12-06 00:00:00+00:00 0.201002
#> 2016-12-07 00:00:00+00:00 0.206177
#> 2016-12-08 00:00:00+00:00 0.212715
#> 2016-12-09 00:00:00+00:00 0.215868
#> 2016-12-12 00:00:00+00:00 0.229217
#> 2016-12-13 00:00:00+00:00 0.257645
#> 2016-12-14 00:00:00+00:00 0.232839
#> 2016-12-15 00:00:00+00:00 0.164264
#> 2016-12-16 00:00:00+00:00 0.172491
#> 2016-12-19 00:00:00+00:00 0.159185
#> 2016-12-20 00:00:00+00:00 0.159297
#> 2016-12-21 00:00:00+00:00 0.157991
#> 2016-12-22 00:00:00+00:00 0.168107
#> 2016-12-23 00:00:00+00:00 0.162598
#> 2016-12-27 00:00:00+00:00 0.146658
#> 2016-12-28 00:00:00+00:00 0.092915
#> 2016-12-29 00:00:00+00:00 0.092686
#> 2016-12-30 00:00:00+00:00 0.093274
#> Name: returns, Length: 1007, dtype: float64
# Go back to R now
rolling_vol_pd %>%
as_tibble() %>%
rename(returns_rolling_vol = returns)
#> # A tibble: 1,007 x 2
#> date returns_rolling_vol
#> <dttm> <dbl>
#> 1 2013-01-03 00:00:00 NA
#> 2 2013-01-04 00:00:00 NA
#> 3 2013-01-07 00:00:00 NA
#> 4 2013-01-08 00:00:00 NA
#> 5 2013-01-09 00:00:00 NA
#> 6 2013-01-10 00:00:00 NA
#> 7 2013-01-11 00:00:00 NA
#> 8 2013-01-14 00:00:00 NA
#> 9 2013-01-15 00:00:00 NA
#> 10 2013-01-16 00:00:00 0.425
#> # ... with 997 more rows
The idea is that we can wrap these functions up to take care of all the conversion for us. I believe this is essentially what the keras
package does with the Python keras
library, but in a more efficient way than using feather
.
pyfolio_rolling_volatility <- function(x, rolling_vol_window, ...) {
pyfolio <- import("pyfolio")
stopifnot(is.integer(rolling_vol_window))
as_pandas_Series(x, ...) %>%
pyfolio$timeseries$rolling_volatility(rolling_vol_window) %>%
as_tibble()
}
pyfolio_rolling_volatility(FB_returns, 10L, index = date)
#> # A tibble: 1,007 x 2
#> date returns
#> <dttm> <dbl>
#> 1 2013-01-03 00:00:00 NA
#> 2 2013-01-04 00:00:00 NA
#> 3 2013-01-07 00:00:00 NA
#> 4 2013-01-08 00:00:00 NA
#> 5 2013-01-09 00:00:00 NA
#> 6 2013-01-10 00:00:00 NA
#> 7 2013-01-11 00:00:00 NA
#> 8 2013-01-14 00:00:00 NA
#> 9 2013-01-15 00:00:00 NA
#> 10 2013-01-16 00:00:00 0.425
#> # ... with 997 more rows
data(FANG) # also from tibbletime
pd <- import("pandas")
FANG_pd <- FANG %>%
mutate(date = as_datetime(date)) %>%
as_pandas_DataFrame()
# Use Pandas to take monthly averages of every column, by symbol
# Set index as "date", group by "symbol", For each month and for each symbol...take a mean of every column
# reset_index turns the symbol and date indices back into columns so we can then convert back to R
FANG_pd$set_index("date")$groupby("symbol")$resample("M")$mean()$reset_index() %>%
as_tibble()
#> # A tibble: 192 x 8
#> symbol date open high low close volume adjusted
#> <chr> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 AMZN 2013-01-31 00:00:00 269 272 265 268 4010781 268
#> 2 AMZN 2013-02-28 00:00:00 264 267 261 264 3613774 264
#> 3 AMZN 2013-03-31 00:00:00 265 268 262 266 2925285 266
#> 4 AMZN 2013-04-30 00:00:00 263 266 260 263 3496191 263
#> 5 AMZN 2013-05-31 00:00:00 262 265 260 263 2684114 263
#> 6 AMZN 2013-06-30 00:00:00 274 277 271 274 2928790 274
#> 7 AMZN 2013-07-31 00:00:00 298 301 295 299 3069859 299
#> 8 AMZN 2013-08-31 00:00:00 292 294 289 291 1989723 291
#> 9 AMZN 2013-09-30 00:00:00 304 307 301 305 2173440 305
#> 10 AMZN 2013-10-31 00:00:00 325 329 321 326 3360670 326
#> # ... with 182 more rows
Currently Date
columns are translated to datetime64[ns]
and then back as POSIXct
with a local-specific timezone. This is...undesirable and should be fixed on the feather
side of things.
ex <- tibble(date_col = as.Date("2018-01-01"))
ex_pd <- as_pandas_DataFrame(ex)
ex_pd
#> date_col
#> 0 2018-01-01
as_tibble(ex_pd)
#> # A tibble: 1 x 1
#> date_col
#> <dttm>
#> 1 2017-12-31 19:00:00
Ordered factors do not stay ordered (but remain factors) when going back and forth.
ex_ordered_fac <- tibble(fct = factor(c("low", "high", "med"),
levels = c("low", "med", "high"),
ordered = TRUE))
ex_ordered_fac
#> # A tibble: 3 x 1
#> fct
#> <ord>
#> 1 low
#> 2 high
#> 3 med
as_tibble(as_pandas_DataFrame(ex_ordered_fac))
#> # A tibble: 3 x 1
#> fct
#> <fct>
#> 1 low
#> 2 high
#> 3 med
Integer columns with NA
values are coerced to numeric (double) column because Pandas does not have support for NaN
values in integer columns.
ex_int <- tibble(int_col = c(1L, 2L, 3L))
ex_int
#> # A tibble: 3 x 1
#> int_col
#> <int>
#> 1 1
#> 2 2
#> 3 3
ex_int_NA <- tibble(int_col = c(1L, 2L, 3L, NA))
ex_int_NA
#> # A tibble: 4 x 1
#> int_col
#> <int>
#> 1 1
#> 2 2
#> 3 3
#> 4 NA
as_tibble(as_pandas_DataFrame(ex_int))
#> # A tibble: 3 x 1
#> int_col
#> <int>
#> 1 1
#> 2 2
#> 3 3
as_tibble(as_pandas_DataFrame(ex_int_NA))
#> # A tibble: 4 x 1
#> int_col
#> <dbl>
#> 1 1.00
#> 2 2.00
#> 3 3.00
#> 4 NA
Rownames are not preserved in the translation from R to Python.