Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement data passing functions #19

Closed
EntilZha opened this issue Mar 23, 2015 · 8 comments
Closed

Implement data passing functions #19

EntilZha opened this issue Mar 23, 2015 · 8 comments
Assignees
Milestone

Comments

@EntilZha
Copy link
Owner

So far the only way to ingest data into ScalaFunctional is to read through using python defined data structures. It would be helpful to be able to read directly from data formats such as json/sql/csv.

Target milestone for everything completed will be 0.4.0.

This issue will serve as a parent issue for implementing each specific function.

Child issues:
#34 seq.open
#35 seq.range
#36 seq.csv
#37 seq.jsonl
#29 seq.json
#30 to_json
#31 to_csv
#32 to_file
#33 to_jsonl

@EntilZha EntilZha self-assigned this Mar 23, 2015
@EntilZha EntilZha modified the milestones: 0.3.0, 0.4.0 Mar 23, 2015
@EntilZha EntilZha removed their assignment Apr 24, 2015
@ChuyuHsu
Copy link
Contributor

Hi, @EntilZha ,
I really like this project.
Do you stop updating this project?
or you have some better alternative?

@EntilZha
Copy link
Owner Author

I haven't stopped using the package. I actually still use it quite a lot, so most of the improvements/additions are when I feel something is missing so I just add it. Its been working fairly well for me, although recently I have been thinking about adding something to help with open/close files since I seem to be doing this alot.

If you have ideas/suggestions for things that you think are important, definitely let me know. Project is definitely not dead, its just reached a place where it is actually working pretty well.

@EntilZha
Copy link
Owner Author

Here is what I am thinking of doing:

  1. The general abstraction is having multiple input streams/entrypoints instead of only seq
  2. I want to keep the import at only needing from functional import seq rather than requiring a separate import for each type of input stream or importing a stream module. That is, I want to avoid needing from functional import seq; from functional import streams; streams.json("").map....
  3. This can be resolved by setting attributes on seq so that streams can be accessed via seq.json or seq.range
  4. However, the streams will be implemented in a separate module, so that they are still importable separately or all together. This probably also means moving seq to that same module, which is a more logical place for it than where it is currently anyway. Need to decide if I want to preserve from functional.chain import seq, I am inclined not to though since its not the official way to import and technically package is still pre 1.0.
  5. To start, I think a good list of stream sources are: csv, json, reading lines from a file, read entire file and break on delimiter, and things like range.

@EntilZha
Copy link
Owner Author

seq.open and equivalently streams.open has been implemented and tested in 207b42b

@ChuyuHsu
Copy link
Contributor

@EntilZha, I have thought about it.
From the OOP perspective, Single responsibility principle is the reason why factor pattern exists.
Suppose you have implemented the data reading method in seq. Then you have to continuously modify and test the seq code, if you want to add new data source.
I think that is the reason why rdd is usually created by a factor SparkContext and pandas.DataFrame is created by pd.read_table, etc.
Even the scala object definition usually has been understood as a "Factory".

But I can understand the effort you trying to eliminating multiple importing.
If the list of stream source is short, that will be fine. However in long term, maintaining seq will be painful.

@EntilZha
Copy link
Owner Author

I think there is a slight confusion (if not, would be happy to be corrected). The definition of the seq method would stay the same. However since functions are objects in python, I can set attributes on them. So I am settings attributes to functions that implement these other stream operations. Effectively this creates a convenient alias. The short version looks like below, but the specific code implementing this is here: https://github.com/EntilZha/ScalaFunctional/blob/master/functional/streams.py

def seq(input):
    # Implementation of seq goes here
    pass

def open(input):
    # Implementation of open here
    pass

seq.open = open

#In code using functional
from functional import seq
seq.open('filename').....
seq(regular_input).....

I am currently working on implementations of the stream functions which do the necessary preprocessing then hand off the ordinary python sequence to seq to turn into a functional.pipelines.Sequence.

I don't know enough about pandas, but at least for Spark (I think) its mostly that SparkContext contains lots of information about the execution context which isn't as applicable here.

@ChuyuHsu
Copy link
Contributor

Okay, got it.
That is beautiful.

p.s. by the spark part I previously mentioned, I actually meant
sc.textFile("/path/to/file") will create a RDD, instead of creating by RDD its own.

@EntilZha
Copy link
Owner Author

EntilZha commented Nov 1, 2015

Now closing this since all its child issues have been implemented and closed. Since this is now resolved, getting very close to releasing 0.4.0 after working on the documentation a bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants