Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration of hdt-rdf compression module #81

Merged
merged 7 commits into from May 10, 2019

Conversation

@abakarboubaa
Copy link
Contributor

@abakarboubaa abakarboubaa commented May 4, 2019

I have created the hdt package as mentioned and created a class TripleOps and corresponding Test cases are submitted.

@GezimSejdiu GezimSejdiu self-requested a review May 6, 2019
Copy link
Member

@GezimSejdiu GezimSejdiu left a comment

Hi @abakarboubaa ,

many thanks for your contribution. I went through it and provided some comments. Please, try to resolve them and let me know in case you need more info.
PS: The PR is failing, mostly due to the scala-style violations. Try to resolve them.

Best regards,

Loading

*/

/*
--input
Copy link
Member

@GezimSejdiu GezimSejdiu May 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, remove these comments from the class.

Loading

* @return Returns DataFrame [Subject,Object,Predicate]
*/
def convertRDDGraphToDF(triple:RDD[graph.Triple])={
logger.info("Converting RDD[Graph] to DataFrame[Row].")
Copy link
Member

@GezimSejdiu GezimSejdiu May 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to have this log here.

Loading

Copy link
Member

@GezimSejdiu GezimSejdiu May 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And remove all the remaining logs below as well.

Loading

logger.info(s"Rdd[Graph] is created. ${rddGraph.count()}")

val triplesDF=convertRDDGraphToDF(rddGraph)
triplesDF.printSchema()
Copy link
Member

@GezimSejdiu GezimSejdiu May 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to print.

Loading

triplesDF.cache()

objectDF=getDistinctObjectDictDF(rddGraph)
objectDF.printSchema()
Copy link
Member

@GezimSejdiu GezimSejdiu May 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to print. Remove it.

Loading

* @param outputDir Path to be written
* @param mode SaveMode of Write
*/
def saveAsCSV(outputDir:String, mode:SaveMode): Unit =
Copy link
Member

@GezimSejdiu GezimSejdiu May 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

um. why do we need to save them as a csv? Aren't they just indexes? which could be kept in any file (parquet files, hadoop files?).

Loading

Copy link
Contributor Author

@abakarboubaa abakarboubaa May 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the future enhancement to support various target types and compression type.

Loading

val logger = LoggerFactory.getLogger(TripleOps.getClass)


def main(args: Array[String]) {
Copy link
Member

@GezimSejdiu GezimSejdiu May 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, remove the main method from here.

Loading

@abakarboubaa abakarboubaa reopened this May 7, 2019
@abakarboubaa
Copy link
Contributor Author

@abakarboubaa abakarboubaa commented May 7, 2019

Code has been updated as mentioned. Also updated the code as per scalastyle

Loading

* @return DataFrame Subject dictionary of [index,subject]
*/
def getDistinctSubjectDictDF(triples : RDD[graph.Triple]) : DataFrame = {
spark.createDataFrame(triples.map(_.getSubject.toString()).distinct().zipWithIndex().map(t => Row(t._1, t._2)), dictionarySchema).cache()
Copy link
Member

@GezimSejdiu GezimSejdiu May 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to persist/cache them in advance? I think would be good that we remove the caching as a default approach and leave it to the user how they see it relevant for their use case.
So, remove the .cache() the method from everywhere where it is added and which can be provided by the user.

Loading

* @param registerAsTable If true, it register all the DF as Spark table
* @return Returns the Tuple4 [IndexDataFrame,SubjectDictDataFrame,ObjectDictDataFrame,PredicateDictDataFrame]
*/
def createOrReadDataSet(input : String, compressedDir : String, registerAsTable : Boolean = true) : (DataFrame, DataFrame, DataFrame, DataFrame) = {
Copy link
Member

@GezimSejdiu GezimSejdiu May 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method requires to have the compressed data already kept in the dist (which usually is mean to be a pre-processing step), did you also consider that sometimes there is no need to materialize the process and keep them in memory and do not put them back to the dist.

I think we should provide another method which allows the users to read the data and compress them on the fly and or course to save them in case they want to do so.

Loading

Copy link
Contributor Author

@abakarboubaa abakarboubaa May 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"createOrReadDataSet" method does this. If NTriple file is passed, it processes and converts it into index dictionary and if compressed Directory (i.e. already parsed) is passed, then it simply read it as indexed dataFrames. However, I will separate out the function into 2, one for reading from raw NT file and the second one from processed data.

There is already a provision of saving processed (compressed) dataset into a disk using saveAsCSV() and re-read.

Loading

Copy link
Member

@GezimSejdiu GezimSejdiu May 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @abakarboubaa ,

indeed. There is no need to do such a separate. I already did it here:

See you

Loading

@GezimSejdiu GezimSejdiu merged commit 61976ab into SANSA-Stack:develop May 10, 2019
1 check failed
Loading
@abakarboubaa
Copy link
Contributor Author

@abakarboubaa abakarboubaa commented May 13, 2019

Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants