Skip to content

Integration of hdt-rdf compression module #81

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
May 10, 2019

Conversation

abakarboubaa
Copy link
Contributor

I have created the hdt package as mentioned and created a class TripleOps and corresponding Test cases are submitted.

@GezimSejdiu GezimSejdiu self-requested a review May 6, 2019 13:30
Copy link
Member

@GezimSejdiu GezimSejdiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @abakarboubaa ,

many thanks for your contribution. I went through it and provided some comments. Please, try to resolve them and let me know in case you need more info.
PS: The PR is failing, mostly due to the scala-style violations. Try to resolve them.

Best regards,

*/

/*
--input
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, remove these comments from the class.

* @return Returns DataFrame [Subject,Object,Predicate]
*/
def convertRDDGraphToDF(triple:RDD[graph.Triple])={
logger.info("Converting RDD[Graph] to DataFrame[Row].")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to have this log here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And remove all the remaining logs below as well.

logger.info(s"Rdd[Graph] is created. ${rddGraph.count()}")

val triplesDF=convertRDDGraphToDF(rddGraph)
triplesDF.printSchema()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to print.

triplesDF.cache()

objectDF=getDistinctObjectDictDF(rddGraph)
objectDF.printSchema()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to print. Remove it.

* @param outputDir Path to be written
* @param mode SaveMode of Write
*/
def saveAsCSV(outputDir:String, mode:SaveMode): Unit =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

um. why do we need to save them as a csv? Aren't they just indexes? which could be kept in any file (parquet files, hadoop files?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the future enhancement to support various target types and compression type.

val logger = LoggerFactory.getLogger(TripleOps.getClass)


def main(args: Array[String]) {
Copy link
Member

@GezimSejdiu GezimSejdiu May 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, remove the main method from here.

@abakarboubaa abakarboubaa reopened this May 7, 2019
@abakarboubaa
Copy link
Contributor Author

Code has been updated as mentioned. Also updated the code as per scalastyle

* @return DataFrame Subject dictionary of [index,subject]
*/
def getDistinctSubjectDictDF(triples : RDD[graph.Triple]) : DataFrame = {
spark.createDataFrame(triples.map(_.getSubject.toString()).distinct().zipWithIndex().map(t => Row(t._1, t._2)), dictionarySchema).cache()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to persist/cache them in advance? I think would be good that we remove the caching as a default approach and leave it to the user how they see it relevant for their use case.
So, remove the .cache() the method from everywhere where it is added and which can be provided by the user.

* @param registerAsTable If true, it register all the DF as Spark table
* @return Returns the Tuple4 [IndexDataFrame,SubjectDictDataFrame,ObjectDictDataFrame,PredicateDictDataFrame]
*/
def createOrReadDataSet(input : String, compressedDir : String, registerAsTable : Boolean = true) : (DataFrame, DataFrame, DataFrame, DataFrame) = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method requires to have the compressed data already kept in the dist (which usually is mean to be a pre-processing step), did you also consider that sometimes there is no need to materialize the process and keep them in memory and do not put them back to the dist.

I think we should provide another method which allows the users to read the data and compress them on the fly and or course to save them in case they want to do so.

Copy link
Contributor Author

@abakarboubaa abakarboubaa May 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"createOrReadDataSet" method does this. If NTriple file is passed, it processes and converts it into index dictionary and if compressed Directory (i.e. already parsed) is passed, then it simply read it as indexed dataFrames. However, I will separate out the function into 2, one for reading from raw NT file and the second one from processed data.

There is already a provision of saving processed (compressed) dataset into a disk using saveAsCSV() and re-read.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @abakarboubaa ,

indeed. There is no need to do such a separate. I already did it here:

See you

@GezimSejdiu GezimSejdiu merged commit 61976ab into SANSA-Stack:develop May 10, 2019
@abakarboubaa
Copy link
Contributor Author

abakarboubaa commented May 13, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants