Skip to content

Conversation

@ahmedlone127
Copy link
Contributor

Added Push to Hub for Models and Pipelines

Description

Parameters

Name Type
Name Type
name (required) string
task (required) string
sparkVersion (required) string
sparknlpVersion (required) string
language (required) string
license string ["Open Source", "Licensed"]
tags array of strings
supported boolean
title (required) string
dependencies string
description (required) string
predictedEntities string
howToUse string
liveDemo string
runInColab string
pythonCode (required) string
scalaCode string
nluCode string
results string
dataSource string
includedModels string
benchmarking string

Example Usage

from python.sparknlp.upload_to_hub import PushToHub
sample_upload = {
    "name":"analyze_sentiment_ml",
    "task":'Sentiment Analysis',
    'title':'Analyze Sentiment Machine Learning ',
    'sparkVersion':"3.0",
    'sparknlpVersion':'Spark NLP 4.0.0',
    'language':'en',
    'license':'Open Source',
    'description':'''The analyze_sentiment_ml is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and predicts sentiment  .
         It performs most of the common text processing tasks on your dataframe''',
    'pythonCode':'''from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("analyze_sentiment_ml", "en")

result = pipeline.annotate("""I love johnsnowlabs!  """)''',
'model_zip_path':'pos_ud_bokmaal_nb_3.4.0_3.0_1641902661339.zip'

}

PushToHub.upload_to_modelshub_and_fill_form_API(sample_upload,GitToken )

@maziyarpanahi
Copy link
Contributor

Thanks @ahmedlone127 for this, let's enrich this and have some restrictions:

  • Let's not expose the raw sample_upload and ask for required/mandatory info from the user (from a dict) and then we fill in the rest internally, like Spark and Spark NLP versions, it must be Open Source since nobody is allowed to upload licensed models/pipelines except JSL members, etc.
  • maybe we can zip/archive internally and the user just shares the path to a saved model's path (if it's .zip then we skip if not we do the.zip ourselves)
  • maybe we can also check to see if the model/pipeline has metadata-00000 (the test models hub does) to be sure it's Apache Spark saved model to avoid having an error in return

@maziyarpanahi maziyarpanahi changed the base branch from master to release/402-release-candidage July 18, 2022 12:30
@maziyarpanahi maziyarpanahi changed the base branch from release/402-release-candidage to master July 18, 2022 12:32
@ahmedlone127
Copy link
Contributor Author

Hey @maziyarpanahi For the first part, how about I add a function called create_docs that takes in the required fields as parameters, fills up some of them internally, and has optional arguments such as benchmarking and scalaCode so the user can still add extra info if they want to, at the end it calls the original function and uploads to the hub.

@maziyarpanahi
Copy link
Contributor

  • That's a great idea. Let's have a create_docs to fill in everything required, the output would be a dictionary that can be fed into upload_to_hub (which let's rename this to push_to_hub().
  • However, if someone calls push_to_hub() with no dictionary and only a simple dict (name, lang, path to model) we should still allow it to be uploaded. (those fields are required, but the rest can be done in the PR)

@ahmedlone127
Copy link
Contributor Author

If someone calls push_to_hub directly with a simple dict like that , we can for sure add the other required fields and keep some of the ones that we can't generate empty but,I think we should also make task and pythonCode required, we can generate Title and Description but without those two the model won't make a lot of sense (useable)

@maziyarpanahi
Copy link
Contributor

If someone calls push_to_hub directly with a simple dict like that , we can for sure add the other required fields and keep some of the ones that we can't generate empty but,I think we should also make task and pythonCode required, we can generate Title and Description but without those two the model won't make a lot of sense (useable)

That makes sense, we can have PyDoc show the minimum required fields and then make those mandatory

@ahmedlone127
Copy link
Contributor Author

ahmedlone127 commented Jul 20, 2022

Hey, @maziyarpanahi do we still support spark version 2? I am asking because for the sparknlpVersion we can simply import the library and do sparknlp.version() but for sparkVersion we would have to start a spark session and we could avoid that if we keep it at 3 by default? and we would keep the supported field to False right?

@maziyarpanahi
Copy link
Contributor

Hi,

No, by default and until further notice the spark version is 3.0 for models/pipelines.
For spark nlp version, if it's empty we can take it from the current or else users can specify something else I guess

@ahmedlone127
Copy link
Contributor Author

that sounds good, I have also added the zip function we talked about to zip folders, I was thinking about what was a good way to add the last part
image
I think we should check for this file in the metadata folder in the given input and if it exists we assume it's apache spark .

@maziyarpanahi
Copy link
Contributor

that sounds good, I have also added the zip function we talked about to zip folders, I was thinking about what was a good way to add the last part image I think we should check for this file in the metadata folder in the given input and if it exists we assume it's apache spark .

I will loop in @pabla who has more insight. @pabla So we basically want to have a first simple check to avoid Models Hub returning an error when it comes to the format of the saved model in Spark/Spark NLP.

@ahmedlone127
Copy link
Contributor Author

Hey @maziyarpanahi I made the changes we discussed , please review them and let me know if they look good :)

@maziyarpanahi
Copy link
Contributor

Hey @maziyarpanahi I made the changes we discussed , please review them and let me know if they look good :)

Thanks for this, I have pushed some changes. Can we have a small unit test for this? Obviously, you can tag it as a slow but we can simply load a NerDLModel.pretrained(), save it and use a sample code to upload it to see if it works. (you can leave GIT_TOKEN empty and we manually add it when we do manual tests)

@ahmedlone127
Copy link
Contributor Author

Hey @maziyarpanahi I made a test

import unittest
import sparknlp
from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *
from pyspark.ml import Pipeline
from upload_models_to_hub import PushToHub



class TestStringMethods(unittest.TestCase):

    def load_and_upload():
        """Loads and uploads a Spark NLP Model 
        """
        spark = sparknlp.start()
        ner_model = NerDLModel.pretrained("ner_aspect_based_sentiment")\
        .setInputCols(["document", "token", "embeddings"])\
        .setOutputCol("ner")
        nlp_pipeline = Pipeline(stages=[ ner_model])
        model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
        model.write().overwrite().save(f"test")
        PushToHub.push_to_hub('test_model_hub_upload',
                    'en',
                    'test',
                    'Summarization',
                    'restaurant_pipeline = PretrainedPipeline("nerdl_restaurant_100d_pipeline", lang = "en")',
                    GIT_TOKEN = '')


if __name__ == '__main__':
    unittest.main()

@maziyarpanahi maziyarpanahi changed the base branch from master to release/410-release-candidate August 8, 2022 06:30
@maziyarpanahi maziyarpanahi merged commit 7c029b0 into release/410-release-candidate Aug 22, 2022
@KshitizGIT KshitizGIT deleted the Adding-Push-To-Hub branch March 2, 2023 09:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants