Skip to content

[ggma] Add ggma_tokenize.h and implementation#16267

Merged
glistening merged 4 commits intoSamsung:masterfrom
glistening:tokenizer
Nov 5, 2025
Merged

[ggma] Add ggma_tokenize.h and implementation#16267
glistening merged 4 commits intoSamsung:masterfrom
glistening:tokenizer

Conversation

@glistening
Copy link
Copy Markdown
Contributor

@glistening glistening commented Nov 4, 2025

It adds ggma_tokenize API and implementation.
It supports SentencePiece library.

ONE-DCO-1.0-Signed-off-by: Sanggyu Lee sg5.lee@samsung.com

I will enable tokenizer Tizen build in separate PR since it requires SENTENCEPIECE source tarball.

@glistening glistening requested review from a team, chunseoklee and hseok-oh November 4, 2025 02:48
@glistening glistening added the PR/ready for review It is ready to review. Please review it. label Nov 4, 2025
@glistening glistening force-pushed the tokenizer branch 2 times, most recently from 91945c7 to a675788 Compare November 4, 2025 03:16
It adds ggma_tokenize API and implementation.
It supports SentencePiece library.

ONE-DCO-1.0-Signed-off-by: Sanggyu Lee <sg5.lee@samsung.com>
@glistening glistening force-pushed the tokenizer branch 2 times, most recently from 2facee6 to 77595bf Compare November 4, 2025 05:22
Comment thread runtime/onert/odc/CMakeLists.txt Outdated
return()
endif(NOT SentencePieceSource_FOUND)

include_directories(${SentencePieceSource_DIR})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason need this line?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In file included from /z/ONE/runtime/externals/SENTENCEPIECE/src/util.h:32,
                 from /z/ONE/runtime/externals/SENTENCEPIECE/src/flags.cc:27:
/z/ONE/runtime/externals/SENTENCEPIECE/src/sentencepiece_processor.h:25:10: fatal error: third_party/absl/strings/string_view.h: No such file or directory
   25 | #include "third_party/absl/strings/string_view.h"
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.

Copy link
Copy Markdown
Contributor

@hseok-oh hseok-oh Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems v0.1.90 does not consider build by add_subdirectory(). It considers build as root or by external build like ExternalProject_Add. v0.2.1 seems considering add_subdirectory().

Copy link
Copy Markdown
Contributor

@hseok-oh hseok-oh Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems v0.1.90 does not consider build by add_subdirectory()

And because of this, install(...) in SentencePiece's cmake is not working, fortunately. If it is working, it will install SentencePiece's libraries, headers, and tools in our install path.

Copy link
Copy Markdown
Contributor Author

@glistening glistening Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SentencePiece libraries are installed under lib/ggma.

$ tree Product/x86_64-linux.debug/out/lib/ggma
Product/x86_64-linux.debug/out/lib/ggma
├── libggma_api.so
├── libggma_tokenize.so
├── libsentencepiece.so -> libsentencepiece.so.0
├── libsentencepiece.so.0 -> libsentencepiece.so.0.0.0
└── libsentencepiece.so.0.0.0

No header is installed under include/ggma

$ tree Product/x86_64-linux.debug/out/include/ggma/
Product/x86_64-linux.debug/out/include/ggma/
├── ggma_api.h
├── ggma_context.h
├── ggma_generate.h
├── ggma_tokenize.h
└── ggma_types.h

I don't recognize any problem. Could you please let me know if somethig is wrong?

Do you prefer static library which hides libsentencepiece.so in libggma_tokenize.so ?

Copy link
Copy Markdown
Contributor

@hseok-oh hseok-oh Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I told, fortunately no problem, because SentencePiece's cmake is not working correctly.

Do you prefer static library which hides libsentencepiece.so in libggma_tokenize.so ?

I have no preference about this.

Copy link
Copy Markdown
Contributor Author

@glistening glistening Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems v0.1.90 does not consider build by add_subdirectory(). It considers build as root or by external build like ExternalProject_Add. v0.2.1 seems considering add_subdirectory().

For your information, I've used v0.1.90 since our internal translation model does not work the latest version of sentencepiece. I confirmed it works with v0.1.90. I hope I can use v0.2.1 or later after translation model is done.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Could you change this to use target setting target_include_directories(sentencepiece PRIVATE ${SentencePieceSource_DIR}) instead of global setting?

return()
endif()

set_property(TARGET sentencepiece PROPERTY POSITION_INDEPENDENT_CODE ON)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because you does not set SPM_ENABLE_SHARED as FALSE, sentencepiece may be shared library.

  • Is POSITION_INDEPENDENT_CODE required?
  • Is it intended to use shared sentencepiece library?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made sentencepiece as shared library.

By default, SPM_ENABLED_SHARED seems ON.

runtime/externals/SENTENCEPIECE/CMakeLists.txt

option(SPM_ENABLE_SHARED "Builds shared libaries in addition to static libraries." ON)
Product/x86_64-linux.debug/out/lib/ggma/libsentencepiece.so
Product/x86_64-linux.debug/out/lib/ggma/libsentencepiece.so.0
Product/x86_64-linux.debug/out/lib/ggma/libsentencepiece.so.0.0.0

I prefered shared since we may choose to load the necessary shared library later using dlopen.

Do you prefer static library, then could you please let me know why?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you prefer static library, then could you please let me know why?

No. I commented because line 22 is meaningless if you want to use shared library. I thought that you added POSITION_INDEPENDENT_CODE property explicitly because you want to use static library.

Comment thread runtime/infra/cmake/packages/SentencePieceConfig.cmake
chunseoklee
chunseoklee previously approved these changes Nov 4, 2025
@glistening
Copy link
Copy Markdown
Contributor Author

glistening commented Nov 5, 2025

@hseok-oh Please see the last commit. I've applied your reviews.

After this PR merged, I will update debian build and Tizen rpm build.

hseok-oh
hseok-oh previously approved these changes Nov 5, 2025
Copy link
Copy Markdown
Contributor

@hseok-oh hseok-oh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@glistening glistening merged commit 2469255 into Samsung:master Nov 5, 2025
10 checks passed
@glistening glistening deleted the tokenizer branch November 5, 2025 02:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

PR/ready for review It is ready to review. Please review it.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants