Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The citation key displays Chinese pinyin instead of Chinese character #9605

Open
ehehela opened this issue Feb 8, 2023 · 17 comments
Open

The citation key displays Chinese pinyin instead of Chinese character #9605

ehehela opened this issue Feb 8, 2023 · 17 comments

Comments

@ehehela
Copy link

ehehela commented Feb 8, 2023

Here, an alternative citation key generation scheme is recommended for Chinese bibliography: using Chinese pinyin of authors rather Chinese character which is non-ASCII.

For example:
WanZheng2016 or WanZ2016 is preffered rather than the default 万征2016.

@Article{万征2016,
author = {万征 and 姚仰平 and 孟达},
journal = {力学学报},
title = {复杂加载下混凝土的弹塑性本构模型},
year = {2016},
issn = {0459-1879},
month = jun,
number = {05},
pages = {1159--1171},
volume = {48},
}

@clzls
Copy link

clzls commented Jan 6, 2024

I want to mention that any dumb auto converter solution is not reliable, since Hanzi is used in many Asian countries and territories, in many variants. For example, Hanzi 【出来】 maybe chu lai in Chinese, or de ki in Japanese, or xuý lãi in Vietnamese, cheok rae in Korean.

In my workflow, calibre is using a dumb converter, converting my Chinese items into ugly "ASCII equivalents" which it believes to be so. Sometimes in Chinese (Mandarin or Cantonese), sometimes Japanese...

So I suggest that let user decide which romanization to be used.

@koppor
Copy link
Member

koppor commented Mar 25, 2024

Implementation note: Use https://github.com/houbb/pinyin. I found it via https://search.maven.org/search?q=pinyin4j - and it seems to be the most maintained one. If that does not work, try https://github.com/belerweb/pinyin4j.

I checked https://sourceforge.net/p/pinyin4j/news/ and think, the library needs to be configured to (in bold)

  • uppercase or lowercase
  • v or u (not sure about this one)
  • with tone numbers or without tone

@koppor koppor added the good first issue An issue intended for project-newcomers. Varies in difficulty. label Mar 25, 2024
@github-project-automation github-project-automation bot moved this to Free to take in Good First Issues Mar 25, 2024
@clzls
Copy link

clzls commented Apr 1, 2024

I would again state that it may not be a good idea to do so naively in main program, since one cannot distinguish Chinese among other CJK languages easily without natives or AI, and will definitely BREAK other users' experience, especially Japanese users which uses Hanzi (Kanji in Japanese) too but totally different romanization.
Hanzi is not only the character of Chinese, but also the character of all CJK languages across Asia, just like Latin characters we are using now. It is a complicated system. Low priority of this issue is suggested.
Localization checks are not reliable, since multiple languages should be allowed in a single library.
(I have seen this in calibre and I don't want to see this happen here breaking all my bibliography...)

@koppor
Copy link
Member

koppor commented Apr 1, 2024

BibTeX allows for using the field "language" to indicate the language of the entry. Maybe, one could use that as input for the citation key generator.

@clzls I assume you are on UTF8 and use the non-ASCII characters also for your citation keys?

The issue is complicated with many different user "profiles". Maybe we need a preference?

@ehehela
Copy link
Author

ehehela commented Apr 1, 2024

The implementation is complicated since we need to accommodate users with different languages while ensuring a smooth user experience for those accustomed to the current system.

Maybe we could offer users the option to enable this function (default off). Other preconditions are also needed such as ensuring that romanization only occurs when a valid "language" field is specified (as mentioned by kropper).

Perhaps we can extend romanization support not only to Chinese language but also other languages (Korean, Japanese, etc.).
Allow different "language" fields to use distinct romanization schemes. Therefore, a consistent interface with customization options would be beneficial, allowing users to decide which "language" fields to romanize.

Alternatively, a semi-automatic approach could suffice. We could introduce new options in the right-click menu (see figure below). We could also use “check integrity” to collect those entries with non-ASCII citation keys into one group, followed by “cleanup entries” for this group (see figure below). Thus, this method can also make it convenient for people who are in need.

屏幕截图 2024-04-01 185548
屏幕截图 2024-04-01 181912

@Siedlerchr Siedlerchr removed the good first issue An issue intended for project-newcomers. Varies in difficulty. label Apr 1, 2024
@Siedlerchr Siedlerchr moved this from Free to take to Reserved in Good First Issues Apr 1, 2024
@clzls
Copy link

clzls commented Apr 1, 2024

@clzls I assume you are on UTF8 and use the non-ASCII characters also for your citation keys?

Yes I do, and a bunch of my papers were written in Chinese, using tons of packages to tweak LaTeX compilers, to make them happy dealing with non-ASCII characters... (no one would write papers full of something like \symbol{28450}\symbol{35486}, I guess)

Perhaps we can extend romanization support not only to Chinese language but also other languages (Korean, Japanese, etc.). Allow different "language" fields to use distinct romanization schemes. Therefore, a consistent interface with customization options would be beneficial, allowing users to decide which "language" fields to romanize.

Looks good for me. By implementing this way, it is like an extension to opt-in and extensible for any language that has needs to obtain ASCII equivalents (even Europeans may need it, such as Danish or Greek, I think). I would go even further and suggest that introducing dynamic-loadable custom formatters may be an even better solution, so that everyone would be happy...

@koppor
Copy link
Member

koppor commented Apr 9, 2024

At a LaTeX conference, I learned form the LaTeX developers that it is now also possible to use Unicode with pdflatex and labels. E.g., \label{sec:grüße}. Moreover, it is also possible for citaton keys.

@mlep
Copy link
Contributor

mlep commented Apr 16, 2024

Does it work for BibTeX too?
(like \cite{grüße})

@koppor
Copy link
Member

koppor commented Apr 16, 2024

Does it work for BibTeX too?
(like \cite{grüße})

According to the LaTeX 3 team: Yes. Just ensure that you run latest TeXLive 😅

@ehehela
Copy link
Author

ehehela commented Apr 16, 2024

I use the latest MiKTeX, BibTeX and pdflatex, the citation key like \cite{grüße} works correctly, but the citation key like \cite{任政2018} causes error.
The error message is:

! Undefined control sequence.
\GenericError ...
! Emergency stop.
\GenericError ...

@koppor
Copy link
Member

koppor commented Apr 16, 2024

@ehehela I asked LaTeX pros. It works on TeXLive. See https://chat.stackexchange.com/transcript/message/65511308#65511308

OK, it seems, some more "magic" is needed:

\documentclass{article}

\DeclareUnicodeCharacter{4EFB}{CJK Ideograph 4efb}
\DeclareUnicodeCharacter{653F}{CJK Ideograph 653f}

\begin{document}

\cite{任政2018}

\begin{thebibliography}{99}
\bibitem{任政2018} xxxx
\end{thebibliography}
\end{document}

@davidcarlisle
Copy link

davidcarlisle commented Apr 16, 2024

actually that resolves the error but the cite doesn't work it doesn't need the definitions but it does (currently) need something safe as the first token

\documentclass{article}

\begin{document}

\cite{ 任政2018}

\cite{x任政2018}



\begin{thebibliography}{99}
\bibitem{任政2018} xxxx
\bibitem{x任政2018} xxxx
\end{thebibliography}
\end{document}

Although the official position is that cite keys should use ascii characters,

@ehehela
Copy link
Author

ehehela commented Apr 17, 2024

@koppor and @davidcarlisle Thank you.
I have test two cases with pdflatex+bibtex+article: the first one also includes ctex package to enable Chinese support while the second one not.
The results show that the first test only works for ascii citation key with bibliography in Chinese, but the second test works for non-ascii citation key with bibliography in English.
Therefore, I think pdflatex+bibtex may not fully support non-ascii characters. xelatex+biber works.

The source code of the first test is:

\documentclass{article}
\usepackage{ctex}

\begin{filecontents*}{ref.bib}
@article{陈骁2012,
  title={afdrgwfdsa},
  author={sdfas and sadfsd and afsa and dasf},
  journal={asdfsd},
  volume={27},
  number={2},
  pages={133--138},
  year={2012}
}
@article{chen2012,
  title={基于电无级变速器的内燃机最优控制策略及整车能量管理},
  author={陈骁 and 黄声华 and 万山明 and 庞珽},
  journal={电工技术学报},
  volume={27},
  number={2},
  pages={133--138},
  year={2012}
}
\end{filecontents*}

\bibliographystyle{plain}

\begin{document}

\cite{ chen2012}  % works for article class with ctex package and Chinese bibliography with ascii citation key
%\cite{ 陈骁2012}  % works for article class and English bibliography with non-ascii citation key

\bibliography{ref}
\end{document} 

The compilation result is:
屏幕截图 2024-04-17 091808

The source code of the second test is:

\documentclass{article}
%\usepackage{ctex}

\begin{filecontents*}{ref.bib}
@article{陈骁2012,
  title={afdrgwfdsa},
  author={sdfas and sadfsd and afsa and dasf},
  journal={asdfsd},
  volume={27},
  number={2},
  pages={133--138},
  year={2012}
}
@article{chen2012,
  title={基于电无级变速器的内燃机最优控制策略及整车能量管理},
  author={陈骁 and 黄声华 and 万山明 and 庞珽},
  journal={电工技术学报},
  volume={27},
  number={2},
  pages={133--138},
  year={2012}
}
\end{filecontents*}

\bibliographystyle{plain}

\begin{document}

%\cite{ chen2012}  % works for article class with ctex package and Chinese bibliography with ascii citation key
\cite{ 陈骁2012}  % works for article class and English bibliography with non-ascii citation key

\bibliography{ref}
\end{document}

The compilation result is:
屏幕截图 2024-04-17 092010

@clzls
Copy link

clzls commented Apr 17, 2024

Therefore, I think pdflatex+bibtex may not fully support non-ascii characters. xelatex+biber works.

FYI: My thesis is using xelatex+bibtex and citation keys with CJK characters work fine. ctexbook and a brunch of other packages are used as it is a production env.

@ThiloteE
Copy link
Member

As for JabRef, I am not yet sure what would be the best option to have the correct language in the entry preview, but when it comes to rendering the entry in LaTeX, there seems to be a limitation of pdflatex that can be worked around with xelatex, special commands/syntax or other packages.

Are you aware of Babel?

There is also LuaLaTeX.

@koppor
Copy link
Member

koppor commented Apr 17, 2024

@ThiloteE lualatex is the way to go :). pdflatex and xelatex should only be used if absolutely necessary :)

@koppor
Copy link
Member

koppor commented Sep 19, 2024

Idea: Maybe, some of the Apache Lucene functionality can be used. There are these FoldingFilters. We used some of them in our LatexAwareAnalyzer. (The AsciiFoldingFilter)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Free to take
Status: Normal priority
Development

No branches or pull requests

7 participants