Skip to content

A Python package that asynchronously segments JSON data into TEI XML format.

License

Notifications You must be signed in to change notification settings

Taiwan-Social-Media-Corpus/ckip-2-tei

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

88 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ckip-2-tei

This project segments the title, body, and comments from a JSON file and writes them to a TEI XML file, and leverages asynchronous programming to achieve high performance and speed.

Installation

The source code is currently hosted on GitHub at: https://github.com/Taiwan-Social-Media-Corpus/ckip-2-tei

Binary installers for the latest released version are available at the Python Package Index (PyPI).

pip install ckip2tei

Documentation

1. Import module

from ckip2tei import generate_tei_xml

If you are working on Jupyter Notebook, you need to add two additional code lines beforehand:

import nest_asyncio
nest_asyncio.apply()

Since ckip2tei is built with Python asynchronous frameworks, it cannot run properly on Jupyter Notebook due to the fact that Jupyter (IPython ≥ 7.0) is already running an event loop. Visit this question asked in StackOverflow for further details.

2. Run pipeline

Provide the function generate_tei_xml with two arguments:

  • post_data: the data to be segmented
  • media: the source of the data

The post_data argument should be in the following format:

{
    "board": "Soft_Job",
    "id": "ABCD",
    "created_at": 1183186255,
    "title": "[請益] 最愛的程式?",
    "author": "Retr0327",
    "content": "這是一篇測試文章\n我喜歡 Python 和 TypeScript",
    "comment_reaction": {"推 (pos)": 2, "噓 (neg)": 0, "→ (neu)": 0},
    "comments": [
        {
            "type": "pos",
            "author": "Uncle",
            "content": "我愛 TypeScript",
            "order": "1",
        },
        {
            "type": "pos",
            "author": "Bob",
            "content": "我也很愛 Python",
            "order": "2",
        },
    ],
}

After filling the arguments, do it as follows:

post_data = {
    "board": "Soft_Job",
    "id": "ABCD",
    "created_at": 1183186255,
    "title": "[請益] 最愛的程式?",
    "author": "Retr0327",
    "content": "這是一篇測試文章\n我喜歡 Python 和 TypeScript",
    "comment_reaction": {"推 (pos)": 2, "噓 (neg)": 0, "→ (neu)": 0},
    "comments": [
        {
            "type": "pos",
            "author": "Uncle",
            "content": "我愛 TypeScript",
            "order": "1",
        },
        {
            "type": "pos",
            "author": "Bob",
            "content": "我也很愛 Python",
            "order": "2",
        },
    ],
}

generate_tei_xml(post_data, "ptt")

This prints:

<TEI.2>
   <teiHeader>
      <metadata name="media">ptt</metadata>
      <metadata name="author">Retr0327</metadata>
      <metadata name="id">ABCD</metadata>
      <metadata name="year">2007</metadata>
      <metadata name="board">Soft_Job</metadata>
      <metadata name="title">[請益] 最愛的程式?</metadata>
   </teiHeader>
   <text>
      <title author="Retr0327">
         <s>
            <w type="PARENTHESISCATEGORY">[</w>
            <w type="VB">請益</w>
            <w type="PARENTHESISCATEGORY">]</w>
            <w type="WHITESPACE"> </w>
            <w type="Dfa">最</w>
            <w type="VL">愛</w>
            <w type="DE">的</w>
            <w type="Na">程式</w>
            <w type="QUESTIONCATEGORY">?</w>
         </s>
      </title>
      <body author="Retr0327">
        ...
   </text>
</TEI.2>

Contact Me

If you have any suggestion or question, please do not hesitate to email me at lixingyang.dev@gmail.com

About

A Python package that asynchronously segments JSON data into TEI XML format.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages