# Text Pre-Processing for Chinese

<div class="admonition note" name="html-admonition" style="background: lightblue; padding: 10px">
<p class="title">Note</p>
This section, "Working in Languages Beyond English," is co-authored with <a href="http://www.quinndombrowski.com/">Quinn Dombrowski</a>, the Academic Technology Specialist at Stanford University and a leading voice in multilingual digital humanities. I'm grateful to Quinn for helping expand this textbook to serve languages beyond English. 
</div>

This lesson is for anyone who wants to try the [TF-IDF](https://melaniewalsh.github.io/Intro-Cultural-Analytics/Text-Analysis/TF-IDF.html) or [topic modeling](https://melaniewalsh.github.io/Intro-Cultural-Analytics/Text-Analysis/Topic-Modeling.html) lessons on Chinese texts. Before continuing with those lessons, you need to create a *segmented derivative* of your original Chinese text, which inserts spaces between the words. If words are not separated by spaces, word count methods won't work.

## Install spaCy

In [1]:
!pip install -U spacy

Collecting spacy
  Obtaining dependency information for spacy from https://files.pythonhosted.org/packages/90/f0/0133b684e18932c7bf4075d94819746cee2c0329f2569db526b0fa1df1df/spacy-3.7.2-cp311-cp311-win_amd64.whl.metadata
  Downloading spacy-3.7.2-cp311-cp311-win_amd64.whl.metadata (26 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl (29 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Obtaining dependency information for spacy-loggers<2.0.0,>=1.0.0 from https://files.pythonhosted.org/packages/33/78/d1a1a026ef3af911159398c939b1509d5c36fe524c7b644f34a5146c4e16/spacy_loggers-1.0.5-py3-none-any.whl.metadata
  Downloading spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Obtaining dependency information for murmurhash<1.1.0,>=0.28.0 from https://files.pythonhosted.org/packages/71/46/af01a20ec368bd9cb49a1d2df15e3eca113bbf6952cc1f2a47f1c6801a7f/murmurhash-1.0.10-cp311

## Download Language Model

In [2]:
!python -m spacy download zh_core_web_md

Collecting zh-core-web-md==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/zh_core_web_md-3.7.0/zh_core_web_md-3.7.0-py3-none-any.whl (78.0 MB)
     ---------------------------------------- 0.0/78.0 MB ? eta -:--:--
     ---------------------------------------- 0.0/78.0 MB 1.9 MB/s eta 0:00:41
     --------------------------------------- 0.1/78.0 MB 825.8 kB/s eta 0:01:35
     --------------------------------------- 0.1/78.0 MB 491.5 kB/s eta 0:02:39
     --------------------------------------- 0.1/78.0 MB 525.1 kB/s eta 0:02:29
     --------------------------------------- 0.1/78.0 MB 525.1 kB/s eta 0:02:29
     --------------------------------------- 0.1/78.0 MB 525.1 kB/s eta 0:02:29
     --------------------------------------- 0.1/78.0 MB 525.1 kB/s eta 0:02:29
     --------------------------------------- 0.1/78.0 MB 525.1 kB/s eta 0:02:29
     --------------------------------------- 0.1/78.0 MB 525.1 kB/s eta 0:02:29
     ----------------------------

## Import Libraries

In [3]:
import spacy

## Load Language Model
Once the model is downloaded, we need to load it. There are two ways to load a spaCy language model.

1. We can import the model as a module and then load it from the module.

In [4]:
import zh_core_web_md
nlp = zh_core_web_md.load()

2. We can load the model by name.

In [None]:
#nlp = spacy.load('zh_core_web_md')

If you just downloaded the model for the first time, it’s advisable to use Option 1. Then you can use the model immediately. Otherwise, you’ll likely need to restart your Jupyter kernel (which you can do by clicking Kernel -> Restart Kernel… in the Jupyter Lab menu).

## Process Document
To create a derivative text file that we can use with TF-IDF, topic modeling, or other word-count based methods, we need to use spaCy to *segment* the text, artificially inserting spaces between words because most text analysis methods assume that words are separated by spaces.

The example text for Chinese is *敬告中国二万万女同胞* by 秋瑾. (Thanks to Paul Vierthaler for selecting and finding the text.)

Here we open the text and process it with the Chinese spaCy model.

In [17]:
filepath = '../texts/zh.txt'
# Open and read text
text = open(filepath, encoding='utf-8').read()
# Process text with spaCy
document = nlp(text)

秋瑾《敬告中国二万万女同胞》
 
唉!世界上最不平的事，就是我们二万万女同胞了。从小生下来，遇着好老子，还说得过;遇着脾气杂冒、不讲情理的，满嘴连说：“晦气，又是一个没用的。”恨不得拿起来摔死。总抱着“将来是别人家的人”这句话，冷一眼、白一眼地看待;没到几岁，也不问好歹，就把一双雪白粉嫩的天足脚，用白布缠着，连睡觉的时候，也不许放松一点，到了后来肉也烂尽了，骨也折断了，不过讨亲戚、朋友、邻居们一声“某人家姑娘脚小”罢了。这还不说，到了择亲的时光，只凭着两个不要脸媒人的话，只要男家有钱有势，不问身家清白，男人的性情好坏、学问高低，就不知不觉应了。到了过门的时候，用一顶红红绿绿的花轿，坐在里面，连气也不能出。到了那边，要是遇着男人虽不怎么样，却还安分，这就算前生有福今生受了。遇着不好的，总不是说“前生作了孽”，就是说“运气不好”。要是说一二句抱怨的话，或是劝了男人几句，反了腔，就打骂俱下;别人听见还要说：“不贤惠，不晓得妇道呢!”诸位听听，这不是有冤没处诉么?还有一桩不公的事：男子死了，女子就要带三年孝，不许二嫁。女子死了，男人只带几根蓝辫线，有嫌难看的，连带也不带;人死还没三天，就出去偷鸡摸狗;七还未尽，新娘子早已进门了。上天生人，男女原没有分别。试问天下没有女人，就生出这些人来么?为什么这样不公道呢?那些男子，天天说“心是公的，待人是要和平的”，又为什么把女子当作非洲的黑奴一样看待。不公不平，直到这步田地呢?
　　诸位，你要知道天下事靠人是不行的，总要求己为是。当初那些腐儒说什么“男尊女卑”、“女子无才便是德”、“夫为妻纲”这些胡说，我们女子要是有志气的，就应当号召同志与他反对，陈后主兴了这缠足的例子，我们要是有羞耻的，就应当兴师问罪;即不然，难道他捆着我的腿?我不会不缠的么?男子怕我们有知识、有学问、爬上他们的头，不准我们求学，我们难道不会和他分辨，就应了么?这总是我们女子自己放弃责任，样样事体一见男子做了，自己就乐得偷懒，图安乐。男子说我没用，我就没用;说我不行，只要保着眼前舒服，就作奴隶也不问了。自己又看看无功受禄，恐怕行不长久，一听见男子喜欢脚小，就急急忙忙把它缠了，使男人看见喜欢，庶可以藉此吃白饭。至于不叫我们读书、习字，这更是求之不得的，有甚么不赞成呢?诸位想想，天下有享现成福的么?自然是有学问、有见识、出力作事的男人得了权利，我们作他的奴隶了。既作了他

Then we loop through each token in the original text, lemmatize each token and insert a space between the tokens, and finally write them to our new segmented derivative text file.

In [26]:
outname = filepath.replace('.txt', '-segmented.txt')

# Create a segmented version of the original text file
with open(outname, 'w', encoding='utf8') as out:
    
    for token in document:
        # Get the lemma for each token
        out.write(token.lemma_.lower())
        # Insert white space between each token
        out.write(' ')

## View output
The code cell below prints the text as a list of individual tokens (words and punctuation), so you can see how successfully it identified word boundaries.

In [27]:
for token in document:
    print(token.lemma_)




























































































































































































































































































































































































































































































































































































































































































































































































































































































































































































