# Preprocessing and analysing Chinese text

The course covers basic Python code that can get you started using programming as a tool for text processing, quantitative analysis and text and data mining.

In more technical terms, we review concepts such as variables, values, the data types text strings, lists and loops.

We go through an example of how to retrieve text data, prepare data and use the jieba library. Jieba is used to divide text into words and subdivides traditional Chinese.

_Source: https://github.com/fxsjy/jieba_

In [None]:
#! pip install jieba

In [None]:
# Webscrape libraries
from bs4 import BeautifulSoup
import requests

# For preprocessing and analysing
import jieba
import nltk
import re

We create a variable that we use to store the url to the pages that we want to webscrape.

We need to scrape this wikipedia page: 反对逃犯条例修订草案运动

In [None]:
# store the url in a variable
url_zh = 'https://zh.wikipedia.org/zh-cn/%E5%8F%8D%E5%B0%8D%E9%80%83%E7%8A%AF%E6%A2%9D%E4%BE%8B%E4%BF%AE%E8%A8%82%E8%8D%89%E6%A1%88%E9%81%8B%E5%8B%95'

We insert one of the variable names into request.get('url') below.

In [None]:
# get data
page = requests.get(url_zh)

# scrape webpage
soup = BeautifulSoup(page.content, 'html.parser')

In [None]:
# find all 'headline3 and paragraph-tags'
tags = soup.find_all(['h1', 'h3', 'p'])

# Parse the text from the p_tags ajd 'join' a returned list into the variable called 'text'
text = ' '.join([p.get_text() for p in tags]).strip()

In [None]:
text [0:100]

## Preparation of text
### Cleaning

The text consists of Latin letters and Chinese characters.

If you want to sort out the Latin letters, you can use the code below.

Sources:

https://stackoverflow.com/questions/2718196/find-all-chinese-text-in-a-string-using-python-and-regex

https://unicode-table.com/en/blocks/cjk-unified-ideographs/

In [None]:
chinese_list = re.findall(r'[\u4e00-\u9fff]+', text)

In [None]:
chinese_list[0:20]

The list can then be assembled into a text string again with .join()

_Source: "https://www.w3schools.com/python/ref_string_join.asp"_   

In [None]:
chinese_text = ' '.join(chinese_list)

In [None]:
chinese_text[0:200]

### Text segmentation / tokenisation

In the jieba.lcut method, we insert a text string or, as in this case, a variable containing a text string, and we control the cut mode. The L in .lcut() indicates that the method returns a list. 'Cut_all=True' should give the most possible hyphenation of the text, be fast, but less accurate. 'Cut_all=False' should be more accurate than the first, and thus more suitable for text analysis. _Source: https://github.com/fxsjy/jieba_

In [None]:
seg_list1 = jieba.lcut(chinese_text, cut_all=False)

In [None]:
seg_list1[0:20]

We get returned many fields consisting of 'white_space'.

To see these lines removed from our data, we use 'if' to put a condition into the code. We write, if our lines consist of characters that are not equal to 'white space', then we are interested in storing it in the variable seg_list2.

In [None]:
seg_list2 = [item for item in seg_list1 if item != ' ']

In [None]:
seg_list2[0:20]

As shown above, lists are made using square brackets ( [ ] ).

You can access the elements in the list by referring to the index number. Again, we can use both positive and negative numbers. Remember that in Python the first index number is 0 and not 1, which means we access the first and last element of the list like this:

In [None]:
print (seg_list2[0])
print (seg_list2[-1])

## Part of Speech Tagging (POS)

Jieba's part of speech tagger returns the words and tags in two different elements. To use the pos tagger, import 'import jieba.posseg as pseg'.

According to the documentation, you use pseg after .cut( 'text_string' ). Source: "4. Part of Speech Tagging https://github.com/fxsjy/jieba"

We get returned words and tags. They are in .word and .flag. In the documentation, the programmer shows how to print words and tags, but I would like to have all words and tags stored as pairs in a list. Therefore I use a tuple which is a python data type and add each word and tag pair to a list which I call 'pos'. Source: Python Tuples https://www.w3schools.com/python/python_tuples.asp_

In [None]:
import jieba.posseg as pseg
words = pseg.cut(chinese_text)
pos_tags = []
for w in words:
    if w.word > ' ':
        word_tag = tuple((w.word, w.flag))
        pos_tags.append(word_tag)

In [None]:
pos_tags[0:20]

I am writing a for loop that contains a condition ('if'). With the loop I go through the list of tuples. If the first element of the 'tuple' ([1]) is equal to 'v' I add the first element ([0]) to the list 'words'.

In [None]:
words = []
for item in pos_tags:
    if item[1] == 'v':
        words.append(item[0])

In [None]:
words[0:20] 

Before looking at the distribution of words with 'v' tags, we need to address the fact that python by default cannot print Chinese characters. Therefore, we import 'matplotlib.pyplot as plt' and change font.family to "Microsoft YaHei".

In [None]:
import matplotlib.pyplot as plt
plt.rcParams["font.family"] = "Microsoft YaHei"  # husk på mac skal man bruge  'Heiti TC'

After this we can import nltk and use nltk.FreqDist().

In [None]:
import nltk
nltk.FreqDist(pos_tags).plot(20)

## Task: try replacing 'v' with other tags.
## Stop words

Stop words are small words that are often not meaningful words.

We therefore need to load a stop word list. It is online at Science data at this link: https://sciencedata.dk/shared/93a217a0533d949d9b2c675cd3c99cfd?download.

To retrieve the file we use "from urllib.request import urlopen".

In [None]:
from urllib.request import urlopen

target_url = 'https://sciencedata.dk/shared/93a217a0533d949d9b2c675cd3c99cfd?download'

sw_ch = urlopen(target_url).read().decode('utf-8').split()

Now all the texts can be filtered for stop words.

In [None]:
filtered_tokens = []
for word in seg_list2:
    if word not in sw_ch:
        filtered_tokens.append(word)

Now all the texts can be filtered for stop words.

In [None]:
fdist_filtered = nltk.FreqDist(filtered_tokens).plot(20, title='Hyppigste ord (uden stopord)')

In [None]:
long_tokens = []

for word in filtered_tokens:
    if len(word) > 4:
        long_tokens.append(word)

In [None]:
fdist_filtered = nltk.FreqDist(long_tokens).plot(20, title='Længste ord')

## NLTK methods

I have used nltk many times, but never with Chinese text. We experiment and create an nltk text object which should allow us to use various nltk methods.

In [None]:
nltk_text = nltk.Text(seg_list2)

collocation_list() returns a list of the most common word pairs in the text. Note that in some versions of Python, collocation_list() does not work. If this is the case, try _collocations()_ instead.

In [None]:
nltk_text.collocation_list()

The concordance() method returns the context of a specific expression. The length of the output can be changed with the parameters in width and lines.

In [None]:
nltk_text.concordance('反对', lines=30, width=40)

To identify words that appear in a similar context, we can use the similar() method.

I have a notion that the method gives better results the longer the text is.

In [None]:
# similar til "politi"
nltk_text.similar('警察')

You can use the generate() method to generate more or less coherent text based on an existing text.

In [None]:
text_gen = nltk_text.generate(150)