Hi, welcome to this notebook. You can run the following cells step by step, and hopefully you will know the basis of processing corpus with Python. Have fun!

## Processing a Single File

### Read the file

In [2]:
filename = 'LCMC_A.xml'
with open(filename, encoding='utf-8') as f:
    read_data = f.read()

The first line declares a variable called `filename`.  
The second line tells the system to read the file. Note that `encoding='utf-8'` indicates the encoding of the file, and currently *utf-8* is the most frequently used encoding. If you don’t know the file’s encoding, take utf-8 by default. (Later I’ll write how to dealing with other encodings.)  
The third line just read the entire content and pass it to a variable `read_data`.

You can run the following cell to see whether it's successful. 

In [6]:
read_data[:1000]               # print the first 1000 characters

'<?xml version="1.0" encoding="utf-8"?>\n<LCMC ver="character"><header><corpusDesc><corpusName>The Lancaster Corpus of Mandarin Chinese </corpusName><creator>Created by the Department of Linguistics, Lancaster University </creator><funding>Funded by the Economic and social Research Council (ESRC), UK </funding><designer>Designed by Anthony McEnery and Zhonghua Xiao </designer><supervision>Supervised by Anthony McEnery </supervision><textcollect>Texts collected by Zhonghua Xiao </textcollect><proofread>Electronic texts proofread and corrected by Zhonghua Xiao and Xin Huang </proofread><POStag>Segmented and POS-tagged by Zhonghua Xiao </POStag><unicodify>Converted into Unicode by Multilingual Corpus Tools (MLCT) developed by Scott Piao and Andrew Wilson </unicodify></corpusDesc><publication><publisher>Department of Linguistics, Lancaster University, LA1 4YT, UK </publisher><availability region="world"></availability><pubDate>June 2003 </pubDate><contact>Anthony McEnery, a.mcenery@lancast

### Extract Useful Parts

In [7]:
import re
tag = re.findall(r'<w POS="(\w{0,4})">.{0,7}</w>', read_data)
token = re.findall(r'<w POS="\w{0,4}">(.{0,7})</w>', read_data)

The most useful information is the tags and the tokens. With the above code we can extract these parts parallelly (in two seperate lists). Here I use the regular expression (re) and maybe it is a bit abstruce. Well, you just need to know what it does: recall the word form: `<w POS="a">大</w>`, the second line extract the content between the quotes(`a`, which is a tag); the third line extract the content between the right angle bracket and the left angle bracket(`大`, which is a token).  
Note that with the above codes I just extract words, not including punctuations. If you need to treat punctuations as tokens, use the following code instead:  
`tag = re.findall(r'<[wc] POS="(\w{0,4})">.{0,7}</[wc]>', read_data)`,  
`token = re.findall(r'<[wc] POS="\w{0,4}">(.{0,7})</[wc]>', read_data)`.

### Count Frequency

In [8]:
import collections
counter = collections.Counter(token)

In [10]:
counter.most_common(10)

[('的', 4108),
 ('了', 1267),
 ('在', 993),
 ('一', 771),
 ('是', 765),
 ('和', 597),
 ('他', 530),
 ('说', 373),
 ('上', 361),
 ('着', 359)]

Since we have extract the useful part, we can count the frequencies. The first line import a library `collections` for counting. The second line counts the token list and the third line outputs 10 most frequently-used words. 

### Indexing Target Words

Well, the freguency just gives us an overview. Then let's focus on our target words. We need to know the word's occurrence in the corpus. For example, the following line helps to find all indices of the word '是'.

In [11]:
indices = [i for i, x in enumerate(token) if x == '是']

With the indices we get, now we can perform a variety of tasks. We can see the concordance of the word, n-gram, etc.  

In [12]:
pre_token = [token[i-1] for i in indices]
next_token = [token[i+1] for i in indices]

pre_tag = [tag[i-1] for i in indices]
next_tag = [tag[i+1] for i in indices]

The first two lines stores the tokens before '是' and the tokens next '是' separately, and the last two lines stores the tags before '是' and the tags next '是'.