# Data Processing

Qidu Fu

**Contents**
- [0 Introduce data](#0-introduce-data-and-import-libraries)
- [1 Load data](#1-load-data)
    - [1.1 Get stack exchange data](#11-get-stack-exchange-data)
    - [1.2 Get SEAME data](#12-get-seame-data-development-dataset)
    - [1.3 Get ASCEND data](#13-get-ascend-data)
- [2 Process the stack exchange data: the title column](#2-process-the-stack-exchange-data-the-title-column)
- [3 Process the stack exchange data: the tags and title column](#3-process-the-stack-exchange-tags-and-title-columns)
- [4 Get basic information: cleaned stack exchange dataset](#4-get-basic-information-cleaned-stack-exchange-dataset)

## 0 Import libraries



In [57]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import ssl
import datasets
import re

## 1 Load data

### 1.1 Get stack exchange data
Observations
- 1. Not NULL values in the data
- 2. The dataset include 12401 records/rows and 3 columns

In [58]:
# The stack exchange data is from stack exchange API

def get_stack_data(path):
    """
    Load the stack exchange data
    param path: str: path to the stack exchange data
    return: pd.DataFrame: the stack exchange data
    """
    return pd.read_csv(path)

STACK_DF = get_stack_data('private/stack_exchange_all_questions.csv')
print(STACK_DF.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12041 entries, 0 to 12040
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   question_id  12041 non-null  int64 
 1   title        12041 non-null  object
 2   tags         12041 non-null  object
dtypes: int64(1), object(2)
memory usage: 282.3+ KB
None


In [59]:
STACK_DF.head()

Unnamed: 0,question_id,title,tags
0,59815,My translation of Li Bai&#39;s 《三五七言》,"translation, poetry"
1,53654,What do these characters on an antique mural p...,"character-identification, traditional-characte..."
2,59823,Help in translating Li Bai&#39;s 《月下独酌&#183;其二》,"translation, poetry"
3,59821,purpose of using 了 with 要不,grammar
4,59652,Why does the character 的 is pronounced differe...,"pronunciation, songs"


### 1.2 Get SEAME data (development dataset)
Observations
- 1. Not NULL values in the data
- 2. The dataset include 5321 records/rows and 3 columns

In [60]:
# The SEAME DATASET IS FROM: 
# Zhiping Zeng, Yerbolat Khassanov, Van Tung Pham, Haihua Xu, Eng Siong Chng, 
# and Haizhou Li, “On the End-to-End Solution to Mandarin-English Code-switching 
# Speech Recognition,” arXiv preprint arXiv:1811.00241, 2018.

def get_seame_data(path):
    """
    This function reads the SEAME dataset from the given path 
    and returns a pandas dataframe.
    param path: str: path to the SEAME dataset
    return: pd.DataFrame: SEAME dataset
    """
    # Ignore SSL certificate errors
    ssl._create_default_https_context = ssl._create_unverified_context
    # Read the file without specifying column names
    df = pd.read_csv(path, sep='\t', header=None)
    # Reset SSL certificate to default
    ssl._create_default_https_context = ssl._create_default_https_context
    # Split the first column into 'ID' and 'sentence'
    df[['ID', 'sentence']] = df[0].str.split(pat=' ', n=1, expand=True)
    # Drop the original column
    df = df.drop(columns=[0])
    return df

SEAME_DF = get_seame_data('https://raw.githubusercontent.com/zengzp0912/' + 
                            'SEAME-dev-set/refs/heads/master/dev_sge/text')

In [61]:
SEAME_DF.head()

Unnamed: 0,ID,sentence
0,nc15m-08nc15mbp_0101-00190-00481,hello hello 可 以
1,nc15m-08nc15mbp_0101-01044-01130,往 下 一 点
2,nc15m-08nc15mbp_0101-02185-02444,那 个 wave 比 较 慢 一 点 啊
3,nc15m-08nc15mbp_0101-02503-02571,okay 可 以
4,nc15m-08nc15mbp_0101-02838-02963,讲 多 一 点 咯 可 以


In [62]:
SEAME_DF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5321 entries, 0 to 5320
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   ID        5321 non-null   object
 1   sentence  5321 non-null   object
dtypes: object(2)
memory usage: 83.3+ KB


In [63]:
# save it to a csv file for future use
SEAME_DF.to_csv('private/seame.csv', index=False)

### 1.3 Get ASCEND data
Observations
- 1. Not NULL values in the data
- 2. The dataset include 9869 records/rows and 3 columns

In [64]:
# The third dataset is ASCEND
# The ASCEND DATASET IS FROM:
# Lovenia, H., Cahyawijaya, S., Winata, G., Xu, P., Xu, Y., Liu, Z., Frieske, 
# R., Yu, T., Dai, W., Barezi, E. J., Chen, Q., Ma, X., Shi, B., & Fung, 
# P. (2022, June). ASCEND: A spontaneous Chinese-English dataset 
# for code-switching in multi-turn conversation. Proceedings of the 
# Language Resources and Evaluation Conference, 7259–7268. European 
# Language Resources Association. https://aclanthology.org/2022.lrec-1.788

def get_ascend_data():
    """
    Load the ASCEND dataset
    param: None
    return: pd.DataFrame: the ASCEND training and development datasets
    """
    # Load dataset but remove audio feature
    data = datasets.load_dataset('CAiRE/ascend')

    # Convert dataset to a Pandas DataFrame
    ascend_tr_df = data['train'].to_pandas()
    # Use the development set for now
    ascend_dev_df = data['validation'].to_pandas()

    # Keep only the required columns
    ascend_tr_df = ascend_tr_df[["id", "transcription", "topic"]]
    ascend_dev_df = ascend_dev_df[["id", "transcription", "topic"]]

    return ascend_tr_df, ascend_dev_df
# Load ASCEND dataset with selected columns
ASCEND_DF, ASCEND_DV_DF = get_ascend_data()
ASCEND_DV_DF.head()


Unnamed: 0,id,transcription,topic
0,0,嗯,technology
1,1,小朋友我回想一下when i was in elementary school,technology
2,2,like year three,technology
3,3,i have a phone but it's not a smart phone at t...,technology
4,4,就是其实主要的功能就是打电话然后跟parents联系,technology


In [65]:
# ASCEND_DV_DF.info()

In [66]:
ASCEND_DF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9869 entries, 0 to 9868
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   id             9869 non-null   object
 1   transcription  9869 non-null   object
 2   topic          9869 non-null   object
dtypes: object(3)
memory usage: 231.4+ KB


In [67]:
# save it to a csv file for future use
ASCEND_DF.to_csv('private/ascend.csv', index=False)

## 2 Process the stack exchange data: the title column
Data processing pipeline aims at cleaning to the dataset to the level of the 
SEAME dataset from the published Chinese-English code-switching [paper](https://arxiv.org/abs/1811.00241). 
- 1. Randomly sample 20 rows/records multiple times for uncleanliness checking
- 2. Check URLs (non present)
    - URLs are not relevant to the code-switching. 
- 3. Replace special symbols for their actual characters, such as &quot as "
    - These special symbols are not HTML representations of the actual characters. 
    The replacement correct that.
- 4. Remove punctuations
    - Based on the SEAME dataset, punctuations are not relevant to code-switching.
    - Based on my knowledge as an applied linguist, punctuations may not trigger 
    code-switching for bi-/multi-lingual speakers.
- 5. Remove non-Chinese-or-English characters, such as number+letters, 
    letters+numbers
    - This project focuses on Chinese-English code-switching. 
    Non-Chinese-or-English characters are irrelevant, which may be gibberish
    from website scraping. 

In [68]:
def sample_rows(df, n=20):
    """
    This function samples n rows from the given dataframe.
    param df: pd.DataFrame: dataframe to sample from
    param n: int: number of rows to sample
    return: pd.DataFrame: sampled dataframe
    """
    return df.sample(n)

sample_rows(STACK_DF)

Unnamed: 0,question_id,title,tags
8932,23952,Is there any relationship between 田 (ti&#225;n...,characters
5249,34418,What are the characters on this? (Character id...,character-identification
604,58166,what is a real meaning of 致 ？,meaning
1647,53429,What&#39;s the difference between 借代 and 比喻?,rhetoric
9682,18042,How to determine what 即 means in a sentence,grammar
11873,2392,How to cope with classical references?,culture
3029,27270,What is the difference between 于 and 以 to expr...,"difference, word, preposition"
6055,22701,What does 而将活活地被压死 mean?,"translation, meaning"
3573,42954,How do I type 了 with m17n tonepy in ubuntu?,"pinyin, input-methods"
2352,51295,What does 被打败 mean in this context?,meaning-in-context


In [69]:
# Function to check all the rows with a given pattern
def check_pattern_non_regex(df, pattern):
    """
    Check all the rows with a given pattern
    param df: pd.DataFrame: dataframe to check
    param pattern: str: pattern to check
    return: pd.DataFrame: dataframe with rows containing the pattern
    """
    return df[df.str.contains(pattern, na=False)]

def check_pattern_regex(df, pattern):
    """
    Check all the rows with a given pattern
    param df: pd.DataFrame: dataframe to check
    param pattern: str: pattern to check
    return: pd.DataFrame: dataframe with rows containing the pattern
    """
    return df[df.str.contains(pattern, na=False, regex=True)]

In [70]:
# Check if url is present the stack exchange data
check_pattern_regex(STACK_DF['title'], r'https?://\S+')
# This will return all the rows with a url in the title column

Series([], Name: title, dtype: object)

In [71]:

# Function to replace A with B using regex
def replace_text(text, pattern, replace_with):
    """
    Replace a pattern in a text with a given string
    param text: str: input text
    param pattern: str: pattern to replace
    param replace_with: str: string to replace the pattern with
    return: str: text with the pattern replaced
    """
    return re.sub(pattern, replace_with, text)

# Function to replace " &quot;" with '"' and "&#39;" with '"'
def replace_quot(text):
    """
    Replace &quot; with "
    param text: str: input text
    return: str: text with &quot or &#39;; replaced with "
    """
    text = replace_text(text, "&quot;", '"')
    text = replace_text(text, "&#39;", '"')
    return text

# Test the function
print(replace_quot('This is a &quot;test&quot; string'))

This is a "test" string


In [72]:
# Apply the function to the title quotes in the stack exchange dataset
STACK_DF['title'] = STACK_DF['title'].apply(replace_quot)
print(STACK_DF.iloc[10710, 1])
print(STACK_DF.iloc[11633, 1])
print(STACK_DF.iloc[2547, 1])

What is the difference: "和小猫一样可爱" and "像小猫一样可爱"
How do you say "dates back to" in Mandarin?
一 [份/个/项] 工作??? What is the difference?


In [73]:
# Function to remove 's, 'll, and others
def remove_apostrophe(text):
    """
    Remove 's, 'll, 've, 're, 'd, 'm, 'em from text
    param text: str: input text
    return: str: text with 's, 'll, 've, 're, 'd, 'm, 'em removed
    """
    text = re.sub(r"'ll", "", text)
    text = re.sub(r"'ve", "", text)
    text = re.sub(r"'re", "", text)
    text = re.sub(r"'d", "", text)
    text = re.sub(r"'m", "", text)
    text = re.sub(r"'em", "", text)
    text = re.sub(r"'t", "", text)
    text = re.sub(r"n't", "not", text)
    text = re.sub(r"'", "", text)
    text = re.sub(r'"s', "", text)
    text = re.sub(r'"ll', "", text)
    text = re.sub(r'"ve', "", text)
    text = re.sub(r'"re', "", text)
    text = re.sub(r'"d', "", text)
    text = re.sub(r'"m', "", text)
    text = re.sub(r'"em', "", text)
    text = re.sub(r'"t', "", text)
    text = re.sub(r'"', "", text)
    text = re.sub(r"’s", "", text)
    text = re.sub(r"’ll", "", text)
    text = re.sub(r"’ve", "", text)
    text = re.sub(r"’re", "", text)
    text = re.sub(r"’d", "", text)
    text = re.sub(r"’m", "", text)
    text = re.sub(r"’em", "", text)
    text = re.sub(r"’t", "", text)
    text = re.sub(r"n’t", "not", text)
    text = re.sub(r"’", "", text)
    text = re.sub(r"‘s", "", text)
    text = re.sub(r"‘ll", "", text)
    text = re.sub(r"‘ve", "", text)
    text = re.sub(r"‘re", "", text)
    text = re.sub(r"‘d", "", text)
    text = re.sub(r"‘m", "", text)
    text = re.sub(r"‘em", "", text)
    text = re.sub(r"‘t", "", text)
    text = re.sub(r"n‘t", "not", text)
    text = re.sub(r"‘", "", text)
    return text

remove_apostrophe("I'm going to the store. He'll be there.")
remove_apostrophe(STACK_DF.iloc[192, 1])

'Understanding and interpretation of 李白 《玉阶怨》'

In [74]:
print(STACK_DF.iloc[192, 1])
STACK_DF['title'] = STACK_DF['title'].apply(remove_apostrophe)
print(STACK_DF.iloc[192, 1])

Understanding and interpretation of 李白"s 《玉阶怨》
Understanding and interpretation of 李白 《玉阶怨》


In [75]:
# Resample again to check other signs of data uncleanliness
sample_rows(STACK_DF)

Unnamed: 0,question_id,title,tags
2837,48642,Is there an ordering to words like 喜欢，喜爱，喜好?,"translation, vocabulary, comparison"
9699,17922,Similarities between connectives 就 and 而?,"meaning, meaning-in-context"
186,59463,Using 有 or 是 to describe amount,"grammar, verbs, quantity"
4913,33834,How did 零 come to mean zero?,"meaning-in-context, etymology"
5313,38866,Are radicals like first letters or letters in ...,radicals
5418,8732,What is the Chinese word for programmer?,"translation, word-requests"
7398,31058,Why is 叫 (call) in 叫我马上到北极?,"meaning-in-context, word"
9074,23255,Identification help,character-identification
1561,53932,Is this sentence correct? 我把梦想献给李白的诗,"translation, meaning-in-context"
11005,9483,Is there a difference between 教训 and 论点 ？,grammar


In [76]:
print('STACK_DF.iloc[7598, 1]: ')
print(STACK_DF.iloc[7598, 1])
print('STACK_DF.iloc[4374, 1]: ')
print(STACK_DF.iloc[4374, 1])
print('STACK_DF.iloc[1041, 1]: ')
print(STACK_DF.iloc[1041, 1])
print('STACK_DF.iloc[1230, 1]: ')
print(STACK_DF.iloc[1230, 1])
print('STACK_DF.iloc[6224, 1]: ')
print(STACK_DF.iloc[6224, 1])
print('STACK_DF.iloc[664, 1]: ')
print(STACK_DF.iloc[664, 1])
print('STACK_DF.iloc[5366, 1]: ')
print(STACK_DF.iloc[5366, 1])

STACK_DF.iloc[7598, 1]: 
Is this [Ivanka Trump] Chinese proverb real? —“Those who say it can not be done, should not interrupt those doing it”
STACK_DF.iloc[4374, 1]: 
What semantic notions underlie o think; to contemplate, with but; however; nevertheless with only?
STACK_DF.iloc[1041, 1]: 
Should the blank in 他毕业以后很快 _____ 找到了一个理想的工作 be filled with 才, 可, or 便?
STACK_DF.iloc[1230, 1]: 
Can you say someone is a ”非常熟的人“？
STACK_DF.iloc[6224, 1]: 
“管你是[surname] + [name]还是[same surname] + [different name]”
STACK_DF.iloc[664, 1]: 
D&#224;o d&#233; and d&#224;od&#233;. Difference between each word separately? And when combined together as a compound word?
STACK_DF.iloc[5366, 1]: 
好客	vs 亲切 - both mean hospitable?


In [77]:
# Remove punctuations from the title column
STACK_DF['title'] = STACK_DF['title'].apply(lambda x: re.sub(r'[^\w\s]', '', x))
print(STACK_DF.iloc[7598, 1])
print(STACK_DF.iloc[4374, 1])
print(STACK_DF.iloc[1041, 1])
print(STACK_DF.iloc[1230, 1])
print(STACK_DF.iloc[6224, 1])
print('STACK_DF.iloc[664, 1]: ')
print(STACK_DF.iloc[664, 1])
print('STACK_DF.iloc[5366, 1]: ')
print(STACK_DF.iloc[5366, 1])
print('STACK_DF.iloc[6782, 1]')
print(STACK_DF.iloc[6782, 1])

Is this Ivanka Trump Chinese proverb real Those who say it can not be done should not interrupt those doing it
What semantic notions underlie o think to contemplate with but however nevertheless with only
Should the blank in 他毕业以后很快 _____ 找到了一个理想的工作 be filled with 才 可 or 便
Can you say someone is a 非常熟的人
管你是surname  name还是same surname  different name
STACK_DF.iloc[664, 1]: 
D224o d233 and d224od233 Difference between each word separately And when combined together as a compound word
STACK_DF.iloc[5366, 1]: 
好客	vs 亲切  both mean hospitable
STACK_DF.iloc[6782, 1]
When handwriting 黄 hu225ng yellow is it incorrect to have a disconnected 草 cǎo grass radical on top


In [78]:
# Check underscores in the title column
check_pattern_regex(STACK_DF['title'], r'_+')

873      How should I understand this teacher objection...
1020     Is it okay to fill in the blanks of 那种汽车 ____ ...
1041     Should the blank in 他毕业以后很快 _____ 找到了一个理想的工作 b...
2859            What does the underscore _ mean in Chinese
5578               Is the o in Let meet from 4 _to_ five 到
5681              How did three characters for _de_ emerge
5943              How to fill in the blanks in 两个队的分数_都很_近
6479     When is it necessary to say _色的 when telling t...
8396                       Do you add 在 before saying ___年
9251                        Can you say 回去 ____ A location
9375                                 Blessed bat pun 蝙蝠__福
10062              Do the notations  ___ LLL mean anything
Name: title, dtype: object

In [79]:
# remove underscores from the title column
STACK_DF['title'] = STACK_DF['title'].apply(lambda x: re.sub(r'_+', ' ', x))
check_pattern_regex(STACK_DF['title'], r'_+')

Series([], Name: title, dtype: object)

In [80]:
# Resample again to check other signs of data uncleanliness
sample_rows(STACK_DF)

Unnamed: 0,question_id,title,tags
3494,43468,Idioms Quiz of 23,"idioms, puzzle"
335,40730,How did Middle Chinese h230wk evolve into Mand...,"mandarin, pronunciation, middle-chinese"
1722,53418,Can classifiers be used poetically,measure-word
2161,51869,A 回来用隔离不 when you return do you need to isolat...,"meaning-in-context, humor"
10677,12231,Can someone translate this chinese letter,translation
1094,33348,Why does 葛藤갈등 mean conflict or roubles,"meaning, etymology"
5555,37114,What exactly is an ancient Tang dynasty 壺,"classical-chinese, culture, history, literary-..."
5284,38938,Cantonese song 不可一世 what is the spoken part a...,"cantonese, songs"
2916,48417,How does 味 taste flavor semantically appertain...,etymology
6240,35539,Why does 兴奋 excitement not mean what I think i...,"meaning, usage, meaning-in-context"


In [81]:
print('STACK_DF.iloc[7488, 1]: ')
print(STACK_DF.iloc[7488, 1])
print('STACK_DF.iloc[9375, 1]: ')
print(STACK_DF.iloc[9375, 1])
print('STACK_DF.iloc[6587, 1]: ')
print(STACK_DF.iloc[6587, 1])

STACK_DF.iloc[7488, 1]: 
Why is 把 in 我去洗把脸 wǒ q249 xǐ bǎ liǎn  I going to wash my face
STACK_DF.iloc[9375, 1]: 
Blessed bat pun 蝙蝠 福
STACK_DF.iloc[6587, 1]: 
Why is 弄 n242ng in 他终于弄明白了


In [82]:
check_pattern_non_regex(STACK_DF['title'], 'q249')

7488     Why is 把 in 我去洗把脸 wǒ q249 xǐ bǎ liǎn  I going ...
11613                     Is 去 q249 pronunced tɕʰu or tɕʰy
Name: title, dtype: object

In [83]:
# remove numbers/words with numbers from the title column
print(STACK_DF.iloc[3638, 1])
STACK_DF['title'] = STACK_DF['title'].apply(lambda x: re.sub(r'\w*\d\w*', '', x))
print(STACK_DF.iloc[3638, 1])
print(check_pattern_non_regex(STACK_DF['title'], 'q249'))

Is there any distinction between 二百 232rbǎi and 两百 liǎng bǎi which both mean wo hundred
Is there any distinction between 二百  and 两百 liǎng bǎi which both mean wo hundred
Series([], Name: title, dtype: object)


In [84]:
# Function to remove non English or Chinese characters
def remove_non_en_ch(text):
    """
    Remove non-English or Chinese characters
    param text: str: input text
    return: str: text with non-English or Chinese characters removed
    """
    return re.sub(r'[^\x00-\x7F\u4E00-\u9FA5]', '', text)

print(remove_non_en_ch(STACK_DF.iloc[11613, 1]))

Is 去  pronunced tu or ty


In [85]:
print(STACK_DF.iloc[11613, 1])
STACK_DF['title'] = STACK_DF['title'].apply(remove_non_en_ch)
print(STACK_DF.iloc[11613, 1])

Is 去  pronunced tɕʰu or tɕʰy
Is 去  pronunced tu or ty


In [86]:
# Resample again to check other signs of data uncleanliness
sample_rows(STACK_DF)

Unnamed: 0,question_id,title,tags
10431,13206,What does 国产地效船双船 mean,meaning
2570,50567,What is the colloquial meaning of 小祖宗,"etymology, idioms"
7365,29715,Help me understandtranslate 不留到晚自习就写完多好 contex...,translation
9761,17542,What is the difference between 涉及 and 涉及到,grammar
2091,27861,How do I type on a keyboard,"cantonese, pinyin, traditional-characters, yal..."
10460,13089,Why 保重 means ake care,"meaning, vocabulary, etymology, phrase"
7379,30742,Under what circumstances can you use 又又,"grammar, usage, sentence-structure"
7064,31943,Function of 吧 in 循环吧代码,"meaning, usage"
4279,41174,Why is 的 used instead of 了 in this sentence,"translation, grammar, simplified-characters"
8401,22352,Need help translating a few words,"translation, word-choice, word-requests"


In [87]:
# check to remove rows with null values on the title column
STACK_DF = STACK_DF.dropna(subset=['title'])
STACK_DF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12041 entries, 0 to 12040
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   question_id  12041 non-null  int64 
 1   title        12041 non-null  object
 2   tags         12041 non-null  object
dtypes: int64(1), object(2)
memory usage: 282.3+ KB


## 3 Process the stack exchange: tags and title columns
Data processing pipeline aims at cleaning to the dataset to the level of the 
SEAME dataset from the published Chinese-English code-switching [paper](https://arxiv.org/abs/1811.00241). 
The tags columns are much cleaners as compared to the previous column
- Remove dash line as they are just formatting not relevant to the topics/tags
- Remove empty strings in the title column and replace with NaN values
- Remove rows with NaN values in the title column
- Rename the column to be `text` and `topic`
- Save the stack exchange data to a CSV to avoid repeatition processing

In [88]:
# check for non English or Chinese  characters
check_pattern_non_regex(STACK_DF['tags'], r'[^\x00-\x7F\u4E00-\u9FA5]')

Series([], Name: tags, dtype: object)

In [89]:
# check for punctuations if other than ,
check_pattern_non_regex(STACK_DF['tags'], r'[^\w\s,]')

1        character-identification, traditional-characte...
5        translation, character-identification, seal, c...
11                               classical-chinese, poetry
13             translation, character-identification, seal
17        translation, word-choice, terminology, loanwords
                               ...                        
12023                               word-choice, cantonese
12026                                          word-choice
12031                            verbs, meaning-in-context
12032                             mandarin, mainland-china
12038                 characters, character-identification
Name: tags, Length: 4342, dtype: object

In [90]:
# remove - from the tags columns
STACK_DF['tags'] = STACK_DF['tags'].apply(lambda x: re.sub(r'-', ' ', x))
check_pattern_non_regex(STACK_DF['tags'], r'-')

Series([], Name: tags, dtype: object)

In [91]:
sample_rows(STACK_DF)

Unnamed: 0,question_id,title,tags
3683,42923,Symbolism Of A Verse From A Chinese Poem,"simplified characters, poetry"
1298,54814,Why it called 蕁麻疹,terminology
589,54208,Can 徐 develop a sound,phonology
9930,16737,Meaning of text on poster,translation
3082,47867,Etymology of 復員,etymology
662,57957,what is the difference between 困惑 and 糊涂,difference
901,56159,How to literally translate this 在赶路时用尽全力地向前冲,"meaning, meaning in context"
3165,3001,Is there a literal meaning of 对不起,meaning
9108,22447,is this translation in Chinese of grammatical ...,translation
3084,47832,Contranyms in the Chinese language,"usage, synonyms, antonyms"


In [92]:
print('STACK_DF.iloc[4763, 1]: ')
print(STACK_DF.iloc[4763, 2])
# replace - with space in the tags column
STACK_DF['tags'] = STACK_DF['tags'].apply(lambda x: re.sub(r'-', ' ', x))
print(STACK_DF.iloc[4763, 2])

STACK_DF.iloc[4763, 1]: 
traditional characters, software, fonts
traditional characters, software, fonts


In [93]:
# rename the columns, and only select the title and tags columns
STACK_DF = STACK_DF.rename(columns={'title': 'text', 'tags': 'topic'})[['text', 'topic']]
STACK_DF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12041 entries, 0 to 12040
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    12041 non-null  object
 1   topic   12041 non-null  object
dtypes: object(2)
memory usage: 188.3+ KB


In [94]:
# remove empty strings from the text column
print(STACK_DF.iloc[4957, ])
STACK_DF['text'] = STACK_DF['text'].apply(lambda x: x.strip())
print(STACK_DF.iloc[4957, ])
# convert empty strings to NaN
STACK_DF['text'] = STACK_DF['text'].replace('', np.nan)
STACK_DF.info()

text                          
topic    writing critique, hsk
Name: 4957, dtype: object
text                          
topic    writing critique, hsk
Name: 4957, dtype: object
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12041 entries, 0 to 12040
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    12039 non-null  object
 1   topic   12041 non-null  object
dtypes: object(2)
memory usage: 188.3+ KB


In [95]:
# drop NaN values if title is empty
STACK_DF = STACK_DF.dropna(subset=['text'])
STACK_DF.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12039 entries, 0 to 12040
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    12039 non-null  object
 1   topic   12039 non-null  object
dtypes: object(2)
memory usage: 282.2+ KB


In [96]:
# save the cleaned df to a csv file
STACK_DF.to_csv('private/stack_exchange_cleaned.csv', index=False)
print('STACK_DF saved to private/stack_exchange_cleaned.csv')

STACK_DF saved to private/stack_exchange_cleaned.csv


In [97]:
# save the first 1000 rows of the cleaned df to a csv file for sample 
STACK_DF.head(1000).to_csv('data_samples/stack_exchange_cleaned_sample.csv', index=False)
print('First 1000 rows of STACK_DF saved to data_samples/stack_exchange_cleaned_sample.csv')

First 1000 rows of STACK_DF saved to data_samples/stack_exchange_cleaned_sample.csv


## 4 Get basic information: cleaned stack exchange dataset

The cleaned dataset contains **12,039 entries** with two columns:

- **text**: Posts from Stack Exchange (12,039 non-null values, `object` string).
- **topic**: General discourse domain of each post (12,039 non-null values, `object` string).

Additional information
- The `topic` column still needs processing, such as word tokenization or others 
for topic modeling/clustering. At the moment, it has 3500 combinations using the
`value_counts` function.
- The text column may still needs further processing, such word tokenization 
and others. 

In [98]:
CLEANED_STACK_DF = pd.read_csv('private/stack_exchange_cleaned.csv')
CLEANED_STACK_DF.head()

Unnamed: 0,text,topic
0,My translation of Li Bai 三五七言,"translation, poetry"
1,What do these characters on an antique mural p...,"character identification, traditional characte..."
2,Help in translating Li Bai,"translation, poetry"
3,purpose of using 了 with 要不,grammar
4,Why does the character 的 is pronounced differe...,"pronunciation, songs"


In [99]:
CLEANED_STACK_DF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12039 entries, 0 to 12038
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    12039 non-null  object
 1   topic   12039 non-null  object
dtypes: object(2)
memory usage: 188.2+ KB


In [100]:
CLEANED_STACK_DF['topic'].value_counts()

translation                                                            974
grammar                                                                601
meaning                                                                464
meaning in context                                                     238
word choice                                                            217
                                                                      ... 
terms of address, formal                                                 1
meaning in context, etymology, character identification                  1
idioms, colloquialisms                                                   1
character identification, traditional characters, seal, old chinese      1
translation, style                                                       1
Name: topic, Length: 3500, dtype: int64