## Mini-Project: Chinese Idiom Game (Approved by Tutor Terrence Broad)

This project is a word chain game based on traditional Chinese idioms in a simple chatbot form, as an creative approach to the Natural Language Process (NLP) field. The project is using an external dataset created by [GitHub user crazywhalecc](https://github.com/crazywhalecc/idiom-database), which the dataset contains 30895 traditional Chinese idioms and was processed and clearly divided into 9 different attributes by the author. This project applies the Pandas library and its DataFrame module, follwed by a main game function based on a search & match logic and other chatbot-like features including simple conversations with the user, based on RegEx, and an idiom generator function, based on Markov Chain model. The purpose of this project is aiming to create a simple tool for both Chinese learners to practice their idiom knowledge with, and anyone who wants to have fun during their leisure time.

Keywords: NLP, Chinese idiom solitaire game, search & match, RegEx, Markov Chain, chatbot.


*This project topic has been approved by NLP 23-24 tutor Terrence Broad


### Import the libraries and modules

In [None]:
#only run this if any of the import modules below are missing
%pip install pandas
%pip install ipython

In [1]:
#import the libraries needed for this project
import pandas as pd
import re
import random
import time
from IPython.display import clear_output

### Set up the data

In [2]:
#read the dataset using the pandas module.
df=pd.read_csv('c_idiom.csv')   #this dataset is retrieved from: https://github.com/crazywhalecc/idiom-database [1]
df

Unnamed: 0,derivation,example,explanation,pinyin,word,abbreviation,pinyin_r,first,last
0,语出《法华经·法师功德品》下至阿鼻地狱。”,但也有少数意志薄弱的……逐步上当，终至堕入～。★《上饶集中营·炼狱杂记》,阿鼻梵语的译音，意译为无间”，即痛苦无有间断之意。常用来比喻黑暗的社会和严酷的牢狱。又比喻无...,ā bí dì yù,阿鼻地狱,abdy,a bi di yu,a,yu
1,三国·魏·曹操《整齐风俗令》阿党比周，先圣所疾也。”,《论语·卫灵公》众恶之，必察焉；众好之，必察焉”何晏集解引三国魏王肃曰或众～，或其人特立不群...,指相互勾结，相互偏袒，结党营私。,ē dǎng bǐ zhōu,阿党比周,edbz,e dang bi zhou,e,zhou
2,《汉书·诸葛丰传》今以四海之大，曾无伏节死谊之臣，率尽苟合取容，阿党相为，念私门之利，忘国家...,无,阿党偏袒、偏私一方。为了谋求私利相互偏袒、包庇。,ē dǎng xiāng wéi,阿党相为,edxw,e dang xiang wei,e,wei
3,鲁迅《我们要批评家》然而新的批评家不开口，类似批评家之流便趁势一笔抹杀‘阿狗阿猫’。”,无,旧时人们常用的小名。引申为任何轻贱的，不值得重视的人或著作。,ā gǒu ā māo,阿狗阿猫,agam,a gou a mao,a,mao
4,见阿家阿翁”。,既然如此，你我两个，便学个不痴不聋的～。★《儿女英雄传》二三回,阿名词的前缀。姑丈夫的母亲。翁丈夫的父亲。指公公婆婆。,ā gū ā wēng,阿姑阿翁,agaw,a gu a weng,a,weng
...,...,...,...,...,...,...,...,...,...
30890,清·李宝嘉《文明小史》第四十四回我们做一天和尚撞一天钟，只要不像从前那位老中堂，摆在面上被人...,敷衍了事，得过且过，～。★毛泽东《反对自由主义》,俗语。过一天算一天，凑合着混日子。比喻遇事敷衍，得过且过。也有无可奈何，勉強从事的意思。,zuò yī tiān hé shàng zhuàng yī tiān zhōng,做一天和尚撞一天钟,zythszytz,zuo yi tian he shang zhuang yi tian zhong,zuo,zhong
30891,宋·释悟明《联灯会要·重显禅师》却顾侍者云‘适来有人看方丈么？’侍者云‘有。’师云‘作贼人心...,这个毛病，起先人家还不知道，这又是他们～弄穿的。★清·吴趼人《二十年目睹之怪现状》第六十回,虚怕。指做了坏事怕人知道，心里老是不安。,zuò zéi xīn xū,做贼心虚,zzxx,zuo zei xin xu,zuo,xu
30892,语出《醒世恒言·卖油郎独占花魁》那些有势有力的不肯出钱，专要讨人便宜。及至肯出几两银子的，女...,[蒋淑真]梳个纵鬓头儿，着件叩身衫子，～，乔模乔样。★《醒世通言蒋淑真刎颈鸳鸯会》,装模作样，故意做出一种姿态。,zuò zhāng zuò shì,做张做势,zzzs,zuo zhang zuo shi,zuo,shi
30893,语出《醒世恒言·卖油郎独占花魁》那些有势有力的不肯出钱，专要讨人便宜。及至肯出几两银子的，女...,沈琼枝看那两个妇人时，一个二十六七岁光景，一个十七八岁，乔素打扮，～的。★清·吴敬梓《儒林外...,犹言装模作样，装腔作势。,zuò zhāng zuò zhì,做张做致,zzzz,zuo zhang zuo zhi,zuo,zhi


### Game mode 1: Chinese idiom solitaire 

In [3]:
#get the column of the idiom word and its pinyin. Referenced from NLP-23-24 Week 6 'classification-lecture.ipynb' [2]
words=df["word"]    
pinyin=df["pinyin"]

print(pinyin[0])

#split the pinyin series into a list of characters.  Referenced from https://saturncloud.io/blog/how-to-split-one-column-into-multiple-columns-in-pandas-dataframe/#:~:text=use%20the%20pd.-,Series.,list%20as%20a%20new%20column.
pinyin=pinyin.str.split()   
print(pinyin[0])

ā bí dì yù
['ā', 'bí', 'dì', 'yù']


In [4]:
#define the game as a function

def idiom_game():

    game_running=True   #set this to control the while loop below
    nextRound=True  #used to check the state if the player can continue to the next round

    #get a random index (they share the same index since being in the same row)     https://stackoverflow.com/questions/58551425/how-to-feed-random-numbers-as-indices-to-pandas-data-frame
    r_index=words.sample().index[0]  #https://stackoverflow.com/questions/45968529/return-the-index-using-pandas-series-sample

    previous_answer=words[r_index]  #this stores the previous answer from the bot

    pa_char=previous_answer[-1]  #this stores the last character of the previous answer from the bot

    pa_pinyin=pinyin[r_index]  #this stores the pinyin_r of the last character of the previous answer from the bot

    pa_pinyin_last=pa_pinyin[-1]  #this stores the last character of the pinyin_r of the last character of the previous answer from the bot

    used_idiom=[]   #this stores the idioms that have been used by the bot and the player

    #the bot starts the game with a random idiom from the dataset and ask for user input, the variables are set to string since the input can only take one parameter https://stackoverflow.com/questions/58223407/im-getting-an-error-saying-raw-input-takes-from-1-to-2-positional-arguments
    user_input=input("Welcome to the Chinese Idiom game. I'll start with: " + str(previous_answer) + str(pa_pinyin) + " Please enter an idiom: ")

    used_idiom.append(previous_answer)    #add the bot idiom to the used list

    #use a while loop for continuous game play. https://www.programiz.com/python-programming/while-loop
    while game_running:  #if the game_running state is True, the game will continue to run

        #allow the player to quit during the game by typing in any of the following words
        reg = r"exit|quit|bye|goodbye|leave|stop"   #define the greetings regular expression
        m = re.search(reg, user_input, re.IGNORECASE)  #search for this expression in the msgText argument, and ignore the case
        if m:
            print("You ended the game. You will return to the main menu.")
            game_running=False  #end the game

        #check if the user input is in the idiom dataset and has not been used. Referenced from: #https://saturncloud.io/blog/how-to-check-if-pandas-column-has-value-from-list-of-strings/   
        elif words.isin([user_input]).any() and user_input not in used_idiom:   #user_input is placed in '[]' since the isin() function takes a list as an argument, .any() is to check if there is at least 1 value the matches in the list
            
            print("You entered: " + user_input)

            #locate the user input in 'words' 
            user_input_index=words[words==user_input].index[0]
            
            #get the first character and the pinyin data of the user input
            user_input_char_1st = user_input[0]
            user_input_pinyin_1st = pinyin[user_input_index][0]

            #get the last character of the user input
            user_input_char_last = user_input[-1]

            #get the last pinyin character of the user input
            user_input_pinyin_last=pinyin[user_input_index][-1] 


            #create two empty lists to store the idioms that have the same first character as the last character of the user input and the idioms that have the same first pinyin character as the last pinyin character of the user input
            word_list=[]
            pinyin_list=[]
            
            #if the first character of user_input is the same as the last character of the bot's previous answer
            if user_input_char_1st == pa_char:     
                print("the character matches")

                #the user input is all good then, the following will be the bot looking for its answer:
                
                #search in the dataset for the index of the idioms that have the same first character as the last character of the user input
                #at the same time, search in the dataset for the index of the idioms that have the same first pinyin character as the last pinyin character of the user input
                
                #this for-loop is searching in 'words' for the idioms that have the same first character as the last character of the user input
                for index, word in words.items():
                    #if any idiom in the dataset has the same first character as the last character of the user input, add the idiom to 'word_list' with its index
                    if word[0]==user_input_char_last:
                        word_list.append((word, index))

                #same logic as above, this for-loop is searching in 'pinyin' for the idioms that have the same first pinyin as the last pinyin of the user input
                for index, pin in pinyin.items():
                    #if any idiom pinyin in the dataset has the same first character as the last character of the user input, add it to 'pinyin_list' with its index
                    if pin[0]==user_input_pinyin_last:   
                        pinyin_list.append((pin, index))

            #if the first pinyin character is the same as the last pinyin character of the previous answer
            elif user_input_pinyin_1st == pa_pinyin_last: 
                print("the pinyin matches")
                
                #perform the same search as the above if statement
                for index, pin in pinyin.items():
            
                    if pin[0]==user_input_pinyin_last:  
                        pinyin_list.append((pin, index)) 

                for index, word in words.items():
                    
                    if word[0]==user_input_char_last:    
                        word_list.append((word, index))  
            
            #if the first character or pinyin of user_input is not the same as the last character of the bot's previous answer, the player loses
            else:   
                print("I don't think your answer matches my idiom. You lose!")
                
                #switch to the end-game states
                nextRound=False
                game_running=False  

            #this count will be deducted if the bot cannot find an idiom that has not been used before
            count=2
            
            #this if statement is for the bot to pick an idiom that has the first character match to user's answer
            if word_list and nextRound==True:   #if 'word_list' is not empty and player's answer is correct
                
                #use a while loop here is to keep the code running until the bot finds an idiom that has not beenused
                while True:
                    #randomly select an idiom from the list. Referenced from https://www.w3schools.com/python/ref_random_choices.asp
                    bot_answer=random.choice(word_list)  
                    
                    #if the randomly selected idiom has not been used before
                    if bot_answer[0] not in used_idiom:
                        print("Nice! I'll go: ", bot_answer[0], pinyin[bot_answer[1]]) #bot_answer[0] since the bot_answer is a (idiom, pinyin) tuple here
                        
                        #update the bot's last idiom data to this new idiom
                        previous_answer=bot_answer[0]
                        pa_char=previous_answer[-1] 
                        pa_pinyin_last=pinyin[bot_answer[1]][-1] 
                        used_idiom.append(previous_answer)  #add the bot answer to the used list
                        
                        user_input=input("Your turn: ") #wait for user input
                        break
                    
                    #check if all the word found has appeared in the used_idiom list, referenced from https://www.geeksforgeeks.org/python-test-if-all-elements-are-present-in-list/
                    elif all(word[0] in used_idiom for word in word_list):  #if all the words in 'word_list' are contained in 'used_idiom', word[0] since each element in word_list is a (idiom, index) tuple
                        count-=1

            #this if statement is placed after the above one since the first character match has higher priority (a better answer!)
            elif pinyin_list and nextRound==True:   #if 'pinyin_list' is not empty and player's answer is correct
                
                while True:
                    a=random.choice(pinyin_list)   #randomly select an idiom pinyin from the list; a is a (pinyin, index) tuple here
                    bot_answer=words[a[1]]   #get the idiom word based on the pinyin index
                    
                    if bot_answer[0] not in used_idiom:
                        print("Nice! I'll go: ", bot_answer[0], pinyin[bot_answer[0]])   #show the pinyin as well
                        
                        #update the bot's last idiom data to this new idiom
                        previous_answer=bot_answer[0]  
                        pa_char=previous_answer[-1] 
                        pa_pinyin_last=pinyin[bot_answer[1]][-1] 
                        used_idiom.append(previous_answer)  #add the bot answer to the used list

                        user_input=input("Your turn: ")
                        print("You entered: " + user_input)
                        break

                    elif all(words[index[1]] in used_idiom for index in pinyin_list):  #since pinyin_list is a list of (pinyin, index) tuples, index[1] is the index of the idiom word, and we use it as index to refer to the idiom word in 'words'
                        count-=1
            
            #this means all the character-match and pinyin-match results have been used before, the player wins
            elif count==0 and nextRound==True:
                print("You got me! I can't think of any idiom that has not been used to answer. You win!")
                game_running=False  #the player lost, end the game

            #if both lists are empty, it means the bot cannot find a correct asnwer, the player wins!
            elif not word_list and not pinyin_list and nextRound==True:
                print("You got me! I can't think of any idioms that can follow " + user_input + ". You win!") 
                game_running=False
            
        #if the user entered an idiom that has been used before, the player loses
        elif words.isin([user_input]).any() and user_input in used_idiom:
            print("You entered: " + user_input)
            print("Sorry, it seems this idiom has been used by us before. You lose!")
        
        #if the user input is not in 'words', the player loses as the input is not an idiom
        else:
            print("You entered: " + user_input)
            print("Sorry, I don't think your input is an existing idiom. You lose!")
            game_running=False
        

### Game mode 2: idiom-pedia

Using a similar logic as how to search and match the user input with the idiom data in the idiom_game() function above, the bot can perform an idiom information explanation too:

In [5]:
#define the idiom knowledge function

def pedia():

    #get the columns of the idiom word, explanation, example and derivation from the idiom dataset
    exps=df["explanation"]
    examples=df["example"]
    sources=df["derivation"]    

    word_to_exp=input("What is the idiom you want to learn about?")

    #check if the user input is an existing chinese idiom first
    if words.isin([word_to_exp]).any():
        user_input_index=words[words==word_to_exp].index[0]
        print(word_to_exp)
        exp=exps[user_input_index]
        example=examples[user_input_index]
        source=sources[user_input_index]
        print("The explanation of the idiom is: ", exp)
        print("This idiom is from: ", source)
        print("An example of using this idiom is: ", example)

    else:
        print("Sorry, I don't think your input is an existing idiom. Please retry.")

### Game mode 3: generate random 'idiom'

In [6]:
#Define the markov-chain based idiom generator. Referenced from NLP-23-24 Week 5 'text-generation-with-markov-chains.ipynb' [4]
import markovify

def markov():
    char_list = [list(word) for word in words]    #create a list of lists, each list contains the characters of an idiom
    chain = markovify.Chain(char_list, state_size=3)   #predict the next character based on the previous 3 characters
    
    new_str = ''
    #generate 6 4-character idioms
    for i in range(6):
        for j in range(4):
            example_output = chain.walk()
            new_str += example_output[0]
        new_str += '\n'  #add a new line after each idiom
    
    time.sleep(2)  #add a delay to simulate 'thinking'
    print("I've generated 6 Chinese idioms for you:\n" + new_str)

### Make a simple Chatbot using RegEx to take user's instructions:

In [7]:
#Define a simple ChatBot using regex. Codes referenced from NLP-22-23 Week 3 'bonus-exercise-regexp.ipynb' [3]

def simpleBot(msgText):
    
    #start the idiom game
    reg = r"1"
    m = re.search(reg, msgText) 
    if m:
        print("Sure thing! Let's start!")
        time.sleep(1)
        print("In case you are not familiar with the rules, here is a quick recap:\nI will start the game with a random Chinese idiom, with its pinyin next to the word. You need to reply with another idiom that starts with")      
        print("the same character as the last character of my idiom, or has the same pinyin syllable. For example, if I say '一举两得 yī jǔ liǎng dé', you")
        print("can reply with '得过且过 dé guò qiě guò' or '德才兼备 dé cái jiān bèi'. Whoever cannot answer the other with a correct Chinese idiom loses.\n")
        print("Note: you can only use each idiom once or you lose!")
        time.sleep(12)   #add a delay before the game starts for the player to read the rules
        clear_output()   #clear the output. Referenced from https://stackoverflow.com/questions/24816237/ipython-notebook-clear-cell-output-in-code
        
        idiom_game()   #start the idiom game
        return
    
    #run the idiom information function
    reg = r"2" 
    m = re.search(reg, msgText)
    if m:
        pedia()
        return
    
    #run the markov chain idiom generator function
    reg = r"3" 
    m = re.search(reg, msgText)
    if m:
        markov()
        return

    #Tell the user that the bot cannot understand the input yet
    else:
        print("Sorry, I couldn't understand this yet - but I'll keep learning!")


### Define the play() function for game starting:

In [8]:
#Define the 'play()' function to start the game by calling it

def play():
    #create a while loop to make sure the conversation continues until the user types 'exit'
    while True:
        #create a simple 'menu' for the user to choose what to do
        msgText = input("Hi there! I'm a simple Chinese Idiom Chain Game chatbot. Just key in 'play' to start! If you want to exit any time, just tell me. ")
        if msgText == 'play':
            clear_output()  #clear the previous cell output each time the user starts the game
            msgText = input("I can do these by typing the number: 1.Chinese idiom solitaire game; 2.Chinese idiom information; 3.Chinese idiom generator. What's your idea?")
            print("You selected game mode " + msgText)
        
        #exit the game if the user enters any of the words
        reg = r"exit|leave|bye|over|stop"
        m = re.search(reg, msgText, re.IGNORECASE) 
        if m:
            print("Thanks for playing with me. Bye!")
            break
        
        simpleBot(msgText)

Now the structure is completed. Run the code below to interact with the bot!

In [9]:
#let's play!
play()

#if you can't read chinese or can't come up with an answer, then...
#here is a chinese idiom search tool, just copy the last character of the bot's idiom and paste in the search box at this website: https://cy.hwxnet.com/

You entered: 土生土长
the pinyin matches
Nice! I'll go:  长才短驭 ['cháng', 'cái', 'duǎn', 'yù']
You entered: 玉石俱焚
the pinyin matches
Nice! I'll go:  焚香膜拜 ['fén', 'xiāng', 'mó', 'bài']
You entered: 败军之将
the pinyin matches
Nice! I'll go:  将遇良材 ['jiàng', 'yù', 'liáng', 'cái']
You entered: 财源广进
Sorry, I don't think your input is an existing idiom. You lose!
Thanks for playing with me. Bye!


### References List:

[1] Crazywhalecc. (2020). CRAZYWHALECC/idiom-database: 成语数据库，成语接龙数据库，拥有30000+个成语，可直接使用首拼音和尾拼音编写自己的成语接龙 [online] GitHub. Available at: https://github.com/crazywhalecc/idiom-database [Accessed 27 November 2023]. 

[2] Broad, T. and Fiebrink, R. (2023). Week-6-Classification/classification-lecture.ipynb [online] GitHub. Available at: https://git.arts.ac.uk/tbroad/NLP-23-24/blob/main/Week-6-Classification/classification-lecture.ipynb [Accessed 1 December 2023]. 

[3] Broad, T. and Fiebrink, R. (2023). Week-3-Manipulating-text/bonus-exercise-regex.ipynb [online] GitHub. Available at: https://git.arts.ac.uk/tbroad/NLP-23-24/blob/main/Week-6-Classification/classification-lecture.ipynb [Accessed 1 December 2023]. 

[4] Broad, T. and Fiebrink, R. (2023). Week-5-Web-data-and-generative-text/text-generation-with-markov-chains.ipynb [online] GitHub. Available at: https://git.arts.ac.uk/tbroad/NLP-23-24/blob/main/Week-6-Classification/classification-lecture.ipynb [Accessed 3 December 2023]. 