# Summary

* Investigated the **phonotactic patterns that correlate with gender** in given names for Mandarin Chinese.

* Found that for Mandarin, **female names** have a **higher proportion of open syllables and high vowel nuclei**, while **male names** have a **higher proportion of back vowel nuclei, round vowel nuclei, obstruent onsets, and non-coronal onsets**, conforming to cross-linguistic patterns.

# Preparation

In [None]:
! git clone https://github.com/FulangChen/Predict-Gender-from-Mandarin-Names

Cloning into 'Predict-Gender-from-Mandarin-Names'...
remote: Enumerating objects: 9, done.[K
remote: Counting objects: 100% (9/9), done.[K
remote: Compressing objects: 100% (8/8), done.[K
remote: Total 9 (delta 0), reused 9 (delta 0), pack-reused 0[K
Unpacking objects: 100% (9/9), done.


In [None]:
import re

* Function to remove tone in pinyin

In [None]:
def rmv_tone(py):
    py = re.sub(r'[āáǎà]', 'a', py)
    py = re.sub(r'[ēéěè]', 'e', py)
    py = re.sub(r'[īíǐì]', 'i', py)
    py = re.sub(r'[ōóǒò]', 'o', py)
    py = re.sub(r'[ūúǔù]', 'u', py)
    py = re.sub(r'[ǘǚǜ]', 'ü', py)
    return py

* Load py2ipa dictionary

In [None]:
py2ipa = {}

with open('/content/Predict-Gender-from-Mandarin-Names/py2ipa.txt', 'r') as f:
    for line in f.readlines():
        py, ipa = line.strip().split(',')
        py2ipa[py] = ipa

# Phonotactic Features

* **Open-syllable Proportion**: the number of open syllables divided by the total number of syllables in the name

* **High-Nucleus Proportion**: the number of [+high] nuclear vowels divided by the total number of nuclear vowels in the name

* **Back-Nucleus Proportion**: the number of [+back] nuclear vowels divided by the total number of nuclear vowels in the name

* **Round-Nucleus Proportion**: the number of [+round] nuclear vowels divided by the total number of nuclear vowels in the name

* **Obstruent-Onset Proportion**: the number of obstruent onsets divided by the total number of onset consonants in the name
    
* **Non-Coronal-Onset Proportion**: the number of [-coronal] onsets divided by the total number of onset consonants in the name


In [None]:
patterns = {
    'open_syll': r'\w*[^nŋr0]\b', 
    'hi_nuc': r'[iyu]',
    'bk_nuc': r'[ɤɑuo]',
    'rd_nuc': r'[yuo]',
    'has_ons': r'\b[^iyueɛəɤoaɑjɥw0]\w*',
    'obsr_ons': r'\b(pʰ|p|f|tʰ|t|ʐ|tʂʰ|tʂ|ʂ|tsʰ|ts|s|tɕʰ|tɕ|ɕ|kʰ|k|x)\w*',
    'non_cor_ons': r'\b(pʰ|p|f|kʰ|k|x|m|ŋ)\w*',
}

def proportion(pattern, ipa1, ipa2, denominator):
    ct = sum([1 for ipa in [ipa1, ipa2] if re.search(patterns[pattern], ipa)])
    return ct / denominator if denominator != 0 else 0

# Process Corpus


In [None]:
f = open('/content/Predict-Gender-from-Mandarin-Names/Chinese_Celebrities_Names.csv', 'r')
corpus = open('Corpus_processed.csv', 'w') # same as /content/Predict-Gender-from-Mandarin-Names/Corpus_processed.csv

header = ['Given1', 'Given2', 'Gender', 'Py1', 'Py2', 'IPA1', 'IPA2', \
          'open_syll_p', 'hi_nuc_p', 'bk_nuc_p', 'rd_nuc_p', 'ons_p', \
          'obsr_ons_p', 'non_cor_ons_p']
    
corpus.write(','.join(header) + '\n')


f.readline() # Skip header

for line in f.readlines():
    line = line.strip().split(',')
    
    Given1 = line[6].strip()
    
    Py1, Py2 = line[9].strip(), line[14].strip()
    
    # Amend systematic data entry errors
    Given2 = line[11].strip() if Py2 != '0' else '0' 
    
    Gender = '0' if line[1] == '女' else '1'
    
    IPA1 = py2ipa[rmv_tone(Py1)] 
    IPA2 = py2ipa[rmv_tone(Py2)] if Py2 != '0' else '0'
    
    num_syll = 1 if IPA2 == '0' else 2
    
    P1 = proportion('open_syll', IPA1, IPA2, num_syll)
    P2 = proportion('hi_nuc', IPA1, IPA2, num_syll)
    P3 = proportion('bk_nuc', IPA1, IPA2, num_syll)
    P4 = proportion('rd_nuc', IPA1, IPA2, num_syll)
    P5 = proportion('has_ons', IPA1, IPA2, num_syll)
    P6 = proportion('obsr_ons', IPA1, IPA2, P5 * num_syll)
    P7 = proportion('non_cor_ons', IPA1, IPA2, P5 * num_syll)
    
    new_line = [Given1, Given2, Gender, Py1, Py2, IPA1, IPA2, str(P1), str(P2), \
                str(P3), str(P4), str(P5), str(P6), str(P7)] 
    corpus.write(','.join(new_line) + '\n')
   
f.close()
corpus.close()

# Results

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv("Corpus_processed.csv") 

In [None]:
df.groupby('Gender').mean() # 0 = female; 1 = male

Unnamed: 0_level_0,open_syll_p,hi_nuc_p,bk_nuc_p,rd_nuc_p,ons_p,obsr_ons_p,non_cor_ons_p
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.494508,0.416306,0.136912,0.169129,0.78709,0.74679,0.274752
1,0.450804,0.370862,0.204324,0.21086,0.835473,0.845104,0.301555
