## → Chapter Six 

**Deciphering the broken pinyin, Completing the Epic Merge**  

🔺One of the challanges in data cleaning was that, global databases like ISBNdb and isbnsearch.org often returned Chinese book metadata in **pinyin**(the phonetic romanization of Chinese) and not in readable Chinese characters. To fix this, I used a Python library called **pinyin2hanzi** which converts pinyin text into the most likely Chinese characters. 

In [None]:
import pandas as pd
from Pinyin2Hanzi import DefaultDagParams, dag

# initialize pinyin converter
dagparams = DefaultDagParams()

# helper function to convert pinyin string to hanzi if score > 0.1
def convert_pinyin_to_hanzi(pinyin_string):
    pinyin_list = pinyin_string.strip().lower().split()
    try:
        result = dag(dagparams, pinyin_list, path_num=5)
        for item in result:
            if item.score > 0.1:
                return ''.join(item.path)
        return None
    except:
        return None

# --- Convert isbndbResults.csv ---
isbndb_df = pd.read_csv('isbndbResults.csv')
isbndb_df['converted_title'] = isbndb_df['title'].astype(str).apply(convert_pinyin_to_hanzi)

# --- Convert zoteroExport.csv ---
zotero_df = pd.read_csv('zoteroExport.csv')

# process title
zotero_df['converted_title'] = zotero_df['Title'].astype(str).apply(convert_pinyin_to_hanzi)

# clean ISBN column (keep first 17 characters, remove hyphens)
zotero_df['ISBN'] = zotero_df['ISBN'].astype(str).str[:17].str.replace('-', '', regex=False)

# result DataFrames
isbndbConverted = isbndb_df
zoteroConverted = zotero_df

🔺Then, I created a script to merge the book data collected from the four sources into a single, structured dataset formatted to match the fields required for Shopify product import.
- **ISBNdb** 
- **Zotero** 
- **Dangdang.com** 
- **isbnsearch.org**