### Motivation

We want our **dataset to be fully in English** so we can evaluate the RAG system better by ourselves, e.g. looking at the LangChain output. However, our current English dataset was manually translated from Google Translate with the context that none of us are fluent in Japanese.

Based on these internet [discussion](https://www.reddit.com/r/ChatGPT/comments/12ze6xx/gpt4_is_amazingly_good_at_translating_japanese/) and [article](https://medium.com/akvelon/is-gpt-4-better-at-translation-than-google-translate-2fd39730af0e), GPT for translation is actually the best since it accomodates context better. In my opinion, *dataset is the key*, so I think it's worth any money to make sure the dataset is in high quality.

That's why this code is made to create a *better dataset* using the SOTA LLM so far, OpenAI GPT-4.

# Part 1: Parse Source Data

There are simple and complex way to get our data:
1. Simply use the provided Japanese dataset from sensei, or
2. Get the source data from [here](https://lifeshiftplatform.com/member) using any Python scraper, thus up-to-date data

Well, we want a better dataset, why not complicate ourselves even more? xD

## General Inspection

In [None]:
import requests
from IPython.display import display, HTML

base_url = "https://lifeshiftplatform.com/member/"
page = requests.get(base_url + '1')

display(HTML(page.text))

## Scrape HTML

We see that the actual person's URL is indexed using a number (`BASE_URL + num`), e.g. https://lifeshiftplatform.com/member/101. But, beware that not all them exist.

Let's get how many are these.

In [None]:
from bs4 import BeautifulSoup
import time

num = 1  # starting number
num_found = []
not_found_count = 0
max_not_found = 10  # threshold for consecutive failures

while not_found_count < max_not_found:
    url = f"{base_url}{num}"
    try:
        response = requests.get(url)
        if response.status_code == 200:  # process the page
            not_found_count = 0  # reset the counter
            num_found.append(num)
            print(f"Page found for URL: {url}")
        else:
            not_found_count += 1
            print(f"Page NOT found for URL: {url}")
    except requests.RequestException as e:
        print(f"Request failed for URL: {url}, Error: {e}")

    num += 1
    time.sleep(0.1)  # delay to avoid overloading the server

print("Scraping ended.")

Page found for URL: https://lifeshiftplatform.com/member/1
Page found for URL: https://lifeshiftplatform.com/member/2
Page found for URL: https://lifeshiftplatform.com/member/3
Page found for URL: https://lifeshiftplatform.com/member/4
Page found for URL: https://lifeshiftplatform.com/member/5
Page found for URL: https://lifeshiftplatform.com/member/6
Page found for URL: https://lifeshiftplatform.com/member/7
Page found for URL: https://lifeshiftplatform.com/member/8
Page found for URL: https://lifeshiftplatform.com/member/9
Page found for URL: https://lifeshiftplatform.com/member/10
Page found for URL: https://lifeshiftplatform.com/member/11
Page found for URL: https://lifeshiftplatform.com/member/12
Page found for URL: https://lifeshiftplatform.com/member/13
Page found for URL: https://lifeshiftplatform.com/member/14
Page found for URL: https://lifeshiftplatform.com/member/15
Page found for URL: https://lifeshiftplatform.com/member/16
Page found for URL: https://lifeshiftplatform.com

In [None]:
len(num_found)

210

## Parse HTML

### Preprocess a sample

Let's see which data should we get for our dataset by looking at one person first.

In [None]:
page = requests.get(base_url + '1')

soup = BeautifulSoup(page.content, "html.parser")
print(soup.prettify())

<!DOCTYPE html>
<html lang="ja">
 <head>
  <meta charset="utf-8"/>
  <script async="" src="https://www.googletagmanager.com/gtag/js?id=G-1WCL7T2L7Q">
  </script>
  <script>
   window.dataLayer = window.dataLayer || [];
                function gtag(){dataLayer.push(arguments);}
                gtag('js', new Date());
                gtag('config', 'G-1WCL7T2L7Q');
  </script>
  <title>
   黒岩秀行｜Life Shift Platform
  </title>
  <link href="/assets/css/member_detail.css" rel="stylesheet"/>
  <meta content="NEW HORIZON COLLECTIVE, ニューホライズンコレクティブ, ライフシフトプラットフォーム, LIFE SHIFT PLATFORM, 人生100年時代, 電通, dentsu, 個人事業主" name="keywords"/>
  <meta content="1987年　慶大経済学部卒、同年（株）電通入社　1987～1997年　クリエーティブ局にてコピーライター、CMプランナーとして活動。味の素、日本コカ・コーラ、フマキラー、アデラン..." name="description"/>
  <meta content="width=device-width, initial-scale=1.0, minimum-scale=1.0" name="viewport"/>
  <meta content="#F8F8F8" name="theme-color"/>
  <meta content="ミドルシニアのための転職コミュニティ｜Life Shift Platform" property="og:site_name"/>
  <meta cont

Check the fourth line from bottom. That's basically our data!

In [None]:
# pretty print dictionary / JSON data
import json

def pdict(t):
    pretty_json = json.dumps(t, indent=4, ensure_ascii=False)
    print(pretty_json)

In [None]:
from IPython.display import display, JSON

script_tag = soup.find('script', {'id': '__NEXT_DATA__'})
json_data = json.loads(script_tag.string)  # a dictionary
member_info = json_data.get('props', {}).get('pageProps', {}).get('member', {})

pdict(member_info)

{
    "id": 86,
    "name": "黒岩 秀行",
    "display_name": "黒岩秀行",
    "displayable": true,
    "company_name": "ジムショDan代表",
    "face_photo": {
        "degrade": {
            "url": "https://s3-p-zbiz19-fsv-00.s3.ap-northeast-1.amazonaws.com/uploads/member/face_photo/86/degrade_NH_Kuroiwa_Hideyuki.png?X-Amz-Expires=600&X-Amz-Date=20231221T093255Z&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEIb%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaDmFwLW5vcnRoZWFzdC0xIkcwRQIgVumhBnrfl6NYTVJ%2BZIDx7d0cBahSLSsf1TX7b8ZhymsCIQCyVtJZk4Z5JID7AnAbflktHadEZXhkHIylSTQC8Wt%2BiCqPBAj%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F8BEAIaDDQ1OTEwMDgzNTY3MyIMz9UHwK8CuxY3WTI9KuMDPR0fjPCwXfxyixM87m%2FKL0urmoGRCdYDU4gBZaB6vzXv0%2BMUOCeVBb3e5m1m1qWt0gnJ6Chd%2BmaovWUds55ZJBUjPwaTXrMvW4yBwXGtv3040Wk7wILoZF0FVG9Xv77tNHmVFWUeokuTUeoi0i1f3om%2FHRY0Kbip885NvnH8zxanSSx48Hwi%2FWQJ%2Bk2VUfQ9xlg%2BhtpIk7tjzwYZ4VHtJiOIgGkGFrYGtRTmYLWwSzH83DUGFxWrOQwj1JmKPB0jOuExqyVbyJp69X%2BP1yoJrlmGCDkGu4icSDqWTu9Lq8%2Bmj7ymK8fGlAGHHuVUxnk2FI8W9bmdyLIQ7R0Dx8%2FEwN4nc4Qle0s

In [None]:
# main fields we want to keep
main_fields = ["display_name", "company_name", "profile", "title"]
main_info = {key: member_info[key] for key in main_fields if key in member_info}

# change key names and manual ordering
main_info["name"] = main_info.pop("display_name")
main_info["company"] = main_info.pop("company_name")
main_info["title"] = main_info.pop("title")
main_info["profile_jp"] = main_info.pop("profile")

pdict(main_info)

{
    "name": "黒岩秀行",
    "company": "ジムショDan代表",
    "title": "プロデューサー、プランニングディレクター",
    "profile_jp": "1987年　慶大経済学部卒、同年（株）電通入社　1987～1997年　クリエーティブ局にてコピーライター、CMプランナーとして活動。味の素、日本コカ・コーラ、フマキラー、アデランスなど担当。電通賞ほか受賞作多数。1997年～2020年　営業局（現ビジネスプロデュース局）にて資生堂、日本生命、HIS、吉本興業、文藝春秋他出版社などを担当として歴任。2005年より営業部長。各大型キャンペーンの企画・実施運営などを中心となり担務。現在、電通子会社ニューホライズンコレクティブ(合）メンバーとしてと業務委託契約を結ぶ。『宣伝会議』編集ライター養成講座第43期卒業生。卒業制作作品において優秀賞受賞。上記広告会社に勤めた34年間の経験を活かしながら、今後は企業の取り組み、あるいは世の中の出来事を記事・文章を通して紹介していきたいと考えます。事業のメルマガ作成から、コンテンツとしての記事制作(取材含む）を承りつつ、企業の価値・課題と人々の持つニーズやウォンツの橋渡しができればと考えます。PR・コミュニケーション関連に特に知見があり、得意な分野としては文化風俗系の記事ですが、クライアントの要望に沿って幅広い領域に対応します。"
}


In [None]:
# probably useful metadata
occupations = [{k: occ[k] for k in ["name", "id"] if k in occ} for occ in member_info.get("occupations", [])]
industries = [{k: ind[k] for k in ["name", "id"] if k in ind} for ind in member_info.get("industries", [])]

metadata = {'occupations': occupations, 'industries': industries}
main_info['metadata'] = metadata

pdict(main_info)

{
    "name": "黒岩秀行",
    "company": "ジムショDan代表",
    "title": "プロデューサー、プランニングディレクター",
    "profile_jp": "1987年　慶大経済学部卒、同年（株）電通入社　1987～1997年　クリエーティブ局にてコピーライター、CMプランナーとして活動。味の素、日本コカ・コーラ、フマキラー、アデランスなど担当。電通賞ほか受賞作多数。1997年～2020年　営業局（現ビジネスプロデュース局）にて資生堂、日本生命、HIS、吉本興業、文藝春秋他出版社などを担当として歴任。2005年より営業部長。各大型キャンペーンの企画・実施運営などを中心となり担務。現在、電通子会社ニューホライズンコレクティブ(合）メンバーとしてと業務委託契約を結ぶ。『宣伝会議』編集ライター養成講座第43期卒業生。卒業制作作品において優秀賞受賞。上記広告会社に勤めた34年間の経験を活かしながら、今後は企業の取り組み、あるいは世の中の出来事を記事・文章を通して紹介していきたいと考えます。事業のメルマガ作成から、コンテンツとしての記事制作(取材含む）を承りつつ、企業の価値・課題と人々の持つニーズやウォンツの橋渡しができればと考えます。PR・コミュニケーション関連に特に知見があり、得意な分野としては文化風俗系の記事ですが、クライアントの要望に沿って幅広い領域に対応します。",
    "metadata": {
        "occupations": [
            {
                "name": "アカウントプランニング・プロデュース",
                "id": 1
            },
            {
                "name": "クリエーティブ（コピーライター・CMプランナー系）",
                "id": 5
            }
        ],
        "industries": [
            {
                "name": "食品・農業",
                "id": 23
            },
           

### Preprocess all samples

In [None]:
def parse(url):
    print(f"Processing {url}")

    # parse data
    data = requests.get(url)
    data = BeautifulSoup(data.content, "html.parser")
    data = data.find('script', {'id': '__NEXT_DATA__'})  # in html script
    data = json.loads(data.string)  # a dictionary
    data = data.get('props', {}).get('pageProps', {}).get('member', {})

    # main fields we want to keep
    main_fields = ["display_name", "company_name", "profile", "title"]
    result = {k: (None if data.get(k) == '' else data.get(k)) for k in main_fields}
    result["name"] = result.pop("display_name", None)
    result["company"] = result.pop("company_name", None)

    # probably useful metadata
    occupations = [{k: (None if occ.get(k) == '' else occ.get(k)) for k in ["name", "id"] if k in occ}
                   for occ in data.get("occupations", [])]
    industries = [{k: (None if ind.get(k) == '' else ind.get(k)) for k in ["name", "id"] if k in ind}
                  for ind in data.get("industries", [])]
    result['metadata'] = {'occupations': occupations, 'industries': industries}

    # manual ordering
    keyorder = ['name', 'company', 'title', 'profile', 'metadata']
    return {k: result.get(k, None) for k in keyorder if k in result}

In [None]:
dataset = []
for num in num_found:
    data = parse(f"{base_url}{num}")
    dataset.append(data)
    time.sleep(0.1)  # delay to avoid overloading the server

Processing https://lifeshiftplatform.com/member/1
Processing https://lifeshiftplatform.com/member/2
Processing https://lifeshiftplatform.com/member/3
Processing https://lifeshiftplatform.com/member/4
Processing https://lifeshiftplatform.com/member/5
Processing https://lifeshiftplatform.com/member/6
Processing https://lifeshiftplatform.com/member/7
Processing https://lifeshiftplatform.com/member/8
Processing https://lifeshiftplatform.com/member/9
Processing https://lifeshiftplatform.com/member/10
Processing https://lifeshiftplatform.com/member/11
Processing https://lifeshiftplatform.com/member/12
Processing https://lifeshiftplatform.com/member/13
Processing https://lifeshiftplatform.com/member/14
Processing https://lifeshiftplatform.com/member/15
Processing https://lifeshiftplatform.com/member/16
Processing https://lifeshiftplatform.com/member/17
Processing https://lifeshiftplatform.com/member/18
Processing https://lifeshiftplatform.com/member/19
Processing https://lifeshiftplatform.com

In [None]:
import datetime
today = datetime.datetime.now().strftime("%Y%m%d")
filename = f"members-jp_v{today}.json"

# write data to a file
try:
    with open(filename, 'w', encoding='utf-8') as file:
        json.dump(dataset, file, ensure_ascii=False)
        print(f"Data successfully saved to {filename}")
except IOError as e:
    print("An error occurred while writing the file:", e)

Data successfully saved to members-jp_v20231221.json


In [None]:
from google.colab import files
files.download(filename)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Part 2: Translating Data

## Preparing Model

In [None]:
!pip install -U openai typing-extensions

print('Stopping RUNTIME! Please run again.')
import os
os.kill(os.getpid(), 9)

Collecting openai
  Downloading openai-1.6.0-py3-none-any.whl (225 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/225.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.7/225.4 kB[0m [31m1.0 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m225.3/225.4 kB[0m [31m3.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m225.4/225.4 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
Collecting typing-extensions
  Downloading typing_extensions-4.9.0-py3-none-any.whl (32 kB)
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.26.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.9/75.9 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.2-py3-none-any.whl (76 kB)
[2K 

Restart session after installing above, otherwise you will get `ImportError: cannot import name 'Iterator' from 'typing_extensions'` when running OpenAI client.

In [None]:
import os
from getpass import getpass

OPENAI_API_KEY = getpass()
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

··········


In [None]:
from openai import OpenAI
client = OpenAI()

def translate(prompt, json_mode=False, token_count=0):
    messages = [
        {"role": "system", "content": f"You are a professional Japanese to English translator.{' Output in JSON.' if json_mode else ''}"},
        {"role": "user", "content": prompt}
    ]
    response_format = {"type": "json_object"} if json_mode else {"type": "text"}

    response = client.chat.completions.create(
        model="gpt-3.5-turbo-1106",
        response_format=response_format,
        messages=messages,
        temperature=0.0
    )

    translated = response.choices[0].message.content
    token_used = token_count + response.usage.total_tokens
    return (translated, token_used)

## Load Data

In [None]:
# pretty print dictionary / JSON data
import json

def pdict(t):
    pretty_json = json.dumps(t, indent=4, ensure_ascii=False)
    print(pretty_json)

In [None]:
# load existing file
filename = "members-jp_v20231221.json"

try:
    with open(filename, 'r', encoding='utf-8') as file:
        dataset = json.load(file)
    print(f"Data successfully loaded from {filename}")
except FileNotFoundError:
    print(f"No file named {filename} was found.")
except json.JSONDecodeError:
    print(f"Error decoding JSON from the file {filename}.")
except Exception as e:
    print(f"An error occurred: {e}")

Data successfully loaded from members-jp_v20231221.json


## Process Metadata

In [None]:
dict_occ, dict_ind = {}, {}

for member in dataset:
    for occupation in member.get('metadata', {}).get("occupations", []):
        dict_occ[occupation["id"]] = occupation["name"]
    for industry in member.get('metadata', {}).get("industries", []):
        dict_ind[industry["id"]] = industry["name"]

dict_occ = dict(sorted(dict_occ.items()))
dict_ind = dict(sorted(dict_ind.items()))

print(dict_occ)
print(dict_ind)

{1: 'アカウントプランニング・プロデュース', 2: 'イベント', 3: 'キャスティング', 4: 'クリエーティブ（アートディレクター・デザイナー系）', 5: 'クリエーティブ（コピーライター・CMプランナー系）', 6: 'グローバル', 7: 'コンテンツビジネス・スポーツビジネス', 8: 'セールスプロモーション', 9: 'デジタル・DX', 10: 'ナレッジシェア', 11: 'ファイナンス・会計・経理', 12: 'マーケティング・コンセプト開発・戦略立案・リサーチ', 13: 'メディア（アウト・オブ・ホーム・メディア）', 14: 'メディア（インターネット）', 15: 'メディア（テレビ）', 16: 'メディア（ラジオ）', 17: 'メディア（新聞）', 18: 'メディア（雑誌）', 19: '事業投資・ファンド運営', 20: '事業開発', 21: '人事・人材育成', 22: '広報・PR', 23: '情報システム・IT', 24: '法務', 25: '海外法人経営', 26: '組織開発', 27: '経営', 28: '経営企画・経営コンサルティング', 29: '総務・管理・秘書', 30: '顧客ソリューション開発'}
{1: 'エネルギー・素材・機械', 2: 'ファッション・アクセサリー', 3: '不動産・住宅設備', 4: '交通・レジャー', 5: '会計', 6: '出版', 7: '化粧品・トイレタリー', 8: '外食・各種サービス', 9: '官公庁・団体', 10: '家庭用品', 11: '家電・AV機器', 12: '情報・通信', 13: '教育・医療サービス・宗教', 14: '案内', 15: '流通・小売業', 16: '精密機器・事務用品', 17: '臨時もの', 18: '自動車・関連品', 19: '芸術・美術', 20: '薬品・医療用品', 21: '趣味・スポーツ用品', 22: '金融・保険', 23: '食品・農業', 24: '飲料・嗜好品'}


Translate metadata labels

In [None]:
occ_res, _ = translate(f"Translate this list of occupations. Convert '・' to '/', and output separated by ' | ': {' | '.join(list(dict_occ.values()))}")
ind_res, _ = translate(f"Translate this list of industry sectors. Convert '・' to '/', and output separated by ' | ': {' | '.join(list(dict_ind.values()))}")

In [None]:
for (key, src), tgt in zip(dict_occ.items(), occ_res.split(' | ')):
    print(f"{key:2d}: {src}  ->  {tgt}")
    dict_occ[key] = tgt

print(dict_occ)

 1: アカウントプランニング・プロデュース  ->  Account Planning/Production
 2: イベント  ->  Events
 3: キャスティング  ->  Casting
 4: クリエーティブ（アートディレクター・デザイナー系）  ->  Creative (Art Director/Designer)
 5: クリエーティブ（コピーライター・CMプランナー系）  ->  Creative (Copywriter/CM Planner)
 6: グローバル  ->  Global
 7: コンテンツビジネス・スポーツビジネス  ->  Content Business/Sports Business
 8: セールスプロモーション  ->  Sales Promotion
 9: デジタル・DX  ->  Digital/DX
10: ナレッジシェア  ->  Knowledge Share
11: ファイナンス・会計・経理  ->  Finance/Accounting/Finance
12: マーケティング・コンセプト開発・戦略立案・リサーチ  ->  Marketing/Concept Development/Strategic Planning/Research
13: メディア（アウト・オブ・ホーム・メディア）  ->  Media (Out-of-Home Media)
14: メディア（インターネット）  ->  Media (Internet)
15: メディア（テレビ）  ->  Media (Television)
16: メディア（ラジオ）  ->  Media (Radio)
17: メディア（新聞）  ->  Media (Newspaper)
18: メディア（雑誌）  ->  Media (Magazine)
19: 事業投資・ファンド運営  ->  Business Investment/Fund Management
20: 事業開発  ->  Business Development
21: 人事・人材育成  ->  Human Resources/Personnel Development
22: 広報・PR  ->  Public Relations/PR
23: 情報システム・IT  -> 

In [None]:
for (key, src), tgt in zip(dict_ind.items(), ind_res.split(' | ')):
    print(f"{key:2d}: {src}  ->  {tgt}")
    dict_ind[key] = tgt

print(dict_ind)

 1: エネルギー・素材・機械  ->  Energy/Materials/Machinery
 2: ファッション・アクセサリー  ->  Fashion/Accessories
 3: 不動産・住宅設備  ->  Real Estate/Home Equipment
 4: 交通・レジャー  ->  Transportation/Leisure
 5: 会計  ->  Accounting
 6: 出版  ->  Publishing
 7: 化粧品・トイレタリー  ->  Cosmetics/Toiletries
 8: 外食・各種サービス  ->  Dining/Various Services
 9: 官公庁・団体  ->  Government/Groups
10: 家庭用品  ->  Household Goods
11: 家電・AV機器  ->  Home Appliances/AV Equipment
12: 情報・通信  ->  Information/Communication
13: 教育・医療サービス・宗教  ->  Education/Medical Services/Religion
14: 案内  ->  Guidance
15: 流通・小売業  ->  Distribution/Retail
16: 精密機器・事務用品  ->  Precision Equipment/Office Supplies
17: 臨時もの  ->  Temporary Items
18: 自動車・関連品  ->  Automobiles/Related Products
19: 芸術・美術  ->  Arts/Fine Arts
20: 薬品・医療用品  ->  Pharmaceuticals/Medical Supplies
21: 趣味・スポーツ用品  ->  Hobbies/Sports Equipment
22: 金融・保険  ->  Finance/Insurance
23: 食品・農業  ->  Food/Agriculture
24: 飲料・嗜好品  ->  Beverages/Confectionery
{1: 'Energy/Materials/Machinery', 2: 'Fashion/Accessories', 3: 'Real

## Main Translation

### Translate a sample

In [None]:
prompt_template = """\
Given following information of a person:
name: {name}
company: {company}
title: {title}
profile: {profile}

Translate his/her 'title' and 'profile' information and \
format the output as a JSON object with the following keys, detail follows:
`title`: Translate the equivalent job position in English without losing context.
`profile`: Translate the full profile carefully without losing details.

Think for the translation step-by-step and very carefully.\
"""

print(prompt_template)

Given following information of a person:
name: {name}
company: {company}
title: {title}
profile: {profile}

Translate his/her 'title' and 'profile' information and format the output as a JSON object with the following keys, detail follows:
`title`: Translate the equivalent job position in English without losing context.
`profile`: Translate the full profile carefully without losing details.

Think for the translation step-by-step and very carefully.


In [None]:
def mini_tr(id):
    data = dataset[id]
    translated, token_used = translate(prompt_template.format(
        name=data["name"], company=data["company"],
        title=data["title"], profile=data["profile"],
    ), json_mode=True)

    res = json.loads(translated)
    res = {key: (None if val=='' or val=='None' else val) for key, val in res.items()}

    print(token_used)
    print(data["company"])
    print(data["title"])
    print(data["profile"])
    pdict(res)

In [None]:
# random sample
mini_tr(35)

438
COMPASS
プロジェクトプロデューサー、ビジネスプロデューサー
電通時代は、営業・ビジネスプロデューサーとして、様々な広告キャンペーンやブランディングプロジェクト、ビジネスデザインやDXプロジェクトに携わり、クライアント企業の成長支援をしてきました。 独立後も、クライアント企業の潜在力／可能性を最大化すべく、電通時代に得たナレッジや人脈をフルに活かしたプロジェクトによって、多くの中堅企業のみなさまと伴走させていただいております。
{
    "title": "Project Producer, Business Producer",
    "profile": "During my time at Dentsu, I worked as a sales and business producer, engaging in various advertising campaigns, branding projects, business design, and DX projects, supporting the growth of client companies. Even after becoming independent, I continue to work alongside many mid-sized companies, maximizing the potential and possibilities of client companies by fully utilizing the knowledge and connections I gained during my time at Dentsu."
}


In [None]:
# case with no company
mini_tr(4)

264
None
マーケティングコンサルタント
飲食店の開業を目指しスキルアップと準備を進めています。またこれまでのビジネスプロデュース経験も活かして、飲食店等のマーケティングコンサルティングを行なっています。
{
    "title": "Marketing Consultant",
    "profile": "I am working on improving my skills and preparing to open a restaurant. In addition, I am utilizing my previous business production experience to provide marketing consulting for restaurants and other food and beverage establishments."
}


In [None]:
# often throw parse error; took long time to translate
mini_tr(93)
mini_tr(113)

1352
Action Creative 
クリエイティブディレクター、フォトグラファー
1989年4月  株式会社電通入社　コピーライター配属。1995年4月 電通九州　CMプランナー2018年4月 ５CRP局 部長・クリエイティブ・ディレクター / フォトグラファー(電通CR史上初のジョブタイトル)2020年12月 電通退社 2021年1月 Action Creative 設立http://action-creative.com/ ●クライアントのマーケティングのコンサルティングから、広告の企画・制作・検証まで、すべてのクリエイティブ作業をディレクションします。CMやYoutubeなどのトップ・ファネルから、ミドル・ファネル、ボトムファネルまでフル設計し、商品が動くクリエイティブを仕掛けることが得意です。●写真家「宮下五郎」として、スチール撮影（タレント・人物・空間・フードなど）も実施します。NHプロフェッショナル撮影チームのリーダー。http://www.goromiyashita.com/ [受賞歴]ロンドン国際広告賞　IBA賞　ニューヨークフェスティバル　TCC新人賞　ACC賞　毎日広告デザイン賞　準グランプリ・ダブル受賞　全日本CMコンクール優秀賞　広告電通賞　　全日本CMフェスティバル　FCC賞　福岡広告協会賞　今年を代表するCM賞　光文社カラー広告コンクール　消費者のためになった広告コンクールJAA会長賞　その他国内外の受賞多数　写真新世紀入賞（＊写真作家　宮下五郎として）[著書]「海外経験ゼロ。それでもTOEIC900点」　PHP研究所「失敗の教科書」扶桑社「部きっ長さん」　＊朝日新聞土曜版beに2016年から2019年まで毎週連載 [写真集]「WINDOWS_NY]「MY FAVORITE PLACE」[講師歴]青山学院大学　昭和女子大学　御茶ノ水女子大学　跡見学園女子大学　十文字女子大学　武蔵大学　九州産業大学
{
    "title": "Creative Director, Photographer",
    "profile": "Yusuke Miyashita started his career at Dentsu in April 1989 as a copywriter. In April 1995, he became a

### Translate all samples

In [None]:
len(dataset)

210

In [None]:
def batch_tr(new_dataset, total_token, id_start, id_end):
    id = id_start
    max_retries = 3

    for data in dataset[id_start:id_end]:
        print(f"Translating data {id}...", end="")
        new_data = data.copy()

        # extract and map metadata id
        occupations = [dict_occ.get(item['id']) for item in data['metadata']['occupations']]
        industries = [dict_ind.get(item['id']) for item in data['metadata']['industries']]
        new_data['metadata'] = {'occupations': occupations, 'industries': industries}

        for attempt in range(max_retries):
            try:
                # translation
                translated, token_used = translate(prompt_template.format(
                    name=data["name"], company=data["company"],
                    title=data["title"], profile=data["profile"],
                ), json_mode=True)
                print(" Now parsing...", end="")
                result = json.loads(translated)
                result = {key: (None if val=='' or val=='None' else val) for key, val in result.items()}

                # store our translation
                new_data['title'] = result['title']
                new_data['profile'] = result['profile']

                new_dataset.append(new_data)
                print(" Done.")
                total_token += token_used
                break  # exit the retry loop on successful translation
            except Exception as e:
                print(f"\nError on attempt {attempt + 1}: {e}.", end="")
                if attempt == max_retries - 1:
                    print(f"\nFailed to translate data no. {id} after {max_retries} attempts.")
        id += 1

    return new_dataset, total_token

In [None]:
new_dataset = []
total_token = 0

new_dataset, total_token = batch_tr(new_dataset, total_token, 0, 10)

Translating data 0... Now parsing... Done.
Translating data 1... Now parsing... Done.
Translating data 2... Now parsing... Done.
Translating data 3... Now parsing... Done.
Translating data 4... Now parsing... Done.
Translating data 5... Now parsing...
Error on attempt 1: Expecting ',' delimiter: line 3 column 1088 (char 1147). Now parsing... Done.
Translating data 6... Now parsing... Done.
Translating data 7... Now parsing... Done.
Translating data 8... Now parsing... Done.
Translating data 9... Now parsing... Done.


In [None]:
new_dataset_cp, total_token_cp = new_dataset.copy(), total_token
new_dataset, total_token = batch_tr(new_dataset, total_token, 10, 50)

Translating data 10... Now parsing... Done.
Translating data 11... Now parsing... Done.
Translating data 12... Now parsing... Done.
Translating data 13... Now parsing... Done.
Translating data 14... Now parsing... Done.
Translating data 15... Now parsing... Done.
Translating data 16... Now parsing... Done.
Translating data 17... Now parsing... Done.
Translating data 18... Now parsing... Done.
Translating data 19... Now parsing... Done.
Translating data 20... Now parsing... Done.
Translating data 21... Now parsing... Done.
Translating data 22... Now parsing... Done.
Translating data 23... Now parsing... Done.
Translating data 24... Now parsing... Done.
Translating data 25... Now parsing... Done.
Translating data 26... Now parsing... Done.
Translating data 27... Now parsing... Done.
Translating data 28... Now parsing... Done.
Translating data 29... Now parsing... Done.
Translating data 30... Now parsing... Done.
Translating data 31... Now parsing... Done.
Translating data 32... Now parsi

In [None]:
new_dataset_cp, total_token_cp = new_dataset.copy(), total_token
new_dataset, total_token = batch_tr(new_dataset, total_token, 50, 90)

Translating data 50... Now parsing... Done.
Translating data 51... Now parsing... Done.
Translating data 52... Now parsing... Done.
Translating data 53... Now parsing... Done.
Translating data 54... Now parsing... Done.
Translating data 55... Now parsing... Done.
Translating data 56... Now parsing... Done.
Translating data 57... Now parsing... Done.
Translating data 58... Now parsing... Done.
Translating data 59... Now parsing... Done.
Translating data 60... Now parsing... Done.
Translating data 61... Now parsing... Done.
Translating data 62... Now parsing... Done.
Translating data 63... Now parsing... Done.
Translating data 64... Now parsing... Done.
Translating data 65... Now parsing... Done.
Translating data 66... Now parsing... Done.
Translating data 67... Now parsing... Done.
Translating data 68... Now parsing... Done.
Translating data 69... Now parsing... Done.
Translating data 70... Now parsing... Done.
Translating data 71... Now parsing... Done.
Translating data 72... Now parsi

In [None]:
new_dataset_cp, total_token_cp = new_dataset.copy(), total_token
new_dataset, total_token = batch_tr(new_dataset, total_token, 90, 120)

Translating data 90... Now parsing... Done.
Translating data 91... Now parsing... Done.
Translating data 92... Now parsing... Done.
Translating data 93... Now parsing... Done.
Translating data 94... Now parsing... Done.
Translating data 95... Now parsing... Done.
Translating data 96... Now parsing... Done.
Translating data 97... Now parsing... Done.
Translating data 98... Now parsing... Done.
Translating data 99... Now parsing... Done.
Translating data 100... Now parsing... Done.
Translating data 101... Now parsing... Done.
Translating data 102... Now parsing... Done.
Translating data 103... Now parsing... Done.
Translating data 104... Now parsing... Done.
Translating data 105... Now parsing... Done.
Translating data 106... Now parsing... Done.
Translating data 107... Now parsing... Done.
Translating data 108... Now parsing... Done.
Translating data 109... Now parsing... Done.
Translating data 110... Now parsing... Done.
Translating data 111... Now parsing... Done.
Translating data 112

In [None]:
new_dataset_cp, total_token_cp = new_dataset.copy(), total_token
new_dataset, total_token = batch_tr(new_dataset, total_token, 120, 150)

Translating data 120... Now parsing... Done.
Translating data 121... Now parsing... Done.
Translating data 122... Now parsing... Done.
Translating data 123... Now parsing... Done.
Translating data 124... Now parsing... Done.
Translating data 125... Now parsing... Done.
Translating data 126... Now parsing... Done.
Translating data 127... Now parsing... Done.
Translating data 128... Now parsing... Done.
Translating data 129... Now parsing... Done.
Translating data 130... Now parsing... Done.
Translating data 131... Now parsing... Done.
Translating data 132... Now parsing... Done.
Translating data 133... Now parsing... Done.
Translating data 134... Now parsing... Done.
Translating data 135... Now parsing... Done.
Translating data 136... Now parsing... Done.
Translating data 137... Now parsing... Done.
Translating data 138... Now parsing... Done.
Translating data 139... Now parsing... Done.
Translating data 140... Now parsing... Done.
Translating data 141... Now parsing... Done.
Translatin

In [None]:
new_dataset_cp, total_token_cp = new_dataset.copy(), total_token
new_dataset, total_token = batch_tr(new_dataset, total_token, 150, 180)

Translating data 150... Now parsing... Done.
Translating data 151... Now parsing... Done.
Translating data 152... Now parsing... Done.
Translating data 153... Now parsing... Done.
Translating data 154... Now parsing... Done.
Translating data 155... Now parsing... Done.
Translating data 156... Now parsing... Done.
Translating data 157... Now parsing... Done.
Translating data 158... Now parsing... Done.
Translating data 159... Now parsing... Done.
Translating data 160... Now parsing... Done.
Translating data 161... Now parsing... Done.
Translating data 162... Now parsing... Done.
Translating data 163... Now parsing... Done.
Translating data 164... Now parsing... Done.
Translating data 165... Now parsing... Done.
Translating data 166... Now parsing... Done.
Translating data 167... Now parsing... Done.
Translating data 168... Now parsing... Done.
Translating data 169... Now parsing... Done.
Translating data 170... Now parsing... Done.
Translating data 171... Now parsing... Done.
Translatin

In [None]:
new_dataset_cp, total_token_cp = new_dataset.copy(), total_token
new_dataset, total_token = batch_tr(new_dataset, total_token, 180, len(dataset))

Translating data 180... Now parsing... Done.
Translating data 181... Now parsing... Done.
Translating data 182... Now parsing... Done.
Translating data 183... Now parsing... Done.
Translating data 184... Now parsing... Done.
Translating data 185... Now parsing... Done.
Translating data 186... Now parsing... Done.
Translating data 187... Now parsing... Done.
Translating data 188... Now parsing... Done.
Translating data 189... Now parsing... Done.
Translating data 190... Now parsing... Done.
Translating data 191... Now parsing... Done.
Translating data 192... Now parsing... Done.
Translating data 193... Now parsing... Done.
Translating data 194... Now parsing... Done.
Translating data 195... Now parsing... Done.
Translating data 196... Now parsing... Done.
Translating data 197... Now parsing... Done.
Translating data 198... Now parsing... Done.
Translating data 199... Now parsing... Done.
Translating data 200... Now parsing... Done.
Translating data 201... Now parsing... Done.
Translatin

In [None]:
print(len(new_dataset))
print(total_token)

210
102630


How much we spend on translating all of them based on OpenAI pricing [here](https://openai.com/pricing).

In [None]:
print(f"It costs roughly around ${total_token/1000 * 0.0015}")

It costs roughly around $0.153945


### Post-processing

For better dataset, we still need manual evaluation. This will not scale well for larger dataset. But for 210 entries, personally, it should be okay.

In [None]:
# backup the new_dataset first
new_dataset_backup = new_dataset.copy()

In [None]:
check = []
for id, member in enumerate(new_dataset):
    mem = {}
    mem["id"] = id
    mem["title"] = member["title"]
    mem["profile"] = member["profile"]
    check.append(mem)
pdict(check)

[
    {
        "id": 0,
        "title": "Producer, Planning Director",
        "profile": "Graduated from Keio University's Faculty of Economics in 1987, and joined Dentsu Co., Ltd. in the same year. From 1987 to 1997, worked as a copywriter and CM planner in the Creative Department, handling clients such as Ajinomoto, Coca-Cola Japan, Fumakilla, and Adelans. Received numerous awards including the Dentsu Award. From 1997 to 2020, held various positions in the Sales Department (currently Business Produce Department), working with clients such as Shiseido, Nippon Life, HIS, Yoshimoto Kogyo, and various publishing companies including Bungeishunju. Became the Sales Department Manager in 2005, focusing on planning and implementing large-scale campaigns. Currently, engaged as a member of the subsidiary of Dentsu, New Horizon Collective, under a business outsourcing contract. Graduated from the 43rd class of the 'Advertising Conference' editorial writer training course, receiving an award f

We have six id to correct: 12, 20, 22, 65, 66, 120.

#### Manual correction

##### ID 12

In [None]:
pdict(check[12])

{
    "id": 12,
    "title": {
        "english": "Business Producer, Mental Health Counselor, Coffee Coordinator"
    },
    "profile": {
        "english": "Shintaro Fujii holds a qualification as a mental health counselor and works as a business producer who is capable of attentive listening. His motto is to create businesses that benefit people, society, and children."
    }
}


In [None]:
new_dataset[12]['title'] = new_dataset[12]['title']['english']
new_dataset[12]['profile'] = new_dataset[12]['profile']['english']
pdict(new_dataset[12])

{
    "name": "藤井慎太郎",
    "company": "株式会社 藤家　代表取締役",
    "title": "Business Producer, Mental Health Counselor, Coffee Coordinator",
    "profile": "Shintaro Fujii holds a qualification as a mental health counselor and works as a business producer who is capable of attentive listening. His motto is to create businesses that benefit people, society, and children.",
    "metadata": {
        "occupations": [
            "Account Planning/Production",
            "Media (Television)",
            "Digital/DX"
        ],
        "industries": [
            "Energy/Materials/Machinery",
            "Food/Agriculture",
            "Beverages/Confectionery",
            "Pharmaceuticals/Medical Supplies",
            "Cosmetics/Toiletries",
            "Fashion/Accessories",
            "Precision Equipment/Office Supplies",
            "Automobiles/Related Products",
            "Household Goods",
            "Hobbies/Sports Equipment",
            "Information/Communication",
            "

##### ID 20

In [None]:
pdict(check[20])

{
    "id": 20,
    "title": "Business Development Consultant, Management Planning Consultant",
    "profile": {
        "経歴": "① Worked at Dentsu for 34 years. Conducted marketing services domestically and internationally. Particularly, stationed in 3 countries (Singapore, Indonesia, India) over 10 years, with a rich local network. ② Established a personal company in January 2021. Mainly supports client companies in new business development and overseas business expansion.",
        "現在の業務領域": "① Various planning services for resource-limited companies. Can provide comprehensive service from information gathering to strategy formulation and planning for any type of planning, such as communication planning, business planning, and management planning. Aiming to be easy to understand even for first-timers. ② Support for overseas business expansion. Leveraging 15 years of experience in overseas business and overseas assignments, can support in planning and implementation of export, overse

In [None]:
new_dataset[20]['profile'] = f"Career: {new_dataset[20]['profile']['経歴']}.\nCurrent business area: {new_dataset[20]['profile']['現在の業務領域']}"
pdict(new_dataset[20])

{
    "name": "福本浩一",
    "company": "合同会社ブライトビジョン",
    "title": "Business Development Consultant, Management Planning Consultant",
    "profile": "Career: ① Worked at Dentsu for 34 years. Conducted marketing services domestically and internationally. Particularly, stationed in 3 countries (Singapore, Indonesia, India) over 10 years, with a rich local network. ② Established a personal company in January 2021. Mainly supports client companies in new business development and overseas business expansion..\nCurrent business area: ① Various planning services for resource-limited companies. Can provide comprehensive service from information gathering to strategy formulation and planning for any type of planning, such as communication planning, business planning, and management planning. Aiming to be easy to understand even for first-timers. ② Support for overseas business expansion. Leveraging 15 years of experience in overseas business and overseas assignments, can support in planning and 

##### ID 22

In [None]:
pdict(check[22])

{
    "id": 22,
    "title": {
        "english": "Project Handler, Consultant, Account Manager, Printing Director"
    },
    "profile": {
        "english": "I have experience in a wide range of fields beyond advertising and promotion, from small jobs worth a few thousand yen to system development projects worth tens of billions of yen, and even serving as the coordinator for national projects worth hundreds of billions of yen. I take joy in working with various people and aim to satisfy them by completing the work together. I pride myself on being knowledgeable about cashless and energy-related matters, but I see myself more as a project producer and manager rather than a specialist, and consider myself a jack-of-all-trades."
    }
}


In [None]:
new_dataset[22]['title'] = new_dataset[22]['title']['english']
new_dataset[22]['profile'] = new_dataset[22]['profile']['english']
pdict(new_dataset[22])

{
    "name": "後藤康夫",
    "company": "個人事業主/株式会社DETOURNER",
    "title": "Project Handler, Consultant, Account Manager, Printing Director",
    "profile": "I have experience in a wide range of fields beyond advertising and promotion, from small jobs worth a few thousand yen to system development projects worth tens of billions of yen, and even serving as the coordinator for national projects worth hundreds of billions of yen. I take joy in working with various people and aim to satisfy them by completing the work together. I pride myself on being knowledgeable about cashless and energy-related matters, but I see myself more as a project producer and manager rather than a specialist, and consider myself a jack-of-all-trades.",
    "metadata": {
        "occupations": [
            "Account Planning/Production",
            "Marketing/Concept Development/Strategic Planning/Research",
            "Events",
            "Sales Promotion",
            "Public Relations/PR",
            "Medi

##### ID 65

In [None]:
pdict(check[65])

{
    "id": 65,
    "title": {
        "Japanese": "代表取締役CEO、戦略プロデューサー、動物取扱責任者",
        "English": "Representative Director and CEO, Strategic Producer, Animal Handling Officer"
    },
    "profile": "Kishi Takayoshi joined Dentsu Inc. in 1992 and worked mainly in the sales field. He retired from the company in 2020. During his time at Dentsu, he served as a business producer for a wide range of clients from various industries such as Honda, Takeda Pharmaceutical, and Asics, as well as international sports events such as the Olympics and F1. He also has extensive experience in advertising production in overseas markets such as China, Southeast Asia, the Middle East, and Europe. Skilled in proposal-based production across all areas from business planning to product and service development, marketing planning, and creative output."
}


In [None]:
new_dataset[65]['title'] = new_dataset[65]['title']['English']
pdict(new_dataset[65])

{
    "name": "岸貴義",
    "company": "株式会社キスオブライフ　/　エアモビリティ株式会社　/　株式会社フォーステック",
    "title": "Representative Director and CEO, Strategic Producer, Animal Handling Officer",
    "profile": "Kishi Takayoshi joined Dentsu Inc. in 1992 and worked mainly in the sales field. He retired from the company in 2020. During his time at Dentsu, he served as a business producer for a wide range of clients from various industries such as Honda, Takeda Pharmaceutical, and Asics, as well as international sports events such as the Olympics and F1. He also has extensive experience in advertising production in overseas markets such as China, Southeast Asia, the Middle East, and Europe. Skilled in proposal-based production across all areas from business planning to product and service development, marketing planning, and creative output.",
    "metadata": {
        "occupations": [
            "Account Planning/Production",
            "Content Business/Sports Business",
            "Marketing/Concept Devel

##### ID 66

In [None]:
pdict(check[66])

{
    "id": 66,
    "title": "Marketing Consultant",
    "profile": "(株)電通にて、流通企業の営業として長年従事したのち、電通内デジタル関連部門、㈱電通デジタル、楽天データマーケティング㈱の設立に経営幹部として参画し、ビジネスモデルの構築及び各社の成長に貢献する。デジタルマーケティングに対する知見に加え、リアル店舗流通、Eコマース、フランチャイズビジネス、PR領域についての豊富な経験と知見を持つ。2021年1月より起業、現在に至る。中小企業診断士 独立行政法人 中小企業基盤整備機構　中小企業アドバイザー（経営支援）"
}


In [None]:
mini_tr(66)

696
株式会社ザイサク
マーケティングコンサルタント
(株)電通にて、流通企業の営業として長年従事したのち、電通内デジタル関連部門、㈱電通デジタル、楽天データマーケティング㈱の設立に経営幹部として参画し、ビジネスモデルの構築及び各社の成長に貢献する。デジタルマーケティングに対する知見に加え、リアル店舗流通、Eコマース、フランチャイズビジネス、PR領域についての豊富な経験と知見を持つ。2021年1月より起業、現在に至る。中小企業診断士 独立行政法人 中小企業基盤整備機構　中小企業アドバイザー（経営支援）
{
    "title": "Marketing Consultant",
    "profile": "(株)電通にて、流通企業の営業として長年従事したのち、電通内デジタル関連部門、㈱電通デジタル、楽天データマーケティング㈱の設立に経営幹部として参画し、ビジネスモデルの構築及び各社の成長に貢献する。デジタルマーケティングに対する知見に加え、リアル店舗流通、Eコマース、フランチャイズビジネス、PR領域についての豊富な経験と知見を持つ。2021年1月より起業、現在に至る。中小企業診断士 独立行政法人 中小企業基盤整備機構　中小企業アドバイザー（経営支援）"
}


Strange that the model does not translate it somehow even after trying multiple times. Let's just translate it manually with ChatGPT 4 using the same prompt template.

In [None]:
new_dataset[66]["profile"] = "After a long tenure as a sales professional in distribution companies at Dentsu Inc., he/she participated as an executive in the establishment of Dentsu's digital-related divisions, Dentsu Digital Inc., and Rakuten Data Marketing Inc., contributing to the development of business models and the growth of each company. Possessing expertise in digital marketing, as well as extensive experience and knowledge in physical store distribution, e-commerce, franchise businesses, and public relations. Began entrepreneurship in January 2021 and continues to present. SME Diagnostician, Independent Administrative Agency Small and Medium Enterprise Infrastructure Development Organization - SME Advisor (Business Support)."
pdict(new_dataset[66])

{
    "name": "岸本直樹",
    "company": "株式会社ザイサク",
    "title": "Marketing Consultant",
    "profile": "After a long tenure as a sales professional in distribution companies at Dentsu Inc., he/she participated as an executive in the establishment of Dentsu's digital-related divisions, Dentsu Digital Inc., and Rakuten Data Marketing Inc., contributing to the development of business models and the growth of each company. Possessing expertise in digital marketing, as well as extensive experience and knowledge in physical store distribution, e-commerce, franchise businesses, and public relations. Began entrepreneurship in January 2021 and continues to present. SME Diagnostician, Independent Administrative Agency Small and Medium Enterprise Infrastructure Development Organization - SME Advisor (Business Support).",
    "metadata": {
        "occupations": [
            "Account Planning/Production",
            "Marketing/Concept Development/Strategic Planning/Research",
            "Events",

##### ID 120

In [None]:
pdict(check[120])

{
    "id": 120,
    "title": {
        "ITコンサルタント": "IT Consultant",
        "プロデューサー": "Producer",
        "ライター": "Writer",
        "カメラマン": "Photographer",
        "フードコンサルタント": "Food Consultant"
    },
    "profile": "IT/Communication Consulting: With 27 years of experience at Dentsu, including in the system department and digital promotion department, I bring a unique perspective as a specialist in both advertising and IT to solve client problems. Writing: Known for precise content understanding and clear writing, with contributions to business book reviews in Dentsu Report and specialized PC magazines. I also provide support for reporting and review creation based on insights and technical evaluations, as well as simplification of manuals and instruction manuals."
}


In [None]:
new_dataset[120]['title'] = ', '.join([v for v in new_dataset[120]['title'].values()])
pdict(new_dataset[120])

{
    "name": "大木天馬",
    "company": "株式会社フォルメノス",
    "title": "IT Consultant, Producer, Writer, Photographer, Food Consultant",
    "profile": "IT/Communication Consulting: With 27 years of experience at Dentsu, including in the system department and digital promotion department, I bring a unique perspective as a specialist in both advertising and IT to solve client problems. Writing: Known for precise content understanding and clear writing, with contributions to business book reviews in Dentsu Report and specialized PC magazines. I also provide support for reporting and review creation based on insights and technical evaluations, as well as simplification of manuals and instruction manuals.",
    "metadata": {
        "occupations": [
            "Account Planning/Production",
            "Marketing/Concept Development/Strategic Planning/Research",
            "Events",
            "Sales Promotion",
            "Information Systems/IT"
        ],
        "industries": [
          

#### Finalization

Let's check once again then save the data.

In [None]:
check = []
for id, member in enumerate(new_dataset):
    mem = {}
    mem["id"] = id
    mem["title"] = member["title"]
    mem["profile"] = member["profile"]
    check.append(mem)
pdict(check)

[
    {
        "id": 0,
        "title": "Producer, Planning Director",
        "profile": "Graduated from Keio University's Faculty of Economics in 1987, and joined Dentsu Co., Ltd. in the same year. From 1987 to 1997, worked as a copywriter and CM planner in the Creative Department, handling clients such as Ajinomoto, Coca-Cola Japan, Fumakilla, and Adelans. Received numerous awards including the Dentsu Award. From 1997 to 2020, held various positions in the Sales Department (currently Business Produce Department), working with clients such as Shiseido, Nippon Life, HIS, Yoshimoto Kogyo, and various publishing companies including Bungeishunju. Became the Sales Department Manager in 2005, focusing on planning and implementing large-scale campaigns. Currently, engaged as a member of the subsidiary of Dentsu, New Horizon Collective, under a business outsourcing contract. Graduated from the 43rd class of the 'Advertising Conference' editorial writer training course, receiving an award f

Then, we're all done!

In [None]:
import datetime
today = datetime.datetime.now().strftime("%Y%m%d")
filename = f"members-en_v{today}.json"

# write data to a file
try:
    with open(filename, 'w', encoding='utf-8') as file:
        json.dump(new_dataset, file, ensure_ascii=False)
        print(f"Data successfully saved to {filename}")
except IOError as e:
    print("An error occurred while writing the file:", e)

Data successfully saved to members-en_v20231221.json


In [None]:
from google.colab import files
files.download(filename)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>