---
title: "Building Dataset"
jupyter: python3
format:
  html:
    grid: 
      body-width: 1000px
      sidebar-width: 300px
      margin-width: 300px
    toc: true
    toc-title: Contents
    page-layout: full
    code-overflow: wrap
    
number-sections: true
reference-location: margin
citation-location: margin
---

# Scraping Data

The old version of the BCAS website doesn't use Java Script animations, so the classic  `BeautifulSoup` library is enough at this stage.

## Issues

First we should investigate the structure of the website. Article data we are looking for is in the section 'Past Issues' (过刊目录). Each issue page contains links to the full article  in PDF or HTML, and abstracts (摘要) which we will explore next.

Our final goal is to retrieve the data on the articles. The website structure shows that the links to the articles can be found inside issues. So, our first move is to get the links to the issues.

In [2]:
# | echo: false
url = "http://old2022.bulletin.cas.cn/zgkxyyk/ch/reader/issue_browser.aspx"
issues = 'data/bcas_issues.txt'

Let's build a function to extract all issues URLs on the catalogue page. 

In [3]:
from bs4 import BeautifulSoup
import requests


def get_urls(url):
    try:
        response = requests.get(url)
        soup = BeautifulSoup(response.content, "html.parser")
        urls = []
        # links to issues have 'a' tag
        links = soup.find_all('a')

        # extract the href from each 'a'
        for link in links:
            href = link.get('href')
        return urls

    except Exception as e:
        print("Error:", str(e))
        return None

Once the links are retrieved we need to save them in a txt file.

In [4]:
def save_urls(urls, txt):
    with open(txt, 'w') as file:
        for url in urls:
            if url.startswith('issue_list.aspx?year_id='):
                file.write(
                    'http://old2022.bulletin.cas.cn/zgkxyyk/ch/reader/' + url + '\n')

    print("Issues URLs saved to", txt)

Finally, we can execute the two functions and get the links to the issues.

In [None]:
# get URLs
urls = get_urls(url)
# save to txt
save_urls(urls, issues)

## Articles

Once we get the links to the issues, we can itearate through them and retrieve the links to the desired articles.

In [None]:
articles = 'data/bcas_articles.txt'

with open(articles, "a") as output:
    # read the list of issues pages from issues.txt
    with open(issues, "r") as file:
        issue_urls = file.read().splitlines()

    # iterate through each URL
    for url in issue_urls:
        article_urls = get_urls(url)
        if article_urls:
            # save article URLs by appending to the file
            for article_url in article_urls:
                if article_url.startswith('view_abstract.aspx?file_no='):
                    output.write(
                        'http://old2022.bulletin.cas.cn/zgkxyyk/ch/reader/' + article_url + '\n')

print("Article URLs saved to", articles)

In [None]:
# | echo: false
def remove_duplicates(file_path):
    with open(file_path, 'r') as file:
        lines = file.readlines()

    # remove duplicates while preserving order
    unique_lines = []
    seen = set()
    for line in lines:
        if line not in seen:
            seen.add(line)
            unique_lines.append(line)

    # Write the unique lines back to the file
    with open(file_path, 'w') as file:
        file.writelines(unique_lines)

    print(f"Duplicates removed. Check {file_path}")


remove_duplicates(articles)

## Article Data

Now we can scrape the data for each BCAS article.
To do so we need to analyze the HTML structure of the pages and determine CSS selectors for the desired elements.
Article pages contain a lot of data about publications:

- Title
- Date
- Issue
- Authors
- Affiliations
- Abstracts
- Keywords
- Associated fund projects
- Views and downloads statistics

### Define Elements

We can start with a function which gets the desired element using beautifulsoup functionality.

In [220]:
# get text of an element if it exists
def get_element(soup, selector):
    element = soup.select_one(selector)
    return element.get_text(strip=True, separator=",") if element else ""

Next we locate each element through CSS selectors and parse with "html.parser". The function returns a dictionary with the extracted data.

The parsing process may take some time and connection errors (from both sides) can disturb the process. That's why it is better to wrap parsing in try-except block. If any exception occurs during the process, the function prints an error message with the URL and the specific error, and then returns None.

In [225]:
# function to extract text using BeautifulSoup
def get_data(url):
    try:
        response = requests.get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.content, "html.parser")

        # extracting data using CSS selectors
        title_cn = get_element(soup, 'span#FileTitle')
        title_en = get_element(soup, 'span#EnTitle')
        author_cn = get_element(soup, 'div.cn_author')
        author_en = get_element(soup, 'div.en_author')
        org_cn = get_element(soup, 'div.cn_unit')
        org_en = get_element(soup, 'div.en_unit')
        abstract_cn = get_element(soup, 'div.zw_zhaiyao')
        abstract_en = get_element(soup, 'div.yw_zhaiyao')
        keywords_cn = get_element(soup, 'div.zw_gjc')
        keywords_en = get_element(soup, 'div.yw_gjc')
        fund_project = get_element(soup, 'div.jjxm')
        date = get_element(soup, 'div.d_deta.fr')
        views = get_element(soup, 'span#ClickNum')
        downloads = get_element(soup, 'span#PDFClickNum')

        return {
            "url": url,
            "title_cn": title_cn,
            "title_en": title_en,
            "author_cn": author_cn,
            "author_en": author_en,
            "org_cn": org_cn,
            "org_en": org_en,
            "abstract_cn": abstract_cn,
            "abstract_en": abstract_en,
            "keywords_cn": keywords_cn,
            "keywords_en": keywords_en,
            "fund_project": fund_project,
            "date": date,
            "views": views,
            "downloads": downloads,
        }

    except Exception as e:
        print(f"An error occurred while processing {url}: {e}")
        return None

In [6]:
# | echo: false
# dataset in csv
dataset = "data/bcas_dataset.csv"

Next we need to werite the scraped data into a csv file. 

First, we open a csv file in the write mode and define the fieldnames, which are the same as the fields of the dictionary in the get_data() function. We use a pipe "|" as a delimeter.

Next we open the articles.txt with the URLs and iterate through them retrieving the data, which is tehn stored in the csv.


In [None]:
import csv


with open(dataset, mode="w", newline="") as csv_file:
    fieldnames = [
        "url", "date", "views", "downloads",
        "author_cn", "author_en",
        "title_cn", "title_en",
        "org_cn", "org_en",
        "abstract_cn", "abstract_en",
        "keywords_cn", "keywords_en",
        "fund_project"
    ]
    writer = csv.DictWriter(csv_file, fieldnames=fieldnames, delimiter="|")
    writer.writeheader()

    # read the list of URLs from the file
    with open(articles, "r") as file:
        article_urls = file.read().splitlines()

        # iterate through each url, extract data, and write to csv
        for url in article_urls:
            data = get_data(url)
            if data:
                writer.writerow(data)

print(f"Data has been extracted and saved to {dataset}")

In [1]:
# | echo: false
dataset_fixed = "data/bcas_dataset_fixed.csv"

Due to some formatting issues the parsed data can contain new lines "/n" in some cases, breaking the lines. We can fix this issues with a simple fucntion which replaces all line breaks (\n) in the content with an empty string and then adds a new line break before each occurrence of 'http://old2022.bulletin.cas.cn/'. 

In [59]:
# fix broken lines in the csv
def fix_lines(dataset_csv, dataset_csv_fixed):
    with open(dataset_csv, mode='r', encoding='utf-8') as file:
        content = file.read()

    # replace all line endings with empty string to remove them
    content = content.replace('\n', '')

    # add a new line before each 'http' to separate URLs
    content = content.replace(
        'http://old2022.bulletin.cas.cn/', '\nhttp://old2022.bulletin.cas.cn/')

    with open(dataset_csv_fixed, mode='w', encoding='utf-8') as outfile:
        outfile.write(content)


fix_lines(dataset, dataset_fixed)

### Similar Articles

Our goal is to clasterize the articles using available texts -- so the more useful texts we have, the better. On the pages there is a section called "Similar Articles" (相似文献). The titles of similar publications may be useful for topic modeling, for it increases the chances for articles with similar referencies to appear in the same cluster.

Accessing this data requires clicking a "Similar Articles" button, otherwise the text is not present on the page. We can simulate clicking (and a do lot of other useful stuff) using `selenium` library. 

In [None]:
import time
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.common.exceptions import WebDriverException
from selenium.webdriver.common.by import By

with open(similar_csv, mode="w", newline="", encoding="utf-8") as csv_file:
    fieldnames = ["url", "similar"]
    writer = csv.DictWriter(csv_file, fieldnames=fieldnames, delimiter="|")
    writer.writeheader()


def initialize_driver():
    firefox_options = webdriver.FirefoxOptions()
    firefox_options.add_argument("--headless")
    return webdriver.Firefox(options=firefox_options)


driver = initialize_driver()

data = []

with open(articles, 'r') as file:
    urls = file.read().splitlines()

for url in urls:
    try:
        driver.get(url)

        WebDriverWait(driver, 10).until(EC.element_to_be_clickable(
            (By.XPATH, '//div[text()="相似文献"]'))).click()
        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CLASS_NAME, "ArticleList")))

        html = driver.page_source
        soup = BeautifulSoup(html, 'html.parser')

        # find all <a> with class "ArticleList"
        links = soup.find_all('a', class_='ArticleList')

        # extract the text from each element
        similar_list = [link.text.strip() for link in links]

    except TimeoutException:
        similar_list = ['time_error']

    except WebDriverException:
        max_retries = 5
        retry_count = 0
        while retry_count < max_retries:
            try:
                time.sleep(3)
                driver.quit()
                driver = initialize_driver()
                driver.get(url)
                WebDriverWait(driver, 2).until(EC.element_to_be_clickable(
                    (By.XPATH, '//div[text()="相似文献"]'))).click()
                WebDriverWait(driver, 2).until(
                    EC.presence_of_element_located((By.CLASS_NAME, "ArticleList")))
                html = driver.page_source
                soup = BeautifulSoup(html, 'html.parser')
                links = soup.find_all('a', class_='ArticleList')
                similar_list = [link.text.strip() for link in links]
                break

            except (WebDriverException, TimeoutException):
                retry_count += 1
                if retry_count == max_retries:
                    similar_list = ['web_error']

    finally:
        data.append({'url': url, 'similar': similar_list})

        with open(similar_csv, 'a', newline='') as csvfile:
            writer = csv.DictWriter(csvfile, fieldnames=[
                                    'url', 'similar'], delimiter="|")
            writer.writerow({'url': url, 'similar': similar_list})

driver.quit()

print(f"Data extracted and scaved to {similar_csv}.")

### Merge Dataset

Once we have one dataset with article metadata and the other with similar articles we can combine them into one.
One way to do so fast and easy is through Pandas.

We create two dataframes and merge them through SQL-like function. 

In [5]:
import pandas as pd

df = pd.read_csv(dataset_fixed, sep='|')
similar_df = pd.read_csv(similar_csv)

We can explore the dataset a bit. 

In [6]:
# | echo: false
# | output: true

df.loc[[3]]

Unnamed: 0,url,date,views,downloads,author_cn,author_en,title_cn,title_en,org_cn,org_en,abstract_cn,abstract_en,keywords_cn,keywords_en,fund_project
3,http://old2022.bulletin.cas.cn/zgkxyyk/ch/read...,"中国科学院院刊:2024,39(1):17-26",470,448,"谭光明,,,贾伟乐,,,王展,,,元国军,,,邵恩,,,孙凝晖*","TAN Guangming,,,JIA Weile,,,WANG Zhan,,,YUAN G...",面向模拟智能的计算系统,Computing system for simulation intelligence,(中国科学院计算技术研究所 北京 100190),"(Institute of Computing Technology, Chinese Ac...","中文摘要:,科学研究中的计算机模拟称为科学模拟（scientific simulation）...","Abstract:,This study refers computer simulatio...","中文关键词:,科学模拟,模拟智能,人工智能,计算系统,Z级计算","keywords:,scientific simulation,simulation int...","基金项目:,国家杰出青年科学基金（T2125013）"


And this is what the dataset woth references looks like.

In [16]:
# | echo: false
# | output: true

similar_df.head()

Unnamed: 0,url,similar
0,http://old2022.bulletin.cas.cn/zgkxyyk/ch/read...,[]
1,http://old2022.bulletin.cas.cn/zgkxyyk/ch/read...,"['科研信息化发展态势和思考', '数据科学与计算智能：内涵、范式与机遇', '人工智能驱动..."
2,http://old2022.bulletin.cas.cn/zgkxyyk/ch/read...,"['人工智能驱动的科学研究新范式：从AI4S到智能科学', 'GPT技术变革对基础科学研究的..."
3,http://old2022.bulletin.cas.cn/zgkxyyk/ch/read...,"['信息化:从计算机科学到计算科学', '科学大数据智能分析软件的现状与趋势', '中国高通..."
4,http://old2022.bulletin.cas.cn/zgkxyyk/ch/read...,"['人工智能驱动的科学研究新范式：从AI4S到智能科学', '适度超前推动科研基础平台建设 ..."




There are multiple ways to combine these two datasets together. Now a SQL-like merge() function will do just fine.

In [None]:
df = pd.merge(df, similar_df, on='url', how='left')

In [19]:
# | echo: false
# | output: true
df.loc[[3]]

Unnamed: 0,url,date,views,downloads,author_cn,author_en,title_cn,title_en,org_cn,org_en,abstract_cn,abstract_en,keywords_cn,keywords_en,fund_project,similar
3,http://old2022.bulletin.cas.cn/zgkxyyk/ch/read...,"中国科学院院刊:2024,39(1):17-26",470,448,"谭光明,,,贾伟乐,,,王展,,,元国军,,,邵恩,,,孙凝晖*","TAN Guangming,,,JIA Weile,,,WANG Zhan,,,YUAN G...",面向模拟智能的计算系统,Computing system for simulation intelligence,(中国科学院计算技术研究所 北京 100190),"(Institute of Computing Technology, Chinese Ac...","中文摘要:,科学研究中的计算机模拟称为科学模拟（scientific simulation）...","Abstract:,This study refers computer simulatio...","中文关键词:,科学模拟,模拟智能,人工智能,计算系统,Z级计算","keywords:,scientific simulation,simulation int...","基金项目:,国家杰出青年科学基金（T2125013）","['信息化:从计算机科学到计算科学', '科学大数据智能分析软件的现状与趋势', '中国高通..."


# Data Cleaning

The data we got is quite messy with lots of missing values, extra comas and other punctuation signs, some columns can be split into two to make more sense.

In this cases we can do some cleaning with `pandas` and regular expressions.

In [3]:
# | echo: false
import pandas as pd
import regex as re

In [4]:
# | echo: false
dataset_fixed = "data/bcas_dataset_fixed.csv"
similar_csv = 'data/bcas_similar.csv'

df = pd.read_csv(dataset_fixed, sep='|')
similar_df = pd.read_csv(similar_csv)

df = pd.merge(df, similar_df, on='url', how='left')

## Date & Issue

First, we should transform strings like this '2024,39(1):0-0' into more meaningful form.

This string contains three features -- year, issue, and pages. The pattern for separating these features is consistent throughout the dataset, so by a simple split by coma and colon  we can create columns 'date', 'issue', and 'pages'.

In [77]:
# 中国科学院院刊:2024,39(1):0-0
import regex as re

df['date'] = df['date'].str.replace('中国科学院院刊:', '', regex=True)
df[['date', 'issue']] = df['date'].str.split(',', n=1, expand=True)
df[['issue', 'page']] = df['issue'].str.split(':', n=1, expand=True)

In [78]:
# | echo: false
# | output: true
df.loc[:, ['date', 'issue', 'page']].head()

Unnamed: 0,date,issue,page
0,2024,39(1),0-0
1,2024,39(1),1-9
2,2024,39(1),10-16
3,2024,39(1),17-26
4,2024,39(1),27-33


## Abstracts & Keywords

Next we can remove irrelevant words like '中文摘要:', 'Abstract:', '中文关键词:', 'keywords:', '基金项目:'. This can be done through regular expression replacement. 

In [36]:
# remove redundant text and strip comas
df['abstract_cn'] = df['abstract_cn'].str.replace(
    '中文摘要:', '', regex=True).str.lstrip(',')
df['abstract_en'] = df['abstract_en'].str.replace(
    'Abstract:', '', regex=True).str.lstrip(',')
df['keywords_cn'] = df['keywords_cn'].str.replace(
    '中文关键词:', '', regex=True).str.lstrip(',')
df['keywords_en'] = df['keywords_en'].str.replace(
    'keywords:', '', regex=True).str.lstrip(',')
df['fund_project'] = df['fund_project'].str.replace(
    '基金项目:', '', regex=True).str.lstrip(',')

In [35]:
#| echo: false
#| output: true
df.loc[[1], ['abstract_cn', 'abstract_en',
             'keywords_cn', 'keywords_en', 'fund_project']]

Unnamed: 0,abstract_cn,abstract_en,keywords_cn,keywords_en,fund_project
1,"中文摘要:,文章将“智能化科研”（AI4R）称为第五科研范式，概括它的一系列特征包括：（1）...","Abstract:,This article refers to “AI for Resea...","中文关键词:,智能化科研,涌现,组合爆炸问题,非确定计算,大科学模型,科研大平台","keywords:,AI4R,emergence,combinatorial explosi...",基金项目:


## Authors 

We need to clean the author column as well and remove extra columns, numbers and other symbols like "*".

In [36]:
def normalize_text(text):
    # remove numbers
    text = re.sub(r'\d+', '', text)

    # replace multiple commas with one
    text = re.sub(r',+', ',', text)
    # remove asterisks
    text = re.sub(r'\*', '', text)
    # remove any leading or trailing commas and whitespace
    text = text.strip(',').strip()

    return text

Some articles do not have a specified name of the author, but each publication is written by a person, so instead of NaN we fill missing values with "not_specified" text.

In [42]:
# | echo: false
# | output: show
df['author_en'] = df['author_en'].fillna('not_specified')
df['author_en'] = df['author_en'].apply(normalize_text)
df['author_cn'] = df['author_cn'].fillna('not_specified')
df['author_cn'] = df['author_cn'].apply(normalize_text)
df.loc[:, ['author_cn']].head()

Unnamed: 0,author_cn
0,not_specified
1,李国杰
2,鄂维南
3,"谭光明,贾伟乐,王展,元国军,邵恩,孙凝晖"
4,"王飞跃,王雨桐"


## Organizations

Cleaning organization data is a tricky part, beacuse this part is the most inconsistent and messy.
Example of the organization description: (1.北京大学 北京 100871;2.北京科学智能研究院 北京 100084). One articles can be written by several people from different instituitions. The affiliation info also includes data about city, postal codes and job titles in a lot of cases.

In [42]:
orgs = df[['url', 'org_cn']].dropna()

In [None]:
# replace '!' with a space and split on ';'
orgs['org_cn'] = orgs['org_cn'].str.replace(
    '!', ' ', regex=True).str.split(';')

# explode and strip whitespace and parentheses
orgs_expld = orgs.explode('org_cn')
orgs_expld['org_cn'] = orgs_expld['org_cn'].str.strip().str.strip('()')

# remove sequences of 4 to 8 digits at the end and strip again
orgs_expld['org_cn'] = orgs_expld['org_cn'].str.replace(
    r'\s*\d{4,8}\s*$', ' ', regex=True).str.strip()

# remove digits and spaces or commas at any position and leading non-Chinese text followed by a space
orgs_expld['org_cn'] = orgs_expld['org_cn'].str.replace(
    r'^\d+\.\d*[\s,]*|\d+[\s,]*|^.+?\s+', '', regex=True)

We don't need the postal codes and job titles for our research, but the city data can be useful for geospatial analysis.

We can get the relevant cities by checking what city from the list of Chinese cities (get it from Baidu) appears in the string.

In [43]:
# | echo: false
city_df = pd.read_csv('data/cities.csv')

In [44]:
# create a regex from the city list
cities_list = city_df['city_cn'].tolist()
cities_pattern = r'\s*(?<=)(\s*' + \
    '|'.join(map(re.escape, cities_list)) + r')\b'

# compile the regex pattern
city_regex = re.compile(cities_pattern)

# extract the city from a string


def extract_city(text):
    match = city_regex.search(text)
    if match:
        return match.group(0)  # get the match
    return None


orgs_expld['city_cn'] = orgs_expld['org_cn'].apply(extract_city)

In [None]:
# | echo: false
orgs_expld['city_cn'] = orgs_expld['city_cn'].str.strip()

In [47]:
# | echo: false
mapping_dict = dict(zip(city_df['city_cn'], city_df['city_en']))
orgs_expld['city_en'] = orgs_expld['city_cn'].map(mapping_dict)

In [48]:
# | echo: false
orgs_expld['org_cn'] = orgs_expld['org_cn'].apply(lambda x: city_regex.sub('', x).strip())

Next we need to remove the job titles from the strings.

For this purpose we can create a list of job titles which appear in the dataset.

In [49]:
job_titles = [
    "所长", "研究员",
    "院长", "校长",
    "主席", "总经理",
    "教授", "院士",
    "博士", "学部委员",
    "委员长", "组长",
    "主任", "处长",
    "部长", "主任",
    "党委书记", "秘书",
    "局长", "总裁",
    "台长", "名誉",
    "特邀顾问", "执行",
    "主管", "工程师",
    "专利代理人", "导师",
    "助理", "书记", "理事长", "馆长"
]

job_titles_pattern = '|'.join([fr'\s*{title}\s*' for title in job_titles])


def remove_job_titles(text):
    return re.sub(job_titles_pattern, '', text).strip()


orgs_expld['org_cn'] = orgs_expld['org_cn'].apply(remove_job_titles)

# remove some characters and 副 from job titles
orgs_expld['org_cn'] = orgs_expld['org_cn'].str.replace(
    r"[、《》副]", "", regex=True)
# remove empty parentheses
orgs_expld['org_cn'] = orgs_expld['org_cn'].str.replace(
    r"\(\)", "", regex=True)

Now the strings will look something like this: 

In [52]:
# | echo: false
# | output: true
orgs_expld = orgs_expld[orgs_expld.org_cn != '']
orgs_expld.head()

Unnamed: 0,url,org_cn,city_cn,city_en
2,http://old2022.bulletin.cas.cn/zgkxyyk/ch/read...,北京大学,北京,Beijing
2,http://old2022.bulletin.cas.cn/zgkxyyk/ch/read...,北京科学智能研究院,北京,Beijing
4,http://old2022.bulletin.cas.cn/zgkxyyk/ch/read...,中国科学院自动化研究所 复杂系统管理与控制国家重点实验室,北京,Beijing
4,http://old2022.bulletin.cas.cn/zgkxyyk/ch/read...,澳门科技大学 创新工程学院,澳门,Macau
4,http://old2022.bulletin.cas.cn/zgkxyyk/ch/read...,中国科学院自动化研究所 多模态人工智能系统全国重点实验室,北京,Beijing


The next important issue is the degree of detail about organizations we are interested in. The affiliation data includes in some cases the title of the head organization and a subdivision (lab, office, group, ). For now we will focus on the head organizations, e.g. if there is a string like '中国科学院自动化研究所 复杂系统管理与控制国家重点实验室' we count it like '中国科学院自动化研究所'.

This also can be achieved through regular exprssions.

In [53]:
# List of head organizations
head_org_endings = ["科学院", "研究所", "研究院", "大学", "学院"]

# Create regex
pattern = '|'.join(head_org_endings)

# Regex pattern to detect Chinese characters
chinese_char_pattern = re.compile(r'[\u4e00-\u9fff]')


def extract_head_org(text):
    if not text.strip():
        return text  # Return the original text if it is empty or only whitespace

    # Check if the text contains any Chinese characters
    if not chinese_char_pattern.search(text):
        return text  # Return the original text if it contains no Chinese characters

    # Check specifically for "中国科学院大学" ignoring other symbols or characters
    if re.search(r"中国科学院大学", text):
        return "中国科学院大学"

    if "中国科学院院刊" in text:
        return "中国科学院院刊"

    if "上海天文台" in text:
        return "中国科学院上海天文台"

    if "北京天文台" in text:
        return "中国科学院北京天文台"

    if "南京天文台" in text:
        return "中国科学院南京天文台"

    if "国家天文台" in text:
        return "中国科学院国家天文台"

    # Special case for "中国科学院" followed by "研究所", "中心", or "研究院"
    zky_with_suffix_match = re.search(r"(中国科学院.*?(研究所|中心|研究院))", text)
    if zky_with_suffix_match:
        return zky_with_suffix_match.group(1)

    # Search for the first occurrence of any of the common endings in the full text
    match = re.search(fr"(.+?({pattern}))", text)
    if match:
        # Extract and return the head organization
        return match.group(1)
    else:
        # No common ending found, consider the first part as the head organization
        return text.split()[0] if ' ' in text else text


# Apply the function to the DataFrame
orgs_expld['org_cn_head'] = orgs_expld['org_cn'].apply(extract_head_org)

In [55]:
orgs_head = orgs_expld['org_cn_head'].unique()
orgs_head_df = pd.DataFrame(orgs_head, columns=['org_cn_head'])
orgs_head_df.to_csv('data/bcas_orgs_head.csv', index=False)

In [56]:
# | echo: false
orgs_head_clean_df = pd.read_csv('data/bcas_orgs_head_clean.csv')
orgs_head_clean_df.head()

Unnamed: 0,org_cn_head,org_cn_head_clean
0,北京师范大学,北京师范大学
1,“万种园”项目首席科学家,“万种园”项目
2,“中国科学与人文论坛”长,中国科学与人文论坛
3,“论坛”处宣传,“论坛”处宣传
4,)中国科学院,中国科学院


The next step is to translate the data to English. Note that we need official English titles, so doing everything through machine translation is not the best fit.

We can try to retrieve some English titles from Baidu using Beautiful Soup again.

In [None]:
def fetch_english_title(org_cn):
    base_url = 'https://baike.baidu.com/item/'
    url = base_url + org_cn
    print(f"Processing: {org_cn}")
    try:
        response = requests.get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.content, "html.parser")

        # find divs with class "itemWrapper_yPk3z" or "itemWrapper_qgaYJ"
        divs = soup.find_all(
            'div', class_=['itemWrapper_yPk3z', 'itemWrapper_qgaYJ'])
        for div in divs:
            # find dt with "外文名"
            dt = div.find('dt', class_='basicInfoItem_hdTH0 itemName_iCg2R')
            if dt and '外文名' in dt.text:
                # find the dd element with class "basicInfoItem_hdTH0 itemValue_rxziX"
                dd = div.find(
                    'dd', class_='basicInfoItem_hdTH0 itemValue_rxziX')
                if dd:
                    # extract English title
                    span = dd.find('span', class_='text_v1llE')
                    if span:
                        english_title = span.text.strip()
                        print(
                            f"Found English title for {org_cn}: {english_title}")
                        return english_title
        print(f"No English title found for {org_cn}")
    except Exception as e:
        print(f"Error fetching data for {org_cn}: {e}")
    return None


orgs_head_clean_df['org_cn_head_en'] = orgs_head_clean_df['org_cn_head_clean'].apply(
    fetch_english_title)

In [None]:
# | echo: false
orgs_head_clean_df.to_csv('data/bcas_orgs_head_clean_en.csv', index=False)

In [None]:
#| echo: false
#| output: false
orgs_head_clean_df[orgs_head_clean_df['org_cn_head_en'] == ''].

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1259 entries, 0 to 1258
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   org_cn_head        1259 non-null   object
 1   org_cn_head_clean  1259 non-null   object
 2   org_cn_head_en     600 non-null    object
dtypes: object(3)
memory usage: 29.6+ KB


Unfortunately, not all organizations are present on Baidu or have English translation. We can translate these with Google Translate through `deep_translator` library.

In [None]:
# | output: false

from deep_translator import GoogleTranslator
from tqdm import tqdm

def get_translation(text):
    try:
        return GoogleTranslator(source='auto', target='en').translate(str(text))
    except KeyboardInterrupt as e:
        raise e
    except Exception as e:
        print(f"Error translating text: {text}. Error: {e}")
        return 'error'


tqdm.pandas()

# translate only the rows where 'org_cn_head_en' is null
orgs_head_clean_df.loc[orgs_head_clean_df['org_cn_head_en'].isnull(
), 'org_cn_head_en'] = orgs_head_clean_df.loc[orgs_head_clean_df['org_cn_head_en'].isnull(), 'org_cn_head_clean'].progress_apply(get_translation)

100%|██████████| 659/659 [19:41<00:00,  1.79s/it]


This way we can get the translation of all organization titles:

In [None]:
# | echo: false
# | output: false
orgs_head_clean_df.head()

Unnamed: 0,org_cn_head,org_cn_head_clean,org_cn_head_en
0,北京师范大学,北京师范大学,Beijing Normal University
1,“万种园”项目首席科学家,“万种园”项目,"""Ten Thousand Plants Garden"" Project"
2,“中国科学与人文论坛”长,中国科学与人文论坛,China Science and Humanities Forum
3,“论坛”处宣传,“论坛”处宣传,Publicity at the Forum
4,)中国科学院,中国科学院,Chinese Academy of Sciences


We add some final touches in Excel/Google Sheets and update the dataframe. This dataset is small, so it is faster to just manually remove irrelevant symbols than to craft automated solution.

In [8]:
# | echo: false
# | output: true
orgs_head_clean_df = pd.read_csv('data/bcas_orgs_head_clean_en_upd_v2.csv')
orgs_head_clean_df.head().style.hide(axis="index").relabel_index(["Head Organization", "Head Organization Cleaned", "Head Organization English"], axis=1)

Head Organization,Head Organization Cleaned,Head Organization English
北京师范大学,北京师范大学,Beijing Normal University
“万种园”项目首席科学家,“万种园”项目,"""Ten Thousand Plants Garden"" Project"
“中国科学与人文论坛”长,中国科学与人文论坛,China Science and Humanities Forum
“论坛”处宣传,“论坛”处宣传,Publicity at the Forum
)中国科学院,中国科学院,CAS


In [61]:
# | echo: false

mapping_dict = dict(
    zip(orgs_head_clean_df['org_cn_head'], orgs_head_clean_df['org_cn_head_en']))
orgs_expld['org_cn_head_en'] = orgs_expld['org_cn_head'].map(mapping_dict)

city_cn_mapping = orgs_expld.dropna(subset=['city_cn']).drop_duplicates(
    subset=['org_cn_head_en']).set_index('org_cn_head_en')['city_cn'].to_dict()
city_en_mapping = orgs_expld.dropna(subset=['city_en']).drop_duplicates(
    subset=['org_cn_head_en']).set_index('org_cn_head_en')['city_en'].to_dict()

orgs_expld['city_cn'] = orgs_expld.apply(
    lambda row: city_cn_mapping.get(row['org_cn_head_en'], row['city_cn']), axis=1)
orgs_expld['city_en'] = orgs_expld.apply(
    lambda row: city_en_mapping.get(row['org_cn_head_en'], row['city_en']), axis=1)

title_dict = df.set_index('url')['title_cn'].to_dict()
orgs_expld['title_cn'] = orgs_expld['url'].map(title_dict)

year_dict = df.set_index('url')['date'].to_dict()
orgs_expld['year'] = orgs_expld['url'].map(year_dict)

orgs_expld = orgs_expld.rename(columns={'org_cn_head_en': 'orgs_head'})
orgs_expld['orgs_head'] = orgs_expld['orgs_head'].str.title()
orgs_expld['orgs_head'] = orgs_expld['orgs_head'].str.replace(
    'Cas', 'CAS', regex=True)
orgs_expld['orgs_head'] = orgs_expld['orgs_head'].str.replace(
    'Of', 'of', regex=True)
orgs_expld['orgs_head'] = orgs_expld['orgs_head'].str.replace(
    'And', 'and', regex=True)

Finally, we map years, cities and article titles to the dataframe and format the titles (change case, add abbreviations). The final result will look like this:

In [23]:
# | echo: false
# | output: true
orgs_expld = pd.read_csv('data/bcas_orgs.csv')
orgs_expld = orgs_expld[['org_cn', 'org_cn_head', 'orgs_head', 'city_cn', 'city_en', 'title_cn', 'year']]
orgs_expld.head(5).style.hide(axis="index").relabel_index(["Organization", "Head Organization (CN)", "Head Organization (EN)", "City (EN)", "City (CN)", "Article", "Year"], axis=1)

Organization,Head Organization (CN),Head Organization (EN),City (EN),City (CN),Article,Year
北京大学,北京大学,Peking University,北京,Beijing,AI助力打造科学研究新范式,2024
北京科学智能研究院,北京科学智能研究院,Beijing Institute of Scientific and Intelligent Technology,北京,Beijing,AI助力打造科学研究新范式,2024
中国科学院自动化研究所 复杂系统管理与控制国家重点实验室,中国科学院自动化研究所,"Institute of Automation, CAS",北京,Beijing,数字科学家与平行科学：AI4S和S4AI的本源与目标,2024
澳门科技大学 创新工程学院,澳门科技大学,Macau University of Science and Technology,澳门,Macau,数字科学家与平行科学：AI4S和S4AI的本源与目标,2024
中国科学院自动化研究所 多模态人工智能系统全国重点实验室,中国科学院自动化研究所,"Institute of Automation, CAS",北京,Beijing,数字科学家与平行科学：AI4S和S4AI的本源与目标,2024


## Fund Projects

Finally, we need to clean the data about associated fund projects. 

In [None]:
# | echo: false
# | output: false

fund = df[['url', 'date', 'title_cn', 'fund_project']]
fund.shape

(7216, 4)

In [None]:
# | echo: false
# | output: false

fund = fund[fund['fund_project'].str.strip() != '']
fund.shape

(1054, 4)

In [None]:
fund['fund_project'] = fund['fund_project'].str.replace('基金项目：', '', regex=True)
fund['fund_project'] = (fund['fund_project'].str.replace(',', '，', regex=False)
                                            .str.replace(';', '，', regex=False)
                                            .str.replace(r'\(', '（', regex=True)
                                            .str.replace(r'\)', '）', regex=True)
                                            .str.replace('!', '，', regex=False)
                                            .str.replace('；', '，', regex=False)
                        )

In [None]:
def replace_commas_in_brackets(text):
    return re.sub(r'[（(](.*?)[）)]', lambda m: re.sub(r'[!,;，；]', '、', m.group()), text)

fund['fund_project'] = fund['fund_project'].apply(replace_commas_in_brackets)

In [None]:
# | echo: false
# | output: false

fund['fund_project'] = fund['fund_project'].str.split('，')

In [None]:
# | echo: false
# | output: false

fund_expld = fund.explode('fund_project').reset_index()
fund_expld['fund_project'] = fund_expld['fund_project'].str.replace('）、', '），')
fund_expld['fund_project'] = fund_expld['fund_project'].str.split('，')
fund_expld = fund_expld.explode('fund_project')

In [None]:
# | echo: false
# | output: false

# drop irrelevant rows
fund_expld = fund_expld.drop(index=[1643] + list(range(1654, 1667)))

In [None]:
# | echo: false
# | output: false

fund_expld.to_csv('data/bcas_fund_projects.csv', index=False)

The final dataframe looks this way:

In [109]:
# | echo: false
# | output: false
fund_expld = fund_expld.drop(index=[1645, 1646, 1839])

In [115]:
# | echo: false
# | output: false
fund_expld = fund_expld[['date', 'title_cn', 'fund_project']]
fund_expld.to_csv('data/bcas_fund_projects_fin.csv')

In [117]:
# | echo: false
# | output: true
fund_expld.head().style.hide(axis="index").relabel_index(["Year", "Article", "Fund Project"], axis=1)

Year,Article,Fund Project
2024,面向模拟智能的计算系统,国家杰出青年科学基金（T2125013）
2024,数字科学家与平行科学：AI4S和S4AI的本源与目标,澳门科学技术发展基金（0093/2023/RIA2）
2024,数字科学家与平行科学：AI4S和S4AI的本源与目标,国家自然科学基金（61533019）
2024,AI for Technology：技术智能在高技术领域的应用实践与未来展望,国家自然科学基金（61925208、U22A2028）
2024,AI for Technology：技术智能在高技术领域的应用实践与未来展望,中国科学院稳定支持基础研究领域青年团队计划（YSBR-029）
