---
title: "Topic Analysis"
jupyter: python3
format:
  html:
    grid: 
      body-width: 1000px
      sidebar-width: 300px
      margin-width: 300px
    toc: true
    toc-title: Contents
    page-layout: full
    code-overflow: wrap
    

number-sections: true
reference-location: margin
citation-location: margin
---

In [329]:
#| echo: false
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import pandas as pd
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 150)
pd.set_option('display.max_colwidth', 100)

# load flattened dataset
df = pd.read_csv('data/bcas_dataset_topics_flat.csv', sep='|')
df = df.rename(columns={'multiple_topics': 'topic'})

#original full dataset
df_full = pd.read_csv('data/bcas_dataset_multiple_topics.csv', sep='|')
df_full = df_full[df_full['year'] < 2024]

df = df.dropna(subset='topic')
df.topic = df.topic.astype('int')

#| echo: false
topic_info = pd.read_csv('data/topic_label_ontology_full.csv')
df = pd.merge(df, topic_info, how='left', on='topic')

We’ve enriched the data with a semantic layer, enabling us to delve into the evolving content of BCAS publications over the years.

With this enhancement, we can now:

- Track shifts in the prevalence of specific topics.
- Gauge interdisciplinarity across articles.
- Analyze the distribution of topics.
- Explore how reader engagement varies by views and downloads.
- Uncover trends in what authors focus on over time.
- Investigate organizational collaboration across different fields.
- Assess the impact of funded projects.

This opens up a comprehensive view of how research priorities and collaborations have transformed within BCAS.

# Research Variety: Topic Numbers over Time

In [330]:
#| echo: false

pio.renderers.default = "plotly_mimetype+notebook_connected"

In [331]:
#| echo: false

import plotly.express as px 
import plotly.graph_objects as go
import plotly.io as pio
pio.templates.default = "ggplot2"
pio.renderers.default = "plotly_mimetype+notebook_connected"


def set_layout():
    return {
        'width': 700,
        'height': 400,
        'xaxis': {
            'title': '',
            'titlefont': {'size': 14, 'family': "Verdana"},
            'tickmode': 'array',
            'showline': True,
            'linecolor': 'black'
        },
        'yaxis': {
            'title':'',
            'showline': True,
            'linecolor': 'black'
        },
        'font': {
            'size': 12,
            'family': "Verdana"
        },
        'margin':{'r': 20, 'l':20, 't': 50, 'b':50}
    }

layout = set_layout()

color_palette = {
    'Economic & Social Sciences': '#CCBFFF', 
    'Natural Sciences': '#77dd77', 
    'Applied Sciences': '#654CFF', 
    'Arts & Humanities': '#E51932',
    'Health Sciences':'#19B2FF'
    }

We observe that Applied Sciences have consistently led in the number of topics throughout the entire period, with Natural Sciences following closely behind. The total number of topics has steadily increased over the years, peaking in 2019-2023. The Applied Sciences domain shows the most variety, which reflects China's strategic emphasis on S&T and its focus on applied research.

In contrast, the dynamics of topics within the Economic & Social Sciences domain remain stable.

Meanwhile, the presence of Arts & Humanities has dwindled to near non-existence. Previously, these topics were reflected in articles covering CAS historical matters and biographies of prominent scientists. The shift towards more application-oriented publications suggests a broader move away from these areas.

In [332]:
#| echo: false
#| output: true
#| fig-cap: "Number of Topics by Year, 1986-2023."
#| label: fig-topics-year

gp = df.groupby(['year', 'domain_en'])['topic_en'].nunique().reset_index()

fig = px.area(
    gp,
    x='year',
    y='topic_en',
    color='domain_en',
    color_discrete_map=color_palette,
)

fig.update_layout(
    layout,
    width=800,
    xaxis=dict(minor=dict(ticks="inside", showgrid=True)),
    yaxis=dict(title="Topic Count"),
    #title='Number of Topics per Year, 1986-2023',
    showlegend=True,
    hovermode='x unified',
    legend=dict(title='Domain'),
)

fig.update_traces(
    hovertemplate='%{fullData.name}: %{y}<extra></extra>'
)

fig.show()

In [333]:
#| echo: false

# aggregate views by topic
topic_views = df.groupby('topic')['views'].agg(['sum', 'mean'])
topic_views.columns = ['views_total', 'views_avg']

# aggregate downloads by topic
topic_downloads = df.groupby('topic')['downloads'].agg(['sum', 'mean'])
topic_downloads.columns = ['downloads_total', 'downloads_avg']

# merge
topic_stats = pd.merge(topic_views, topic_downloads, left_index=True, right_index=True)

# calculate views and downloads shares
topic_stats['views_share'] = topic_stats['views_total'] / topic_stats['views_total'].sum()
topic_stats['downloads_share'] = topic_stats['downloads_total'] / topic_stats['downloads_total'].sum()

# add article count and share
topic_stats['article_count'] = df.groupby('topic').size().values
topic_stats['article_share'] = (df.topic.value_counts(normalize=True)).values

topic_stats = topic_stats.reset_index()
topic_stats = topic_stats[['topic', 'article_count', 'article_share',
                            'views_total', 'views_avg', 'views_share',
                            'downloads_total', 'downloads_avg', 'downloads_share']]

topic_stats = pd.merge(topic_stats, topic_info, how='left', on='topic')

# Interdesciplinarity: Category Overlap

Since we’ve tagged each article with multiple topics, some overlap is inevitable. By counting how often different topics show up together in the same articles, we can spot where they intersect. This will help us see how closely related different topics are and understand the connections between them better.

The 235 overlaps between topics like China’s S&T strategy and talent cultivation suggest a strong connection. Basically, discussions on innovation and strategic goals often go hand-in-hand with talks about nurturing scientific talent.

An overlap of 143 shows that open science and key labs are often discussed alongside policy research. So, when people talk about open science, it frequently ties into policy and innovation in national labs.

For the Belt and Road Initiative, we see an overlap of 74, which means that conversations about this initiative often include both S&T cooperation and sustainability. It's clear that the initiative's scientific side is linked with both collaborative efforts and long-term goals.

The 65 overlaps in sustainable development and big data point to a close relationship between these topics. It seems that discussions on sustainability often involve data-driven strategies.

When it comes to Optical and Spectroscopic Detection Instruments and High-Power Solid-State Lasers, they overlap 49 times. This suggests a strong focus on advanced imaging and detection tech in these areas.

The overlap of 43 between renewable energy and carbon neutrality shows that these topics are closely connected. It highlights how renewable energy is key to reaching carbon neutrality goals.

Finally, the 34 overlaps between biodiversity conservation and the CAS Field Observation Station Network suggest that these discussions are tied to monitoring and restoring ecosystems.

Overall, these overlaps make sense. Many of these topics are closely related, which is why we see a lot of connections between them.

In [334]:
#| echo: false
#| output: true
#| tbl-cap: 'Topic Overlap'
#| tbl-cap-location: top
#| label: tbl-topic-overlap

from itertools import combinations

topic_title_sets = {topic: set(df[df['topic_en'] == topic]['title_cn']) for topic in df['topic_en'].unique()}

topics = list(topic_title_sets.keys())
overlap_data = []

for topic1, topic2 in combinations(topics, 2):
    overlap = len(topic_title_sets[topic1].intersection(topic_title_sets[topic2]))
    overlap_data.append({'topic1': topic1, 'topic2': topic2, 'overlap': overlap})

overlap_df = pd.DataFrame(overlap_data)
overlap_df.sort_values(by='overlap', ascending=False).head(20).style.hide(axis="index").relabel_index(["Topic 1", "Topic 2", "Overalp"], axis=1)

Topic 1,Topic 2,Overalp
S&T Innovation and Superpower Strategy,S&T Talent Cultivation,235
Open Science and National Key Laboratories,Open Science and S&T Innovation Policy Research,143
CAS Academicians and Academic Divisions Work,CAS Divisions and Academicians: History and Development,91
Development History and Scientific Achievements of the Chinese Academy of Sciences,CAS Leaders Appointments and Profiles,83
Belt and Road Initiative: S&T Cooperation,Belt and Road Initiative: S&T Innovation and Sustainable Development,74
Sustainable Development in China,Earth Big Data for Sustainable Development,65
Development History and Scientific Achievements of the Chinese Academy of Sciences,Open Science and National Key Laboratories,56
Development History and Scientific Achievements of the Chinese Academy of Sciences,Open Science and S&T Innovation Policy Research,56
S&T Talent Cultivation,Graduate Education and Talent Cultivation,51
Optical and Spectroscopic Detection Instruments,High-Power Solid-State Lasers and Deep Ultraviolet Spectral Imaging Systems,49


We may find more interesting inside moving up to the subfield level.

For example, we can see a tstrong connection between Strategic Studeis and Education subfields. 

Environmental Sciences show notable overlaps with several key subfields:

- **Ecology**: Highlighting a strong connection between environmental sciences and ecological research.
- **Environmental Engineering**: Emphasizing a significant link between these two areas.
- **Energy**: Indicating that energy studies are closely tied to environmental science.
- **Agronomy & Agriculture**: Showing the importance of environmental sciences in agricultural research.
- **Information Systems**: Reflecting the intersection of environmental science with information technology.
- **Science Studies**: Demonstrating how environmental sciences are integrated into broader scientific research.
- **Education**: Suggesting that environmental science research contributes to educational topics.
- **Meteorology & Atmospheric Sciences**: Revealing a relationship between environmental sciences and atmospheric research.

These overlaps underscore the interdisciplinary nature of environmental sciences and their relevance across various fields.

In [335]:
#| echo: false
#| output: true
#| tbl-cap: 'Subfield Overlap'
#| tbl-cap-location: top
#| label: tbl-subfield-overlap

subfield_df = df[
    ['year', 'views', 'downloads', 'title_cn', 'author_cn',
    'subfield_en', 'subfield_cn', 'subfield_ru',
    'field_en', 'field_cn', 'field_ru',
    'domain_en', 'domain_cn', 'domain_ru']
]
subfield_df = subfield_df.drop_duplicates()
from itertools import combinations

subfield_title_sets = {subfield: set(subfield_df[subfield_df['subfield_en'] == subfield]['title_cn']) for subfield in subfield_df['subfield_en'].unique()}

subfields = list(subfield_title_sets.keys())
overlap_data = []

for subfield1, subfield2 in combinations(subfields, 2):
    overlap = len(subfield_title_sets[subfield1].intersection(subfield_title_sets[subfield2]))
    overlap_data.append({'subfield1': subfield1, 'subfield2': subfield2, 'overlap': overlap})

overlap_subfield_df = pd.DataFrame(overlap_data)
overlap_subfield_df.sort_values(by='overlap', ascending=False).head(20).style.hide(axis="index").relabel_index(["Subfield 1", "Subfield 2", "Overalp"], axis=1)

Subfield 1,Subfield 2,Overalp
"Strategic, Defence & Security Studies",Education,240
Education,Science Studies,127
Science Studies,"History of Science, Technology & Medicine",104
Education,"History of Science, Technology & Medicine",76
Environmental Sciences,Ecology,62
"Strategic, Defence & Security Studies",Science Studies,61
Environmental Sciences,Environmental Engineering,55
Energy,Environmental Sciences,51
Environmental Sciences,Agronomy & Agriculture,48
Environmental Sciences,Information Systems,47


In [336]:
#| echo: false
#| output: false

import numpy as np

subfields = sorted(overlap_subfield_df['subfield1'].unique())

heatmap_df = overlap_subfield_df.pivot(index='subfield1', columns='subfield2', values='overlap')
heatmap_df = heatmap_df.reindex(index=subfields, columns=subfields)
heatmap_df = heatmap_df.fillna(0)
heatmap_df = heatmap_df + heatmap_df.T


fig = px.imshow(heatmap_df,
                labels=dict(x="Subfield", y="Subfield", color="Overlap"),
                color_continuous_scale="Blues",
                text_auto='.0f',
                template='plotly_white')

fig.update_layout(
    layout,
    #title="Overlap Between Subfields",
    width=1000,
    height=900,
    xaxis_showgrid=False,
    yaxis_showgrid=False
)

# Show the plot
fig.show()


At the field level, we see some interesting patterns. The most notable overlap is between **Enabling & Strategic Technologies** and **Social Sciences**, indicating a strong connection between technological advancements and social impacts. 

**Earth & Environmental Sciences** also show significant overlap with **Enabling & Strategic Technologies**, reflecting how environmental concerns are closely tied to technological developments. Additionally, there's a considerable intersection with **Information & Communication Technologies**, **Engineering**, and **Social Sciences**, underscoring the broad relevance of environmental science across different sectors.

While there is a connection between **Earth & Environmental Sciences** and **Agriculture**, it’s not as pronounced, but it still points to an interplay between environmental and agricultural research.

In [337]:
#| echo: false
#| output: false
field_df = df[
    ['year', 'views', 'downloads', 'title_cn', 'author_cn',
    'field_en', 'field_cn', 'field_ru',
    'domain_en', 'domain_cn', 'domain_ru']
]
field_df = field_df.drop_duplicates()

field_df = field_df[field_df['year'] > 2012]

field_title_sets = {field: set(field_df[field_df['field_en'] == field]['title_cn']) for field in field_df['field_en'].unique()}

fields = list(field_title_sets.keys())
overlap_data = []

for field1, field2 in combinations(fields, 2):
    overlap = len(field_title_sets[field1].intersection(field_title_sets[field2]))
    overlap_data.append({'field1': field1, 'field2': field2, 'overlap': overlap})

overlap_field_df = pd.DataFrame(overlap_data)

In [338]:
#| echo: false
#| output: true
#| fig-cap: 'Overlap Between Fields'
#| label: fig-field-overlap

import numpy as np

fields = sorted(overlap_field_df['field1'].unique())

heatmap_df = overlap_field_df.pivot(index='field1', columns='field2', values='overlap')
heatmap_df = heatmap_df.reindex(index=fields, columns=fields)
heatmap_df = heatmap_df.fillna(0)
heatmap_df = heatmap_df + heatmap_df.T

np.fill_diagonal(heatmap_df.values, 0)

fig = px.imshow(heatmap_df,
                text_auto='.0f',
                labels=dict(x="field", y="field", color="Overlap"),
                color_continuous_scale="Blues")

fig.update_layout(
    layout,
    #title="Overlap Between Fields",
    xaxis_tickangle=-270,
    width=850,
    height=800,
    xaxis_showgrid=False,
    yaxis_showgrid=False
)

fig.show()


At the domain level, Applied Sciences shows the strongest connections, with significant overlaps with Natural Sciences and Economic & Social Sciences. The largest overlap is with Natural Sciences (687), indicating a close relationship between these domains. There is also a substantial connection with Economic & Social Sciences (631), reflecting how applied research intersects with economic and social aspects.

Applied Sciences also overlaps with Health Sciences (168) and Arts & Humanities (92), though these connections are less pronounced.

Among the other domains, Economic & Social Sciences and Arts & Humanities show a notable overlap (171), while Economic & Social Sciences and Health Sciences have a smaller, yet significant, overlap (62). Connections between Health Sciences and Arts & Humanities are minimal (7), and overlaps between Natural Sciences and Arts & Humanities are also limited (36).

In [339]:
#| echo: false
#| output: false

domain_df = df[
    ['year', 'views', 'downloads', 'title_cn', 'author_cn',
    'domain_en', 'domain_cn', 'domain_ru']
]
domain_df = domain_df.drop_duplicates()

domain_title_sets = {domain: set(domain_df[domain_df['domain_en'] == domain]['title_cn']) for domain in domain_df['domain_en'].unique()}

domains = list(domain_title_sets.keys())
overlap_data = []

for domain1, domain2 in combinations(domains, 2):
    overlap = len(domain_title_sets[domain1].intersection(domain_title_sets[domain2]))
    overlap_data.append({'domain1': domain1, 'domain2': domain2, 'overlap': overlap})

overlap_domain_df = pd.DataFrame(overlap_data)

overlap_domain_df.sort_values(by='domain1', ascending=True).head(20).style.hide(axis="index").relabel_index(["Domain 1", "Domain 2", "Overalp"], axis=1)

Domain 1,Domain 2,Overalp
Applied Sciences,Natural Sciences,687
Applied Sciences,Economic & Social Sciences,631
Applied Sciences,Health Sciences,168
Applied Sciences,Arts & Humanities,92
Economic & Social Sciences,Health Sciences,62
Economic & Social Sciences,Arts & Humanities,171
Health Sciences,Arts & Humanities,7
Natural Sciences,Economic & Social Sciences,205
Natural Sciences,Health Sciences,86
Natural Sciences,Arts & Humanities,36


In [340]:
#| echo: false
#| output: true
#| fig-cap: 'Overlap Between Domains'
#| label: fig-domain-overlap

domain_df = df[
    ['year', 'views', 'downloads', 'title_cn', 'author_cn',
    'domain_en', 'domain_cn', 'domain_ru']
]
domain_df = domain_df.drop_duplicates()

domain_title_sets = {domain: set(domain_df[domain_df['domain_en'] == domain]['title_cn']) for domain in domain_df['domain_en'].unique()}

domains = list(domain_title_sets.keys())
overlap_data = []

for domain1, domain2 in combinations(domains, 2):
    overlap = len(domain_title_sets[domain1].intersection(domain_title_sets[domain2]))
    overlap_data.append({'domain1': domain1, 'domain2': domain2, 'overlap': overlap})

overlap_domain_df = pd.DataFrame(overlap_data)

domains = sorted(overlap_domain_df['domain1'].unique())

heatmap_df = overlap_domain_df.pivot(index='domain1', columns='domain2', values='overlap')
heatmap_df = heatmap_df.reindex(index=domains, columns=domains)
heatmap_df = heatmap_df.fillna(0)
heatmap_df = heatmap_df + heatmap_df.T

np.fill_diagonal(heatmap_df.values, 0)

fig = px.imshow(heatmap_df,
                text_auto='.0f',
                labels=dict(x="domain", y="domain", color="Overlap"),
                color_continuous_scale="Blues")

fig.update_layout(
    layout,
    #title="Overlap Between domains",
    xaxis_tickangle=-270,
    width=700,
    height=600,
    xaxis_showgrid=False,
    yaxis_showgrid=False
)

fig.show()

# Article Distributions

Another question is about article distributions: what are the areas of focus in expert research and analysis? While a diverse range of topics within a domain suggests breadth, it doesn't necessarily indicate depth--or that these topics are the most frequently discussed. Examining the volume of articles across different topics will reveal which subjects experts are most engaged with, providing a clearer picture of research priorities and trends.

Analysis of article distribution across domains reveals a clear dominance of Applied Sciences, particularly since 2020, when its share reached 54%. Concurrently, Social Sciences experienced a significant decline, with its share shrinking from 33.6% in 2009 to 18.2% in the most recent year. Arts & Humanities maintained a substantial presence before 2005 but has since shown a downward trend. In contrast, Health Sciences has seen an increase in its share over the past three years.

In [341]:
#| echo: false
#| output: false
#| fig-cap: 'Distribution of Articles by Domain'
#| label: fig-domain-distribution

# Count unique articles per domain per year
domain = df.groupby(['year', 'domain_en'])['title_cn'].nunique().reset_index(name='count')
domain = domain.pivot_table(index='year', columns='domain_en', values='count', fill_value=0)
domain = domain.div(domain.sum(axis=1), axis=0).reset_index()

fig = px.area(domain, x='year', y=[col for col in domain.columns if col != 'year'],
              color_discrete_map=color_palette)

fig.update_layout(
    layout,
    width=800,
    #title='Distribution of Articles by Domain, 1986-2023',
    xaxis=dict(
        minor=dict(ticks="inside", showgrid=True),
        #type='category'
    ),
    yaxis=dict(
        title="Share of Articles",
        tickformat='.0%'
    ),
    showlegend=True,
    legend=dict(title='Domain'),
    hovermode='x unified',
    #margin=dict(b=0, r=100)
)

fig.update_traces(
    hovertemplate='%{fullData.name}: %{y:.1%}<extra></extra>'
)

fig.show()

Analysis of field dynamics from 2013 to 2023 reveals a significant shift in the journal's focus. 

While Earth & Environmental Sciences remains a predominant field, Enabling & Strategic Technologies has emerged as the dominant area, claiming a 33.53% share in 2023. 

The years 2020-2021 saw a notable increase in publications related to Clinical Medicine and Public Health & Health Services, likely driven by the COVID-19 pandemic—a trend corroborated by our exploratory data analysis of keyword frequencies.

Engineering consistently represents a key research vector, maintaining a share between 11-13% from 2019 to 2023 (with the exception of 2022). Social Sciences, after peaking at 20-22% in 2015-2016, experienced a substantial decline, settling at 15.55% in recent years.

Information & Communication Technologies publications surged in 2021-2022, coinciding with two major events: the U.S. chip export ban on China and the emergence of advanced AI like ChatGPT. This period also saw a decline in science management topics. The shift suggests a pivot from theoretical discussions to practical applications, with researchers focusing more on developing technologies than debating how science should be organized.

In [342]:
#| echo: false
#| output: true
#| fig-cap: 'Distribution of Articles by Field, 2013-2023'
#| label: fig-field-distribution

gp = df.groupby(['year', 'field_en'])['title_cn'].nunique().reset_index(name='count')
gp['share'] = gp['count'] / gp.groupby('year')['count'].transform('sum')*100
gp = gp[gp['year'] > 2012]

# pivot the df to create a matrix & fill nan values with 0
heatmap_data = gp.pivot(index='field_en', columns='year', values='share')
heatmap_data = heatmap_data.fillna(0)

fig = px.imshow(heatmap_data,
                text_auto = '.2f',
                color_continuous_scale='Blues',
                labels = dict(x = "Year", y = "", color = "Share, %"))

fig.update_layout(
    layout,
    #title='Distribution of Articles by Field, 2013-2023',
    width=850,
    height=600,
    xaxis=dict(
        side = "bottom",
        tickmode='array', 
        tickvals=gp['year'].unique()
    )
)

fig.update_traces(
    hovertemplate='Year: %{x}<br>Field: %{y}<br>Share: %{z:.2f}%<extra></extra>'
)

fig.show()

During Xi Jinping's era, the research articles predominantly centered around subfields like Environmental Sciences, Agronomy & Agriculture, Computer Hardware & Architecture, Ecology, Economics, Education, Electrical & Electronic Engineering, Energy, Environmental Engineering, Information Systems, Optoelectronics & Photonics, Science Studies, and Strategic, Defence & Security Studies.

In 2023, Optoelectronics & Photonics emerged as the most represented subfield, making up 14.29% of the topics. It was followed by Environmental Sciences at 12.22%, Strategic, Defence & Security Studies at 7.45%, Science Studies at 5.80%, and International Relations at 4.97%. While these numbers suggest a strong focus on these areas, there's an inconsistency: despite the apparent emphasis on Optoelectronics & Photonics, Strategic Studies, and Environmental Sciences, other important and emerging fields like Materials, Meteorology & Atmospheric Sciences, Nuclear & Particle Physics, Microscopy, and Nuclear Medicine & Medical Imaging are underrepresented. 

This underrepresentation could indicate a gap between the evolving scientific priorities and the actual research output, pointing to potential areas for future growth and exploration.

In [343]:
#| echo: false
#| output: true
#| fig-cap: 'Distribution of Articles by Subfield, 2013-2023'
#| label: fig-subfield-distribution

gp = df.groupby(['year', 'subfield_en'])['title_cn'].nunique().reset_index(name='count')
gp['share'] = gp['count'] / gp.groupby('year')['count'].transform('sum')*100
gp = gp[gp['year'] > 2012]

# pivot the df to create a matrix & fill nan values with 0
heatmap_data = gp.pivot(index='subfield_en', columns='year', values='share')
heatmap_data = heatmap_data.fillna(0)

fig = px.imshow(heatmap_data,
                text_auto = '.2f',
                template='plotly_white',
                color_continuous_scale='Blues',
                labels = dict(x = "Year", y = "", color = "Share, %"))

fig.update_layout(
    layout,
    width=850,
    height=1500,
    #title='Distribution of Articles by Field, 2013-2023',
    xaxis=dict(
        side = "bottom",
        tickmode='array', 
        tickvals=gp['year'].unique()
    )
)

fig.update_traces(
    hovertemplate='Year: %{x}<br>Subfield: %{y}<br>Share: %{z:.2f}%<extra></extra>'
)

fig.show()

# Topic Engagement: Views & Downloads

In [344]:
#| echo: false

# aggregate views by topic
topic_views_past_decade = df[df['year'] > 2012].groupby('topic')['views'].agg(['sum', 'mean'])
topic_views_past_decade.columns = ['views_total', 'views_avg']

# aggregate downloads by topic
topic_downloads_past_decade = df[df['year'] > 2012].groupby('topic')['downloads'].agg(['sum', 'mean'])
topic_downloads_past_decade.columns = ['downloads_total', 'downloads_avg']

# merge
topic_stats_past_decade = pd.merge(topic_views_past_decade, topic_downloads_past_decade, left_index=True, right_index=True)

# calculate views and downloads shares
topic_stats_past_decade['views_share'] = topic_stats_past_decade['views_total'] / topic_stats_past_decade['views_total'].sum()
topic_stats_past_decade['downloads_share'] = topic_stats_past_decade['downloads_total'] / topic_stats_past_decade['downloads_total'].sum()

# add article count and share
topic_stats_past_decade['article_count'] = df.groupby('topic').size().values
topic_stats_past_decade['article_share'] = (df.topic.value_counts(normalize=True)).values

topic_stats_past_decade = topic_stats_past_decade.reset_index()
topic_stats_past_decade = topic_stats_past_decade[['topic', 'article_count', 'article_share',
                            'views_total', 'views_avg', 'views_share',
                            'downloads_total', 'downloads_avg', 'downloads_share']]

topic_stats_past_decade = pd.merge(topic_stats_past_decade, topic_info, how='left', on='topic')

By analyzing Views and Downloads statistics for the past decade, we can gain insights into what topics truly attract the attention of the Chinese scientific community and officials, revealing the priorities and interests of both researchers and decision-makers.

In [345]:
#| echo: false
#| output: false
correlation = topic_stats_past_decade.downloads_total.corr(topic_stats_past_decade.views_total)
print(f'Correlation between downloads total and views total: {round(correlation, 2)}')

Correlation between downloads total and views total: 0.99


We can start with calculating correlation between total views and downloads for each topic. Unsurprisingly, we get a correlation of 0.99, which basically means that the more a paper is viewed, the more it's downloaded—who would've thought?

In [346]:
#| echo: false
#| output: true
#| fig-cap: 'Views Total vs. Downloads Total'
#| label: fig-views-downloads-scatter

fig = px.scatter(
    topic_stats_past_decade,
    x='downloads_total',
    y='views_total',
    trendline="ols",
    color_discrete_map=color_palette,
    hover_data=['topic_en'],
    log_x=False
)

fig.update_layout(
    layout,
    width=600,
    xaxis=dict(title="Downloads"),
    yaxis=dict(title="Views"),
    legend=dict(title='Domain')
)

fig.update_traces(
    marker=dict(size=10, color='#0E86D4', opacity=0.7),
    hovertemplate='Topic: %{customdata[0]}<br>Downloads: %{x}<br>Views: %{y}<extra></extra>'
)

fig.show()

Next, we can analyze the relationship between views and the number of articles for each topic. This analysis will allow us to identify research areas that are over-performing—where views are disproportionately high compared to the number of articles published—and those that are under-performing, where the volume of publications does not correspond to a similarly high level of interest. This approach provides valuable insights into which topics are getting significant attention from the scientific community and which may not be as impactful despite a higher output of publications.

In [347]:
#| echo: false
#| output: false
correlation = topic_stats_past_decade.article_count.corr(topic_stats_past_decade.views_total)
print(f'Correlation between article count and views total: {round(correlation, 2)}')

Correlation between article count and views total: 0.75


In [348]:
#| echo: false
#| output: true
#| fig-cap: 'Topic Popularity: Articles vs. Views'
#| label: fig-article-views-scatter

fig = px.scatter(
    topic_stats_past_decade,
    x='views_total',
    y='article_count',
    trendline="ols",

    color_discrete_map=color_palette,
    hover_data=['topic_en'],
    log_x=False
    )

fig.update_layout(
    layout,
    width=600,
    xaxis=dict(title="Views"),
    yaxis=dict(title="Article Count"),
    legend=dict(title='Domain')
)

fig.update_traces(
    marker=dict(size=10, color='#0E86D4', opacity=0.7),
    hovertemplate='Topic: %{customdata[0]}<br>Views: %{x}<br>Articles: %{y}<extra></extra>'
)

fig.show()

Analysis of article count versus view numbers reveals a positive correlation of 0.75, indicating that generally, more articles lead to more views. However, significant outliers exist, highlighting disparities in topic popularity:

Underperforming topics (low views despite high article count):

1. CAS Divisions and Academicians: History and Development - 207 articles, 15,866 views
2. CAS Academicians and Academic Divisions Work - 172 articles, 12,239 views
3. Development History and Scientific Achievements of CAS - 340 articles, 95,401 views

These topics, primarily focused on institutional history and structure, attract fewer views relative to their article count, suggesting lower reader interest in internal organizational matters.

Overperforming topics (high views despite low article count):
1. Ecological Civilization Construction - 81 articles, 248,882 views
2. Belt and Road Initiative:
   - S&T Cooperation - 105 articles, 228,602 views
   - S&T Innovation and Sustainable Development - 140 articles, 297,314 views

These topics, centered on current global initiatives and environmental concerns, garner more attention per article, indicating high reader engagement with contemporary, outward-facing issues.

This disparity in viewership suggests that while institutional topics are frequently published, reader interest leans towards current global and environmental themes.

In [349]:
#| echo: false
#| output: false
topic_stats['views_efficiency'] = topic_stats.views_share/topic_stats.article_share
topic_stats_past_decade['views_efficiency'] = topic_stats_past_decade.views_share/topic_stats_past_decade.article_share

To assess topic performance more effectively, we introduce a "view efficiency" metric, calculated by dividing a topic's view share by its article share. This ratio reveals which topics generate disproportionate interest relative to their publication volume. Topics with an efficiency score above 1 indicate higher-than-average reader engagement, while those below 1 suggest lower interest compared to their prevalence. 

In [350]:
#| echo: false
#| output: true
#| fig-cap: 'Views Efficiency Distribution'
#| label: fig-views-efficiency-article-share-boxplot

fig = px.box(topic_stats_past_decade, y="views_efficiency", )

fig.update_layout(
    layout,
    yaxis=dict(
        title='Views Efficiency'
    )
)

fig.update_traces(
    marker=dict(color='#0E86D4', opacity=0.7),
    hovertemplate = 'Views Efficiency: %{y:.2f}'
)

fig.show()

The following list presents the top 10 topics ranked by view efficiency:

In [351]:
#| echo: false
#| output: true

# top topics by view efficiency
topic_stats_past_decade[['topic_en', 'views_efficiency']].sort_values(by='views_efficiency', ascending=False).head(10).style.hide(axis="index").relabel_index(["Topic", "Views Efficiency"], axis=1)

Topic,Views Efficiency
Belt and Road Initiative: S&T Cooperation,8.291509
Ecological Civilization Construction,3.659625
Belt and Road Initiative: S&T Innovation and Sustainable Development,3.171685
Earth Big Data for Sustainable Development,2.842358
High-Power Solid-State Lasers and Deep Ultraviolet Spectral Imaging Systems,2.675303
Lunar and Deep Space Exploration,2.511525
S&T Innovation and Superpower Strategy,2.427604
Disaster Mitigation,2.337339
Microbiology and Ecological Environment Research,2.234833
Advanced Medical Imaging and Diagnostic Technologies,2.063377


And 10 least efficient topics:

In [352]:
#| echo: false
#| output: true

topic_stats_past_decade[['topic_en', 'views_efficiency']].sort_values(by='views_efficiency', ascending=False).tail(10).style.hide(axis="index").relabel_index(["Topic", "Views Efficiency"], axis=1)

Topic,Views Efficiency
Advanced Nuclear Energy Technology,0.373404
High Power Laser Physics and Technology,0.364596
CAS Development History and Scientific Cooperation,0.34835
Ultra-Intense Laser Technology and Terahertz Detector,0.34569
"Big Science Facilities: Particle Physics, Colliders, and Neutrino Research",0.321998
Chemical Sciences and Technology,0.309196
Development History and Scientific Achievements of the Chinese Academy of Sciences,0.186035
Automated Theorem Proving,0.135855
Gold Deposit Geology,0.122504
CAS Academicians and Academic Divisions Work,0.101632


In [353]:
#| echo: false
#| output: true
#| fig-cap: 'Views Efficiency vs Article Share by Topic'
#| label: fig-views-efficiency-article-share-scatter

fig = px.scatter(
    topic_stats_past_decade,
    y='views_efficiency',
    x='article_share',
    color_discrete_map=color_palette,
    hover_data=['topic_en'],
    log_x=False
    )

fig.update_layout(
    layout,
    width=600,
    #title='Views Efficiency vs Article Share by Topic',
    yaxis=dict(
        title="Views Efficiency",
        tickformat='.2',
        ),
    xaxis=dict(
        title="Article Share",
        tickformat=',.0%',
        hoverformat='.2%'
        ),
)

fig.update_traces(
    marker=dict(size=10, color='#0E86D4', opacity=0.7),
    hovertemplate='Topic: %{customdata[0]}<br>Views Efficiency: %{y}<br>Article Share: %{x}<extra></extra>'
)

fig.show()

# Organizations

Next we can identify how different research areas are represented by organizations.

In [354]:
#| echo: false
#| output: false

orgs = pd.read_csv('data/orgs_flat.csv')
#1986-2023
orgs = orgs[(orgs['year'] < 2024) & (orgs['year'] > 2012)]

topics_orgs = pd.merge(
    orgs,
    df[['title_cn', 'topic_en', 'subfield_en', 'field_en', 'domain_en']],
    on='title_cn', how='left')

topics_orgs = topics_orgs.dropna(subset='topic_en')
topics_orgs = topics_orgs.dropna(subset='orgs_head')
topics_orgs = topics_orgs.drop_duplicates()

gp = topics_orgs.groupby('topic_en')['orgs_head'].nunique().reset_index()
mapping_dict = dict(zip(gp['topic_en'], gp['orgs_head']))
topic_stats['org_count'] = topic_stats['topic_en'].map(mapping_dict)
topic_stats['org_share'] = topic_stats['org_count']/topic_stats['org_count'].sum()

The highest variety of organizations is found in Applied Sciences, with 441 organizations. This is followed by Natural Sciences, which has 378 organizations. Economics and Social topics are represented by 247 organizations, while Health Sciences has 131 organizations, which is 100 fewer. The mos insignificant number of organizations (7) is in Arts and Humanities. 

In [355]:
#| echo: false
#| output: true
#| fig-cap: 'Number of Organizations by Domain, 2013-2023'
#| label: fig-orgs-domain

#one organization can fit into multiple categories
gp = topics_orgs.groupby('domain_en')['orgs_head'].nunique().sort_values(ascending=False).reset_index(name='count')

fig = px.bar(gp,
             y='domain_en',
             x='count',
             orientation='h',
             color='domain_en',
             color_discrete_map=color_palette)

fig.update_layout(
    layout,
    xaxis=dict(
        title='Count',
        range=[0, 500]
        ),
    #title='Number of Organizations by Domain, 2013-2023',
    showlegend=False,
)

fig.update_traces(
    textposition='outside', 
    texttemplate='%{x}',
    textfont=dict(color='black'), 
    opacity=0.7,
    hovertemplate='Domain: %{y}<br>Organizations: %{x}<extra></extra>'
)

fig.show()

The frequency count of organizations indicates that the University of CAS leads across all domains, with general CAS affiliations also being common.

The Institutes of Science and Development ranks as the second top contributor in Economic and Social Sciences, third in Applied Sciences, and fifth in Natural Sciences.

Additionally, the Institute of Geographic Sciences and Natural Resources Research is active in Applied, Natural, and Economic and Social Sciences.

In [356]:
#| echo: false
#| output: true
#| fig-cap: 'Views Efficiency vs Article Share by Topic'
#| label: fig-orgs-top-domain

from plotly.subplots import make_subplots

domains = topics_orgs['domain_en'].dropna().unique()

fig = make_subplots(rows=len(domains), cols=1, subplot_titles=domains, vertical_spacing=0.1)

for i, domain in enumerate(domains, 1):
    domain_data = topics_orgs[topics_orgs['domain_en'] == domain]
    top_topics_orgs = domain_data['orgs_head'].value_counts().nlargest(5).reset_index()
    top_topics_orgs.columns = ['orgs_head', 'count']
    
    fig.add_trace(
        go.Bar(
            y=top_topics_orgs['orgs_head'],
            x=top_topics_orgs['count'],
            orientation='h',
            marker_color=color_palette.get(domain, 'blue'),
            opacity=0.7,
            text=top_topics_orgs['count'],
            textposition='outside',
        ),
        row=i, col=1
    )
    
    fig.update_xaxes(
        title="Count",
        showline=True,
        linewidth=1,
        linecolor='black',
        range=[0, 300],      
        row=i, col=1
    )
    fig.update_yaxes(
        autorange='reversed',
        title="",
        showline=True,
        linewidth=1,
        linecolor='black',
        row=i, col=1
    )

fig.update_layout(
    layout,
    width=800,
    height=1200,
    font=dict(color='black', size=14, family='Verdana'),
    showlegend=False,
)

fig.update_traces(
    textposition='outside', 
    texttemplate='%{x}',
    textfont=dict(color='black'), 
    opacity=0.7,
    hovertemplate='Organization: %{y}<br>Affiliations: %{x}<extra></extra>'
)

fig.show()

When categorizing organizations by fields, we find that the largest number is in Earth and Environmental Sciences, followed by Social Sciences. The next largest domain is Enabling and Strategic Technologies.

Information and Communication Technologies ranks next, followed by Engineering, Biology, and Agriculture.

In [357]:
#| echo: false
#| output: true
#| fig-cap: 'Number of Organizations by Field, 2013-2023'
#| label: fig-orgs-field

# One organization can fit into multiple categories
gp = topics_orgs.groupby(['field_en', 'domain_en'])['orgs_head'].nunique().sort_values(ascending=True).reset_index(name='count')

fig = px.bar(gp,
             y='field_en',
             x='count',
             orientation='h',
             color='domain_en',
             color_discrete_map=color_palette
             )

fig.update_layout(
    layout,
    #title="Number of Organizations by Field, 2013-2023",
    width=900,
    height=700,
    xaxis=dict(
        title="Organizations",
        range=[0, 400]
    ),
    yaxis=dict(
        #autorange='reversed',
        title="",
    ),
    showlegend=True,
    legend_title=dict(
        text="Domain",
        #font=dict(
        #    size=14,
        #    family="Verdana",
        #    color="black"
        #    )
        )
)

fig.update_traces(
    textposition='outside',  
    texttemplate='%{x}', 
    opacity=0.7,
    textfont=dict(color='black'),
    hovertemplate='Field: %{y}<br>Affiliations: %{x}<extra></extra>' 
)

fig.show()

The most represented subfields are:

- Environmental Sciences
- Ecology
- Information Systems
- Agronomy and Agriculture
- Strategic, Defence, and Security Studies
- Environmental Engineering
- International Relations

In [358]:
#| echo: false
#| output: true
#| fig-cap: 'Number of Organizations by Subfield, 2013-2023'
#| label: fig-orgs-subfield-count

# One organization can fit into multiple categories
gp = topics_orgs.groupby(['subfield_en', 'domain_en'])['orgs_head'].nunique().sort_values(ascending=True).reset_index(name='count')

fig = px.bar(gp,
             y='subfield_en',
             x='count',
             orientation='h',
             color='domain_en',
             color_discrete_map=color_palette
             )

fig.update_layout(
    layout,
    #title="Number of Organizations by Subfield, 2013-2023",
    width=900,
    height=1000,
    xaxis=dict(
        title="Organizations",
        range=[0, 350]
    ),
    showlegend=True,
    legend_title=dict(
        text="Domain",
        font=dict(size=14, family="Verdana", color="black")
        )
)

fig.update_traces(
    textposition='outside',  
    texttemplate='%{x}',  
    opacity=0.7,
    textfont=dict(color='black'),
    hovertemplate='Domain: %{y}<br>Affiliations: %{x}<extra></extra>' 
)

fig.show()

# Collaboration Network

In [359]:
#| echo: false
#| output: true

# keep only the rows where the number of unique values in the 'orgs_head' column, grouped by the 'title_cn' column, is greater than 1
orgs_filtered = topics_orgs[topics_orgs.groupby('title_cn')['orgs_head'].transform('nunique') > 1]
orgs_filtered = orgs_filtered.drop(columns='topic_en')
orgs_filtered = orgs_filtered.drop_duplicates()

In [360]:
#| echo: false
#| output: true

from itertools import combinations

gp = orgs_filtered.groupby(['field_en', 'title_cn']).orgs_head.unique().reset_index()

# initialize list to hold collaboration pairs
collaboration_pairs = []

# iterate through the grouped data
for idx, row in gp.iterrows():
    orgs = row['orgs_head']
    title = row['title_cn']
    field = row['field_en']
    # generate all possible unique pairs of organizations
    if len(orgs) > 1:
        for pair in combinations(orgs, 2):
            collaboration_pairs.append((field, title, pair[0], pair[1]))

# create a df from the pairs
collaboration_df = pd.DataFrame(collaboration_pairs, columns=['field_en', 'title_cn', 'org1', 'org2'])

In [361]:
#| echo: false
#| output: true

# calculate collaboration strength (=number of occurrences for a pair)
# group by field_en and organization pair, then count collaborations
collaboration_strength = collaboration_df.groupby(['field_en', 'org1', 'org2']).size().reset_index(name='strength')

# ensure each pair appears only once per field_en ((A, B) = (B, A))
collaboration_strength['org_pair'] = collaboration_strength.apply(lambda row: tuple(sorted([row['org1'], row['org2']])), axis=1)
collaboration_strength = collaboration_strength.groupby(['field_en', 'org_pair'])['strength'].sum().reset_index()

# split the org_pair back into separate columns
collaboration_strength[['org1', 'org2']] = pd.DataFrame(collaboration_strength['org_pair'].tolist(), index=collaboration_strength.index)

# drop the org_pair column
collaboration_strength = collaboration_strength.drop('org_pair', axis=1)

# sort
collaboration_strength = collaboration_strength.sort_values(['field_en', 'strength'], ascending=[True, False])

# reset the index
collaboration_strength = collaboration_strength.reset_index(drop=True)

In [362]:
#| echo: false
#| output: false
collaboration_strength.sort_values(by='strength', ascending=False).head(10)

Unnamed: 0,field_en,strength,org1,org2
2859,Social Sciences,67,"Institutes of Science and Development, CAS",University of CAS
1549,Enabling & Strategic Technologies,43,"Institutes of Science and Development, CAS",University of CAS
2860,Social Sciences,24,"Institute of Geographic Sciences and Natural Resources Research, CAS",University of CAS
2861,Social Sciences,23,CAS,University of CAS
580,Earth & Environmental Sciences,18,"Institutes of Science and Development, CAS",University of CAS
2862,Social Sciences,17,CAS,"Institutes of Science and Development, CAS"
581,Earth & Environmental Sciences,17,"Institute of Geographic Sciences and Natural Resources Research, CAS",University of CAS
441,Built Environment & Design,15,"Institutes of Science and Development, CAS",University of CAS
2316,Information & Communication Technologies,15,"Institutes of Science and Development, CAS",University of CAS
582,Earth & Environmental Sciences,14,"Northwest Institute of Eco-Environment and Resources, CAS",University of CAS


In [363]:
#| echo: false
#| output: false

import networkx as nx


def process_data(collaboration_strength):
    fields = sorted(collaboration_strength['field_en'].unique())
    graphs_by_field = {}
    all_nodes = set()
    
    for field in fields:
        field_data = collaboration_strength[collaboration_strength['field_en'] == field]
        G = nx.Graph()
        
        for _, row in field_data.iterrows():
            G.add_edge(row['org1'], row['org2'], weight=row['strength'])
            all_nodes.add(row['org1'])
            all_nodes.add(row['org2'])
        
        graphs_by_field[field] = G
    
    return graphs_by_field, all_nodes, fields

def plot_graph(G, pos, fig):
    edge_x, edge_y = [], []
    for edge in G.edges():
        x0, y0 = pos[edge[0]]
        x1, y1 = pos[edge[1]]
        edge_x.extend([x0, x1, None])
        edge_y.extend([y0, y1, None])
    
    edge_trace = go.Scatter(
        x=edge_x, y=edge_y,
        line=dict(color='#0E86D4', width=1),
        hoverinfo='none',
        mode='lines'
    )
    
    node_x, node_y = [], []
    for node in G.nodes():
        x, y = pos[node]
        node_x.append(x)
        node_y.append(y)
    
    node_adjacencies = [len(list(G.adj[node])) for node in G.nodes()]
    node_text = [f'{node}<br>Connections: {adj}' for node, adj in zip(G.nodes(), node_adjacencies)]

    node_trace = go.Scatter(
        x=node_x, y=node_y,
        mode='markers',
        hoverinfo='text',
        marker=dict(
            showscale=True,
            colorscale='Blues',
            reversescale=False,
            color=node_adjacencies,
            size=10,
            colorbar=dict(thickness=15, title='Connections', xanchor='left', titleside='right'),
            line_width=2
        ),
        text=node_text
    )

    fig.add_trace(edge_trace)
    fig.add_trace(node_trace)
    
    return fig

An analysis of collaboration statistics across fields reveals distinct patterns:

- Earth & Environmental Sciences boasts the largest network, but with low density (2.29%), indicating many small clusters. The University of CAS is central, followed by institutes focused on eco-environment, geography, and atmospheric physics.
- Social Sciences, with 211 organizations, also centers around the University of CAS. The Institutes of Science and Development and the Institute of Remote Sensing and Digital Earth play key roles, reflecting the field's strategic focus.
- Enabling & Strategic Technologies encompasses 187 organizations, with the University of CAS and Institutes of Science and Development as main hubs. Most organizations have fewer than 10 connections, resulting in low network density (2.28%).
- Engineering, despite fewer organizations (121), has a denser network (5%), suggesting higher topic complexity and more interconnected collaborations.

In [364]:
#| echo: false
#| output: true
#| page-layout: full
#| fig-cap: 'Collaboration Network by Field, 2013-2023'
#| fig-subcap: 
#|  - "Agriculture, Fisheries & Forestry"
#|  - "Biology"
#|  - "Biomedical Research"
#|  - "Built Environment & Design"
#|  - "Clinical Medicine"
#|  - "Earth & Environmental Sciences"
#|  - "Economics & Business"
#|  - "Enabling & Strategic Technologies"
#|  - "Engineering"
#|  - "Historical Studies"
#|  - "Information & Communication Technologies"
#|  - "Physics & Astronomy"
#|  - "Psychology & Cognitive Sciences"
#|  - "Public Health & Health Services"
#|  - "Social Sciences"
#| fig-column: page-right
#| warning: false
#| label: fig-field-network
#| fig-align: center


def visualize_network(collaboration_strength):
    graphs_by_field, all_nodes, fields = process_data(collaboration_strength)
    
    G_combined = nx.Graph()
    for G in graphs_by_field.values():
        G_combined = nx.compose(G_combined, G)
    
    pos = nx.spring_layout(G_combined, k=0.5, iterations=30)
    
    figures = []
    
    for field in fields:
        G = graphs_by_field[field]
        
        fig = go.Figure()
        fig = plot_graph(G=G, pos=pos, fig=fig)
        
        network_density = nx.density(G)
        
        fig.update_layout(
            layout,
            template='plotly_white',
            width=800,
            height=500,
            #title=f'{field}',
            showlegend=False,
            hovermode='closest',
            annotations=[dict(
                text=f"Number of organizations: {len(G.nodes())}<br>Network Density: {network_density:.2%}",
                showarrow=False,
                xref="paper", yref="paper",
                x=0.005, y=-0.05,
                font=dict(family="Verdana")
            )],
            xaxis=dict(showgrid=False, showline=False, zeroline=False, showticklabels=False),
            yaxis=dict(showgrid=False, showline=False, zeroline=False, showticklabels=False),
        )
        
        figures.append(fig)
    
    return figures

figures = visualize_network(collaboration_strength)

for fig in figures:
    fig.show()

In [365]:
#| echo: false
#| output: false
topics_orgs.groupby('orgs_head')['field_en'].value_counts().reset_index()

Unnamed: 0,orgs_head,field_en,count
0,"Academic Division, CAS-Tsinghua University Center For Collaborative Development of Science and S...",Social Sciences,1
1,Academy of Macroeconomic Research,Earth & Environmental Sciences,2
2,Academy of Macroeconomic Research,Biology,1
3,Academy of Macroeconomic Research,Enabling & Strategic Technologies,1
4,Academy of Macroeconomic Research,Engineering,1
...,...,...,...
1535,中国科学院预测中心,"Agriculture, Fisheries & Forestry",2
1536,中国科学院预测中心,Social Sciences,2
1537,中国科学院预测中心,Earth & Environmental Sciences,1
1538,中国科学院预测中心,Information & Communication Technologies,1


# Authors

In [366]:
#| echo: false
#| output: false

authors_flat = pd.read_csv('data/authors_flat.csv')
authors_flat.shape

(12837, 4)

In [367]:
#| echo: false
#| output: false
exclude_list = ['not_specified', '本刊编辑部', '本刊特约评论员', '《中国科学院院刊》编辑部']
authors_flat = authors_flat.drop(authors_flat[authors_flat['author_cn'].isin(exclude_list)].index)
authors_flat.shape

(10155, 4)

In [368]:
#| echo: false
#| output: false
authors_topics = pd.merge(authors_flat[['year', 'title_cn', 'author_cn']],
                          df[['title_cn', 'topic_en', 'domain_en', 'field_en', 'subfield_en']],
                          on='title_cn')
authors_topics.shape

(15314, 7)

In [369]:
#| echo: false
#| output: false
authors_topics = authors_topics.dropna(subset='topic_en')
authors_topics.shape

(15314, 7)

In [370]:
#| echo: false
#| output: false
#calculate author shares by topic, all time
author_topic_count = authors_topics.groupby('topic_en')['author_cn'].nunique().reset_index()
topic_stats = pd.merge(topic_stats, author_topic_count, on='topic_en')
topic_stats = topic_stats.rename(columns={'author_cn': 'author_count'})
topic_stats['author_share'] = topic_stats['author_count'] / topic_stats['author_count'].sum()

Analysis of author counts by field shows significant growth across all domains except Humanities since 2016, with numbers peaking in 2023.

In [371]:
#| echo: false
#| output: true
#| fig-cap: 'Number of Authors by Domain, 1986-2023'
#| label: fig-authors-domain

gp = authors_topics.groupby(['year', 'domain_en'])['author_cn'].nunique().reset_index()

fig = px.area(
    gp,
    x='year',
    y='author_cn',
    color='domain_en',
    color_discrete_map=color_palette,
)

fig.update_layout(
    layout,
    width=800,
    xaxis=dict(minor=dict(ticks="inside", showgrid=True)),
    yaxis=dict(title="Author Count"),
    #title='Number of Authors by Domain, 1986-2023',
    showlegend=True,
    hovermode='x unified',
    legend=dict(title='Domain'),
)

fig.update_traces(
    hovertemplate='%{fullData.name}: %{y}<extra></extra>'
)

fig.show()

Applied Sciences consistently dominates, accounting for about 40% of authors.
Economic & Social Sciences shows fluctuation, declining through the 2010s but rebounding in 2022-2023 to 26.5%.
Health Sciences maintains a smaller author share, but saw a sharp increase from 1.9% to 14.9% between 2019 and 2020, likely due to pandemic-related research.
We can also observe the death of Humanities, which experienced a significant decline, effectively disappearing by 2010.

In [372]:
#| echo: false
#| output: true
#| fig-cap: 'Distribution of Authors by Domain, 1986-2023'
#| label: fig-author-domain-share

# Count unique articles per domain per year
gp = authors_topics.groupby(['year', 'domain_en'])['author_cn'].nunique().reset_index(name='count')
gp = gp.pivot_table(index='year', columns='domain_en', values='count', fill_value=0)
gp = gp.div(gp.sum(axis=1), axis=0).reset_index()

fig = px.area(gp, x='year', y=[col for col in gp.columns if col != 'year'],
              color_discrete_map=color_palette)

fig.update_layout(
    layout,
    width=800,
    #title='Distribution of Authors by Domain, 1986-2023',
    xaxis=dict(
        minor=dict(ticks="inside", showgrid=True),
    ),
    yaxis=dict(
        title="Share of Authors",
        tickformat='.0%'
    ),
    showlegend=True,
    legend=dict(title='Domain'),
    hovermode='x unified',
    )

fig.update_traces(
    hovertemplate='%{fullData.name}: %{y:.1%}<extra></extra>'
)


fig.show()

In [373]:
#| echo: false
#| output: false
#| fig-cap: 'Number of Authors by Field, 2013-2023'
#| label: fig-author-field-count

gp = authors_topics.groupby(['year', 'field_en'])['author_cn'].nunique().reset_index(name='count')
gp = gp[gp['year'] > 2012]

heatmap_data_count = gp.pivot(index='field_en', columns='year', values='count')
heatmap_data_count = heatmap_data_count.fillna(0)

fig = px.imshow(heatmap_data_count,
                text_auto='.0f',
                color_continuous_scale='Blues',
                labels=dict(x="Year", y="", color="Count"))

fig.update_layout(
    layout,
    width=900,
    height=600,
    #title='Number of Authors by Field, 2013-2023',
    xaxis=dict(
        side="bottom",
        tickmode='array',
        tickvals=gp['year'].unique()
    )
)

fig.update_traces(
    hovertemplate='Year: %{x}<br>Field: %{y}<br>Authors: %{z:,.0f}<extra></extra>'
)

fig.show()

At the field level, we see a distribution of expertise across various domains. Earth & Environmental Sciences topped the list, followed closely by Social Sciences and Enabling & Strategic Technologies. Engineering also drew a significant number of experts. A notable trend emerged in 2022, with Information and Communication Technology (ICT) experiencing a remarkable surge, jumping from 3.87% to 13.78%. This spike correlates with the rise of generative AI and the release of ChatGPT. Engineering followed suit with a similar boost in 2023, rising from 2.08% in 2022 to 11.96%. Additionally, Public Health & Health Services saw a significant uptick in 2020, highlighting its growing importance.

In [374]:
#| echo: false
#| output: true
#| fig-cap: 'Distribution of Authors by Field, 2013-2023'
#| label: fig-author-field-share

gp = authors_topics.groupby(['year', 'field_en'])['author_cn'].nunique().reset_index(name='count')
gp['share'] = gp['count'] / gp.groupby('year')['count'].transform('sum')*100
gp = gp[gp['year'] > 2012]

heatmap_data = gp.pivot(index='field_en', columns='year', values='share')
heatmap_data = heatmap_data.fillna(0)

fig = px.imshow(heatmap_data,
                text_auto = '.2f',
                color_continuous_scale='Blues',
                labels = dict(x = "Year", y = "", color = "Share, %"))

fig.update_layout(
    layout,
    width=900,
    height=600,
    #title='Distribution of Authors by Field, 2013-2023',
    xaxis=dict(
        side = "bottom",
        tickmode='array', 
        tickvals=gp['year'].unique()
    )
)

fig.update_traces(
    hovertemplate='Year: %{x}<br>Field: %{y}<br>Share: %{z:.2f}%<extra></extra>'
)

fig.show()

In [375]:
#| echo: false
#| output: false
#| fig-cap: 'Number of Authors by Subfield, 2013-2023'
#| label: fig-author-subfield-count

gp = authors_topics.groupby(['year', 'subfield_en'])['author_cn'].nunique().reset_index(name='count')
gp = gp[gp['year'] > 2012]

# Create a pivot table for absolute count values
heatmap_data_count = gp.pivot(index='subfield_en', columns='year', values='count')
heatmap_data_count = heatmap_data_count.fillna(0)

fig = px.imshow(heatmap_data_count,
                text_auto='.0f',  # Display whole numbers
                color_continuous_scale='Blues',
                labels=dict(x="Year", y="", color="Count"))

fig.update_layout(
    layout,
    width=900,
    height=1200,
    #title='Number of Authors by Subfield, 2013-2023',
    xaxis=dict(
        side="bottom",
        tickmode='array',
        tickvals=gp['year'].unique()
    )
)

# Update hovertemplate to show only the count
fig.update_traces(
    hovertemplate='Year: %{x}<br>Subfield: %{y}<br>Authors: %{z:,.0f}<extra></extra>'
)

fig.show()

Here we can observe the distributions by fields:

In [376]:
#| echo: false
#| output: true
#| fig-cap: 'Number of Authors by Subfield, 2013-2023'
#| label: fig-author-subfield-share

gp = authors_topics.groupby(['year', 'subfield_en'])['author_cn'].nunique().reset_index(name='count')
gp['share'] = gp['count'] / gp.groupby('year')['count'].transform('sum')*100
gp = gp[gp['year'] > 2012]

# pivot the df to create a matrix & fill nan values with 0
heatmap_data = gp.pivot(index='subfield_en', columns='year', values='share')
heatmap_data = heatmap_data.fillna(0)

fig = px.imshow(heatmap_data,
                text_auto = '.2f',
                color_continuous_scale='Blues',
                labels = dict(x = "Year", y = "", color = "Share, %"))

fig.update_layout(
    layout,
    width=900,
    height=1200,
    #title='Share of Authors by Subfield, 2013-2023',
    xaxis=dict(
        side = "bottom",
        tickmode='array', 
        tickvals=gp['year'].unique()
    )
)

fig.update_traces(
    hovertemplate='Year: %{x}<br>Subfield: %{y}<br>Share: %{z:.2f}%<extra></extra>'
)

fig.show()

# Fund Projects

In [377]:
#| echo: false
#| output: false


fund = pd.read_csv('data/bcas_fund_projects.csv')
fund = fund[fund['date'] < 2024]
fund = fund.rename(columns={'date':'year'})

fund_topics = pd.merge(fund, df[['title_cn', 'topic_en', 'topic_ru', 'domain_en', 'domain_ru', 'field_en', 'field_ru', 'subfield_en', 'subfield_ru']],  on='title_cn', how='left')
fund_topics = fund_topics.dropna(subset=['fund_project', 'topic_en'])
fund_topics_count = fund_topics.groupby('topic_en')['fund_project'].count().reset_index()
mapping_dict = dict(zip(fund_topics_count['topic_en'], fund_topics_count['fund_project']))
topic_stats['fund_project_count'] = topic_stats['topic_en'].map(mapping_dict)
topic_stats['fund_project_count'] = topic_stats['fund_project_count'].fillna(0)
topic_stats['fund_project_share'] = topic_stats['fund_project_count']/topic_stats['fund_project_count'].sum()*100

In [378]:
#| echo: false
#| output: false
#| fig-cap: 'Number of Fund Projects by Domain, 2013-2023'
#| label: fig-fund-count

gp = fund_topics.groupby(['year', 'domain_en'])['title_cn'].nunique().reset_index(name='count')
gp = gp[(gp['year'] > 2012) & (gp['domain_en'] != 'Arts & Humanities')]

fig = px.area(
    gp,
    x='year',
    y='count',
    color='domain_en',
    color_discrete_map=color_palette,
)

fig.update_layout(
    layout,
    xaxis=dict(
        type='category',
        categoryorder='category ascending',
    ),
    yaxis=dict(
        title="Fund Project Count",
    ),
    #title='Number of Fund Projects by Domain, 2013-2023',
    showlegend=True,
    legend=dict(title='Domain'),
    hovermode='x unified'
)

fig.update_traces(
    opacity=0.7,
    hovertemplate='%{fullData.name}: %{y}<extra></extra>'
)

fig.show()

The distribution of funded projects by field reveals a clear favorite: Applied and Natural Sciences, which isn’t too surprising.

In [379]:
#| echo: false
#| output: true
#| fig-cap: 'Distribution of Fund Projects by Domain, 2013-2023'
#| label: fig-fund-share

gp = fund_topics.groupby(['year', 'domain_en'])['title_cn'].nunique().reset_index(name='count')
gp = gp[(gp['year'] > 2012) & (gp['domain_en'] != 'Arts & Humanities')]

gp = gp.pivot_table(index='year', columns='domain_en', values='count', fill_value=0)
gp = gp.div(gp.sum(axis=1), axis=0).reset_index()

fig = px.area(gp, x='year', y=[col for col in gp.columns if col != 'year'],
              color_discrete_map=color_palette)

fig.update_layout(
    layout,
    width=800,
    #title='Distribution of Fund Projects by Domain, 2013-2023',
    xaxis=dict(
        minor=dict(ticks="inside", showgrid=True), type='category'
    ),
    yaxis=dict(
        title="Share of Funded Articles",
        tickformat='.0%'
    ),
    showlegend=True,
    legend=dict(title='Domain'),
    hovermode='x unified',
)

fig.update_traces(
    hovertemplate='%{fullData.name}: %{y:.1%}<extra></extra>'
)

fig.show()

Between 2013-2023 the largest amount of funding was associated with Earth & Environmental Sciences, followed by Enabling & Strategic Technologies and Social Sciences.

In [380]:
#| echo: false
#| output: true
#| fig-cap: 'Number of Fund Projects by Field, 2013-2023'
#| label: fig-fund-field-count

gp = fund_topics[(fund_topics['year'] > 2012) & (fund_topics['domain_en'] != 'Arts & Humanities')]
gp = gp.groupby(['domain_en','field_en'])['title_cn'].nunique().sort_values(ascending=True).reset_index(name='count')

fig = px.bar(gp,
             y='field_en',
             x='count',
             orientation='h',
             color='domain_en',
             color_discrete_map=color_palette
             )

fig.update_layout(
    layout,
    #title="Number of Fund Projects by Field, 2013-2023",
    width=800,
    height=600,
    xaxis=dict(
        title="Fund Projects",
        range=[0, 300]
    ),
    yaxis=dict(
        title="",
        showline=True,
        linewidth=1,
        linecolor='black'
    ),
    showlegend=True,
    legend_title=dict(text="Domain", font=dict(size=14, family="Verdana", color="black"))
)

fig.update_traces(
    textposition='outside',  
    texttemplate='%{x}',  
    opacity=0.7,
    textfont=dict(color='black'), 

)

fig.show()

A year-by-year count shows a noticeable increse in ICT funding in 2022, along with a peak in Social Sciences during the same year. Additionally, it's clear that Earth & Environmental Sciences, Enabling & Strategic Technologies, and Agriculture consistently receive more funding compared to other fields.

In [381]:
#| echo: false
#| output: true
#| fig-cap: 'Number of Fund Projects by Field, 2013-2023'
#| label: fig-fund-field-count-year

gp = fund_topics[(fund_topics['year'] > 2012) & (fund_topics['domain_en'] != 'Arts & Humanities')]
gp = gp.groupby(['year', 'domain_en','field_en'])['title_cn'].nunique().sort_values(ascending=True).reset_index(name='count')

heatmap_data_count = gp.pivot(index='field_en', columns='year', values='count')
heatmap_data_count = heatmap_data_count.fillna(0)

fig = px.imshow(heatmap_data_count,
                text_auto='.0f',
                color_continuous_scale='Blues',
                labels=dict(x="Year", y="", color="Count"))

fig.update_layout(
    layout,
    width=800,
    height=600,
    #title='Number of Fund Projects by Field, 2013-2023',
    xaxis=dict(
        side="bottom",
        tickmode='array',
        tickvals=gp['year'].unique()
    )
)

fig.update_traces(
    hovertemplate='Year: %{x}<br>Field: %{y}<br>Authors: %{z:,.0f}<extra></extra>'
)

fig.show()

# Summary

Topic modeling has revealed key themes in BCAS publications, offering valuable insights into the evolving landscape of Chinese scientific research. Over time, the journal's content has diversified, with a noticeable shift towards practical applied research and natural sciences. While social and economic research maintains a significant presence, historical and biographical genres have largely faded from prominence.

The analysis unveils a clear connection between BCAS publications and the broader political and economic environment. For instance, the surge in Public Health publications in 2020 and the increased focus on ICT-related topics following the China Chip Export Ban and ChatGPT's release demonstrate the journal's responsiveness to current events. This alignment suggests that BCAS serves as a platform for China's academia to address pressing national issues.

The social economics discourse within BCAS predominantly centers on strategic issues and cutting-edge technologies. The overlap in topics indicates a strong interplay between strategy, enabling technologies, engineering, and education. These themes mirror the core elements of Chinese tech discourse: human capital development, leadership in emerging technologies, and industrial policy modernization. 

Further evidence of this connection is found in the prominence of Earth & Environmental Sciences research. The consistent high volume of publications in this field throughout the years aligns with Xi Jinping's emphasis on green development and carbon neutrality goals. Notably, Environmental Sciences topics attract the highest number of contributing organizations, with some institutions extending their reach into other fields, highlighting a trend towards interdisciplinary research.

Topics in these categories are more frequently associated with funded research, underscoring their strategic importance. Our analysis also identified the most significant contributing organizations between 2013-2023:

- University of CAS
- Institutes of Science and Development
- Institute of Geographic Sciences and Natural Resources Research
- Northwest Institute of Eco-Environment and Resources
- Peking University

However, it's crucial to recognize that while this bird's-eye view is informative, it has its limitations. For a deeper, more nuanced understanding of these topics and the specific research being conducted, there is no substitute for reading the articles themselves. This analysis proves that these materials are indeed valuable and worth reading for anyone seeking to gain a comprehensive understanding of CAS opinions and research vectors. 

The trends and patterns identified here can guide researchers and policymakers towards the most relevant and impactful articles, organizations and experts, helping to navigate the vast landscape of Chinese scientific literature more effectively. 