# Data collection: Chinese Stack Exchange

**Contents**
1. [Experiment and understand API](#1-experiment-and-understand-api)
2. [Collect Chinese stack exchange data](#2-collect-chinese-stack-exchange-data)

The participants in this study are users of a Chinese language-focused Stack 
Exchange site, contributing to an exploration of Chinese-English code-switching. 
The core participants include Chinese-English bilinguals, individuals fluent 
in both languages who engage in code-switching as part of their communication. 
These users often navigate the complexities of using both languages in the 
same context, sharing insights into language mixing and code-switching patterns. 
Additionally, Chinese students studying abroad contribute to this phenomenon, 
as they frequently switch between Chinese and English in academic 
and social settings, seeking clarification or engagement in both languages.

Another important group comprises non-Chinese learners of Chinese, who, 
while primarily focused on learning Chinese, may also engage in code-switching 
when seeking answers or discussing language-related topics in a bilingual context. 
Language enthusiasts and educators with expertise in Chinese or bilingualism also 
provide valuable contributions, 
often reflecting on the linguistic and cultural aspects of code-switching. 
These participants, with their varied linguistic backgrounds, offer rich data 
for understanding the dynamics of Chinese-English code-switching in a global, 
online learning environment.

In [1]:
# Import libraries
import requests
import pandas as pd
import numpy as np
import time
import private.config as config

## 1. Experiment and understand API

In [2]:
# define API endpoint
def get_stack_exchange_data(url):
    response = requests.get(url)
    # check if the response is successful
    if response.status_code != 200:
        raise ValueError(f"Invalid response: {response.status_code}")
    return response.json()

stack_exchange_url = "https://api.stackexchange.com/2.3/questions?site=chinese"
data_stack_ex = get_stack_exchange_data(stack_exchange_url)

In [3]:
# understand the structure of the data
print(data_stack_ex.keys())

dict_keys(['items', 'has_more', 'quota_max', 'quota_remaining'])


In [4]:
data_stack_ex['items'][0]

{'tags': ['translation', 'poetry'],
 'owner': {'account_id': 1347685,
  'reputation': 1524,
  'user_id': 11269,
  'user_type': 'registered',
  'accept_rate': 58,
  'profile_image': 'https://www.gravatar.com/avatar/c6223a208ab21a3a745704c6823aa2c2?s=256&d=identicon&r=PG',
  'display_name': 'Starnuto di topo',
  'link': 'https://chinese.stackexchange.com/users/11269/starnuto-di-topo'},
 'is_answered': True,
 'view_count': 31,
 'answer_count': 1,
 'score': 0,
 'last_activity_date': 1739899113,
 'creation_date': 1739799170,
 'question_id': 59815,
 'content_license': 'CC BY-SA 4.0',
 'link': 'https://chinese.stackexchange.com/questions/59815/my-translation-of-li-bais-%e4%b8%89%e4%ba%94%e4%b8%83%e8%a8%80',
 'title': 'My translation of Li Bai&#39;s 《三五七言》'}

In [5]:
data_stack_ex['items'][0].keys()

dict_keys(['tags', 'owner', 'is_answered', 'view_count', 'answer_count', 'score', 'last_activity_date', 'creation_date', 'question_id', 'content_license', 'link', 'title'])

In [6]:
print('data_stack_ex[items][0][tags]')
print(data_stack_ex['items'][0]['tags'])
print('data_stack_ex[items][0][title]')
print(data_stack_ex['items'][0]['title'])

data_stack_ex[items][0][tags]
['translation', 'poetry']
data_stack_ex[items][0][title]
My translation of Li Bai&#39;s 《三五七言》


In [7]:
print('data_stack_ex[items] number: ')
print(len(data_stack_ex['items']))

data_stack_ex[items] number: 
30


In [8]:
# Test the API key
API_KEY = config.STACK_EXCHANGE_API_KEY  
TEST_URL = f"https://api.stackexchange.com/2.3/info?site=chinese&key={API_KEY}"

response = requests.get(TEST_URL)
print(response.status_code, response.json())

200 {'items': [{'new_active_users': 0, 'total_users': 32226, 'badges_per_minute': 0.01, 'total_badges': 41320, 'total_votes': 109886, 'total_comments': 58981, 'answers_per_minute': 0.0, 'questions_per_minute': 0.0, 'total_answers': 30141, 'total_accepted': 7065, 'total_unanswered': 192, 'total_questions': 12041, 'api_revision': '2025.2.12.45337'}], 'has_more': False, 'quota_max': 10000, 'quota_remaining': 9999}


## 2. Collect Chinese Stack Exchange data

In [None]:
# Base URL for Stack Exchange API (Chinese Stack Exchange)
BASE_URL = "https://api.stackexchange.com/2.3"
API_KEY = config.STACK_EXCHANGE_API_KEY  

def fetch_questions_with_tags(site="chinese", page_size=100, retries=3):
    """
    Fetches all questions with their tags from the Stack Exchange API.

    :param site: The Stack Exchange site (default: "chinese")
    :param page_size: Number of questions per page (max 100)
    :param retries: Number of retry attempts for failed requests
    :return: List of all questions with tags
    """
    questions = []
    page = 1

    while True:
        url = (f"{BASE_URL}/questions?order=desc&sort=activity&site={site}" +
                f"&pagesize={page_size}&page={page}&filter=!nKzQURF6Y5&key={API_KEY}")

        attempt = 0

        while attempt < retries:
            try:
                response = requests.get(url, timeout=20)
                if response.status_code == 200:
                    data = response.json()
                    for item in data.get("items", []):
                        questions.append({
                            "question_id": item["question_id"],
                            "title": item["title"],
                            "tags": ", ".join(item["tags"])
                        })

                    if not data.get("has_more", False):
                        print(f"Finished fetching all questions at page {page}")
                        return questions

                    break  # Break retry loop if successful

                else:
                    print(f"Error fetching page {page} (attempt {attempt+1}): 
                                                        {response.status_code}")
                    return questions  # Return what we have so far

            except requests.exceptions.RequestException as e:
                print(f"Request failed for page {page} 
                                            (attempt {attempt+1}): {e}")

            attempt += 1
            # Random sleep to prevent rate limiting
            time.sleep(np.random.randint(3, 10))  

        page += 1  # Move to the next page
        time.sleep(np.random.randint(3, 10))  

    return questions


In [10]:
# Function to run the full data collection pipeline
def run_data_collection_pipeline(site="chinese", page_size=100):
    """
    Runs the data collection pipeline, fetching all questions and saving them to a CSV file.

    :param site: The Stack Exchange site (default: "chinese")
    :param page_size: Number of questions per page (max 100)
    """
    start_time = time.time()
    questions = fetch_questions_with_tags(site=site, page_size=page_size)
    end_time = time.time()

    # Calculate and print the time taken
    elapsed_time = end_time - start_time
    elapsed_str = time.strftime("%H:%M:%S", time.gmtime(elapsed_time))
    print(f"Data collection completed in {elapsed_str}")

    # Convert to DataFrame and save to CSV
    questions_df = pd.DataFrame(questions)
    questions_df.head()
    file_path = "private/stack_exchange_all_questions.csv"
    # Save the data to a CSV file, and overwrite if it already exists
    questions_df.to_csv(file_path, index=False)
    print(f"Data saved to {file_path}")

# Run the full data collection pipeline
run_data_collection_pipeline(page_size=100)

Finished fetching all questions at page 121
Data collection completed in 00:12:51
Data saved to private/stack_exchange_all_questions.csv
