This folder contains scripts for downloading data from CSMAR database using the official Python API.
download_all_chinese_listed_firms.py- ⭐ NEW Download ALL data from Chinese Listed Firms Research Series (2010-2024)explore_csmar_tables.py- Explore available CSMAR tables and columns
download_csmar_classifications.py- Download stock classification data (6 types)test_csmar_api.py- Test script to verify CSMAR API installationsetup_csmar_api.sh- Bash script to install Python dependencies
Use download_all_chinese_listed_firms.py to get complete Chinese Listed Firms data:
- All financial statements (balance sheet, income, cash flow)
- All financial ratios (ROA, ROE, Tobin's Q, profitability, liquidity, leverage)
- Board characteristics (size, independence, meetings)
- Shareholder structure (ownership concentration, top shareholders)
- Executive compensation, M&A, IPO/delisting data
- Time period: 2010-2024 (15 years)
- Expected size: 2-5 GB
- Expected time: 1-3 hours
Jump to Complete Download Instructions
Use download_csmar_classifications.py for stock classifications only:
- Market type, ST status, industry codes, area codes
- Time period: All historical data
- Expected size: <100 MB
- Expected time: 5-15 minutes
Jump to Classification Download Instructions
- CSMAR Account - Personal registered account (not institutional)
- Python 3.6+ - Check with
python3 --version - Windows OS - CSMAR-PYTHON currently only works on Windows
cd /Users/ekd/Desktop/Desktop/Others/Patience/code
chmod +x setup_csmar_api.sh
./setup_csmar_api.shThis installs:
- urllib3
- websocket
- websocket_client
- pandas
- prettytable
-
Login to CSMAR: https://www.gtarsc.com/
-
Navigate to: Data Download → API Interface
-
Download: CSMAR-PYTHON.zip
-
Extract to:
[Python Installation]/Lib/site-packages/# Find Python installation python3 -c "import sys; print(sys.executable)" # Example: /usr/local/bin/python3 # Extract to site-packages # Example: /usr/local/lib/python3.9/site-packages/csmarapi/
python3 test_csmar_api.pyExpected output:
======================================================================
CSMAR API Installation Test
======================================================================
Test 1: Checking if CSMAR API is installed...
✅ CSMAR API modules found!
Test 2: Checking required dependencies...
✅ urllib3
✅ websocket
✅ pandas
✅ prettytable
✅ All dependencies installed!
Test 3: Creating CSMAR service instance...
✅ CsmarService instance created!
======================================================================
✅ ALL TESTS PASSED!
======================================================================
Edit download_csmar_classifications.py:
nano download_csmar_classifications.pyUpdate lines 44-46:
CSMAR_USERNAME = "enoch.dongbo@stu.ujn.edu.cn" # Your CSMAR login
CSMAR_PASSWORD = "your_password_here" # Your password
LANGUAGE = "1" # 0=Chinese, 1=Englishpython3 download_csmar_classifications.pyThe download_all_chinese_listed_firms.py script downloads EVERYTHING from CSMAR's "China Listed Firms Research Series":
- Company profiles (name, listing date, location, registered capital)
- Listing status (trading status, market type, board type)
- ST status (special treatment stocks)
- Industry classifications (CSRC, SWS)
- Balance sheet (assets, liabilities, equity)
- Income statement (revenue, profit, expenses, R&D)
- Cash flow statement (operating, investing, financing)
- Financial indicators (liquidity ratios, leverage ratios)
- Profitability ratios (ROA, ROE, profit margins)
- Growth ratios (YoY growth rates)
- Board characteristics (size, independence, meetings)
- Shareholder structure (top shareholders, ownership concentration)
- Executive compensation (salary, bonuses, equity)
- Subsidiary structure (active holdings, exits) and core staff roster
- Institutional ownership (funds, QFII, social security)
- IPO data (offering details, pricing, allocation)
- SEO data (seasoned equity offerings)
- Dividends (cash dividends, stock dividends, payout ratio)
- Debt financing (bond issuance, bank loans)
- M&A activities (mergers, acquisitions, restructuring)
- Related party transactions
- Litigation and disputes
- Corporate name changes
- Digital transformation indicators
- ESG metrics (environmental, social, governance)
- Innovation metrics (patents, R&D intensity)
- Annual report sentiment analysis
- Management discussion & analysis (MD&A) topics
- Market valuation (Tobin's Q, P/E, P/B ratios)
- Trading summary (volume, turnover, volatility)
- Stock returns (annual returns, beta)
- Start: 2010-01-01
- End: 2024-12-31
- Total: 15 years of data
Create a .env file in project root:
cd /Users/ekd/Desktop/Desktop/Others/Patience/code
nano .envAdd these lines:
# CSMAR API Credentials
CSMAR_USERNAME=your_email@example.com
CSMAR_PASSWORD=your_password_here
CSMAR_LANGUAGE=1 # 0=Chinese, 1=English
# Date range (2010-2024)
CSMAR_START_DATE=2010-01-01
CSMAR_END_DATE=2024-12-31Save and exit (Ctrl+X, Y, Enter)
cd /Users/ekd/Desktop/Desktop/Others/Patience/code
python3 src/python/data_acquisition/download_all_chinese_listed_firms.pyWhat happens:
- Script prompts for confirmation (shows expected size and time)
- Logs in to CSMAR API
- Downloads data from 9 major categories
- Each category saved to
dataset/csmar_data/{category}/ - Progress shown for each table
- Summary statistics displayed at end
================================================================================
DOWNLOADING COMPLETE CHINESE LISTED FIRMS RESEARCH SERIES
================================================================================
Time period: 2010-01-01 to 2024-12-31
Stock filter: A-shares only (0%, 3%, 6% codes)
Output: dataset/csmar_data/
================================================================================
Progress: 1/40 tables
--------------------------------------------------------------------------------
📊 DOWNLOADING: Balance sheet - annual
--------------------------------------------------------------------------------
Category: financial_data
Table: FS_Balance_Sheet
Columns: Stkcd, Date, TotalAssets, TotalLiabilities, ...
Time-varying: True
Expected records: 75,000
Downloading... ✅ SUCCESS
Records: 74,856
Size: 125.43 MB
Columns: ['Stkcd', 'Date', 'TotalAssets', ...]
...
================================================================================
DOWNLOAD SUMMARY
================================================================================
Successful downloads: 38/40
Failed downloads: 2/40 (table names may need updating)
Total records: 3,245,678
Total size: 2,847.52 MB (2.78 GB)
================================================================================
dataset/csmar_data/
├── basic_info/
│ ├── company_profile_20251016_143052.csv
│ ├── listing_status_20251016_143052.csv
│ ├── st_status_20251016_143052.csv
│ └── industry_classification_20251016_143052.csv
├── financial_data/
│ ├── balance_sheet_20251016_143052.csv
│ ├── income_statement_20251016_143052.csv
│ ├── cash_flow_20251016_143052.csv
│ ├── financial_indicators_20251016_143052.csv
│ ├── profitability_ratios_20251016_143052.csv
│ └── growth_ratios_20251016_143052.csv
├── equity_governance/
│ ├── board_characteristics_20251016_143052.csv
│ ├── shareholder_structure_20251016_143052.csv
│ ├── executive_compensation_20251016_143052.csv
│ └── institutional_ownership_20251016_143052.csv
├── financing_distribution/
│ ├── ipo_data_20251016_143052.csv
│ ├── seo_data_20251016_143052.csv
│ ├── dividends_20251016_143052.csv
│ └── debt_financing_20251016_143052.csv
├── major_events/
│ ├── ma_activities_20251016_143052.csv
│ ├── related_party_20251016_143052.csv
│ └── litigation_20251016_143052.csv
├── feature_topics/
│ ├── digital_transformation_20251016_143052.csv
│ ├── esg_metrics_20251016_143052.csv
│ └── innovation_20251016_143052.csv
├── text_analysis/
│ ├── sentiment_analysis_20251016_143052.csv
│ └── mda_topics_20251016_143052.csv
├── stock_market/
│ ├── market_value_20251016_143052.csv
│ ├── trading_summary_20251016_143052.csv
│ └── stock_returns_20251016_143052.csv
└── manifest_20251016_143052.json # Download metadata
Cause: Table names in script don't match your CSMAR subscription
Solution: Explore available tables first:
python3 src/python/data_acquisition/explore_csmar_tables.pyThis shows all tables you have access to. Update table names in download_all_chinese_listed_firms.py accordingly.
Cause: 2-5 GB is a lot of data
Solutions:
- Run overnight - Let it complete while you sleep
- Use faster internet - University/office connection is better
- Download in batches - Comment out some categories in the script
- Target specific tables - Only download what you need
Cause: Loading large tables into RAM
Solution: The script processes tables one at a time and saves to disk immediately, so this is rare. If it happens:
# Edit download_all_chinese_listed_firms.py
# Add chunking for very large tables (>1GB)This will:
- Login to CSMAR API
- Download 6 classification types:
- Stock Market Classification (Shanghai/Shenzhen)
- ST & Non-ST stocks
- CSRC Industry 2012
- Area Classification
- SWS Industry 2021
- CSRC Industry 2001
- Save individual CSVs to
../../dataset/ - Merge all into one master file
Expected time: 5-15 minutes depending on network speed
All files saved to dataset/ folder:
csmar_market_classification_YYYYMMDD_HHMMSS.csv(~5,607 records)csmar_st_classification_YYYYMMDD_HHMMSS.csv(~80,000+ records)csmar_csrc_industry_2012_YYYYMMDD_HHMMSS.csv(~80,000+ records)csmar_area_classification_YYYYMMDD_HHMMSS.csv(~5,607 records)csmar_sws_industry_2021_YYYYMMDD_HHMMSS.csv(~80,000+ records)csmar_csrc_industry_2001_YYYYMMDD_HHMMSS.csv(~80,000+ records)csmar_classifications_merged_YYYYMMDD_HHMMSS.csv(master file)
Edit the download_all_classifications() method:
# Example: Add market index data
df_index = self.download_classification_data(
table_name='STK_Market_Index', # Check with getListTables()
columns=['Date', 'IndexCode', 'ClosePrice'], # Check with getListFields()
condition="IndexCode like 'SH000001'", # Shanghai Composite
description="Market Index Data"
)
datasets['market_index'] = df_indexEdit lines 41-42:
START_DATE = "2015-01-01" # Change start date
END_DATE = "2023-12-31" # Change end dateModify the condition parameter:
# Original (all A-shares)
condition="Stkcd like '0%' or Stkcd like '3%' or Stkcd like '6%'"
# Only Shanghai Stock Exchange
condition="Stkcd like '6%'"
# Only specific company
condition="Stkcd='000001'"
# Multiple specific companies
condition="Stkcd in ('000001', '000002', '600000')"Cause: CSMAR-PYTHON not in Python's site-packages
Solution:
# Check if csmarapi folder exists
ls [Python]/Lib/site-packages/csmarapi
# If not found, re-download and extract CSMAR-PYTHON.zipPossible causes:
- Wrong username/password
- Using institutional account (need personal account)
- Account not verified
- Network issue
Solution: Verify credentials at https://www.gtarsc.com/
Possible causes:
- Wrong table name
- Wrong column names
- No permission for that table
Solution: Explore available tables:
from csmarapi.CsmarService import CsmarService
csmar = CsmarService()
csmar.login('username', 'password', '1')
# List all databases
databases = csmar.getListDbs()
# List tables in a database
tables = csmar.getListTables('China Stock Market Series')
# List fields in a table
fields = csmar.getListFields('STK_MKT_Type')Cause: CSMAR rate limiting
Solution: Wait 30 minutes before running identical query again
Cause: CSMAR API has 200K record limit per query
Solution: Use pagination (automatically handled in script, but you can customize):
# First batch
df1 = csmar.query_df(columns, "Stkcd like '0%' limit 0,200000", table_name)
# Second batch
df2 = csmar.query_df(columns, "Stkcd like '0%' limit 200000,200000", table_name)
# Combine
df = pd.concat([df1, df2], ignore_index=True)See full documentation in: ../../docs/CSMAR_DATA_DOWNLOAD_GUIDE.md
from csmarapi.CsmarService import CsmarService
csmar = CsmarService()
# Login
csmar.login('username', 'password', '1') # 1=English, 0=Chinese
# List databases
databases = csmar.getListDbs()
# List tables
tables = csmar.getListTables('China Stock Market Series')
# List fields
fields = csmar.getListFields('STK_MKT_Type')
# Count records
count = csmar.queryCount(columns, condition, table_name, start_date, end_date)
# Download data as DataFrame
df = csmar.query_df(columns, condition, table_name, start_date, end_date)- CSMAR Website: https://www.gtarsc.com/
- API Documentation: https://www.gtarsc.com/api/ (login required)
- Technical Support: service@gtarsc.com
- Phone: +86 (0755) 8670 3017
- Windows Only: CSMAR-PYTHON currently only works on Windows
- Personal Accounts Only: Institutional accounts don't support API access
- Rate Limiting: 30-minute cooldown between identical queries
- Record Limit: 200,000 records per query (use pagination for more)
- Subscription Required: You must have active CSMAR subscription
After downloading classification data:
-
Verify downloads:
ls -lh ../../dataset/csmar_*.csv wc -l ../../dataset/csmar_*.csv
-
Integrate into pipeline:
cd ../.. python3 regenerate_analysis_dataset.py -
Run baseline regression:
python3 run_corrected_baseline_analysis.py
# 1. Download ALL data (2010-2024)
python3 src/python/data_acquisition/download_all_chinese_listed_firms.py
# 2. Merge with existing master dataset
python3 merge_csmar_complete_data.py # To be created
# 3. Re-run analysis with complete data
python3 run_corrected_baseline_analysis.py# 1. Download classifications only
python3 src/python/data_acquisition/download_csmar_classifications.py
# 2. Integrate into pipeline
python3 regenerate_analysis_dataset.py
# 3. Run baseline regression
python3 run_corrected_baseline_analysis.pyLast Updated: October 16, 2025
Time Period: 2010-2024 (15 years)
Project: MSCI Digital Transformation DID Analysis
Author: Enoch Dongbo (enoch.dongbo@stu.ujn.edu.cn)