## Chinese SOEs list

- source : [State-owned Assets Supervision and Administration Commission of the State Council](http://en.sasac.gov.cn/n_688.htm)
- only works on Google Colab. Do NOT run this script locally; you might lose access to the website

### issues
- China Silk Corporation is now under Poly Corp. Might not be counted as one SOE, but it is kept in the file as it is in the website(English)
- Nokia Bell is in the list in English, not in Chinese. There seems to be no evidence or reports that the Nokia Bell has some relationship with Chinese official

In [123]:
import numpy as np
from bs4 import BeautifulSoup as bs
import time
import pandas as pd
import requests as rq
from google.colab import files

In [124]:
eng = []
link = []

req = rq.get('http://en.sasac.gov.cn/n_688.htm').text
soup = bs(req,'html.parser')

for s in soup.find_all('div',class_="lis_t"):
  eng.append(s.text.strip())
  link.append(s.find('a').get('href'))

pages = int(soup.find('div',id="displaypagenum").find_all('a')[-2].text)

for i in range(2,pages+1):
  url = 'http://en.sasac.gov.cn/n_688_' + str(i) + '.htm'

  req = rq.get(url).text
  soup = bs(req,'html.parser')

  for s in soup.find_all('div',class_="lis_t"):
    eng.append(s.text.strip())
    link.append(s.find('a').get('href'))

  time.sleep(3)

df = pd.DataFrame({'english':eng,'link':link})

In [125]:
domain = []

for i in df.link:
  if 'com' in i:
    urls = i.split('.')
    domain.append(urls[[urls.index(u) for u in urls if u.startswith('com')][0]-1])
  elif 'cn' in i:
    urls = i.split('.')
    domain.append(urls[[urls.index(u) for u in urls if u.startswith('cn')][0]-1])   
  else :
    domain.append(np.nan)

df['domain'] = domain
df

Unnamed: 0,english,link,domain
0,China National Nuclear Corporation,http://en.cnnc.com.cn/index.html,cnnc
1,China Aerospace Science and Technology Corpora...,http://english.spacechina.com/n16421/index.html,spacechina
2,China Aerospace Science and Industry Corporati...,http://www.casic.com/,casic
3,"Aviation Industry Corporation of China, Ltd",http://www.avic.com/en/aboutus/index.shtml,avic
4,China North Industries Group Corporation Limit...,http://en.norincogroup.com.cn/,norincogroup
...,...,...,...
89,Overseas Chinese City Group,http://www.chinaoct.com/,chinaoct
90,Nam Kwong (Group) Company Limited,http://www.namkwong.com.mo/en/,namkwong
91,China XD Group,http://www.xd.com.cn/structure/eng/index.htm,xd
92,China Railway Materials Company Limited,https://www.crmsc.com.cn/crmscnewEn/index.asp,crmsc


In [126]:
df[df.domain.isna()].english.to_list()

['CRRC Corporation Limited',
 'China Information and Communication Technologies Group Corporation (CICT)',
 'China TravelSky Holding Company']

In [127]:
df.loc[df.english=='CRRC Corporation Limited','domain'] = 'crrcgc'
df.loc[df.english=='China Information and Communication Technologies Group Corporation (CICT)','domain'] = 'cict'
df.loc[df.english=='China TravelSky Holding Company','domain'] = 'travelsky'

In [128]:
df.loc[df.english=='China Telecommunications Corporation','domain'] = 'chinatelecom'
df.loc[df.english=='China Mobile Communications Group Co Ltd','domain'] = '10086'
df.loc[df.english=='Commercial Aircraft Corporation of China, Ltd','domain'] = 'comac'
df.loc[df.english=='China National Chemical Engineering Group Corporation','domain'] = 'cncec'
df.loc[df.english=='China Energy Conservation and Environmental Protection Group','domain'] = 'cecep'
df.loc[df.english=='China Energy Corporation','domain'] = 'ceic'

In [129]:
req = rq.get('http://www.sasac.gov.cn/n2588035/n2641579/n2641645/index.html').content
soup = bs(req,'lxml')

In [130]:
chinese = []
link_cn = []

for i in soup.find_all('td',{'bgcolor':"#f0f0f0"}):
  chinese.append(i.text.strip())
  try :
    link_cn.append(i.find('a').get('href'))
  except:
    link_cn.append(np.nan)

df2 = pd.DataFrame({'chinese':chinese,'link_cn':link_cn})

In [131]:
df2[df2.link_cn.isna()]

Unnamed: 0,chinese,link_cn
18,中国融通资产管理集团有限公司,
77,中国安能建设集团有限公司,


In [132]:
domain = []

for i in df2.link_cn:
  if i is None:
    domain.append(np.nan)
  elif i is np.nan:
    domain.append(np.nan)
  elif 'com' in i:
    urls = i.split('.')
    domain.append(urls[[urls.index(u) for u in urls if u.startswith('com')][0]-1])
  elif 'cn' in i:
    urls = i.split('.')
    domain.append(urls[[urls.index(u) for u in urls if u.startswith('cn')][0]-1])  
  elif 'cc' in i:
    urls = i.split('.')
    domain.append(urls[[urls.index(u) for u in urls if u.startswith('cc')][0]-1])  
  else :
    domain.append(np.nan)

df2['domain'] = domain
df2

Unnamed: 0,chinese,link_cn,domain
0,中国核工业集团有限公司,http://www.cnnc.com.cn/,cnnc
1,中国商用飞机有限责任公司,http://www.comac.cc/,http://www
2,中国航天科技集团有限公司,http://www.spacechina.com/,spacechina
3,中国节能环保集团有限公司,http://www.cecep.cn,cecep
4,中国航天科工集团有限公司,http://www.casic.com.cn/,casic
...,...,...,...
92,招商局集团有限公司,http://www.cmhk.com/,cmhk
93,中国国新控股有限责任公司,http://www.crhc.cn/,crhc
94,华润（集团）有限公司,http://www.crc.com.hk/,crc
95,中国检验认证（集团）有限公司,http://www.ccic.com/,ccic


In [133]:
df2[df2.domain.isna()]

Unnamed: 0,chinese,link_cn,domain
18,中国融通资产管理集团有限公司,,
77,中国安能建设集团有限公司,,


In [134]:
df2.loc[df2.chinese=='中国化学工程集团有限公司','domain'] = 'cncec'
df2.loc[df2.chinese=='中国商用飞机有限责任公司','domain'] = 'comac'

In [135]:
seo = pd.merge(df, df2, on='domain', how='outer')

In [136]:
seo.loc[seo.chinese=='中国船舶集团有限公司','english'] = 'China State Shipbuilding Co Ltd'
seo.loc[seo.chinese=='国家石油天然气管网集团有限公司','english'] = 'China Oil & Gas Pipeline Network Corporation(PipeChina)'
seo.loc[seo.chinese=='中国检验认证（集团）有限公司','english'] = 'China Certification & Inspection Group (CCIC)'

In [137]:
seo.to_csv('china_ceos.csv',encoding='utf-8-sig')
files.download('china_ceos.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>