# Step2: Data Selection from HK.HSI Constituent (恒指成份股)

---
**What does this notebook do?**
- Step1 downloads all the 64 HK HSI Constituent data
- The next step is to select which ones to use for the research. 
- As the project needs stock price data from Jan 1st, 2012 to Jan 1st, 2022 for each one stock, this notebook will select the stocks fulfilling this requirement
- **Please notice that you can SKIP RUNNING THROUGH this notebook if you are planning to use the stocks data given inside the [DataSource](../DataSource), please go directly to the step3**

**How to choose the stocks?**
- HANG SENG Bank has published a report on HANG SENG INDEX, the [full report can be found here](https://www.hsi.com.hk/static/uploads/contents/en/dl_centre/factsheets/hsie.pdf)
- The 50 stock lists are stored in a csv file in [HSI50_Stock_list in the DataSource](../DataSource/HSI50_Stock_list.csv)
- The report gives the summary of top 50 stocks' information, including industry, share type and weight, etc
- Combined with the timing requirement, this research will uses the stocks that fulfill the timing requirement within these 50 stocks

**To-Do List**
1. Find stocks with listing date before '2012-1-1' from the 64 HSI Constituent data
2. Find stocks in the HSI Top 50 stocks before '2012-1-1'
3. Make Decisions on which group of data to use
---

# DataSource Overview

In [1]:
import pandas as pd
HSI_64 = pd.read_csv('../DataSource/full_hsi_stock_list.csv')
HSI_top_50 = pd.read_csv('../DataSource/HSI50_Stock_list.csv')

In [2]:
HSI_64.head()

Unnamed: 0,code,lot_size,stock_name,stock_owner,stock_child_type,stock_type,list_time,stock_id,main_contract,last_trade_time
0,HK.00001,500,长和,,,STOCK,2015-03-18,4440996184065,False,
1,HK.00002,500,中电控股,,,STOCK,1970-01-01,2,False,
2,HK.00003,1000,香港中华煤气,,,STOCK,1970-01-01,3,False,
3,HK.00005,400,汇丰控股,,,STOCK,1970-01-01,5,False,
4,HK.00006,500,电能实业,,,STOCK,1976-08-16,6,False,


In [3]:
HSI_top_50.head()

Unnamed: 0,Stock Code,ISIN CODE,Company Name,Industry Classification,Share Type,Weighting (%)
0,700,KYG875721634,TENCENT,Information Technology,Other HK-listed Mainland Co.,8.0
1,5,GB0005405286,HSBC HOLDINGS,Financials,HK Ordinary,7.71
2,3690,KYG596691041,MEITUAN-W,Information Technology,Other HK-listed Mainland Co.,7.62
3,1299,HK0000069689,AIA,Financials,HK Ordinary,7.53
4,9988,KYG017191142,BABA-SW,Information Technology,Other HK-listed Mainland Co.,7.13


# Select the Top 50 HSI Constituent Data with list time before 2012-1-1

In [4]:
def formatting_HSI_top_50_stock_code():
    """
    This function transfers the 50 stock codes from HANG SENG BANK report
    to the same stock code format in the historical data downloaded via 
    FUTU Open API
    """
    HSI50_list = HSI_top_50['Stock Code'].tolist()
    update_HSI50_list = []
    for i in HSI50_list:
        add_zero_num = 5 - len(str(i))
        name = 'HK.' + '0'*add_zero_num + str(i)
        update_HSI50_list.append(name)
    HSI_top_50['Stock Code'] = update_HSI50_list
    return HSI_top_50

In [5]:
HSI_top_50 = formatting_HSI_top_50_stock_code()
HSI_top_50.head()

Unnamed: 0,Stock Code,ISIN CODE,Company Name,Industry Classification,Share Type,Weighting (%)
0,HK.00700,KYG875721634,TENCENT,Information Technology,Other HK-listed Mainland Co.,8.0
1,HK.00005,GB0005405286,HSBC HOLDINGS,Financials,HK Ordinary,7.71
2,HK.03690,KYG596691041,MEITUAN-W,Information Technology,Other HK-listed Mainland Co.,7.62
3,HK.01299,HK0000069689,AIA,Financials,HK Ordinary,7.53
4,HK.09988,KYG017191142,BABA-SW,Information Technology,Other HK-listed Mainland Co.,7.13


In [6]:
df = HSI_top_50.merge(HSI_64, how='left', left_on='Stock Code', right_on='code')
print(f"There are {len(df[df['list_time']>'2012-1-1'])} stocks listed after 2012-1-1, out of top 50 HSI Constituent Stock\nListed here below")
df[df['list_time']>'2012-1-1']

There are 11 stocks listed after 2012-1-1, out of top 50 HSI Constituent Stock
Listed here below


Unnamed: 0,Stock Code,ISIN CODE,Company Name,Industry Classification,Share Type,Weighting (%),code,lot_size,stock_name,stock_owner,stock_child_type,stock_type,list_time,stock_id,main_contract,last_trade_time
2,HK.03690,KYG596691041,MEITUAN-W,Information Technology,Other HK-listed Mainland Co.,7.62,HK.03690,100,美团-W,,,STOCK,2018-09-20,76364518526570,False,
4,HK.09988,KYG017191142,BABA-SW,Information Technology,Other HK-listed Mainland Co.,7.13,HK.09988,100,阿里巴巴-SW,,,STOCK,2019-11-26,78224239372036,False,
8,HK.02269,KYG970081173,WUXI BIO,Healthcare,HK Ordinary,2.64,HK.02269,500,药明生物,,,STOCK,2017-06-13,74371653699805,False,
9,HK.01810,KYG9830T1067,XIAOMI-W,Information Technology,Other HK-listed Mainland Co.,2.63,HK.01810,200,小米集团-W,,,STOCK,2018-07-09,76033806042898,False,
18,HK.09618,KYG8208B1014,JD-SW,Information Technology,Other HK-listed Mainland Co.,1.34,HK.09618,50,京东集团-SW,,,STOCK,2020-06-18,79100412700050,False,
23,HK.00001,KYG217651051,CKH HOLDINGS,Conglomerates,HK Ordinary,1.07,HK.00001,500,长和,,,STOCK,2015-03-18,4440996184065,False,
33,HK.01113,KYG2177B1014,CK ASSET,Properties & Construction,HK Ordinary,0.78,HK.01113,500,长实集团,,,STOCK,2015-06-03,71244917507161,False,
38,HK.06098,KYG2453A1085,CG SERVICES,Properties & Construction,Other HK-listed Mainland Co,0.65,HK.06098,1000,碧桂园服务,,,STOCK,2018-06-19,75965086570450,False,
39,HK.09999,KYG6427A1022,NTES-S,Information Technology,Other HK-listed Mainland Co.,0.64,HK.09999,100,网易-S,,,STOCK,2020-06-11,79083232831247,False,
43,HK.01997,KYG9593A1040,WHARF REIC,Properties & Construction,HK Ordinary,0.57,HK.01997,1000,九龙仓置业,,,STOCK,2017-11-23,75067438401485,False,


In [7]:
df = df[df['list_time']<='2012-1-1'].reset_index(drop=True)
df = df.drop(columns=['code', 'stock_owner', 'stock_child_type', 'stock_type', 'stock_id', 'main_contract', 'last_trade_time'])
df.to_csv('../DataSource/research_use_39_stocks.csv', index=False)
df

Unnamed: 0,Stock Code,ISIN CODE,Company Name,Industry Classification,Share Type,Weighting (%),lot_size,stock_name,list_time
0,HK.00700,KYG875721634,TENCENT,Information Technology,Other HK-listed Mainland Co.,8.0,100,腾讯控股,2004-06-16
1,HK.00005,GB0005405286,HSBC HOLDINGS,Financials,HK Ordinary,7.71,400,汇丰控股,1970-01-01
2,HK.01299,HK0000069689,AIA,Financials,HK Ordinary,7.53,200,友邦保险,2010-10-29
3,HK.00939,CNE1000002H1,CCB,Financials,H Share,4.63,1000,建设银行,2005-10-27
4,HK.00388,HK0388045442,HKEX,Financials,HK Ordinary,4.35,100,香港交易所,2000-06-27
5,HK.02318,CNE1000003X6,PING AN,Financials,H Share,2.82,500,中国平安,2004-06-24
6,HK.01398,CNE1000003G1,ICBC,Financials,H Share,2.57,1000,工商银行,2006-10-27
7,HK.00941,HK0941009539,CHINA MOBILE,Telecommunications,Red Chip,2.28,500,中国移动,1997-10-23
8,HK.03968,CNE1000002M1,CM BANK,Financials,H Share,1.87,500,招商银行,2006-09-22
9,HK.00669,HK0669013440,TECHTRONIC IND,Consumer Discretionary,HK Ordinary,1.8,500,创科实业,1990-12-17


# Data Formatting Before Prediction & Optimization

In [None]:
def format_date():
    """
    This function formats the dataset into a better format
    before using them for prediction and optimization
    """
    import pandas as pd
    import datetime as dt
    stock_list = pd.read_csv('../DataSource/research_use_39_stocks.csv')['Stock Code'].tolist()
    # Add the Hang Seng Index stock prices to the stock_list
    stock_list.append('HK.800000')
    data_file_location_list = [str('../DataSource/StockData/' + i + '.csv') for i in stock_list]

    for data_file in data_file_location_list:
        df = pd.read_csv(data_file)
        df.rename(columns = {'time_key':'date'}, inplace = True)
        df['date'] = pd.to_datetime(df['date']).dt.date
        df = df.drop([col for col in df.columns if 'Unnamed: 0' in col],axis=1)
        df.to_csv(data_file, index=False)

In [None]:
format_date()