Stock-Model/data_prepare at master · Lizn-zn/Stock-Model

History

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
data_discribtion.txt		data_discribtion.txt
data_label.py		data_label.py
data_prepare.py		data_prepare.py
data_preprocess.py		data_preprocess.py
data_price.py		data_price.py
data_tushare.py		data_tushare.py
hist.py		hist.py

README.md

Data Prepare

You should put these two .py file under GPSJ folder. data_prepare.py generates .csv files by stock, as well as by feature. In addition, a .txt file which discribes feature is also available. hist.py generates historical stock data, which is used to label the examples.

Work to Be Completed

1. Data missing

For some feature, some stock do not even have one piece of data. e.g. For stock 000001.SZ, there is no record for its ROIC data.

About this problem, in machine learning, we have four methods to solve:

a. Replace the missing data with median, mode or some other values, but it dostn't work well because we add some noise unexpectedly.

b. Create a prediction model by other variables to calculate missing values, but there is a fundamental problem need to solve: In one hand, if other variables are not related to the missing variables, the predicted results are meaningless. In the other hand, if the prediction result is quite accurate, it indicates that this variable is not necessarily added to the model.

c. Map variables to high dimensional space. For example, gender, male, female, missing three cases, then mapped into three variables: whether male, female, whether missing. Continuous variables can also be handled this way. The advantage of doing so is to completely retain all the information of the original data, do not consider missing values, and do not consider the problem of linear inseparability. The disadvantage is that the amount of computation is greatly increased.

d.Just delete this whole piece of data because we already have a lot of data

I think finally we will use the method (d), but maybe we can use (b) to restore some data because we have lots of factors and we won't take all of them into our model.

2. Factor Analysis

Colinearity etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data_prepare

data_prepare

README.md

Data Prepare

Work to Be Completed

1. Data missing

2. Factor Analysis

3. Other important data, which should be calculated in the future (man man lai ba, md 70 ge yin zi)

Files

data_prepare

Directory actions

More options

Directory actions

More options

Latest commit

History

data_prepare

Folders and files

parent directory

README.md

Data Prepare

Work to Be Completed

1. Data missing

2. Factor Analysis

3. Other important data, which should be calculated in the future (man man lai ba, md 70 ge yin zi)