# Stock Data Preparation Demo

Today we will do a simple experiment on a stock prediction case. 

In order to save time on preparing the data, I'd like to introduce **TuShare**. You can use any other data source if you don't want to deal with Chinese stock market or you are not familiar with Chinese (TuShare's doc is all Chinese). 

In addition, you can have as many features as you want unless you keep the close price (the one we want to predict!) as the last column in your Pandas DataFrame.

**This is just a demo about how to use TuShare and deal with its data. You can find a separate (and simple) script with instructions about fetching more detailed data.**

In [1]:
import tushare as ts # TuShare is a utility for crawling historical data of China stocks
import pandas as pd

print(ts.__version__)

0.7.4


In [2]:
stock_index = '000002' # You can enter the stock you are interested in
csv_name = 'stock-{}'.format(stock_index)
start_date = '2016-01-01'
end_date = None # We will use today as the end date here, you can specify one if you want

In [3]:
df = ts.get_h_data(stock_index, start=start_date, autype=None)
df.head()

[Getting data:]#####

Unnamed: 0_level_0,open,high,close,low,volume,amount
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2017-04-14,20.92,20.92,20.64,20.6,21850597.0,452007100.0
2017-04-13,21.0,21.15,20.94,20.72,26910139.0,562839600.0
2017-04-12,20.7,21.57,21.02,20.7,64585536.0,1363421000.0
2017-04-11,20.6,20.7,20.7,20.2,45886018.0,938204300.0
2017-04-10,20.72,20.75,20.6,20.51,27459940.0,565963400.0


In [4]:
df = df.reset_index(drop=True)
df.head()

Unnamed: 0,open,high,close,low,volume,amount
0,20.92,20.92,20.64,20.6,21850597.0,452007100.0
1,21.0,21.15,20.94,20.72,26910139.0,562839600.0
2,20.7,21.57,21.02,20.7,64585536.0,1363421000.0
3,20.6,20.7,20.7,20.2,45886018.0,938204300.0
4,20.72,20.75,20.6,20.51,27459940.0,565963400.0


## Change Column Order

We will keep 'close' as our last column.

In [5]:
col_list = df.columns.tolist()
col_list

['open', 'high', 'close', 'low', 'volume', 'amount']

In [6]:
col_list.remove('close')
col_list.remove('amount') # Just for simplicity, should not be removed
col_list.append('close')
col_list

['open', 'high', 'low', 'volume', 'close']

In [7]:
df = df[col_list]
df.head()

Unnamed: 0,open,high,low,volume,close
0,20.92,20.92,20.6,21850597.0,20.64
1,21.0,21.15,20.72,26910139.0,20.94
2,20.7,21.57,20.7,64585536.0,21.02
3,20.6,20.7,20.2,45886018.0,20.7
4,20.72,20.75,20.51,27459940.0,20.6


In [8]:
df['volume'] = df['volume'] / 1000000
df.head()

Unnamed: 0,open,high,low,volume,close
0,20.92,20.92,20.6,21.850597,20.64
1,21.0,21.15,20.72,26.910139,20.94
2,20.7,21.57,20.7,64.585536,21.02
3,20.6,20.7,20.2,45.886018,20.7
4,20.72,20.75,20.51,27.45994,20.6


## Save Data

In [9]:
df.to_csv(csv_name, index=False)

Let's have a double check whether the data is saved properly.

In [10]:
validate_df = pd.read_csv(csv_name)
validate_df.head()

Unnamed: 0,open,high,low,volume,close
0,20.92,20.92,20.6,21.850597,20.64
1,21.0,21.15,20.72,26.910139,20.94
2,20.7,21.57,20.7,64.585536,21.02
3,20.6,20.7,20.2,45.886018,20.7
4,20.72,20.75,20.51,27.45994,20.6


**Great, it works!**