# Assignment 3: Crawl (雪球) XueQiu Investor Portfolio Rebalance History Data and Do Simple Analysis


* [雪球(XueQiu)](https://xueqiu.com/) is one of the most popular stock information sharing forum in China.
* In XueQiu there has hundreds of thousands portfolios (A share stock), and investor reblance a portfolio from time to time. 
* In this assignment, you are asked to crawl each portfolio's entire rebalance history (from the time when the portfolio is created and the latest rebalance) and do some simple analysis.
* This is meaningful since the decision of rebalance action comes from investor's personal experience and investment philosophy. The entire rebalance history of a successful portfolio (high return low variance) can help us learn how to do investment. Besides, we can also simply follow the successful portfolios and do investment accordingly, and this also requires us to be able track portfolios' rebalance history.
* You may have to register an account and login first so that you can find the entire reblance history.
* There has 3 tasks in this assignment, introduced below:

**Hint:** There has some similar tasks on the web, you can make some reference.

## Task 1: Crawl XueQiu User's Rebalance History

* Each portfolio has an id. For example below portfolio is associated with id `ZH010218`. 
* You can use the url template "https://xueqiu.com/portfolio_id" to access the webpage of a portfolio. For example below portfolio's page is "https://xueqiu.com/ZH010218".
* In XueQiu user's page, you may see below figures, which records user's rebalance history and the corresponding profit curve.

Time Series Profit              | Investment Portfolio  | 
:-------------------------:|:-------------------------:|
<img src="xueqiu1.jpeg" width = "300" height = "550"/>  |  <img src="xueqiu2.jpeg" width = "300" height = "550"/>  | 

* In this assignment, you are asked to crawl each portfolio's entire rebalance history.
* Specifically, you need to return a dictionary in following structure:

>```Python
{
    portfolio1_id:
    {
        time_1: {'cash_value':val, 'position':{stock1_symbol:{'volume':val, 'price':val}, stock2_symbol:...}}, 
        time_2: {'cash_value':val, 'position':{stock3_symbol:{'volume':val, 'price':val}, stock4_symbol:...}},
        ...
        time_n:
    }
    portfolio_id:
    {
        ...
    }
}
```

**Explanation**

* **portfolio_id**: introduced above.
* **time**: the time when each rebalance is done. 
* **cash_value**: portfolio's cash value. It is made up of two parts, `cash` + `holding stocks' cash value`, which can be denoted in following equation, where $n$ denotes the number of stocks user is holding at hand in one rebalance. 
\begin{equation}
cash\_value = cash + \sum\limits_{i=1}^n price_i * volume_i
\end{equation}
* **stock_symbol**: each stock associates with a symbol, for example in above figure, "民生银行" has symbol "SH600016".
* **volume**: number of shares of stocks in investor's portfolio when (s)he performs rebalance.
* **price**: the stock's price when investor performs rebalance.

## Task 2: 

* Return the portfolio id which has the highest profit between "ZH010000" and "ZH020000".

## Task 3:

* Shows the portfolio id with highest return and with the following two constraints: (1) the latest rebalance happens after May 1st, 2018, and (2) the rebalance history lasts more than 2 years.

## Submission

* For Task 1, you need to extract all the portfolio ranged from **ZH010000** (included) and **ZH020000** (excluded) and orgnize the data as the format of above dictionary. Once it is done, save it as a pickle file, named as **portfolio.pkl** under current directory.

* For task 2 and 3, you can put your code below and directly **show the results in this notebook**.

## Put your source code below

### Task 1: Save the pickle file in current directory.

In [6]:
import requests
from bs4 import BeautifulSoup
import json

In [7]:
import time
def timestamp_datetime(value):
    format = '%Y-%m-%d %H:%M:%S'
    value = time.localtime(value/1000)
    dt = time.strftime(format, value)
    return dt

In [8]:
headers={'Cookie':'s=fn12fl4mj2; device_id=0fb96ecfbdf1a34091da6def003fd5bd; bid=ec89bc263ff2b7b4a45b582a7633900a_jhbqy125; xqat=51549a82ad3c092a65b1354e9d945a34e68b873a; xq_token_expire=Tue%20Jun%2012%202018%2017%3A58%3A37%20GMT%2B0800%20(CST); aliyungf_tc=AQAAAGCzon9QsgYABZJ6ndcxwIutl5tV; remember=1; remember.sig=K4F3faYzmVuqC0iXIERCQf55g2Y; xq_a_token=a82b6ffff852a10c6407d73a5544cf41649d4e4c; xq_a_token.sig=AyIhFYxlX49XWpS5dhMGBYr0wbI; xq_r_token=97f539838b44ca7d8c7f1f7cbc32d85a88d4f578; xq_r_token.sig=1E1os5YuAWYn4Q36NuiM1xD50Ys; xq_is_login=1; xq_is_login.sig=J3LxgPVPUzbBg3Kee_PquUfih7Q; u=6172339619; u.sig=fBw14PVizEzsouaJ8zqGOdonkJQ; _ga=GA1.2.1209913552.1526631346; _gid=GA1.2.1709210208.1527234740; __utma=1.1209913552.1526631346.1526635342.1527234737.4; __utmb=1.2.10.1527234737; __utmc=1; __utmz=1.1526631346.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); Hm_lvt_1db88642e346389874251b5a1eded6e3=1526631346,1527234737; Hm_lpvt_1db88642e346389874251b5a1eded6e3=1527234770',"Accept":"text/html,application/xhtml+xml,application/xml;", "Accept-Encoding":"gzip", "Accept-Language":"zh-CN,zh;q=0.8", "Referer":"https://xueqiu.com/", "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36" } 

In [42]:
url ='https://xueqiu.com/cubes/rebalancing/history.json?cube_symbol=ZH0{}&count=20&page={}'
Data={}
for number in range(10000,10101):
    context=[]
    Data['ZH0%d'%(number)]={}
    stock_symbol={}
    for page in range(1,3):
        page_body = requests.get(url.format(number,page),headers=headers)
        jd = json.loads(page_body.text)
        try:
            context+=jd['list']
        except:
            pass
    context=context[::-1]
    for n in context:
        if n['status'] != 'failed':
            Data['ZH0%d'%(number)][timestamp_datetime(n['created_at'])]={}
            cash_value=n['cash']
            for stock in n['rebalancing_histories']:
                if stock['volume']==None or stock['volume']==0 or stock['price']==None:
                    try:
                        del stock_symbol[stock['stock_name']]
                    except:
                        pass
                else:
                    cash_value+=stock['volume']* stock['price']
                    stock_symbol[stock['stock_name']]={'volume':stock['volume'], 'price':stock['price']}
            Data['ZH0%d'%(number)][timestamp_datetime(n['created_at'])]['position']=stock_symbol.copy()
            Data['ZH0%d'%(number)][timestamp_datetime(n['created_at'])]['cash_value']=cash_value
    if len(Data['ZH0%d'%(number)])==0:
        del Data['ZH0%d'%(number)]
        continue

In [43]:
Data

{'ZH010000': {'2014-11-25 11:17:33': {'position': {'贵绳股份': {'volume': 0.01773648,
     'price': 11.84}},
   'cash_value': 79.2099999232},
  '2014-11-25 11:19:21': {'position': {'南京熊猫': {'volume': 0.04139806,
     'price': 12.08},
    '熊猫金控': {'volume': 0.02237533, 'price': 22.35}},
   'cash_value': 1.0001771903},
  '2014-11-25 11:20:41': {'position': {'南京熊猫': {'volume': 0.01241942,
     'price': 12.08},
    '熊猫金控': {'volume': 0.00671259, 'price': 22.35},
    '卧龙电气': {'volume': 0.01012325, 'price': 9.88},
    '卧龙地产': {'volume': 0.02016486, 'price': 4.96}},
   'cash_value': 50.50008839570001},
  '2014-12-01 13:14:11': {'position': {'南京熊猫': {'volume': 0.00851379,
     'price': 12.11},
    '熊猫金控': {'volume': 0.00379191, 'price': 27.19},
    '卧龙电气': {'volume': 0.01047785, 'price': 9.84},
    '卧龙地产': {'volume': 0.02112747, 'price': 4.88},
    '峨眉山A': {'volume': 0.00527646, 'price': 19.54}},
   'cash_value': 50.5155101558},
  '2015-04-29 14:14:00': {'position': {'南京熊猫': {'volume': 0.00706271,

In [44]:
import pickle

In [45]:
write_file=open('portfolio.pkl','wb') 
pickle.dump(Data,write_file)
write_file.close() 

### Task 2: 

In [46]:
import pandas as pd
import numpy as np

In [47]:
list1=[]
list2=[]
for name,n in Data.items():
    times=n.values()
    list1.append(name)
    list2.append(list(times)[-1]['cash_value'])

In [48]:
df_price=pd.DataFrame(list2,index=list1,columns=['price'])
df_price.sort_values(by='price',ascending=False)[:1]

Unnamed: 0,price
ZH010075,98.764756


### Task 3:

In [49]:
import datetime

In [50]:
def check_2018(day):
    date=datetime.datetime.strptime("2018-05-01 00:00:00","%Y-%m-%d %H:%M:%S")
    date2=datetime.datetime.strptime(day,"%Y-%m-%d %H:%M:%S")
    return date2>date
def tran_day(day):
    return datetime.datetime.strptime(day,"%Y-%m-%d %H:%M:%S")

In [51]:
list1=[]
list2=[]
for name,n in Data.items():
    times=list(n.values())
    day=list(n.keys())
    if len(day)!=0 and check_2018(day[-1]):
        if tran_day(day[-1])-tran_day(day[0])>datetime.timedelta(days=30):
            list1.append(name)
            list2.append(list(times)[-1]['cash_value'])

In [52]:
df_price=pd.DataFrame(list2,index=list1,columns=['price'])
df_price.sort_values(by='price',ascending=False)[:1]

Unnamed: 0,price
ZH010031,79.622023
