# 货币政策最新消息聚合
本课程的目的是制作一个国际货币政策的最新消息聚合。    
通过汇集各国央行网站最新的货币政策消息，生成一个rss的信息源。

## 预备知识
1. python3.0以上基础 [https://www.python.org](https://www.python.org)
2. python http访问库，[requests](http://www.python-requests.org/en/master/)
3. python html解析库，[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/)
4. python web开发库， [flask](http://flask.pocoo.org/)
5. [SQLite](https://www.sqlite.org/index.html)
6. [SQLAlchemy](https://www.sqlalchemy.org/)
6. [feedparser](https://pythonhosted.org/feedparser/index.html)

## 数据来源
1. [美联储](https://www.federalreserve.gov/feeds/press_all.xml) 
2. [欧洲央行](https://www.ecb.europa.eu/rss/press.xml)
3. [日本央行](https://www.boj.or.jp/en/rss/whatsnew.xml)
4. [英国央行]    
   4.1 [事件](https://www.bankofengland.co.uk/rss/events)    
   4.2 [新闻](https://www.bankofengland.co.uk/rss/news)
5. [加拿大央行](https://www.bankofcanada.ca/content_type/press-releases/feed/)
6. [新西兰央行](https://www.rbnz.govt.nz/feeds/news)
7. 澳洲联储    
   7.1 [公告](https://www.rba.gov.au/rss/rss-cb-bulletin.xml)    
   7.2 [政策](https://www.rba.gov.au/rss/rss-cb-smp.xml)
   
## 数据存储
### 数据库和数据表的创建
我们在每个用户的项目目录下创建一个sqlite3的数据库文件及相应的数据表 

        ~/rss$ sqlite3 rss.db    
    SQLite version 3.23.1 2018-04-10 17:39:29     
    Enter ".help" for usage hints.    
    sqlite>create table news(    
        id integer primary key autoincrement,    
        source varchar(32) not null,    
        category varchar(128) not null,    
        title varchar(256) not null,      
        link varchar(256) unique not null,     
        description text not null,     
        pubDate datetime not null     
    );  
    
- id 主键 自增长
- source 消息来源 字符串
- category 消息类别 字符串
- title 标题 字符串
- link 消息链接 字符串
- description 更多内容 字符串
- pubDate 发布消息的时间 时间类型

**特别注意:link设置了唯一性索引，为了保证不重复保存相同的记录**

### 使用SQLAlchemy往数据库中插入新闻记录

In [1]:
from sqlalchemy import create_engine
from sqlalchemy.orm import scoped_session, sessionmaker
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer, String, Text, DateTime
from datetime import datetime

engine = create_engine('sqlite://///home/jupyter-bcm/rss/rss.db', convert_unicode=True)
db_session = scoped_session(sessionmaker(autocommit=False,
                                         autoflush=False,
                                         bind=engine))
Base = declarative_base()
Base.query = db_session.query_property()

class News(Base):
    __tablename__ = 'news'
    
    id = Column(Integer, primary_key=True)
    source = Column(String(32), nullable=False)
    category = Column(String(128), nullable=False)
    title = Column(String(256), nullable=False)
    link = Column(String(256), unique=True, nullable=False)
    description = Column(Text(), nullable=False)
    pubDate = Column(DateTime(), nullable=False)
    
    def __init__(self, source=None, category=None, title=None, link=None, description=None, pubDate=None):
        self.source = source;
        self.category = category
        self.title = title
        self.link = link
        self.description = description
        self.pubDate = pubDate
    
    def __repr__(self):
        return 'title:{}|link:{}'.format(self.title, self.link)
try:
    n = News('us', 'cate1', 'titletest', 'https://www.federalreserve.gov', 'details', datetime.now())
    db_session.add(n)
    db_session.commit()
except Exception as err:
    #print(err)
    db_session.rollback()
print('done')

done


## 数据的获取
### 初始环境设置

In [2]:
import random

user_agents = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.1.2 Safari/605.1.15',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.1 Safari/605.1.15',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/604.4.7 (KHTML, like Gecko) Version/11.0.2 Safari/604.4.7',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/601.7.8 (KHTML, like Gecko) Version/9.1.3 Safari/601.7.8',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11) AppleWebKit/601.1.56 (KHTML, like Gecko) Version/9.0 Safari/601.1.56'
]

def get_user_agent():
    i = random.randint(0, 9)
    return user_agents[i]

print(get_user_agent())
print(get_user_agent())
print(get_user_agent())


Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.1.2 Safari/605.1.15
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.1.2 Safari/605.1.15
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36


### 时间格式转换
因为消息来源的时间在不同的时区，统一转化为datetime类型(这里没有去做时区的转换，有兴趣的同学可自行完成)

In [3]:
from datetime import datetime

def get_timestamp(dtstring, dtformat):
    return datetime.strptime(dtstring, dtformat)

print(get_timestamp('Wed, 17 Oct 2018 14:00:00 GMT', '%a, %d %b %Y %H:%M:%S GMT'))
    

2018-10-17 14:00:00


### 美联储数据的获取

In [4]:
import feedparser

feedparser.USER_AGENT = get_user_agent()
d = feedparser.parse('https://www.federalreserve.gov/feeds/press_all.xml')
for item in d.entries:
    try:
        n = News('federalreserve.gov', item.category, item.title, item.link, item.description, get_timestamp(item.published, '%a, %d %b %Y %H:%M:%S GMT'))
        db_session.add(n)
        db_session.commit()
    except Exception as err:
        #print(err)
        db_session.rollback()
        continue
print('done')

done


### 欧洲央行数据的获取
由于该数据源返回无category和description字段，所以做了简单的处理。    
category写为: other    
description用title代替

In [5]:
import feedparser

feedparser.USER_AGENT = get_user_agent()
d = feedparser.parse('https://www.ecb.europa.eu/rss/press.xml')
for item in d.entries:
    try:
        n = News('ecb.europa.eu', 'other', item.title, item.link, item.title, get_timestamp(item.published, '%a, %d %b %Y'))
        db_session.add(n)
        db_session.commit()
    except Exception as err:
        #print(err)
        db_session.rollback()
        continue
print('done')

done


### 日本央行数据的获取
由于该数据源返回无category和description字段，所以做了简单的处理。    
category写为: other    
description用title代替

In [6]:
import feedparser

feedparser.USER_AGENT = get_user_agent()
d = feedparser.parse('https://www.boj.or.jp/en/rss/whatsnew.xml')
for item in d.entries:
    try:
        n = News('boj.or.jp', 'other', item.title, item.link, item.title, get_timestamp(item.published, '%a, %d %b %Y %H:%M:%S +0900'))
        db_session.add(n)
        db_session.commit()
    except Exception as err:
        #print(err)
        db_session.rollback()
        continue
print('done')

done


### 英国央行数据的获取
由于该数据源返回无category字段，所以做了简单的处理。    
category写为: other

In [7]:
import feedparser

feedparser.USER_AGENT = get_user_agent()

d = feedparser.parse('https://www.bankofengland.co.uk/rss/events')
for item in d.entries:
    try:
        if ' Z' in item.published: 
            n = News('bankofengland.co.uk', 'other', item.title, item.link, item.description, get_timestamp(item.published, '%a, %d %b %Y %H:%M:%S Z'))
        else:
            n = News('bankofengland.co.uk', 'other', item.title, item.link, item.description, get_timestamp(item.published, '%a, %d %b %Y %H:%M:%S +0100'))        
        db_session.add(n)
        db_session.commit()
    except Exception as err:
        #print(err)
        db_session.rollback()
        continue
    
d = feedparser.parse('https://www.bankofengland.co.uk/rss/news')
for item in d.entries:
    try:
        if ' Z' in item.published: 
            n = News('bankofengland.co.uk', 'other', item.title, item.link, item.description, get_timestamp(item.published, '%a, %d %b %Y %H:%M:%S Z'))
        else:
            n = News('bankofengland.co.uk', 'other', item.title, item.link, item.description, get_timestamp(item.published, '%a, %d %b %Y %H:%M:%S +0100'))
        db_session.add(n)
        db_session.commit()
    except Exception as err:
        #print(err)
        db_session.rollback()
        continue
print('done')

done


### 加拿大央行数据的获取

In [8]:
import feedparser

feedparser.USER_AGENT = get_user_agent()
d = feedparser.parse('https://www.bankofcanada.ca/content_type/press-releases/feed/')
for item in d.entries:
    try: 
        n = News('bankofcanada.ca', 'other', item.title, item.link, item.description, get_timestamp(item.updated, '%Y-%m-%dT%H:%M:%S+00:00'))
        db_session.add(n)
        db_session.commit()
    except Exception as err:
        #print(err)
        db_session.rollback()
        continue
print('done')

done


### 新西兰央行数据的获取

In [9]:
import feedparser

feedparser.USER_AGENT = get_user_agent()
d = feedparser.parse('https://www.rbnz.govt.nz/feeds/news')
for item in d.entries:
    try: 
        if '+1200' in item.published:
            n = News('rbnz.govt.nz', 'other', item.title, item.link, item.description, get_timestamp(item.published, '%a, %d %b %Y %H:%M:%S +1200'))
        else:
            n = News('rbnz.govt.nz', 'other', item.title, item.link, item.description, get_timestamp(item.published, '%a, %d %b %Y %H:%M:%S +1300'))
        db_session.add(n)
        db_session.commit()
    except Exception as err:
        #print(err)
        db_session.rollback()
        continue
print('done')

done


### 澳洲联储数据的获取

In [10]:
import feedparser

feedparser.USER_AGENT = get_user_agent()

d = feedparser.parse('https://www.rba.gov.au/rss/rss-cb-bulletin.xml')
for item in d.entries:
    try: 
        n = News('rba.gov.au', 'other', item.title, item.link, item.description, get_timestamp(item.updated, '%Y-%m-%dT%H:%M:%S+10:00'))
        db_session.add(n)
        db_session.commit()
    except Exception as err:
        #print(err)
        db_session.rollback()
        continue
    
d = feedparser.parse('https://www.rba.gov.au/rss/rss-cb-smp.xml')
for item in d.entries:
    try: 
        n = News('rba.gov.au', 'other', item.title, item.link, item.description, get_timestamp(item.updated, '%Y-%m-%dT%H:%M:%S+10:00'))
        db_session.add(n)
        db_session.commit()
    except Exception as err:
        #print(err)
        db_session.rollback()
        continue
print('done')

done


## 生成一个新的聚合rss源
我们使用flask从数据库中读取数据，并生成rss源

In [None]:
# -*- coding: utf-8 -*-

from flask import Flask
from flask import request
from sqlalchemy import desc
import PyRSS2Gen
import datetime

app = Flask(__name__)

@app.route("/rss")
def rss():
    items = News.query.order_by(desc(News.pubDate)).limit(100)
    rss_items = []
    for item in items:
        ri = PyRSS2Gen.RSSItem(
            title = item.title,
            link = item.link,
            description = item.description,
            guid = PyRSS2Gen.Guid(item.link),
            pubDate = item.pubDate
        )
        rss_items.append(ri)
    
    rss = PyRSS2Gen.RSS2(
        title = 'Recent Monetary News', 
        link = request.url,
        description = 'Recent Monetary News',
        lastBuildDate = datetime.datetime.now(),
        items = rss_items)
    
    return rss.to_xml(), 200, {'Content-Type': 'application/xml'}

# 由于每个用户的应用实例都在同一台服务器上，建议同学们把自己学号的后两位数用在9000后两位数字上，比如9022
# 这样子，你如果运行正常，那么访问这个rss源的地址为：http://lab.ftcourse.cn:9022/rss
app.run(host='0.0.0.0', port=9000)

 * Serving Flask app "__main__" (lazy loading)
 * Environment: production
   Use a production WSGI server instead.
 * Debug mode: off


 * Running on http://0.0.0.0:9000/ (Press CTRL+C to quit)
45.124.24.36 - - [22/Oct/2018 06:52:14] "[37mGET /rss HTTP/1.1[0m" 200 -


## 待改进的地方和建议
这个应用有几个地方并不完善，有兴趣的同学可以从下面几点去尝试改进它：

1. 时区转换，大家可以看到我并没有可以去处理不同国家来源数据的日期的时区问题。可以把所有时区都转换为北京的时区然后再存储。
2. 空缺字段的处理，有些数据源返回的数据并没有rss规范所要求的category和description，在现在的代码中我仅仅是做了简化处理，可尝试从link中去获取详情页里面的内容，然后分词，然后获取有用信息用来填充缺失字段。
3. 在对2的处理中，如果有兴趣的同学可以接触下nlp，就是自然语言处理的部分内容，python在这块也很有优势。
4. 由于在jupyter环境下缺乏定时处理和与系统交互的方法，如果能写成独立的代码，可对这些不同数据源进行定时获取，获取更新的数据。
5. 对于4的处理，目前jupyter notebook文件可通过nbconvert命令行工具来执行，有兴趣同学可参考jupyter官方文档。

下面推荐两个网站：
- [nlp](https://www.nltk.org)
- [jupyter command tool](https://nbconvert.readthedocs.io/en/latest/usage.html#)
