我们之前爬取的内容，没有历史数据。怎么办？


github上面有人做了全面的数据爬取工作，这里是[链接](https://github.com/BlankerL/DXY-COVID-19-Data)

我们接下来使用他们的数据集进行分析。

# 导入数据

In [1]:
from datetime import datetime

In [2]:
import pandas as pd
data = pd.read_csv('DXYArea.csv') # 读取数据

我们来了解下这组数据

In [3]:
data.head()

有哪些列呢？

In [4]:
data.columns

有多少条数据呢？

In [5]:
data.shape

In [6]:
data['updateTime'].unique()

显然，一天之内爬取了多次数据，存在多余数据,我们要将其去掉。为此，先要进行时间的统一。


先来数据格式转换的方法

In [7]:
data['updateTime'][0][0:10]

In [8]:
def changeTime(TimeStr):
    return TimeStr[0:10]
data['updateTime'] = data['updateTime'].apply(changeTime)

In [9]:
# 我们来看看修改以后的数据长什么样子
data['updateTime'].unique()

可以看到，我们收集到了从1.22开始，到3.19为止的所有数据。


来看一个武汉的例子

In [10]:
data[(data['updateTime'] == '2020-03-03') & (data['cityName'] == '武汉')]

我们发现，由于多次爬虫，出现了重复的数据，使用`drop_duplicates`方法，可以去掉重复的项。

In [11]:
data = data.drop_duplicates(['cityName', 'updateTime'])

我们可以检查下，修改以后的数据是否不再重复了

In [12]:
data[data['cityName'] =='武汉' ]

In [13]:
len(data['cityName'])

# 疫情数据可视化

In [None]:
data_China = data[(data['provinceName'] != '中国') & (data['countryName'] == '中国')]
data_China

In [16]:
data2 = data_China.groupby(['provinceName','updateTime']).sum().reset_index()
data2.head()

In [17]:
import plotly.express as px
fig = px.line(data2, 
             x='updateTime',  # 年份为横坐标
             y='city_confirmedCount',  # 预期寿命为纵坐标 
             color='provinceName') # 以国家进行染色
fig.show()

# 地理数据可视化

In [20]:
import json
with open(''china.json', 'r') as f:
    geofile = json.load(f)

Name_ID_Dict = {}
for item in geofile['features']:
    item['id'] = item['properties']['id']
    item['name'] = item['properties']['name']
    Name_ID_Dict[item['properties']['name']] = item['properties']['id']
Name_ID_Dict['澳门']    = 82
Name_ID_Dict['香港']    = 81
Name_ID_Dict['台湾']    = 71

In [21]:
New_ID = []
for i in range(len(data2)):
    New_ID.append(Name_ID_Dict[data2['provinceName'][i]])
data2['ID'] = New_ID

In [22]:
data3 = data2[data2['updateTime'] =='2020-02-23']
data3.head()

In [23]:
data2['provinceName'].unique()

In [24]:
import plotly.express as px
px.set_mapbox_access_token('pk.eyJ1IjoidG9uZ3hpbnJlbiIsImEiOiJjazZnM2phcXEwdTJ5M2pxcHQ3MDRteHNlIn0.ci2XKyZQRC_tAEcvxVIeAQ')
fig = px.choropleth_mapbox(data2, 
                           geojson=geofile, 
                           locations='ID', 
                           color='city_confirmedCount',
                           color_continuous_scale="Reds",
                           range_color=(0, 1000),
                           hover_name = 'provinceName',
                           animation_frame = 'updateTime',
                           #mapbox_style="carto-positron",
                           zoom=2.5, 
                           center = {"lat": 35, "lon": 110},
                           #opacity=0.5,
                           #labels={'unemp':'unemployment rate'}
                          )
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

由于存在缺失数据，下面我们提供了一个补全缺失数据的方法。从而可以实现地图的全面展示。

In [25]:
for province in data2['provinceName'].unique():
    print(province)
    df = data2[data2['provinceName'] == province]
    helper = pd.DataFrame({'date': pd.date_range('2020-01-21','2020-03-19')})
    df['date'] = pd.to_datetime(df['updateTime'])
    d = pd.merge(df,helper,how = 'outer',on='date')
    d = d.sort_values('date')
    d = d.fillna(method = 'ffill')
    if province == '上海市':
        #print('xx')
        new_df = d
    else:
        new_df = pd.concat([new_df,d],ignore_index =True)
    new_df
#new_df = new_df.dropna(how='any')
new_df['date'] = new_df['date'].astype(str)

In [28]:
import plotly.express as px
import chart_studio
chart_studio.tools.set_credentials_file(username='RENTONGXIN', api_key='RZbo4eiGrc0Hn1ZOWkFh')
import chart_studio.plotly as py
px.set_mapbox_access_token('pk.eyJ1IjoidG9uZ3hpbnJlbiIsImEiOiJjazZnM2phcXEwdTJ5M2pxcHQ3MDRteHNlIn0.ci2XKyZQRC_tAEcvxVIeAQ')
fig = px.choropleth_mapbox(new_df, 
                           geojson=geofile, 
                           locations='ID', 
                           color='city_confirmedCount',
                           color_continuous_scale="Reds",
                           range_color=(0, 1000),
                           hover_name = 'provinceName',
                           animation_frame = 'date',
                           #mapbox_style="carto-positron",
                           zoom=2.5, 
                           center = {"lat": 35, "lon": 110},
                           #opacity=0.5,
                           labels={'city_confirmedCount':'确诊人数'}
                          )
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
# py.plot(fig, filename = 'china_map', auto_open=False)