# **Data collection and cleaning**

#### *For our dataset, we choose to look into an api that generates the latest data on the spread and fatality rates of coronavirus, at it appears globally. We chose to look at the way that the virus spreads and grows in China, to see if there is any utility that this trend provides in predicting the spread of COVID-19 in the United States.*

##### In this notebook, we are going to take the covid data from an open-sourced API and look carefully time-series data and clean it up in a way that it is more readily interpretable...


---



Added necessary imports and link to API site...

In [0]:
import pandas as pd
import numpy as np

import requests
import json
from pandas.io.json import json_normalize
import time

# https://lab.isaaclin.cn/nCoV/en

Read in all data from the API from the /area endpoint, starting from latest data and grabbing data all the way to Jan 21st--the first date that the virus had been tracked in this database.

In [0]:
req = requests.get("https://lab.isaaclin.cn/nCoV/api/area?latest=0")
req_data = req.json()["results"]
df_coronavirus = json_normalize(req_data)
df_coronavirus.head()

Unnamed: 0,provinceName,currentConfirmedCount,confirmedCount,suspectedCount,curedCount,deadCount,comment,operator,locationId,countryShortCode,countryFullName,continentName,countryName,provinceShortName,continentEnglishName,countryEnglishName,provinceEnglishName,updateTime,statisticsData,cities,createTime,modifyTime,cityName
0,关岛,3.0,3,0.0,0,0,,chend,0,GU,Guam,北美洲,关岛,关岛,North America,Guam,Guam,1584487275191,,,,,
1,美属维尔京群岛,2.0,2,0.0,0,0,,chend,0,USVI,United States Virgin Islands,北美洲,美属维尔京群岛,美属维尔京群岛,North America,,,1584487275191,,,,,
2,以色列,332.0,337,0.0,5,0,,chend,955009,ISR,Israel,亚洲,以色列,以色列,Asia,Israel,Israel,1584484401599,https://file1.dxycdn.com/2020/0315/982/3402160...,,,,
3,斯洛伐克,97.0,97,0.0,0,0,,chend,963007,SVK,Slovakia,欧洲,斯洛伐克,斯洛伐克,Europe,Slovakia,Slovakia,1584483818068,https://file1.dxycdn.com/2020/0315/353/3402160...,,,,
4,冈比亚,1.0,1,0.0,0,0,,chend,982005,GMB,Gambia,非洲,冈比亚,冈比亚,Africa,Gambia,Gambia,1584483753903,,,,,


In [0]:
df_coronavirus.to_csv("coronavirus_raw.csv",encoding='utf-8-sig', index=False)

We need to have a way to properly draw meaning from the "updateTime" column. Right now, we have something like "1584297854611" which is really hard to actually interpret and use as a variable to group/compare entries by. To solve this issue we use Python's "datetime" library to turn this confusing timestamp into a properly formatted date. The results of this cleaning are shown in a new column called "updateTime_cleaned", shown below.

In [0]:
from datetime import datetime

cleaned = []
for entry in range(len(df_coronavirus)):
  cleaned.append(datetime.fromtimestamp(df_coronavirus['updateTime'][entry]/1000)
                  .strftime("%x"))
  
df_coronavirus['updateTime_cleaned'] = np.array(cleaned)
df_coronavirus

Unnamed: 0,provinceName,currentConfirmedCount,confirmedCount,suspectedCount,curedCount,deadCount,comment,operator,locationId,countryShortCode,countryFullName,continentName,countryName,provinceShortName,continentEnglishName,countryEnglishName,provinceEnglishName,updateTime,statisticsData,cities,createTime,modifyTime,cityName,updateTime_cleaned
0,关岛,3.0,3,0.0,0,0,,chend,0,GU,Guam,北美洲,关岛,关岛,North America,Guam,Guam,1584487275191,,,,,,03/17/20
1,美属维尔京群岛,2.0,2,0.0,0,0,,chend,0,USVI,United States Virgin Islands,北美洲,美属维尔京群岛,美属维尔京群岛,North America,,,1584487275191,,,,,,03/17/20
2,以色列,332.0,337,0.0,5,0,,chend,955009,ISR,Israel,亚洲,以色列,以色列,Asia,Israel,Israel,1584484401599,https://file1.dxycdn.com/2020/0315/982/3402160...,,,,,03/17/20
3,斯洛伐克,97.0,97,0.0,0,0,,chend,963007,SVK,Slovakia,欧洲,斯洛伐克,斯洛伐克,Europe,Slovakia,Slovakia,1584483818068,https://file1.dxycdn.com/2020/0315/353/3402160...,,,,,03/17/20
4,冈比亚,1.0,1,0.0,0,0,,chend,982005,GMB,Gambia,非洲,冈比亚,冈比亚,Africa,Gambia,Gambia,1584483753903,,,,,,03/17/20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15965,辽宁省,,0,1.0,0,0,,zyyun,210000,,,亚洲,中国,辽宁,Asia,China,Liaoning,1579634890131,,,1.579626e+12,1.579626e+12,,01/21/20
15966,台湾,,1,0.0,0,0,,huanshi,710000,,,亚洲,中国,台湾,Asia,China,Taiwan,1579634890131,,,1.579617e+12,1.579617e+12,,01/21/20
15967,香港,,0,117.0,0,0,,huanshi,810000,,,亚洲,中国,香港,Asia,Hongkong,Hongkong,1579634890131,,,1.579617e+12,1.579617e+12,,01/21/20
15968,黑龙江省,,0,1.0,0,0,,huanshi,230000,,,亚洲,中国,黑龙江,Asia,China,Heilongjiang,1579634890131,,,1.579621e+12,1.579621e+12,,01/21/20


We also noted that for places not in China-- the cities field was either None or NaN, in which case we would run into an error when trying to parse out info about each city in China. We decided to fill missing value with an empty array "[]" in order to parse as needed.

In [0]:
df_coronavirus['cities'].fillna('[]', inplace = True)
df_coronavirus[df_coronavirus['cities']=='[]']

Unnamed: 0,provinceName,currentConfirmedCount,confirmedCount,suspectedCount,curedCount,deadCount,comment,operator,locationId,countryShortCode,countryFullName,continentName,countryName,provinceShortName,continentEnglishName,countryEnglishName,provinceEnglishName,updateTime,statisticsData,cities,createTime,modifyTime,cityName,updateTime_cleaned
0,关岛,3.0,3,0.0,0,0,,chend,0,GU,Guam,北美洲,关岛,关岛,North America,Guam,Guam,1584487275191,,[],,,,03/17/20
1,美属维尔京群岛,2.0,2,0.0,0,0,,chend,0,USVI,United States Virgin Islands,北美洲,美属维尔京群岛,美属维尔京群岛,North America,,,1584487275191,,[],,,,03/17/20
2,以色列,332.0,337,0.0,5,0,,chend,955009,ISR,Israel,亚洲,以色列,以色列,Asia,Israel,Israel,1584484401599,https://file1.dxycdn.com/2020/0315/982/3402160...,[],,,,03/17/20
3,斯洛伐克,97.0,97,0.0,0,0,,chend,963007,SVK,Slovakia,欧洲,斯洛伐克,斯洛伐克,Europe,Slovakia,Slovakia,1584483818068,https://file1.dxycdn.com/2020/0315/353/3402160...,[],,,,03/17/20
4,冈比亚,1.0,1,0.0,0,0,,chend,982005,GMB,Gambia,非洲,冈比亚,冈比亚,Africa,Gambia,Gambia,1584483753903,,[],,,,03/17/20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15965,辽宁省,,0,1.0,0,0,,zyyun,210000,,,亚洲,中国,辽宁,Asia,China,Liaoning,1579634890131,,[],1.579626e+12,1.579626e+12,,01/21/20
15966,台湾,,1,0.0,0,0,,huanshi,710000,,,亚洲,中国,台湾,Asia,China,Taiwan,1579634890131,,[],1.579617e+12,1.579617e+12,,01/21/20
15967,香港,,0,117.0,0,0,,huanshi,810000,,,亚洲,中国,香港,Asia,Hongkong,Hongkong,1579634890131,,[],1.579617e+12,1.579617e+12,,01/21/20
15968,黑龙江省,,0,1.0,0,0,,huanshi,230000,,,亚洲,中国,黑龙江,Asia,China,Heilongjiang,1579634890131,,[],1.579621e+12,1.579621e+12,,01/21/20


We see here that there are about 100 entries where we could not find an english name...

In [0]:
df_coronavirus[df_coronavirus['countryEnglishName'].isna()]

Unnamed: 0,provinceName,currentConfirmedCount,confirmedCount,suspectedCount,curedCount,deadCount,comment,operator,locationId,countryShortCode,countryFullName,continentName,countryName,provinceShortName,continentEnglishName,countryEnglishName,provinceEnglishName,updateTime,statisticsData,cities,createTime,modifyTime,cityName,updateTime_cleaned
1,美属维尔京群岛,2.0,2,0.0,0,0,,chend,0,USVI,United States Virgin Islands,北美洲,美属维尔京群岛,美属维尔京群岛,North America,,,1584487275191,,[],,,,03/17/20
13,黑山,2.0,2,0.0,0,0,,chend,965018,MNE,Montenegro,欧洲,黑山,黑山,Europe,,,1584477329205,,[],,,,03/17/20
180,瓜德罗普岛,6.0,6,0.0,0,0,,yuyb,0,GLP,Guadeloupe,北美洲,瓜德罗普岛,瓜德罗普岛,North America,,,1584461151471,https://file1.dxycdn.com/2020/0317/354/3402535...,[],,,,03/17/20
187,圣巴泰勒米岛,3.0,3,0.0,0,0,,yuyb,0,BL,Saint Barthelemy,北美洲,圣巴泰勒米岛,圣巴泰勒米岛,North America,,,1584461151471,https://file1.dxycdn.com/2020/0317/743/3402535...,[],,,,03/17/20
190,库拉索岛,2.0,2,0.0,0,0,,yuyb,0,CW,Curaçao,北美洲,库拉索岛,库拉索岛,North America,,,1584461151471,https://file1.dxycdn.com/2020/0317/893/3402535...,[],,,,03/17/20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2186,刚果（布）,1.0,1,0.0,0,0,,shenjia,0,COG,Congo,非洲,刚果（布）,刚果（布）,Africa,,,1584341625409,,[],,,,03/16/20
2332,刚果（布）,1.0,1,0.0,0,0,,yaoyanbo,0,COG,Congo,非洲,刚果（布）,刚果（布）,Africa,,,1584340146703,,[],,,,03/16/20
2482,刚果（布）,1.0,1,0.0,0,0,,wz.ebd,0,COG,Congo,非洲,刚果（布）,刚果（布）,Africa,,,1584336552358,,[],,,,03/16/20
2497,刚果（布）,1.0,1,0.0,0,0,,wangbingbing,0,COG,Congo,非洲,刚果（布）,刚果（布）,Africa,,,1584331156204,,[],,,,03/16/20


These countries are not translated into English, so we are going to drop them to minimize obscurity.

In [0]:
df_coronavirus_filtered = df_coronavirus[df_coronavirus['countryEnglishName']
                                         .isna() != True]
df_coronavirus_filtered

Unnamed: 0,provinceName,currentConfirmedCount,confirmedCount,suspectedCount,curedCount,deadCount,comment,operator,locationId,countryShortCode,countryFullName,continentName,countryName,provinceShortName,continentEnglishName,countryEnglishName,provinceEnglishName,updateTime,statisticsData,cities,createTime,modifyTime,cityName,updateTime_cleaned
0,关岛,3.0,3,0.0,0,0,,chend,0,GU,Guam,北美洲,关岛,关岛,North America,Guam,Guam,1584487275191,,[],,,,03/17/20
2,以色列,332.0,337,0.0,5,0,,chend,955009,ISR,Israel,亚洲,以色列,以色列,Asia,Israel,Israel,1584484401599,https://file1.dxycdn.com/2020/0315/982/3402160...,[],,,,03/17/20
3,斯洛伐克,97.0,97,0.0,0,0,,chend,963007,SVK,Slovakia,欧洲,斯洛伐克,斯洛伐克,Europe,Slovakia,Slovakia,1584483818068,https://file1.dxycdn.com/2020/0315/353/3402160...,[],,,,03/17/20
4,冈比亚,1.0,1,0.0,0,0,,chend,982005,GMB,Gambia,非洲,冈比亚,冈比亚,Africa,Gambia,Gambia,1584483753903,,[],,,,03/17/20
5,阿曼,21.0,33,0.0,12,0,,chend,955013,OMN,Oman,亚洲,阿曼,阿曼,Asia,Oman,Oman,1584483497693,https://file1.dxycdn.com/2020/0315/945/3402160...,[],,,,03/17/20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15965,辽宁省,,0,1.0,0,0,,zyyun,210000,,,亚洲,中国,辽宁,Asia,China,Liaoning,1579634890131,,[],1.579626e+12,1.579626e+12,,01/21/20
15966,台湾,,1,0.0,0,0,,huanshi,710000,,,亚洲,中国,台湾,Asia,China,Taiwan,1579634890131,,[],1.579617e+12,1.579617e+12,,01/21/20
15967,香港,,0,117.0,0,0,,huanshi,810000,,,亚洲,中国,香港,Asia,Hongkong,Hongkong,1579634890131,,[],1.579617e+12,1.579617e+12,,01/21/20
15968,黑龙江省,,0,1.0,0,0,,huanshi,230000,,,亚洲,中国,黑龙江,Asia,China,Heilongjiang,1579634890131,,[],1.579621e+12,1.579621e+12,,01/21/20


DataFrame to track the spread of the virus through the United States from January 21st to now

In [0]:
df_US = df_coronavirus_filtered[
    df_coronavirus_filtered["countryEnglishName"] == "United States of America"]
df_US.head()

Unnamed: 0,provinceName,currentConfirmedCount,confirmedCount,suspectedCount,curedCount,deadCount,comment,operator,locationId,countryShortCode,countryFullName,continentName,countryName,provinceShortName,continentEnglishName,countryEnglishName,provinceEnglishName,updateTime,statisticsData,cities,createTime,modifyTime,cityName,updateTime_cleaned
23,美国,5538.0,5709,0.0,74,97,,chend,971002,USA,United States of America,北美洲,美国,美国,North America,United States of America,United States of America,1584476307899,https://file1.dxycdn.com/2020/0315/553/3402160...,[],,,,03/17/20
92,美国,5538.0,5709,0.0,74,97,,yuyb,971002,USA,United States of America,北美洲,美国,美国,North America,United States of America,United States of America,1584470183217,https://file1.dxycdn.com/2020/0315/553/3402160...,[],,,,03/17/20
99,美国,4994.0,5139,0.0,48,97,,yuyb,971002,USA,United States of America,北美洲,美国,美国,North America,United States of America,United States of America,1584468710736,https://file1.dxycdn.com/2020/0315/553/3402160...,[],,,,03/17/20
118,美国,4865.0,5010,0.0,48,97,,yuyb,971002,USA,United States of America,北美洲,美国,美国,North America,United States of America,United States of America,1584461151471,https://file1.dxycdn.com/2020/0315/553/3402160...,[],,,,03/17/20
210,美国,4523.0,4661,0.0,48,90,,wz.ebd,971002,USA,United States of America,北美洲,美国,美国,North America,United States of America,United States of America,1584456865525,https://file1.dxycdn.com/2020/0315/553/3402160...,[],,,,03/17/20


In [0]:
df_US.to_csv("US_coronavirus.csv",encoding='utf-8-sig', index=False)

We are now getting only the latest data on Coronavirus for every country.

In [0]:
req = requests.get("https://lab.isaaclin.cn/nCoV/api/area?latest=1")
req_data = req.json()["results"]
df_coronavirus_latest = json_normalize(req_data)
df_coronavirus_latest.head()

Unnamed: 0,locationId,continentName,continentEnglishName,countryName,countryEnglishName,provinceName,provinceShortName,provinceEnglishName,currentConfirmedCount,confirmedCount,suspectedCount,curedCount,deadCount,cities,comment,updateTime
0,0,北美洲,North America,关岛,Guam,关岛,关岛,Guam,3,3,0,0,0,,,1584487275191
1,0,北美洲,North America,美属维尔京群岛,,美属维尔京群岛,美属维尔京群岛,,2,2,0,0,0,,,1584487275191
2,955009,亚洲,Asia,以色列,Israel,以色列,以色列,Israel,332,337,0,5,0,,,1584484401599
3,963007,欧洲,Europe,斯洛伐克,Slovakia,斯洛伐克,斯洛伐克,Slovakia,97,97,0,0,0,,,1584483818068
4,982005,非洲,Africa,冈比亚,Gambia,冈比亚,冈比亚,Gambia,1,1,0,0,0,,,1584483753903


In [0]:
df_coronavirus_latest.to_csv("coronavirus_latest_raw.csv",encoding='utf-8-sig',
                             index=False)

In [0]:
df_coronavirus_latest['cities'].fillna('[]', inplace = True)
df_coronavirus_latest[df_coronavirus_latest['cities']=='[]'].head()
df_coronavirus_latest_filtered = df_coronavirus_latest[
                                 df_coronavirus_latest['countryEnglishName']
                                         .isna() != True].copy()
df_coronavirus_latest_filtered.head()

Unnamed: 0,locationId,continentName,continentEnglishName,countryName,countryEnglishName,provinceName,provinceShortName,provinceEnglishName,currentConfirmedCount,confirmedCount,suspectedCount,curedCount,deadCount,cities,comment,updateTime
0,0,北美洲,North America,关岛,Guam,关岛,关岛,Guam,3,3,0,0,0,[],,1584487275191
2,955009,亚洲,Asia,以色列,Israel,以色列,以色列,Israel,332,337,0,5,0,[],,1584484401599
3,963007,欧洲,Europe,斯洛伐克,Slovakia,斯洛伐克,斯洛伐克,Slovakia,97,97,0,0,0,[],,1584483818068
4,982005,非洲,Africa,冈比亚,Gambia,冈比亚,冈比亚,Gambia,1,1,0,0,0,[],,1584483753903
5,955013,亚洲,Asia,阿曼,Oman,阿曼,阿曼,Oman,21,33,0,12,0,[],,1584483497693


Here we dropped the row of that has the overall counts for China because it would be double counted if we were to group by country name and sum over the columns to get the numbers for every country.

In [0]:
china = df_coronavirus_latest_filtered[
                    df_coronavirus_latest_filtered['locationId']==951001].index
df_coronavirus_latest_filtered.drop(china, inplace=True)

In [0]:
df_by_countries = df_coronavirus_latest_filtered.groupby(
                                              'countryEnglishName').sum(axis=1)
df_by_countries_filtered = df_by_countries[["currentConfirmedCount",
                                            "confirmedCount",
                                            "suspectedCount",
                                            "curedCount",
                                            "deadCount"]]
df_by_countries_filtered.head()

Unnamed: 0_level_0,currentConfirmedCount,confirmedCount,suspectedCount,curedCount,deadCount
countryEnglishName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Afghanistan,22,22,0,0,0
Albania,54,55,0,0,1
Algeria,45,60,0,10,5
Andorra,14,15,0,1,0
Antigua and Barbuda,1,1,0,0,0


In [0]:
df_by_countries_filtered.to_csv("covid_by_country_filtered.csv",
                                encoding='utf-8-sig',
                                index=True)

In [0]:
df_by_countries_filtered.filter(like="China",axis=0)

Unnamed: 0_level_0,currentConfirmedCount,confirmedCount,suspectedCount,curedCount,deadCount
countryEnglishName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
China,9084,81135,0,68820,3231


Here we are getting the counts of each city within each Chinese province so that we can graph the spread of Coronavirus within cities in China.

In [0]:
req_data_cities = []
province = []
country = []
for i in range(len(df_coronavirus_latest)):
  if df_coronavirus_latest.iloc[i]['cities'] != '[]':
    req_data_cities.extend(req.json()["results"][i]["cities"])
    province.extend(len(req.json()["results"][i]["cities"])*\
                    [df_coronavirus_latest.iloc[i]['provinceEnglishName']])
    country.extend(len(req.json()["results"][i]["cities"])*\
                    [df_coronavirus_latest.iloc[i]['countryEnglishName']])
df_cities = json_normalize(req_data_cities)
df_cities['provinceName'] = np.array(province)
df_cities['countryName'] = np.array(country)
df_cities.head()

Unnamed: 0,cityName,currentConfirmedCount,confirmedCount,suspectedCount,curedCount,deadCount,locationId,cityEnglishName,provinceName,countryName
0,丹东,2,11,0,9,0,210600,Dandong,Liaoning,China
1,沈阳,1,28,0,27,0,210100,Shenyang,Liaoning,China
2,朝阳,1,6,0,5,0,211300,Chaoyang,Liaoning,China
3,大连,0,19,0,19,0,210200,Dalian,Liaoning,China
4,锦州,0,12,0,12,0,210700,Jinzhou,Liaoning,China


We are dropping the rows that do not have a cityEnglishName. For example "境外输入人员" is not a city within Yunnan province, rather it is an overseas import center, which has nothing to do with the city.

In [0]:
df_cities_filtered = df_cities.dropna().copy()
df_cities_filtered.head()

Unnamed: 0,cityName,currentConfirmedCount,confirmedCount,suspectedCount,curedCount,deadCount,locationId,cityEnglishName,provinceName,countryName
0,丹东,2,11,0,9,0,210600,Dandong,Liaoning,China
1,沈阳,1,28,0,27,0,210100,Shenyang,Liaoning,China
2,朝阳,1,6,0,5,0,211300,Chaoyang,Liaoning,China
3,大连,0,19,0,19,0,210200,Dalian,Liaoning,China
4,锦州,0,12,0,12,0,210700,Jinzhou,Liaoning,China


The only problem with our dataset is that we don't have the latitudes and longitudes of the cities, thus we found another API that can return the latitudes and longitudes of the cities by entering the city and country name.

In [0]:
!pip install opencage
from opencage.geocoder import OpenCageGeocode
apikey = "82f1aabf1bd249488b9394ed2ab1dd8f"

geocoder = OpenCageGeocode(apikey)



In [0]:
lat = []
long = []

for city, country in zip(df_cities_filtered['cityEnglishName'],
                         df_cities_filtered['countryName']):
  query = city + ", " + country
  results = geocoder.geocode(query)
  lat.append(results[0]['geometry']['lat'])
  long.append(results[0]['geometry']['lng'])

In [0]:
df_cities_filtered['lat'] = np.array(lat)
df_cities_filtered['long'] = np.array(long)
df_cities_filtered

Unnamed: 0,cityName,currentConfirmedCount,confirmedCount,suspectedCount,curedCount,deadCount,locationId,cityEnglishName,provinceName,countryName,lat,long
0,丹东,2,11,0,9,0,210600,Dandong,Liaoning,China,40.128668,124.386340
1,沈阳,1,28,0,27,0,210100,Shenyang,Liaoning,China,41.804109,123.427636
2,朝阳,1,6,0,5,0,211300,Chaoyang,Liaoning,China,41.575477,120.439074
3,大连,0,19,0,19,0,210200,Dalian,Liaoning,China,38.918171,121.628294
4,锦州,0,12,0,12,0,210700,Jinzhou,Liaoning,China,41.108528,121.119422
...,...,...,...,...,...,...,...,...,...,...,...,...
425,三明,0,14,0,14,0,350400,Sanming,Fujian,China,26.236795,117.603849
426,龙岩,0,6,0,6,0,350800,Longyan,Fujian,China,25.097439,117.015116
427,西宁,0,15,0,15,0,630100,Xining,Qinghai,China,36.622532,101.772196
428,海北州,0,3,0,3,0,632200,Haibei,Qinghai,China,37.719189,100.456559


In [0]:
df_cities_filtered.to_csv("chinese_cities_cleaned.csv",encoding='utf-8-sig',
                          index=False)

# **Data Cleaning that we were not able to apply (data too new)**

*Also looked into NEW endpoint for news data for use in possible sentiment analysis.*

*Issue: This was a new API feature made on 3/19/2020-- day of presentation. Was not able to get around to using it, but you can see some simple cleaning steps that were taken...*

In [28]:
req_news = requests.get("https://lab.isaaclin.cn/nCoV/api/news?lang=en&num=300")
req_news_data = req_news.json()["results"]
df_cv_news = json_normalize(req_news_data)
df_cv_news

Unnamed: 0,title,summary,infoSource,sourceUrl,pubDate,provinceName,provinceId
0,The London Underground will close a number of ...,Dozens of stations on the London Underground w...,CNN,https://edition.cnn.com/world/live-news/corona...,1584600410000,,
1,Thailand now requires health certificates from...,Thailand will require health certificates for ...,CNN,https://edition.cnn.com/world/live-news/corona...,1584598853000,,
2,Coronavirus: Australia and New Zealand ban non...,Australia and New Zealand will ban entry to al...,BBC,https://www.bbc.com/news/world-australia-51957...,1584598582000,,
3,Worship in churches and mosques suspended in K...,Religious leaders in Kenya have suspended wors...,BBC,https://www.bbc.com/news/topics/cyz0z8w0ydwt/c...,1584598298000,,
4,Australia bans entry to foreign citizens and n...,"Starting tomorrow, Australia will no longer al...",CNN,https://edition.cnn.com/world/live-news/corona...,1584597392000,,
...,...,...,...,...,...,...,...
295,"2,800 coronavirus cases now reported in the US...","There are at least 2,816 cases of the novel co...",CNN,https://edition.cnn.com/world/live-news/corona...,1584235080000,,
296,Detroit Pistons player tests positive for coro...,A player for the Detroit Pistons tested positi...,CNN,https://twitter.com/DrDingxiang/status/1239006...,1584231720000,,
297,"Begoña Gómez, the wife of Prime Minister Pedro...","Begoña Gómez, the wife of Prime Minister Pedro...",The New York Times,https://twitter.com/DrDingxiang/status/1238981...,1584230400000,,
298,The White House physician says President Trump...,The White House physician says President Trump...,The New York Times,https://twitter.com/DrDingxiang/status/1238977...,1584230400000,,


In [32]:
df_cv_news = df_cv_news[df_cv_news.keys()[:5]].copy()
df_cv_news

Unnamed: 0,title,summary,infoSource,sourceUrl,pubDate
0,The London Underground will close a number of ...,Dozens of stations on the London Underground w...,CNN,https://edition.cnn.com/world/live-news/corona...,1584600410000
1,Thailand now requires health certificates from...,Thailand will require health certificates for ...,CNN,https://edition.cnn.com/world/live-news/corona...,1584598853000
2,Coronavirus: Australia and New Zealand ban non...,Australia and New Zealand will ban entry to al...,BBC,https://www.bbc.com/news/world-australia-51957...,1584598582000
3,Worship in churches and mosques suspended in K...,Religious leaders in Kenya have suspended wors...,BBC,https://www.bbc.com/news/topics/cyz0z8w0ydwt/c...,1584598298000
4,Australia bans entry to foreign citizens and n...,"Starting tomorrow, Australia will no longer al...",CNN,https://edition.cnn.com/world/live-news/corona...,1584597392000
...,...,...,...,...,...
295,"2,800 coronavirus cases now reported in the US...","There are at least 2,816 cases of the novel co...",CNN,https://edition.cnn.com/world/live-news/corona...,1584235080000
296,Detroit Pistons player tests positive for coro...,A player for the Detroit Pistons tested positi...,CNN,https://twitter.com/DrDingxiang/status/1239006...,1584231720000
297,"Begoña Gómez, the wife of Prime Minister Pedro...","Begoña Gómez, the wife of Prime Minister Pedro...",The New York Times,https://twitter.com/DrDingxiang/status/1238981...,1584230400000
298,The White House physician says President Trump...,The White House physician says President Trump...,The New York Times,https://twitter.com/DrDingxiang/status/1238977...,1584230400000


In [33]:
cleaned = []
for entry in range(len(df_cv_news)):
  cleaned.append(datetime.fromtimestamp(df_cv_news['pubDate'][entry]/1000)
                  .strftime("%x"))
  
df_cv_news['pubDate_cleaned'] = np.array(cleaned)
df_cv_news

Unnamed: 0,title,summary,infoSource,sourceUrl,pubDate,pubDate_cleaned
0,The London Underground will close a number of ...,Dozens of stations on the London Underground w...,CNN,https://edition.cnn.com/world/live-news/corona...,1584600410000,03/19/20
1,Thailand now requires health certificates from...,Thailand will require health certificates for ...,CNN,https://edition.cnn.com/world/live-news/corona...,1584598853000,03/19/20
2,Coronavirus: Australia and New Zealand ban non...,Australia and New Zealand will ban entry to al...,BBC,https://www.bbc.com/news/world-australia-51957...,1584598582000,03/19/20
3,Worship in churches and mosques suspended in K...,Religious leaders in Kenya have suspended wors...,BBC,https://www.bbc.com/news/topics/cyz0z8w0ydwt/c...,1584598298000,03/19/20
4,Australia bans entry to foreign citizens and n...,"Starting tomorrow, Australia will no longer al...",CNN,https://edition.cnn.com/world/live-news/corona...,1584597392000,03/19/20
...,...,...,...,...,...,...
295,"2,800 coronavirus cases now reported in the US...","There are at least 2,816 cases of the novel co...",CNN,https://edition.cnn.com/world/live-news/corona...,1584235080000,03/15/20
296,Detroit Pistons player tests positive for coro...,A player for the Detroit Pistons tested positi...,CNN,https://twitter.com/DrDingxiang/status/1239006...,1584231720000,03/15/20
297,"Begoña Gómez, the wife of Prime Minister Pedro...","Begoña Gómez, the wife of Prime Minister Pedro...",The New York Times,https://twitter.com/DrDingxiang/status/1238981...,1584230400000,03/15/20
298,The White House physician says President Trump...,The White House physician says President Trump...,The New York Times,https://twitter.com/DrDingxiang/status/1238977...,1584230400000,03/15/20
