## 2. Data collection
---

### 2.0. Import neccessary libraries

In [67]:
import requests
import pandas as pd
import datetime
import json
import urllib3
import warnings
from unidecode import unidecode
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)

### 2.1. Get coordinates of different places in Vietnam

Since we aim to analyze the air population situation accross different places in VN. First, we will get all coordinates (`latitude` and `longitude`) from 63 provinces/cities across our country: __Vietnam__.

In [68]:
# get 64 provinces in Vietnam
coords = pd.read_csv('../data/Vietnam_provinces.csv')
locations = coords.drop(columns=['Region code', 'Country/Region'])
locations

Unnamed: 0,Province/State,Lat,Long
0,An Giang,10.5216,105.1259
1,Ba Ria - Vung Tau,10.4963,107.1684
2,Bac Giang,21.2670,106.2000
3,Bac Kan,22.1333,105.8333
4,Bac Lieu,9.2804,105.7200
...,...,...,...
58,Tra Vinh,9.9340,106.3340
59,Tuyen Quang,21.8180,105.2110
60,Vinh Long,10.2560,105.9640
61,Vinh Phuc,21.3610,105.5470


Specify the beginning and ending dates to retrieve data from the API.

Given that the OpenWeather API stores records from the __two most recent years__ as of the current date, we'll gather data from _January 1, 2021_, to _December 31, 2022_. Feel free to adjust the date if any data appears to be missing since those may be removed over time by OpenWeather.

In [69]:
# start (firt day of 2021): 01/01/2021
START_DATE = datetime.datetime(2021, 1, 1, 7, 0, 0)
# end (last day of 2022): 31/12/2022
END_DATE = datetime.datetime(2022, 12, 31, 23, 59, 59)

# note that the datetime used in the API url must be in the Unix time format, so we will convert them
START = int(START_DATE.timestamp())
END = int(END_DATE.timestamp())

### 2.2. Collect data from the [OpenWeather API](https://openweathermap.org/api/air-pollution)

Initially, to access data from the OpenWeather API, the first step involves creating an account and obtaining an API key from the website. Subsequently, this key is utilized in the URL each time we seek to retrieve the data.

In [70]:
# this is my API, feel free to change to yours if necessary
API_KEY = "3c2583f290fcb524ca83f54fb8500db2"

Then, let's request the API data by calling the requests by the url format: 

https://api.openweathermap.org/data/2.5/air_pollution/history?lat={lat}&lon={long}&start={START}&end={END}&appid={API_KEY}

With:
- `lat`, `long`: are the latitude and longitude of the place we want to collect the data from (we create the `locations` dataframe for this purpose).
- `START`, `END`: are the starting and ending time range of the returned data. It's important to note that the API will provide multiple records for each passing hour which are more than necessary, so we will skip over some points and exclusively retrieve data for each day that has elapsed.
- `API_KEY`: is the OpenWeather API provided for each person which we have just discussed above.

In [71]:
air_dict = {
    'location': [],
    'dt': [],
    "co": [],
    "no": [],
    "no2": [],
    "o3": [],
    "so2": [],
    "pm2_5": [],
    "pm10": [],
    "nh3": [],
    'aqi': []
}

# loop over all locations (every cities/provinces in Vietnam)
for loc in locations.iterrows():
    lat = loc[1]['Lat']
    long = loc[1]['Long']
    name = loc[1]['Province/State']

    # Standardize the names
    url = "http://api.openweathermap.org/data/2.5/air_pollution/history?" # base
    url += f"lat={lat}&lon={long}&start={START}&end={END}&appid={API_KEY}"
    data = requests.get(url, json=True)
    data_json = data.json() # convert to json format
    
    tmp_date = None
    
    for i, s in enumerate(data_json['list']):
        current = datetime.datetime.utcfromtimestamp(s['dt']).strftime(r'%Y-%m-%d')
        if tmp_date and tmp_date == current: # get different days only since the API provide data hourly
            continue
        tmp_date = current
        # save the data to the dictionary
        air_dict['location'].append(name)
        air_dict['dt'].append(s['dt'])
        air_dict['aqi'].append(s['main']['aqi'])
        air_dict["co"].append(s['components']['co'])
        air_dict["no"].append(s['components']['no'])
        air_dict["no2"].append(s['components']['no2'])
        air_dict["o3"].append(s['components']['o3'])
        air_dict["so2"].append(s['components']['so2'])
        air_dict["pm2_5"].append(s['components']['pm2_5'])
        air_dict["pm10"].append(s['components']['pm10'])
        air_dict["nh3"].append(s['components']['nh3'])

In [72]:
# convert to a dataframe to process it easier
df = pd.DataFrame(air_dict)
df

Unnamed: 0,location,dt,co,no,no2,o3,so2,pm2_5,pm10,nh3,aqi
0,An Giang,1609459200,507.36,0.00,6.34,65.80,5.42,15.00,19.00,6.27,2
1,An Giang,1609545600,400.54,0.02,5.66,60.08,4.83,14.48,18.26,6.08,2
2,An Giang,1609632000,500.68,0.01,7.97,42.20,3.82,19.29,23.05,8.04,2
3,An Giang,1609718400,654.22,0.06,12.51,24.68,4.35,23.56,29.31,9.63,3
4,An Giang,1609804800,714.30,0.04,14.22,18.95,4.23,22.98,25.93,4.88,3
...,...,...,...,...,...,...,...,...,...,...,...
45859,Yen Bai,1672099200,714.30,0.00,13.19,55.07,12.52,138.08,156.96,0.94,5
45860,Yen Bai,1672185600,607.49,0.00,3.60,72.24,11.80,124.45,126.85,0.66,5
45861,Yen Bai,1672272000,487.33,0.00,3.60,41.13,2.44,26.11,27.11,1.44,3
45862,Yen Bai,1672358400,407.22,0.00,1.97,48.64,1.64,18.41,19.14,1.41,2


It seems like our data is good now. 

Let's just save them to a file and we will process it later.

In [73]:
df.to_csv('../data/air_raw.csv', index=False)

### 2.3. Crawling data from the [GIS on Population and Development](https://gis.gso.gov.vn/?fbclid=IwAR0S7fuNo9J-p-iMRWeXGsVHoJSmRMMr5MF4KYzV5PqdwODuWpt9T_JgoGg) website

As the air-related data obtained from the OpenWeather API mentioned above may not be particularly engaging for the data analysis process (though still valuable for other purposes). We aim to enhance our understanding of various locations in Vietnam by gathering additional information such as demographics, GRDP, HDI, etc., for each city/province in Vietnam. This will involve scraping data from the aforementioned website to provide a more comprehensive insight into the diverse situations across different regions in Vietnam.

This is the url format of the API _(hidden)_: 

https://apigis.gso.gov.vn/api/web/exportdetail?province_code={province_code}&years={years}&import_type={import_type}

With:
- `province_code`: is the code for the province/city of the place we want to collect data from. Here we can use the code _00_ to collect data from all provinces/cities in Vietnam.

- `years`: list of years we want data to be in. Note that due to some errors with the API for the year 2022, we have to request the data from 2021, 2022 separately.
- `import_type`: a number represent for the information (feature) that we want. Here we will use:
    - __Total population in urban and rural__: 3
    - __HDI__: 59
    - __Percentage of poor households__: 27

Let's define some constant (defined by the API of the website) before requesting the data.

In [74]:
BASE_URL = 'https://apigis.gso.gov.vn/api/web/exportdetail'
PROVINCE_CODE = "00" # represent for all provinces/cities in Vietnam
# import type of the features we want to get
IT_URBAN_RURAL = 3
IT_HDI = 59
IT_POOR_HOUSEHOLDS = 27

# to temporarily store the data
first_year = {
    'Total population in urban': {}, 
    'Total population in rural': {}, 
    'HDI': {}, 
    'Percentage of poor households': {}
}
second_year = {
    'Total population in urban': {}, 
    'Total population in rural': {}, 
    'HDI': {}, 
    'Percentage of poor households': {}
}

Scraping the total urban/rural population data in various places in Vietnam.

In [75]:
# first year
headers = {"province_code":PROVINCE_CODE, "years":[START_DATE.year],"import_type":IT_URBAN_RURAL}
response = requests.post(BASE_URL, json=headers, verify=False)
res = response.content.decode('unicode-escape')
res = json.loads(res)
tp_df1 = pd.DataFrame(res['data']['dataExport']).drop(0, axis=1)
tp_df1 = tp_df1.drop([0,1])

# second year
headers["years"] = [END_DATE.year]
response = requests.post(BASE_URL, json=headers, verify=False)
res = response.content.decode('unicode-escape')
res = json.loads(res)
tp_df2 = pd.DataFrame(res['data']['dataExport']).drop(0, axis=1)
tp_df2 = tp_df2.drop([0,1])

# save data to dictionary
for i in range(tp_df1.shape[0]):
    value1 = tp_df1.iloc[i, :]
    value2 = tp_df2.iloc[i, :]
    province_name = unidecode(value1[1])
    
    first_year['Total population in urban'][province_name] = value1[2]
    first_year['Total population in rural'][province_name] = value1[3]
    
    second_year['Total population in urban'][province_name] = value2[2]
    second_year['Total population in rural'][province_name] = value2[3]

Scraping data about Human Development Index (HDI)

In [76]:
# first year
headers = {"province_code":PROVINCE_CODE, "years":[START_DATE.year],"import_type":IT_HDI}
response = requests.post(BASE_URL, json=headers, verify=False)
res = response.content.decode('unicode-escape')
res = json.loads(res)
tp_df1 = pd.DataFrame(res['data']['dataExport']).drop(0, axis=1)
tp_df1 = tp_df1.drop([0])

# second year
headers["years"] = [END_DATE.year]
response = requests.post(BASE_URL, json=headers, verify=False)
res = response.content.decode('unicode-escape')
res = json.loads(res)
tp_df2 = pd.DataFrame(res['data']['dataExport']).drop(0, axis=1)
tp_df2 = tp_df2.drop([0])

# save data to dictionary
for i in range(tp_df1.shape[0]):
    value1 = tp_df1.iloc[i, :]
    value2 = tp_df2.iloc[i, :]
    province_name = unidecode(value1[1])
    
    first_year['HDI'][province_name] = value1[2]
    second_year['HDI'][province_name] = value2[2]

Scraping data about Percentage of poor households in every province

In [77]:
# first year
headers = {"province_code":PROVINCE_CODE, "years":[START_DATE.year],"import_type":IT_POOR_HOUSEHOLDS}
response = requests.post(BASE_URL, json=headers, verify=False)
res = response.content.decode('unicode-escape')
res = json.loads(res)
tp_df1 = pd.DataFrame(res['data']['dataExport']).drop(0, axis=1)
tp_df1 = tp_df1.drop([0])

# second year
headers["years"] = [END_DATE.year]
response = requests.post(BASE_URL, json=headers, verify=False)
res = response.content.decode('unicode-escape')
res = json.loads(res)
tp_df2 = pd.DataFrame(res['data']['dataExport']).drop(0, axis=1)
tp_df2 = tp_df2.drop([0])

# save data to dictionary
for i in range(tp_df1.shape[0]):
    value1 = tp_df1.iloc[i, :]
    value2 = tp_df2.iloc[i, :]
    province_name = unidecode(value1[1])
    
    first_year['Percentage of poor households'][province_name] = value1[2]
    second_year['Percentage of poor households'][province_name] = value2[2]
    

Let's print out the result.

In [78]:
first_year = pd.DataFrame(first_year)
second_year = pd.DataFrame(second_year)
print("First year (2021):")
display(first_year)
print("Second year (2022):")
display(second_year)

First year (2021):


Unnamed: 0,Total population in urban,Total population in rural,HDI,Percentage of poor households
Ha Noi,4095366,4235468,0.81000000000000005,0.40000000000000002
Ha Giang,140327,746759,0.58999999999999997,25.0
Cao Bang,138178,404039,0.65000000000000002,24.5
Bac Kan,73114,250598,0.68000000000000005,20.600000000000001
Tuyen Quang,,,,
...,...,...,...,...
Can Tho,876923,370070,0.71999999999999997,1.8
Hau Giang,212686,517202,0.68999999999999995,5.2000000000000002
Soc Trang,391396,815423,0.65000000000000002,4.7000000000000002
Bac Lieu,254940,663570,0.66000000000000003,5.7999999999999998


Second year (2022):


Unnamed: 0,Total population in urban,Total population in rural,HDI,Percentage of poor households
Ha Noi,4138505,4297147,0.81999999999999995,0.10000000000000001
Ha Giang,142345,750378,0.59999999999999998,31.600000000000001
Cao Bang,138465,404587,0.66000000000000003,23.600000000000001
Bac Kan,73565,250788,0.68999999999999995,20.100000000000001
Tuyen Quang,,,,
...,...,...,...,...
Can Tho,882856,369492,0.73999999999999999,1.0
Hau Giang,204991,524476,0.68999999999999995,5.2999999999999998
Soc Trang,405650,792173,0.67000000000000004,6.0999999999999996
Bac Lieu,255891,665918,0.67000000000000004,5.2000000000000002


The data is not very perfect, but we will process them at the next stage (preprocessing).

Let's just save them to separate files first.

In [79]:
# Save additional features to the csv
first_year.to_csv('../data/new_features_year1.csv')
second_year.to_csv('../data/new_features_year2.csv')