You can download AQI data using two methods:

## 1- Data from doe.ir
Data available in `https://aqicn.org` is good for downloading all data, but this data in compartion of data available in `https://aqms.doe.ir/` site is incomplete and incorrect. So, because the latest site has `DOE` reference it is better to obtain data from this site.
Downloading data from `https://aqms.doe.ir/` can be done in multi-steps as in following:

1- Install chrome extension `Print Friendly & PDF`.</br>
2- Open site in `English` language.</br>
3- Define `Date`.</br>
4- Export data as `PDF`.</br>
5- Convert pdf files to python `list` with `tabula-py` python library.
6- Convert list to `Pandas Dataframe`.
7- This data is `AQI` and must be converted to `Concenteartion`, for doing it we can use [AQI Calculator](https://www.airnow.gov/aqi/aqi-calculator-concentration/).

## 2- Data from aqicn.org

1- AQICN Token

The first step is to make sure to acquire your own token for all API access.</br>
You can get your token from the [data-platform token](https://aqicn.org/data-platform/token/) page. 

2- Downloading Lat-Lon from [www.aqicn.org](https://aqicn.org/)</br>

Historical data downloaded manually from [aqicn](www.aqicn.org) does not any information about `latitude` and `longitude` of each station. So, for obtaining this information we can use [API](https://aqicn.org/json-api/doc/#api-Map_Queries-GetMapStations).</br>
There is a general code for getting information of stations based on boundary such as following:</br>

```
/map/bounds?token=:token&latlng=:latlng
```
We used the following code for obtaining the stations data (`name`, `lat`, `lon` and `uid`):</br>

```
https://api.waqi.info/v2/map/bounds?latlng=11.88,25.74,42.57,63.45&networks=all&token=f8a5365329b83961a149e7e4ada3946b84a63528
```


## Reading JSON file

https://towardsdatascience.com/how-to-convert-json-into-a-pandas-dataframe-100b2ae1e0d8

In [None]:
import pandas as pd

path = '/mnt/864424144424098F/Anaconda_Projects/Air_Pollution_Projects/Middle_East_Prj/AQICN_Dataset/Lat_Long/'
df = pd.read_json(path + 'bounds_API.json')
df.head()

Unnamed: 0,status,data
0,ok,"{'lat': 32.243608, 'lon': 48.813886, 'uid': 11..."
1,ok,"{'lat': 35.034016, 'lon': 50.368211, 'uid': 10..."
2,ok,"{'lat': 41.5462, 'lon': 45.0145, 'uid': 12472,..."
3,ok,"{'lat': 32.536037, 'lon': 35.863945, 'uid': 11..."
4,ok,"{'lat': 41.197475, 'lon': 32.623379, 'uid': 10..."


After reading this JSON, we can see that our nested list is put up into a single column **data**. How can we flatten the nested list? One solution is to apply a custom function to flatten the values in **data**.

This certainly does our work, but it requires extra code to get the data in the form we require. We can solve this effectively using the Pandas `json_normalize()` function.

## Flattening nested list from JSON object

In [None]:
import json
# load data using Python JSON module
with open((path + 'bounds_API.json'),'r') as f:
    data = json.loads(f.read())
    
# Flatten data
df_nested_list = pd.json_normalize(data, record_path =['data'])
df_nested_list.head()

Unnamed: 0,lat,lon,uid,aqi,station.name,station.time
0,32.243608,48.813886,11254,35,"Gotvand, Khouzestan, Iran (گتوند خوزستان)",2022-11-19T04:30:00+09:00
1,35.034016,50.368211,10640,57,"Saveh, Markazy, Iran (ساوه مركزی)",2022-11-19T04:30:00+09:00
2,41.5462,45.0145,12472,76,"Batumi Str. 19, Rustavi, Georgia",2022-11-19T03:00:00+09:00
3,32.536037,35.863945,11420,42,"Al-Hassan Sport City, Irbid, Jordan (ﻣﺪﯾﻨﺔ اﻟﺤ...",2022-11-19T03:00:00+09:00
4,41.197475,32.623379,10677,168,"Toren Alani, Turkey",2022-11-19T04:00:00+09:00


## Spliting `station.name`

In [None]:
splitted = df_nested_list['station.name'].str.split(',')
df_nested_list['country'] = splitted.str[-1]
df_nested_list['province'] = splitted.str[-2]
df_nested_list['station'] = splitted.str[-3]
                                         
df_nested_list.head()

Unnamed: 0,lat,lon,uid,aqi,station.name,station.time,country,province,station
0,32.243608,48.813886,11254,35,"Gotvand, Khouzestan, Iran (گتوند خوزستان)",2022-11-19T04:30:00+09:00,Iran (گتوند خوزستان),Khouzestan,Gotvand
1,35.034016,50.368211,10640,57,"Saveh, Markazy, Iran (ساوه مركزی)",2022-11-19T04:30:00+09:00,Iran (ساوه مركزی),Markazy,Saveh
2,41.5462,45.0145,12472,76,"Batumi Str. 19, Rustavi, Georgia",2022-11-19T03:00:00+09:00,Georgia,Rustavi,Batumi Str. 19
3,32.536037,35.863945,11420,42,"Al-Hassan Sport City, Irbid, Jordan (ﻣﺪﯾﻨﺔ اﻟﺤ...",2022-11-19T03:00:00+09:00,Jordan (ﻣﺪﯾﻨﺔ اﻟﺤﺴﻦ اﻟﺮﯾﺎﺿﯿﺔ/ارﺑﺪ),Irbid,Al-Hassan Sport City
4,41.197475,32.623379,10677,168,"Toren Alani, Turkey",2022-11-19T04:00:00+09:00,Turkey,Toren Alani,


In [None]:
df_nested_list['country'] = df_nested_list['country'].str.split('(').str[0]

It seems that we have white space before, after or both for each word in `country_new` column. So, for achive clean data you must remove this white spaces with `strip()`.

In [None]:
# Remove white space before and after words
df_nested_list['country'] = df_nested_list['country'].str.strip()

In [None]:
# Change some values to return Country names correctly (such as example following)
df_nested_list.loc[df_nested_list["country"] == 'שפלה פנימית)', "country"] = 'Israel'

In [None]:
# Checking unique values in `country` column
df_nested_list['country'].unique()

array(['Iran', 'Georgia', 'Jordan', 'Turkey', 'Kuwait', 'UAE', 'Cyprus',
       'Egypt', 'Israel', 'Qatar', 'Greece', 'Bulgaria', 'Iraq',
       'Palestine', 'Saudi Arabia', 'Turkmenistan', 'Azerbaijan', 'Sudan'],
      dtype=object)

In [None]:
# Remove Countries out of Middle East
df_ME = df_nested_list.drop(df_nested_list[(df_nested_list['country'] == 'Cyprus') | 
                                          (df_nested_list['country'] == 'Greece') | 
                                          (df_nested_list['country'] == 'Turkmenistan') | 
                                          (df_nested_list['country'] == 'Sudan') |
                                          (df_nested_list['country'] == 'Bulgaria') |
                                          (df_nested_list['country'] == 'Georgia') |
                                          (df_nested_list['country'] == 'Azerbaijan')].index)
df_ME['country'].unique()

array(['Iran', 'Jordan', 'Turkey', 'Kuwait', 'UAE', 'Egypt', 'Israel',
       'Qatar', 'Iraq', 'Palestine', 'Saudi Arabia'], dtype=object)

In [None]:
# Save initial cleaned dataframe as df_middle_east.csv
df_ME.to_csv('df_middle_east.csv', index=False)

In [None]:
# Get Null (None) values index in `country` column.
#non = df_nested_list.loc[pd.isna(df_nested_list["country"]), :].index.tolist()

# Replace None with Country names
#for i in non:
#    txt = df_nested_list.iloc[i]['province']
#    df_nested_list['country'].replace('None', str(txt), inplace=True)
#    df_nested_list.loc[i] = str(txt)

#df_nested_list.head()

## Mapping Stations

### *Plotting with Folium*

In [None]:
# Import Libraries
import pandas as pd
import geopandas as gpd
import folium
import matplotlib.pyplot as plt

In [None]:
# Reading Data
df_me = pd.read_csv('df_middle_east.csv')
df_me.head()

Unnamed: 0,lat,lon,uid,aqi,station.name,station.time,country,province,station
0,32.243608,48.813886,11254,35,"Gotvand, Khouzestan, Iran (گتوند خوزستان)",2022-11-19T04:30:00+09:00,Iran,Khouzestan,Gotvand
1,35.034016,50.368211,10640,57,"Saveh, Markazy, Iran (ساوه مركزی)",2022-11-19T04:30:00+09:00,Iran,Markazy,Saveh
2,32.536037,35.863945,11420,42,"Al-Hassan Sport City, Irbid, Jordan (ﻣﺪﯾﻨﺔ اﻟﺤ...",2022-11-19T03:00:00+09:00,Jordan,Irbid,Al-Hassan Sport City
3,41.197475,32.623379,10677,168,"Toren Alani, Turkey",2022-11-19T04:00:00+09:00,Turkey,Toren Alani,
4,38.674596,39.211668,4058,39,"Elazig, Turkey",2022-11-19T04:00:00+09:00,Turkey,Elazig,


In [None]:
# Datafrma info
df_me.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 309 entries, 0 to 308
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   lat           309 non-null    float64
 1   lon           309 non-null    float64
 2   uid           309 non-null    int64  
 3   aqi           309 non-null    object 
 4   station.name  309 non-null    object 
 5   station.time  309 non-null    object 
 6   country       309 non-null    object 
 7   province      297 non-null    object 
 8   station       125 non-null    object 
dtypes: float64(2), int64(1), object(6)
memory usage: 21.9+ KB


In [None]:
# Keep only relevant columns
df_me.drop(columns = ['aqi', 'station.name', 'station.time'], inplace=True)
df_me.head()

Unnamed: 0,lat,lon,uid,country,province,station
0,32.243608,48.813886,11254,Iran,Khouzestan,Gotvand
1,35.034016,50.368211,10640,Iran,Markazy,Saveh
2,32.536037,35.863945,11420,Jordan,Irbid,Al-Hassan Sport City
3,41.197475,32.623379,10677,Turkey,Toren Alani,
4,38.674596,39.211668,4058,Turkey,Elazig,


In [None]:
df_me['country'].unique()

array(['Iran', 'Jordan', 'Turkey', 'Kuwait', 'UAE', 'Egypt', 'Israel',
       'Qatar', 'Iraq', 'Palestine', 'Saudi Arabia'], dtype=object)

In [None]:
# Create point geometries
geometry = gpd.points_from_xy(df_me.lon, df_me.lat)
geo_df = gpd.GeoDataFrame(
    df_me[["uid", "station", 'province', 'country', 'lat', 'lon']], geometry=geometry
)

geo_df.head()

Unnamed: 0,uid,station,province,country,lat,lon,geometry
0,11254,Gotvand,Khouzestan,Iran,32.243608,48.813886,POINT (48.81389 32.24361)
1,10640,Saveh,Markazy,Iran,35.034016,50.368211,POINT (50.36821 35.03402)
2,11420,Al-Hassan Sport City,Irbid,Jordan,32.536037,35.863945,POINT (35.86395 32.53604)
3,10677,,Toren Alani,Turkey,41.197475,32.623379,POINT (32.62338 41.19747)
4,4058,,Elazig,Turkey,38.674596,39.211668,POINT (39.21167 38.67460)


# Convert AQI of PM10 to PM10 Concentration

AQI data in [aqicn](aqicn.org) should be converted to the raw concentration using the US EPA standard.
In this project we need concetraion of `PM10`, so we should convert `PM10 AQI` to `PM10 Concentarion`, such as [AQI Calculator](https://www.airnow.gov/aqi/aqi-calculator/).

# How does the AQI work?

(srcs: [1](https://www.airnow.gov/aqi/aqi-basics/), [2](https://en.wikipedia.org/wiki/Air_quality_index#United_States))

hink of the `AQI` as a yardstick that runs from 0 to 500. The higher the `AQI` value, the greater the level of air pollution and the greater the health concern. For example, an `AQI` value of 50 or below represents good air quality, while an `AQI` value over 300 represents hazardous air quality.

For each pollutant an `AQI` value of 100 generally corresponds to an ambient air concentration that equals the level of the short-term national ambient air quality standard for protection of public health. `AQI` values at or below 100 are generally thought of as satisfactory. When `AQI` values are above 100, air quality is unhealthy: at first for certain sensitive groups of people, then for everyone as `AQI` values get higher.

The `AQI` is divided into six categories. Each category corresponds to a different level of health concern. Each category also has a specific color. The color makes it easy for people to quickly determine whether air quality is reaching unhealthy levels in their communities.

| ![aqi_category](images/aqi_categories.png)| 
|:--:| 
| *AQI Categories* |

| ![aqi_epa_usa](images/aqi_epa_usa.png) | 
|:--:| 
| *AQI of EPA Categories* |

Converting `AQI values` to `Concentration values`, we can use [python-aqi](https://github.com/NILICK/python-aqi) python library.

# Using `python-aqi` in pandas

In [None]:
# import libraries
import pandas as pd
import numpy as np
import aqi

In [None]:
# Reading aqi file in csv format
df = pd.read_csv("./Cleand_data/Turkey/tr_adana_catalan.csv") # read csv
df['date'] = pd.to_datetime(df['date'], format='%Y/%m/%d') # change `date` column to 'datetime'.
df.set_index(['date'], inplace=True) # setting index column.
df = df.sort_index() # sort date based on `datetime`.
df.head()

Unnamed: 0_level_0,pm25,pm10,o3,no2,so2,co,lat,lon,country,station
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2013-12-31,,,12.0,1.0,1.0,,37.186701,35.264586,Turkey,Adana catalan
2014-01-01,,24.0,11.0,1.0,3.0,,37.186701,35.264586,Turkey,Adana catalan
2014-01-02,,24.0,15.0,1.0,1.0,,37.186701,35.264586,Turkey,Adana catalan
2014-01-03,,24.0,18.0,1.0,1.0,,37.186701,35.264586,Turkey,Adana catalan
2014-01-04,,23.0,14.0,1.0,4.0,,37.186701,35.264586,Turkey,Adana catalan


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3069 entries, 2013-12-31 to 2022-11-16
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   pm25     698 non-null    float64
 1   pm10     2959 non-null   float64
 2   o3       2954 non-null   float64
 3   no2      2057 non-null   float64
 4   so2      2370 non-null   float64
 5   co       1126 non-null   float64
 6   lat      3069 non-null   float64
 7   lon      3069 non-null   float64
 8   country  3069 non-null   object 
 9   station  3069 non-null   object 
dtypes: float64(8), object(2)
memory usage: 263.7+ KB


In [None]:
len(df)

3069

In [None]:
# Subset dataframe include ['pm10', 'lat', 'lon', 'country', 'station'] columns
def pm10(df):
    if 'pm10' in df.columns:
        df_pm10 = df[['pm10', 'lat', 'lon', 'country', 'station']]
        return df_pm10
    else:
        return None

In [None]:
df_pm10 = pm10(df)
df_pm10.head()

Unnamed: 0_level_0,pm10,lat,lon,country,station
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2013-12-31,,37.186701,35.264586,Turkey,Adana catalan
2014-01-01,24.0,37.186701,35.264586,Turkey,Adana catalan
2014-01-02,24.0,37.186701,35.264586,Turkey,Adana catalan
2014-01-03,24.0,37.186701,35.264586,Turkey,Adana catalan
2014-01-04,23.0,37.186701,35.264586,Turkey,Adana catalan


In [None]:
# Checking `NaN` values in columns
df_pm10.columns[df_pm10.isna().any()].tolist()

['pm10']

In [None]:
# Drop 'NaN' rows
df_pm10.dropna(inplace=True)

In [None]:
len(df_pm10)

2959

In [None]:
df_pm10['pm10'].max()

746.0

In [None]:
# Maximum AQI is 500, so AQI more than this value should be drop
dfpm10 = df_pm10[df_pm10['pm10'] <= 500]
dfpm10.pm10.max()

433.0

In [None]:
len(aa)

2957

In [None]:
# define a function to calculate concentration value `cc` from AQI value.
def aqi_to_cc_pm10(x):
    cc = aqi.to_cc(aqi.POLLUTANT_PM10, x, algo=aqi.ALGO_EPA) # calculate PM10 concentration.
    pm10_cc = pd.to_numeric(cc) # convert series to float number.
    return pm10_cc

In [None]:
# Using function and create new column `PM10_ug/m3` for results as pm10 concentration.
dfpm10['PM10_ug/m3'] = dfpm10['pm10'].apply(aqi_to_cc_pm10)

In [None]:
dfpm10.head()

Unnamed: 0_level_0,pm10,lat,lon,country,station,PM10_ug/m3
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2014-01-01,24.0,37.186701,35.264586,Turkey,Adana catalan,25.0
2014-01-02,24.0,37.186701,35.264586,Turkey,Adana catalan,25.0
2014-01-03,24.0,37.186701,35.264586,Turkey,Adana catalan,25.0
2014-01-04,23.0,37.186701,35.264586,Turkey,Adana catalan,24.0
2014-01-05,29.0,37.186701,35.264586,Turkey,Adana catalan,31.0


In [None]:
len(dfpm10)

2957

In [None]:
dfpm10[220:230] # checking by index

Unnamed: 0_level_0,pm10,lat,lon,country,station,PM10_ug/m3
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2014-09-10,25.0,37.186701,35.264586,Turkey,Adana catalan,27.0
2014-09-11,26.0,37.186701,35.264586,Turkey,Adana catalan,28.0
2014-09-12,29.0,37.186701,35.264586,Turkey,Adana catalan,31.0
2014-09-13,34.0,37.186701,35.264586,Turkey,Adana catalan,36.0
2014-09-14,27.0,37.186701,35.264586,Turkey,Adana catalan,29.0
2014-09-15,27.0,37.186701,35.264586,Turkey,Adana catalan,29.0
2014-09-16,28.0,37.186701,35.264586,Turkey,Adana catalan,30.0
2014-09-17,20.0,37.186701,35.264586,Turkey,Adana catalan,21.0
2014-09-18,20.0,37.186701,35.264586,Turkey,Adana catalan,21.0
2014-09-19,16.0,37.186701,35.264586,Turkey,Adana catalan,17.0


## Function for convert AQI to Concentration

In [None]:
import pandas as df
iran = pd.read_csv('./AQICN_Dataset/Cleand_AQI_data/Iran/ir_abdanan.csv')
iran.head()

Unnamed: 0,date,pm25,lat,lon,country,station
0,2019-01-01,,32.989,47.414305,Iran,Abdanan
1,2019-01-02,,32.989,47.414305,Iran,Abdanan
2,2019-01-03,,32.989,47.414305,Iran,Abdanan
3,2019-01-04,,32.989,47.414305,Iran,Abdanan
4,2019-01-06,,32.989,47.414305,Iran,Abdanan


In [None]:
iran = pd.read_csv('./AQICN_Dataset/Cleand_AQI_data/Cyprus/cy_larnaca_traffic.csv')
iran.head()

Unnamed: 0,date,pm25,pm10,o3,no2,so2,co,lat,lon,country,station
0,2022-11-01,58.0,36.0,,12.0,1.0,2.0,34.9159,33.62749,CY,larnaca-traffic
1,2022-11-02,58.0,40.0,,13.0,1.0,3.0,34.9159,33.62749,CY,larnaca-traffic
2,2022-11-03,65.0,33.0,,13.0,1.0,2.0,34.9159,33.62749,CY,larnaca-traffic
3,2022-11-04,60.0,34.0,,14.0,1.0,3.0,34.9159,33.62749,CY,larnaca-traffic
4,2022-11-05,64.0,34.0,,8.0,1.0,2.0,34.9159,33.62749,CY,larnaca-traffic


In [None]:
# import libraries
import pandas as pd
import numpy as np
import aqi
import os
from glob import glob
import warnings
warnings.filterwarnings('ignore')

# Subset dataframe include ['pm10', 'lat', 'lon', 'country', 'station'] columns
## if there is not `pm10` convert df to empty dataframe to go next iterate
def cols(df):
    if 'pm10' in df.columns:
        #print(df.columns)
        df_pm10 = df[['date', 'pm10', 'lat', 'lon', 'country', 'station']]
        return df_pm10
    else:
        #print(df.columns)
        df_empty = pd.DataFrame(columns=df.columns)
        return df_empty

# Definning a function to calculate concentration value `cc` from AQI value.
def aqi_to_cc_pm10(x):
    cc = aqi.to_cc(aqi.POLLUTANT_PM10, x, algo=aqi.ALGO_EPA) # calculate PM10 concentration.
    pm10_cc = pd.to_numeric(cc) # convert series to float number.
    return pm10_cc

# Applying for CSV files
def aqi_to_cc(path):
    files = glob(path + '*.csv')
    for csvfile in files:
        df = pd.read_csv(csvfile, sep = ',') # read csv
        #df['date'] = pd.to_datetime(df['date'], format='%Y/%m/%d') # change `date` column to 'datetime'.
        #df.set_index(['date'], inplace=True) # setting index column.
        #df = df.sort_index() # sort date based on `datetime`.
        # Keep necessary columns
        df_cols = cols(df)
        if df_cols.empty:
            continue
        else:
            # Drop 'NaN' rows
            df_cols = df_cols.dropna()
            df_cols.reset_index(drop=True, inplace=True)
        
            # Maximum AQI is 500, so AQI more than this value should be drop
            df_pm10 = df_cols[df_cols['pm10'] <= 500]
            # Convert AQI to CC and create new column `PM10_ug/m3` for results as pm10 concentration.
            df_pm10['PM10_ug/m3'] = df_pm10['pm10'].apply(aqi_to_cc_pm10)

            # Save final dataframe
            df_pm10.to_csv(os.path.splitext(csvfile)[0] + '_pm10.csv',
                           index=False)
            
            # Delete variables
            del df_cols

        del (df, df_pm10)

In [None]:
path = './AQICN_Dataset/Cleand_AQI_data/Iran/'
aqi_to_cc(path)