<i><p style="font-size:24px; background-color: #ff9933; border: 2px dotted black; margin: 20px; padding: 20px;">This kernel is dedicated to doing all the data preparation, transformation and feature engineering which was previously done in [ChaiEDA: India's Air Quality 2015-20 🇮🇳](https://www.kaggle.com/neomatrix369/chaieda-india-s-air-quality-2015-20), which is now going to contain visualisations and narrations and make use of the datasets prepared via this kernel. The extended [Air Quality India dataset can be found here](https://www.kaggle.com/neomatrix369/air-quality-data-in-india-extended). Please feel free to use/share it with you own notebooks.

_(background colour is Saffron - if this gives a hint to you why I chose it so)_.
    

![](https://nirvanabeing.com/wp-content/uploads/2018/04/iaq_blog_1.jpg)

In [None]:
DATASET_UPLOAD_FOLDER='/kaggle/working/upload'

In [None]:
%%bash
UPLOAD_FOLDER=/kaggle/working/upload
mkdir -p ${UPLOAD_FOLDER}
cp /kaggle/input/air-quality-data-in-india/*.csv ${UPLOAD_FOLDER} || true
cp /kaggle/input/air-quality-data-in-india-extended/*.csv ${UPLOAD_FOLDER} || true
cp /kaggle/input/air-quality-data-in-india-extended/*.fth ${UPLOAD_FOLDER} || true

In [None]:
import os
import warnings
import numpy as np
import pandas as pd
from math import pi
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
from IPython.display import HTML,display

sns.set(style="whitegrid", font_scale=1.75)


# prettify plots
plt.rcParams['figure.figsize'] = [20.0, 5.0]
    
%matplotlib inline

warnings.filterwarnings("ignore")

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
%%time
df_station_hour = pd.read_csv("/kaggle/input/air-quality-data-in-india/station_hour.csv")
df_city_hour    = pd.read_csv("/kaggle/input/air-quality-data-in-india/city_hour.csv")
df_station_day  = pd.read_csv("/kaggle/input/air-quality-data-in-india/station_day.csv")
df_city_day     = pd.read_csv("/kaggle/input/air-quality-data-in-india/city_day.csv")
df_stations     = pd.read_csv("/kaggle/input/air-quality-data-in-india/stations.csv")

In [None]:
def force_regenerate_dataset(force_regenerate: bool, dataset_name: str, target_dataframe: pd.DataFrame): 
    csv_filename = f"{DATASET_UPLOAD_FOLDER}/{dataset_name}.csv"
    fth_filename = f"{DATASET_UPLOAD_FOLDER}/{dataset_name}.fth"
    if force_regenerate or (not os.path.exists(fth_filename)) or (not os.path.exists(csv_filename)):
        print(f"{fth_filename} NOT found, will regenerate the dataset")
        target_dataframe.to_feather(fth_filename)
        target_dataframe.to_csv(csv_filename)
    else:
        print(f"{fth_filename} found, will NOT regenerate the dataset")

In [None]:
print('Below is a list of columns of tables just as they are loaded:')
print('~~~')
print(f'df_stations: {list(df_stations.columns)}')
print('~~~')
print(f'df_station_day: {list(df_station_day.columns)}')
print('~~~')
print(f'df_station_hour: {list(df_station_hour.columns)}')
print('~~~')
print(f'df_city_day: {list(df_city_day.columns)}')
print('~~~')
print(f'df_city_hour: {list(df_city_hour.columns)}')
print('~~~')

<i><p style="font-size:16px; background-color: #66cdde; border: 2px dotted black; margin: 20px; padding: 20px;">For initial EDA steps to view datasets, please refer to the [original kernel](https://www.kaggle.com/frtgnn). Here we start with some Data transformation and Feature engineering to summarise the information provided.

## Feature Engineering

Why introduce feature engineering already, we are not going to build a model in EDA? But the process of Feature Engineering helps understand the data much better by the process of summarisation and creation of classes that are hidden in granular data.

Then using these classes and features as stepping stones we can climb up the layers and get a **"birds-eye-view"** i.e. a **big picture** of the table of data or the datasets provided.

Such summarisation processes also magnify or isolate bugs or issues or anomalies in the data itself (quality of the data or correctness of the data can be revealed via such processes).

In [None]:
fields_to_show = ['City','AQI_Bucket']

In [None]:
fields_to_ignore = ['StationId', 'StationName', 'State', 'Status', 'Region', 'Month', 'Year', 'Season', 'City', 'Date', 'AQI', 'AQI_Bucket']
names_of_pollutants = list(set(df_city_day.columns) - set(fields_to_ignore))
print(f"Names of Pollutants: {list(names_of_pollutants)}")

### Filling in AQI_Bucket missing values across all tables

In [None]:
%%time
df_station_day['AQI_Bucket'].fillna('Unknown', inplace=True)
df_station_hour['AQI_Bucket'].fillna('Unknown', inplace=True)
df_city_day['AQI_Bucket'].fillna('Unknown', inplace=True)
df_city_hour['AQI_Bucket'].fillna('Unknown', inplace=True)

### Adding Regions to the Stations table

Using the classifications mentioned in https://en.wikipedia.org/wiki/Administrative_divisions_of_India

In [None]:
regions = ['1. Northern', '2. North Eastern', '3. Central', '4. Eastern', '5. Western', '6. Southern']
state_to_region_mapping = {
    'Andhra Pradesh': regions[4], 'Assam': regions[1] , 'Bihar': regions[3], 'Chandigarh': regions[0],  
    'Delhi': regions[0], 'Gujarat': regions[4], 'Haryana': regions[0], 'Jharkhand': regions[3], 
    'Karnataka': regions[5], 'Kerala': regions[5], 'Madhya Pradesh': regions[2], 'Maharashtra': regions[5], 
    'Meghalaya': regions[1], 'Mizoram': regions[1], 'Odisha': regions[3], 'Punjab': regions[0], 
    'Rajasthan': regions[0], 'Tamil Nadu': regions[5], 'Telangana': regions[5], 'Uttar Pradesh': regions[0],
    'West Bengal': regions[3]
}

def state_to_region(state):
    if state in state_to_region_mapping:
        return state_to_region_mapping[state]
    return 'None'

In [None]:
%%time
df_stations['Region'] = df_stations['State'].apply(state_to_region)
df_stations

### Filling in missing values in the Stations table
The status field has a number of NaN values, about 97 of the stations and also see the proportion of inactive, active and unknown status stations across the various Regions

In [None]:
df_stations['Status'].fillna('Unknown', inplace=True)
df_stations

In [None]:
%%time
force_regenerate_dataset(False, 'stations_transformed', df_stations)

### Adding states and regions to the City_day, Station_day, City_hour, Station_hour tables

In [None]:
%%time
df_city_day = df_city_day.merge(df_stations)
df_city_day[fields_to_show + list(df_stations.columns)]

In [None]:
%%time
df_station_day = df_station_day.merge(df_stations)
df_station_day[fields_to_show + list(df_stations.columns)]

In [None]:
%%time
df_city_hour = df_city_hour.merge(df_stations)
df_city_hour[fields_to_show + list(df_stations.columns)]

In [None]:
%%time
df_station_hour = df_station_hour.merge(df_stations)
df_station_hour[fields_to_show + list(df_stations.columns)]

### Adding AQ_Acceptability, Holidays, Day kind, Month, Year and Seasons to the City_day and Station_day tables

See https://en.wikipedia.org/wiki/Climate_of_India for details on seasons.

Holidays information is based on the PyPi package `holidays`.

`AQ_Acceptability` field is computed by using the idea (from David During) to separate the `AQI_Bucket` values into two different categories i.e. _Acceptable_ (for AQI_Bucket values of Good and Satisfactory) and _Unacceptable_ (all others).


In [None]:
old_and_new_fields_to_show = list(set(['Region', 'Season', 'Year', 'Month', 
                                       'Weekday_or_weekend', 'Regular_day_or_holiday', 'AQ_Acceptability'] + fields_to_show) 
                                  - set(['StationId', 'Date']))

In [None]:
# The country's meteorological department follows the international standard of four seasons with some local adjustments: 
# - winter (January and February)
# - summer (March, April and May) 
# - monsoon (rainy) season (June to September)
# - post-monsoon period (October to December)

date_to_season_mapping = {'1. Winter': [1, 2], '2. Summer': [3, 5], '3. Monsoon': [6, 9],  '4. Post-Monsoon': [10, 12]}

def date_to_season(dates):
    results = []
    date_values = dates.values
    for date in date_values:
        month = int(date.split('-')[1])
        result = 'None'
        for each_season in date_to_season_mapping:
            start, end = date_to_season_mapping[each_season]
            if ((start < end) and (start <= month <= end)) or \
               ((start > end) and ((month >= start) or (month <= end))):
                result = each_season
                break

        results.append(result)
    return results

In [None]:
month_no_to_name_mapping = [
    '01. Jan', '02. Feb', '03. Mar', '04. Apr', '05. May', '06. Jun', '07. Jul', 
    '08. Aug', '09. Sep', '10. Oct', '11. Nov', '12. Dec'
]

def date_to_month_name(dates):
    month_values = pd.DatetimeIndex(dates).month.values
    results = []
    for month in month_values:
        result = month_no_to_name_mapping[month - 1]
        results.append(result)
    return results

def weekday_or_weekend(dates):
    results = []
    for date_value in pd.DatetimeIndex(dates.values):
        weekno = date_value.weekday()
        result = "Weekday" if weekno < 5 else "Weekend"
        results.append(result)
    return results


In [None]:
import holidays
holidays_india = holidays.India()

def regular_day_or_holiday(dates):
    results = []
    for date_value in pd.DatetimeIndex(dates.values):
        result = "Holiday (or Festival)" if date_value.date() in holidays_india else "Regular day"
        results.append(result)
    return results

In [None]:
def aq_acceptability(aqi_bucket):
    results = []
    for each_aqi_bucket in aqi_bucket.values:
        result = "Acceptable" if each_aqi_bucket \
                in ["Good", "Satisfactory"] else "Unacceptable"
        results.append(result)
    return results

In [None]:
%%time
df_city_day['Month'] = date_to_month_name(df_city_day['Date'])
df_city_day['Year'] = pd.DatetimeIndex(df_city_day['Date']).year
df_city_day['Season'] = date_to_season(df_city_day['Date'])
df_city_day['Weekday_or_weekend'] = weekday_or_weekend(df_city_day['Date'])
df_city_day['Regular_day_or_holiday'] = regular_day_or_holiday(df_city_day['Date'])
df_city_day['AQ_Acceptability'] = aq_acceptability(df_city_day['AQI_Bucket'])

In [None]:
df_city_day[old_and_new_fields_to_show]

In [None]:
%%time
force_regenerate_dataset(False, 'city_day_transformed', df_city_day)

In [None]:
%%time
df_station_day['Month'] = date_to_month_name(df_station_day['Date'])
df_station_day['Year'] = pd.DatetimeIndex(df_station_day['Date']).year
df_station_day['Season'] = date_to_season(df_station_day['Date'])
df_station_day['Weekday_or_weekend'] = weekday_or_weekend(df_station_day['Date'])
df_station_day['Regular_day_or_holiday'] = regular_day_or_holiday(df_station_day['Date'])
df_station_day['AQ_Acceptability'] = aq_acceptability(df_station_day['AQI_Bucket'])

In [None]:
df_station_day[old_and_new_fields_to_show]

In [None]:
%%time
force_regenerate_dataset(False, 'station_day_transformed', df_station_day)

### Adding Holidays, Day kind, Day period to the City_hour and State_hour tables

Definitions of morning, afternoon, evening, and night as per Wikipedia:
- [Morning](https://en.wikipedia.org/wiki/Morning): Morning is the period of time from sunrise to noon (4am to 11:59am)
- [Afternoon](https://en.wikipedia.org/wiki/Afternoon): Afternoon is the time of the day between noon and evening (12pm to 5pm)
- [Evening](https://en.wikipedia.org/wiki/Evening): Evening is the period of time from the end of the afternoon to the beginning of night (5pm to 8pm).
- [Night or nighttime](https://en.wikipedia.org/wiki/Night): Night or nighttime (also spelled night-time or night time) is the period of ambient darkness from sunset to sunrise . Start around 8 pm and to last to about 4 am.

Holidays information is based on the PyPi package `holidays`.

`AQ_Acceptability` field is computed by using the idea (from David During) to separate the `AQI_Bucket` values into two different categories i.e. _Acceptable_ (for AQI_Bucket values of Good and Satisfactory) and _Unacceptable_ (all others).

In [None]:
date_to_day_period_mapping = {'1. Morning': [4, 11], '2. Afternoon': [12, 17], 
                              '3. Evening': [18, 19], '4. Night': [20, 4]}
def date_to_day_period(datetimes):
    results = []
    datetime_values = datetimes.values
    for datetime in datetime_values:
        _, time_of_day = datetime.split(' ')
        hour, _, _ = time_of_day.split(':')
        hour = int(hour)
        result = 'None'
        for each_day_period in date_to_day_period_mapping:
            start, end = date_to_day_period_mapping[each_day_period]
            if ((start < end) and (start <= hour <= end)) or \
               ((start > end) and ((hour >= start) or (hour <= end))):
                result = each_day_period
                break

        results.append(result)
    return results

In [None]:
%%time
df_city_hour['Day_period'] = date_to_day_period(df_city_hour['Datetime'])
df_city_hour['Month'] = date_to_month_name(df_city_hour['Datetime'])
df_city_hour['Year'] = pd.DatetimeIndex(df_city_hour['Datetime']).year
df_city_hour['Season'] = date_to_season(df_city_hour['Datetime'])
df_city_hour['Weekday_or_weekend'] = weekday_or_weekend(df_city_hour['Datetime'])
df_city_hour['Regular_day_or_holiday'] = regular_day_or_holiday(df_city_hour['Datetime'])
df_city_hour['AQ_Acceptability'] = aq_acceptability(df_city_hour['AQI_Bucket'])

In [None]:
df_city_hour[set(old_and_new_fields_to_show + ["Day_period", "Weekday_or_weekend", 'Regular_day_or_holiday', 'AQ_Acceptability'])]

In [None]:
%%time
force_regenerate_dataset(False, 'city_hour_transformed', df_city_hour)

In [None]:
%%time
df_station_hour['Day_period'] = date_to_day_period(df_station_hour['Datetime'])
df_station_hour['Month'] = date_to_month_name(df_station_hour['Datetime'])
df_station_hour['Year'] = pd.DatetimeIndex(df_station_hour['Datetime']).year
df_station_hour['Season'] = date_to_season(df_station_hour['Datetime'])
df_station_hour['Weekday_or_weekend'] = weekday_or_weekend(df_station_hour['Datetime'])
df_station_hour['Regular_day_or_holiday'] = regular_day_or_holiday(df_station_hour['Datetime'])
df_station_hour['AQ_Acceptability'] = aq_acceptability(df_station_hour['AQI_Bucket'])

In [None]:
df_station_hour[set(old_and_new_fields_to_show + ["Day_period", "Weekday_or_weekend", 
                                                  'Regular_day_or_holiday', 'AQ_Acceptability'])]

In [None]:
%%time
force_regenerate_dataset(False, 'station_hour_transformed', df_station_hour)

## Uploading newly created/updated csv to your Kaggle Dataset

Setup your local environment with your Kaggle login details (`KAGGLE_KEY` and `KAGGLE_USERNAME`).

In [None]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()

import os
os.environ['KAGGLE_KEY'] = user_secrets.get_secret("KAGGLE_KEY")
os.environ['KAGGLE_USERNAME'] = user_secrets.get_secret("KAGGLE_USERNAME")

Using the `kaggle` Python client login, into your account from within the kernel.

In [None]:
import kaggle
kaggle.api.authenticate()

Get the metadata for the dataset you have already created manually - it's best to manually create it and upload the initial csv file(s) into it, to avoid subsequent issues with updating the dataset (as seen during my own end-to-end cycle).

Save the metadata file as a json file but before that, add/update two keys `id` and `id_no` with the respective details as shown below and then save it.

In [None]:
OWNER_SLUG='neomatrix369'
DATASET_SLUG='air-quality-data-in-india-extended'
dataset_metadata = kaggle.api.metadata_get(OWNER_SLUG, DATASET_SLUG)
dataset_metadata['id'] = dataset_metadata["ownerUser"] + "/" + dataset_metadata['datasetSlug']
dataset_metadata['id_no'] = dataset_metadata['datasetId']
import json
with open(f'{DATASET_UPLOAD_FOLDER}/dataset-metadata.json', 'w') as file:
    json.dump(dataset_metadata, file, indent=4)

Finally call the `dataset_create_version()` api and pass it the folder where the metadata file exists and also where your `.csv` and `.fth` file(s) - those file(s) that you would like to upload into your existing Dataset (as a new version).

In [None]:
%%time
# !kaggle datasets version -m "Updating datasets" -p /kaggle/working/upload
kaggle.api.dataset_create_version(DATASET_UPLOAD_FOLDER, 'Updating datasets')

## Credits

- Forked from [Firat Gonen](https://www.kaggle.com/frtgnn)'s [Clean Air? India's Air Quality 🇮🇳](https://www.kaggle.com/frtgnn/clean-air-india-s-air-quality) kernel - thanks for the foundation work
- David During for all the insights during the ChaiEDA sessions, and also building on his idea of the KPI based on the AQI Index

<i><p style="font-size:24px; background-color: #ff9933; border: 2px dotted black; margin: 20px; padding: 20px;">This kernel is dedicated to doing all the data preparation, transformation and feature engineering which was previously done in [ChaiEDA: India's Air Quality 2015-20 🇮🇳](https://www.kaggle.com/neomatrix369/chaieda-india-s-air-quality-2015-20), which is now going to contain visualisations and narrations and make use of the datasets prepared via this kernel. The extended [Air Quality India dataset can be found here](https://www.kaggle.com/neomatrix369/air-quality-data-in-india-extended).
Please feel free to use/share it with your own notebooks.