<i><p style="font-size:20px; background-color: #ff9933; border: 2px dotted black; margin: 20px; padding: 20px;">This kernel is dedicated to all the visualisations and narrations of Air Quality from the Cities of India perspective, prepared to submit to the [Task](https://www.kaggle.com/rohanrao/air-quality-data-in-india/tasks?taskId=1877) created by [David Dirring](https://www.kaggle.com/romandovega).

<i><p style="font-size:20px; background-color: #ff9936; border: 2px dotted black; margin: 20px; padding: 20px;">It makes use of the datasets prepared via [ChaiEDA: India's Air Quality 2015-20 🇮🇳: Data Prep](https://www.kaggle.com/neomatrix369/chaieda-india-s-air-quality-2015-20-data-prep/). The extended [Air Quality India  dataset can be found here](https://www.kaggle.com/neomatrix369/air-quality-data-in-india-extended).

<i><p style="font-size:20px; background-color: #ff9936; border: 2px dotted black; margin: 20px; padding: 20px;">Some of the content comes from it's predeccessor [ChaiEDA: India's Air Quality 2015-20 🇮🇳](https://www.kaggle.com/neomatrix369/chaieda-india-s-air-quality-2015-20), where you can also see other perspective across both location (Regions, States) and time periods (Years, Seasons, Months, Days and Day Periods).
    

<a id='ToC'></a>

----------

# Table of content

- [Installing packages and libraries](#installation)
- [Loading datasets](#loading_datasets)
- [Commonly used variables and functions](#common_functions)
- [Summary](#summary)
- [Filling missing values (rest of the four tables)](#missing_values_four_tables)
- [Filling missing values (Stations table)](#missing_values)
   - After filling in missing values in the original Stations table
- [Station's status analysis](#station_status_analysis)
   - city_day table: Inactive and Unknown status Stations across the Cities
   - city_hour table: Inactive and Unknown status Stations across the Cities
   - station_day table: Inactive and Unknown status Stations across the Cities
   - station_hour table: Inactive and Unknown status Stations across the Cities
- [Pivot tables, visualisations and charts: CO levels and AQI Bucket/AQI Acceptability](#pivots_visualisations_charts)
   - CO levels across Cities and Years (using city_day table)
   - CO levels across Cities and Seasons (using city_day table)
   - CO levels across Cities and Months (using city_day table)
   - CO levels across Cities and Weekday/weekend (using city_hour table)
   - CO levels across Cities and Holiday/Regular day (using city_hour table)
   - CO levels across Cities and Day period (using city_hour table)
- [Trying to understand the interaction of AQI/AQI Bucket/AQ Acceptability and Cities](#aqi_bucket_and_acceptability)
   - AQI Bucket and Cities
   - AQI Bucket and Cities: readings available across all bucket types
   - AQI Bucket and Cities: readings available across one or more bucket types (but not all)
   - AQ Acceptability and Cities: listed by Acceptable AQ
   - AQ Acceptability and Cities: listed by Unacceptable AQ

<a id='installation'></a>

----------
## Installing packages and libraries

In [None]:
import os
import warnings
import numpy as np
import pandas as pd
from math import pi
import seaborn as sns
import plotly.express as px
import matplotlib
import matplotlib.pyplot as plt
from IPython.display import HTML,display

sns.set(style="whitegrid", font_scale=1.75)


# prettify plots
plt.rcParams['figure.figsize'] = [20.0, 5.0]
    
%matplotlib inline

warnings.filterwarnings("ignore")

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

<a id='loading_datasets'></a>

----------

## Loading datasets

In [None]:
%%time
DATASET_INPUT_DIR = '/kaggle/input/air-quality-data-in-india-extended'
df_station_hour = pd.read_feather(f"{DATASET_INPUT_DIR}/station_hour_transformed.fth")
df_city_hour    = pd.read_feather(f"{DATASET_INPUT_DIR}/city_hour_transformed.fth")
df_station_day  = pd.read_feather(f"{DATASET_INPUT_DIR}/station_day_transformed.fth")
df_city_day     = pd.read_feather(f"{DATASET_INPUT_DIR}/city_day_transformed.fth")
df_stations     = pd.read_feather(f"{DATASET_INPUT_DIR}/stations_transformed.fth")

In [None]:
print('Below is a list of columns of tables just as they are loaded:')
print('~~~')
print(f'df_stations: {list(df_stations.columns)}')
print('~~~')
print(f'df_station_day: {list(df_station_day.columns)}')
print('~~~')
print(f'df_station_hour: {list(df_station_hour.columns)}')
print('~~~')
print(f'df_city_day: {list(df_city_day.columns)}')
print('~~~')
print(f'df_city_hour: {list(df_city_hour.columns)}')
print('~~~')

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

<a id='common_functions'></a>

----------

## Commonly used variables and functions

In [None]:
fields_to_show = ['City','AQI_Bucket']

In [None]:
fields_to_ignore = ['StationId', 'StationName', 'State', 'Status', 'Region', 'Month', 'Year', 'Season', 'City', 'Date', 'AQI', 'AQI_Bucket']
names_of_pollutants = list(set(df_city_day.columns) - set(fields_to_ignore))
print(f"Names of Pollutants: {list(names_of_pollutants)}")

In [None]:
def plot_chart(dataframe, width=20.0, height=7.0, 
               title='<No title assigned>', xlabel_title='<No xlabel title assigned>', 
               ylabel_title='<No xlabel title assigned>', stacked=False):
    plt.rcParams['figure.figsize'] = [width, height]
    font = {'size': 16}
    matplotlib.rc('font', **font)
    ax = dataframe.plot(kind='barh', stacked=stacked, title=title)
    ax.set_ylabel(ylabel_title)
    ax.set_xlabel(xlabel_title)

In [None]:
def plot_sns_chart(dataframe, y_axis, class_separator=None, width=20.0, height=9.0, 
                   font_scale=2, xlabel_title="Xlabel title not set", ylabel_title="Ylabel title not set"):
    plt.rcParams['figure.figsize'] = [width, height]
    sns.set(font_scale=font_scale)
    g = sns.countplot(y=y_axis, hue=class_separator, data=dataframe)
    g.set(xlabel=xlabel_title, ylabel=ylabel_title)

In [None]:
def calculate_percentage(dataframe: pd.DataFrame):
    column_names = list(dataframe.columns)
    new_dataframe = dataframe.copy()
    total_values = {}
    valid_values_count = {}
    for index, each_row in new_dataframe.iterrows():
        total_values[index] = 0
        valid_values_count[index] = 0
        for each_column in column_names:
            if (not np.isnan(each_row[each_column])) and (each_row[each_column] > 0.0):
                total_values[index] = total_values[index] + each_row[each_column]
                valid_values_count[index] = valid_values_count[index] + 1

    for index, each_row in new_dataframe.iterrows():
        for target_column in column_names:
            if np.isnan(each_row[target_column]) or (each_row[target_column] == 0.0):
                each_row[target_column] = 0
            else:
                each_row[target_column] = each_row[target_column] / total_values[index] if total_values[index] > 0 else 0
            each_row[target_column] = each_row[target_column] * 100

    return new_dataframe

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

<a id='summary'></a>

----------

> ## Summary

> - the dataset provided has missing values and low-reading counts across various aspects
>   - multiple tables have missing values for features like pollutants, AQI and AQI Bucket (reducing resolution and reliability)
>   - there is a discrepancy in the statuses of the stations providing readings (although might be a less serious issue)
>   - readings provided by inactive or unknown status stations - these could be considered less reliable, erroneous or inconsistent
>   - number of readings between cities and states vary due to the available stations assigned to each location (impacts resolution and reliability)
> - it is indeed tricky to make conclusions given the above situation with the data provided
> - we have two groups of cities that could be helped to improve the Air Quality:
>   - cities with known low-rate of Acceptable AQ (low AQI values) or high-rate of Unacceptable AQ (high AQI values)
>     - **Guwahati, Ahmedabad and Delhi**
>   - cities whose Acceptable AQ or Unacceptable AQ values cannot be easily determined due to one or more of the issues outlined above
>     - **Ernakulam, Kochi and Lucknow (getting readings from Inactive and Unknown status stations)**
>     - **Aizawl, Shillong, Coimbatore, Ernakulam, Mumbai, Thiruvananthapuram, Visakhapatnam, Kochi, Brajrajnagar, Amaravati, Bhopal, Chandigarh (missing AQI Bucket across one or more bucket types)**
> - so one hand we have cities that show urgent need for help with their AQ with the help of readings and metrics while on the other hand we are struggling to determine the same for those who don't have reliable infrastructure
> - unreliablility and uncertainty could also be due to negligence or lack of resources to maintain a stable station capturing important AQ measurements
> - while it may appear cities with bigger AQ issues might be important, they could also be expensive and technically or pollitically hard to implement
> - on the other hand smaller cities that might be stuggling to keep up with the latest infrastructure and systematic approach might be more feasible and less expensive to fix
> - **bigger cities are harder to change and keep up with and the lessons are not learnt easily for a long time**
> - while smaller cities maybe be relatively easier to guide and bring them up to speed and discipline, and these might spread the word to neighbouring cities and towns and help foster better well-being in those areas
> - also these small cities when they grow and get better connected with other big and small cities will be better equipped to handle such situations
> - **lessons learnt from simpler implementations can be passed on to other experiments and implementations across the country, sometimes fixing small and simple things from the top five list can resolve many other items in the rest of the list**
> - we haven't come up with a way to assess the correctness of the data and the accuracy of our analysis and interpretations
> - we need more data about the regions, states and cities to understand a bit more about what is leading to the Air Quality
> - and we haven't come up with a way to assess which cities have a bigger need for help (these could be a mix of the cities from the two groups)



<a id='missing_values_four_tables'></a>

----------

## Filling missing values (rest of the four tables)

In [None]:
df_station_hour['AQI_Bucket'] = df_station_hour['AQI_Bucket'].fillna('Unknown')
df_city_hour['AQI_Bucket']    = df_city_hour['AQI_Bucket'].fillna('Unknown')
df_station_day['AQI_Bucket']  = df_station_day['AQI_Bucket'].fillna('Unknown')
df_city_day['AQI_Bucket']     = df_city_day['AQI_Bucket'].fillna('Unknown')

<a id='missing_values'></a>

----------

## Filling missing values (Stations table)

### After filling in missing values in the original Stations table
The status field has a number of NaN values, about 97 of the stations and also see the proportion of inactive, active and unknown status stations across the various Cities

In [None]:
plot_sns_chart(df_stations, 'Status', 
               xlabel_title="Number of stations by Status",
               ylabel_title="Station Status", height=5.0)

In [None]:
active_stations_filter = df_stations['Status'] == 'Active'
plot_sns_chart(df_stations[active_stations_filter].sort_values(by='City'), 'City', 'Status', 
               xlabel_title="Number of stations Active stations per City", ylabel_title = "Cities",
               height=30.0, font_scale=2)

In [None]:
active_stations_filter = df_stations['Status'] == 'Inactive'
plot_sns_chart(df_stations[active_stations_filter].sort_values(by='City'), 'City', 'Status', 
               xlabel_title="Number of stations Inactive stations by City", ylabel_title = "Cities",
               height=3, font_scale=2)

In [None]:
active_stations_filter = df_stations['Status'] == 'Unknown'
plot_sns_chart(df_stations[active_stations_filter].sort_values(by='City'), 'City', 'Status', 
               xlabel_title="Number of stations Unknown stations by City", ylabel_title = "Cities",
               height=35.0, font_scale=2)

<i><p style="font-size:16px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">From the above graphs we can see that there is a big divide between the spread of `Active` and `Unknown` status stations in various cities across the country. As we already know that that only ~52% of the stations are `Active`, while the rest are of `Inactive` or `Unknown` statuses. Also it's the smaller cities that have more `Inactive` or `Unknown` status stations as compared to the larger or main cities of the various regions. India being a vast country, this makes it harder to get a closer and more accurate picture of the Air Quality across the country in a uniform and fair fashion throughout the year, across all the seasons.

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

<a id='station_status_analysis'></a>

----------

## Station's status analysis

### Let's check the status of the Inactive and Unknown status Stations across the Cities again

We will be checking this across the four tables: city_day, city_hour, station_day and station_hour

### city_day table: Inactive and Unknown status Stations across the Cities

In [None]:
filter_not_active = df_city_day['Status'] != 'Active'
plot_sns_chart(df_city_day[filter_not_active], 'Status', 
               xlabel_title="Number of readings from the stations by Status", ylabel_title = "Station Status",
               width=15.0, height=3.0)

### city_hour table: Inactive and Unknown status Stations across the Cities

In [None]:
filter_not_active = df_city_hour['Status'] != 'Active'
plot_sns_chart(df_city_hour[filter_not_active], 'Status',
               xlabel_title="Number of readings from the stations by Status", ylabel_title = "Station Status",
               width=15.0, height=3.0)

In [None]:
plot_sns_chart(df_city_hour[filter_not_active].sort_values(by='City'), 'City', 'Status',
               xlabel_title="Number of readings from the stations by Status per City", ylabel_title = "Cities",
               width=15.0, height=5.0)

### station_day table: Inactive and Unknown status Stations across the Cities

In [None]:
filter_not_active = df_station_day['Status'] != 'Active'
plot_sns_chart(df_station_day[filter_not_active], 'Status',
               xlabel_title="Number of readings from the stations by Status", ylabel_title = "Station Status",
               width=15.0, height=3.0)

In [None]:
plot_sns_chart(df_station_day[filter_not_active].sort_values(by='City'), 'City', 'Status',
               xlabel_title="Number of readings from the stations by Status per City", ylabel_title = "Cities",
               width=15.0, height=5.0)

### station_hour table: Inactive and Unknown status Stations across the Cities

In [None]:
filter_not_active = df_station_hour['Status'] != 'Active'
plot_sns_chart(df_station_hour[filter_not_active], 'Status',
               xlabel_title="Number of readings from the stations by Status", ylabel_title = "Station Status",
               width=15.0, height=3.0)

In [None]:
plot_sns_chart(df_station_hour[filter_not_active].sort_values(by='City'), 'City', 'Status',
               xlabel_title="Number of readings from the stations by Status per City", ylabel_title = "Cities",
               width=15.0, height=5.0)

<i><p style="font-size:18px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">One thing is clear that it's only a handful of cities across the country that are getting readings from `Inactive` or `Unknown` status stations. The question remains, why `Inactive` stations would give readings. Is it they gave readings for a period and then stopped giving readings. To be precise we have readings from Inactive/Unknown statuses from stations in Delhi, Lucknow, Ernakulam and Kotchi. The number of readings provided are also not very large which also means if they are unreliable then the impact is mixed, depending on the total number of Active stations constituting readings to these regions.
    
<i><p style="font-size:18px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Same with the `Unknown` status stations, is it a similar situation and the dataset wasn't kept up-to-date. In that case can we mark the stations with both `Inactive` and`Unknown` status back to `Active` since they are producing readings? Do we need to inquire with the sources before doing this? For some reason many of the stations in the **Northern** cities of the country whose status are `Unknown`. A very small number of them from the **Southern** cities are `Inactive` in status as well. 
    
<i><p style="font-size:24px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">_How should funds be deployed to improve the Air Quality of select cities, which cities and why? Should the investments be made in improving the infrastructure that enables capturing data across the cities and states? -- As such investments decisions are quite reliant on the quality of information provided to make those decisions!_

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

<a id='pivots_visualisations_charts'></a>

----------


## Pivot tables, visualisations and charts: CO levels and AQI Bucket/AQI Acceptability

In the below we just picked one of the pollutants to establish that a similar approach could be applied to the other pollutants, let's also for now assume that other pollutants behave in the same way (this hasn't been verified and could be false but this can be seen when we get to verifying each one of them).

### CO levels across Cities, Years, Seasons, Months, Day periods (i.e. Morning, Evening, etc...), Weekend/weekday, Holidays or regular days

Here we are using city hour table as station_xxx tables are too specific to stations and a summarised view of the City is more than sufficient.

### CO levels across Cities and Years (using city_day table)

In [None]:
%%time
df_city_day_fill_missing_values = df_city_day.fillna(0.00)
df_city_year_pivot_table = df_city_day_fill_missing_values.pivot_table(values='CO', index='City', columns='Year', aggfunc=np.mean)
df_city_year_pivot_table = calculate_percentage(df_city_year_pivot_table)

In [None]:
plot_chart(df_city_year_pivot_table.sort_values(by=[2015, 2016, 2017]), 
           title='CO levels across Cities and Years', 
           xlabel_title=f'{df_city_year_pivot_table.shape[0]} cities in total',
           ylabel_title='Years', 
           height=20.0,
           stacked=True)

<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">It's visible not all cities have readings for all the years, and we will explore this further when looking at the breakdown for each type of missing categories.

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

In [None]:
co_levels_all_readings_filter = True
for each_year in list(df_city_year_pivot_table.columns):
    co_levels_all_readings_filter = co_levels_all_readings_filter & (df_city_year_pivot_table[each_year] > 0.0)
filtered_dataset = df_city_year_pivot_table[co_levels_all_readings_filter].sort_values(by=[2015, 2020])
plot_chart(filtered_dataset.sort_values(by=[2015, 2016, 2017]), 
           title='CO levels across Cities and Years (with readings for all years)', 
           xlabel_title=f'CO levels across {filtered_dataset.shape[0]} out of {df_city_year_pivot_table.shape[0]} cities',
           ylabel_title='Years', 
           height=10.0,
           stacked=True)

<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Here are the handful of cities from among the **26** cities that have readings for all the years between 2015 and 2020, that is about 8 of the 26 cities i.e. only ~30% of the total. 
    
<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">For cities where it is visible, Bengaluru, Delhi, Lucknow and Chennai show marked improvements since 2015 and the years to follow until 2020 (although we don't have all the data for 2020).
    
<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">On the other hand cities like Ahmedabad, Gurugram, Patna, and Hyderabad do not give a clear indication of improvements since 2015 and the years to follow until 2020 (although we don't have all the data for 2020). If observed closely we see a rise and fall and not a consistent drop in the levels - this is something to look into and might be specific to these kinds of cities even though they are spread far across the country.

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

In [None]:
co_levels_any_readings_filter = ~co_levels_all_readings_filter
filtered_dataset = df_city_year_pivot_table[co_levels_any_readings_filter].sort_values(by=[2015, 2020])
plot_chart(filtered_dataset.sort_values(by=[2015, 2016, 2017]), 
           title='CO levels across Cities and Years (with readings for one or more years but not all years)', 
           xlabel_title=f'CO levels across {filtered_dataset.shape[0]} out of {df_city_year_pivot_table.shape[0]} cities',
           ylabel_title='Years', 
           height=10.0,
           stacked=True)

<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Here are a bunch of cities with readings for one or more years between 2015 and 2020, but not for all the years. Making it about 18 of the 26 cities i.e. ~69% of the total. Which means we can only rely on the ~31% of the readings from the dataset when trying to make a decision about where help is most needed.

<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Firstly it's clear we do not have enough information about `CO` levels across the years between 2015 to 2020 for many cities, so it's hard to say whether levels in some regions have improved or deteriotated or are unchanged. Another perspective in addition to the helping cities where the levels are visible we could also consider looking into cities to improve the quality and availability of the readings first and once their situation is established we could do comparisons and analysis with other cities before making a decision.

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

### CO levels across Cities and Seasons (using city_day table)

In [None]:
%%time
df_city_day_fill_missing_values = df_city_day.fillna(0.00)
df_city_season_pivot_table = df_city_day_fill_missing_values.sort_values(by=['City'], ascending=True) \
                                .pivot_table(values='CO', index='City', columns='Season', aggfunc=np.mean)
df_city_season_pivot_table = df_city_season_pivot_table.sort_values(by='1. Winter')
df_city_season_pivot_table = calculate_percentage(df_city_season_pivot_table)

In [None]:
plot_chart(df_city_season_pivot_table.sort_values(['1. Winter', '2. Summer']), 
           title='CO levels across Cities and Seasons', 
           xlabel_title=f'CO levels across {df_city_season_pivot_table.shape[0]} cities',
           ylabel_title='Cities', height=20.0, stacked=True)

<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Clearly cities like Gurugram, Lucknow, Shillong, Guhawati among a handful of others are impacted by `CO` in the air during **Winter** and **Post-Monsoon** months and this could be due to various factors: industralisation, altitude, population, vehicles usage, forests/vegetation, etc... These could be analysed separately. Which also leads us into -- which season(s) should be considered when trying to decide to help as opposed to other seasons when the levels are comparitively lower. Aizawl does have readings for all the seasons and this makes it hard to assess where it stands with regards to CO levels during some parts of the year or during a particular season. This goes for other cities where there is a lack of data across one or more seasons.

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

### CO levels across Cities and Months  (using city_day table)

In [None]:
%%time
df_city_day_fill_missing_values = df_city_day.fillna(0.00)
df_city_month_pivot_table = df_city_day_fill_missing_values.sort_values(by=['City'], ascending=False) \
                                .pivot_table(values='CO', index='City', columns='Month', aggfunc=np.mean)
df_city_month_pivot_table = calculate_percentage(df_city_month_pivot_table)

In [None]:
plot_chart(df_city_month_pivot_table.sort_values(by=['01. Jan', '02. Feb', '03. Mar']), 
           title='CO levels across Cities and Months', 
           xlabel_title='CO levels across 12 months', ylabel_title='Months', 
           height=20.0, stacked=True)

<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">The month-to-month trend from the above appears to be a mixed bad for different cities. When ordered by their levels by the first month of the year, this tallies with the other trends we have seen so far.

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

### CO levels across Cities and Weekday/weekend (using city_hour table)

In [None]:
%%time
df_city_hour_pivot_table = df_city_hour.sort_values(by=['City'], ascending=True) \
                                .pivot_table(values='CO', index='City', columns='Weekday_or_weekend', aggfunc=np.mean)
df_city_hour_pivot_table = df_city_hour_pivot_table.sort_values(by='Weekday')

In [None]:
plot_chart(df_city_hour_pivot_table.sort_values('Weekday'), 
           title='CO levels across Cities and Weekday/weekend', 
           xlabel_title='CO Levels',
           ylabel_title='Cities',
           stacked=False,
           height=30.0)

<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">We can see why the Western Region is the biggest contributor, because one particular city i.e **Ahmedabad**, is the biggest contributors followed by other cities. Although the difference between the levels of `CO` between Weekday and Weekend isn't much - for some cities it's a bit more on weekends and for others it's a bit more on a weekday. Weekdays seems to be slightly higher than Weekends in most cases except for Northern and Central cities. The Western cities clearly show a chunk more contribution of `CO` during the weekdays, we can say Ahmedabad alone contributes to pretty much for most if not all other cities put together. We can see groupings between the other cities where their levels are similar in proportion.

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

### CO levels across Cities and Holiday/Regular day (using city_hour table)

In [None]:
%%time
df_city_hour_pivot_table = df_city_hour.sort_values(by=['City'], ascending=True) \
                                .pivot_table(values='CO', index='City', columns='Regular_day_or_holiday', aggfunc=np.mean)
df_city_hour_pivot_table = calculate_percentage(df_city_hour_pivot_table)
df_city_hour_pivot_table = df_city_hour_pivot_table.sort_values(by='Holiday (or Festival)')

In [None]:
plot_chart(df_city_hour_pivot_table, 
           title='CO levels across City and Holiday/Regular day', 
           xlabel_title='CO levels',
           ylabel_title='Cities',
           stacked=True,
           height=25.0)

<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Everywhere in the country a Holiday or Festival day means rise in `CO` (although the relative differences are not too much), except for cities in the Western region, where regular days contribute to a lot more `CO` than Holidays or Festival days. This could be mainly due to less usage of vehicles and less traffic, public services, etc... during holidays for the ones in the Western region. While it's not clear for the rest of the country why it could be the opposite, this could be investigated, analysed further. Interestingly in Ahmedabad, Ernakulam, Kochi and the likes the opposite is true, regular days constitute to more CO levels than Holidays or Festivals, while Gurugram, Guwahati, Vishakapatnam, Kolkata and the likes experience higher levels during Holidays days.

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

### CO levels across Cities and Day period (using city_hour table)

In [None]:
%%time
df_city_hour_pivot_table = df_city_hour.sort_values(by=['City'], ascending=True) \
                                .pivot_table(values='CO', index='City', columns='Day_period', aggfunc=np.mean)
df_city_hour_pivot_table = calculate_percentage(df_city_hour_pivot_table)
df_city_hour_pivot_table = df_city_hour_pivot_table.sort_values(by=['4. Night', '1. Morning', '2. Afternoon'])

In [None]:
plot_chart(df_city_hour_pivot_table, title='CO levels across Cities and Day period', 
           xlabel_title='CO levels across the day', 
           ylabel_title='Cities', 
           stacked=True, 
           height=20.0)

<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">More or less during night time there is a lot of `CO` released in all cities across the country, compared to the morning time. But it appears that the level of `CO` rises from morning to afternoon to evening and then into the night for most of the regions (more or less), before dropping again at the end of the cycle. Cities like Ahmedabad, Guwahati, Kolkata and the likes are impacted by this. Does it have to do with their geographic location, climate, altitude, population, etc or other factors -- best found out by further investigations.

<i><p style="font-size:18px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Just from the summarised fields (newly added features) we can already start making assessments about the time of the year and/or time or the day and/or city when the levels of any single or group of pollutants are at a high/low/medium levels or at healthy or unhealthy levels. 
    
<i><p style="font-size:18px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Also the status of the stations across the various Cities can be known as well. In fact if we dig deeper we could see during which periods which stations gave readings and which didn't. Although this path may be of less importance than the analysis of the Air Quality information itself.

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

<a id='aqi_bucket_and_acceptability'></a>

----------


## Trying to understand the interaction of AQI/AQI Bucket/AQ Acceptability and Cities

### AQI Bucket and Cities

In [None]:
df_station_day_pivot_table = df_station_day.sort_values(by=['AQI', 'AQI_Bucket'], ascending=False) \
                                .pivot_table(values='AQI', index='City', columns='AQI_Bucket', aggfunc=np.mean)
df_station_day_pivot_table = calculate_percentage(df_station_day_pivot_table)
df_station_day_pivot_table = df_station_day_pivot_table.sort_values(by=['Good', 'Satisfactory'])

In [None]:
plot_chart(df_station_day_pivot_table, title='AQI Bucket by Cities', 
           xlabel_title='AQI Bucket',ylabel_title='Cities', 
           stacked=True, height=20.0)

<i><p style="font-size:18px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Just like we have seen a number of cities do not have full readings across the year, similarly is the case with the AQI Bucket field, from above it's clear many do not have all the buckets populated (missing/NaN values present). Which makes is relatively hard to compare with their current standing and also compare with other cities across the country. And if these kinds of features are used further to compute other features then those dependent features are also embedded with different levels of inaccuries. We can see further the extent to which we are impacted with these missing information.

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

### AQI Bucket and Cities: readings available across all bucket types

In [None]:
all_readings_available_filter = True
columns = list(set(df_station_day_pivot_table.columns) - set(['Unknown']))
for each_rating in columns:
    all_readings_available_filter = all_readings_available_filter & (df_station_day_pivot_table[each_rating] > 0)

In [None]:
plot_chart(df_station_day_pivot_table[all_readings_available_filter], title='AQI Bucket by Cities (all readings available)', 
           xlabel_title=f'AQI Bucket from ' \
                        f'{df_station_day_pivot_table[all_readings_available_filter].shape[0]} out of {df_station_day_pivot_table.shape[0]} cities', 
           ylabel_title='Cities', 
           stacked=True, height=12.0)

<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Here are a handful of cities from among the **26** cities that have readings for all the AQI Bucket types, that is about 14 of the 26 cities i.e. only ~54% of the total. This tallies again with previous observations of absense of readings leading to absence of AQI Bucket types.

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

### AQI Bucket and Cities: readings available across one or more bucket types (but not all)

In [None]:
one_or_more_readings_filter = ~all_readings_available_filter
plot_chart(df_station_day_pivot_table[one_or_more_readings_filter], title='AQI Bucket by Cities (one or more readings available)', 
           xlabel_title=f'AQI Bucket from ' \
                        f'{df_station_day_pivot_table[one_or_more_readings_filter].shape[0]} out of {df_station_day_pivot_table.shape[0]} cities', 
           ylabel_title='Cities', 
           stacked=True, height=10.0)

<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Here are another handful of cities from among the **26** cities that have readings for one or more the AQI Bucket types but not for all of them, that is about 12 of the 26 cities i.e. only ~46% of the total. But these are cities whose assement will become hard for us to make when it comes to make investment decisions or help to improve the situations in their cities.

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

### AQ Acceptability and Cities: listed by Acceptable AQ

In [None]:
df_station_day_pivot_table = df_station_day.sort_values(by=['AQI', 'AQ_Acceptability'], ascending=False) \
                                .pivot_table(values='AQI', index='City', columns='AQ_Acceptability', aggfunc=np.mean)
df_station_day_pivot_table = calculate_percentage(df_station_day_pivot_table)
df_station_day_pivot_table = df_station_day_pivot_table.sort_values(by='Acceptable')

In [None]:
plot_chart(df_station_day_pivot_table, 
           title='AQ Acceptability by Cities (sorted by Acceptable AQ: most to least)', 
           xlabel_title='AQ Acceptability',  ylabel_title='Cities', 
           stacked=True, height=20.0)

<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Again, the above show clearly that the cities like Ahmedabad, Guwahati, Kolkata, Lucknow, Delhi, Gurugram) seem to contribute to **Unacceptable AQ** much more than the cities from other regions of the country. And the improvements in these cities are not visible or maybe marginal in some cases. It's the cities in  Southern region that has a better acceptable rate. Although work needs to be done across all regions. We need to also check how much data we have collected from each of the regions to justify the current state of things, for e.g. we won't consider Central region better AQ Acceptability or not for the same reasons. These comparisons are not simple or straight-forward.

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

### AQ Acceptability and Cities: listed by Unacceptable AQ

In [None]:
df_city_day_pivot_table = df_city_day.sort_values(by=['AQI', 'AQ_Acceptability'], ascending=False) \
                                .pivot_table(values='AQI', index='City', columns='AQ_Acceptability', aggfunc=np.mean)
df_city_day_pivot_table = df_city_day_pivot_table.dropna()
df_city_day_pivot_table = calculate_percentage(df_city_day_pivot_table)
df_city_day_pivot_table = df_city_day_pivot_table.sort_values(by='Unacceptable')

In [None]:
plot_chart(df_city_day_pivot_table, 
           title='AQ Acceptability by Cities (sorted by Unacceptable AQ: most to least)', 
           xlabel_title='AQ Acceptability', ylabel_title='Cities', 
           stacked=True, height=20.0)

<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">The top five cities with  **Unacceptable AQ** are Ahmedabad, Guwahati, Kolkata, Delhi and Gurugram. While the top five cities with **Acceptable AQ** are Ernakulam, Coimbatore, Kochi, Bengaluru and Jaipur showing mostly that the many Southern cities are more serious about their **AQ Acceptability**. And this is seen from previous analysis, that the Southern cities are doing a better job than the Northern ones. But it is seen that in general almost all the cities have bad or **Unacceptable AQ** majority of the times, none of the cities have **Acceptable AQ** even 50% of the time. All the cities in the list have  **Acceptable AQ** moments overshadowed **AQ is Unacceptable** moments ~2 to ~5 times their frequency.

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

<i><p style="font-size:16px; background-color: #008000; border: 2px solid black; margin: 20px; padding: 20px;">There are plenty of insights that can be extracted with various combinations of fields across the different tables to gain insights. As per our discussion in the study group it's would be best to start with high-level questions or else we might meander into various paths and not answer focussed questions.

## Credits

- Forked from [Firat Gonen](https://www.kaggle.com/frtgnn)'s [Clean Air? India's Air Quality 🇮🇳](https://www.kaggle.com/frtgnn/clean-air-india-s-air-quality) kernel - thanks for the foundation work.
- [David Dirring](https://www.kaggle.com/romandovega) for all the insights during the ChaiEDA sessions, and also building on his idea of the KPI based on the AQI Index

<i><p style="font-size:20px; background-color: #ff9933; border: 2px dotted black; margin: 20px; padding: 20px;">This kernel is dedicated to all the visualisations and narrations of Air Quality from the Cities of India perspective, prepared to submit to the [Task](https://www.kaggle.com/rohanrao/air-quality-data-in-india/tasks?taskId=1877) created by [David Dirring](https://www.kaggle.com/romandovega).

<i><p style="font-size:20px; background-color: #ff9936; border: 2px dotted black; margin: 20px; padding: 20px;">It makes use of the datasets prepared via [ChaiEDA: India's Air Quality 2015-20 🇮🇳: Data Prep](https://www.kaggle.com/neomatrix369/chaieda-india-s-air-quality-2015-20-data-prep/). The extended [Air Quality India  dataset can be found here](https://www.kaggle.com/neomatrix369/air-quality-data-in-india-extended).

<i><p style="font-size:20px; background-color: #ff9936; border: 2px dotted black; margin: 20px; padding: 20px;">Some of the content comes from it's predeccessor [ChaiEDA: India's Air Quality 2015-20 🇮🇳](https://www.kaggle.com/neomatrix369/chaieda-india-s-air-quality-2015-20), where you can also see other perspective across both location (Regions, States) and time periods (Years, Seasons, Months, Days and Day Periods).
    

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>