<i><p style="font-size:24px; background-color: #ff9933; border: 2px dotted black; margin: 20px; padding: 20px;">This kernel is dedicated to all the visualisations and narrations and makes use of the datasets prepared via [ChaiEDA: India's Air Quality 2015-20 🇮🇳: Data Prep](https://www.kaggle.com/neomatrix369/chaieda-india-s-air-quality-2015-20-data-prep/). The extended [Air Quality India dataset can be found here](https://www.kaggle.com/neomatrix369/air-quality-data-in-india-extended). For Air Quality in the Cities of India perspective, see [ChaiEDA: Air Quality in Indian Cities 2015-20](https://www.kaggle.com/neomatrix369/chaieda-air-quality-in-indian-cities-2015-20/).

![](https://www.nextwanderlust.com/wp-content/uploads/2017/12/Incredible-India.jpg)

![](https://nirvanabeing.com/wp-content/uploads/2018/04/iaq_blog_1.jpg)

https://waqi.info

In [None]:
import os
import warnings
import numpy as np
import pandas as pd
from math import pi
import seaborn as sns
import plotly.express as px
import matplotlib
import matplotlib.pyplot as plt
from IPython.display import HTML,display

sns.set(style="whitegrid", font_scale=1.75)


# prettify plots
plt.rcParams['figure.figsize'] = [20.0, 5.0]
    
%matplotlib inline

warnings.filterwarnings("ignore")

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
%%time
DATASET_INPUT_DIR = '/kaggle/input/air-quality-data-in-india-extended'
df_station_hour = pd.read_feather(f"{DATASET_INPUT_DIR}/station_hour_transformed.fth")
df_city_hour    = pd.read_feather(f"{DATASET_INPUT_DIR}/city_hour_transformed.fth")
df_station_day  = pd.read_feather(f"{DATASET_INPUT_DIR}/station_day_transformed.fth")
df_city_day     = pd.read_feather(f"{DATASET_INPUT_DIR}/city_day_transformed.fth")
df_stations     = pd.read_feather(f"{DATASET_INPUT_DIR}/stations_transformed.fth")

In [None]:
print('Below is a list of columns of tables just as they are loaded:')
print('~~~')
print(f'df_stations: {list(df_stations.columns)}')
print('~~~')
print(f'df_station_day: {list(df_station_day.columns)}')
print('~~~')
print(f'df_station_hour: {list(df_station_hour.columns)}')
print('~~~')
print(f'df_city_day: {list(df_city_day.columns)}')
print('~~~')
print(f'df_city_hour: {list(df_city_hour.columns)}')
print('~~~')

<i><p style="font-size:18px; background-color: #66cdde; border: 2px dotted black; margin: 20px; padding: 20px;">For initial EDA steps to view datasets and other introductory steps, please refer to the [source kernel from where this kernel has been forked](https://www.kaggle.com/frtgnn). Another nice kernel to look at before getting started is [Parul Pandey's](https://www.kaggle.com/parulpandey): [😷 Breathe India: COVID-19 effect on Pollution](https://www.kaggle.com/parulpandey/breathe-india-covid-19-effect-on-pollution). In this current kernel, we start with visualisations and narrations based on the [transformed datasets](https://www.kaggle.com/neomatrix369/chaieda-india-s-air-quality-2015-20-data-prep/output) which are based on the [original datasets](https://www.kaggle.com/rohanrao/air-quality-data-in-india) provided.

In [None]:
fields_to_show = ['City','AQI_Bucket']

In [None]:
fields_to_ignore = ['StationId', 'StationName', 'State', 'Status', 'Region', 'Month', 'Year', 'Season', 'City', 'Date', 'AQI', 'AQI_Bucket']
names_of_pollutants = list(set(df_city_day.columns) - set(fields_to_ignore))
print(f"Names of Pollutants: {list(names_of_pollutants)}")

In [None]:
def plot_chart(dataframe, width=20.0, height=7.0, title='<No title assigned>', 
               ylabel_title='<No xlabel title assigned>', stacked=False):
    plt.rcParams['figure.figsize'] = [width, height]
    font = {'size': 16}
    matplotlib.rc('font', **font)
    ax = dataframe.plot(kind='barh', stacked=stacked, title=title)
    ax.set_ylabel(ylabel_title)


### After filling in missing values in the original Stations table
The status field has a number of NaN values, about 97 of the stations and also see the proportion of inactive, active and unknown status stations across the various Regions

In [None]:
def plot_sns_chart(dataframe, y_axis, class_separator=None, width=20.0, height=9.0, 
                   font_scale=2, xlabel_title="Xlabel title not set", ylabel_title="Ylabel title not set"):
    plt.rcParams['figure.figsize'] = [width, height]
    sns.set(font_scale=font_scale)
    g = sns.countplot(y=y_axis, hue=class_separator, data=dataframe)
    g.set(xlabel=xlabel_title, ylabel=ylabel_title)

In [None]:
plot_sns_chart(df_stations, 'Status', 
               xlabel_title="Number of stations by Status",
               ylabel_title="Station Status", height=7.0)

<i><p style="font-size:16px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">From the above we can see about `130 stations` are active across the various regions and nearly `98 stations` whose status are `Unknown`, with a few that are marked as `Inactive`. So to speak out of `230 stations` only about `56%` are providing readings that are known to be active.

In [None]:
plot_sns_chart(df_stations.sort_values(by='Region'), 'Region', 'Status', 
               xlabel_title = "Number of stations by Status per Region", ylabel_title = "Regions")

In [None]:
plot_sns_chart(df_stations.sort_values(by='State'), 'State', 'Status', 
               xlabel_title="Number of stations by Status per State", ylabel_title = "States",
               height=18.0, font_scale=2.5)

<i><p style="font-size:16px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">From the above two graphs we can say that the across northern Regions and States we tend to have less `Active` stations. The statuses of the remaining stations tend to be `Unknown` or `Inactive`. Let us verify this at a later stage when we have merged the **Stations** table with the other four tables.

### Let's check the status of the Inactive and Unknown status Stations across the Regions and States again

We will be checking this across the four tables: city_day, city_hour, station_day and station_hour

### city_day table: Inactive and Unknown status Stations across the Regions

In [None]:
filter_not_active = df_city_day['Status'] != 'Active'
plot_sns_chart(df_city_day[filter_not_active], 'Status', 
               xlabel_title="Number of readings from the stations by Status", ylabel_title = "Station Status",
               width=15.0, height=3.0)

In [None]:
plot_sns_chart(df_city_day[filter_not_active].sort_values(by='Region'), 'Region', 'Status',
               xlabel_title="Number of readings from the stations by Status per Region", ylabel_title = "Regions",
               width=15.0, height=5.0)

### city_hour table: Inactive and Unknown status Stations across the Regions

In [None]:
filter_not_active = df_city_hour['Status'] != 'Active'
plot_sns_chart(df_city_hour[filter_not_active], 'Status',
               xlabel_title="Number of readings from the stations by Status", ylabel_title = "Station Status",
               width=15.0, height=3.0)

In [None]:
plot_sns_chart(df_city_hour[filter_not_active].sort_values(by='Region'), 'Region', 'Status',
               xlabel_title="Number of readings from the stations by Status per Region", ylabel_title = "Regions",
               width=15.0, height=5.0)

### station_day table: Inactive and Unknown status Stations across the Regions

In [None]:
filter_not_active = df_station_day['Status'] != 'Active'
plot_sns_chart(df_station_day[filter_not_active], 'Status',
               xlabel_title="Number of readings from the stations by Status", ylabel_title = "Station Status",
               width=15.0, height=3.0)

In [None]:
plot_sns_chart(df_station_day[filter_not_active].sort_values(by='Region'), 'Region', 'Status',
               xlabel_title="Number of readings from the stations by Status per Region", ylabel_title = "Regions",
               width=15.0, height=5.0)

### station_hour table: Inactive and Unknown status Stations across the Regions

In [None]:
filter_not_active = df_station_hour['Status'] != 'Active'
plot_sns_chart(df_station_hour[filter_not_active], 'Status',
               xlabel_title="Number of readings from the stations by Status", ylabel_title = "Station Status",
               width=15.0, height=3.0)

In [None]:
plot_sns_chart(df_station_hour[filter_not_active].sort_values(by='Region'), 'Region', 'Status',
               xlabel_title="Number of readings from the stations by Status per Region", ylabel_title = "Regions",
               width=15.0, height=5.0)

<i><p style="font-size:16px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">So now it's unclear again, why `Inactive` stations would give readings. Is it they gave readings for a period and then stopped giving readings. Same with the `Unknown` status stations, is it a similar situation and the dataset wasn't kept up-to-date. In that case can we mark the stations with both `Inactive` and`Unknown` status back to `Active` since they are producing readings? These readings can be seen in all of the four tables above, or do we need to inquire with the sources before doing this? For some reason many of the stations in the **Northern** region of the country whose status is `Unknown`. A very small number of them from the **Southern** region are `Inactive` in status as well.

## Pivot tables, visualisations and charts

In the below we just picked one of the pollutants to establish that a similar approach could be applied to the other pollutants, let's also for now assume that other pollutants behave in the same way (this hasn't been verified and could be false but this can be seen when we get to verifying each one of them).

### CO levels across Regions, Seasons, Months, Day periods (i.e. Morning, Evening, etc...), Weekend/weekday, Holidays or regular days

Here we are using city hour table as station_xxx tables are too specific to stations and a summarised view of the State/Region is more than sufficient.

### CO levels across Region and Years (using city_year table)

In [None]:
%%time
df_city_year_pivot_table = df_city_day.pivot_table(values='CO', index='Year', columns='Region', aggfunc=np.mean)
df_city_year_pivot_table

In [None]:
plot_chart(df_city_year_pivot_table, title='CO levels across Region and Years', ylabel_title='Years', stacked=True)

<i><p style="font-size:16px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">From 2015 to 2017, `CO` levels have been rising and falling in the Northern, Western and Southern regions, but compared to 2015 the other years have shown signs of improvement, it's possible there was a countrywide action taken regarding this matter.

### CO levels across Regions and Seasons (using city_season table)

In [None]:
%%time
df_city_season_pivot_table = df_city_day.sort_values(by=['Region'], ascending=True) \
                                .pivot_table(values='CO', index='Region', columns='Season', aggfunc=np.mean)
df_city_season_pivot_table = df_city_season_pivot_table.sort_values(by='1. Winter')

In [None]:
plot_chart(df_city_season_pivot_table, title='CO levels across Regions and Seasons', ylabel_title='Regions', stacked=True)

<i><p style="font-size:16px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Clearly the western region dominates over all other regions put together when it comes to `CO` in the air and this could be due to various factors: industralisation, altitude, population, vehicles usage, forests/vegetation, etc... These could be analysed separately.

### CO levels across Regions and Months  (using city_month table)

In [None]:
%%time
df_city_month_pivot_table = df_city_day.sort_values(by=['Region'], ascending=False) \
                                .pivot_table(values='CO', index='Month', columns='Region', aggfunc=np.mean)
df_city_month_pivot_table

In [None]:
plot_chart(df_city_month_pivot_table, title='CO levels across Regions and Seasons', ylabel_title='Months', height=15.0, stacked=True)

<i><p style="font-size:16px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">The month-to-month trend is pretty clear, July/August/September have the lowest levels and then it keeps rising from October onwards through January where it peaks and then starts dropping through all the months upto June/July and the cycle continues. This is also to do with seasons and activities across the year, the wind directions, etc...

### CO levels across Regions and Weekday/weekend (using city_hour table)

In [None]:
%%time
df_city_hour_pivot_table = df_city_hour.sort_values(by=['Region'], ascending=True) \
                                .pivot_table(values='CO', index='Region', columns='Weekday_or_weekend', aggfunc=np.mean)
df_city_hour_pivot_table = df_city_hour_pivot_table.sort_values(by='Weekday')

In [None]:
plot_chart(df_city_hour_pivot_table, title='CO levels across Regions and Weekday/weekend', ylabel_title='Regions')

<i><p style="font-size:16px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Again the Western region of the country is the biggest contributors followed by the Northern region. Although the difference between the levels of `CO` between Weekday and Weekend isn't much. Weekday seems to be slightly higher than Weekend in most cases except for Northern and Central. The Western region clearly shows a chunk more contribution of `CO` during the weekdays.

### CO levels across Regions and Holiday/Regular day (using city_hour table)

In [None]:
%%time
df_city_hour_pivot_table = df_city_hour.sort_values(by=['Region'], ascending=True) \
                                .pivot_table(values='CO', index='Region', columns='Regular_day_or_holiday', aggfunc=np.mean)
df_city_hour_pivot_table = df_city_hour_pivot_table.sort_values(by='Holiday (or Festival)')
df_city_hour_pivot_table

In [None]:
plot_chart(df_city_hour_pivot_table, title='CO levels across Regions and Holiday/Regular day', ylabel_title='Regions')

<i><p style="font-size:16px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Everywhere in the country a Holiday or Festival day means rise in `CO` (although the difference is not too much), except in the Western region, where regular days contribute to a lot more `CO` than Holidays or Festival days. This could be mainly due to less usage of vehicles and less traffic, public services, etc... during holidays for the ones in the Western region. While it's not clear for the rest of the country why it could be the opposite, this could be investigated, analysed further.

### CO levels across Regions and Day period (using city_hour table)

In [None]:
%%time
df_city_hour_pivot_table = df_city_hour.sort_values(by=['Region'], ascending=True) \
                                .pivot_table(values='CO', index='Region', columns='Day_period', aggfunc=np.mean)
df_city_hour_pivot_table = df_city_hour_pivot_table.sort_values(by='1. Morning')
df_city_hour_pivot_table

In [None]:
plot_chart(df_city_hour_pivot_table, title='CO levels across Regions and Day period', ylabel_title='Regions', stacked=True)

<i><p style="font-size:16px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">More or less during night time there is a lot of `CO` released all across the country, compared to the morning time. But it appears that the level of `CO` rises from morning to afternoon to evening and then into the night for most of the regions (more or less). Although again the western region contributes to the most, and the sequence of rise and fall is a bit strange, rises starting noon time into the evening and night and falls during the afternoon, and then starts to rise again.

<i><p style="font-size:16px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Just from the summarised fields (newly added features) we can already start making assessments about the time of the year and/or time or the day and/or region when the levels of any single or group of pollutants are at a high/low/medium levels or at healthy or unhealthy levels. 
    
<i><p style="font-size:16px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Also the status of the stations across the various Regions and States can be known as well. In fact if we dig deeper we could see during which periods which stations gave readings and which didn't. Although this path may be of less importance than the analysis of the Air Quality information itself.

## Trying to understand the interaction of AQI/AQI Bucket/AQ Acceptability, Year, Seasons, Regions, States and Cities

<i><p style="font-size:16px; background-color: #66cdde; border: 2px solid black; margin: 20px; padding: 20px;">Just as a side-note, this whole section on AQI, AQI Bucket and AQ Acceptability requires further checking and reformatting/presenting. But the current presentation could still be useful to give some coarse views on things (slightly low on accuracy but quick insights).

### AQI/AQI Bucket/AQ Acceptability and Year

In [None]:
df_station_day_pivot_table = df_station_day.sort_values(by=['AQI', 'AQI_Bucket'], ascending=False) \
                                .pivot_table(values='AQI', index='Year', columns='AQI_Bucket', aggfunc=np.mean)
df_station_day_pivot_table

In [None]:
plot_chart(df_station_day_pivot_table, title='AQI Rating by Year', ylabel_title='Years', stacked=True, height=9.0)

In [None]:
df_station_day_pivot_table = df_station_day.sort_values(by=['AQI', 'AQ_Acceptability'], ascending=False) \
                                .pivot_table(values='AQI', index='Year', columns='AQ_Acceptability', aggfunc=np.mean)
df_station_day_pivot_table['Acceptable'] = df_station_day_pivot_table['Acceptable'] / (df_station_day_pivot_table['Acceptable'] + df_station_day_pivot_table['Unacceptable']) * 100
df_station_day_pivot_table['Unacceptable'] = 100.0 - df_station_day_pivot_table['Acceptable']
df_station_day_pivot_table

In [None]:
plot_chart(df_station_day_pivot_table, title='AQ Acceptability by Year (percentage)', ylabel_title='Years', stacked=True, height=9.0)

<i><p style="font-size:16px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">The above show clearly that although in general the AQ quality is way below acceptable levels, there has been year-on-year improvements since 2015. But the moments of unacceptable AQ still looms, and the data for 2020 isn't complete as the year is still on but the other years show there is much work to be done.

### AQI/AQI Bucket/AQ Acceptability and Regions

In [None]:
df_station_day_pivot_table = df_station_day.sort_values(by=['AQI', 'AQI_Bucket'], ascending=False) \
                                .pivot_table(values='AQI', index='Region', columns='AQI_Bucket', aggfunc=np.mean)
df_station_day_pivot_table = df_station_day_pivot_table.sort_values(by=['Good', 'Satisfactory'])
df_station_day_pivot_table

In [None]:
plot_chart(df_station_day_pivot_table, title='AQI Rating by Region', ylabel_title='Regions', stacked=True, height=10.0)

In [None]:
df_station_day_pivot_table = df_station_day.sort_values(by=['AQI', 'AQ_Acceptability'], ascending=False) \
                                .pivot_table(values='AQI', index='Region', columns='AQ_Acceptability', aggfunc=np.mean)
df_station_day_pivot_table['Acceptable'] = df_station_day_pivot_table['Acceptable'] / (df_station_day_pivot_table['Acceptable'] + df_station_day_pivot_table['Unacceptable']) * 100
df_station_day_pivot_table['Unacceptable'] = 100.0 - df_station_day_pivot_table['Acceptable']
df_station_day_pivot_table = df_station_day_pivot_table.sort_values(by='Acceptable')
df_station_day_pivot_table

In [None]:
plot_chart(df_station_day_pivot_table, title='AQ Acceptability by Region (percentage)', ylabel_title='Regions', stacked=True, height=10.0)

<i><p style="font-size:16px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Again, the above show clearly that the Northern and Western territories seem to contribute to unacceptable AQ much more than the regions. And the improvements cannot be seen. It's the Southern region that has a better acceptable rate. Although work needs to be done across all regions. We need to also check how much data we have collected from each of the regions to justify the current state of things, for e.g. we won't consider Central region better AQ Acceptability or not for the same reasons. These comparisons are not simple or straight-forward.

### AQI/AQI Bucket/AQ Acceptability and Seasons

In [None]:
df_station_day_pivot_table = df_station_day.sort_values(by=['AQI', 'AQI_Bucket'], ascending=False) \
                                .pivot_table(values='AQI', index='Season', columns='AQI_Bucket', aggfunc=np.mean)
df_station_day_pivot_table

In [None]:
plot_chart(df_station_day_pivot_table, title='AQI Rating by Season', ylabel_title='Seasons', stacked=True)

In [None]:
df_station_day_pivot_table = df_station_day.sort_values(by=['AQI', 'AQ_Acceptability'], ascending=False) \
                                .pivot_table(values='AQI', index='Season', columns='AQ_Acceptability', aggfunc=np.mean)
df_station_day_pivot_table['Acceptable'] = df_station_day_pivot_table['Acceptable'] / (df_station_day_pivot_table['Acceptable'] + df_station_day_pivot_table['Unacceptable']) * 100
df_station_day_pivot_table['Unacceptable'] = 100.0 - df_station_day_pivot_table['Acceptable']
df_station_day_pivot_table = df_station_day_pivot_table.sort_values(by='Acceptable')
df_station_day_pivot_table

In [None]:
plot_chart(df_station_day_pivot_table, title='AQ Acceptability Rating by Season (percentage)', ylabel_title='Seasons', stacked=True)

<i><p style="font-size:16px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">The above makes sense logically, that Monsoons is the best time of the year, followed by Summer when it comes to AQ Acceptability, while Post-Monsoon and Winter the AQ Acceptability gets bad/lower due to various factors like lack of movement of air, cold temperatures, smog, etc... As we know rains do a good job of cleaning the air/environment and washes away anything in the air and the ground aborbs it. Also it's a bit more windy during these two season in many areas compared to Post-Monsoon and Winter. All of these macro and micro factors can be seen reflected above and plays a role in positively or negatively impacting AQ Acceptability.

### AQI/AQI Bucket/AQ Acceptability and States

In [None]:
df_station_day_pivot_table = df_station_day.sort_values(by=['AQI', 'AQI_Bucket'], ascending=False) \
                                .pivot_table(values='AQI', index='State', columns='AQI_Bucket', aggfunc=np.mean)
df_station_day_pivot_table

In [None]:
plot_chart(df_station_day_pivot_table, title='AQI Rating by States', ylabel_title='States', stacked=True, height=18.0)

In [None]:
df_station_day_pivot_table = df_station_day.sort_values(by=['AQI', 'AQ_Acceptability'], ascending=False) \
                                .pivot_table(values='AQI', index='State', columns='AQ_Acceptability', aggfunc=np.mean)
df_station_day_pivot_table['Acceptable'] = df_station_day_pivot_table['Acceptable'] / (df_station_day_pivot_table['Acceptable'] + df_station_day_pivot_table['Unacceptable']) * 100
df_station_day_pivot_table['Unacceptable'] = 100.0 - df_station_day_pivot_table['Acceptable']
df_station_day_pivot_table = df_station_day_pivot_table.sort_values(by='Unacceptable')
df_station_day_pivot_table

In [None]:
plot_chart(df_station_day_pivot_table, 
           title='AQ Acceptability by States (sorted by Unacceptable AQ: ASC) (percentage)', ylabel_title='States', 
           stacked=True, height=18.0)

In [None]:
df_station_day_pivot_table = df_station_day_pivot_table.sort_values(by='Acceptable')
plot_chart(df_station_day_pivot_table, 
           title='AQ Acceptability by States (sorted by Acceptable AQ: ASC) (percentage)', ylabel_title='States', 
           stacked=True, height=18.0)

<i><p style="font-size:16px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;"> **Unacceptable AQ**: The top three States are Gujarat, Assam, and West Bengal, while for **Acceptable AQ**: the top three States are Kerala, Madhya Pradesh and Rajesthan and from other analysis this makes. The Southern states are doing a better job than the Northern ones. But it is seen that almost all the states have bad or **Unacceptable AQ** majority of the times, while a few states have it say around half of the time. But there's a big list of states where the **Acceptable AQ** moments are so low that they are greatly overshadowed by moments when **AQ is Unacceptable**. We will be ignoring Mizoram from the list, as it's not clean data (has NaN or 0.00 values) - it has no data for **Unacceptable AQ** and this could be a data issue.

### AQI/AQI Bucket/AQ Acceptability and Cities

See [ChaiEDA: India's Air Quality 2015-20 (cities)](https://www.kaggle.com/neomatrix369/chaieda-india-s-air-quality-2015-20-cities/)

## Trying to understand the interaction of AQI/AQI Bucket/AQ Acceptability, Month, Day periods, Weekday/Weekend and Holiday/Regular day

### AQI/AQI Bucket/AQ Acceptability and Month

In [None]:
df_city_day_pivot_table = df_city_day.sort_values(by=['AQI', 'AQI_Bucket'], ascending=False) \
                                .pivot_table(values='AQI', index='Month', columns='AQI_Bucket', aggfunc=np.mean)
df_city_day_pivot_table

In [None]:
plot_chart(df_city_day_pivot_table, title='AQI Rating by Month', ylabel_title='Months', stacked=True, height=10.0)

In [None]:
df_city_day_pivot_table = df_city_day.sort_values(by=['AQI', 'AQ_Acceptability'], ascending=False) \
                                .pivot_table(values='AQI', index='Month', columns='AQ_Acceptability', aggfunc=np.mean)
df_city_day_pivot_table['Acceptable'] = df_city_day_pivot_table['Acceptable'] / (df_city_day_pivot_table['Acceptable'] + df_city_day_pivot_table['Unacceptable'])
df_city_day_pivot_table['Unacceptable'] = abs(1 - df_city_day_pivot_table['Acceptable'])
df_city_day_pivot_table

In [None]:
plot_chart(df_city_day_pivot_table, title='AQ Acceptability by Month (percentage)', ylabel_title='Months', stacked=True, height=15.0)

<i><p style="font-size:16px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">From the analysis of Seasons previously, we can see that the rise and fall of **Acceptable AQ** and **Unacceptable AQ** are pretty cyclical in nature. Just like seen earlier through other analysis, we have a rise in July/August/September and then a drop after that until January, then a rise from February/March and then drops again into June/July and so on and so forth.

### AQI/AQI Bucket/AQ Acceptability and Day periods

In [None]:
df_city_hour_pivot_table = df_city_hour.sort_values(by=['AQI', 'AQI_Bucket'], ascending=False) \
                                .pivot_table(values='AQI', index='Day_period', 
                                             columns='AQI_Bucket', aggfunc=np.mean)
df_city_hour_pivot_table

In [None]:
plot_chart(df_city_hour_pivot_table, 
           title='AQI Rating by Day period (Morning, Afternoon, etc...)', ylabel_title='Day period', 
           stacked=True, height=5.0)

In [None]:
df_city_hour_pivot_table = df_city_hour.sort_values(by=['AQI', 'AQ_Acceptability'], ascending=False) \
                                .pivot_table(values='AQI', index='Day_period', 
                                             columns='AQ_Acceptability', aggfunc=np.mean)
df_city_hour_pivot_table['Acceptable'] = df_city_hour_pivot_table['Acceptable'] / (df_city_hour_pivot_table['Acceptable'] + df_city_hour_pivot_table['Unacceptable']) * 100
df_city_hour_pivot_table['Unacceptable'] = 100.0 - df_city_hour_pivot_table['Acceptable']
df_city_hour_pivot_table = df_city_hour_pivot_table.sort_values(by='Unacceptable')
df_city_hour_pivot_table

In [None]:
plot_chart(df_city_hour_pivot_table, stacked=True, 
           title='AQ Acceptability by Day period (Morning, Afternoon, etc...) (percentage)', ylabel_title='Day period'
          )

<i><p style="font-size:16px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">The Day period analysis shows a different story, it's hard to tell which time of the day AQ is more **Acceptable** or less. It appears that Afternoons and Evenings are better than Nights and mornings. But the overall trend is it's **Unacceptable**  during most part of the day period and say less than 25% of the moments in a day period (1 in 4 moments) the AQ is  **Acceptable**.

### AQI/AQI Bucket/AQ Acceptability and Weekday/Weekend

In [None]:
df_city_hour_pivot_table = df_city_hour.sort_values(by=['AQI', 'AQI_Bucket'], ascending=False) \
                                .pivot_table(values='AQI', index='AQI_Bucket', 
                                             columns='Weekday_or_weekend', aggfunc=np.mean)
df_city_hour_pivot_table

In [None]:
plot_chart(df_city_hour_pivot_table, title='AQI Rating by Weekday/Weekend', ylabel_title='AQI Bucket', height=12.0)

In [None]:
df_city_hour_pivot_table = df_city_hour.sort_values(by=['AQI', 'AQ_Acceptability'], ascending=False) \
                                .pivot_table(values='AQI', index='Weekday_or_weekend', 
                                             columns='AQ_Acceptability', aggfunc=np.mean)
df_city_hour_pivot_table['Acceptable'] = df_city_hour_pivot_table['Acceptable'] / (df_city_hour_pivot_table['Acceptable'] + df_city_hour_pivot_table['Unacceptable']) * 100
df_city_hour_pivot_table['Unacceptable'] = 100.0 - df_city_hour_pivot_table['Acceptable']
df_city_hour_pivot_table = df_city_hour_pivot_table.sort_values(by='Unacceptable')
df_city_hour_pivot_table

In [None]:
plot_chart(df_city_hour_pivot_table, 
           title='AQ Acceptability by Weekday/Weekend (percentage)', ylabel_title='Weekday or Weekend', 
           stacked=True)

<i><p style="font-size:16px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">The Weekday/Weekend analysis shows a similar story as day period, it's hard to tell which day of the week the AQ is more **Acceptable** or less. But the overall trend is it's **Unacceptable** all throughout the week and let's say less than 25% of the moments in the week (1 in 4 moments) the AQ is  **Acceptable**, the rest of the time it's **Unacceptable**. The difference in AQ (both types) between weekday or weekend are also negligible at this scale, meaning the trend is the same irrespective of the day of the week.

### AQI/AQI Bucket/AQ Acceptability and Holiday/Regular day

In [None]:
df_city_hour_pivot_table = df_city_hour.sort_values(by=['AQI', 'AQI_Bucket'], ascending=False) \
                                .pivot_table(values='AQI', index='Regular_day_or_holiday', 
                                             columns='AQI_Bucket', aggfunc=np.mean)
df_city_hour_pivot_table = df_city_hour_pivot_table.sort_values(by=['Good', 'Satisfactory'])
df_city_hour_pivot_table

In [None]:
plot_chart(df_city_hour_pivot_table, title='AQI Rating by Regular day/Holiday', 
           ylabel_title='Regular day or Holiday (Festival)', height=12.0)

In [None]:
df_city_hour_pivot_table = df_city_hour.sort_values(by=['AQI', 'AQ_Acceptability'], ascending=False) \
                                .pivot_table(values='AQI', index='Regular_day_or_holiday', 
                                             columns='AQ_Acceptability', aggfunc=np.mean)
df_city_hour_pivot_table['Acceptable'] = df_city_hour_pivot_table['Acceptable'] / (df_city_hour_pivot_table['Acceptable'] + df_city_hour_pivot_table['Unacceptable']) * 100
df_city_hour_pivot_table['Unacceptable'] = 100.0 - df_city_hour_pivot_table['Acceptable']
df_city_hour_pivot_table = df_city_hour_pivot_table.sort_values(by='Acceptable')
df_city_hour_pivot_table

In [None]:
plot_chart(df_city_hour_pivot_table, stacked=True, 
           title='AQ Acceptability by Regular day/Holiday (percentage)', ylabel_title='Regular day or Holiday (Festival)')

<i><p style="font-size:16px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">The Holiday/Regular day analysis shows a similar story as Weekday/Weekend analysis, it's hard to tell during which of the two  the AQ is more **Acceptable** or less. But the overall trend is it's **Unacceptable** all throughout and let's say less than 25% of the moments (1 in 4 moments) the AQ is  **Acceptable**, the rest of the time it's **Unacceptable**. The difference in AQ (both types) between Holiday (or Festival) or Regular days are also negligible at this scale.

<i><p style="font-size:16px; background-color: #008000; border: 2px solid black; margin: 20px; padding: 20px;">There are plenty of insights that can be extracted with various combinations of fields across the different tables to gain insights. As per our discussion in the study group it's would be best to start with high-level questions or else we might meander into various paths and not answer focussed questions.


<i><p style="font-size:16px; background-color: #008000; border: 2px solid black; margin: 20px; padding: 20px;">Although this is a good point, but omitting to verify the integrity of the data using various techniques isn't a good start and during such phases we must take short journeys on the back of low-hanging fruits like the ones attempted above. Many more things need to be done before starting with any important question(s). Or we might conclude less accurately when processing the answers to those questions.

<i><p style="font-size:16px; background-color: #008000; border: 2px solid black; margin: 20px; padding: 20px;">At the moment the higher concepts of time (Years, Seasons, Months) and location (Regions and States) have been used and this helps keep a birds-eye-view of things during our EDA.

## Credits

- Forked from [Firat Gonen](https://www.kaggle.com/frtgnn)'s [Clean Air? India's Air Quality 🇮🇳](https://www.kaggle.com/frtgnn/clean-air-india-s-air-quality) kernel - thanks for the foundation work.
- [David Dirring](https://www.kaggle.com/romandovega) for all the insights during the ChaiEDA sessions, and also building on his idea of the KPI based on the AQI Index

<i><p style="font-size:24px; background-color: #ff9933; border: 2px dotted black; margin: 20px; padding: 20px;">This kernel is dedicated to all the visualisations and narrations and makes use of the datasets prepared via [ChaiEDA: India's Air Quality 2015-20 🇮🇳: Data Prep](https://www.kaggle.com/neomatrix369/chaieda-india-s-air-quality-2015-20-data-prep/). The extended [Air Quality India dataset can be found here](https://www.kaggle.com/neomatrix369/air-quality-data-in-india-extended). For Air Quality in the Cities of India perspective, see [ChaiEDA: Air Quality in Indian Cities 2015-20](https://www.kaggle.com/neomatrix369/chaieda-air-quality-in-indian-cities-2015-20/).