# Data ingestion

#### Importing important libraries

In [1]:
import import_ipynb
import pandas as pd
import numpy as np
import gdown
%run loading_dataset.ipynb


In [5]:
# import dataset ur
dataset_url = generating_url()
filename = "air_quality_data.csv"
gdown.download(dataset_url, filename)


Downloading...
From: https://drive.google.com/uc?export=download&id=1QV9TppYEslb7kbi1cCYaXIIYsQ53J411
To: g:\metro project\notebooks\air_quality_data.csv
100%|██████████| 68.2M/68.2M [00:03<00:00, 22.1MB/s]


'air_quality_data.csv'

In [6]:
# importing dataset and converting data into pandas dataframe
df = pd.read_csv(filename)
df.head(3)

Unnamed: 0,City,Datetime,PM2.5,PM10,NO,NO2,NOx,NH3,CO,SO2,O3,Benzene,Toluene,Xylene,AQI,AQI_Bucket
0,Ahmedabad,2015-01-01 01:00:00,,,1.0,40.01,36.37,,1.0,122.07,,0.0,0.0,0.0,,
1,Ahmedabad,2015-01-01 02:00:00,,,0.02,27.75,19.73,,0.02,85.9,,0.0,0.0,0.0,,
2,Ahmedabad,2015-01-01 03:00:00,,,0.08,19.32,11.08,,0.08,52.83,,0.0,0.0,0.0,,


# About dataset

<b>Context</b>
Air Quality plays a significant factor in maintaining the health of an individual. Hence, monitoring the Air Quality by measuring and documenting the concentration levels of different pollutants is important.

<b>Source</b>
The dataset have been derived from Central Pollution Control Board of India: : https://cpcb.nic.in/

<b>Inspiration</b>
This dataset aims to document the pollutant concentration levels in different cities of India at different dates and time during the period of 2015 - 2020. The pollutant concentration levels can be utilized to determine Air Quality Index and conclude on the air quality of India throughout the period. The dataset is aimed to be updated annually with up-to-date values and credible information.

Dataset Reference: [click Here](https://www.kaggle.com/datasets/amandeepvasistha/air-quality-data)


# Data Profiling and Inspection

In [7]:
# checking datatypes of dataset colums
df.dtypes

City           object
Datetime       object
PM2.5         float64
PM10          float64
NO            float64
NO2           float64
NOx           float64
NH3           float64
CO            float64
SO2           float64
O3            float64
Benzene       float64
Toluene       float64
Xylene        float64
AQI           float64
AQI_Bucket     object
dtype: object

In [8]:
# Overall Information of dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 737406 entries, 0 to 737405
Data columns (total 16 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   City        737406 non-null  object 
 1   Datetime    737406 non-null  object 
 2   PM2.5       587720 non-null  float64
 3   PM10        429529 non-null  float64
 4   NO          617192 non-null  float64
 5   NO2         616699 non-null  float64
 6   NOx         609997 non-null  float64
 7   NH3         454536 non-null  float64
 8   CO          648830 non-null  float64
 9   SO2         603179 non-null  float64
 10  O3          604176 non-null  float64
 11  Benzene     568137 non-null  float64
 12  Toluene     508758 non-null  float64
 13  Xylene      263468 non-null  float64
 14  AQI         603645 non-null  float64
 15  AQI_Bucket  603645 non-null  object 
dtypes: float64(13), object(3)
memory usage: 90.0+ MB


### From above information
* Total number of entries have 737406
* Total number of columns is 16
    1. 
    2. 
    3. 
    4. 
    5. 
    6. 
    7. 
    8. 
    9. 
    10. 
    11. 
    12. 
    



In [9]:
# Checking missing data 
df.isna().sum()

City               0
Datetime           0
PM2.5         149686
PM10          307877
NO            120214
NO2           120707
NOx           127409
NH3           282870
CO             88576
SO2           134227
O3            133230
Benzene       169269
Toluene       228648
Xylene        473938
AQI           133761
AQI_Bucket    133761
dtype: int64

In [11]:
# finding percentage of missing data
df.isna().sum() / len(df) * 100

City           0.000000
Datetime       0.000000
PM2.5         20.298994
PM10          41.751355
NO            16.302281
NO2           16.369137
NOx           17.277999
NH3           38.360144
CO            12.011836
SO2           18.202591
O3            18.067388
Benzene       22.954655
Toluene       31.007071
Xylene        64.270971
AQI           18.139397
AQI_Bucket    18.139397
dtype: float64