In [1]:
# Install the libraries (if using binder)
# !pip install numpy
# !pip install pandas
# !pip install matplotlib
# !pip install seaborn
# !pip install pylab

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pylab as plot

### Reading the csv file and get it into a dataframe format

In [3]:
df = pd.read_csv("pollution_us_2000_2016.csv")

In [7]:
print(f'The dataframe has {len(df)} rows and {df.shape[1]} columns')

The dataframe has 1746661 rows and 29 columns


Based on the above numbers, one concludes that the dataframe corresponds to very large dataset <em>(Big Data)</em>. Therefore, it is not advisable to view the entire dataframe as that will be a super memory expensive task. One needs to perform the analytics in a smarter way so as to the get the desired results while making sure not to put too much load on the memory.

In [5]:
print(f'The columns in the dataframe are given by \n{df.columns}')

The columns in the dataframe are given by 
Index(['Unnamed: 0', 'State Code', 'County Code', 'Site Num', 'Address',
       'State', 'County', 'City', 'Date Local', 'NO2 Units', 'NO2 Mean',
       'NO2 1st Max Value', 'NO2 1st Max Hour', 'NO2 AQI', 'O3 Units',
       'O3 Mean', 'O3 1st Max Value', 'O3 1st Max Hour', 'O3 AQI', 'SO2 Units',
       'SO2 Mean', 'SO2 1st Max Value', 'SO2 1st Max Hour', 'SO2 AQI',
       'CO Units', 'CO Mean', 'CO 1st Max Value', 'CO 1st Max Hour', 'CO AQI'],
      dtype='object')


One would now like to get rid of the columns that are not going to be useful for our analytics.

In [6]:
# columns to be dropped
drop_cols = ['Unnamed: 0', 'State Code', 'County Code', 'Site Num', 'Address', 'City', 'NO2 1st Max Value', 'NO2 1st Max Hour', 'NO2 AQI', 'O3 1st Max Value', 'O3 1st Max Hour', 'O3 AQI', 'SO2 1st Max Value', 'SO2 1st Max Hour', 'SO2 AQI', 'CO 1st Max Value', 'CO 1st Max Hour', 'CO AQI']

In [8]:
df.drop(drop_cols, axis=1, inplace=True)

In [9]:
print(f'The dataframe now contains {len(df)} rows and {df.shape[1]} columns')

The dataframe now contains 1746661 rows and 11 columns



To get some idea about the entries in the dataframe, instad of looking at the entire dataframe, one peeks at only few rows.

In [10]:
df.head()

Unnamed: 0,State,County,Date Local,NO2 Units,NO2 Mean,O3 Units,O3 Mean,SO2 Units,SO2 Mean,CO Units,CO Mean
0,Arizona,Maricopa,2000-01-01,Parts per billion,19.041667,Parts per million,0.0225,Parts per billion,3.0,Parts per million,1.145833
1,Arizona,Maricopa,2000-01-01,Parts per billion,19.041667,Parts per million,0.0225,Parts per billion,3.0,Parts per million,0.878947
2,Arizona,Maricopa,2000-01-01,Parts per billion,19.041667,Parts per million,0.0225,Parts per billion,2.975,Parts per million,1.145833
3,Arizona,Maricopa,2000-01-01,Parts per billion,19.041667,Parts per million,0.0225,Parts per billion,2.975,Parts per million,0.878947
4,Arizona,Maricopa,2000-01-02,Parts per billion,22.958333,Parts per million,0.013375,Parts per billion,1.958333,Parts per million,0.85


### Conversion of Units to achieve uniformity

By peeking at the above tiny dataframe, one sees that the units are not uniform for all the gases. Before converting all the units, one must make sure if all the entries in particular units column are the same, i.e., for a gas <em>x</em>, one would like to make sure if it is always measured in the same units.} 

In [11]:
def same_units(df, col):
    if len(df[col].unique()) == 1:
        return True
    return False

def get_units(df, col):
    return df[col].unique()

In [21]:
same_units(df, 'NO2 Units')

True

In [22]:
get_units(df, 'NO2 Units')

array(['Parts per billion'], dtype=object)

In [23]:
same_units(df, 'SO2 Units')

True

In [24]:
get_units(df, 'SO2 Units')

array(['Parts per billion'], dtype=object)

In [25]:
same_units(df, 'O3 Units')

True

In [26]:
get_units(df, 'O3 Units')

array(['Parts per million'], dtype=object)

In [27]:
same_units(df, 'CO Units')

True

In [28]:
get_units(df, 'CO Units')

array(['Parts per million'], dtype=object)


Note that, $NO_2$ and $SO_2$ are **always** measured in parts per billion (ppb), while $O_3$ and $CO$ are **always** measured in parts per million (ppm).

A standard unit in the field of science is parts per million (ppm), hence one would like to convert $NO_2$ and $SO_2$ into parts per million (ppm).

In [29]:
df['NO2 Mean'] = df['NO2 Mean']/1000
df['SO2 Mean'] = df['SO2 Mean']/1000


The columns describing the units are now irrelevant, hence, one would like to discard all of these columns to get even more finer dataframe.


In [30]:
unit_cols = ['NO2 Units', 'SO2 Units', 'O3 Units', 'CO Units']
df.drop(unit_cols, axis=1, inplace=True)

Before, moving ahead, a good practice is to check if there are any NaN, i.e., missing values in the dataframe. 

In [31]:
df.isna().any()

State         False
County        False
Date Local    False
NO2 Mean      False
O3 Mean       False
SO2 Mean      False
CO Mean       False
dtype: bool

Therefore, none of the above columns contain any missing value. Great!!