# **Step 1: Download the Dataset**

Visit the provided link: https://tn.data.gov.in/resource/location-wise-daily-ambient-air-quality-tamil-nadu-year-2014
Download the dataset in a format such as CSV or Excel.

# **Step 2: Import Necessary Libraries**

Make sure you have Python and the Pandas library installed. You can install Pandas using pip if it's not already installed:

In [57]:
pip install pandas



# **Step 3: Load the Dataset**
Assuming you have downloaded the dataset as a CSV file, you can load it into a Pandas DataFrame like this:

In [58]:
import pandas as pd


In [59]:
from google.colab import files
upload=files.upload()

Saving cpcb_dly_aq_tamil_nadu-2014.csv to cpcb_dly_aq_tamil_nadu-2014.csv


In [60]:
df = pd.read_csv('/content/cpcb_dly_aq_tamil_nadu-2014.csv')
df.head()

Unnamed: 0,Stn Code,Sampling Date,State,City/Town/Village/Area,Location of Monitoring Station,Agency,Type of Location,SO2,NO2,RSPM/PM10,PM 2.5
0,38,01-02-14,Tamil Nadu,Chennai,"Kathivakkam, Municipal Kalyana Mandapam, Chennai",Tamilnadu State Pollution Control Board,Industrial Area,11.0,17.0,55.0,
1,38,01-07-14,Tamil Nadu,Chennai,"Kathivakkam, Municipal Kalyana Mandapam, Chennai",Tamilnadu State Pollution Control Board,Industrial Area,13.0,17.0,45.0,
2,38,21-01-14,Tamil Nadu,Chennai,"Kathivakkam, Municipal Kalyana Mandapam, Chennai",Tamilnadu State Pollution Control Board,Industrial Area,12.0,18.0,50.0,
3,38,23-01-14,Tamil Nadu,Chennai,"Kathivakkam, Municipal Kalyana Mandapam, Chennai",Tamilnadu State Pollution Control Board,Industrial Area,15.0,16.0,46.0,
4,38,28-01-14,Tamil Nadu,Chennai,"Kathivakkam, Municipal Kalyana Mandapam, Chennai",Tamilnadu State Pollution Control Board,Industrial Area,13.0,14.0,42.0,


# **Step 4: Preprocess the Data**
Data preprocessing may involve various tasks, such as handling missing values, data cleaning, and feature engineering. Here are some common preprocessing steps:

**Handle Missing Values:**
If there are missing values in the dataset, you can use Pandas to fill them or drop rows/columns with missing values.

**Data Cleaning:**
Check for and clean any outliers or incorrect data.

**Feature Engineering:**
Create new features or transform existing ones if needed.

**Data Exploration:**
You can use various Pandas functions to explore and understand the data. For example, you can use df.describe(), df.info(), and df['column_name'].value_counts() to gain insights into the dataset.

In [61]:
print(df.isnull().sum())

Stn Code                             0
Sampling Date                        0
State                                0
City/Town/Village/Area               0
Location of Monitoring Station       0
Agency                               0
Type of Location                     0
SO2                                 11
NO2                                 13
RSPM/PM10                            4
PM 2.5                            2879
dtype: int64


In [62]:
print(df.describe())

          Stn Code          SO2          NO2    RSPM/PM10  PM 2.5
count  2879.000000  2868.000000  2866.000000  2875.000000     0.0
mean    475.750261    11.503138    22.136776    62.494261     NaN
std     277.675577     5.051702     7.128694    31.368745     NaN
min      38.000000     2.000000     5.000000    12.000000     NaN
25%     238.000000     8.000000    17.000000    41.000000     NaN
50%     366.000000    12.000000    22.000000    55.000000     NaN
75%     764.000000    15.000000    25.000000    78.000000     NaN
max     773.000000    49.000000    71.000000   269.000000     NaN


In [63]:
df.drop(["PM 2.5"],axis=1,inplace=True)

In [64]:
df.head(50)

Unnamed: 0,Stn Code,Sampling Date,State,City/Town/Village/Area,Location of Monitoring Station,Agency,Type of Location,SO2,NO2,RSPM/PM10
0,38,01-02-14,Tamil Nadu,Chennai,"Kathivakkam, Municipal Kalyana Mandapam, Chennai",Tamilnadu State Pollution Control Board,Industrial Area,11.0,17.0,55.0
1,38,01-07-14,Tamil Nadu,Chennai,"Kathivakkam, Municipal Kalyana Mandapam, Chennai",Tamilnadu State Pollution Control Board,Industrial Area,13.0,17.0,45.0
2,38,21-01-14,Tamil Nadu,Chennai,"Kathivakkam, Municipal Kalyana Mandapam, Chennai",Tamilnadu State Pollution Control Board,Industrial Area,12.0,18.0,50.0
3,38,23-01-14,Tamil Nadu,Chennai,"Kathivakkam, Municipal Kalyana Mandapam, Chennai",Tamilnadu State Pollution Control Board,Industrial Area,15.0,16.0,46.0
4,38,28-01-14,Tamil Nadu,Chennai,"Kathivakkam, Municipal Kalyana Mandapam, Chennai",Tamilnadu State Pollution Control Board,Industrial Area,13.0,14.0,42.0
5,38,30-01-14,Tamil Nadu,Chennai,"Kathivakkam, Municipal Kalyana Mandapam, Chennai",Tamilnadu State Pollution Control Board,Industrial Area,14.0,18.0,43.0
6,38,02-04-14,Tamil Nadu,Chennai,"Kathivakkam, Municipal Kalyana Mandapam, Chennai",Tamilnadu State Pollution Control Board,Industrial Area,12.0,17.0,51.0
7,38,02-06-14,Tamil Nadu,Chennai,"Kathivakkam, Municipal Kalyana Mandapam, Chennai",Tamilnadu State Pollution Control Board,Industrial Area,13.0,16.0,46.0
8,38,02-11-14,Tamil Nadu,Chennai,"Kathivakkam, Municipal Kalyana Mandapam, Chennai",Tamilnadu State Pollution Control Board,Industrial Area,10.0,19.0,50.0
9,38,13-02-14,Tamil Nadu,Chennai,"Kathivakkam, Municipal Kalyana Mandapam, Chennai",Tamilnadu State Pollution Control Board,Industrial Area,15.0,14.0,48.0


In [65]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2879 entries, 0 to 2878
Data columns (total 10 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Stn Code                        2879 non-null   int64  
 1   Sampling Date                   2879 non-null   object 
 2   State                           2879 non-null   object 
 3   City/Town/Village/Area          2879 non-null   object 
 4   Location of Monitoring Station  2879 non-null   object 
 5   Agency                          2879 non-null   object 
 6   Type of Location                2879 non-null   object 
 7   SO2                             2868 non-null   float64
 8   NO2                             2866 non-null   float64
 9   RSPM/PM10                       2875 non-null   float64
dtypes: float64(3), int64(1), object(6)
memory usage: 225.0+ KB


In [66]:
df.columns

Index(['Stn Code', 'Sampling Date', 'State', 'City/Town/Village/Area',
       'Location of Monitoring Station', 'Agency', 'Type of Location', 'SO2',
       'NO2', 'RSPM/PM10'],
      dtype='object')

In [67]:
df.shape

(2879, 10)

In [68]:
df.dtypes

Stn Code                            int64
Sampling Date                      object
State                              object
City/Town/Village/Area             object
Location of Monitoring Station     object
Agency                             object
Type of Location                   object
SO2                               float64
NO2                               float64
RSPM/PM10                         float64
dtype: object

In [69]:
df.index

RangeIndex(start=0, stop=2879, step=1)

In [70]:
df["SO2"].fillna(0,inplace=True)

In [71]:
df["NO2"].fillna(0,inplace=True)

In [72]:
df["RSPM/PM10"].fillna(0,inplace=True)

In [73]:
print(df.isnull().sum())

Stn Code                          0
Sampling Date                     0
State                             0
City/Town/Village/Area            0
Location of Monitoring Station    0
Agency                            0
Type of Location                  0
SO2                               0
NO2                               0
RSPM/PM10                         0
dtype: int64


In [74]:
df['SO2'].unique()

array([11., 13., 12., 15., 14., 10., 16., 19.,  9., 20., 17., 18., 25.,
       21., 23., 26., 24., 32., 27., 30., 22.,  0.,  8., 31., 28., 29.,
        6., 49.,  3.,  7.,  5.,  2.,  4., 39.])

In [82]:
df.agg(['min', 'max'])

Unnamed: 0,Stn Code,Sampling Date,State,City/Town/Village/Area,Location of Monitoring Station,Agency,Type of Location,SO2,NO2,RSPM/PM10
min,38,01-02-14,Tamil Nadu,Chennai,"AVM Jewellery Building, Tuticorin",National Environmental Engineering Research In...,Industrial Area,0.0,0.0,0.0
max,773,31-12-14,Tamil Nadu,Trichy,"Thiyagaraya Nagar, Chennai",Tamilnadu State Pollution Control Board,"Residential, Rural and other Areas",49.0,71.0,269.0


In [76]:
df.mean()

  df.mean()


Stn Code     475.750261
SO2           11.459187
NO2           22.036818
RSPM/PM10     62.407433
dtype: float64

In [77]:
df.median()

  df.median()


Stn Code     366.0
SO2           12.0
NO2           21.0
RSPM/PM10     55.0
dtype: float64

In [78]:
df.var()

  df.var()


Stn Code     77103.726003
SO2             25.925974
NO2             52.792251
RSPM/PM10      988.051105
dtype: float64

In [79]:
df.std()

  df.std()


Stn Code     277.675577
SO2            5.091756
NO2            7.265828
RSPM/PM10     31.433280
dtype: float64

In [80]:
df.kurtosis()

  df.kurtosis()


Stn Code    -1.714667
SO2          2.200113
NO2          3.643779
RSPM/PM10    3.345704
dtype: float64

# **Step 5: Save the Preprocessed Data**

If we want to save the preprocessed dataset for future use, we can save it to a new CSV file using to_csv():

In [81]:
df.to_csv('preprocessed_dataset.csv', index=False)
