<a href="https://colab.research.google.com/github/Pranav-Bhatlapenumarthi/Data-Preprocessing-Air-Quality-Analyisis/blob/main/Data_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Preprocessing for Air Quality Analysis

Disclaimer: The original .xlsx file has be renamed and has been converted to a .csv file.

Please run the entire notebook to get final results.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import MinMaxScaler
import seaborn as sns

## Handling Missing Values

Given that entries tagged as "-200" are missing values.

Step 1: Locate the missing values

Step 2: Replace the "-200" entries with null values

Step 3: Replace the null entries with the mean value the column. This ensures consistency in the range of all the values in the column.

In [None]:
df = pd.read_csv("/content/AirQualityUCI.csv - AirQualityUCI.csv") #Loading the CSV file
missing_val = (df == -200).sum()
print(missing_val) #Locating the number of missing values in each column

df.replace(-200, np.nan, inplace = True) # Replacing missing values with NA in the original file. This helps accurate calculation of the mean
df.fillna(df.mean(numeric_only=True), inplace = True) # Replacing the NA values with the mean in numeric valued columns

new_missing_val = (df == np.nan).sum() # Checking if any null values still exist after updation
print(new_missing_val) # Printing out the number of missing values after updation

## Analysing the data(a little bit...)

We first need to get a sense of how the data looks like. This information(such as skewness of the data) will help us decide on what methods to apply for further preprocessing

In [None]:
data_headers = ['CO(GT)', 'PT08.S1(CO)', 'NMHC(GT)', 'C6H6(GT)',
       'PT08.S2(NMHC)', 'NOx(GT)', 'PT08.S3(NOx)', 'NO2(GT)', 'PT08.S4(NO2)',
       'PT08.S5(O3)', 'T', 'RH', 'AH'] # Features of interest

for feature in data_headers:
  print(feature)
  print("Mean: ", df[feature].mean(), ", Min: ", df[feature].min(), ", Max: ", df[feature].max())

## Outlier Detection and Handling
This is a crucial part of data preprocessing as it detects any significantly varying values in the dataset which may adversly affect the functioning of the Machine Learning model to make predictions.

In the given dataset, the entries are clearly skewed(non-symmetric). Hence we use IQR(Interquartile Range) method to detect the outliers.





In [None]:
def detect_outliers(df, columns):
  outliers = {}

  for col in columns:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1

    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    outlier_indices = df[(df[col] < lower_bound) | (df[col] > upper_bound)].index.tolist()
    outliers[col] = outlier_indices

  return outliers

outliers = detect_outliers(df, data_headers)
for k in outliers:
  print(k,outliers[k])


## Feature Scaling
Feature scaling is a process of standardisation of numerical features to have a uniform range across all features.

For the given dataset, Min-Max Scaling is used since it is easier to interpret (since the range is between 0 and 1) and the units are not varying too much.

In [None]:
print(df.columns) #To check all the feature headings
toScale = ['CO(GT)', 'PT08.S1(CO)', 'NMHC(GT)', 'C6H6(GT)',
       'PT08.S2(NMHC)', 'NOx(GT)', 'PT08.S3(NOx)', 'NO2(GT)', 'PT08.S4(NO2)',
       'PT08.S5(O3)', 'T', 'RH', 'AH']

scaler = MinMaxScaler() #Initialising the scaler
df[toScale] = scaler.fit_transform(df[toScale]) #Scaling only selected features
df.head() #Checking the current status of the dataset

## Exploratory Data Analysis (EDA)

In [None]:
#print(df.head(50)) # Displays first 50 records
df.describe() # To give overall summary statistics
selected_features = ['CO(GT)', 'PT08.S1(CO)', 'NMHC(GT)', 'C6H6(GT)',
       'PT08.S2(NMHC)', 'NOx(GT)', 'PT08.S3(NOx)', 'NO2(GT)', 'PT08.S4(NO2)',
       'PT08.S5(O3)', 'T', 'RH', 'AH']

df_copy = df[selected_features].copy()
plt.figure(figsize = (16,10))
sns.heatmap(df_copy.corr(), cmap="RdBu",annot=True)
plt.show()

In [None]:
# Select features to visualize
selected_features = ["CO(GT)", "NOx(GT)", "T", "RH", "PT08.S1(CO)"]

# Create scatter plots
plt.figure(figsize=(12, 6))
for i, feature in enumerate(selected_features):
    plt.subplot(2, len(selected_features) // 2 + 1, i + 1)
    sns.scatterplot(x=df.index, y=df[feature])
    plt.title(f"Scatterplot of {feature}")
    plt.xlabel("Index")
    plt.ylabel(feature)

plt.tight_layout()
plt.show()

# Data Analysis

CO(GT) Scatterplot:



*   The concentration spikes at multiple points, indicating potential pollution events.
*   There seems to be a periodic pattern, with values fluctuating significantly.

NOx(GT) Scatterplot:

*  The trend follows a similar structure as CO(GT), with increased values in specific regions.
* There is a cluster of low values, which might indicate sensor inaccuracies or missing values.
T (Temperature) Scatterplot:

PT08.S1(CO) Scatterplot:

* The pattern resembles CO(GT), which aligns with its high correlation in the heatmap.
* The values fluctuate widely, possibly due to pollution events or measurement variations.

Key Takeaways:
1. Cyclic or Seasonal Trends: The features T and RH display cyclic
patterns, suggesting periodic environmental changes.
2. High Correlation Features: The scatter plots for CO(GT) and PT08.S1(CO) suggest a direct relationship (also supported by the heatmap).
3. Potential Data Issues: Some plots show horizontal bands, which might indicate few sensor errors