## Exploratory Data Analysis
Exploratory Data Analysis is a crucial step in understanding our dataset and making informed decisions about feature engineering, model selection, and more.

### Objective
- Include creating Visualisations, provide as much graphs as possible, and also provide conclusions after each process.
- Identify patterns among the data.
- suggest what feature engineering can be done.

In [None]:
#importing important libraries for Exploratory data analysis

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

In [None]:
#reading data and saving the data as a pandas dataframe
df = pd.read_csv('gujarat_aqi.csv')

In [None]:
#printing basic information about data
df.info()
df.head()

as we can notice here, there is some missing data in the NO2 and SO2 columns, so we will have to deal with that during data preprccsesing, for now i will remove those rows as 7 rows is statically insignificant.

In [None]:
df = df.dropna(subset=['NO2', 'SO2'])

here we can see that this column as only one unique element inside it, that being 'Gujarat', this was obvious as the data is that of Gujarat. Well have to remove this coumn as this will serve no function in the model.

the SPM coummn is also filled with all empty data, perhaps it wasnt avalaible, we will also have to remove this column.

here are some graphs plotted to help visualize the given data :-

In [None]:
sns.scatterplot(x='NO2',
                y='SO2',
                data=df.sample(n = 1935 , random_state = 0),
                color = ['red' if x == 'Industrial Area' else 'blue' for x in df.sample(n = 1935 , random_state=0)['Type of Location']],
                marker='o',
                s = 20,
                label='NO2 vs. SO2'
                )
plt.xlabel('NO2')
plt.ylabel('SO2')
plt.title('Scatter Plot: NO2 vs. SO2')
plt.legend()
plt.grid(True)
plt.show()

the conclusions derived from the following graph - here we cna clearly see that Rural and resential areas have lower levels of NO2 and SO2, both of which are gasses assosicated with pollution and chemical processes. with such data, clustering would be possible as they seem to form a visible cluster, which makes models like k-means and HDBSCAN clustering viable models for our project.

In [None]:
sns.scatterplot(x='SO2', y='RSPM/PM10', data=df, color = ['red' if x == 'Industrial Area' else 'blue' for x in df.sample(n = 1935 , random_state=0)['Type of Location']], marker='o',s = 20, label='SO2 vs. RSPM/PM10')
plt.xlabel('SO2')
plt.ylabel('RSPM/PM10')
plt.title('Scatter Plot: SO2 vs. RSPM/PM10')
plt.legend()
plt.grid(True)
plt.show()

sns.scatterplot(x='NO2', y='RSPM/PM10', data=df, color = ['red' if x == 'Industrial Area' else 'blue' for x in df.sample(n = 1935 , random_state=0)['Type of Location']], marker='o',s = 20, label='NO2 vs. RSPM/PM10')
plt.xlabel('NO2')
plt.ylabel('RSPM/PM10')
plt.title('Scatter Plot: NO2 vs. RSPM/PM10')
plt.legend()
plt.grid(True)
plt.show()


selected_columns = ['SO2', 'NO2', 'RSPM/PM10']
correlation_matrix = df[selected_columns].corr()

plt.figure(figsize=(4, 3))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title("Correlation Matrix")
plt.show()

the conclusions derived from the above two graphs is as follows - they show a positive co-relation between the three polutants, which makes sense as areas with high pollution of SO2 are also the most likely to have gasses like NO2 there. the corelation matrix also clearly shows us that there is a pretty strong co-relarion between the three polutants.

In [None]:
custom_colors = {
    'Ahmedabad': 'red',
    'Ankleshwar': 'blue',
    'Jamnagar': 'green',
    'Rajkot': 'orange',
    'Surat': 'purple',
    'Vadodara': 'pink',
    'Vapi': 'brown'
}

colors = [custom_colors[val] for val in df.sample(n=1000, random_state = 0)['City/Town/Village/Area']]

sns.scatterplot(x='NO2',
                y='SO2',
                data=df.sample(n=1000, random_state = 0),
                color = colors,
                marker='o',
                s = 15,
                label='NO2 vs. SO2'
                )
plt.xlabel('NO2')
plt.ylabel('SO2')
plt.title('Scatter Plot: NO2 vs. SO2')
plt.grid(True)
plt.show()

the conclusions derived from the above graph is :- this graph shows the no2 and so2 levels from diffrent cities. that the data indicates that there is equal balance among most cities.

In [None]:
# inorder to perform Time Series analysis, I've split the date column into three columns, namely 'Day', 'Month', 'Year'
df[['Day', 'Month', 'Year']] = df['Sampling Date'].str.split('/', expand=True)

# Converting the columns to integers as they will be 'objects' by default and that slows down the computation a bit
df['Day'] = df['Day'].astype(int)
df['Month'] = df['Month'].astype(int)
df = df.drop(columns = 'Year')

In [None]:
sns.scatterplot(x='Month', y='RSPM/PM10', data=df, color='b', marker='o',s = 20, label='Month vs. RSPM/PM10')
plt.xlabel('Month')
plt.ylabel('RSPM/PM10')
plt.title('Scatter Plot: Month vs. RSPM/PM10')
plt.legend()
plt.grid(True)
plt.show()

sns.scatterplot(x='Month', y='NO2', data=df, color='b', marker='o',s = 20, label='Month vs. NO2')
plt.xlabel('Month')
plt.ylabel('NO2')
plt.title('Scatter Plot: Month vs. NO2')
plt.legend()
plt.grid(True)
plt.show()

sns.scatterplot(x='Month', y='SO2', data=df, color='b', marker='o',s = 20, label='Month vs. SO2')
plt.xlabel('Month')
plt.ylabel('SO2')
plt.title('Scatter Plot: Month vs. SO2')
plt.legend()
plt.grid(True)
plt.show()

The following graphs show how the different pollutants change with different months, we can see a sort of zig-zag pattern with a high in SO2 and NO2 during the months of summer. However the patterns aren't very prominent soo its not possible to conclude anything with certainty

In [None]:
sns.scatterplot(x='Day', y='RSPM/PM10', data=df, color='b', marker='o',s = 20, label='days vs. RSPM/PM10')
plt.xlabel('Day')
plt.ylabel('RSPM/PM10')
plt.title('Scatter Plot: Days vs. RSPM/PM10')
plt.legend()
plt.grid(True)
plt.show()

sns.scatterplot(x='Day', y='NO2', data=df, color='b', marker='o',s = 20, label='Days vs. NO2')
plt.xlabel('Day')
plt.ylabel('NO2')
plt.title('Scatter Plot: Days vs. NO2')
plt.legend()
plt.grid(True)
plt.show()

sns.scatterplot(x='Day', y='SO2', data=df, color='b', marker='o',s = 20, label='Days vs. SO2')
plt.xlabel('Day')
plt.ylabel('SO2')
plt.title('Scatter Plot: Days vs. SO2')
plt.legend()
plt.grid(True)
plt.show()

in these above graphs, we can see on average the amount of pollutants tends to increase periodically once every approximately 7 days, thic could be related to weekends, when the factories are perhaps closed. but I cant say that with compelete certainty

from the above observations i can conclude the following:-

Conclusion 1: The data indicates a sort of zig-zag pattern in SO2 and NO2 levels, with peaks during the summer months, but these patterns are not very prominent.

Conclusion 2: On average, pollutant levels tend to decrease approximately once every 7 days, possibly related to weekends when factories may be closed, or less cars on the roads.

Conclusion 3: There seems to be an equal balance among most cities in terms of NO2 and SO2 levels.

Conclusion 4: There is a positive correlation between the three pollutants, which makes sense as areas with high SO2 pollution are more likely to have NO2 as well.

Conclusion 5: Rural and residential areas have lower levels of NO2 and SO2, forming a visible cluster. 

Given these conclusions, here's some approaches of analysis I'm intrested in:

Time Series Analysis: to further investigate the zig-zag patterns and the weekly variations, Models like Prophet could help in understanding these patterns.

Clustering models: as we've identified clusters in rural and residential areas, implement k-means or HDBSCAN clustering can help in understanding regional pollution trends.

Regression and classification Models: models like logistic regression and random forest can help us in classifying the type of location.