# About This Notebook

  Hey there, and a warm welcome to my EDA journey! 🚀 This notebook is a gateway to the fascinating world of data, curated and published on GitHub. I'm thrilled to guide you through the initial steps of crafting your own insightful notebooks and diving into the world of code. Every step is thoughtfully commented to make your journey smooth and enjoyable. Don't fret if things seem a bit complex at first - the code is simpler than you think, and I'm here to ensure you're on the right track! Let's embark on this exciting data adventure together. 🌟

In [None]:
import numpy as np
import pandas as pd
import missingno as msno
import seaborn as sns
import matplotlib.pyplot as plt
import geopandas as gpd

import warnings
warnings.filterwarnings('ignore')

In [None]:
air_quality = pd.read_csv('/kaggle/input/air-quality-in-biggest-cities-of-the-world/world_air_quality_with_locations.csv', index_col=0)
air_quality.sample(2)

> Let's dive into the CSV and discover the hidden treasures within! 📊 First things first, we'll unveil the data types, tweaking just one column (Last Updated) to ensure a smoother visualization. Then, brace yourself for the grand reveal of Nan values. Fear not, for they are but a few. 🕵️‍♂️ Particularly, our dear 'City' column plays a bit coy with Nan values, but I highly recommend focusing on the 'Location' column to unlock the visual wonders of GEO data. Let's unravel the story this data holds! 🌍✨

In [None]:
# Checking Nan Value
air_quality.isna().sum()

In [None]:
# summary of the data
air_quality.info()

In [None]:
air_quality['Last Updated'] = pd.to_datetime(air_quality['Last Updated'])
air_quality['Last Updated'].info()

In [None]:
msno.matrix(air_quality)

# *Hold on tight, for we're about to venture into the world of vibrant visualizations - a quintessential element of my EDA notebooks! 🌈 Today is no different. We're set to explore the geographical gems, plotting the latitude and longitude of the calculation locations. Brace yourself for a simple yet enlightening scatterplot. 📈 My initial excitement was met with a twinge of disappointment as the metrics from Latam were fewer than anticipated. I was eager to witness their performance on this stage! However, fear not, for we've crafted an intriguing scatterplot that reveals more than you'd imagine. Let's unravel these visual tales! 🌏✨*

In [None]:
sns.set_style('darkgrid')
plt.figure(figsize=(10, 8))
sns.relplot(data = air_quality, x='Long', y='Lat', hue='Pollutant')
plt.title('Monitored Areas')
plt.show()

***The intrigue deepens! Despite the initial setback, my curiosity for Latam metrics persists. 🌎🔍 Setting my sights on India, a standout 'metriced' country in the region within this dataset, I dive in. Mirroring the earlier scatter plot, I take it up a notch by overlaying it on a country map using the mighty Geopandas - the champion of visualizing GEO data! 🗺️✨ There's something mysterious about that region, and I'm determined to unravel its secrets. Let's forge ahead on this visual expedition! 🚀***

In [None]:
world = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))
chosen_country = 'India'
country_geometry = world[world['name'] == chosen_country].geometry
df_chosen_country = air_quality[air_quality['Country Label'] == chosen_country]
plt.figure(figsize=(10, 8))
sns.set_style('whitegrid')

# plot the geometry of the chosen country
country_geometry.plot(ax=plt.gca(), color='lightgrey')
# create the scatterplot
sns.scatterplot(data=df_chosen_country, x='Long', y='Lat', hue='Pollutant')
# plot country borders
country_geometry.boundary.plot(ax=plt.gca(), linewidth=1.2, color='black')

plt.title(f'Monitored Areas in {chosen_country}')
plt.show()

` Hold tight! We're unmasking metric masters. Those towering bars? They signify a metrics bonanza. Our map's abuzz with their data dance. The big reveal? It's clear! The plots? Crystal clear! Let's bask in this visual feast. 📊✨
`



In [None]:
air_quality['Source Name'].value_counts(ascending =False).head().plot(kind='barh', figsize=(10,4), color = 'green')
plt.xlabel('Metrics Count')
plt.ylabel('Metric Source')
plt.title('Top 5 Metric Sources')
plt.show()

In [None]:
air_quality['Country Label'].value_counts(ascending = False).head().plot(kind='barh', figsize=(10,4), color = 'red')
plt.xlabel('Metrics Count')
plt.ylabel('Country')
plt.title('Top 5 Countries by amount of Metrics')
plt.show()

In [None]:
air_quality['City'].value_counts(ascending = False).head().plot(kind='barh', figsize=(10,4), color = 'blue')
plt.xlabel('Metrics Count')
plt.ylabel('City')
plt.title('Top 5 City by amount of Metrics')
plt.show()

***And that's a wrap! My first deep dive into ecology data - an exciting journey. If I've missed or misinterpreted anything, do share. Your feedback matters as I navigate this new realm. Cheers to your future notebooks! I'm all ears, all eyes. Let's keep the data tales spinning! 🚀🌿***