This repository contains a data analysis and visualization notebook built using Python libraries like Pandas, NumPy, Seaborn, and Matplotlib. The dataset used contains information about pollution levels across various cities, stations, and countries.
The dataset used in this project is a .csv
file (example: 3b01bcb8-0b14-4abf-b6f2-c1bfd384ba69.csv
) that includes fields such as:
station
city
country
pollutant_avg
latitude
longitude
- Python
- Pandas
- NumPy
- Matplotlib
- Seaborn
This project generates multiple types of visualizations to understand the dataset:
- Bar Plot β Average pollutant by station
- Pie Chart β Distribution of entries by city
- Histogram β Distribution of
pollutant_avg
values - Scatter Plot β Pollutant Average vs Latitude with country-wise hue
- Line Plot β Pollutant Average across Longitudes
- Correlation Heatmap β Relationship among numeric columns
- Box Plot β Pollutant Average distribution per country
- Pair Plot β Country-wise scatter relationships
- Outlier-Free Box Plot β Refined pollutant averages by country
- Missing numeric values are filled with column means.
- Categorical missing values are filled with the mode.
- Outliers are handled using the IQR (Interquartile Range) method.
The notebook also prints descriptive statistics using df.describe()
to get insights into the dataset's distribution, central tendency, and spread.
To run this project:
- Clone the repository or download the
.ipynb
/.py
file. - Make sure you have the required libraries installed:
pip install pandas numpy matplotlib seaborn
- Replace the CSV path in the
read_csv()
method with your dataset path. - Run the script or notebook.
- Visual plots for easy understanding of trends
- Insights into pollution averages by location
- Detection and removal of outliers
- Heatmaps showing correlation between numerical columns