<a href="https://colab.research.google.com/github/AdwaaNasser/DataMiningProject/blob/main/phase1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 **Project Goal :** The goal of this project is to analyze and predict air quality using two complementary data mining techniques: classification and clustering. Through classification, the project aims to develop a predictive model that categorizes air quality into three levels—Good, Moderate, and Poor—based on various factors such as pollutant concentrations (PM2.5, PM10, NO2, SO2, CO), weather conditions (temperature and humidity), industrial proximity, and population density. Clustering will be used to group records with similar environmental characteristics, helping to uncover hidden patterns and typical pollution profiles. Together, these techniques will provide valuable insights into air quality trends and support more informed environmental decision-making.


**Dataset** **Source** : The dataset used in this study was obtained from Kaggle : https://www.kaggle.com/datasets/mujtabamatin/air-quality-and-pollution-assessment


**Dataset** **Description** :

Number of objects (records) : 5000 objects

Number of Attributes : 10 attributes



Data type of each attribute :


| Attribute                     | Data Type |
|---------------------|--------|
| Temperature                   | float64   |
| Humidity                      | float64   |
| PM2.5                         | float64   |
| PM10                          | float64   |
| NO2                           | float64   |
| SO2                           | float64   |
| CO                            | float64   |
| Proximity_to_Industrial_Areas | float64   |
| Population_Density            | int64     |
| Air Quality                   | object    |

Detailed description :


Temperature (°C): A continuous numerical attribute indicating the average regional temperature.



Humidity (%): A continuous numerical attribute representing the relative humidity level in the region.



PM2.5 Concentration (ug/m³): A continuous numerical attribute measuring fine particulate matter in the air.



PM10 Concentration (ug/m³): A continuous numerical attribute measuring coarse particulate matter in the air.



NO2 Concentration (ppb): A continuous numerical attribute representing nitrogen dioxide concentration.



SO2 Concentration (ppb): A continuous numerical attribute representing sulfur dioxide concentration.



CO Concentration (ppm): A continuous numerical attribute representing carbon monoxide concentration.



Proximity to Industrial Areas (km): A continuous numerical attribute specifying the distance to the nearest industrial zone.



Population Density (people/km²): A continuous numerical attribute representing the number of people living per square kilometer in the region.

Count of instances for each label :

| Air Quality | Count |
|-------------|-------|
| Good        | 2000  |
| Moderate    | 1500  |
| Poor        | 1000  |
| Hazardous   | 500   |

A sample of the raw dataset :


| Temperature | Humidity | PM2.5 | PM10 |  NO2 | SO2 |  CO  | Proximity_to_Industrial_Areas | Population_Density | Air Quality |
|-------------|----------|-------|------|------|-----|------|-------------------------------|---------------------|--------------|
|        29.8 |     59.1 |   5.2 | 17.9 | 18.9 | 9.2 | 1.72 |                           6.3 |                 319 |     Moderate |
|        28.3 |     75.6 |   2.3 | 12.2 | 30.8 | 9.7 | 1.64 |                           6.0 |                 611 |     Moderate |
|        23.1 |     74.7 |  26.7 | 33.8 | 24.4 |12.6 | 1.63 |                           5.2 |                 619 |     Moderate |
|        27.1 |     39.1 |   6.1 |  6.3 | 13.5 | 5.3 | 1.15 |                          11.1 |                 551 |         Good |
|        26.5 |     70.7 |   6.9 | 16.0 | 21.9 | 5.6 | 1.01 |                          12.7 |                 303 |         Good |
|        40.6 |     74.1 | 116.0 |126.7 | 45.5 |25.7 | 2.11 |                           2.8 |                 765 |   Hazardous |
|        28.1 |     96.9 |   6.9 | 25.0 | 25.3 |10.8 | 1.54 |                           5.7 |                 709 |     Moderate |
|        25.9 |     78.2 |  14.2 | 22.1 | 34.8 | 7.8 | 1.63 |                           9.6 |                 379 |     Moderate |
|        25.3 |     44.4 |  21.4 | 29.0 | 23.7 | 5.7 | 0.89 |                          11.6 |                 241 |         Good |
|        24.1 |     77.9 |  81.7 | 94.3 | 23.2 |10.5 | 1.38 |                           8.3 |                 461 |     Moderate |


code snippet :

In [4]:
import pandas as pd
df = pd.read_csv('updated_pollution_dataset.csv')

#to get the type of each attribute
attribute_types = pd.DataFrame({
    'Attribute': df.columns,
    'Data Type': df.dtypes.values
})
print(attribute_types)
print()


                       Attribute Data Type
0                    Temperature   float64
1                       Humidity   float64
2                          PM2.5   float64
3                           PM10   float64
4                            NO2   float64
5                            SO2   float64
6                             CO   float64
7  Proximity_to_Industrial_Areas   float64
8             Population_Density     int64
9                    Air Quality    object



In [7]:
#to get the number of attribute
print(f"The number of attribute : {df.shape[1]}")
print()
#to get the number of rows
print(f"The number of objects (records) : {df.shape[0]}")
#to print the class attribute with its values and the count of instances for each lable
label_column = 'Air Quality'
if label_column in df.columns:
  print("\nCount of instances for each lable :")
  print(df[label_column].value_counts())

The number of attribute : 10

The number of objects (records) : 5000

Count of instances for each lable :
Air Quality
Good         2000
Moderate     1500
Poor         1000
Hazardous     500
Name: count, dtype: int64


In [6]:
print("\nSample of the raw dataset:")
print(df.head())
print(df.tail())


Sample of the raw dataset:
   Temperature  Humidity  PM2.5  PM10   NO2   SO2    CO  \
0         29.8      59.1    5.2  17.9  18.9   9.2  1.72   
1         28.3      75.6    2.3  12.2  30.8   9.7  1.64   
2         23.1      74.7   26.7  33.8  24.4  12.6  1.63   
3         27.1      39.1    6.1   6.3  13.5   5.3  1.15   
4         26.5      70.7    6.9  16.0  21.9   5.6  1.01   

   Proximity_to_Industrial_Areas  Population_Density Air Quality  
0                            6.3                 319    Moderate  
1                            6.0                 611    Moderate  
2                            5.2                 619    Moderate  
3                           11.1                 551        Good  
4                           12.7                 303        Good  
      Temperature  Humidity  PM2.5   PM10   NO2   SO2    CO  \
4995         40.6      74.1  116.0  126.7  45.5  25.7  2.11   
4996         28.1      96.9    6.9   25.0  25.3  10.8  1.54   
4997         25.9      78.