##HALL TICKET - 2303A51097
##B-02
##AIR QUALITY
##PREDICTION OF AIR QUALITY IN ITALIAN CITIES.
##1.identify the top 5 reasons for air quality
##2.identify the day of week with most air quality issues
##3.find the max and min air quality levels
##4.identify the highest and lowest temperatures of air quality
##5. apply either classification model or clustering model to evaluate the dataset give the python code for this

In [16]:
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

In [17]:
# Load the dataset
file_path = "AirQualityUCI.csv"  # Update this with the correct path to your file
data = pd.read_csv(file_path, sep=';')

In [18]:
# Data Cleaning
# Drop irrelevant columns
data_cleaned = data.drop(columns=["Unnamed: 15", "Unnamed: 16"], errors='ignore')

# Replace commas with dots in numeric columns and convert them to numeric types
columns_to_clean = ["CO(GT)", "C6H6(GT)", "T", "RH", "AH"]
for col in columns_to_clean:
    data_cleaned[col] = data_cleaned[col].str.replace(',', '.', regex=False).astype(float, errors='ignore')

# Remove rows with significant missing data
data_cleaned = data_cleaned.dropna()

# Convert Date to datetime and add a Day of Week column
data_cleaned["Date"] = pd.to_datetime(data_cleaned["Date"], format="%d/%m/%Y", errors='coerce')
data_cleaned["Day_of_Week"] = data_cleaned["Date"].dt.day_name()

# 1. Identify the top 5 reasons for air quality issues
numeric_columns = data_cleaned.select_dtypes(include=['float64', 'int64']).columns
correlations = data_cleaned[numeric_columns].corr()
top_5_reasons = correlations["CO(GT)"].abs().sort_values(ascending=False).iloc[1:6]
print("Top 5 reasons for air quality issues:")
print(top_5_reasons)

# 2. Identify the day of the week with the most air quality issues
day_with_most_issues = data_cleaned.groupby("Day_of_Week")["CO(GT)"].mean().sort_values(ascending=False)
print("\nDay of the week with most air quality issues:")
print(day_with_most_issues)

# 3. Find the max and min air quality levels
max_air_quality = data_cleaned["CO(GT)"].max()
min_air_quality = data_cleaned["CO(GT)"].min()
print("\nMaximum air quality level (CO(GT)):", max_air_quality)
print("Minimum air quality level (CO(GT)):", min_air_quality)

# 4. Identify the highest and lowest temperatures for air quality
max_temperature = data_cleaned.loc[data_cleaned["CO(GT)"].idxmax(), "T"]
min_temperature = data_cleaned.loc[data_cleaned["CO(GT)"].idxmin(), "T"]
print("\nTemperature associated with maximum air quality:", max_temperature)
print("Temperature associated with minimum air quality:", min_temperature)

# 5. Clustering to evaluate air quality patterns
# Select relevant features for clustering
features = ["CO(GT)", "NO2(GT)", "NOx(GT)", "NMHC(GT)", "T", "RH"]
data_for_clustering = data_cleaned[features]

# Standardize the features for clustering
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data_for_clustering)

# Apply K-Means Clustering
kmeans = KMeans(n_clusters=3, random_state=42)
data_cleaned["Cluster"] = kmeans.fit_predict(scaled_data)

# Analyze cluster characteristics
cluster_centers = pd.DataFrame(scaler.inverse_transform(kmeans.cluster_centers_), columns=features)
cluster_distribution = data_cleaned["Cluster"].value_counts()

print("\nCluster Distribution:")
print(cluster_distribution)

print("\nCluster Characteristics (Cluster Centers):")
print(cluster_centers)


Top 5 reasons for air quality issues:
NO2(GT)         0.671127
NOx(GT)         0.526451
NMHC(GT)        0.128351
PT08.S3(NOx)    0.089981
PT08.S5(O3)     0.080310
Name: CO(GT), dtype: float64

Day of the week with most air quality issues:
Day_of_Week
Friday      -24.583259
Saturday    -27.126414
Monday      -30.063820
Sunday      -35.432292
Thursday    -35.806176
Tuesday     -41.773864
Wednesday   -44.917647
Name: CO(GT), dtype: float64

Maximum air quality level (CO(GT)): 11.9
Minimum air quality level (CO(GT)): -200.0

Temperature associated with maximum air quality: 12.4
Temperature associated with minimum air quality: 10.1

Cluster Distribution:
Cluster
1    6995
0    1996
2     366
Name: count, dtype: int64

Cluster Characteristics (Cluster Centers):
       CO(GT)     NO2(GT)     NOx(GT)    NMHC(GT)           T          RH
0 -159.372695 -145.002505 -142.587174 -192.561623   20.168136   50.667685
1    0.621658  114.212152  251.155969 -149.041029   17.789850   48.825161
2  -17.26885