<a href="https://colab.research.google.com/github/2303A52268/AIML-2268/blob/main/Batch_36_Q_12_(2303A52268).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**HT No-2303A52268**

#**Batch no-36**

#**Question No-12**

**Prediction of Air Quality in Italian Cities**

1. Identify the top 5 reasons for air quality

2. Identify the Day of week with most air quality issues

3. Find the max and min air quality levels

4. Identify the highest and lowest temperatures of air quality

5. Identify the highest educational qualification of the employees.

Air Quality

UCI ML Repo

6. Apply either Classification Model or Clustering Model to evaluate the dataset

**Load and Inspect the Dataset**

In [None]:
import pandas as pd

# Load the dataset
file_path = ('/content/AirQualityUCI.xlsx')  # Replace with the correct path in Colab
air_quality_data = pd.read_excel(file_path, sheet_name='AirQualityUCI')

# Display the first few rows to inspect the data
print(air_quality_data.head())


        Date      Time  CO(GT)  PT08.S1(CO)  NMHC(GT)   C6H6(GT)  \
0 2004-03-10  18:00:00     2.6      1360.00       150  11.881723   
1 2004-03-10  19:00:00     2.0      1292.25       112   9.397165   
2 2004-03-10  20:00:00     2.2      1402.00        88   8.997817   
3 2004-03-10  21:00:00     2.2      1375.50        80   9.228796   
4 2004-03-10  22:00:00     1.6      1272.25        51   6.518224   

   PT08.S2(NMHC)  NOx(GT)  PT08.S3(NOx)  NO2(GT)  PT08.S4(NO2)  PT08.S5(O3)  \
0        1045.50    166.0       1056.25    113.0       1692.00      1267.50   
1         954.75    103.0       1173.75     92.0       1558.75       972.25   
2         939.25    131.0       1140.00    114.0       1554.50      1074.00   
3         948.25    172.0       1092.00    122.0       1583.75      1203.25   
4         835.50    131.0       1205.00    116.0       1490.00      1110.00   

       T         RH        AH  
0  13.60  48.875001  0.757754  
1  13.30  47.700000  0.725487  
2  11.90  53.975000 

**Data Cleaning and Feature Engineering**

In [None]:
# Convert 'Date' column to datetime
air_quality_data['Date'] = pd.to_datetime(air_quality_data['Date'], errors='coerce')

# Extract the day of the week
air_quality_data['DayOfWeek'] = air_quality_data['Date'].dt.day_name()

# Replace invalid values (-200) with NaN for analysis
air_quality_data.replace(-200, pd.NA, inplace=True)


**Top 5 Factors Correlated with CO(GT)**

In [None]:
# Calculate correlations with CO(GT)
# Convert all numeric-looking columns to numeric, coercing errors
for col in air_quality_data.columns:
    if air_quality_data[col].dtype == 'object':  # Check if column is of type object
        try:
            air_quality_data[col] = pd.to_numeric(air_quality_data[col], errors='coerce')
        except ValueError:
            pass  # Skip columns that cannot be converted

numeric_columns = air_quality_data.select_dtypes(include=['number'])  # Select only numeric columns
correlations = numeric_columns.corr()['CO(GT)'].sort_values(ascending=False).head(6)
print("Top 5 correlations with CO(GT):\n", correlations)

Top 5 correlations with CO(GT):
 CO(GT)           1.000000
C6H6(GT)         0.931091
PT08.S2(NMHC)    0.915519
NMHC(GT)         0.889734
PT08.S1(CO)      0.879292
PT08.S5(O3)      0.854183
Name: CO(GT), dtype: float64


**Day of the Week with Most Air Quality Issues**

In [None]:
# Find the day of the week with the highest average CO levels
day_air_quality_issues = air_quality_data.groupby('DayOfWeek')['CO(GT)'].mean().sort_values(ascending=False)
print("Average CO(GT) levels by day of the week:\n", day_air_quality_issues)


Average CO(GT) levels by day of the week:
 Series([], Name: CO(GT), dtype: float64)


**Max and Min Air Quality Levels**

In [None]:
# Maximum and minimum CO(GT)
max_co_level = air_quality_data['CO(GT)'].max()
min_co_level = air_quality_data['CO(GT)'].min()
print("Max CO(GT):", max_co_level)
print("Min CO(GT):", min_co_level)


Max CO(GT): 11.9
Min CO(GT): 0.1


**Highest and Lowest Temperatures**

In [None]:
# Maximum and minimum temperature
max_temperature = air_quality_data['T'].max()
min_temperature = air_quality_data['T'].min()
print("Max Temperature:", max_temperature)
print("Min Temperature:", min_temperature)


Max Temperature: 44.60000038147
Min Temperature: -1.8999999761581


**Clustering or Classification Model**

In [None]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Select features for clustering
features = air_quality_data[['CO(GT)', 'NOx(GT)', 'NO2(GT)', 'T', 'RH']].dropna()

# Standardize features
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

# Apply K-means clustering
kmeans = KMeans(n_clusters=3, random_state=42)
# Get cluster labels for the scaled features
cluster_labels = kmeans.fit_predict(features_scaled)

# Create a new DataFrame with the original index and cluster labels
cluster_df = pd.DataFrame({'Cluster': cluster_labels}, index=features.index)

# Merge the cluster labels back into the original DataFrame using the index
air_quality_data = air_quality_data.merge(cluster_df, left_index=True, right_index=True, how='left')

print("Cluster centers:\n", kmeans.cluster_centers_)

Cluster centers:
 [[-0.57640962 -0.44417968 -0.52773987 -0.47861842  0.52285796]
 [ 1.2395222   1.40378062  1.14577564 -0.47274687  0.38480176]
 [-0.11100112 -0.41637812 -0.10967735  1.0155135  -1.01229549]]
