# Analyzing Air Quality In China

This notebook explores syntehtic dataset containing air pollution data from 5 major cities in China: Beijing, Shanghai, Guangzhou, Chengdu, and Shenzhen.

The data spans from 2015 to 2025 and includes data for various air pollutants and weather data.

### Objectives:
- **Visualize** Air Quality Trends over time across cities
- **Compare** pollution and particulate matter levels between cities, time of day, season, and weather conditions.
- **Explore** data using dropdowns and animations by using `ipywidgets` and `plotly.express`
- **Forecast** AQI levels using regression model from `scikit-learn`
- **Evaluate** and compare different machine learning models for accuracy and reliability while predicting AQI.

This analysis builds upon core concepts covered in OMAT5100M: Programming for Data Science, while also incorporating additional techniques and tools beyond the scope of the course to demonstrate independent learning and application.

**Dataset Source:**
Air Pollution in China (2015-2025)
Available on [Kaggle](https://www.kaggle.com/datasets/khushikyad001/air-pollution-in-china-2015-2025)
Accessed on: April 13, 2025.


## Introduction

We'll start by importing all the required libraries

In [170]:
import numpy as np
import pandas as pd
import plotly.express as px
from ipywidgets import widgets

### Loading & Previewing Dataset

We'll use pandas read_csv function to read the data and get information including datatypes, null values, column names, and dataframe shape using the info() method. This will help us understand the data and apporach with our analysis accordingly

In [171]:
pollution = pd.read_csv("air_pollution_china.csv")
pollution.info() # getting a summary of the dataframe

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 24 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   PM2.5 (µg/m³)       3000 non-null   float64
 1   PM10 (µg/m³)        3000 non-null   float64
 2   NO2 (µg/m³)         3000 non-null   float64
 3   SO2 (µg/m³)         3000 non-null   float64
 4   CO (mg/m³)          3000 non-null   float64
 5   O3 (µg/m³)          3000 non-null   float64
 6   Temperature (°C)    3000 non-null   float64
 7   Humidity (%)        3000 non-null   float64
 8   Wind Speed (m/s)    3000 non-null   float64
 9   Wind Direction (°)  3000 non-null   float64
 10  Pressure (hPa)      3000 non-null   float64
 11  Precipitation (mm)  3000 non-null   float64
 12  Visibility (km)     3000 non-null   float64
 13  AQI                 3000 non-null   int64  
 14  Season              3000 non-null   object 
 15  City                3000 non-null   object 
 16  Latitu

Upon inspection, we can see that none of the values in the dataframe are null values. In this case, we can proceed with an exploratory data analysis.

## Exploratory Data Analysis

We'll also use the describe function to check the summary statistics of each column to get a sense of the distrubtion of each numerical column including mean, std, median, etc.

In [172]:
pollution.describe() # Checking summary statistics for each column

Unnamed: 0,PM2.5 (µg/m³),PM10 (µg/m³),NO2 (µg/m³),SO2 (µg/m³),CO (mg/m³),O3 (µg/m³),Temperature (°C),Humidity (%),Wind Speed (m/s),Wind Direction (°),Pressure (hPa),Precipitation (mm),Visibility (km),AQI,Latitude,Longitude,Hour,Month,Year,Station ID
count,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0
mean,130.072742,158.818398,52.948697,25.426127,2.562514,105.613365,15.11706,54.737773,5.137389,179.396411,1014.894063,24.880798,9.900773,249.639667,35.037328,110.072445,11.289667,6.537333,2019.555333,50.164667
std,68.14005,80.915207,26.905603,14.177923,1.40957,54.054354,14.719677,26.18767,2.891444,104.076767,20.34841,14.638862,5.751453,143.396118,8.666335,11.604091,6.835057,3.441204,2.850669,28.807838
min,10.039019,20.017134,5.005047,1.014693,0.102207,10.094551,-9.997818,10.049686,0.002052,0.017217,980.020253,0.002508,0.103173,0.0,20.005034,90.006533,0.0,1.0,2015.0,1.0
25%,72.409621,86.982912,29.885955,12.720078,1.362659,59.561719,2.123508,31.915813,2.601402,87.632382,997.419156,12.036467,4.997298,126.0,27.495695,100.001894,5.75,4.0,2017.0,26.0
50%,128.858172,157.174723,53.404879,25.677423,2.553778,106.706302,15.334326,53.993836,5.206242,178.820576,1014.29695,24.785057,9.850522,248.0,35.148201,110.181877,11.0,7.0,2020.0,49.0
75%,188.579931,228.60391,75.955623,37.755735,3.785613,151.29302,28.016646,77.409183,7.750817,268.835379,1032.671602,37.617215,14.808608,374.0,42.612931,119.977826,17.0,10.0,2022.0,75.0
max,249.847473,299.702293,99.979511,49.98985,4.998658,199.936387,39.959957,99.998108,9.991277,359.881532,1049.997553,49.984126,19.998597,499.0,49.998873,129.989284,23.0,12.0,2024.0,99.0


We'll also check the distribution of these pollutant values visually.

First, we must select the relevant data only and covert it into a format that makes it easier for us to plot.

In [173]:
# We'll select pollutants only and reformat the dataframe for easier plotting
pollutants_only_df = pollution[['PM2.5 (µg/m³)', 'PM10 (µg/m³)', 'NO2 (µg/m³)', 'SO2 (µg/m³)', 'CO (mg/m³)', 'O3 (µg/m³)', 'AQI']]
pollutants_only_df = pollutants_only_df.melt(var_name='Pollutant', value_name='Concentration')
pollutants_only_df.head()

Unnamed: 0,Pollutant,Concentration
0,PM2.5 (µg/m³),94.437337
1,PM2.5 (µg/m³),194.17479
2,PM2.5 (µg/m³),45.037661
3,PM2.5 (µg/m³),76.131857
4,PM2.5 (µg/m³),204.127929


Now we can proceed with the visual distribution of each pollutant. We'll use histograms to understand the distribution.

In [174]:
fig = px.histogram(pollutants_only_df, x='Concentration', facet_col='Pollutant', facet_col_wrap=2, facet_col_spacing=0.08)
fig.update_xaxes(matches=None, showticklabels=True) # matches none to avoid sharing x-axis for different pollutants
fig.update_yaxes(matches=None, showticklabels=True)
fig.update_layout(height=1500, width=1000)
fig.show()

As we can see, there is a lot of variability in distribution for each pollutant with no major discernable patterns to drive conclusions.

### Pollution Trends Over Time

In this section, we'll investigate pollution trends over time, particularly focusing on PM2.5 particulate matter & AQI across different cities to inspect any increase or decrease.

In [175]:
# We'll group the data by city and year and calculate mean AQI and PM2.5 and reformat again for easier plotting
city_year_grouped_df = pollution.groupby(["City", "Year"])[["PM2.5 (µg/m³)", 'AQI']].mean().reset_index() # Grouping data by city and year and calculateing mean AQI 

# NOTE: Rest index is used to convert the grouped data back into columns

city_year_grouped_df = city_year_grouped_df.melt(id_vars=[ "City", "Year"], var_name='Pollutant', value_name='Concentration')

In [176]:
# Plotting the line graph to visualize the trends by year for each city
fig = px.line(city_year_grouped_df, x = 'Year', y = 'Concentration', facet_col='City', facet_row='Pollutant', markers=True, title='PM2.5 Trends by City Over Time')
print('Tip: Hover over the datapoints to preview data labels')
fig.update_yaxes(matches=None, showticklabels=True)
fig.update_layout(height=800, width=2000)
fig.show()

Tip: Hover over the datapoints to preview data labels


From this chart, we can see that Shanghai expeirnced an increase in PM2.5 matter from 2021 to 2023 but also an improvement in 2024 vs the year before.

The city of Chengdu, Shanghai, and Shenzhen saw an average increase in AQI vs the year 2020. This may be due to the sctrict lockdown policies introduced during COVID and later relaxed causing a drop and then a surge in traffic. Further analysis is needed to draw conclusions. 

There seems to be a lot of variability in the PM2.5 particulate matter concentration year on year. It would be good to understand variability, the median, and the extreme in PM2.5 readings for each city using a box plot.

### City Wise Pollution Comparision

In this section, we'll investigate the distribution of PM2.5 matter across different cities.

In [177]:
# We'll plot box charts to visualize the distribution of PM2.5 concentration by city. Note we could just as easily plot AQI.
fig = px.box(pollution, x = 'City', y = 'PM2.5 (µg/m³)', color='City', title='PM2.5 Trends by City Over Time')
fig.show()

The median PM2.5 matter value seems to be higher in Shanghai and the lowest in Beijing but only by a small margin. Overall, the cities seem to have close median and interquartile range.

It would also be good to check PM2.5 matter concentrations for different seasons and cities.

In [178]:
city_season_df = pollution[['City', 'Season', 'PM2.5 (µg/m³)']] # Selecting relevant columns

# Creating a multi select filter using ipywidgets. This will allow us to select cities above our chart.
city_filter = widgets.SelectMultiple(
    options=city_season_df['City'].unique(),
    value=['Beijing'],
    description='City',
    layout=widgets.Layout(width='50%'))

In [179]:
# Creating a function to update the plot based on selected cities. This will be passed as an arguement to create the interative plot.
def update_plot(input_cities):
    filtered_df = city_season_df[city_season_df['City'].isin(input_cities)]
    fig = px.box(
        filtered_df,
        x = 'Season',
        y = 'PM2.5 (µg/m³)',
        title = 'PM2.5 (µg/m³) concentration by season'
    )
    fig.update_traces(marker_color='#ff7f0e', line_color='#ff7f0e') # changing the color of the boxplot
    fig.update_xaxes(categoryorder='category ascending') # ordering x axis by season
    fig.show()

print("Tip: Hold Cmd (or Ctrl on Windows) and click to select multiple cities.")
widgets.interactive(update_plot, input_cities=city_filter)

Tip: Hold Cmd (or Ctrl on Windows) and click to select multiple cities.


interactive(children=(SelectMultiple(description='City', index=(2,), layout=Layout(width='50%'), options=('She…

On an aggregate across all cities combined, spring seems to have the highest median PM2.5 concentration but this trend doesn't seem to be the case for individual cities. For example, Shanghai has higher PM2.5 concentration during Autumn and Summer.

Another thing to note is that Chengdu seems to have the highest median PM2.5 concentration of 160 µg/m³ in Spring.

### Pollution levels and weather

In this section, we'll explore patterns between weather conditions and different pollutants to understand if speicific weather conditions correlate with higher or lower pollutant concentration. We'll also explore correlation between different pollutants.

In [180]:

import plotly.graph_objects as go

# Listing all weather and pollutants for filters
weather_conditions = ['Temperature (°C)', 'Humidity (%)', 'Precipitation (mm)', 'Wind Speed (m/s)']
pollutants = ['PM2.5 (µg/m³)', 'PM10 (µg/m³)', 'NO2 (µg/m³)', 'SO2 (µg/m³)', 'CO (mg/m³)', 'O3 (µg/m³)', 'AQI']

# Creating filter to select pollutant
pollutant_filter = widgets.Dropdown(
    options=pollutants,
    value='PM2.5 (µg/m³)',
    description='Pollutant:',
    disabled=False,
)

# Creating filter to select city
city_filter = widgets.Dropdown(
    options=pollution['City'].unique(),
    value='Beijing',
    description='City:',
    disabled=False,
)

# Creating filter to select weather
weather_filter = widgets.Dropdown(
    options=weather_conditions,
    value='Temperature (°C)',
    description='Weather:',
    disabled=False,
)

# Similar to the previous update plot function, this will take multiple filters as input and update the plot.
def update_weather_pollution_plot(input_pollutant, input_city, input_weather):

    # filtering data based on user selected input
    filtered_df = pollution[pollution['City'] == input_city][['City', input_pollutant, input_weather, 'Season', 'Year']].sort_values(by=['Year'])

    # Creating a scatter plot and faceting on Season. Animation frame is set to Year to show trends over time.
    fig = px.scatter(
        filtered_df,
        x = input_weather,
        y = input_pollutant,
        facet_row='Season',
        animation_frame='Year',
        trendline='ols',
    )

    fig.update_layout(height=1000, width=800, title_text=f"{input_pollutant} vs {input_weather} for {input_city}")
    fig.show()

print('Tip: Press the Play button at the bottom to view trends by year. You may also use the slider to select a specific year.')

# Each parameter for the update_weather_pollution_plot function is passed as an arguement to the interactive function.
widgets.interactive(update_weather_pollution_plot, input_city=city_filter, input_pollutant=pollutant_filter, input_weather=weather_filter)

Tip: Press the Play button at the bottom to view trends by year. You may also use the slider to select a specific year.


interactive(children=(Dropdown(description='Pollutant:', options=('PM2.5 (µg/m³)', 'PM10 (µg/m³)', 'NO2 (µg/m³…

We'll also create a correlation matrix to understand the correlation between each pollutant. 

In [181]:
# Checking correlation between different pollutants
pollutants_data = pollution[pollutants]

corr_matrix = pollutants_data.corr() # Creating correlation matrix

# Using heatmap to visualize the correlation matrix
fig = px.imshow(
    corr_matrix,
    text_auto=True,
    color_continuous_scale='balance',
    zmin=-1, zmax=1,
    title='Correlation Matrix Heatmap'
)

fig.show()

There seems to be no discrenable correlation different various pollutants. Given that PM2.5 and PM10 are [primary drivers of AQI](https://www.iqair.com/ca/newsroom/what-is-aqi), it is interesting to notice there is no significant positive correlation between them. This may be due to the syntehtic nature of the data.

### Pollutant Concentration by Different Timeframes

In this section we'll investigate pollutant concentration for different timeframes of the day to identify rush hour or late night pollutant patterns.

In [182]:
# We'll select relevant columns from the original dataframe.
pollution_by_hour_month = pollution[['Hour', 'Month', 'PM2.5 (µg/m³)','AQI', 'PM10 (µg/m³)', 'NO2 (µg/m³)', 'SO2 (µg/m³)', 'CO (mg/m³)', 'O3 (µg/m³)']].copy()

# Ensuring the scale for the CO (mg/m³) is in µg/m³ for consistency with other pollutants.
pollution_by_hour_month.rename(columns={'CO (mg/m³)': 'CO (µg/m³)'}, inplace=True)
pollution_by_hour_month.loc[:, 'CO (µg/m³)'] = pollution_by_hour_month['CO (µg/m³)']*1000
 
pollution_by_hour = pollution_by_hour_month.melt(id_vars=['Hour'], var_name='Pollutant', value_name='Concentration')

pollution_by_hour = pollution_by_hour.groupby(['Hour', 'Pollutant']).median().reset_index()

fig = px.line(
    pollution_by_hour,
    x = 'Hour',
    y = 'Concentration',
    facet_row='Pollutant',
    title='Hourly Pollutant Concentration'
)
fig.update_yaxes(matches=None)
fig.update_layout(height=1500, width=1200, title_text="Pollutant Concentration by Hour")
fig.show()

We can also check by month to understand whether any specific months and consequently seasons see any spike in any pollutant or AQI

In [183]:
pollution_by_month = pollution_by_hour_month.melt(id_vars=['Month'], var_name='Pollutant', value_name='Concentration')
pollution_by_month = pollution_by_month.groupby(['Month', 'Pollutant']).median().reset_index()

fig = px.line(
    pollution_by_month,
    x = 'Month',
    y = 'Concentration',
    facet_row='Pollutant',
    title='Monthly Pollutant Concentration'
)
fig.update_yaxes(matches=None)
fig.update_layout(height=1500, width=1200, title_text="Pollutant Concentration by Month")
fig.show()

While there are no major repeated patterns, it seems like the AQI sees a spike starting June until September. It is unusual that the AQI would increase but none of the other pollutants exhibit similar patterns given these pollutants directly impact AQI. This may be due to the artifical nature of the data.

## Modeling

We'll use regression models offered by sklearn to predict AQI value using numerical features only. 

Note: For the sake of simplicity, and due to the limited knowledge on handling categorical varibles in machine learning, I have excluded categorical varibles from features. This allowed me to focus on the numerical features and the regression techniques taught in the course. 

### Linear Regression

In [184]:
# Importing all required modules
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.ensemble import RandomForestRegressor

In [185]:
features = ['PM2.5 (µg/m³)', 'PM10 (µg/m³)', 'NO2 (µg/m³)', 'SO2 (µg/m³)', 'CO (mg/m³)', 'O3 (µg/m³)', 'Temperature (°C)', 'Humidity (%)', 'Precipitation (mm)', 'Wind Speed (m/s)', 'Pressure (hPa)', 'Precipitation (mm)', 'Visibility (km)']

y = pollution.loc[:, 'AQI'] # we'll use PM2.5 value as the variable to predict ie target
X = pollution.loc[:, features] # we'll use all the other variables as input variables ie features

# Splitting the data into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.80)


We'll now scale the data.

This process ensures all the data is on the same scale. 

"Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data" ([scikit-learn, 2025](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#:~:text=Standardization%20of%20a%20dataset%20is%20a%20common%20requirement%20for%20many%20machine%20learning%20estimators%3A%20they%20might%20behave%20badly%20if%20the%20individual%20features%20do%20not%20more%20or%20less%20look%20like%20standard%20normally%20distributed%20data)). 

The equation for scaling is:

$z = \frac{x - \mu}{\sigma}$



In [186]:
# Scaling the data

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [187]:
# We'll define a helper function to help us apply various regression models 

def apply_regression_model(model, x_train_data, y_train_data, x_test_data, y_test_data):
    
    # Fitting the model and using to make predictions on test data
    model.fit(x_train_data, y_train_data)
    y_pred = model.predict(x_test_data)
    
    mse = round(mean_squared_error(y_test_data, y_pred),2)
    r2 = round(r2_score(y_test_data, y_pred), 2)
    
    return mse, r2, y_pred

In [188]:
linear_regressor = LinearRegression()

mse, r2, y_pred = apply_regression_model(linear_regressor, x_train_data=X_train_scaled, y_train_data=y_train, x_test_data=X_test_scaled, y_test_data=y_test)

print(f"For the linear regression model, the mean squared error is {mse} and the coefficient of determination is {r2}")

For the linear regression model, the mean squared error is 20797.68 and the coefficient of determination is -0.01


Given the coefficient of determination is 0, our model is no better than average when it comes to predicting the target.

We'll now visualize the predictions against our actual test data to visually analyze the performance of the mdoel.

In [189]:
# We'll plot the predicitons against the actual values with model trendline overlaid to visualize the performance.
results_df = pd.DataFrame({
    'Actual': y_test,
    'Predicted': y_pred
})

fig = px.scatter(results_df, x='Actual', y='Predicted', title='Actual vs Predicted AQI')

fig.add_shape(
    type='line',
    x0=results_df['Actual'].min(), y0=results_df['Actual'].min(),
    x1=results_df['Actual'].max(), y1=results_df['Actual'].max(),
    line=dict(color='red', dash='dash')
)

fig.update_layout(
    xaxis_title='Actual',
    yaxis_title='Predicted',
    width=700,
    height=500,
    xaxis_range=[-50, 550],
    yaxis_range=[200,300]
)

fig.show()

### Other Alternatives

Given the linear regression model has a poor fit, we'll test other regression models to check if we can get better accuracy. Here are some of the models we'll try out:

- **Polynomial Regression** 
- **Random Forest Regressor** 
- **Ridge Regression** 
- **Lasso Regression** 

### Polynomial Regression

We'll apply polynomial regression as the behaviour between the target and features is non linear. Polynomials will take the form below (for a simple case of just one feature):

$$
y = \beta_0 + \beta_1 x + \beta_2 x^2
$$

In [190]:
# We'll transform the features to polynomials and then apply linear regression. 
def apply_polynomial_regression(degrees):
    poly_features = PolynomialFeatures(degree=degrees)
    X_train_poly = poly_features.fit_transform(X_train)
    X_test_poly = poly_features.transform(X_test)

    # We'll now scale these
    poly_scaler = StandardScaler()
    X_train_poly_scaled = poly_scaler.fit_transform(X_train_poly)
    X_test_poly_scaled = poly_scaler.transform(X_test_poly)

    model = LinearRegression()
    
    mse, r2, y_pred_poly = apply_regression_model(model, x_train_data=X_train_poly_scaled, y_train_data=y_train, x_test_data=X_test_poly_scaled, y_test_data=y_test)

    print(f"Polynomial Regression (degree={degrees}) — MSE: {mse:.2f}, R²: {r2:.2f}")
    
    return r2

We'll test different degrees of the polynomial to find a good fit while also being weary of overfitting.

In [191]:
# We'll try different degrees of polynomial regression to see which one performs the best. Anything more than 5 causes significant time to compute so we'll go as high as 5. 
r2_polynomial = []
for i in range(1,6):
    r2_polynomial.append(apply_polynomial_regression(i))

px.line(x=list(range(1,6)), y=r2_polynomial)

Polynomial Regression (degree=1) — MSE: 20797.68, R²: -0.01
Polynomial Regression (degree=2) — MSE: 20888.30, R²: -0.01
Polynomial Regression (degree=3) — MSE: 24187.31, R²: -0.17
Polynomial Regression (degree=4) — MSE: 121230.71, R²: -4.87
Polynomial Regression (degree=5) — MSE: 220786.52, R²: -9.69


Generally, the performance seems to get poorer as we increase degrees (especially after 3rd degree) so polynomial regression may not be the best approach based on computing power required and model prediction up to 5th degree polynomial.

We'll now apply the following regression models to check if any of them lead to better predictions.

### Random Forest Regression

A random forest is a meta estimator that fits a number of decision tree regressors on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting (scikit-learn, 2025).

**Reference:**

Scikit-learn developers. (2025). *Random forest regression — scikit-learn 1.6.1 documentation*. Retrieved from https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html

### Ridge Regression

This model solves a regression model where the loss function is the linear least squares function and regularization is given by the l2-norm. Also known as Ridge Regression or Tikhonov regularization. This estimator has built-in support for multi-variate regression (i.e., when y is a 2d-array of shape (n_samples, n_targets)) (scikit-learn, 2025).

**Reference:**

Scikit-learn developers. (2025). *Ridge regression — scikit-learn 1.6.1 documentation*. Retrieved from https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html

### Lasso Regression

The Lasso is a linear model that estimates sparse coefficients. It is useful in some contexts due to its tendency to prefer solutions with fewer non-zero coefficients, effectively reducing the number of features upon which the given solution is dependent. (scikit-learn, 2025)

**Reference:**

Scikit-learn developers. (2025). *Lasso regression — scikit-learn 1.6.1 documentation*. Retrieved from https://scikit-learn.org/stable/modules/linear_model.html#lasso:~:text=References-,1.1.3.%20Lasso%23,-The%20Lasso%20is


In [192]:
models = {
    'Linear Regression': LinearRegression(),
    'Random Forest Regression': RandomForestRegressor(),
    'Ridge Regression': Ridge(),
    'Lasso Regression': Lasso(),
}

final_results = []

for model_name, model in models.items():
    
    mse, r2, _ = apply_regression_model(model, x_train_data=X_train_scaled, y_train_data=y_train, x_test_data=X_test_scaled, y_test_data=y_test)
    
    final_results.append([model_name, mse, r2])
    
final_results = pd.DataFrame(final_results, columns=['Model', 'Mean Squared Error', 'R2'])
final_results

Unnamed: 0,Model,Mean Squared Error,R2
0,Linear Regression,20797.68,-0.01
1,Random Forest Regression,20841.4,-0.01
2,Ridge Regression,20797.58,-0.01
3,Lasso Regression,20733.16,-0.0


## Results

Overall, none of the models tested led to accurate predictions. This could be due to the following factors:

- **Synthetic Data** Since the dataset is synthetic, model performance heavily depends on how the data was generated. If it was based on real-world data with added noise, the models would likely capture underlying patterns more effectively. However, if the data was generated from scratch without any meaningful structure, it's possible that no real correlations exist for the models to learn from. This aligns with the correlation heatmap where all variables showed little to no correlation with each other except themselves.

- **Disregarding Categorical Variables** Categorical variables such as City, Season, or Weather Condition were excluded from the models for simplicity. If these variables held predictive power, their exclusion may have significantly limited the model’s ability to identify patterns and make accurate predictions.

- **Target Variable Selection** We could test the models with a different target varible such as PM2.5.

- **Chosen Models** Given we didn't test an exhaustive list of regression models, it is likely that one of the other models may be better suited to this data.

## Final Conclusion

In this assignment, I analyzed synthetic air pollution data from five major cities in China with the goal of uncovering insights and reinforcing core concepts from OMAT5100M.

Throughout the assignment, I applied key learnings from the course while also incorporating additional techniques outside the standard curriculum:
- I applied fundamental concepts, including the use of **NumPy** and **Pandas**, to clean, organize, and explore the data effectively.
- I leveraged the theoretical foundations from the visualization lessons to conceptualize relevant plots.
- I used **Plotly** to create **interactive** and **animated** visualizations, providing a more dynamic and engaging exploration of the data compared to static plots.
- I explored **regression models beyond basic linear regression**, including **Random Forest Regression**, **Lasso Regression**, and **Polynomial Regression**, to better capture potential non-linear relationships within the data.
- I introduced **ipywidgets** to set up interactive elements, enabling dynamic filtering and selection within the visualizations.

Overall, this project demonstrates the application of core programming concepts, effective communication of data analysis, and independent exploration of new tools and techniques beyond the course materials.