<a href="https://colab.research.google.com/github/RoseJared/AI-ML/blob/main/FinalProjectDemo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Code Demonstation Exerpts
This code runs the Top 2 models determined in the full code:
- Random Forset Regression
- Gradient Boosting Regression

This code runs in **4 min 22 sec**.

### Importing Visualization Libraries

In this section, we are importing two popular Python libraries used for creating visualizations:

- **`matplotlib.pyplot`**: This is a core plotting library that allows us to create charts like line graphs, bar charts, and scatter plots. We import it as `plt` to make it easier to use in our code.
- **`seaborn`**: This is a library built on top of `matplotlib` that makes it easier to create more visually appealing and informative charts, especially when working with data in tables (like pandas DataFrames).

We'll use these tools later to explore the data and better understand the relationships between different features (columns) in our dataset.


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR
import numpy as np
import xgboost as xgb
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
from scipy.stats import uniform, randint
from sklearn.ensemble import RandomForestRegressor
from sklearn import tree
import matplotlib.pyplot as plt

### Loading the Datasets

In this section, we are using the **`pandas`** library (imported as `pd`) to load two datasets from Google Drive:

- **`beijing_df`**: This dataset contains air quality data specifically for Beijing. It likely includes information such as pollutant levels, weather conditions, and timestamps.
- **`worldcities_df`**: This dataset contains information about cities around the world, such as their names, countries, and geographic coordinates (latitude and longitude).

To load these files:
1. We start with a shared Google Drive URL for each file.
2. We modify the URL so that it can be accessed directly by Python using `pandas.read_csv()`, which reads the data into a structured table called a **DataFrame**.

DataFrames are useful because they allow us to easily view, analyze, and manipulate large tables of data using Python.


In [None]:
import pandas as pd
url= 'https://drive.google.com/file/d/1BFefZwgVG5eBv0f4ocR2tfWKINJ7XzWb/view?usp=sharing'
url= 'https://drive.google.com/uc?id=' + url.split('/')[-2]
beijing_df= pd.read_csv(url)

### Handling Missing Data

In this step, we clean the dataset by filling in missing values:

1. **Filling Missing Values in Numerical Columns**:
   - For each numerical feature (like `PM2.5`, `TEMP`, `RAIN`, etc.), we replace missing values with the **median** of that column.
   - The median is used because it's less affected by extreme values (outliers) than the average, making it a reliable choice for filling in missing data.

2. **Filling Missing Values in Categorical Column**:
   - For the `wd` column (which stands for wind direction and contains text values), we fill in missing entries using the **most frequent value** (also known as the mode).

3. **Verifying the Fixes**:
   - After filling in the missing values, we check again to confirm that there are no missing values left in the dataset.

Cleaning missing data is an essential step before building any machine learning model, as many algorithms cannot handle empty or incomplete values.

In [None]:
# Fill numerical columns with their medians
numerical_cols = ['PM2.5', 'PM10', 'SO2', 'NO2', 'CO', 'O3', 'TEMP', 'PRES', 'DEWP', 'RAIN', 'WSPM']
for col in numerical_cols:
    beijing_df[col] = beijing_df[col].fillna(beijing_df[col].median())

# Fill wind direction (categorical) with most common value
beijing_df['wd'] = beijing_df['wd'].fillna(beijing_df['wd'].mode()[0])


### Calculating the Air Quality Index (AQI)

In this section, we compute the Air Quality Index (AQI) for each row in the dataset based on standard pollutant concentration levels:

1. **Unit Conversion**:
   - The `CO` and `O3` columns are converted from micrograms per cubic meter (μg/m³) to more standard units: parts per million (ppm) for CO and parts per billion (ppb) for O3.

2. **AQI Breakpoints**:
   - Each pollutant has defined concentration ranges (called breakpoints) that map to corresponding AQI values. These are based on official environmental health standards.

3. **AQI Calculation**:
   - For each row of data, we calculate the AQI for all six pollutants and keep the highest one. This represents the overall AQI for that time and location.

4. **AQI Category Labeling**:
   - The final AQI value is mapped to a category like "Good", "Moderate", or "Unhealthy", which helps communicate the quality of the air in a way that’s easier to understand.

5. **Result Preview**:
   - A few key columns (station, date/time, AQI, and category) are shown to confirm the AQI calculation worked correctly.

This process transforms raw pollutant measurements into a single, easy-to-understand indicator of air quality.


In [None]:
# Correcting CO and O3 to appropriate units
#beijing_df['CO'] = beijing_df['CO'] / 1000  # Convert from μg/m³ to ppm
#beijing_df['O3'] = beijing_df['O3'] / 1000  # Convert from μg/m³ to ppb
#internet https://www.freeonlinecalc.com/air-quality-index-aqi-calculation-review-and-formulas.html

# Convert CO from µg/m³ to ppm
beijing_df['CO'] = beijing_df['CO'] * (24.45 / 28010)

# Convert O3 from µg/m³ to ppb
beijing_df['O3'] = (beijing_df['O3'] * 24.45 / (48 * 1000)) * 1000
# Define AQI breakpoints for each pollutant
aqi_breakpoints = {
    'PM2.5': [(0, 12), (12.1, 35.4), (35.5, 55.4), (55.5, 150.4), (150.5, 250.4), (250.5, 500.4)],
    'PM10': [(0, 54), (55, 154), (155, 254), (255, 354), (355, 424), (425, 604)],
    'SO2': [(0, 35), (36, 75), (76, 185), (186, 304), (305, 604), (605, 1004)],
    'NO2': [(0, 53), (54, 100), (101, 360), (361, 649), (650, 1249), (1250, 2049)],
    'CO': [(0.0, 4.4), (4.5, 9.4), (9.5, 12.4), (12.5, 15.4), (15.5, 30.4), (30.5, 50.4)],
    'O3': [(0, 54), (55, 70), (71, 85), (86, 105), (106, 200), (201, 404)]
}

# Corresponding AQI index ranges
aqi_indices = [(0, 50), (51, 100), (101, 150), (151, 200), (201, 300), (301, 500)]

# Function to calculate individual AQI for a pollutant
def calculate_individual_aqi(concentration, breakpoints, aqi_range):
    for i in range(len(breakpoints)):
        low_bp, high_bp = breakpoints[i]
        low_idx, high_idx = aqi_range[i]
        if low_bp <= concentration <= high_bp:
            aqi = ((high_idx - low_idx) / (high_bp - low_bp)) * (concentration - low_bp) + low_idx
            return round(aqi)
    return 500  # Return 500 if beyond highest range

# Function to calculate AQI for each row
def calculate_aqi(row):
    aqi_list = []
    for pollutant in aqi_breakpoints.keys():
        concentration = row[pollutant]
        breakpoints = aqi_breakpoints[pollutant]
        aqi_list.append(calculate_individual_aqi(concentration, breakpoints, aqi_indices))
    return max(aqi_list)

# Apply AQI calculation to each row
beijing_df['AQI'] = beijing_df.apply(calculate_aqi, axis=1)

# Map AQI to AQI Category
def categorize_aqi(aqi):
    if aqi <= 50:
        return 'Good'
    elif aqi <= 100:
        return 'Moderate'
    elif aqi <= 150:
        return 'Unhealthy for Sensitive Groups'
    elif aqi <= 200:
        return 'Unhealthy'
    elif aqi <= 300:
        return 'Very Unhealthy'
    else:
        return 'Hazardous'

beijing_df['AQI_Category'] = beijing_df['AQI'].apply(categorize_aqi)

# Preview the resulting DataFrame
print(beijing_df[['station', 'year', 'month', 'day', 'hour',
                  'PM2.5', 'PM10', 'SO2', 'NO2', 'CO', 'O3',
                  'AQI', 'AQI_Category']])

              station  year  month  day  hour  PM2.5  PM10   SO2   NO2  \
0        Aotizhongxin  2013      3    1     0    4.0   4.0   4.0   7.0   
1        Aotizhongxin  2013      3    1     1    8.0   8.0   4.0   7.0   
2        Aotizhongxin  2013      3    1     2    7.0   7.0   5.0  10.0   
3        Aotizhongxin  2013      3    1     3    6.0   6.0  11.0  11.0   
4        Aotizhongxin  2013      3    1     4    3.0   3.0  12.0  12.0   
...               ...   ...    ...  ...   ...    ...   ...   ...   ...   
420763  Wanshouxigong  2017      2   28    19   11.0  32.0   3.0  24.0   
420764  Wanshouxigong  2017      2   28    20   13.0  32.0   3.0  41.0   
420765  Wanshouxigong  2017      2   28    21   14.0  28.0   4.0  38.0   
420766  Wanshouxigong  2017      2   28    22   12.0  23.0   4.0  30.0   
420767  Wanshouxigong  2017      2   28    23   13.0  19.0   4.0  38.0   

              CO         O3  AQI AQI_Category  
0       0.261871  39.221875   36         Good  
1       0.26187

In [None]:
test_data = {
    'PM2.5': 5.0,  # Expected AQI = Good
    'PM10': 50.0,  # Expected AQI = Good
    'SO2': 30.0,   # Expected AQI = Good
    'NO2': 50.0,   # Expected AQI = Good
    'CO': 1.0,     # Expected AQI = Good
    'O3': 30.0     # Expected AQI = Good
}

test_df = pd.DataFrame([test_data])
test_df['AQI'] = test_df.apply(calculate_aqi, axis=1)
test_df['AQI_Category'] = test_df['AQI'].apply(categorize_aqi)

print(test_df)

   PM2.5  PM10   SO2   NO2   CO    O3  AQI AQI_Category
0    5.0  50.0  30.0  50.0  1.0  30.0   47         Good


### Preparing Data for AQI Prediction

This section sets up the data for building a machine learning model that can predict AQI based on environmental measurements:

1. **Library Imports**:
   - We import tools from the `scikit-learn` library for splitting data, building a linear regression model, evaluating performance, and scaling features.

2. **Feature Selection**:
   - We choose a set of predictor variables (features) that likely affect air quality, such as pollutant levels (`PM2.5`, `CO`, etc.) and weather data (`TEMP`, `PRES`).
   - The variable we want to predict is `AQI`, which serves as our **target**.

3. **Preparing Input and Output**:
   - `X` holds the features (input data), and `y` holds the target (AQI values).

4. **Train-Test Split**:
   - The data is split into two parts, standard ratio used:
     - **Training set** (80%): Used to train the model.
     - **Testing set** (20%): Used to evaluate how well the model performs on new, unseen data.

Splitting the data this way helps ensure that the model can generalize well and is not just memorizing the training data.


In [None]:
# Assuming 'beijing_df' is your DataFrame
# Select relevant features for AQI prediction
features = ['PM2.5', 'PM10', 'SO2', 'NO2', 'CO', 'O3']
#'TEMP', 'PRES'
target = 'AQI'

# Prepare the features and target
X = beijing_df[features]
y = beijing_df[target]

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features (optional but often helps with performance)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

### Random Forest Regression Model

We train a Random Forest Regressor to predict AQI:

- This model builds an ensemble of decision trees and averages their predictions for more accurate results.
- It is trained on the original (unscaled) training data.
- Random Forest is an ensemble method that builds multiple decision trees and averages their predictions.
- The model is trained with 100 trees (n_estimators=100) using the training data.
- After training, predictions are made on the test set.
- The model's performance is measured using Mean Squared Error (MSE) and R² score.

This model often captures complex patterns better than linear models.

#### How Random Forest Differs from a Decision Tree

- A Decision Tree is a single model that splits data into branches based on feature values to make predictions. It is simple, easy to interpret, but can overfit to the training data.

- A Random Forest is an ensemble of many decision trees. Each tree is trained on a random subset of the data and features. The final prediction is the average of all tree predictions (for regression).


In [None]:
# Initialize and train the Random Forest Regressor model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred_rf = rf_model.predict(X_test)

# Evaluate the Random Forest model
mse_rf = mean_squared_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)

print(f"Random Forest - MSE: {mse_rf}, R²: {r2_rf}")

Random Forest - MSE: 89.57417025096846, R²: 0.9886476132453791


### Gradient Boosting Regression

We train a Gradient Boosting Regressor to predict AQI:

- This model builds an ensemble of trees, each correcting errors from the previous one.
- It's known for high accuracy on structured data.

**Evaluation**:
- MSE and R² scores show how well the model performs on the test set.

Gradient Boosting is a strong performer and often used in competitions and production systems.


In [None]:
#Best one rerun
from sklearn.ensemble import GradientBoostingRegressor
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
# Train Gradient Boosting Regressor
gb_model = GradientBoostingRegressor(n_estimators=200, learning_rate=0.1, random_state=42)
gb_model.fit(X_train_scaled, y_train)

y_pred_gb = gb_model.predict(X_test_scaled)

# Evaluate the Gradient Boosting model
mse_gb = mean_squared_error(y_test, y_pred_gb)
r2_gb = r2_score(y_test, y_pred_gb)
print(f"Gradient Boosting - MSE: {mse_gb}, R²: {r2_gb}")

Gradient Boosting - MSE: 70.99888625782124, R²: 0.9910017942260826
