# COSC2968|COSC3053 - Foundations of Artificial Inteligence for STEM
# Assignment 3: Option A - Machine Learning
# Project: Weather Prediction Model for New York City
#### Team 2
- Nguyen Ngoc Dung (s3978535)
- Le Dam Quan (s4031504)
- Nguyen Tran Ha Phan (s3977970)
- Phan Tri Hung (s)
- Tran Quoc Hung (s4045608)

### Introduction
This notebook demonstrates the implementation of several Machine Learning models in a pipeline to generate forecasts for the future maximum temperature in New York City based on historical weather data. The dataset comprises the following characteristics: temperature, humidity, windspeed, cloud cover, and other variables associated to weather. Our approach involves the following steps:
- Data visualization and preprocessing 
- Feature engineering
- Implementation of diverse machine learning models through a pipeline
- Fine tune the models
- Evaluation of the models
- Provide future predictions

### Implementation Code

#### Import Libraries and Functions

In [None]:
# In[0]: IMPORT AND FUNCTIONS

# Standard Libraries
import os  # For file system operations such as creating directories.
import warnings  # For controlling warning messages.
from datetime import timedelta  # For handling time-related operations.

# Third-party Libraries

# Joblib for saving/loading models
import joblib  # For serializing Python objects (like models).

# Numpy and Pandas for numerical and data manipulation
import numpy as np  # For numerical computations, especially with arrays.
import pandas as pd  # For data manipulation and dataframes.

# Seaborn and Matplotlib for visualizations
import seaborn as sns  # Advanced data visualization library based on matplotlib.
import matplotlib.pyplot as plt  # Basic plotting library.

# Scipy for statistical distributions (used in RandomizedSearchCV)
from scipy.stats import randint, uniform, loguniform  # For specifying parameter search distributions.

# Sklearn utilities and transformers
from sklearn.base import BaseEstimator, TransformerMixin  # Base classes for creating custom transformers/estimators.
from sklearn.compose import ColumnTransformer  # For applying different preprocessing pipelines to specific columns.
from sklearn.pipeline import Pipeline, make_pipeline  # For creating machine learning workflows.

# Sklearn preprocessors and imputers
from sklearn.preprocessing import OneHotEncoder, PolynomialFeatures, StandardScaler  # For encoding, feature generation, and scaling.
from sklearn.impute import SimpleImputer  # For handling missing values in datasets.

# Sklearn model evaluation and hyperparameter tuning
from sklearn.model_selection import (KFold, RandomizedSearchCV, cross_val_predict, cross_val_score, train_test_split)  # For cross-validation, hyperparameter tuning, and splitting datasets.
from sklearn.metrics import mean_squared_error  # For evaluating model performance with error metrics like MSE.

# Sklearn models
from sklearn.linear_model import BayesianRidge, Ridge, LinearRegression, Lasso  # Linear models for regression tasks.
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor  # Ensemble models for regression.
from sklearn.svm import SVR  # Support Vector Regression.
from sklearn.neural_network import MLPRegressor  # Neural network regressor model.

# XGBoost and LightGBM - Gradient Boosting models
from xgboost import XGBRegressor  # XGBoost regressor for gradient boosting.
from lightgbm import LGBMRegressor  # LightGBM regressor for fast gradient boosting.

# Optuna for hyperparameter optimization
import optuna  # Framework for hyperparameter optimization.

# Warnings related to model convergence
from sklearn.exceptions import ConvergenceWarning  # To suppress warnings related to non-convergence.


#### Settings

In [10]:
# Suppress specific warnings for cleaner output
# - ConvergenceWarning: typically related to models not reaching convergence
# - UserWarning: general warnings that may not be critical
warnings.filterwarnings("ignore", category=ConvergenceWarning)
warnings.filterwarnings("ignore", category=UserWarning)

# Create necessary directories to store outputs like figures, trained models, and other saved objects.
# 'figures': for storing plots and visual outputs
# 'models': for saving trained machine learning models
# 'saved_objects': for storing additional objects like results, pre-processed data, or intermediaries
os.makedirs('figures', exist_ok=True)  # Create 'figures' directory if it doesn't already exist
os.makedirs('models', exist_ok=True)   # Create 'models' directory for storing models
os.makedirs('saved_objects', exist_ok=True)  # Create 'saved_objects' directory for miscellaneous saved objects


#### Loading and Preprocessing Data
Here, we load the dataset and perform initial preprocessing steps, such as handling missing values, removing outliers, and creating additional time-based features.
1. Loading the data: Data is read from a csv file and stored in the pandas DataFrame. 
Minor adjustments were also made before futher processing: 
- datetime column is changed to a string format
- unnecessary columns (columns that doesn't influence the label) were removed including: 'name', 'icon','stations' and 'description'
2. Handling missing values: At first, we count the missing values in the data using the sum() function, which will be used for later comparison, after the imputation. As for the imputation itself, we have 2 actions:
- If the data is categorical, missing values are replaced with the string 'Unknown'.
- If the data is numerical, the missing values are filled with the column's median. 
 We then will count the missing values again to verify all the missing values have been handled.
3. Removing outliers: To be able to remove outliers, we used the function remove_outliers() to remove data in a specified column.
To classify and remove outliers, we use the Interquartile Range method (IQR): 
- The IQR = 75th percentile (Q3) - 25th percentile (Q1)
- We have also defined lower and upper bounds with the factor = 1.5
- The outliers will be values that are either smaller than the lower bound or higher than the upper bound.

In [None]:
raw_data = pd.read_csv('datasets/NewYork.csv')
raw_data['datetime'] = pd.to_datetime(raw_data['datetime'])
raw_data.drop(columns=["name", "icon", "stations", "description"], inplace=True)

# Check for missing values
print("\nMissing values before imputation:")
print(raw_data.isnull().sum())

# Remove outliers based on 'tempmax'
def remove_outliers(df, column, factor=1.5):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - factor * IQR
    upper_bound = Q3 + factor * IQR
    return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]

raw_data = remove_outliers(raw_data, 'tempmax')
print("\nShape of data after removing outliers:", raw_data.shape)

# Impute missing values
for column in raw_data.columns:
    if raw_data[column].dtype == 'object':
        raw_data[column].fillna('Unknown', inplace=True)
    else:
        raw_data[column].fillna(raw_data[column].median(), inplace=True)

print("\nMissing values after imputation:")
print(raw_data.isnull().sum())

#### Data Discovery
In this section, we explore the dataset through various visualizations. We analyze the distributions of the numeric features and their correlations with the target variable (`tempmax`).

In [None]:
print('\n____________ Dataset info ____________')
print(raw_data.info())              
print('\n____________ Some first data examples ____________')
print(raw_data.head(3)) 
print('\n____________ Statistics of numeric features ____________')
print(raw_data.describe())    

# Correlation heatmap
plt.figure(figsize=(12, 10))
corr = raw_data.select_dtypes(include=[np.number]).corr()
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title("Correlation Heatmap")
plt.tight_layout()
plt.savefig('figures/correlation_heatmap.png', format='png', dpi=300)
plt.close()

# Histograms of all numeric features
numeric_features = raw_data.select_dtypes(include=[np.number]).columns
n_features = len(numeric_features)
n_rows = (n_features + 1) // 2
plt.figure(figsize=(15, 5 * n_rows))
for i, feature in enumerate(numeric_features, 1):
    plt.subplot(n_rows, 2, i)
    sns.histplot(raw_data[feature], kde=True)
    plt.title(f'Distribution of {feature}')
plt.tight_layout()
plt.savefig('figures/hist_raw_data.png', format='png', dpi=300)
plt.close()

# Pairplot of main features
main_features = ['tempmax', 'temp', 'humidity', 'windspeed', 'cloudcover']
plt.figure(figsize=(15, 15))
sns.pairplot(raw_data[main_features], diag_kind='kde')
plt.suptitle("Pairplot of Main Features", y=1.02)
plt.savefig('figures/pairplot_main_features.png', format='png', dpi=300)
plt.close()

#### Data Preparation and Feature Engineering
In this step, we prepare the data for model training by applying feature transformations. We create additional features based on date information, such as day of year, month, and whether the day is a weekend.

In [13]:
class EnhancedFeatureAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_features=True):
        self.add_features = add_features
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        X_ = X.copy()
        if self.add_features:
            X_['day_of_year'] = X_['datetime'].dt.dayofyear
            X_['month'] = X_['datetime'].dt.month
            X_['day_of_week'] = X_['datetime'].dt.dayofweek
            X_['is_weekend'] = X_['day_of_week'].isin([5, 6]).astype(int)
        return X_.drop('datetime', axis=1)

# Define numeric and categorical features
numeric_features = ['temp', 'humidity', 'precip', 'windspeed', 'cloudcover']
categorical_features = ['conditions', 'preciptype']

# Define preprocessing pipelines
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Combine feature engineering and preprocessing in a pipeline
enhanced_pipeline = Pipeline([
    ('feature_adder', EnhancedFeatureAdder()),
    ('preprocessor', preprocessor)
])

# Prepare data for training
X = raw_data.drop('tempmax', axis=1)
y = raw_data['tempmax']
X_processed = enhanced_pipeline.fit_transform(X, y)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_processed, y, test_size=0.2, random_state=42)

#### Model Training and Initial Evaluation
In this step, we train various machine learning models on the training data. We first evaluate the models without hyperparameter tuning to establish a baseline.

In [None]:
models = {
    'RandomForestReg': RandomForestRegressor(random_state=42),
    'GradientBoostingReg': GradientBoostingRegressor(random_state=42),
    'LGBMReg': LGBMRegressor(random_state=42),
    'XGBBoost': XGBRegressor(random_state=42),
    'BayesianRidge': BayesianRidge(),
    'LinearReg': LinearRegression(),
    'Ridge': Ridge(random_state=42),
    'Lasso': Lasso(random_state=42),
    'SVR': SVR(),
    'MLPRegressor': MLPRegressor(random_state=42),
    'PolynomialReg': make_pipeline(PolynomialFeatures(), LinearRegression())
}

def evaluate_model(model, data, labels): 
    prediction = model.predict(data)
    rmse = np.sqrt(mean_squared_error(labels, prediction))
    return rmse

# Store RMSE for untuned models
rmse_before_tuning = {}

print('\n____________ Train and Evaluate Models ____________')
for name, model in models.items():
    model.fit(X_train, y_train)
    rmse = evaluate_model(model, X_train, y_train)
    rmse_before_tuning[name] = rmse
    print(f'{name:<20} RMSE before tuning: {rmse:.4f}')

#### Hyperparameter Tuning Using RandomizedSearchCV
Now, we fine-tune the models using RandomizedSearchCV

In [None]:
print('\n____________ Fine-tune models ____________')

warnings.filterwarnings("ignore", category=ConvergenceWarning)
warnings.filterwarnings("ignore", category=UserWarning)

# Ensure input data is scaled
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

param_grids = {
    'RandomForestReg': {
        'n_estimators': randint(100, 2000),
        'max_depth': randint(5, 50),
        'min_samples_split': randint(2, 20),
        'min_samples_leaf': randint(1, 20),
        'max_features': uniform(0.1, 0.9)
    },
    'GradientBoostingReg': {
        'n_estimators': randint(100, 2000),
        'learning_rate': uniform(0.01, 0.2),
        'max_depth': randint(3, 20),
        'min_samples_split': randint(2, 20),
        'min_samples_leaf': randint(1, 20),
        'subsample': uniform(0.5, 0.5)
    },
    'LGBMReg': {
        'num_leaves': randint(20, 200),
        'learning_rate': uniform(0.01, 0.2),
        'n_estimators': randint(100, 2000),
        'min_child_samples': randint(1, 50),
        'subsample': uniform(0.5, 0.5),
        'colsample_bytree': uniform(0.5, 0.5),
        'verbosity': [-1]
    },
    'XGBBoost': {
        'n_estimators': randint(100, 2000),
        'learning_rate': uniform(0.01, 0.2),
        'max_depth': randint(3, 20),
        'min_child_weight': randint(1, 10),
        'subsample': uniform(0.5, 0.5),
        'colsample_bytree': uniform(0.5, 0.5),
        'gamma': uniform(0, 0.5)
    },
    'BayesianRidge': {
        'alpha_1': uniform(0.001, 1),
        'alpha_2': uniform(0.001, 1),
        'lambda_1': uniform(0.001, 1),
        'lambda_2': uniform(0.001, 1)
    },
    'LinearReg': {
        'fit_intercept': [True, False],
        'copy_X': [True, False],
        'positive': [True, False]
    },
    'Ridge': {
        'alpha': loguniform(1e-3, 1e2),
        'max_iter': [5000, 10000]
    },
    'Lasso': {
        'alpha': loguniform(1e-3, 1e2),
        'max_iter': [5000, 10000]
    },
    'SVR': {
        'C': loguniform(1e-2, 1e2),
        'epsilon': loguniform(1e-3, 1),
        'kernel': ['linear', 'poly', 'rbf', 'sigmoid']
    },
    'MLPRegressor': {
        'hidden_layer_sizes': [(50,50,50), (50,100,50), (100,), (100,100), (100,50,100)],
        'activation': ['tanh', 'relu', 'logistic'],
        'solver': ['sgd', 'adam'],
        'alpha': loguniform(1e-4, 1e-1),
        'learning_rate': ['constant','adaptive'],
        'learning_rate_init': loguniform(1e-4, 1e-1),
        'max_iter': [200, 500, 1000],
        'early_stopping': [True, False],
        'momentum': uniform(0.0, 1.0),
        'nesterovs_momentum': [True, False]
    },
    'PolynomialReg': {
        'polynomialfeatures__degree': randint(2, 5),
        'linearregression__fit_intercept': [True, False]
    }
}

best_models = {}

for name, model in models.items():
    print(f"\nFine-tuning {name}")
    
    grid_search = RandomizedSearchCV(model, param_distributions=param_grids.get(name, {}), 
                                     n_iter=100, cv=tscv, 
                                     scoring='neg_mean_squared_error', n_jobs=-1, 
                                     random_state=42, verbose=1, error_score='raise')
    try:
        if name == 'MLPRegressor':
            grid_search.fit(X_train_scaled, y_train)
        else:
            grid_search.fit(X_train, y_train)
        
        best_rmse = np.sqrt(-grid_search.best_score_)
        print(f"Best RMSE for {name}: {best_rmse:.4f}")
        
        if best_rmse < rmse_before_tuning[name]:
            print(f"{name} is improved after tuning. Using tuned version.")
            best_models[name] = (grid_search.best_estimator_, best_rmse)
        else:
            print(f"{name} is not improved after tuning. Using untuned version.")
            best_models[name] = (model, rmse_before_tuning[name])
        
        joblib.dump(grid_search, f'saved_objects/{name}_gridsearch.pkl')
    except Exception as e:
        print(f"An error occurred while tuning {name}: {str(e)}")
        print("Skipping this model and continuing with the next one.")

#### Select and Evaluate the Best Model
We evaluate the best performing model on the test data to assess its generalization performance.

In [None]:
best_model_name = min(best_models, key=lambda name: best_models[name][1])
best_model, best_rmse = best_models[best_model_name]

print(f"\nBest model after fine-tuning: {best_model_name} with RMSE: {best_rmse:.4f}")

# Save the best model
joblib.dump(best_model, 'models/SOLUTION_model.pkl')

# Evaluate on test data
y_pred = best_model.predict(X_test)
test_rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f'\nPerformance on test data: RMSE: {test_rmse:.4f}')

# Static visualization of residuals
residuals = y_test - y_pred
plt.figure(figsize=(10, 6))
sns.histplot(residuals, kde=True)
plt.title('Residuals Distribution')
plt.tight_layout()
plt.savefig('figures/residuals_distribution.png', format='png', dpi=300)
plt.close()

#### Future Predictions
We use the best model to make predictions for the next 200 days.

In [None]:
last_date = raw_data['datetime'].max()
future_dates = pd.date_range(start=last_date + timedelta(days=1), periods=200)
future_data = pd.DataFrame({'datetime': future_dates})

def find_closest_date(target_date, data):
    try:
        target_date = target_date.replace(year=target_date.year - 1)
    except ValueError:
        target_date = target_date.replace(year=target_date.year - 1, day=28)
    closest_date = data['datetime'].iloc[(data['datetime'] - target_date).abs().argsort()[0]]
    return data.loc[data['datetime'] == closest_date].iloc[0]

for col in raw_data.columns:
    if col not in ['datetime', 'tempmax']:
        future_data[col] = future_data['datetime'].apply(lambda x: find_closest_date(x, raw_data)[col])

future_processed = enhanced_pipeline.transform(future_data)
future_pred = best_model.predict(future_processed)
future_data['predicted_tempmax'] = future_pred

future_data[['datetime', 'predicted_tempmax']].to_csv('future_predictions_200days.csv', index=False)
print("Future predictions have been saved to 'future_predictions_200days.csv'")

# Static visualization of future predictions
plt.figure(figsize=(20, 10))
plt.plot(future_data['datetime'], future_data['predicted_tempmax'], label='Predicted Max Temperature', alpha=0.7)
plt.title('Predicted Maximum Temperature for the Next 200 Days')
plt.xlabel('Date')
plt.ylabel('Temperature (°C)')
plt.legend()
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('figures/future_predictions_200days_plot.png', format='png', dpi=300)
plt.close()

print("\nPredicted Max Temperature:")
print(f"Min: {future_data['predicted_tempmax'].min():.2f}°C")
print(f"Max: {future_data['predicted_tempmax'].max():.2f}°C")
print(f"Avg: {future_data['predicted_tempmax'].mean():.2f}°C")

#### Conclusion

In [None]:
print("\n____________ CONCLUSION ____________")
print("""In this notebook, we have effectively constructed a Machine Learning pipeline that utilizes historical weather data from New York to forecast the maximum temperature (tempmax). 
      We conducted comprehensive coverage of the entire workflow, encompassing data preprocessing, feature engineering, model training, and hyperparameter tuning:""")
print(f"""
1. Data Preprocessing:
   - Removed outliers and handled missing values.
   - Added engineered features: day of year, month, day of week, is_weekend.
   - Applied scaling and one-hot encoding.

2. Model Selection and Hyperparameter Tuning:
   - Evaluated multiple models using RMSE as the sole metric.
   - Used RandomizedSearchCV for hyperparameter optimization.
   - The best performing model was: {best_model_name}

3. Model Performance:
   - Best model RMSE on test data: {test_rmse:.4f}

4. Future Predictions:
   - Generated predictions for the next 200 days.
   - The predicted temperatures range from {future_data['predicted_tempmax'].min():.2f}°C to {future_data['predicted_tempmax'].max():.2f}°C.

Suggestions for Further Improvement:
1. Collect more historical data or external data sources.
2. Experiment with more advanced time series models.
3. Implement online learning for continuous model updates.
4. Consider using deep learning models for complex pattern capture.
5. Analyze prediction intervals for more robust forecasts.
""")

print("\nPrediction and analysis complete.")

# CONTRIBUTION TABLE

| Student ID | Student Name | Contribution Rate (1-100%) | Responsible for (Parts, Cells...) | Note |
|------------|--------------|----------------------------|------------------------------------|------|
|s3978535    |Nguyen Ngoc Dung              |                            |                                    |      |
|            |              |                            |                                    |      |
|s3977970    |Nguyen Tran Ha Phan              |                            |                                    |      |
|            |              |                            |                                    |      |
|s4045608    |Tran Quoc Hung              |                            |                                    |      |
