# Question 1 
In the lecture we have implemented a few ML models, one using Tensorflow for predicting tomorrow's price the stock mtr (0066.HK).

This question asks you to implement yet another ML model using any regression method
available in the scikit-learn library, excluding Linear Regression and Neural Networks with
the same set of stock price (i.e., 0066.HK between "2010-
01-01" and "2020-06-30"). 

You should submit a Jupyter notebook that includes the
three ML models (i.e., the Tensorflow implementation from the lectures and your
implementations using sklearn), and compare their accuracy on predicting the
price of 0066.HK during the period "2021-01-01" and "2021-04-30".

## Reqiurements
1. numpy
2. pandas
3. matplotlib
4. scikit-learn
5. tensorflow
6. akshare

In [None]:
import tensorflow as tf
import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV, TimeSeriesSplit
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, mean_absolute_error
import matplotlib.pyplot as plt

%pip install akshare
import akshare as ak # for getting stock data

In [None]:
# get stock data
symbol = "00066"
start = "2010-01-01"
end = "2020-06-30"
predict_start = "2021-01-01"
predict_end = "2021-04-30"
stock_train = ak.stock_hk_hist(symbol=symbol, start_date=start, end_date=end, adjust='') # without adjust
stock_predict_actual = ak.stock_hk_hist(symbol=symbol, start_date=predict_start, end_date=predict_end, adjust='') # without adjust

# use GradientBoostingRegressor to predict stock price
# stock_train columns: 日期/开盘/收盘/最高/最低/成交量/成交额/振幅/涨跌幅/涨跌额/换手率
# use close price to predict

# add lag and rolling features
def create_features(data, lag_days=3, roll_days=3):
    for lag in range(1, lag_days + 1):
        data[f'lag_{lag}'] = data['收盘'].shift(lag)
    data['rolling_mean'] = data['收盘'].rolling(window=roll_days).mean()
    data['rolling_std'] = data['收盘'].rolling(window=roll_days).std()
    data.dropna(inplace=True)  # drop rows with NaN values
    return data

# create features
stock_train = create_features(stock_train)
stock_predict_actual = create_features(stock_predict_actual)

# form up X_train, y_train, X_test, y_test
X_train = stock_train.drop('收盘', axis=1)
y_train = stock_train['收盘']
X_test = stock_predict_actual.drop('收盘', axis=1)
y_test = stock_predict_actual['收盘']
# select only numerical features
X_train = X_train.select_dtypes(include=[np.number])
X_test = X_test.select_dtypes(include=[np.number])

# define pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', GradientBoostingRegressor())
])

# define hyperparameters grid
param_grid = {
    'model__n_estimators': [100, 200],
    'model__learning_rate': [0.01, 0.05, 0.1],
    'model__max_depth': [3, 5, 7]
}

# use TimeSeriesSplit for cross-validation to avoid data leakage
tscv = TimeSeriesSplit(n_splits=5)
grid_search = GridSearchCV(pipeline, param_grid, cv=tscv, scoring='neg_mean_squared_error', n_jobs=-1)

# fit the model and find the best parameters
grid_search.fit(X_train, y_train)
print("Best Parameters:", grid_search.best_params_)

# predict using the best model
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

# evaluation
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

print(f"Optimized MAE: {mae}")
print(f"Optimized MSE: {mse}")
print(f"Optimized RMSE: {rmse}")

# plot the predicted vs actual price
plt.figure(figsize=(12, 6))
plt.plot(y_test.index, y_test, label="Actual Price", color="blue")
plt.plot(y_test.index, y_pred, label="Predicted Price", color="red")
plt.xlabel("Date")
plt.ylabel("Price")
plt.title("Optimized Gradient Boosting Regressor - Predicted vs Actual Price")
plt.legend()
plt.show()


# Question 2
Choose a suitable method (except neural network) from sklearn to train a
Machine Learning model using the MNIST data set for hand-written digit classification.
Provide a brief explanation of your chosen method and why it is suitable for this task.

## Explanation of SVM
SVM is a supervised machine learning algorithm used for both classification and regression tasks. In classification, SVM works by finding an optimal hyperplane that best separates the data points of different classes. For linearly separable data, this hyperplane maximizes the margin between the two classes. For data that is not linearly separable, SVM uses a kernel trick to map data into a higher-dimensional space, where it becomes easier to classify with a hyperplane.

## Key Reasons Why SVM is Suitable for MNIST
1. Effective in High-Dimensional Spaces: MNIST consists of 28x28 pixel images, which translates to a 784-dimensional feature space when each pixel is treated as an individual feature. SVM is known for its effectiveness in handling high-dimensional data.

2. Well-Suited for Small to Medium-Size Datasets: While neural networks excel on very large datasets, SVM can perform comparably well on smaller datasets and doesn't require as extensive tuning.

3. Ability to Handle Non-Linear Boundaries: With the use of non-linear kernels (e.g., the RBF kernel), SVM can efficiently handle the curved decision boundaries present in complex datasets like MNIST.

4. Robustness to Overfitting: By maximizing the margin between classes, SVM tends to generalize well, reducing the risk of overfitting.

In [None]:
import numpy as np
from sklearn import datasets
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml

# Load the MNIST dataset
mnist = fetch_openml('mnist_784', version=1)
X, y = mnist.data, mnist.target.astype(int)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the data to zero mean and unit variance
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Initialize the SVM model with RBF kernel
svm_model = SVC(kernel='rbf', gamma='scale', C=1.0)

# Train the SVM model
svm_model.fit(X_train, y_train)

# Predict on the test set
y_pred = svm_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy on the test set: {accuracy * 100:.2f}%")
print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Visualize some predictions
plt.figure(figsize=(10, 5))
for i in range(1, 11):
    plt.subplot(2, 5, i)
    plt.imshow(X_test[i].reshape(28, 28), cmap='gray')
    plt.title(f"Prediction: {y_pred[i]}")
    plt.axis('off')
plt.show()
