# COMMODITY RISK ASSESSMENT PREDICTOR AND RECOMMENDER (CRAPR)

CRAPR is a Machine-learning model that aims to address the lack of personalized commodity metals investment recommendations, particularly for investors seeking to diversify their portfolios. It provides foundational background on commodity metals (Gold, Silver and Copper) and Risk Portraits and through answering set of Questionnaire, investors would have clearer insights on their risk appetite and investment goals so as to make better informed investment decisions that suit their profile

## Install and import required python package and libraries

Step 0: Install package

Step 1: Import Required Libraries

In [1]:
# import Python packages needed for loading data and Machine Learning
import pandas as pd
import pycaret
from pycaret.classification import * # Classification module from Pycaret

## Characteristics / Risk Profile / Suitability / Considerations between each Metal
Below provides a table overview of each metal detailing key attributes that can help investors assess which metal aligns best with their risk tolerance and investment strategy.


| Metal  | Characteristics                                                                 | Risk Profile | Suitability                                        | Considerations                                                                 |
|--------|---------------------------------------------------------------------------------|--------------|---------------------------------------------------|--------------------------------------------------------------------------------|
| Gold   | - Safe-haven asset<br> - Low volatility<br> - Inflation hedge<br> - Highly liquid | Low Risk     | Conservative investors<br>Long-term wealth preservation | - Use as a hedge against economic uncertainty<br> - Maintain as a stable store of value<br> - Suitable for wealth preservation |
| Silver | - Dual demand: industrial and investment<br> - Higher volatility<br> - Correlates with economic cycles<br> - Relatively liquid | Moderate Risk | Moderately conservative to moderate investors<br>Seeking higher returns | - Balances safe-haven and industrial demand<br> - Higher potential returns with increased risk<br> - Suitable for portfolio diversification |
| Copper | - Primarily industrial use<br> - High volatility<br> - Economic indicator<br> - Good liquidity | High Risk    | Moderate to aggressive investors<br>Short to medium-term investments | - Capitalize on global economic growth<br> - Manage closely due to high volatility<br> - Suitable for those willing to take higher risks |


## Risk Portrait Overview

Below table focuses on respective Risk portraits to give investors deeper insights on risk-sensitive investments based on time-horizon, growth, risk-return ratio and ideal metal(s).

| **Risk Portrait**         | **Characteristics**                                                                                                                                                 | **Suitability**                                                         | **Investment Focus** |
|---------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------|-------------------------------|
| **Conservative**           | - Prioritizes capital preservation<br>- Low tolerance for risk<br>- Seeks stable, predictable returns<br>- Focus on long-term wealth protection                       | Investors seeking safety over growth<br>Focus on retirement or wealth protection | Gold                          |
| **Moderately Conservative**| - Some willingness to accept modest risk<br>- Prefers a mix of stability and slight growth<br>- Conservative, but seeking returns higher than inflation             | Investors looking for gradual growth while preserving capital           | Gold, Silver                   |
| **Moderate**               | - Balanced risk and return<br>- Seeks moderate growth over time<br>- Comfortable with occasional market fluctuations                                               | Investors aiming for steady long-term growth                            | Silver                         |
| **Moderately Aggressive**  | - Willing to accept higher risk for potentially higher returns<br>- Focus on growth with some downside protection<br>- Accepts volatility as a path to larger gains  | Investors looking for significant growth but not extreme risk            | Silver, Copper                 |
| **Aggressive**             | - High tolerance for risk<br>- Focuses on maximizing returns<br>- Accepts significant market volatility and possible losses                                         | Investors aiming for maximum returns in shorter time frames              | Copper                         |

## Dataset Analysis
## Prepare the Risk portrait dataset for machine learning

To generate the dataset to get risk portrait based on total points accumulated from completing the Questionnaire, we first start by assigning the points range to the specific class starting from lowest range (Conservative) to highest (Aggressive).

| Points  | Risk Profile              |
|---------|---------------------------|
| 09-20   | Conservative               |
| 21-31   | Moderately Conservative    |
| 32-41   | Moderate                   |
| 42-51   | Moderately Aggressive      |
| >=52    | Aggressive                 |


Afterwards, we generate a series of randomized answers for each question (under columns "q1" to "q12") and points will be allocated accordingly in columns "q1 points" to "q12 points" with lowest point assigned to lowest number value choice and so on. The last column "risk_portrait" in Dataset will then show the respective risk portrait after total points have been calculated. 

One row represents an Investor's answers for all 12 questions which we created total of 160 rows, generating risk portrait results for all classes to build balanced dataset for training and testing accuracy which we will cover later in the report.

Each question will have 3-5 choices and responses will be recorded in columns "q1" to "q12" as numbers 1 to 5 accordingly. 


In [2]:
# Read csv file on the investor risk portrait data
df = pd.read_csv('data-risk-portrait.csv')
df

Unnamed: 0,q1,q1-points,q2,q2-points,q3,q3-points,q4,q4-points,q5,q5-points,...,q8-points,q9,q9-points,q10,q10-points,q11,q11-points,q12,q12-points,risk_portrait
0,2,3,3,5,3,5,3,5,4,4,...,5,4,5,4,5,3,5,4,5,Aggressive
1,3,5,3,5,3,5,3,5,5,5,...,5,3,3,3,3,2,3,4,5,Aggressive
2,2,3,3,5,3,5,3,5,4,4,...,5,4,5,4,5,3,5,4,5,Aggressive
3,3,5,3,5,3,5,3,5,5,5,...,5,3,3,3,3,2,3,4,5,Aggressive
4,2,3,3,5,3,5,3,5,4,4,...,5,4,5,4,5,3,5,4,5,Aggressive
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
155,1,1,1,1,2,3,2,3,2,2,...,1,2,1,2,1,1,1,2,1,Moderately conservative
156,1,1,1,1,2,3,2,3,2,2,...,1,2,1,2,1,1,1,2,1,Moderately conservative
157,1,1,1,1,2,3,2,3,2,2,...,1,2,1,2,1,1,1,2,1,Moderately conservative
158,1,1,1,1,2,3,2,3,2,2,...,1,2,1,2,1,1,1,2,1,Moderately conservative


Step 2: Load Data

## Exploratory Data Analysis (EDA)

### A) Welcome Page
Lets explore the Predictor Model which will use Dataset to generate the Portfolio to recommend based on Investor's Questionnaire responses.

To start off, we shall take a look at the "Welcome" page which will provide a brief introduction on what users can expect. There is also a comparison summary table of each metal as a quick reference before starting.

Once user is ready then to click on "Click here to Start" button to begin the Questionnaire.

![CRAPR welcome page](CRAPR_welcome_2.jpg)


![Comparison Summary](Comparison_summary.jpg)

### B1) Questionnaire format

Next as mentioned before, each question will have 3-5 choices for users to select. 

There are total of 12 questions, with each based on a specific Topic (Time Horizon, Investment knowledge, Budget and Risk Tolerance) which influence users' investment strategy. Each choice would then represent amount of points, with first from the top having the lowest point and then following down accordingly to the last choice having highest points. 

This represents the risk level from lowest risk to highest which applies for all 12 questions and if we refer to the points allocation table mentioned above, we can assume that the lower the points, the more conservative the investor is. Thus, users will select appropriate choice based on their risk preferences.


![Questionnaire](Questionnaire.jpg)

### B2) Questionnaire format (Last question)

Once users reached the final question, they can proceed to finish the questionnaire by clicking on the "Finish" button. Afterwards, the model will record the choices from all questions and calculate the total points accordingly which model will then proceed to make the predictions for the expected risk portrait and recommended portfolio results. 

![Questionnaire Final](Questionnaire_final.jpg)

### C) Portfolio Results

The results page will generate the expected risk portrait assigned to the total points calculated along with the recommended portfolio showing breakdown of metal allocation in a pie chart format. Lets deep dive to some examples.

### C1) Portfolio breakdown of a Conservative Investor

![Your Ideal Portfolio](Portfolio%20breakdownv2%20(Conservative).jpg)

Above shows breakdown of "Conservative" Investor with allocation as given. Using description table above for each risk portrait class and ideal metal, Gold is the investment focus hence holds majority of portfolio allocation, followed by some silver which has higher volatility but is relatively liquid and low to medium risk capacity. Copper to hold least percentage due to its High risk capacity thus would not majorly affect the portfolio. 

This allocation falls in line with Investor's risk profile given that it only wants least amount of Risk.


### C2) Portfolio breakdown of a Moderately Aggressive Investor

![Your Ideal Portfolio](Portfolio_Moderately_Aggressive.jpg)

Next example is breakdown for "Moderately aggresive" Investor. As this profile is more acceptable to risks, Copper would hold most percentage allocation due to its high risk-return ratio but as investor is looking for significant growth but not extreme risk, decent Silver allocation would help mitigate the risk exposure with its "safe-haven" features along with Gold.

## Prepare Data for Machine learning

Now that we have covered the Predictor model that will use the Dataset to generate the required results, lets understand how the Data would be first prepared before setting it to be trained and tested in Machine Learning.

This preparation is essential to ensure that the predictor model can accurately generate the required results from the dataset.

Machine learning is crucial because it enables us to create predictive models that can automatically learn patterns from data and make decisions without explicit programming. By training a model on relevant data, we can predict outcomes that help in various scenarios, such as classification task for this case.

Hence, the machine learning model is being trained to predict an individual's risk portrait. The process is as follows:

- The dataset being used is loaded from a CSV file (data-risk-portrait.csv), which contains information related to investor profiles and associated features as covered earlier when preparing the Risk portrait dataset
  
- The target variable for prediction is risk_portrait, which categorize investors based on their risk preferences or tolerance.

- A machine learning classification model is being used to predict this risk portrait using the PyCaret library's classification module, which simplifies model comparison, training, and testing.
  
- The process includes splitting the data into training and test sets, performing 5-fold cross-validation, and selecting the best model based on performance.
  
- After training, the best model is used to predict risk portraits for new data.

  
In summary, the prediction task focuses on classifying investors into different risk profiles, which can guide investment strategies or portfolio decisions based on their risk tolerance.

Step 3: Prepare Data for ML (Preprocessing)

In [3]:
# Setup the data including train-test split and missing value imputation
# data = dataframe, target = "Prediction column"
# Set up the machine learning environment with the given dataset 'df', targeting 'risk_portrait' for prediction, using 5-fold cross-validation and a random seed for reproducibility
setup(data = df, target = 'risk_portrait', session_id = 123, fold=5)

Unnamed: 0,Description,Value
0,Session id,123
1,Target,risk_portrait
2,Target type,Multiclass
3,Target mapping,"Aggressive: 0, Conservative: 1, Moderate: 2, Moderately aggressive: 3, Moderately conservative: 4"
4,Original data shape,"(160, 25)"
5,Transformed data shape,"(160, 25)"
6,Transformed train set shape,"(112, 25)"
7,Transformed test set shape,"(48, 25)"
8,Numeric features,24
9,Preprocess,True


<pycaret.classification.oop.ClassificationExperiment at 0x1a80245a410>

## Train Data to show results of classification models via data set

After preparation of data, we used Pycaret "compare_models" function to train multiple classification models and compare their performance based on accuracy and other metrics. After debugging for errors and performing 5-fold cross-validation, we select the best-performing model and store it in the variable "best_model".

Step 4:  Apply Machine Learning

In [4]:
# Train classification models on the data set
# Show the Training results
# Store the best performing model
best_model = compare_models(errors="raise", fold=5)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lda,Linear Discriminant Analysis,0.883,0.0,0.883,0.8787,0.8746,0.8448,0.8508,0.008
nb,Naive Bayes,0.8206,0.9516,0.8206,0.8383,0.8167,0.7561,0.7621,0.464
lr,Logistic Regression,0.8198,0.0,0.8198,0.8265,0.817,0.7621,0.765,0.788
rf,Random Forest Classifier,0.8032,0.9151,0.8032,0.8278,0.7889,0.7259,0.7401,0.038
lightgbm,Light Gradient Boosting Machine,0.7949,0.918,0.7949,0.8198,0.791,0.725,0.733,0.084
et,Extra Trees Classifier,0.7941,0.9105,0.7941,0.8278,0.7808,0.7113,0.7291,0.034
dt,Decision Tree Classifier,0.7415,0.8248,0.7415,0.7652,0.7375,0.6606,0.6684,0.482
gbc,Gradient Boosting Classifier,0.7229,0.0,0.7229,0.7361,0.7197,0.634,0.6391,0.072
knn,K Neighbors Classifier,0.6953,0.8921,0.6953,0.7026,0.6845,0.5956,0.6015,0.486
ridge,Ridge Classifier,0.6605,0.0,0.6605,0.6712,0.6485,0.5535,0.5646,0.012


Processing:   0%|          | 0/61 [00:00<?, ?it/s]

## Test Data to show best model.
## Best model: Linear Discriminant Analysis

After identifying the best model, it is tested on unseen (or validation) data using "predict_model". The predictions made by the model are stored in a variable called predictions, and the results are displayed in a DataFrame. This includes predicted risk profiles for each test case in the dataset.

Using the classification model table generated during the Train Data part, we list down the Top 3 performing models for comparison in terms of train & test accuracy, accuracy difference and train run-time speed. Based on the comparison table below, Linear Discriminant Anaylsis model proves to be the best performing model in terms of :

- Highest train accuracy of 88.3%, meaning it learns patterns from the training data more efficiently as compared to Naive Bayes and Logistic Regression.
- Minimal train-test accuracy difference of 0.03, suggesting that the model is not overfitting and performs consistently on both the training and testing data
- Fastest run-time speed of 8ms, making it efficient in both computational resources and speed compared to other 2 models.


| **Model**                     | **Train Accuracy** | **Test Accuracy** | **Accuracy Difference** | **Train Run Time [Sec]** |
|-------------------------------|--------------------|-------------------|-------------------------|--------------------------|
| Linear Discriminant Analysis   | **0.8830**         | 0.8542            | 0.03                    | **0.0080**                |
| Naive Bayes                    | 0.8206             | 0.8125            | 0.01                    | 0.4380                   |
| Logistic Regression            | 0.8198             | **0.8750**        | 0.06                    | 0.6560                   |

In [5]:
# Show the Test results
# Show the predictions in a dataframe
# Test the Best model
predictions = predict_model(best_model)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Linear Discriminant Analysis,0.8542,0.9707,0.8542,0.858,0.8491,0.8021,0.8066


In [9]:
# Test 2nd best model
nb = create_model("nb")
predict_model(nb)
predictions = predict_model(nb)

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.8696,0.9703,0.8696,0.8962,0.8618,0.8222,0.8334
1,0.8696,0.9854,0.8696,0.8783,0.8706,0.824,0.8262
2,0.7727,0.9549,0.7727,0.7826,0.7677,0.6901,0.6942
3,0.8182,0.938,0.8182,0.8278,0.813,0.7479,0.755
4,0.7727,0.9095,0.7727,0.8068,0.7706,0.6961,0.702
Mean,0.8206,0.9516,0.8206,0.8383,0.8167,0.7561,0.7621
Std,0.0433,0.0263,0.0433,0.0428,0.0436,0.0583,0.0591


Processing:   0%|          | 0/4 [00:00<?, ?it/s]

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Naive Bayes,0.8125,0.9132,0.8125,0.849,0.8001,0.7388,0.7585


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Naive Bayes,0.8125,0.9132,0.8125,0.849,0.8001,0.7388,0.7585


In [10]:
# Test 3rd best model
lr = create_model("lr")
predict_model(lr)
predictions = predict_model(lr)

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.913,0.0,0.913,0.913,0.913,0.8838,0.8838
1,0.913,0.0,0.913,0.9283,0.9105,0.8827,0.8873
2,0.9545,0.0,0.9545,0.9659,0.9545,0.9391,0.9417
3,0.6818,0.0,0.6818,0.6875,0.6822,0.5792,0.5809
4,0.6364,0.0,0.6364,0.6377,0.6248,0.5256,0.5314
Mean,0.8198,0.0,0.8198,0.8265,0.817,0.7621,0.765
Std,0.1328,0.0,0.1328,0.1358,0.1356,0.1732,0.1725


Processing:   0%|          | 0/4 [00:00<?, ?it/s]

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Logistic Regression,0.875,0.9682,0.875,0.905,0.8627,0.8271,0.8442


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Logistic Regression,0.875,0.9682,0.875,0.905,0.8627,0.8271,0.8442


## Save and export best model and predictions in a .pkl file

After training and selecting the best-performing model, we saved it in a pkl file without having to repeat the process of preprocessing and retraining.

This would then be used by Predictor model to do the required task of predicting the risk portrait based on users questionnaire response and generating the portfolio recommendation.


![Screenshot](Screenshot.jpg)

Step 5: Export Model and predictions

In [8]:
# Save the trained model 'best_model' to a file named "model" for later use (the model can be loaded back when needed)
# save_model(best_model, "name of file")
# Export file as model.pkl
save_model(best_model, "model")


Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('label_encoding',
                  TransformerWrapperWithInverse(exclude=None, include=None,
                                                transformer=LabelEncoder())),
                 ('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['q1', 'q1-points', 'q2',
                                              'q2-points', 'q3', 'q3-points',
                                              'q4', 'q4-points', 'q5',
                                              'q5-points', 'q6', 'q6-points',
                                              'q7', 'q7-points', 'q8',
                                              'q8-points', 'q9'...
                                                               missing_values=nan,
                                                               strategy='most_frequent'))),
                 ('clean_column_names',
                  Transformer

# Conclusions

Overall, the streamlit deployment of CRAPR model was a success after deploying the streamlit "app.py" file. From here, users can start using it to begin their commodity metals investment journey.

![deployment](deployment.jpg)

## Improvements

Currently the dataset for machine learning has some imbalance in terms of risk portrait class having higher percentage of users compared to rest.

Hence, future improvements can be made to the "data-risk-portrait" csv file by increasing and balancing the data size for all risk classes for optimal prediction in model

![output](output.png)

### END