# The General Process
The basic task we're trying to implement is a way to predict which types of vulnerabilities will be most often exploited by Nation State Threat Actors (NSTAs). The basic outline of how to accomplish this is laid out below:
1. Data Collection: Define the problem and find data that matches our needs.
2. Data Preparation: Clean the dataset by removing duplicates and null values, correcting errors, handling type conversions, merging and filtering attributes and observations, and randomizing records to remove the noise introduced by gathering and processing the data. This step also includes general pattern observation through Exploratory Data Analysis in order to acquiant ourselves with the data's character. Finally, we'll be able to split this data into training and test sets. 
3. Model Selection: Choose a model that will best handle the scope and particularities of the problem to solve and the data we've collected.
4. Model Training: Use our chosen algorithm to train the model on the training data
5. Model Evaluation: Develop metrics to judge the objective performance of the model.
6. Hyperparamater Tuning: Fine-tune the parameters given to the model to improve predictive accuracy. This includes things like number of training steps, learning rate, initialization values and distribution, etc.
7. Make Predictions: Use test data for which labels are known to guage the predictive ability of the model


## How to investigate the CVEs of IoTs coming from nation state threat actors
### Data Collection and Preprocessing
What's needed at the root of the project is good data. Sources like the National Vulnerability Database or MITRE's CVE datase catalog CVEs and provide metadate such as severity scores, descriptions, and **affected products**. With MITRE's ATT&CK framework, we can associate its T scores with CVEs by focusing on descriptions like "IoT," "embedded device," "smart device," "sensor," "connected device," etc. **Product tags and vendors** may prove useful here.

Threat intelligence feeds like those from Recorded Future, Mandiant, or ThreatConnect can provide context on **nation-state actos linked to specific CVEs**. Natural Language Processing (NLP) may be necessary to extract meaningful features with which to build a dataset. 

We could process the data we found in a way similar to the following
| CVE ID | CVE Description | CWE Code | Affected IoT Products | TTP Correlation (e.g. `T1078`) | CVSS Score | Nation-State Attribution | Publication Date | Exploitation Date | Associated Threat Actor |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |

### Feature Engineering
We could focus on CVE severity scores (such as CVSS), authentication requirements, or build features based on known tactics, techniques, and procedures (TTPs) of specific nation-state actors. This type of data can be found in the MITRE ATT&CK framework. We also need features based on the specific vulnerabilities of IoT devices and their firmware versions and network configurations. We could also look at the frequency of similar attacks in the past.

### Machine Learning Techniques
- **Supervised learning** could help classify whether a CVE is likely to be exploited by a nation-state actor by using models like random forest, gradient boosting, or support vector machines which can trained on labeled data where the outcome of nation-state involvement is known.
- **Text classification**: Leveraging models like BERT or LSTM to classify threat reports and CVE descriptions based on text data could help predict the likelihood of nation-state involvement.
- Unsupervised learning on network traffic or IoT device behavior data with anomaly detection algorithms like isolation forest of DBSCAN could help identify unusual patterns that might suggest nation-state actor involvement.
- We could use **time-series analysis** to detect unusual temporal patterns in CVE exploitation, which might correlate with known nation-state activities.

## 