# Introduction
The dataset I am using for this project will be the RT-IoT2022 dataset, as found in the [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/dataset/942/rt-iot2022). The data is derived from a real-time Internet of Things (IoT) infrastructure. The data collected is representative of both normal network traffic as well as simulated attack scenarios.

There are 123,117 instances within this dataset with 83 different features. There are no missing values.

## Questions being asked:
1. What is the question you hope to answer? 
    - In light of advancements in AI, CyberSecurity is needed more than ever. Therefore, the question I hope to answer are what are some of the attack patterns and indicators that a threat actor might have when launching an attack on a network?
2. What data are you planning to use to answer that question? 
    - The dataset is the RT-IoT2022 dataset, which is real world data collected for the purpose of helping train Intrusion Detection Systems (IDS).
3. What do you know about the data you're using so far? 
    - I recognize some of the features and attributes from my studies as an IT specialist, and I have an idea of how these features and attributes may relate to each other.
4. Why did you choose this topic? 
    - I am continuing to expand my studies in the IT field and currently am studying to take the CompTIA A+ certification exam. Therefore, understanding this dataset and modeling it may allow me further insights into the inner workings of networks and security.

# Data Wrangling

Wrangling will comprise of 2 parts: Acquisition of the data and then Preparation of the data.

In acquisition, I will download and unpack the data, then build functions to import the data into the wrangle notebook. In this step also, I will take a brief analysis of the data.

In preparation, I will split the data into a training, validation, and test set. I will then conduct a mild amount of surface level research into the variables to determine relevancy of certain variables to predicting a cyber attack may be happening.

## Acquisition: *April 26*
- [X] Download data and check formatting
- [X] Import data into notebook and identify target variable
- [X] Create function to import the data
- Date of completion: April 26, 2024

## Preparation: *April 27-29*
- [X] Split the data into train, validate, and test sets
- [X] Create (or recycle) a function to split the data
- [ ] Identify the variables and their roles in the training set
- [ ] Identify the correlation of the variables with the target variable
- [ ] Develop questions about the data (5-10 questions)
- [ ] Make predictions about the data
- [ ] Create a function that uses the import function, then cleans and prepares the data
- [ ] Export everything to wrangle.py
- Date of completion:

# Exploration And Pre-Processing
In these two steps, I will use the prepared data and answer questions I developed within the preparation and planning stage. Having answered these questions, I will then move on to the pre-processing stage, during which I will use functions from SciKitLearn as well as prior knowledge to build functions that encode the data for modeling.

Both steps will be comprised of two separate notebooks.

## Exploration: *April 30-May 7*
- [ ] Utilize the questions to perform targeted exploration
- [ ] Create 3-5 different graphs and statistical tests to answer questions
- [ ] Turn each question answered into a function
- [ ] Export each function into explore.py
- Date of completion:

## Pre-processing: *May 8-May 10*
- [ ] Process datasets for modeling
- [ ] Run analysis to determine potential irrelevant features
- [ ] Drop decided unnecessary features
- [ ] Encode datasets for modeling
- [ ] Create functions for pre-processing
- [ ] Add functions to model.py
- Date of completion:

# Modeling

## *May 11-12*
In this final step of analysis, I will go through the selection process to choose the most appropriate algorithms equipped to handle and process the data provided. The process will include choosing 10 algorithms and test them using small and simple changes to the algorithm, following which I will then select the best 3 and conduct more extensive hyperparameter testing.

- [ ] Make a selection of 10 algorithms to test (using simple algorithm configurations)
- [ ] Select 3 algorithms with best average performance and apply advanced algorithm configurations
- [ ] Use cross validation and pipeline to evaluate models
- [ ] Isolate the best model and use on test dataset
- [ ] Build functions for modeling 
- [ ] Add functions to model.py 
- Date of completion:

# Delivery
- [ ] Build notebook presentation ***May 13***
- [ ] Prepare speech ***May 14***
- [ ] Practice presentation ***May 15***
- [ ] Conduct presentation ***May 16, 2024***