# Power Up Research Software Development with Github Copilot


In this notebook, you will have the creative freedom to work with any dataset of your interest. Below are some sources for datasets that may be fun to work with.

- [RODA](https://registry.opendata.aws/) -  The Registry of Open Data on AWS (RODA) makes it easy for people to find datasets that are publicly available through AWS.

- [UCI Machine Learning Repository](https://archive.ics.uci.edu/datasets) - The UCI Machine Learning Repository is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms.

- [scikit](https://scikit-learn.org/stable/datasets.html) - Scikit-learn is a popular machine learning library in Python. It provides various datasets for practice and experimentation, often used in tutorials and examples to demonstrate machine learning algorithms and techniques.

### 1.0 Set-up

In [None]:
import pandas as pd
import numpy as np

In [None]:
pd.set_option("display.max_columns", None)  # or 1000
pd.set_option("display.max_rows", None)  # or 1000

In [None]:
!aws s3 ls --no-sign-request s3://cse-cic-ids2018/

In [None]:
import subprocess
output = subprocess.check_output('aws s3 ls --no-sign-request "s3://cse-cic-ids2018/Processed Traffic Data for ML Algorithms/"', shell=True).decode('utf-8')

# Split the output by newline character
lines = output.split('\n')

# Iterate over each line and print the file size in GB
for line in lines:
    if line:
        file_info = line.split()
        if len(file_info) >= 4:  # Check if file_info has at least 4 elements
            file_size = int(file_info[2])
            file_size_gb = file_size / (1024**3)  # Convert bytes to GB
            print(f"File: {file_info[3]}, Size: {file_size_gb:.2f} GB")


In [None]:
import io
import boto3
from botocore.config import Config
from botocore import UNSIGNED


client = boto3.client('s3', config=Config(signature_version=UNSIGNED))
cyber_bucket = 'cse-cic-ids2018'
cyber_prefix = f'Processed Traffic Data for ML Algorithms'

obj = client.get_object(Bucket= cyber_bucket , Key = cyber_prefix + '/' + 'Thursday-15-02-2018_TrafficForML_CICFlowMeter.csv')
df = pd.read_csv(io.BytesIO(obj['Body'].read()), encoding='utf8')

### 2.0 Data analysis

#### 2.1 Data exploration

#### 2.2 Data processing

#### 2.3 Data visualization

#### 2.4 Additional analysis

### 3.0 Data Modelling