# CICIDDOS2019 Dataset Analysis
## Phase 2: Data Exploration and Preprocessing

This notebook provides comprehensive analysis of the CICIDDOS2019 dataset for our federated learning DDoS detection project.

### Objectives:
1. Load and explore the CICIDDOS2019 dataset
2. Perform data cleaning and preprocessing
3. Analyze features and their relevance for DDoS detection
4. Visualize data distributions and attack patterns
5. Prepare data for federated learning simulation

---

In [None]:
# Import Required Libraries
import sys
import os
sys.path.append('../src')

# Data manipulation and analysis
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Machine learning and preprocessing
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.feature_selection import SelectKBest, f_classif

# Custom modules
from data.data_loader import CICDDoS2019Loader
from data.preprocessing import DataPreprocessor
from data.federated_split import FederatedDataDistributor

# Warnings
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("✅ All libraries imported successfully!")

## 1. Load CICIDDOS2019 Dataset

Let's start by loading the dataset and exploring its basic structure.

In [None]:
# Initialize data loader
loader = CICDDoS2019Loader(data_path="../data/raw/CSV-01-12/01-12")

# Get dataset information
print("📊 CICIDDOS2019 Dataset Information")
print("=" * 50)

info = loader.get_dataset_info()
print(f"Total files: {info['total_files']}")
print(f"Attack types: {len(info['attack_types'])}")
print("\nFile sizes:")
total_records = 0
for filename, size in info['file_sizes'].items():
    print(f"  {filename}: {size:,} records")
    total_records += size

print(f"\nTotal records across all files: {total_records:,}")

# Load a sample for initial exploration (1000 records per file to start)
print("\n🔄 Loading sample data...")
sample_df = loader.load_all_data(sample_size=1000)
print(f"Sample dataset loaded: {sample_df.shape}")

## 2. Explore Dataset Structure

Let's examine the dataset structure in detail.

In [None]:
# Basic dataset information
print("🔍 Dataset Structure Analysis")
print("=" * 50)
print(f"Shape: {sample_df.shape}")
print(f"Columns: {len(sample_df.columns)}")
print(f"Memory usage: {sample_df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

print("\n📋 Column Information:")
print("-" * 30)
for i, col in enumerate(sample_df.columns):
    print(f"{i+1:2d}. {col}")

print("\n📊 Data Types:")
print("-" * 20)
print(sample_df.dtypes.value_counts())

print("\n🎯 Target Variable Distribution:")
print("-" * 30)
print("Label distribution:")
print(sample_df['Label'].value_counts())
print("\nBinary label distribution:")
print(sample_df['Binary_Label'].value_counts())
print(f"Attack ratio: {sample_df['Binary_Label'].mean():.2%}")

print("\n🔍 Missing Values:")
print("-" * 20)
missing_data = sample_df.isnull().sum()
missing_percent = 100 * missing_data / len(sample_df)
missing_df = pd.DataFrame({
    'Missing Count': missing_data,
    'Percentage': missing_percent
}).sort_values('Missing Count', ascending=False)

# Show only columns with missing values
if missing_df['Missing Count'].sum() > 0:
    print(missing_df[missing_df['Missing Count'] > 0])
else:
    print("✅ No missing values found!")