# CICIDS2017 Data Selection and Exploratory Analysis

This notebook performs comprehensive data selection, exploration, and visualization of the CICIDS2017 dataset.

## Objectives:
1. Load and compare both dataset versions (GeneratedLabelledFlows vs MachineLearningCSV)
2. Understand the structure and characteristics of each dataset
3. Perform exploratory data analysis with visualizations
4. Analyze data quality and class distribution
5. Make informed decision on which dataset to use for ML modeling

## 1. Setup and Imports

In [9]:
# Install required packages if running on Google Colab
import sys
if 'google.colab' in sys.modules:
    print("Running on Google Colab - installing packages...")
    !pip install -q pandas numpy matplotlib seaborn plotly
else:
    print("Not running on Colab - assuming packages are installed")

Running on Google Colab - installing packages...


In [10]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
import os

# Visualization settings
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline
warnings.filterwarnings('ignore')

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', 1000)

print("‚úì All libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

‚úì All libraries imported successfully!
Pandas version: 2.2.2
NumPy version: 2.0.2


In [11]:

# Setup instructions for Google Colab
import sys
IS_COLAB = 'google.colab' in sys.modules

if IS_COLAB:
    print("=" * 80)
    print("GOOGLE COLAB SETUP INSTRUCTIONS")
    print("=" * 80)
    print("""
This notebook is configured to work on Google Colab in two ways:

OPTION 1: Using Google Drive (Recommended)
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
1. Upload the ML-CICIDS-project folder to your Google Drive root
2. When prompted in the next cell, authorize Drive access
3. The notebook will automatically find the project

OPTION 2: Clone from GitHub
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
1. The notebook will attempt to clone the repo automatically
2. This requires the GitHub repo to be public and accessible
3. Dataset must be uploaded separately to /content/ML-CICIDS-project/

Note: The next cell will handle the setup automatically.
If you encounter issues, uncomment and run the manual setup below:

# Manual Setup Option A: Mount Google Drive
# from google.colab import drive
# drive.mount('/content/drive')

# Manual Setup Option B: Clone from GitHub
# !git clone https://github.com/Tesfay-Hagos/ML-CICIDS-project.git /content/ML-CICIDS-project

# Manual Setup Option C: Download data_config.py
# !wget -O data_config.py https://raw.githubusercontent.com/Tesfay-Hagos/ML-CICIDS-project/main/data_config.py
""")
    print("=" * 80 + "\n")
else:
    print("‚úì Running on local machine (not Colab)")

GOOGLE COLAB SETUP INSTRUCTIONS

This notebook is configured to work on Google Colab in two ways:

OPTION 1: Using Google Drive (Recommended)
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
1. Upload the ML-CICIDS-project folder to your Google Drive root
2. When prompted in the next cell, authorize Drive access
3. The notebook will automatically find the project

OPTION 2: Clone from GitHub
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
1. The notebook will attempt to clone the repo automatically
2. This requires the GitHub repo to be public and accessible
3. Dataset must be uploaded separately to /content/ML-CICIDS-project/

Note: The next cell will handle the setup automatically.
If you encounter issues, uncomment and run the manual setup below:

# Manual Setup Option A: Mount Google Drive
# from google.colab import drive
# drive.mount('/content/drive')

# Manual Setup Option B

## 2. Configure Data Paths

In [12]:

# Setup instructions for Google Colab
import sys
IS_COLAB = 'google.colab' in sys.modules

if IS_COLAB:
    print("=" * 80)
    print("GOOGLE COLAB SETUP INSTRUCTIONS")
    print("=" * 80)
    print("""
This notebook is configured to work on Google Colab in two ways:

OPTION 1: Using Google Drive (Recommended)
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
1. Upload the ML-CICIDS-project folder to your Google Drive root
2. When prompted in the next cell, authorize Drive access
3. The notebook will automatically find the project

OPTION 2: Clone from GitHub
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
1. The notebook will attempt to clone the repo automatically
2. This requires the GitHub repo to be public and accessible
3. Dataset must be uploaded separately to /content/ML-CICIDS-project/

Note: The next cell will handle the setup automatically.
If you encounter issues, uncomment and run the manual setup below:
""")
    
    # Commented manual setup options
    print("""
# Manual Setup Option A: Mount Google Drive
# from google.colab import drive
# drive.mount('/content/drive')

# Manual Setup Option B: Clone from GitHub
# !git clone https://github.com/Tesfay-Hagos/ML-CICIDS-project.git /content/ML-CICIDS-project

# Manual Setup Option C: Download data_config.py
# !wget -O data_config.py https://raw.githubusercontent.com/Tesfay-Hagos/ML-CICIDS-project/main/data_config.py
""")
    print("=" * 80 + "\n")
else:
    print("‚úì Running on local machine (not Colab)")

GOOGLE COLAB SETUP INSTRUCTIONS

This notebook is configured to work on Google Colab in two ways:

OPTION 1: Using Google Drive (Recommended)
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
1. Upload the ML-CICIDS-project folder to your Google Drive root
2. When prompted in the next cell, authorize Drive access
3. The notebook will automatically find the project

OPTION 2: Clone from GitHub
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
1. The notebook will attempt to clone the repo automatically
2. This requires the GitHub repo to be public and accessible
3. Dataset must be uploaded separately to /content/ML-CICIDS-project/

Note: The next cell will handle the setup automatically.
If you encounter issues, uncomment and run the manual setup below:


# Manual Setup Option A: Mount Google Drive
# from google.colab import drive
# drive.mount('/content/drive')

# Manual Setup Option 

In [13]:
# Base path - adjust if needed
import sys
import os
from pathlib import Path

# Check if running on Google Colab
IS_COLAB = 'google.colab' in sys.modules

print(f"Running on Colab: {IS_COLAB}")

if IS_COLAB:
    # For Colab: Mount Google Drive or clone from GitHub
    print("\nüìÇ Setting up Colab environment...\n")
    
    # Try to mount Google Drive first
    try:
        from google.colab import drive
        drive.mount('/content/drive')
        # Assume the project is in Google Drive
        colab_project_path = Path("/content/drive/MyDrive/ML-CICIDS-project")
        if colab_project_path.exists() and (colab_project_path / "data_config.py").exists():
            project_root = colab_project_path
            print(f"‚úì Found project in Google Drive: {project_root}")
        else:
            print("‚ö†Ô∏è  Project not found in Google Drive at /content/drive/MyDrive/ML-CICIDS-project")
            project_root = None
    except Exception as e:
        print(f"‚ö†Ô∏è  Could not mount Google Drive: {e}")
        project_root = None
    
    # If not found in Drive, try cloning from GitHub (alternative)
    if project_root is None:
        print("\nüì• Attempting to clone from GitHub...\n")
        try:
            os.chdir('/content')
            os.system('git clone https://github.com/Tesfay-Hagos/ML-CICIDS-project.git')
            project_root = Path("/content/ML-CICIDS-project")
            if (project_root / "data_config.py").exists():
                print(f"‚úì Successfully cloned project: {project_root}")
            else:
                print("‚ö†Ô∏è  Clone succeeded but data_config.py not found")
                project_root = None
        except Exception as e:
            print(f"‚ö†Ô∏è  Clone failed: {e}")
            project_root = None
else:
    # Local environment (not Colab)
    current_dir = Path(os.getcwd())
    project_root = None
    
    # Search current and parent directories for data_config.py
    for parent in [current_dir] + list(current_dir.parents):
        if (parent / "data_config.py").exists():
            project_root = parent
            break
    
    # Fallback to known path if automatic discovery fails
    if project_root is None:
        known_path = Path("/home/tesfayh/Artificial_inteligence/ML/CICDS/ML-CICIDS-project/")
        if known_path.exists() and (known_path / "data_config.py").exists():
            project_root = known_path

# Add to system path and import
if project_root:
    if str(project_root) not in sys.path:
        sys.path.insert(0, str(project_root))
    print(f"‚úì Added to path: {project_root}")
    
    from data_config import DataConfig
    
    # Initialize configuration
    config = DataConfig(base_path=str(project_root))
    config.print_summary()
    
    # Get file lists from config
    flow_files = config.flow_files
    ml_files = config.ml_files
    
    # Display file names
    print("\nüìÑ Available files (GeneratedLabelledFlows):")
    for i, f in enumerate(flow_files, 1):
        print(f"   {i}. {f.name}")
    
    print("\nüìÑ Available files (MachineLearningCSV):")
    for i, f in enumerate(ml_files, 1):
        print(f"   {i}. {f.name}")
else:
    print("‚ùå ERROR: Could not find project root or data_config.py")
    print("\nFor Colab, you have two options:")
    print("1. Mount Google Drive with the project folder")
    print("2. Manually upload data_config.py to Colab or update the Google Drive path")
    print("\nAlternatively, create data_config.py in this Colab cell:")
    print("   !wget https://raw.githubusercontent.com/Tesfay-Hagos/ML-CICIDS-project/main/data_config.py")

Running on Colab: True

üìÇ Setting up Colab environment...

‚ö†Ô∏è  Could not mount Google Drive: mount failed

üì• Attempting to clone from GitHub...

‚ö†Ô∏è  Could not mount Google Drive: mount failed

üì• Attempting to clone from GitHub...

‚ö†Ô∏è  Clone succeeded but data_config.py not found
‚ùå ERROR: Could not find project root or data_config.py

For Colab, you have two options:
1. Mount Google Drive with the project folder
2. Manually upload data_config.py to Colab or update the Google Drive path

Alternatively, create data_config.py in this Colab cell:
   !wget https://raw.githubusercontent.com/Tesfay-Hagos/ML-CICIDS-project/main/data_config.py
‚ö†Ô∏è  Clone succeeded but data_config.py not found
‚ùå ERROR: Could not find project root or data_config.py

For Colab, you have two options:
1. Mount Google Drive with the project folder
2. Manually upload data_config.py to Colab or update the Google Drive path

Alternatively, create data_config.py in this Colab cell:
   !wget htt

## 3. Dataset Comparison: Structure and Columns

In [13]:
# Load first file from each dataset to compare structure
sample_file = "Monday-WorkingHours.pcap_ISCX.csv"

print("Loading sample files for comparison...\n")
# Use config to load files
flow_sample = config.load_file(sample_file, dataset='flow', nrows=1000)
ml_sample = config.load_file(sample_file, dataset='ml', nrows=1000)

print("=" * 80)
print("DATASET STRUCTURE COMPARISON")
print("=" * 80)

print(f"\nüìä GeneratedLabelledFlows:")
print(f"   - Shape: {flow_sample.shape}")
print(f"   - Columns: {len(flow_sample.columns)}")
print(f"   - Memory: {flow_sample.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

print(f"\nüìä MachineLearningCSV:")
print(f"   - Shape: {ml_sample.shape}")
print(f"   - Columns: {len(ml_sample.columns)}")
print(f"   - Memory: {ml_sample.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# Column differences
flow_cols = set(flow_sample.columns)
ml_cols = set(ml_sample.columns)

common_cols = flow_cols & ml_cols
flow_only = flow_cols - ml_cols
ml_only = ml_cols - flow_cols

print(f"\nüîç Column Analysis:")
print(f"   - Common columns: {len(common_cols)}")
print(f"   - Only in GeneratedLabelledFlows: {len(flow_only)}")
print(f"   - Only in MachineLearningCSV: {len(ml_only)}")

if flow_only:
    print(f"\n   Columns ONLY in GeneratedLabelledFlows:")
    for col in sorted(flow_only):
        print(f"      ‚Ä¢ {col}")

if ml_only:
    print(f"\n   Columns ONLY in MachineLearningCSV:")
    for col in sorted(ml_only):
        print(f"      ‚Ä¢ {col}")

Loading sample files for comparison...



NameError: name 'config' is not defined

### Visualize Column Differences

In [None]:
# Create visualization of column distribution
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

# Pie chart showing column composition
column_data = [len(common_cols), len(flow_only), len(ml_only)]
labels = ['Common Columns', 'Flow Only', 'ML Only']
colors = ['#2ecc71', '#3498db', '#e74c3c']
explode = (0.05, 0.05, 0.05)

ax1.pie(column_data, labels=labels, colors=colors, autopct='%1.1f%%', 
        startangle=90, explode=explode, shadow=True)
ax1.set_title('Column Distribution Across Datasets', fontsize=14, fontweight='bold')

# Bar chart comparing total columns
datasets = ['GeneratedLabelledFlows', 'MachineLearningCSV']
column_counts = [len(flow_sample.columns), len(ml_sample.columns)]
bars = ax2.bar(datasets, column_counts, color=['#3498db', '#e74c3c'], alpha=0.7, edgecolor='black')

# Add value labels on bars
for bar in bars:
    height = bar.get_height()
    ax2.text(bar.get_x() + bar.get_width()/2., height,
            f'{int(height)}',
            ha='center', va='bottom', fontsize=12, fontweight='bold')

ax2.set_ylabel('Number of Columns', fontsize=12)
ax2.set_title('Total Columns Per Dataset', fontsize=14, fontweight='bold')
ax2.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

## 4. Data Overview and Statistics

In [None]:
# Load full Monday dataset for detailed analysis
print("Loading Monday dataset (full) for detailed analysis...\n")

# Using MachineLearningCSV as it's preprocessed for ML
monday_data = config.load_file(sample_file, dataset='ml')

print("‚úì Data loaded successfully!\n")
print("=" * 80)
print("DATASET OVERVIEW")
print("=" * 80)
print(f"Shape: {monday_data.shape[0]:,} rows √ó {monday_data.shape[1]} columns")
print(f"Memory usage: {monday_data.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print(f"\nFirst few rows:")
display(monday_data.head())

print("\n" + "="*80)
print("DATA TYPES")
print("="*80)
print(monday_data.dtypes.value_counts())

In [None]:
# Statistical summary
print("üìä Statistical Summary of Numerical Features:\n")
display(monday_data.describe())

## 5. Data Quality Assessment

In [None]:
# Check for missing values
missing_data = monday_data.isnull().sum()
missing_percent = (missing_data / len(monday_data)) * 100
missing_df = pd.DataFrame({
    'Missing Count': missing_data,
    'Percentage': missing_percent
})
missing_df = missing_df[missing_df['Missing Count'] > 0].sort_values('Missing Count', ascending=False)

print("=" * 80)
print("MISSING DATA ANALYSIS")
print("=" * 80)

if len(missing_df) > 0:
    print(f"\n‚ö†Ô∏è  Found {len(missing_df)} columns with missing values:\n")
    display(missing_df.head(20))
else:
    print("\n‚úì No missing values found!")

# Check for infinite values in numerical columns
numerical_cols = monday_data.select_dtypes(include=[np.number]).columns
inf_counts = {}

for col in numerical_cols:
    inf_count = np.isinf(monday_data[col]).sum()
    if inf_count > 0:
        inf_counts[col] = inf_count

if inf_counts:
    print(f"\n‚ö†Ô∏è  Found infinite values in {len(inf_counts)} columns:")
    for col, count in sorted(inf_counts.items(), key=lambda x: x[1], reverse=True)[:10]:
        print(f"   ‚Ä¢ {col}: {count:,} infinite values")
else:
    print("\n‚úì No infinite values found!")

In [None]:
# Visualize missing data
if len(missing_df) > 0:
    fig, ax = plt.subplots(figsize=(12, 6))
    
    top_missing = missing_df.head(15)
    bars = ax.barh(range(len(top_missing)), top_missing['Percentage'], color='coral', edgecolor='black')
    ax.set_yticks(range(len(top_missing)))
    ax.set_yticklabels(top_missing.index)
    ax.set_xlabel('Missing Percentage (%)', fontsize=12)
    ax.set_title('Top 15 Columns with Missing Data', fontsize=14, fontweight='bold')
    ax.grid(axis='x', alpha=0.3)
    
    # Add percentage labels
    for i, bar in enumerate(bars):
        width = bar.get_width()
        ax.text(width, bar.get_y() + bar.get_height()/2.,
               f'{width:.2f}%',
               ha='left', va='center', fontsize=9)
    
    plt.tight_layout()
    plt.show()
else:
    print("‚úì No missing data to visualize!")

## 6. Label Distribution Analysis

In [None]:
# Analyze label distribution
label_col = ' Label' if ' Label' in monday_data.columns else 'Label'

label_counts = monday_data[label_col].value_counts()
label_percentages = monday_data[label_col].value_counts(normalize=True) * 100

label_summary = pd.DataFrame({
    'Count': label_counts,
    'Percentage': label_percentages
})

print("=" * 80)
print("LABEL DISTRIBUTION (Monday Dataset)")
print("=" * 80)
print(f"\nTotal unique labels: {len(label_counts)}\n")
display(label_summary)

# Check for class imbalance
if len(label_counts) > 1:
    imbalance_ratio = label_counts.max() / label_counts.min()
    print(f"\n‚öñÔ∏è  Class Imbalance Ratio: {imbalance_ratio:.2f}:1")
    if imbalance_ratio > 10:
        print("   ‚ö†Ô∏è  Significant class imbalance detected! Consider using:")
        print("      ‚Ä¢ SMOTE (Synthetic Minority Over-sampling)")
        print("      ‚Ä¢ Class weights in model training")
        print("      ‚Ä¢ Stratified sampling")

In [None]:
# Visualize label distribution
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Bar plot
colors_list = plt.cm.Set3(range(len(label_counts)))
bars = ax1.bar(range(len(label_counts)), label_counts.values, color=colors_list, edgecolor='black', alpha=0.8)
ax1.set_xticks(range(len(label_counts)))
ax1.set_xticklabels(label_counts.index, rotation=45, ha='right')
ax1.set_ylabel('Count', fontsize=12)
ax1.set_title('Label Distribution (Count)', fontsize=14, fontweight='bold')
ax1.grid(axis='y', alpha=0.3)

# Add count labels on bars
for bar in bars:
    height = bar.get_height()
    ax1.text(bar.get_x() + bar.get_width()/2., height,
            f'{int(height):,}',
            ha='center', va='bottom', fontsize=9, rotation=0)

# Pie chart
if len(label_counts) <= 10:  # Only show pie chart if not too many labels
    wedges, texts, autotexts = ax2.pie(label_counts.values, labels=label_counts.index, 
                                         autopct='%1.1f%%', startangle=90, 
                                         colors=colors_list, 
                                         explode=[0.05] * len(label_counts))
    ax2.set_title('Label Distribution (Percentage)', fontsize=14, fontweight='bold')
    
    # Make percentage text more readable
    for autotext in autotexts:
        autotext.set_color('white')
        autotext.set_fontweight('bold')
else:
    # If too many labels, show log scale bar plot
    label_counts.plot(kind='bar', ax=ax2, color=colors_list, edgecolor='black', alpha=0.8, logy=True)
    ax2.set_title('Label Distribution (Log Scale)', fontsize=14, fontweight='bold')
    ax2.set_ylabel('Count (log scale)', fontsize=12)
    ax2.set_xlabel('Label', fontsize=12)
    ax2.tick_params(axis='x', rotation=45)
    ax2.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

## 7. Complete Dataset Label Distribution

In [None]:
# Analyze labels across all files
print("Loading all files to analyze complete label distribution...\n")
print("This may take a moment...\n")

all_labels = []
file_info = []

for csv_file in ml_files:
    print(f"Processing: {csv_file.name}")
    try:
        df = pd.read_csv(csv_file)
        labels = df[label_col].value_counts()
        
        file_info.append({
            'File': csv_file.name,
            'Total Rows': len(df),
            'Unique Labels': len(labels),
            'Labels': ', '.join(labels.index.tolist())
        })
        
        all_labels.extend(df[label_col].tolist())
    except Exception as e:
        print(f"   ‚ö†Ô∏è  Error: {e}")

# Create summary DataFrame
file_summary = pd.DataFrame(file_info)

print("\n" + "="*80)
print("FILE-LEVEL SUMMARY")
print("="*80)
display(file_summary)

# Overall label distribution
overall_labels = pd.Series(all_labels).value_counts()
overall_percentages = pd.Series(all_labels).value_counts(normalize=True) * 100

overall_summary = pd.DataFrame({
    'Count': overall_labels,
    'Percentage': overall_percentages
})

print("\n" + "="*80)
print("OVERALL LABEL DISTRIBUTION (All Files Combined)")
print("="*80)
print(f"Total samples: {len(all_labels):,}")
print(f"Unique labels: {len(overall_labels)}\n")
display(overall_summary)

In [None]:
# Visualize overall label distribution
fig, ax = plt.subplots(figsize=(14, 7))

colors_overall = plt.cm.tab20(range(len(overall_labels)))
bars = ax.bar(range(len(overall_labels)), overall_labels.values, 
              color=colors_overall, edgecolor='black', alpha=0.8)

ax.set_xticks(range(len(overall_labels)))
ax.set_xticklabels(overall_labels.index, rotation=45, ha='right', fontsize=10)
ax.set_ylabel('Count', fontsize=12, fontweight='bold')
ax.set_xlabel('Attack Type', fontsize=12, fontweight='bold')
ax.set_title('Complete CICIDS2017 Dataset - Label Distribution (All Files)', 
             fontsize=16, fontweight='bold', pad=20)
ax.grid(axis='y', alpha=0.3)
ax.set_yscale('log')  # Log scale for better visibility

# Add count labels
for i, bar in enumerate(bars):
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height,
           f'{int(overall_labels.values[i]):,}',
           ha='center', va='bottom', fontsize=8, rotation=45)

plt.tight_layout()
plt.show()

## 8. Feature Distribution Analysis

In [None]:
# Select numerical features for analysis (excluding label)
feature_cols = [col for col in monday_data.select_dtypes(include=[np.number]).columns 
                if col != label_col]

print(f"Analyzing {len(feature_cols)} numerical features...\n")

# Sample of key features to visualize
key_features = [
    'Flow Duration',
    ' Total Fwd Packets',
    ' Total Backward Packets',
    'Total Length of Fwd Packets',
    ' Total Length of Bwd Packets',
    ' Flow Bytes/s',
    ' Flow Packets/s'
]

# Filter to features that exist
available_features = [f for f in key_features if f in monday_data.columns]

if len(available_features) >= 4:
    # Plot distributions
    fig, axes = plt.subplots(2, 2, figsize=(16, 10))
    axes = axes.ravel()
    
    for i, feature in enumerate(available_features[:4]):
        # Remove infinite values for visualization
        data = monday_data[feature].replace([np.inf, -np.inf], np.nan).dropna()
        
        axes[i].hist(data, bins=50, color='steelblue', edgecolor='black', alpha=0.7)
        axes[i].set_xlabel(feature, fontsize=11)
        axes[i].set_ylabel('Frequency', fontsize=11)
        axes[i].set_title(f'Distribution: {feature}', fontsize=12, fontweight='bold')
        axes[i].grid(alpha=0.3)
        
        # Add statistics
        mean_val = data.mean()
        median_val = data.median()
        axes[i].axvline(mean_val, color='red', linestyle='--', linewidth=2, label=f'Mean: {mean_val:.2e}')
        axes[i].axvline(median_val, color='green', linestyle='--', linewidth=2, label=f'Median: {median_val:.2e}')
        axes[i].legend()
    
    plt.tight_layout()
    plt.show()
else:
    print("Not enough features available for visualization.")

## 9. Feature Correlation Analysis

In [None]:
# Calculate correlation matrix for a subset of features
print("Computing correlation matrix...\n")

# Select subset of features (to avoid overwhelming visualization)
sample_features = monday_data[available_features[:10]].replace([np.inf, -np.inf], np.nan)
correlation_matrix = sample_features.corr()

# Plot heatmap
fig, ax = plt.subplots(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
            center=0, square=True, linewidths=1, cbar_kws={"shrink": 0.8},
            ax=ax)
ax.set_title('Feature Correlation Matrix (Sample Features)', fontsize=14, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

# Find highly correlated features
high_corr = []
for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        if abs(correlation_matrix.iloc[i, j]) > 0.8:
            high_corr.append({
                'Feature 1': correlation_matrix.columns[i],
                'Feature 2': correlation_matrix.columns[j],
                'Correlation': correlation_matrix.iloc[i, j]
            })

if high_corr:
    print("\n‚ö†Ô∏è  Highly correlated features (|r| > 0.8):")
    high_corr_df = pd.DataFrame(high_corr).sort_values('Correlation', key=abs, ascending=False)
    display(high_corr_df)
    print("\nüí° Consider removing one feature from each pair to reduce multicollinearity.")
else:
    print("\n‚úì No highly correlated features found (|r| > 0.8)")

## 10. Dataset Selection Recommendation

In [None]:
print("=" * 80)
print("DATASET SELECTION RECOMMENDATION")
print("=" * 80)
print("""
Based on the analysis, here are the recommendations:

üìå FOR MACHINE LEARNING MODEL TRAINING:
   ‚úì USE: MachineLearningCSV/MachineLearningCVE
   
   Reasons:
   ‚Ä¢ Preprocessed and optimized for ML algorithms
   ‚Ä¢ Removes identifying information (IPs, ports, timestamps)
   ‚Ä¢ Focuses on statistical flow features
   ‚Ä¢ Privacy-preserving (no personal/network identifiers)
   ‚Ä¢ Smaller memory footprint
   ‚Ä¢ Industry standard for intrusion detection research

üìå FOR NETWORK FORENSICS & DETAILED ANALYSIS:
   ‚úì USE: GeneratedLabelledFlows
   
   Reasons:
   ‚Ä¢ Contains complete flow information
   ‚Ä¢ Includes Flow ID, Source/Dest IPs, Ports, Timestamps
   ‚Ä¢ Useful for tracking specific flows
   ‚Ä¢ Better for investigating attack patterns
   ‚Ä¢ Correlate with original PCAP files

üìä DATASET STATISTICS:
   ‚Ä¢ Total Samples: {:,}
   ‚Ä¢ Unique Attack Types: {}
   ‚Ä¢ Features (ML version): {}
   ‚Ä¢ Class Imbalance: Present (consider SMOTE or class weights)

‚ö†Ô∏è  KEY CONSIDERATIONS:
   ‚Ä¢ Significant class imbalance exists - use appropriate techniques
   ‚Ä¢ Some features contain infinite values - handle during preprocessing
   ‚Ä¢ High correlation between some features - consider dimensionality reduction
   ‚Ä¢ Stratified sampling recommended for train/test split

üéØ NEXT STEPS:
   1. Data Preprocessing (handle infinities, normalize features)
   2. Feature Selection/Engineering
   3. Handle Class Imbalance (SMOTE, class weights)
   4. Train/Test Split (stratified)
   5. Model Selection and Training
   6. Evaluation with appropriate metrics (F1, Precision, Recall)
""".format(
    len(all_labels),
    len(overall_labels),
    len(ml_sample.columns)
))

print("=" * 80)

## 11. Export Analysis Summary

In [None]:
# Create summary report
summary_report = {
    'analysis_date': pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S'),
    'total_samples': len(all_labels),
    'unique_labels': len(overall_labels),
    'label_distribution': overall_summary.to_dict(),
    'files_analyzed': len(ml_files),
    'file_summary': file_summary.to_dict(),
    'recommended_dataset': 'MachineLearningCSV/MachineLearningCVE',
    'features_count': len(ml_sample.columns)
}

# Save to JSON
import json
output_path = config.base_path / 'data_analysis_summary.json'

with open(output_path, 'w') as f:
    json.dump(summary_report, f, indent=2, default=str)

print(f"‚úì Analysis summary saved to: {output_path}")