# 01 - Exploratory Data Analysis (EDA)

This notebook performs comprehensive exploratory data analysis on the processed network anomaly detection dataset.

## Overview
- Load and examine processed data
- Analyze class distribution and data quality
- Visualize feature distributions and relationships
- Check for missing values and data integrity

## Dataset
The processed dataset contains network flow features that have been cleaned, encoded, and scaled for machine learning.


In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("Libraries imported successfully!")


## 1. Load Processed Data

First, let's load the processed dataset and examine its basic structure.




In [None]:
# Load processed data
data_path = Path("../data/processed/processed.csv")

if data_path.exists():
    df = pd.read_csv(data_path)
    print(f"✅ Data loaded successfully!")
    print(f"Dataset shape: {df.shape}")
else:
    print("❌ Processed data not found. Please run preprocessing first.")
    print("Run: python ../src/preprocess.py --input ../data/raw/sample.csv --output ../data/processed/processed.csv")


In [None]:
# Display first few rows
print("First 5 rows of the dataset:")
df.head()


In [None]:
# Basic dataset information
print("Dataset Info:")
print(f"Rows: {df.shape[0]}")
print(f"Columns: {df.shape[1]}")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print("\nData types:")
print(df.dtypes.value_counts())


## 2. Class Distribution Analysis

Let's examine the distribution of classes in our dataset to understand the balance between normal and anomalous traffic.


In [None]:
# Check if label column exists
if 'label' in df.columns:
    print("Label column found!")
    print(f"Label distribution:")
    print(df['label'].value_counts())
    print(f"\nLabel distribution (percentages):")
    print(df['label'].value_counts(normalize=True) * 100)
else:
    print("❌ Label column not found in dataset")


In [None]:
# Visualize class distribution
if 'label' in df.columns:
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
    
    # Count plot
    df['label'].value_counts().plot(kind='bar', ax=ax1, color=['skyblue', 'salmon'])
    ax1.set_title('Class Distribution (Counts)')
    ax1.set_xlabel('Class')
    ax1.set_ylabel('Count')
    ax1.tick_params(axis='x', rotation=45)
    
    # Pie chart
    df['label'].value_counts().plot(kind='pie', ax=ax2, autopct='%1.1f%%', startangle=90)
    ax2.set_title('Class Distribution (Percentages)')
    ax2.set_ylabel('')
    
    plt.tight_layout()
    plt.show()


## 3. Feature Analysis

Let's examine the features in our dataset to understand their distributions and characteristics.


In [None]:
# Separate features and target
if 'label' in df.columns:
    features = df.drop(columns=['label'])
    target = df['label']
    
    print(f"Number of features: {features.shape[1]}")
    print(f"Feature names: {list(features.columns)}")
else:
    features = df
    target = None
    print("No target column found, analyzing all columns as features")


In [None]:
# Basic statistics for numeric features
print("Basic Statistics for Numeric Features:")
print(features.describe())


In [None]:
# Select a subset of features for visualization (to avoid overcrowding)
n_features_to_plot = min(10, features.shape[1])
selected_features = features.columns[:n_features_to_plot]

print(f"Plotting distributions for first {n_features_to_plot} features:")
print(selected_features.tolist())


In [None]:
# Plot histograms for selected features
fig, axes = plt.subplots(2, 5, figsize=(20, 10))
axes = axes.ravel()

for i, feature in enumerate(selected_features):
    if i < len(axes):
        axes[i].hist(features[feature], bins=30, alpha=0.7, edgecolor='black')
        axes[i].set_title(f'{feature}', fontsize=10)
        axes[i].set_xlabel('Value')
        axes[i].set_ylabel('Frequency')
        axes[i].grid(True, alpha=0.3)

# Hide unused subplots
for i in range(len(selected_features), len(axes)):
    axes[i].set_visible(False)

plt.tight_layout()
plt.show()


## 4. Correlation Analysis

Let's examine the correlations between features to understand relationships in our data.


In [None]:
# Calculate correlation matrix for selected features
correlation_matrix = features[selected_features].corr()

# Plot correlation heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, 
            annot=True, 
            cmap='coolwarm', 
            center=0,
            square=True,
            fmt='.2f',
            cbar_kws={'shrink': 0.8})
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()


## 5. Missing Values and Data Quality

Let's check for missing values and assess data quality.


In [None]:
# Check for missing values
missing_values = df.isnull().sum()
missing_percentage = (missing_values / len(df)) * 100

missing_df = pd.DataFrame({
    'Column': missing_values.index,
    'Missing_Count': missing_values.values,
    'Missing_Percentage': missing_percentage.values
})

missing_df = missing_df[missing_df['Missing_Count'] > 0].sort_values('Missing_Percentage', ascending=False)

if len(missing_df) > 0:
    print("❌ Missing values found:")
    print(missing_df)
else:
    print("✅ No missing values found in the dataset!")


In [None]:
# Check for infinite values
infinite_values = np.isinf(features.select_dtypes(include=[np.number])).sum()
infinite_df = pd.DataFrame({
    'Column': infinite_values.index,
    'Infinite_Count': infinite_values.values
})

infinite_df = infinite_df[infinite_df['Infinite_Count'] > 0]

if len(infinite_df) > 0:
    print("❌ Infinite values found:")
    print(infinite_df)
else:
    print("✅ No infinite values found in numeric features!")


In [None]:
# Check for duplicate rows
duplicate_count = df.duplicated().sum()
print(f"Duplicate rows: {duplicate_count}")

if duplicate_count > 0:
    print(f"Percentage of duplicates: {(duplicate_count / len(df)) * 100:.2f}%")
else:
    print("✅ No duplicate rows found!")


## 6. Feature Distribution by Class

Let's examine how features are distributed across different classes to identify patterns.


In [None]:
# Plot feature distributions by class (if target exists)
if target is not None and len(target.unique()) > 1:
    # Select a few key features for visualization
    key_features = selected_features[:4] if len(selected_features) >= 4 else selected_features
    
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    axes = axes.ravel()
    
    for i, feature in enumerate(key_features):
        if i < len(axes):
            for class_label in target.unique():
                class_data = features[target == class_label][feature]
                axes[i].hist(class_data, alpha=0.6, label=f'Class {class_label}', bins=20)
            
            axes[i].set_title(f'{feature} by Class')
            axes[i].set_xlabel('Value')
            axes[i].set_ylabel('Frequency')
            axes[i].legend()
            axes[i].grid(True, alpha=0.3)
    
    # Hide unused subplots
    for i in range(len(key_features), len(axes)):
        axes[i].set_visible(False)
    
    plt.tight_layout()
    plt.show()
else:
    print("Cannot plot by class - target variable not available or has only one class")


## 7. Summary and Next Steps

### Key Findings:
1. **Dataset Size**: [To be filled based on actual data]
2. **Class Distribution**: [To be filled based on actual data]
3. **Feature Quality**: [To be filled based on actual data]
4. **Missing Values**: [To be filled based on actual data]

### Next Steps:
1. **Feature Engineering**: Consider creating new features based on domain knowledge
2. **Feature Selection**: Identify the most important features for classification
3. **Model Training**: Proceed to train baseline ML models
4. **Deep Learning**: Explore deep learning approaches for sequence data
5. **Hyperparameter Tuning**: Optimize model parameters for better performance
6. **Model Evaluation**: Implement comprehensive evaluation metrics
7. **Deployment**: Prepare models for production deployment


## 6. Feature Distribution by Class

Let's examine how features are distributed across different classes to identify patterns.
