 # PE Malware Classification Analysis - Student Lab Notebook



 **Course:** AI in Cybersecurity - Class 07

 **Instructor:** Steve Smith

 **Duration:** 3-4 hours



 ## Learning Objectives

 By the end of this lab, you will be able to:

 - Distinguish between static and dynamic malware analysis techniques

 - Load and explore PE file malware datasets for security analysis

 - Apply advanced feature engineering techniques for malware detection

 - Train and compare multiple machine learning models

 - Evaluate model performance using cybersecurity-appropriate metrics

 - Interpret results and provide actionable security recommendations

 ---

 ## Part 1: Understanding Malware Analysis Fundamentals 

 ### Exercise 1.1: Static vs. Dynamic Analysis Discussion



 **Before we start coding, let's understand the fundamentals:**



 #### Discussion Questions:

 1. **Static Analysis Advantages:**

    - Fast execution

    - Safe (no malware execution)

    - Can analyze file structure



 2. **Dynamic Analysis Advantages:**

    - Reveals runtime behavior

    - Detects evasion techniques



 **Write your thoughts below on when you would choose each approach:**

 **Your Analysis:**



 *Static Analysis is best for:*

 - [Write your answer here]



 *Dynamic Analysis is best for:*

 - [Write your answer here]



 *Hybrid approaches might include:*

 - [Write your answer here]

 ---

 ## Part 2: Data Loading and Initial Exploration

 ### Exercise 2.1: Setup and Data Loading

In [None]:
# Import all necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import RobustScaler
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
from sklearn.feature_selection import SelectKBest, f_classif, VarianceThreshold
import warnings

# Set up visualization style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
warnings.filterwarnings('ignore')

print("🔍 Malware Classification Analysis")
print("=" * 50)


In [None]:
# Load the dataset
# TODO: Replace 'dataset_malwares.csv' with the correct path to your dataset
df = pd.read_csv('dataset_malwares.csv')

# Display basic information about the dataset
print(f"Dataset shape: {df.shape}")
print(f"Features: {df.shape[1] - 1}")
print(f"Samples: {df.shape[0]:,}")


 ### Exercise 2.2: Initial Data Inspection

In [None]:
# Display first few rows
print("📊 First 5 rows of the dataset:")
df.head()


In [None]:
# Get basic information about the dataset
print("\n📋 Dataset Info:")
df.info()


In [None]:
# Check for missing values
missing_values = df.isnull().sum()
print(f"\n🔍 Missing values: {missing_values.sum()}")
print("\nMissing values per column:")
print(missing_values[missing_values > 0])


 ### Exercise 2.3: Target Variable Analysis

In [None]:
# Analyze target distribution
target_dist = df['Malware'].value_counts()
print(f"\n🎯 Target Distribution:")
print(f"Malware (1): {target_dist[1]:,} ({target_dist[1]/len(df)*100:.1f}%)")
print(f"Benign (0): {target_dist[0]:,} ({target_dist[0]/len(df)*100:.1f}%)")


In [None]:
# Create pie chart for target distribution
plt.figure(figsize=(8, 6))
target_counts = df['Malware'].value_counts()
colors = ['#2E8B57', '#DC143C']  # Green for benign, red for malware
plt.pie(target_counts.values, labels=['Benign', 'Malware'], autopct='%1.1f%%', 
        colors=colors, startangle=90)
plt.title('Dataset Distribution', fontweight='bold')
plt.show()


 **Analysis Questions:**

 1. Is this dataset balanced? **[Write your answer here]**

 2. What implications does this have for model training? **[Write your answer here]**

 3. What metrics should we prioritize? **[Write your answer here]**

 ---

 ## Part 3: Comprehensive Exploratory Data Analysis

 ### Exercise 3.1: PE Header Feature Analysis

In [None]:
# Analyze key PE header characteristics
def create_pe_analysis_plots():
    """Analyze key PE header characteristics"""
    
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    
    # Number of Sections Analysis
    sns.boxplot(data=df, x='Malware', y='NumberOfSections', ax=axes[0,0])
    axes[0,0].set_title('Number of Sections Distribution')
    axes[0,0].set_xlabel('Malware (0=Benign, 1=Malware)')
    
    # Image Size Analysis
    sns.boxplot(data=df, x='Malware', y='SizeOfImage', ax=axes[0,1])
    axes[0,1].set_title('Image Size Distribution')
    axes[0,1].set_xlabel('Malware (0=Benign, 1=Malware)')
    axes[0,1].set_yscale('log')
    
    # Entry Point Analysis
    sns.boxplot(data=df, x='Malware', y='AddressOfEntryPoint', ax=axes[1,0])
    axes[1,0].set_title('Entry Point Address Distribution')
    axes[1,0].set_xlabel('Malware (0=Benign, 1=Malware)')
    axes[1,0].set_yscale('log')
    
    # Entropy Analysis
    entropy_data = df[df['SectionMinEntropy'] > 0]
    sns.boxplot(data=entropy_data, x='Malware', y='SectionMinEntropy', ax=axes[1,1])
    axes[1,1].set_title('Section Minimum Entropy')
    axes[1,1].set_xlabel('Malware (0=Benign, 1=Malware)')
    
    plt.tight_layout()
    plt.show()

# Run the analysis
create_pe_analysis_plots()


 **Analysis Questions:**

 1. Do malware files tend to have more or fewer sections than benign files? **[Your answer]**

 2. What patterns do you observe in image sizes between classes? **[Your answer]**

 3. Why might entry point addresses differ between malware and benign files? **[Your answer]**

 4. What does entropy tell us about the content of file sections? **[Your answer]**

 ### Exercise 3.2: Suspicious Features Investigation

In [None]:
# TODO: Analyze suspicious features
# Create a bar plot showing the mean values of suspicious features by class
suspicious_features = ['SuspiciousImportFunctions', 'SuspiciousNameSection']

# Hint: Use groupby('Malware').mean() and plot(kind='bar')
# Your code here:



 **Discussion:** Why are these features particularly important for malware detection?

 **[Write your analysis here]**

 ### Exercise 3.3: Feature Correlation Analysis

In [None]:
# Calculate top feature correlations
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
numeric_cols.remove('Malware')

# Get top 15 correlated features
target_corr = df[numeric_cols + ['Malware']].corr()['Malware'].abs().sort_values(ascending=False)
top_features = target_corr.head(15).index.tolist()

print("Top 15 features by correlation with malware classification:")
for i, feature in enumerate(top_features[:10]):
    print(f"{i+1:2}. {feature}: {target_corr[feature]:.4f}")


In [None]:
# Create correlation heatmap
plt.figure(figsize=(12, 10))
correlation_matrix = df[top_features].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, 
            square=True, cbar_kws={'shrink': 0.8})
plt.title('Top 15 Features Correlation Matrix', fontweight='bold')
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()


In [None]:
# TODO: Create a horizontal bar plot of feature importance by correlation
# Hint: Use target_corr.head(15).plot(kind='barh')
# Your code here:



 **Key Questions:**

 1. Which features show the strongest correlation with malware classification? **[List top 3]**

 2. Are there any highly correlated features that might cause multicollinearity? **[Your analysis]**

 3. Which PE characteristics appear most discriminative? **[Your insights]**

 ---

 ## Part 4: Feature Engineering and Preprocessing 

 ### Exercise 4.1: Feature Selection and Cleaning

In [None]:
# Separate features and target
feature_columns = [col for col in df.columns if col not in ['Name', 'Malware']]
X = df[feature_columns].copy()
y = df['Malware'].copy()

print(f"Features selected: {len(feature_columns)}")
print(f"First 10 features: {feature_columns[:10]}")


In [None]:
# Handle missing values using median imputation
X = X.fillna(X.median())
print(f"Missing values after imputation: {X.isnull().sum().sum()}")


In [None]:
# Remove low-variance features
variance_threshold = VarianceThreshold(threshold=0.01)
X_variance_filtered = variance_threshold.fit_transform(X)
selected_features = variance_threshold.get_support()
feature_names = [feature_columns[i] for i in range(len(feature_columns)) if selected_features[i]]

print(f"Features after variance filtering: {len(feature_names)}")
X = pd.DataFrame(X_variance_filtered, columns=feature_names)


 **Discussion:** Why do we remove low-variance features? What impact does this have on model performance?

 **[Write your answer here]**

 ### Exercise 4.2: Data Splitting and Scaling

In [None]:
# TODO: Split the data with stratification
# Hint: Use train_test_split with test_size=0.2, random_state=42, stratify=y
# Your code here:
X_train, X_test, y_train, y_test = # Complete this line

print(f"Training set: {X_train.shape[0]:,} samples")
print(f"Test set: {X_test.shape[0]:,} samples")

# Verify stratification worked
print(f"\nTraining set class distribution:")
print(y_train.value_counts(normalize=True))
print(f"\nTest set class distribution:")
print(y_test.value_counts(normalize=True))


In [None]:
# Scale features using RobustScaler (better for outliers)
scaler = RobustScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Feature scaling completed using RobustScaler")
print(f"Training data shape: {X_train_scaled.shape}")
print(f"Test data shape: {X_test_scaled.shape}")


 **Key Concepts:**

 - Why use stratified splitting for imbalanced datasets? **[Your answer]**

 - Why choose RobustScaler over StandardScaler? **[Your answer]**

 ### Exercise 4.3: Advanced Feature Selection

In [None]:
# Statistical feature selection
selector_stats = SelectKBest(score_func=f_classif, k=30)
X_train_selected = selector_stats.fit_transform(X_train_scaled, y_train)
X_test_selected = selector_stats.transform(X_test_scaled)

selected_feature_indices = selector_stats.get_support(indices=True)
selected_feature_names = [feature_names[i] for i in selected_feature_indices]

print(f"Top 30 features selected: {selected_feature_names[:10]}...")
print(f"Training data shape after feature selection: {X_train_selected.shape}")


 ---

 ## Part 5: Machine Learning Model Training 

 ### Exercise 5.1: Model Selection and Training

In [None]:
# Define models for comparison
models = {
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'SVM': SVC(random_state=42, probability=True)
}

print("Models defined:")
for name in models.keys():
    print(f"  • {name}")


In [None]:
# Train and evaluate models
model_results = {}

for name, model in models.items():
    print(f"\nTraining {name}...")
    
    # Train model
    model.fit(X_train_selected, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test_selected)
    y_pred_proba = model.predict_proba(X_test_selected)[:, 1]
    
    # Calculate AUC score
    auc_score = roc_auc_score(y_test, y_pred_proba)
    
    # Store results
    model_results[name] = {
        'model': model,
        'predictions': y_pred,
        'probabilities': y_pred_proba,
        'auc_score': auc_score
    }
    
    print(f"{name} AUC Score: {auc_score:.4f}")


 **Understanding model.fit():**

 The `model.fit()` method is where the magic happens:

 - **Training:** Adjusts model parameters to learn patterns in training data

 - **Learning:** Finds optimal settings that minimize error/loss function

 - **Preparation:** Once fitted, model can make predictions on new data

 ### Exercise 5.2: Model Performance Comparison

In [None]:
# Create performance comparison
print("\n🏆 Model Performance Ranking:")
for name, results in sorted(model_results.items(), key=lambda x: x[1]['auc_score'], reverse=True):
    print(f"   • {name}: AUC = {results['auc_score']:.4f}")


In [None]:
# TODO: Create a bar plot comparing AUC scores
# Hint: Extract auc_scores and model_names, then use plt.bar()
# Your code here:



 ### Exercise 5.3: Detailed Model Evaluation

In [None]:
# Find best performing model
best_model_name = max(model_results.keys(), key=lambda k: model_results[k]['auc_score'])
best_model = model_results[best_model_name]

print(f"Best performing model: {best_model_name}")
print(f"Best AUC score: {best_model['auc_score']:.4f}")


In [None]:
# Create ROC curves for all models
plt.figure(figsize=(10, 8))
plt.plot([0, 1], [0, 1], 'k--', alpha=0.5, label='Random Classifier')
colors = ['#FF6B6B', '#4ECDC4', '#45B7D1', '#96CEB4']

for i, (name, results) in enumerate(model_results.items()):
    fpr, tpr, _ = roc_curve(y_test, results['probabilities'])
    plt.plot(fpr, tpr, color=colors[i], linewidth=2, 
             label=f'{name} (AUC: {results["auc_score"]:.3f})')

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves Comparison', fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()


In [None]:
# TODO: Create confusion matrix for best model
# Hint: Use sklearn.metrics.confusion_matrix and seaborn.heatmap
# Your code here:



In [None]:
# Detailed classification report for best model
best_predictions = best_model['predictions']
report = classification_report(y_test, best_predictions, 
                             target_names=['Benign', 'Malware'])
print(f"\nDetailed Classification Report - {best_model_name}:")
print(report)


 **Key Evaluation Metrics for Cybersecurity:**

 - **Precision:** Of all positive predictions, how many were correct? (TP / (TP + FP))

 - **Recall:** Of all actual positives, how many were identified? (TP / (TP + FN))

 - **F1-Score:** Harmonic mean of precision and recall

 - **AUC-ROC:** Overall model discriminative ability

 **Analysis Questions:**

 1. Which model performed best and why? **[Your analysis]**

 2. What are the precision and recall trade-offs? **[Your analysis]**

 3. How would you choose a threshold for production deployment? **[Your analysis]**

 ### Exercise 5.4: Feature Importance Analysis

In [None]:
# Feature importance for Random Forest
if 'Random Forest' in model_results:
    rf_model = model_results['Random Forest']['model']
    feature_importance = pd.DataFrame({
        'feature': selected_feature_names,
        'importance': rf_model.feature_importances_
    }).sort_values('importance', ascending=False)
    
    print("Top 15 Most Important Features (Random Forest):")
    print(feature_importance.head(15))


In [None]:
# TODO: Create a horizontal bar plot of top 15 feature importances
# Your code here:



 ---

 ## Part 6: Hyperparameter Tuning

 ### Exercise 6.1: Advanced Model Optimization

In [None]:
# Hyperparameter tuning for best model
if best_model_name == 'Random Forest':
    print("Tuning Random Forest hyperparameters...")
    
    # TODO: Define parameter grid for Random Forest
    # Include: n_estimators, max_depth, min_samples_split, min_samples_leaf
    param_grid = {
        # Your parameter grid here
    }
    
    # TODO: Create GridSearchCV object
    # Hint: Use cv=3, scoring='roc_auc', n_jobs=-1
    grid_search = # Your code here
    
    # Fit grid search
    print("Starting grid search (this may take a few minutes)...")
    grid_search.fit(X_train_selected, y_train)
    
    print(f"Best parameters: {grid_search.best_params_}")
    print(f"Best CV score: {grid_search.best_score_:.4f}")


In [None]:
# Train final optimized model
if 'grid_search' in locals():
    final_model = grid_search.best_estimator_
    final_predictions = final_model.predict(X_test_selected)
    final_probabilities = final_model.predict_proba(X_test_selected)[:, 1]
    final_auc = roc_auc_score(y_test, final_probabilities)
    
    print(f"Final optimized model AUC: {final_auc:.4f}")
    print(f"Improvement over baseline: {final_auc - best_model['auc_score']:.4f}")


 **Hyperparameter Concepts:**

 - **Parameters:** Learned during training (weights, biases)

 - **Hyperparameters:** Set before training (learning rate, tree depth)

 - **Grid Search:** Systematic search over parameter combinations

 - **Cross-Validation:** Robust evaluation to prevent overfitting

 ---

 ## Part 7: Results Analysis and Cybersecurity Implications

 ### Exercise 7.1: Comprehensive Results Summary

In [None]:
# Print comprehensive results summary
print("\n📋 Malware Classification Analysis Results")
print("=" * 50)

print(f"🎯 Dataset Overview:")
print(f"   • Total samples: {len(df):,}")
print(f"   • Malware samples: {target_dist[1]:,} ({target_dist[1]/len(df)*100:.1f}%)")
print(f"   • Benign samples: {target_dist[0]:,} ({target_dist[0]/len(df)*100:.1f}%)")
print(f"   • Features analyzed: {len(feature_names)}")

print(f"\n🏆 Model Performance Ranking:")
for name, results in sorted(model_results.items(), key=lambda x: x[1]['auc_score'], reverse=True):
    print(f"   • {name}: AUC = {results['auc_score']:.4f}")


In [None]:
# TODO: Complete the analysis with your findings
print(f"\n🔍 Key Security Findings:")
print(f"   • Best performing model: {best_model_name}")
print(f"   • Dataset shows class imbalance (malware-heavy)")
# Add more findings based on your analysis

print(f"\n💡 Production Deployment Recommendations:")
print(f"   • Deploy {best_model_name} for initial implementation")
# Add more recommendations based on your analysis


 ### Exercise 7.2: Security Operations Discussion

In [None]:
# TODO: Create a prediction probability distribution plot
# Show how well the model separates malware from benign files
# Hint: Use histograms of prediction probabilities for each class
# Your code here:



 **Critical Questions for Production Deployment:**



 1. **False Positive Impact:** How would false positives affect business operations?

    **[Your analysis]**



 2. **False Negative Risk:** What is the cost of missing actual malware?

    **[Your analysis]**



 3. **Model Drift:** How do we detect when new malware families emerge?

    **[Your analysis]**



 4. **Explainability:** Can we explain why a file was classified as malware?

    **[Your analysis]**



 5. **Performance:** What are the real-time classification requirements?

    **[Your analysis]**

 ### Exercise 7.3: Feature Engineering Innovation (Optional)

In [None]:
# TODO: Create new engineered features
# Try creating ratios, combinations, or statistical measures from existing features
# Examples:
# - Ratio of suspicious imports to total imports
# - Entropy variance across sections  
# - Section size distribution statistics

# Example starter code:
# df['sections_per_mb'] = df['NumberOfSections'] / (df['SizeOfImage'] / 1000000)
# df['import_density'] = df['DirectoryEntryImportSize'] / df['SizeOfImage']

# Your innovative features here:



 ---

 ## Assignment Submission Checklist

 ### Assignment #04: Feature Engineering & Visualization (15 points)

 **Due: August 12, 11:59 PM ET**



 **Before submitting, ensure you have:**



 - [ ] **Complete Jupyter Notebook** with all sections filled out

 - [ ] **Quality Visualizations** using seaborn (boxplots, heatmaps, ROC curves)

 - [ ] **Feature Analysis** identifying and discussing top 10 most important features

 - [ ] **Model Comparison** with performance metrics and analysis

 - [ ] **Security Discussion** on production deployment implications

 - [ ] **Code Comments** explaining your analysis decisions

 - [ ] **Clear Conclusions** answering all discussion questions



 **Grading Breakdown:**

 - Code Quality (4 pts): Clean, commented, functional code

 - Visualizations (4 pts): Clear, insightful, well-labeled plots

 - Feature Analysis (3 pts): Identification and interpretation of key features

 - Model Evaluation (2 pts): Appropriate metrics and comparison

 - Security Insights (2 pts): Real-world implications discussion

 ---

 ## Final Reflection

 **Complete this reflection on your learning:**



 1. **What was the most surprising finding in your analysis?**

    [Your answer here]



 2. **Which PE file features were most important for malware detection and why?**

    [Your answer here]



 3. **How would you improve this analysis for a production security system?**

    [Your answer here]



 4. **What additional data sources might enhance malware detection?**

    [Your answer here]



 5. **How does this lab connect to real-world cybersecurity operations?**

    [Your answer here]

 ---

 ## Congratulations! 🎉



 You've completed a comprehensive malware classification analysis using machine learning. The skills you've developed here are directly applicable to:



 - **Security Operations Centers (SOCs):** Automated threat detection

 - **Threat Intelligence:** Malware family classification

 - **Incident Response:** Rapid malware identification

 - **Security Product Development:** AV/EDR solutions



 **Next Steps:**

 - Review your analysis and ensure all sections are complete

 - Double-check your visualizations and interpretations

 - Submit your completed notebook for grading

 - Continue exploring advanced topics in adversarial AI and security automation



 **Great work! You're now ready to apply AI to cybersecurity challenges! 🛡️🤖**