# Aircraft Modification Data Preprocessing

This notebook demonstrates comprehensive data preprocessing for aircraft modification descriptions, including:
- Environment setup and data loading
- Text cleaning and normalization
- Feature extraction using TF-IDF and SBERT
- Aviation-specific pattern recognition
- Data visualization and analysis

**Goal**: Prepare aircraft modification data for machine learning models that will classify modifications, map regulations, and predict certification requirements.

## 1. Environment Setup and Library Imports

First, let's install and import all required libraries for text processing, machine learning, and visualization.

In [None]:
# Install required packages (uncomment if running for the first time)
# !pip install pandas numpy scikit-learn nltk spacy transformers sentence-transformers
# !pip install plotly seaborn matplotlib streamlit faiss-cpu
# !python -m spacy download en_core_web_sm

import warnings
warnings.filterwarnings('ignore')

# Core libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Text processing
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.chunk import ne_chunk
from nltk.tag import pos_tag

# Machine learning
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, MultiLabelBinarizer
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

# Deep learning and transformers
try:
    from sentence_transformers import SentenceTransformer
    import torch
    TRANSFORMERS_AVAILABLE = True
except ImportError:
    print("Transformers not available. Some features will be limited.")
    TRANSFORMERS_AVAILABLE = False

# Utility
import os
import sys
import pickle
from collections import Counter, defaultdict
from datetime import datetime
import json

# Add utils to path
sys.path.append('../utils')

print("✅ All libraries imported successfully!")
print(f"🤖 Transformers available: {TRANSFORMERS_AVAILABLE}")
print(f"📊 Pandas version: {pd.__version__}")
print(f"🔢 NumPy version: {np.__version__}")

In [None]:
# Download required NLTK data
nltk_downloads = ['punkt', 'stopwords', 'wordnet', 'averaged_perceptron_tagger', 'maxent_ne_chunker', 'words']

for dataset in nltk_downloads:
    try:
        nltk.data.find(f'tokenizers/{dataset}')
    except LookupError:
        try:
            nltk.data.find(f'corpora/{dataset}')
        except LookupError:
            try:
                nltk.data.find(f'taggers/{dataset}')
            except LookupError:
                try:
                    nltk.data.find(f'chunkers/{dataset}')
                except LookupError:
                    nltk.download(dataset, quiet=True)

# Set up matplotlib and seaborn
plt.style.use('default')
sns.set_palette("husl")

# Configuration
np.random.seed(42)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 100)

print("✅ NLTK data downloaded and configurations set!")
print("📈 Visualization libraries configured")

## 2. Data Loading and Initial Exploration

Load the aircraft modification dataset and perform initial exploration to understand the data structure, quality, and characteristics.

In [None]:
# Load the aircraft modification dataset
data_path = '../data/mods_dataset.csv'

try:
    df = pd.read_csv(data_path)
    print("✅ Dataset loaded successfully!")
    print(f"📊 Dataset shape: {df.shape}")
    print(f"📋 Columns: {list(df.columns)}")
except FileNotFoundError:
    print("❌ Dataset not found. Let's create sample data...")
    # Generate sample data if the file doesn't exist
    from generate_sample_data import SampleDataGenerator
    
    generator = SampleDataGenerator()
    df = generator.generate_dataset(100)
    df.to_csv(data_path, index=False)
    print("✅ Sample dataset created and saved!")

# Display basic information
print("\n" + "="*50)
print("DATASET OVERVIEW")
print("="*50)

print(f"Total modifications: {len(df)}")
print(f"Date range: {df['approval_date'].min()} to {df['approval_date'].max()}")
print(f"Unique modification types: {df['mod_type'].nunique()}")
print(f"Unique aircraft types: {df['aircraft_type'].nunique()}")

# Display first few rows
print("\n📋 First 5 rows:")
display(df.head())

# Data types and missing values
print("\n📊 Data Info:")
print(df.info())

In [None]:
# Statistical overview
print("📈 STATISTICAL OVERVIEW")
print("="*50)

# Missing values
missing_values = df.isnull().sum()
print("Missing values per column:")
for col, missing in missing_values.items():
    if missing > 0:
        print(f"  {col}: {missing} ({missing/len(df)*100:.1f}%)")
    else:
        print(f"  {col}: ✅ No missing values")

print("\n🏷️ Categorical Variables:")
categorical_cols = ['mod_type', 'loi', 'aircraft_type']
for col in categorical_cols:
    if col in df.columns:
        print(f"\n{col}:")
        value_counts = df[col].value_counts()
        for value, count in value_counts.head(10).items():
            print(f"  {value}: {count} ({count/len(df)*100:.1f}%)")

# Text length analysis
print("\n📝 Text Analysis:")
df['description_length'] = df['mod_description'].str.len()
df['word_count'] = df['mod_description'].str.split().str.len()
df['regulation_count'] = df['regulations'].str.split(',').str.len()

text_stats = {
    'Description Length (chars)': df['description_length'].describe(),
    'Word Count': df['word_count'].describe(),
    'Regulation Count': df['regulation_count'].describe()
}

for stat_name, stats in text_stats.items():
    print(f"\n{stat_name}:")
    print(f"  Mean: {stats['mean']:.1f}")
    print(f"  Median: {stats['50%']:.1f}")
    print(f"  Min: {stats['min']:.0f}, Max: {stats['max']:.0f}")
    print(f"  Std: {stats['std']:.1f}")

In [None]:
# Create visualizations for data exploration
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=('Modification Types', 'Level of Involvement', 
                   'Text Length Distribution', 'Regulations per Modification'),
    specs=[[{"type": "xy"}, {"type": "xy"}],
           [{"type": "xy"}, {"type": "xy"}]]
)

# Modification types
mod_counts = df['mod_type'].value_counts()
fig.add_trace(
    go.Bar(x=mod_counts.values, y=mod_counts.index, orientation='h', 
           name='Mod Types', showlegend=False),
    row=1, col=1
)

# Level of Involvement
loi_counts = df['loi'].value_counts()
colors = ['green', 'orange', 'red']
fig.add_trace(
    go.Bar(x=loi_counts.index, y=loi_counts.values,
           marker=dict(color=colors[:len(loi_counts)]),
           name='LOI', showlegend=False),
    row=1, col=2
)

# Text length distribution
fig.add_trace(
    go.Histogram(x=df['description_length'], nbinsx=20,
                name='Text Length', showlegend=False),
    row=2, col=1
)

# Regulations per modification
fig.add_trace(
    go.Histogram(x=df['regulation_count'], nbinsx=10,
                name='Regulation Count', showlegend=False),
    row=2, col=2
)

fig.update_layout(height=800, title_text="Aircraft Modification Dataset Overview")
fig.show()

# Word cloud of modification descriptions (if wordcloud available)
try:
    from wordcloud import WordCloud
    
    # Combine all descriptions
    all_text = ' '.join(df['mod_description'])
    
    # Create word cloud
    wordcloud = WordCloud(width=800, height=400, 
                         background_color='white',
                         max_words=100).generate(all_text)
    
    plt.figure(figsize=(12, 6))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.title('Most Common Words in Modification Descriptions', fontsize=16)
    plt.tight_layout()
    plt.show()
    
except ImportError:
    print("WordCloud not available. Install with: pip install wordcloud")

## 3. Text Preprocessing and Cleaning

Implement comprehensive text preprocessing specifically designed for aircraft modification descriptions, including:
- Text cleaning and normalization
- Aviation-specific pattern recognition
- Tokenization and lemmatization
- Stop word removal with domain-specific additions

In [None]:
class AviationTextPreprocessor:
    """Advanced text preprocessor for aviation domain"""
    
    def __init__(self):
        self.lemmatizer = WordNetLemmatizer()
        self.stemmer = PorterStemmer()
        
        # Standard stop words + aviation-specific terms
        self.stop_words = set(stopwords.words('english'))
        aviation_stopwords = {
            'aircraft', 'airplane', 'flight', 'aviation', 'system', 'systems',
            'installation', 'installed', 'modify', 'modification', 'mod',
            'equipment', 'component', 'device', 'unit', 'assembly', 'new',
            'improved', 'enhanced', 'advanced', 'latest', 'current'
        }
        self.stop_words.update(aviation_stopwords)
        
        # Aviation-specific regex patterns
        self.patterns = {
            'regulations': r'\b(CS|AMC)\s*[\d\-\.]+\b',
            'part_numbers': r'\b[A-Z]{2,4}[\d\-]{3,10}\b',
            'aircraft_models': r'\b(A\d{3}|B\d{3}|ATR|CRJ|ERJ)\w*\b',
            'measurements': r'\d+\s*(mm|cm|m|ft|in|kg|lb|psi|bar|kts|mach)\b',
            'frequencies': r'\d+\s*(hz|khz|mhz|ghz)\b',
            'voltages': r'\d+\s*(v|volt|volts|vdc|vac)\b'
        }
        
        # Technical term mapping
        self.tech_terms = {
            'vhf': 'very_high_frequency',
            'uhf': 'ultra_high_frequency',
            'gps': 'global_positioning_system',
            'ils': 'instrument_landing_system',
            'tcas': 'traffic_collision_avoidance_system',
            'egpws': 'enhanced_ground_proximity_warning_system'
        }
    
    def extract_aviation_entities(self, text):
        """Extract aviation-specific entities from text"""
        entities = {}
        
        for entity_type, pattern in self.patterns.items():
            matches = re.findall(pattern, text, re.IGNORECASE)
            entities[entity_type] = list(set(matches))
        
        return entities
    
    def clean_text(self, text):
        """Comprehensive text cleaning for aviation domain"""
        if not isinstance(text, str):
            return ""
        
        # Store original for entity extraction
        original_text = text
        
        # Convert to lowercase
        text = text.lower()
        
        # Replace technical abbreviations with full terms
        for abbrev, full_term in self.tech_terms.items():
            text = re.sub(rf'\b{abbrev}\b', full_term, text)
        
        # Normalize whitespace
        text = re.sub(r'\s+', ' ', text)
        
        # Remove URLs and email addresses
        text = re.sub(r'http[s]?://\S+', '', text)
        text = re.sub(r'\S+@\S+', '', text)
        
        # Keep only letters, numbers, and important punctuation
        text = re.sub(r'[^\w\s\-\.\(\)/]', ' ', text)
        
        # Remove extra spaces
        text = ' '.join(text.split())
        
        return text.strip()
    
    def tokenize_and_process(self, text, use_lemmatization=True):
        """Tokenize and process text with domain-specific handling"""
        # Clean text first
        cleaned = self.clean_text(text)
        
        # Tokenize
        tokens = word_tokenize(cleaned)
        
        # Filter tokens
        processed_tokens = []
        for token in tokens:
            # Skip if too short or is punctuation
            if len(token) < 2 or token in string.punctuation:
                continue
            
            # Skip stop words
            if token.lower() in self.stop_words:
                continue
            
            # Apply lemmatization or stemming
            if use_lemmatization:
                token = self.lemmatizer.lemmatize(token)
            else:
                token = self.stemmer.stem(token)
            
            processed_tokens.append(token)
        
        return processed_tokens
    
    def analyze_text_complexity(self, text):
        """Analyze text complexity and characteristics"""
        sentences = sent_tokenize(text)
        words = word_tokenize(text)
        
        analysis = {
            'sentence_count': len(sentences),
            'word_count': len(words),
            'avg_sentence_length': len(words) / len(sentences) if sentences else 0,
            'unique_words': len(set(word.lower() for word in words)),
            'lexical_diversity': len(set(word.lower() for word in words)) / len(words) if words else 0,
            'avg_word_length': np.mean([len(word) for word in words]) if words else 0
        }
        
        # Extract aviation entities
        entities = self.extract_aviation_entities(text)
        analysis['aviation_entities'] = entities
        
        return analysis

# Initialize preprocessor
preprocessor = AviationTextPreprocessor()

print("✅ Aviation Text Preprocessor initialized!")
print("🔧 Features available:")
print("  - Text cleaning and normalization")
print("  - Aviation-specific entity extraction")
print("  - Domain-aware tokenization")
print("  - Text complexity analysis")

In [None]:
# Test preprocessing on sample texts
sample_texts = [
    "Installation of a new VHF antenna on the dorsal fuselage affecting structural and avionics systems according to CS 25.1309.",
    "Retrofit of LED cabin lighting system replacing existing fluorescent lights with 28VDC power supply.",
    "Modification of TCAS II system for enhanced collision avoidance using AMC 20-151 guidelines.",
    "Integration of GPS/WAAS navigation system in A320 aircraft for RNAV approaches per AMC 20-115."
]

print("🧪 PREPROCESSING DEMONSTRATION")
print("="*60)

for i, text in enumerate(sample_texts, 1):
    print(f"\n📝 Sample {i}:")
    print(f"Original: {text}")
    
    # Clean text
    cleaned = preprocessor.clean_text(text)
    print(f"Cleaned:  {cleaned}")
    
    # Tokenize
    tokens = preprocessor.tokenize_and_process(text)
    print(f"Tokens:   {tokens[:10]}...")  # Show first 10 tokens
    
    # Extract entities
    entities = preprocessor.extract_aviation_entities(text)
    print(f"Entities: {entities}")
    
    # Analyze complexity
    analysis = preprocessor.analyze_text_complexity(text)
    print(f"Analysis: Words={analysis['word_count']}, "
          f"Unique={analysis['unique_words']}, "
          f"Diversity={analysis['lexical_diversity']:.2f}")
    
    print("-" * 40)

In [None]:
# Apply preprocessing to the entire dataset
print("🔄 Processing entire dataset...")

# Create processed versions of the text
df['description_cleaned'] = df['mod_description'].apply(preprocessor.clean_text)
df['description_tokens'] = df['mod_description'].apply(preprocessor.tokenize_and_process)
df['token_count'] = df['description_tokens'].apply(len)

# Extract aviation entities for each modification
print("🔍 Extracting aviation entities...")
aviation_entities = []
for desc in df['mod_description']:
    entities = preprocessor.extract_aviation_entities(desc)
    aviation_entities.append(entities)

df['aviation_entities'] = aviation_entities

# Analyze text complexity
print("📊 Analyzing text complexity...")
complexity_data = []
for desc in df['mod_description']:
    analysis = preprocessor.analyze_text_complexity(desc)
    complexity_data.append(analysis)

# Convert to separate columns
complexity_df = pd.DataFrame(complexity_data)
df = pd.concat([df, complexity_df], axis=1)

# Summary statistics
print("\n📈 PREPROCESSING RESULTS")
print("="*50)
print(f"✅ Processed {len(df)} modifications")
print(f"📝 Average tokens per description: {df['token_count'].mean():.1f}")
print(f"🔤 Average word length: {df['avg_word_length'].mean():.1f}")
print(f"📚 Lexical diversity range: {df['lexical_diversity'].min():.2f} - {df['lexical_diversity'].max():.2f}")

# Entity extraction summary
entity_types = ['regulations', 'part_numbers', 'aircraft_models', 'measurements']
print(f"\n🏷️ ENTITY EXTRACTION SUMMARY")
for entity_type in entity_types:
    count = sum(1 for entities in aviation_entities if entities.get(entity_type))
    percentage = count / len(df) * 100
    print(f"  {entity_type}: {count} modifications ({percentage:.1f}%)")

# Save processed data
processed_path = '../data/mods_processed.csv'
df.to_csv(processed_path, index=False)
print(f"\n💾 Processed data saved to: {processed_path}")

## 4. Feature Extraction and Vectorization

Create numerical features from the processed text using multiple approaches:
- TF-IDF vectorization for traditional ML models
- Sentence embeddings using SBERT for semantic similarity
- Custom feature engineering for aviation domain

In [None]:
# TF-IDF Vectorization
print("🔢 TF-IDF VECTORIZATION")
print("="*50)

# Prepare text data for TF-IDF
texts_for_tfidf = df['description_cleaned'].tolist()

# Create TF-IDF vectorizer with optimized parameters for aviation domain
tfidf_vectorizer = TfidfVectorizer(
    max_features=2000,          # Limit vocabulary size
    ngram_range=(1, 2),         # Use unigrams and bigrams
    min_df=2,                   # Ignore terms that appear in less than 2 documents
    max_df=0.8,                 # Ignore terms that appear in more than 80% of documents
    stop_words='english',       # Use English stop words
    lowercase=True,
    strip_accents='ascii',
    token_pattern=r'\b[a-zA-Z][a-zA-Z0-9_]{2,}\b'  # Custom token pattern
)

# Fit and transform the text data
print("🔄 Fitting TF-IDF vectorizer...")
tfidf_matrix = tfidf_vectorizer.fit_transform(texts_for_tfidf)

print(f"✅ TF-IDF matrix shape: {tfidf_matrix.shape}")
print(f"📊 Vocabulary size: {len(tfidf_vectorizer.vocabulary_)}")
print(f"💾 Matrix sparsity: {(1 - tfidf_matrix.nnz / (tfidf_matrix.shape[0] * tfidf_matrix.shape[1])) * 100:.1f}%")

# Get feature names
feature_names = tfidf_vectorizer.get_feature_names_out()

# Find most important features
feature_importance = np.array(tfidf_matrix.sum(axis=0)).flatten()
top_features_idx = np.argsort(feature_importance)[::-1][:20]

print(f"\n🔝 Top 20 TF-IDF features:")
for i, idx in enumerate(top_features_idx, 1):
    print(f"  {i:2d}. {feature_names[idx]} (score: {feature_importance[idx]:.2f})")

# Save TF-IDF components
import pickle
tfidf_path = '../models/tfidf_vectorizer.pkl'
os.makedirs('../models', exist_ok=True)

with open(tfidf_path, 'wb') as f:
    pickle.dump(tfidf_vectorizer, f)

print(f"\n💾 TF-IDF vectorizer saved to: {tfidf_path}")

In [None]:
# Sentence Embeddings using SBERT
print("\n🤖 SENTENCE EMBEDDINGS (SBERT)")
print("="*50)

if TRANSFORMERS_AVAILABLE:
    try:
        # Load sentence transformer model
        print("🔄 Loading sentence transformer model...")
        model_name = 'all-MiniLM-L6-v2'  # Lightweight but effective model
        sentence_model = SentenceTransformer(model_name)
        
        # Generate embeddings for all descriptions
        print("🔄 Generating sentence embeddings...")
        sentence_embeddings = sentence_model.encode(
            df['description_cleaned'].tolist(),
            show_progress_bar=True,
            convert_to_numpy=True,
            batch_size=32
        )
        
        print(f"✅ Sentence embeddings shape: {sentence_embeddings.shape}")
        print(f"📏 Embedding dimension: {sentence_embeddings.shape[1]}")
        
        # Save embeddings
        embeddings_path = '../models/sentence_embeddings.npy'
        np.save(embeddings_path, sentence_embeddings)
        
        print(f"💾 Sentence embeddings saved to: {embeddings_path}")
        
        # Calculate embedding statistics
        embedding_stats = {
            'mean': np.mean(sentence_embeddings),
            'std': np.std(sentence_embeddings),
            'min': np.min(sentence_embeddings),
            'max': np.max(sentence_embeddings)
        }
        
        print(f"\n📊 Embedding statistics:")
        for stat, value in embedding_stats.items():
            print(f"  {stat}: {value:.4f}")
        
        # Visualize embeddings using t-SNE (sample for performance)
        if len(sentence_embeddings) > 100:
            sample_idx = np.random.choice(len(sentence_embeddings), 100, replace=False)
            sample_embeddings = sentence_embeddings[sample_idx]
            sample_labels = df.iloc[sample_idx]['mod_type'].values
        else:
            sample_embeddings = sentence_embeddings
            sample_labels = df['mod_type'].values
        
        print(f"\n🎨 Creating t-SNE visualization with {len(sample_embeddings)} samples...")
        tsne = TSNE(n_components=2, random_state=42, perplexity=min(30, len(sample_embeddings)-1))
        embeddings_2d = tsne.fit_transform(sample_embeddings)
        
        # Create interactive plot
        fig = px.scatter(
            x=embeddings_2d[:, 0], 
            y=embeddings_2d[:, 1],
            color=sample_labels,
            title="t-SNE Visualization of Sentence Embeddings",
            labels={'x': 't-SNE 1', 'y': 't-SNE 2', 'color': 'Modification Type'}
        )
        fig.update_traces(marker_size=8)
        fig.show()
        
    except Exception as e:
        print(f"❌ Error generating sentence embeddings: {e}")
        sentence_embeddings = None
        
else:
    print("⚠️ Sentence transformers not available. Skipping sentence embeddings.")
    sentence_embeddings = None

In [None]:
# Custom Feature Engineering for Aviation Domain
print("\n🛠️ CUSTOM FEATURE ENGINEERING")
print("="*50)

def create_aviation_features(df):
    """Create domain-specific features for aircraft modifications"""
    
    features = pd.DataFrame(index=df.index)
    
    # Text-based features
    features['text_length'] = df['mod_description'].str.len()
    features['word_count'] = df['word_count']
    features['sentence_count'] = df['sentence_count']
    features['avg_word_length'] = df['avg_word_length']
    features['lexical_diversity'] = df['lexical_diversity']
    features['token_count'] = df['token_count']
    
    # Aviation entity features
    features['has_regulations'] = df['aviation_entities'].apply(
        lambda x: len(x.get('regulations', [])) > 0
    )
    features['regulation_count'] = df['aviation_entities'].apply(
        lambda x: len(x.get('regulations', []))
    )
    features['has_part_numbers'] = df['aviation_entities'].apply(
        lambda x: len(x.get('part_numbers', [])) > 0
    )
    features['has_aircraft_models'] = df['aviation_entities'].apply(
        lambda x: len(x.get('aircraft_models', [])) > 0
    )
    features['has_measurements'] = df['aviation_entities'].apply(
        lambda x: len(x.get('measurements', [])) > 0
    )
    
    # Keyword-based features (binary indicators)
    aviation_keywords = {
        'safety_related': ['emergency', 'safety', 'evacuation', 'fire', 'oxygen', 'escape'],
        'avionics_related': ['radio', 'radar', 'navigation', 'gps', 'antenna', 'communication'],
        'structural_related': ['wing', 'fuselage', 'door', 'frame', 'structural', 'reinforcement'],
        'system_related': ['hydraulic', 'fuel', 'air', 'conditioning', 'pump', 'valve'],
        'cabin_related': ['passenger', 'seat', 'galley', 'lavatory', 'lighting', 'entertainment'],
        'propulsion_related': ['engine', 'thrust', 'propulsion', 'turbine', 'combustor']
    }
    
    for category, keywords in aviation_keywords.items():
        features[f'has_{category}'] = df['description_cleaned'].apply(
            lambda text: any(keyword in text.lower() for keyword in keywords)
        )
        features[f'count_{category}'] = df['description_cleaned'].apply(
            lambda text: sum(text.lower().count(keyword) for keyword in keywords)
        )
    
    # Complexity features
    features['technical_density'] = (
        features['regulation_count'] + 
        features['count_avionics_related'] + 
        features['count_system_related']
    ) / features['word_count']
    
    # Modification urgency indicators
    urgency_keywords = ['critical', 'urgent', 'immediate', 'mandatory', 'required']
    features['urgency_score'] = df['description_cleaned'].apply(
        lambda text: sum(text.lower().count(keyword) for keyword in urgency_keywords)
    )
    
    return features

# Generate custom features
print("🔄 Generating aviation-specific features...")
aviation_features = create_aviation_features(df)

print(f"✅ Created {aviation_features.shape[1]} custom features")
print(f"📊 Feature summary:")

# Display feature statistics
feature_stats = aviation_features.describe()
print(feature_stats.round(2))

# Correlation analysis
print(f"\n🔗 Feature correlations with modification type:")
# Encode mod_type for correlation
le = LabelEncoder()
mod_type_encoded = le.fit_transform(df['mod_type'])

correlations = []
for col in aviation_features.select_dtypes(include=[np.number]).columns:
    corr = np.corrcoef(aviation_features[col], mod_type_encoded)[0, 1]
    if not np.isnan(corr):
        correlations.append((col, abs(corr)))

# Sort by absolute correlation
correlations.sort(key=lambda x: x[1], reverse=True)

print("Top 10 features correlated with modification type:")
for i, (feature, corr) in enumerate(correlations[:10], 1):
    print(f"  {i:2d}. {feature}: {corr:.3f}")

# Save custom features
features_path = '../models/aviation_features.csv'
aviation_features.to_csv(features_path, index=False)
print(f"\n💾 Custom features saved to: {features_path}")

## 5. Multi-label Encoding for Regulations

Prepare regulation data for multi-label classification, where each modification can be associated with multiple regulations.

In [None]:
# Multi-label Encoding for Regulations
print("🏷️ MULTI-LABEL REGULATION ENCODING")
print("="*50)

# Parse regulations from comma-separated strings
def parse_regulations(reg_string):
    """Parse comma-separated regulation string into list"""
    if pd.isna(reg_string) or reg_string == '':
        return []
    
    regulations = [reg.strip() for reg in reg_string.split(',')]
    # Clean up regulations (remove empty strings)
    regulations = [reg for reg in regulations if reg]
    return regulations

# Apply parsing to all regulations
df['regulation_list'] = df['regulations'].apply(parse_regulations)

# Analyze regulation distribution
all_regulations = []
for reg_list in df['regulation_list']:
    all_regulations.extend(reg_list)

regulation_counts = Counter(all_regulations)
print(f"📊 Total unique regulations: {len(regulation_counts)}")
print(f"📈 Total regulation instances: {len(all_regulations)}")
print(f"🔢 Average regulations per modification: {len(all_regulations) / len(df):.1f}")

# Display most common regulations
print(f"\n🔝 Top 15 most common regulations:")
for i, (regulation, count) in enumerate(regulation_counts.most_common(15), 1):
    percentage = count / len(df) * 100
    print(f"  {i:2d}. {regulation}: {count} ({percentage:.1f}%)")

# Create multi-label binary encoding
print(f"\n🔄 Creating multi-label binary encoding...")

mlb = MultiLabelBinarizer()
regulation_binary = mlb.fit_transform(df['regulation_list'])

print(f"✅ Binary encoding shape: {regulation_binary.shape}")
print(f"📋 Encoded regulations: {len(mlb.classes_)}")

# Create DataFrame with regulation columns
regulation_df = pd.DataFrame(
    regulation_binary, 
    columns=[f"reg_{reg.replace(' ', '_').replace('.', '_')}" for reg in mlb.classes_],
    index=df.index
)

print(f"📊 Regulation matrix sparsity: {(1 - regulation_binary.sum() / regulation_binary.size) * 100:.1f}%")

# Analyze regulation co-occurrence
print(f"\n🔗 Regulation co-occurrence analysis:")

# Calculate pairwise correlations for top regulations
top_regs = regulation_counts.most_common(10)
top_reg_names = [reg for reg, _ in top_regs]

# Create correlation matrix for top regulations
top_reg_cols = [f"reg_{reg.replace(' ', '_').replace('.', '_')}" for reg in top_reg_names]
available_cols = [col for col in top_reg_cols if col in regulation_df.columns]

if len(available_cols) > 1:
    corr_matrix = regulation_df[available_cols].corr()
    
    # Plot correlation heatmap
    plt.figure(figsize=(12, 8))
    sns.heatmap(
        corr_matrix, 
        annot=True, 
        cmap='coolwarm', 
        center=0,
        square=True,
        fmt='.2f'
    )
    plt.title('Regulation Co-occurrence Correlation Matrix')
    plt.xticks(rotation=45, ha='right')
    plt.yticks(rotation=0)
    plt.tight_layout()
    plt.show()

# Save multi-label encoding components
mlb_path = '../models/regulation_mlb.pkl'
with open(mlb_path, 'wb') as f:
    pickle.dump(mlb, f)

regulation_binary_path = '../models/regulation_binary.npy'
np.save(regulation_binary_path, regulation_binary)

print(f"\n💾 Multi-label binarizer saved to: {mlb_path}")
print(f"💾 Binary regulation matrix saved to: {regulation_binary_path}")

# Add regulation features to main dataframe
df['total_regulations'] = df['regulation_list'].apply(len)
df['regulation_diversity'] = df['regulation_list'].apply(lambda x: len(set(x)))

print(f"\n📋 Regulation statistics added to main DataFrame:")
print(f"  Total regulations per mod: {df['total_regulations'].describe()}")
print(f"  Regulation diversity per mod: {df['regulation_diversity'].describe()}")

## 6. Summary and Next Steps

### What We've Accomplished

✅ **Data Loading & Exploration**
- Loaded and analyzed aircraft modification dataset
- Explored data distribution and characteristics
- Identified key patterns and statistics

✅ **Text Preprocessing**
- Implemented aviation-specific text cleaning
- Created domain-aware tokenization and lemmatization
- Extracted aviation entities (regulations, part numbers, etc.)

✅ **Feature Engineering**
- Generated TF-IDF vectors for traditional ML models
- Created sentence embeddings using SBERT for semantic similarity
- Built custom aviation-domain features
- Implemented multi-label encoding for regulations

✅ **Data Preparation**
- Processed text for machine learning models
- Created multiple feature representations
- Saved preprocessed data and model components

### Generated Assets

📁 **Saved Files:**
- `../data/mods_processed.csv` - Preprocessed dataset
- `../models/tfidf_vectorizer.pkl` - TF-IDF vectorizer
- `../models/sentence_embeddings.npy` - SBERT embeddings
- `../models/aviation_features.csv` - Custom features
- `../models/regulation_mlb.pkl` - Multi-label binarizer
- `../models/regulation_binary.npy` - Binary regulation matrix

### Next Steps

🎯 **Ready for Model Development:**

1. **Modification Classification** (`2_mod_classification.ipynb`)
   - Use TF-IDF + custom features
   - Train Random Forest and Logistic Regression models
   - Evaluate classification performance

2. **Regulation Mapping** (`3_regulation_mapping.ipynb`)
   - Multi-label classification for regulation prediction
   - Use binary relevance and multi-label KNN approaches
   - Evaluate using Hamming loss and Precision@K

3. **LOI Prediction** (`4_loi_prediction.ipynb`)
   - Predict Level of Involvement using all features
   - Try Decision Tree, Random Forest, and XGBoost
   - Analyze feature importance

4. **Similarity Search** (`5_similarity_search.ipynb`)
   - Build FAISS index from sentence embeddings
   - Implement cosine similarity search
   - Create similarity ranking system

The preprocessing pipeline is now complete and ready for model development! 🚀