# Experiment 5: Deep Learning for Oil & Gas Text Classification
**Course:** Introduction to Deep Learning | **Module:** Natural Language Processing

---

## Objective
Design and implement deep learning models for classifying oil & gas industry reports and documents using neural networks, word embeddings, and sequence modeling techniques.

## Learning Outcomes
By the end of this experiment, you will:
1. Understand text preprocessing and tokenization for deep learning
2. Implement word embeddings and sequence models for text classification
3. Build and train LSTM/GRU networks for document classification
4. Apply attention mechanisms and transformer concepts
5. Evaluate text classification models and interpret results

## Background & Theory

**Text Classification** is the task of automatically categorizing text documents into predefined classes. Deep learning approaches use neural networks to learn hierarchical representations of text for improved classification accuracy.

**Key Components:**
- **Tokenization:** Converting text into numerical tokens for neural network processing
- **Word Embeddings:** Dense vector representations capturing semantic relationships
- **Sequence Models:** RNNs, LSTMs, GRUs for processing sequential text data
- **Attention Mechanisms:** Focusing on relevant parts of input sequences
- **Classification Head:** Final layers mapping representations to class probabilities

**Mathematical Foundation:**
- Word embedding: w_i → e_i ∈ R^d where d is embedding dimension
- LSTM cell: f_t = σ(W_f·[h_{t-1}, x_t] + b_f), i_t = σ(W_i·[h_{t-1}, x_t] + b_i)
- Attention: α_i = softmax(e_i), context = Σα_i h_i
- Classification: P(y|x) = softmax(W_c h + b_c)

**Applications in Oil & Gas:**
- Automated classification of incident reports and safety documents
- Maintenance report categorization for predictive analytics
- Regulatory compliance document processing
- Production report analysis and trend identification
- Knowledge management and document retrieval systems

## Setup & Dependencies

**What to Expect:** This section establishes the Python environment for deep learning-based text classification. We'll install all necessary packages including PyTorch for neural networks, NLTK for text processing, and scikit-learn for evaluation metrics.

**Process Overview:**
1. **Package Installation:** Automatically install PyTorch, NLTK, transformers, and scientific computing libraries
2. **Environment Configuration:** Set up device detection (CPU/GPU), random seeds for reproducibility
3. **NLTK Data Download:** Download required tokenizers and language resources
4. **Styling Setup:** Apply ArivuAI color scheme for consistent visualizations
5. **Validation:** Confirm all dependencies are properly installed and configured

**Expected Outcome:** A fully configured environment ready for text classification with neural networks, including all NLP preprocessing tools and deep learning frameworks.

In [1]:
# ============================================================================
# PACKAGE INSTALLATION AND ENVIRONMENT SETUP
# ============================================================================

# Install required packages automatically if not present
import subprocess, sys
packages = ['torch', 'numpy', 'matplotlib', 'pandas', 'scikit-learn', 'nltk', 'wordcloud']

print('📦 Checking and installing required packages...')
for pkg in packages:
    try: 
        __import__(pkg.replace('-', '_').lower())
        print(f'  ✓ {pkg} already installed')
    except ImportError: 
        print(f'  📥 Installing {pkg}...')
        subprocess.check_call([sys.executable, '-m', 'pip', 'install', pkg])

# ============================================================================
# CORE LIBRARY IMPORTS
# ============================================================================

# Deep learning framework imports
import torch                    # Main PyTorch library for tensor operations
import torch.nn as nn          # Neural network modules and loss functions
import torch.optim as optim    # Optimization algorithms (Adam, SGD, etc.)
import torch.nn.functional as F # Functional interface for neural network operations

# PyTorch data handling utilities
from torch.utils.data import Dataset, DataLoader, TensorDataset  # Data loading utilities
from torch.nn.utils.rnn import pad_sequence                      # Sequence padding for RNNs

# Scientific computing and data manipulation
import numpy as np             # Numerical computing with arrays
import pandas as pd            # Data manipulation and analysis
import matplotlib.pyplot as plt # Plotting and visualization

# Machine learning utilities
from sklearn.model_selection import train_test_split  # Data splitting
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score  # Evaluation metrics

# Data structures and utilities
from collections import Counter, defaultdict  # Efficient counting and dictionaries
import json, random, re, time  # JSON processing, random numbers, regex, timing
from pathlib import Path       # Path manipulation

# Natural language processing
import nltk                    # Natural Language Toolkit
from wordcloud import WordCloud  # Word cloud generation for visualization

# ============================================================================
# NLTK DATA DOWNLOAD
# ============================================================================

print('\n📚 Setting up NLTK data...')
# Download required NLTK datasets if not already present
try: 
    nltk.data.find('tokenizers/punkt')  # Sentence tokenizer
    print('  ✓ Punkt tokenizer already available')
except LookupError: 
    print('  📥 Downloading punkt tokenizer...')
    nltk.download('punkt')

try: 
    nltk.data.find('corpora/stopwords')  # Common stop words
    print('  ✓ Stopwords corpus already available')
except LookupError: 
    print('  📥 Downloading stopwords corpus...')
    nltk.download('stopwords')

# Import NLTK components after ensuring data is available
from nltk.tokenize import word_tokenize  # Word tokenization
from nltk.corpus import stopwords        # Stop words list

# ============================================================================
# REPRODUCIBILITY SETUP
# ============================================================================

print('\n🎲 Setting random seeds for reproducible results...')
# Set random seeds for all libraries to ensure reproducible results
RANDOM_SEED = 42
torch.manual_seed(RANDOM_SEED)      # PyTorch random number generator
np.random.seed(RANDOM_SEED)         # NumPy random number generator
random.seed(RANDOM_SEED)            # Python built-in random module

# Set PyTorch to deterministic mode for reproducibility
if torch.cuda.is_available():
    torch.cuda.manual_seed(RANDOM_SEED)
    torch.cuda.manual_seed_all(RANDOM_SEED)

# ============================================================================
# DEVICE CONFIGURATION
# ============================================================================

print('\n🖥️ Configuring compute device...')
# Configure device for optimal performance (GPU if available, otherwise CPU)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'  ✓ Using device: {device}')

if torch.cuda.is_available():
    print(f'  ✓ GPU: {torch.cuda.get_device_name(0)}')
    print(f'  ✓ CUDA version: {torch.version.cuda}')
else:
    print('  ℹ️ GPU not available, using CPU')

# ============================================================================
# DATA DIRECTORY SETUP
# ============================================================================

print('\n📂 Setting up data directories...')
# Flexible path detection to handle different execution contexts
DATA_DIR = Path('data')
if not DATA_DIR.exists():
    DATA_DIR = Path('Expirements/Experiment_5_Text_Classification/data')
    print(f'  📁 Using full path: {DATA_DIR}')
else:
    print(f'  📁 Using local path: {DATA_DIR}')

# ============================================================================
# VISUALIZATION STYLING
# ============================================================================

print('\n🎨 Applying ArivuAI styling...')
# Configure matplotlib for consistent, professional visualizations
plt.style.use('default')  # Start with clean default style

# ArivuAI color palette for consistent branding
colors = {
    'primary': '#004E89',    # Deep blue for main elements
    'secondary': '#3DA5D9',  # Light blue for secondary elements
    'accent': '#F1A208',     # Orange for highlights and accents
    'dark': '#4F4F4F'        # Dark gray for text and borders
}

# Set default figure parameters
plt.rcParams['figure.figsize'] = (12, 8)  # Default figure size for text analysis
plt.rcParams['font.size'] = 11             # Default font size

# ============================================================================
# SETUP VALIDATION
# ============================================================================

print('\n✅ Environment setup complete!')
print('  ✓ All packages installed and configured')
print('  ✓ Random seeds set for reproducible results')
print('  ✓ Device configured for optimal performance')
print('  ✓ Data directories established')
print('  ✓ ArivuAI styling applied')
print(f'  ✓ PyTorch version: {torch.__version__}')
print('  ✓ Ready for text classification experiments!')
if not DATA_DIR.exists():
    DATA_DIR = Path('Expirements/data')
if not DATA_DIR.exists():
    DATA_DIR = Path('.')
    print('Warning: Using current directory for data')

# ArivuAI styling
plt.style.use('default')
colors = {'primary': '#004E89', 'secondary': '#3DA5D9', 'accent': '#F1A208', 'dark': '#4F4F4F'}

print(f'✓ PyTorch version: {torch.__version__}')
print(f'✓ Device: {device}')
print(f'✓ Data directory: {DATA_DIR.absolute()}')
print('✓ All packages installed and configured')
print('✓ Random seeds set for reproducible results')
print('✓ ArivuAI styling applied')

📦 Checking and installing required packages...
  ✓ torch already installed
  ✓ numpy already installed
  ✓ matplotlib already installed
  ✓ pandas already installed
  📥 Installing scikit-learn...
  ✓ nltk already installed
  ✓ wordcloud already installed

📚 Setting up NLTK data...
  ✓ Punkt tokenizer already available
  ✓ Stopwords corpus already available

🎲 Setting random seeds for reproducible results...

🖥️ Configuring compute device...
  ✓ Using device: cpu
  ℹ️ GPU not available, using CPU

📂 Setting up data directories...
  📁 Using local path: data

🎨 Applying ArivuAI styling...

✅ Environment setup complete!
  ✓ All packages installed and configured
  ✓ Random seeds set for reproducible results
  ✓ Device configured for optimal performance
  ✓ Data directories established
  ✓ ArivuAI styling applied
  ✓ PyTorch version: 2.8.0+cpu
  ✓ Ready for text classification experiments!
✓ PyTorch version: 2.8.0+cpu
✓ Device: cpu
✓ Data directory: d:\Suni Files\AI Code Base\Oil and Gas\Oil

## Text Dataset Generation & Preprocessing
Create and preprocess oil & gas industry text data for classification.

In [2]:
class TextDataGenerator:
    def __init__(self, config_path):
        """Initialize text data generator with configuration"""
        try:
            with open(config_path, 'r') as f:
                self.config = json.load(f)
            print('✓ Configuration loaded from JSON')
        except FileNotFoundError:
            print('Creating default configuration...')
            self.config = self._create_default_config()
        
        self.categories = self.config['categories']
        self.sample_texts = self.config['sample_texts']
        self.stop_words = set(stopwords.words('english'))
    
    def _create_default_config(self):
        """Create default configuration if JSON file not found"""
        return {
            'categories': {'0': 'Safety_Incident', '1': 'Equipment_Maintenance', '2': 'Production_Report', '3': 'Environmental_Compliance', '4': 'Operational_Update'},
            'sample_texts': {'Safety_Incident': ['Gas leak detected at facility'], 'Equipment_Maintenance': ['Pump maintenance completed']},
            'text_statistics': {'samples_per_category': 200}
        }
    
    def generate_expanded_dataset(self, samples_per_category=200):
        """Generate expanded dataset by creating variations of sample texts"""
        texts = []
        labels = []
        
        for category_id, category_name in self.categories.items():
            category_id = int(category_id)
            base_texts = self.sample_texts.get(category_name, [])
            
            # Generate variations of base texts
            for i in range(samples_per_category):
                if base_texts:
                    # Select base text and create variation
                    base_text = random.choice(base_texts)
                    varied_text = self._create_text_variation(base_text, category_name)
                    texts.append(varied_text)
                    labels.append(category_id)
                else:
                    # Fallback generic text
                    generic_text = f"Report related to {category_name.lower().replace('_', ' ')} activities and operations."
                    texts.append(generic_text)
                    labels.append(category_id)
        
        return texts, labels
    
    def _create_text_variation(self, base_text, category):
        """Create variations of base text while maintaining category characteristics"""
        # Simple variation techniques
        variations = [
            base_text,  # Original
            base_text.replace('completed', 'finished'),
            base_text.replace('detected', 'identified'),
            base_text.replace('scheduled', 'planned'),
            base_text.replace('required', 'needed'),
        ]
        
        # Add category-specific prefixes/suffixes
        if category == 'Safety_Incident':
            prefixes = ['URGENT: ', 'ALERT: ', 'INCIDENT: ', '']
            suffixes = [' Immediate action required.', ' Safety protocols activated.', ' Investigation ongoing.', '']
        elif category == 'Equipment_Maintenance':
            prefixes = ['MAINTENANCE: ', 'SERVICE: ', 'REPAIR: ', '']
            suffixes = [' Work order completed.', ' Equipment operational.', ' Next service due in 30 days.', '']
        else:
            prefixes = ['REPORT: ', 'UPDATE: ', 'STATUS: ', '']
            suffixes = [' End of report.', ' Status confirmed.', ' Documentation updated.', '']
        
        varied_text = random.choice(variations)
        prefix = random.choice(prefixes)
        suffix = random.choice(suffixes)
        
        return prefix + varied_text + suffix
    
    def preprocess_text(self, text):
        """Preprocess text for neural network training"""
        # Convert to lowercase
        text = text.lower()
        
        # Remove special characters and digits
        text = re.sub(r'[^a-zA-Z\s]', '', text)
        
        # Tokenize
        tokens = word_tokenize(text)
        
        # Remove stopwords and short tokens
        tokens = [token for token in tokens if token not in self.stop_words and len(token) > 2]
        
        return tokens

# Initialize generator and create dataset
generator = TextDataGenerator(DATA_DIR / 'oil_gas_reports.json')
texts, labels = generator.generate_expanded_dataset(samples_per_category=200)

print(f'✓ Text dataset generated:')
print(f'• Total samples: {len(texts):,}')
print(f'• Categories: {len(generator.categories)}')
print(f'• Category names: {list(generator.categories.values())}')
print(f'• Sample text length: {np.mean([len(text.split()) for text in texts]):.1f} words')

# Show sample texts
print('\nSample texts:')
for i in range(3):
    print(f'{i+1}. [{generator.categories[str(labels[i])]}] {texts[i][:100]}...')

✓ Configuration loaded from JSON
✓ Text dataset generated:
• Total samples: 1,000
• Categories: 5
• Category names: ['Safety_Incident', 'Equipment_Maintenance', 'Production_Report', 'Environmental_Compliance', 'Operational_Update']
• Sample text length: 25.2 words

Sample texts:
1. [Safety_Incident] INCIDENT: Gas leak detected at wellhead station 7 during routine inspection. Emergency shutdown proc...
2. [Safety_Incident] URGENT: Near miss incident involving crane operations at offshore platform. Load swing occurred due ...
3. [Safety_Incident] URGENT: Hydrogen sulfide exposure risk identified during well servicing operations. Area evacuated a...


## Summary & Validation
This is a simplified version of Experiment 5 for testing. The complete implementation would include LSTM/GRU models, attention mechanisms, and comprehensive evaluation.

**Key Components Demonstrated:**
- Text classification theory and NLP foundations
- Oil & gas industry text dataset with 5 categories
- Text preprocessing and tokenization pipeline
- Synthetic text generation with realistic variations

**Next Steps:**
- Implement vocabulary building and word embeddings
- Create LSTM/GRU sequence models for classification
- Add attention mechanisms and transformer components
- Include training loops with validation monitoring
- Implement comprehensive evaluation and text analysis