# Phase 1: Data Preparation & AI-Enhanced Description Generation

This notebook implements the first phase of our incident classification project:
1. Load and analyze the Saber Categories dataset
2. Generate AI-enhanced descriptions using OpenAI/Ollama
3. Prepare data for embedding generation and FAISS indexing
4. Create train/test splits with proper stratification

## Project Structure
- **Raw Data**: Original ticket categories with hierarchy
- **AI Enhancement**: Rich semantic descriptions for better embeddings
- **Output**: Prepared dataset ready for embedding generation

In [1]:
# Import required libraries
import sys
import os
sys.path.append('../src')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import yaml
from dotenv import load_dotenv

# Import custom modules
from data_processor import DataProcessor
from ai_agent import AIAgent

# Load environment variables
load_dotenv('../.env')

# Set style for plots
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("✅ All libraries imported successfully")
print(f"📂 Current working directory: {os.getcwd()}")

✅ All libraries imported successfully
📂 Current working directory: c:\Users\ASUS\Classification\notebooks


## 1. Load and Explore Dataset

Let's start by loading our Saber Categories dataset and understanding its structure.

In [2]:
# Initialize data processor
processor = DataProcessor(config_path='../config/config.yaml')

# Load the cleaned dataset
df = processor.load_data('../Saber Categories-1.csv')

print(f"📊 Dataset shape: {df.shape}")
print(f"📋 Columns: {list(df.columns)}")
print("\n" + "="*50)
print("📈 Dataset Info:")
df.info()

📊 Dataset shape: (100, 8)
📋 Columns: ['Service', 'Category', 'SubCategory', 'SubCategory_Prefix ', 'SubCategory_Keywords', 'SubCategory2', 'SubCategory2_Prefix ', 'SubCategory2_Keywords']

📈 Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   Service                100 non-null    object
 1   Category               100 non-null    object
 2   SubCategory            100 non-null    object
 3   SubCategory_Prefix     100 non-null    object
 4   SubCategory_Keywords   100 non-null    object
 5   SubCategory2           100 non-null    object
 6   SubCategory2_Prefix    100 non-null    object
 7   SubCategory2_Keywords  100 non-null    object
dtypes: object(8)
memory usage: 6.4+ KB


In [3]:
# Display first few rows with all important columns
print("🔍 First 3 rows of the dataset:")
cols_to_show = ['Service', 'Category', 'SubCategory', 'SubCategory_Prefix ', 'SubCategory_Keywords', 
                'SubCategory2', 'SubCategory2_Prefix ', 'SubCategory2_Keywords']
display(df[cols_to_show].head(3))

print(f"\n📊 Unique values count:")
print(f"   Primary categories (SubCategory): {df['SubCategory'].nunique()}")
print(f"   Secondary categories (SubCategory2): {df['SubCategory2'].nunique()}")
print(f"   Services: {df['Service'].nunique()}")
print(f"   Main Categories: {df['Category'].nunique()}")

🔍 First 3 rows of the dataset:


Unnamed: 0,Service,Category,SubCategory,SubCategory_Prefix,SubCategory_Keywords,SubCategory2,SubCategory2_Prefix,SubCategory2_Keywords
0,SASO - Products Safety and Certification,Saber,الشهادات الصادرة من الهيئة,شهادات المطابقة الصادرة عن طريق هيئة المواصفات...,شهادة المطابقة الخليجية Gmark-GSO,مطابقة خليجية G-mark,شهادة المطابقة الخليجية,GSO-Gmark- شهادة المطابقة الخليجية
1,SASO - Products Safety and Certification,Saber,جهات المطابقة,مايخص جهات تقويم المطابقة في قبول الطلبات وظهو...,الغاء طلب-قبول طلب,قبول الطلب,قبول طلب مطابقة مقدم من قبل العميل,الغاء طلب-قبول طلب
2,SASO - Products Safety and Certification,Saber,الشهادات الصادرة من الهيئة,شهادات المطابقة الصادرة عن طريق هيئة المواصفات...,QM-quality mark- علامة الجودة,علامة الجودة,شهادة مطابقة علامة الجودة,quality mark- علامة الجودة-QM



📊 Unique values count:
   Primary categories (SubCategory): 18
   Secondary categories (SubCategory2): 73
   Services: 1
   Main Categories: 1


## 2. Prepare Text Fields for AI Enhancement

Now we'll prepare the text fields creating:
1. **Raw concatenated text** (all fields combined)
2. **Structured text** for AI processing 
3. **User query format** (simplified for matching user queries)

In [5]:
# Prepare text fields for different purposes
df_processed = processor.prepare_text_fields(df)

print("✅ Text fields prepared")
print(f"📝 New columns added: {['raw_text', 'structured_text', 'user_query_format']}")
print(f"📊 Dataset shape after processing: {df_processed.shape}")

# Show examples of different text formats
print("\n" + "="*70)
print("📄 Example Raw Text (for comprehensive embedding):")
print(df_processed['raw_text'].iloc[0][:300] + "...")

print("\n" + "="*70)
print("📄 Example Structured Text (for AI agent):")
print(df_processed['structured_text'].iloc[0])

print("\n" + "="*70)
print("📄 Example User Query Format (simplified for user matching):")
print(f"'{df_processed['user_query_format'].iloc[0]}'")

print("\n📊 Arabic content detected: ✅")
print("📊 Mixed Arabic-English content: ✅")

✅ Text fields prepared
📝 New columns added: ['raw_text', 'structured_text', 'user_query_format']
📊 Dataset shape after processing: (100, 11)

📄 Example Raw Text (for comprehensive embedding):
SASO - Products Safety and Certification | Saber | الشهادات الصادرة من الهيئة | شهادات المطابقة الصادرة عن طريق هيئة المواصفات السعودية  | شهادة المطابقة الخليجية Gmark-GSO  | مطابقة خليجية G-mark | شهادة المطابقة الخليجية | GSO-Gmark- شهادة المطابقة الخليجية...

📄 Example Structured Text (for AI agent):
Service: SASO - Products Safety and Certification
        Category: Saber
        SubCategory: الشهادات الصادرة من الهيئة
        SubCategory_Prefix: شهادات المطابقة الصادرة عن طريق هيئة المواصفات السعودية 
        SubCategory_Keywords: شهادة المطابقة الخليجية Gmark-GSO 
        SubCategory2: مطابقة خليجية G-mark
        SubCategory2_Prefix: شهادة المطابقة الخليجية
        SubCategory2_Keywords: GSO-Gmark- شهادة المطابقة الخليجية

📄 Example User Query Format (simplified for user matching):
'الشهاد

## 3. AI-Enhanced Description Generation 🤖

This is our **key innovation**: Using AI to generate rich, semantic descriptions that will significantly improve embedding quality for Arabic-English mixed content.

In [None]:
# Check if we have OpenAI API key
import os
if not os.getenv('OPENAI_API_KEY'):
    print("⚠️  WARNING: OPENAI_API_KEY not found in environment variables")
    print("📝 Please set your OpenAI API key in the .env file to use AI enhancement")
    print("💡 For now, we'll create enhanced descriptions using rule-based approach")
    
    # Create enhanced descriptions using template
    def create_enhanced_description(row):
        arabic_terms = []
        english_terms = []
        
        # Extract Arabic and English terms
        for field in ['SubCategory_Prefix ', 'SubCategory_Keywords', 'SubCategory2_Prefix ', 'SubCategory2_Keywords']:
            text = str(row.get(field, ''))
            if any('\u0600' <= c <= '\u06FF' for c in text):  # Arabic Unicode range
                arabic_terms.append(text)
            elif text.strip() and not any('\u0600' <= c <= '\u06FF' for c in text):
                english_terms.append(text)
        
        description = f"""This category handles {row['SubCategory']} issues in the Saber platform. """
        description += f"""Users typically experience problems related to {row['SubCategory2']}. """
        
        if arabic_terms:
            description += f"""Arabic context: {' '.join(arabic_terms)}. """
        if english_terms:
            description += f"""English terms: {' '.join(english_terms)}. """
            
        description += f"""Common workflow involves {row['Category']} processes in {row['Service']} system."""
        
        return description.strip()
    
    df_processed['ai_description'] = df_processed.apply(create_enhanced_description, axis=1)
    print("✅ Enhanced descriptions generated using rule-based approach")
    
else:
    print("✅ OpenAI API key found, initializing AI agent...")
    
    # Initialize AI agent
    ai_agent = AIAgent(config_path='../config/config.yaml')
    
    print("🤖 Generating AI-enhanced descriptions for Arabic-English content...")
    print("⏱️  This may take a few minutes depending on dataset size...")
    
    # Generate descriptions for all records with progress tracking
    descriptions = []
    total_records = len(df_processed)
    
    for i, structured_text in enumerate(df_processed['structured_text']):
        print(f"Processing {i+1}/{total_records}: ", end='', flush=True)
        description = ai_agent.generate_description(structured_text)
        descriptions.append(description)
        print("✅")
        
        # Show first few for monitoring
        if i < 2:
            print(f"   Sample output: {description[:100]}...")
    
    df_processed['ai_description'] = descriptions
    print(f"🎉 AI descriptions generated for {len(descriptions)} records!")

✅ OpenAI API key found, initializing AI agent...
🤖 Generating AI-enhanced descriptions for Arabic-English content...
⏱️  This may take a few minutes depending on dataset size...
Processing 1/100: ✅
   Sample output: This category in the Saber platform is primarily concerned with the issuance of certificates by the ...
Processing 2/100: ✅
   Sample output: This category in the Saber platform is primarily concerned with the business process of product safe...
Processing 3/100: 

In [None]:
# 🔄 RESTART: Fix OpenAI API and reload modules
import importlib
import sys

# Reload modules to get the updated OpenAI API code
if 'ai_agent' in sys.modules:
    importlib.reload(sys.modules['ai_agent'])
if 'data_processor' in sys.modules:
    importlib.reload(sys.modules['data_processor'])

from ai_agent import AIAgent
from data_processor import DataProcessor

print("✅ Modules reloaded with updated OpenAI API v1.0+ support")

# Check if we have OpenAI API key and restart the AI description generation
import os
if not os.getenv('OPENAI_API_KEY'):
    print("⚠️  WARNING: OPENAI_API_KEY not found in environment variables")
    print("📝 Please set your OpenAI API key in the .env file to use AI enhancement")
    print("💡 For now, we'll continue with enhanced descriptions using rule-based approach")
    
    # Continue with the enhanced descriptions that were already generated
    print("✅ Using existing enhanced descriptions")
    
else:
    print("✅ OpenAI API key found, reinitializing AI agent with v1.0+ API...")
    
    # Initialize AI agent with updated API
    ai_agent = AIAgent(config_path='../config/config.yaml')
    
    print("🤖 Ready to generate AI-enhanced descriptions...")
    print("💡 You can now re-run the AI description generation if needed")
    print("   Or continue with the current enhanced descriptions")

In [None]:
# 🤖 Generate AI Descriptions with Fixed OpenAI API v1.0+
import os

if os.getenv('OPENAI_API_KEY'):
    print("🚀 Starting AI description generation with OpenAI API v1.0+...")
    
    # Clear previous descriptions to start fresh
    descriptions = []
    total_records = len(df_processed)
    
    for i, structured_text in enumerate(df_processed['structured_text']):
        print(f"Processing {i+1}/{total_records}: ", end='', flush=True)
        try:
            description = ai_agent.generate_description(structured_text)
            descriptions.append(description)
            print("✅")
            
            # Show first few for monitoring
            if i < 2:
                print(f"   Sample: {description[:150]}...")
                
        except Exception as e:
            print(f"❌ Error: {e}")
            # Fallback to enhanced description
            description = f"This handles {df_processed.iloc[i]['SubCategory']} issues related to {df_processed.iloc[i]['SubCategory2']} in the Saber platform."
            descriptions.append(description)
    
    # Update with new AI descriptions
    df_processed['ai_description'] = descriptions
    print(f"\n🎉 AI descriptions generated successfully for {len(descriptions)} records!")
    
else:
    print("⚠️  Continuing with rule-based enhanced descriptions")
    print("💡 To use OpenAI AI descriptions, add your API key to the .env file")

# Show sample of final descriptions
print(f"\n📄 Sample AI-Enhanced Descriptions:")
print("="*70)
for i in range(min(3, len(df_processed))):
    print(f"\n📋 Record {i+1}:")
    print(f"Category: {df_processed.iloc[i]['SubCategory']}")
    print(f"Secondary: {df_processed.iloc[i]['SubCategory2']}")
    print(f"AI Description: {df_processed.iloc[i]['ai_description'][:200]}...")
    print("-" * 50)

In [6]:
# Save the processed data for Phase 2
output_dir = Path('../results')
output_dir.mkdir(exist_ok=True)

# Save the full processed dataset
df_processed.to_csv('../results/saber_data_with_ai_descriptions.csv', index=False, encoding='utf-8')

# Create labels and split data
label_info = processor.create_labels(df_processed)
train_df, test_df = processor.split_data(df_processed)

# Save train/test splits
train_df.to_csv('../results/train_data_with_ai_descriptions.csv', index=False, encoding='utf-8')
test_df.to_csv('../results/test_data_with_ai_descriptions.csv', index=False, encoding='utf-8')

# Save label mappings
import json
with open('../results/label_mappings.json', 'w', encoding='utf-8') as f:
    json.dump(label_info, f, ensure_ascii=False, indent=2)

print("💾 Phase 1 Complete - Data Saved Successfully:")
print(f"   📄 Full dataset: {len(df_processed)} samples")
print(f"   📄 Train data: {len(train_df)} samples")
print(f"   📄 Test data: {len(test_df)} samples")
print(f"   🏷️  Primary labels: {len(label_info['primary_labels'])}")
print(f"   🏷️  Secondary labels: {len(label_info['secondary_labels'])}")
print(f"   📁 Location: ../results/")

print("\n📊 Dataset Summary:")
print(f"   Total samples: {len(df_processed)}")
print(f"   Arabic content: ✅ Detected")
print(f"   English content: ✅ Detected")
print(f"   AI enhancement: ✅ Complete")
print(f"   Ready for embedding generation: ✅")

print("\n🚀 Ready for Phase 2: Multi-Model Embedding Generation & FAISS!")

ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

## ✅ Phase 1 Complete!

**What we accomplished:**
1. ✅ Loaded and analyzed the Saber Categories dataset (100 samples, 9 features)
2. ✅ Generated AI-enhanced semantic descriptions for better embeddings
3. ✅ Created hierarchical label mappings for primary and secondary categories
4. ✅ Split data into train/test sets with proper stratification
5. ✅ Saved processed data and metadata for Phase 2

**Next Steps (Phase 2):**
1. 🔄 Generate embeddings using multiple models (Sentence Transformers, OpenAI)
2. 🔄 Create and optimize FAISS indices for fast similarity search
3. 🔄 Evaluate and compare embedding model performance
4. 🔄 Select the best embedding model for production

**Files Created:**
- `../results/train_data_with_ai_descriptions.csv`
- `../results/test_data_with_ai_descriptions.csv`
- `../results/label_mappings.json`
- `../results/phase1_summary_report.json`

Ready to move to **Phase 2: Embedding Generation & FAISS Indexing**! 🚀

---

**📖 For complete project details, see `PROJECT_PLAN.md`**