# Production Material Classification System
## NCSU Faculty Publications (2021-2025)

**Classification Method:** Smart Hybrid (Rule-Based + Conditional Generative)

---

### üéØ System Overview:
- **Rule-Based First**: 145 keywords, weighted scoring
- **Conditional Generative**: Only when Rule-Based confidence < 85%
- **Smart Logic**: 
  1. Apply Rule-Based classification
  2. If confidence ‚â• 85% ‚Üí **Done!** (No API call)
  3. If confidence < 85% ‚Üí Call GPT-4o-mini

**üí∞ Cost Savings**: Only ~30-40% of publications need API calls!

---

### üè∑Ô∏è Material Categories:
polymer ‚Ä¢ biopolymer ‚Ä¢ metal ‚Ä¢ ceramic ‚Ä¢ semiconductor ‚Ä¢ composite ‚Ä¢ nano_materials ‚Ä¢ others

## 1. Setup and Imports

In [1]:
# Standard libraries
import pandas as pd
import json
import time
from datetime import datetime

# Database
import mysql.connector
from dotenv import load_dotenv
import os

# OpenAI
from openai import OpenAI

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)

print("‚úÖ All libraries imported successfully!")
print(f"üìÖ Classification Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

‚úÖ All libraries imported successfully!
üìÖ Classification Date: 2025-12-02 17:07:39


## 2. Load Publications from Database

In [2]:
# Load environment variables
load_dotenv()
print("‚úÖ Environment variables loaded")


‚úÖ Environment variables loaded


In [3]:
# Connect to database
print("üîÑ Connecting to database...")
conn = mysql.connector.connect(
    host='localhost',
    user='root',
    password=os.getenv('DB_PASSWORD'),
    database='mse_db_test_ncsu',
    connection_timeout=10
)
print("‚úÖ Database connected successfully!")


üîÑ Connecting to database...
‚úÖ Database connected successfully!


In [4]:
# Define query for TIER 1 filtering
query = """
SELECT 
    p.publication_id,
    p.title,
    p.publication_year as year,
    COALESCE(CAST(p.keywords AS CHAR), '') as keywords,
    p.faculty_unity_id,
    CONCAT(f.first_name, ' ', f.last_name) as faculty_name,
    p.doi,
    p.journal_name
FROM publications p
LEFT JOIN master_faculty f ON p.faculty_unity_id = f.unity_id
WHERE p.publication_year BETWEEN 2021 AND 2025
    AND p.doi IS NOT NULL 
    AND p.doi LIKE '10.%'
    AND p.journal_name IS NOT NULL
    AND p.journal_name != ''
ORDER BY p.publication_year DESC, p.publication_id
"""
print("‚úÖ Query defined (TIER 1: Standard DOI + Journal Name)")


‚úÖ Query defined (TIER 1: Standard DOI + Journal Name)


In [5]:
# Execute query and load data
print("üîÑ Executing query...")
import time
start = time.time()

df = pd.read_sql(query, conn)

elapsed = time.time() - start
print(f"‚úÖ Query executed in {elapsed:.2f} seconds")
print(f"üìä Loaded {len(df)} records")


üîÑ Executing query...
‚úÖ Query executed in 0.01 seconds
üìä Loaded 773 records


  df = pd.read_sql(query, conn)


In [6]:
# Close connection and display summary
conn.close()
print("‚úÖ Database connection closed")

print(f"\n{'='*60}")
print(f"üìã TIER 1 PUBLICATIONS LOADED")
print(f"{'='*60}")
print(f"   Total records: {len(df)}")
print(f"   Filter: Standard DOI (10.*) + Journal Name")
print(f"   üìÖ Year Range: {df['year'].min()} - {df['year'].max()}")
print(f"   üë• Faculty Count: {df['faculty_unity_id'].nunique()}")
print(f"\nüí° Data Quality: HIGHEST - Verified published research with resolvable DOIs")


‚úÖ Database connection closed

üìã TIER 1 PUBLICATIONS LOADED
   Total records: 773
   Filter: Standard DOI (10.*) + Journal Name
   üìÖ Year Range: 2021 - 2025
   üë• Faculty Count: 25

üí° Data Quality: HIGHEST - Verified published research with resolvable DOIs


## 3. Prepare Text (No Title Duplication)

In [7]:
def prepare_text(row):
    """Combine title and keywords - NO duplication"""
    text_parts = []
    
    # Add title
    if pd.notna(row['title']) and str(row['title']).strip():
        text_parts.append(str(row['title']).strip())
    
    # Add keywords
    if pd.notna(row['keywords']) and str(row['keywords']).strip():
        try:
            kw_str = str(row['keywords'])
            if kw_str.startswith('['):
                keywords_data = json.loads(kw_str)
                if isinstance(keywords_data, list):
                    keywords_text = ' '.join(str(k) for k in keywords_data if k)
                    if keywords_text.strip():
                        text_parts.append(keywords_text)
            elif kw_str not in ['null', 'None', '[]', '']:
                text_parts.append(kw_str)
        except:
            pass
    
    return ' '.join(text_parts)

# Apply text preparation
df['text_for_classification'] = df.apply(prepare_text, axis=1)

# Clean text
df['text_for_classification'] = df['text_for_classification'].str.lower().str.replace(r'[^\w\s]', ' ', regex=True)

print("‚úÖ Text preparation complete!")
print(f"üìè Average text length: {df['text_for_classification'].str.split().str.len().mean():.1f} words")

‚úÖ Text preparation complete!
üìè Average text length: 18.3 words


## 4. Material Keywords Dictionary

In [8]:
# Material keywords dictionary (145 keywords)
material_keywords = {
    'polymer': [
        'polymer', 'polymers', 'polymeric', 'polymerization',
        'plastic', 'plastics', 'resin', 'elastomer', 'thermoplastic',
        'polystyrene', 'polyethylene', 'polypropylene', 'pvc', 'pet',
        'polyester', 'nylon', 'acrylic', 'epoxy', 'silicone',
        'hydrogel', 'copolymer', 'macromolecule', 'monomer'
    ],
    'biopolymer': [
        'biopolymer', 'chitosan', 'cellulose', 'collagen', 'gelatin',
        'alginate', 'protein', 'peptide', 'dna', 'rna',
        'polysaccharide', 'starch', 'lignin', 'silk', 'keratin',
        'fibrin', 'elastin', 'hyaluronic', 'pectin', 'biobased'
    ],
    'metal': [
        'metal', 'metallic', 'alloy', 'steel', 'iron',
        'aluminum', 'copper', 'titanium', 'nickel', 'zinc',
        'magnesium', 'silver', 'gold', 'platinum', 'cobalt',
        'chromium', 'manganese', 'brass', 'bronze', 'stainless'
    ],
    'ceramic': [
        'ceramic', 'ceramics', 'oxide', 'oxides', 'glass',
        'silica', 'alumina', 'zirconia', 'titania', 'silicon dioxide',
        'calcium phosphate', 'hydroxyapatite', 'bioactive glass',
        'porcelain', 'clay', 'mullite', 'spinel', 'perovskite'
    ],
    'semiconductor': [
        'semiconductor', 'silicon', 'transistor', 'diode', 'chip',
        'wafer', 'doping', 'n-type', 'p-type', 'junction',
        'cmos', 'mosfet', 'gallium arsenide', 'germanium',
        'led', 'photovoltaic', 'solar cell', 'bandgap', 'quantum dot'
    ],
    'composite': [
        'composite', 'composites', 'fiber reinforced', 'laminate',
        'carbon fiber', 'glass fiber', 'fiberglass', 'hybrid material',
        'sandwich structure', 'matrix', 'reinforcement', 'filler',
        'multiphase', 'particulate composite', 'nanocomposite'
    ],
    'nano_materials': [
        'nanoparticle', 'nanoparticles', 'gold nanoparticle', 'silver nanoparticle',
        'metal nanoparticle', 'quantum dot', 'nanodot', 'colloidal',
        'plasmonic', 'nanosphere', 'nanocrystal', 'nanorod', 'nanoshell',
        'nanofiber', 'nanofibers', 'electrospinning', 'electrospun',
        'nanostructure', 'nanostructured', 'nanowire', 'nanotube',
        'carbon nanotube', 'cnt', 'graphene', 'nanomesh', 'nanonet',
        'fibrous', 'nanofibrous', 'ultrafine fiber'
    ]
}

print(f"‚úÖ Material keywords loaded: {sum(len(v) for v in material_keywords.values())} total keywords")

‚úÖ Material keywords loaded: 144 total keywords


## 5. Smart Hybrid Classification (Rule-Based + Conditional Generative)

In [9]:
# Initialize OpenAI client
client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

def classify_rule_based(text, threshold=0.4):
    """Rule-based classification with weighted scoring"""
    if not text or text.strip() == '':
        return 'others', 0.0, "No text content found"
    
    text_lower = text.lower()
    category_scores = {}
    category_keywords_found = {}  # Track which keywords were found
    
    # Count keyword matches with weighting
    for category, keywords in material_keywords.items():
        score = 0
        found_keywords = []
        for keyword in keywords:
            if keyword.lower() in text_lower:
                weight = len(keyword.split())  # Multi-word phrases get higher weight
                score += weight
                found_keywords.append(keyword)
        
        if score > 0:
            category_scores[category] = score
            category_keywords_found[category] = found_keywords
    
    if not category_scores:
        return 'others', 0.0, "No material keywords detected"
    
    # Calculate confidence
    total_score = sum(category_scores.values())
    best_category = max(category_scores.items(), key=lambda x: x[1])
    confidence = best_category[1] / total_score
    
    # Create explanation
    best_cat_name = best_category[0]
    keywords_found = category_keywords_found[best_cat_name]
    keyword_count = len(keywords_found)
    
    # Get top 3 most important keywords (longer phrases first)
    top_keywords = sorted(keywords_found, key=lambda x: len(x.split()), reverse=True)[:3]
    keywords_str = ', '.join(top_keywords)
    
    explanation = f"Confidence: {confidence:.1%} | Found {keyword_count} keywords: {keywords_str}"
    
    if confidence < threshold:
        return 'others', confidence, f"Low confidence ({confidence:.1%}) - ambiguous material type"
    
    return best_category[0], confidence, explanation

def classify_generative(text):
    """OpenAI generative classification with explanation"""
    prompt = f"""You are a materials science expert. Classify this publication into ONE category:

Categories: polymer, biopolymer, metal, ceramic, semiconductor, composite, nano_materials, others

Publication: "{text[:500]}"

Respond EXACTLY in this format:
Category: [name]
Confidence: [0-100]
Reason: [one short sentence explaining why]"""
    
    try:
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": "You are a materials science expert. Be concise."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.1,
            max_tokens=100
        )
        
        result = response.choices[0].message.content
        lines = result.strip().split('\n')
        
        category = 'others'
        confidence = 0.5
        reason = "No explanation provided"
        
        for line in lines:
            if line.startswith('Category:'):
                category = line.split(':', 1)[1].strip().lower()
            elif line.startswith('Confidence:'):
                conf_str = line.split(':', 1)[1].strip().replace('%', '')
                try:
                    confidence = float(conf_str) / 100.0
                except:
                    confidence = 0.5
            elif line.startswith('Reason:'):
                reason = line.split(':', 1)[1].strip()
        
        explanation = f"AI: {reason}"
        return category, confidence, explanation
        
    except Exception as e:
        print(f"  ‚ö†Ô∏è  API Error: {e}")
        return 'others', 0.0, f"API Error: {str(e)}"

# Apply Smart Hybrid Classification
print("üîÑ Starting Smart Hybrid Classification...")
print(f"üìä Total publications: {len(df)}")
print(f"üí° Strategy: Rule-Based first, API only if confidence < 85%\n")

CONFIDENCE_THRESHOLD = 0.85  # 85% threshold

results = []
api_call_count = 0
start_time = time.time()

for idx, row in df.iterrows():
    text = row['text_for_classification']
    
    # Step 1: Rule-Based classification
    rb_category, rb_confidence, rb_explanation = classify_rule_based(text)
    
    # Step 2: Decide if we need generative
    if rb_confidence >= CONFIDENCE_THRESHOLD:
        # High confidence - use Rule-Based result
        final_category = rb_category
        final_confidence = rb_confidence
        final_explanation = rb_explanation
        method = 'rule_based'
    else:
        # Low confidence - call generative API
        gen_category, gen_confidence, gen_explanation = classify_generative(text)
        api_call_count += 1
        
        # Use generative result
        final_category = gen_category
        final_confidence = gen_confidence
        final_explanation = gen_explanation
        method = 'generative'
    
    results.append({
        'category': final_category,
        'confidence': final_confidence,
        'explanation': final_explanation,
        'method': method
    })
    
    # Progress update
    if (idx + 1) % 100 == 0 or (idx + 1) == len(df):
        elapsed = time.time() - start_time
        progress = (idx + 1) / len(df) * 100
        print(f"  ‚úÖ {idx + 1}/{len(df)} ({progress:.1f}%) - "
              f"API calls: {api_call_count} ({api_call_count/(idx+1)*100:.1f}%) - "
              f"Elapsed: {elapsed/60:.1f}min")

execution_time = time.time() - start_time

# Store results
df['category'] = [r['category'] for r in results]
df['confidence'] = [r['confidence'] for r in results]
df['explanation'] = [r['explanation'] for r in results]
df['method'] = [r['method'] for r in results]

# Calculate metrics
classified = df[df['category'] != 'others']

print(f"\n{'='*80}")
print(f"üéâ CLASSIFICATION COMPLETE!")
print(f"{'='*80}")
print(f"‚è±Ô∏è  Total time: {execution_time/60:.1f} minutes")
print(f"‚úÖ Coverage: {len(classified)}/{len(df)} ({len(classified)/len(df)*100:.1f}%)")
print(f"üìä Average confidence: {df[df['category'] != 'others']['confidence'].mean()*100:.1f}%")
print(f"\nüí∞ API Efficiency:")
print(f"  API calls: {api_call_count}/{len(df)} ({api_call_count/len(df)*100:.1f}%)")
print(f"  Cost savings: {(1 - api_call_count/len(df))*100:.1f}% vs full generative")
print(f"\nüìà Method Distribution:")
print(df['method'].value_counts())
print(f"\nüè∑Ô∏è  Category Distribution:")
print(df['category'].value_counts())

# Show sample explanations
print(f"\nüìù Sample Explanations:")
print("\nRule-Based Examples:")
rb_samples = df[df['method'] == 'rule_based'].head(3)
for _, row in rb_samples.iterrows():
    print(f"  ‚Ä¢ {row['category']}: {row['explanation']}")

print("\nGenerative Examples:")
gen_samples = df[df['method'] == 'generative'].head(3)
for _, row in gen_samples.iterrows():
    print(f"  ‚Ä¢ {row['category']}: {row['explanation']}")

üîÑ Starting Smart Hybrid Classification...
üìä Total publications: 773
üí° Strategy: Rule-Based first, API only if confidence < 85%

  ‚úÖ 100/773 (12.9%) - API calls: 50 (50.0%) - Elapsed: 1.2min
  ‚úÖ 100/773 (12.9%) - API calls: 50 (50.0%) - Elapsed: 1.2min
  ‚úÖ 200/773 (25.9%) - API calls: 106 (53.0%) - Elapsed: 2.4min
  ‚úÖ 200/773 (25.9%) - API calls: 106 (53.0%) - Elapsed: 2.4min
  ‚úÖ 300/773 (38.8%) - API calls: 163 (54.3%) - Elapsed: 3.7min
  ‚úÖ 300/773 (38.8%) - API calls: 163 (54.3%) - Elapsed: 3.7min
  ‚úÖ 400/773 (51.7%) - API calls: 206 (51.5%) - Elapsed: 4.6min
  ‚úÖ 400/773 (51.7%) - API calls: 206 (51.5%) - Elapsed: 4.6min
  ‚úÖ 500/773 (64.7%) - API calls: 260 (52.0%) - Elapsed: 5.9min
  ‚úÖ 500/773 (64.7%) - API calls: 260 (52.0%) - Elapsed: 5.9min
  ‚úÖ 600/773 (77.6%) - API calls: 317 (52.8%) - Elapsed: 7.1min
  ‚úÖ 600/773 (77.6%) - API calls: 317 (52.8%) - Elapsed: 7.1min
  ‚úÖ 700/773 (90.6%) - API calls: 384 (54.9%) - Elapsed: 8.2min
  ‚úÖ 700/773 (90.6%

## 6. Export Results (Faculty, Title, Category)

In [10]:
# Export simplified results: faculty, title, year, category, method, explanation
export_df = df[['faculty_name', 'title', 'year', 'category', 'method', 'explanation']].copy()

# Clean up
export_df.columns = ['Faculty', 'Title', 'Year', 'Category', 'Method', 'Explanation']

# Make method more readable
export_df['Method'] = export_df['Method'].replace({
    'rule_based': 'Rule-Based',
    'generative': 'OpenAI'
})

output_file = 'production_classifications.csv'
export_df.to_csv(output_file, index=False)

print(f"‚úÖ Exported: {output_file}")
print(f"üìä Total records: {len(export_df)}")
print(f"üìÖ Export date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

# Show sample
print(f"\nüìã Sample (first 10 rows):")
print(export_df.head(10).to_string(index=False))

print(f"\nüéâ Pipeline complete!")

‚úÖ Exported: production_classifications.csv
üìä Total records: 773
üìÖ Export date: 2025-12-02 17:17:17

üìã Sample (first 10 rows):
           Faculty                                                                                                                                            Title  Year      Category     Method                                                                                                                                          Explanation
        Nina Balke                                                           Competing polar phases in 2D ferroelectric transition metal thio- and selenophosphates  2025        others     OpenAI   AI: The publication discusses 2D ferroelectric materials, which do not fit neatly into the standard categories of polymers, metals, ceramics, etc.
 Veronica Augustyn                Competition between dissolution and ion exchange during low temperature synthesis of LiCoO<sub>2</sub> on porous carbon scaffolds  2025      