#  Chat Information Extraction System

## Project Description
This notebook implements a robust system for extracting user information from conversational chat data.
The system uses both regex-based pattern matching and LLM-based extraction to identify:
- Names
- Email addresses  
- Phone numbers
- Locations
- Ages




##  Installing Dependencies
Installing required packages for the chat extraction system.

In [1]:
!pip install groq -q
print(" Dependencies installed successfully!")

 Dependencies installed successfully!


##  Import Libraries
Importing all necessary libraries for the extraction system.

In [2]:
import json
import re
from groq import Groq

##  Configuration and API Setup
Setting up API keys and configuration parameters.

In [None]:
# API Configuration
#GROQ_API_KEY = ""  

# Initialize Groq client
client = Groq(api_key=GROQ_API_KEY)

# Configuration settings
CONFIG = {
    "min_name_length": 2,
    "max_age": 120,
    "min_age": 13,
    "temperature": 0.1,
    "model": "llama-3.1-8b-instant"
}

print(" Configuration loaded:")
for key, value in CONFIG.items():
    print(f"   {key}: {value}")

api_status = " Enabled" if GROQ_API_KEY else " Disabled (add API key for LLM extraction)"
print(f"\n LLM Extraction Status: {api_status}")

 Configuration loaded:
   min_name_length: 2
   max_age: 120
   min_age: 13
   temperature: 0.1
   model: llama-3.1-8b-instant

 LLM Extraction Status:  Enabled


##  Test Data Preparation
Creating comprehensive mock conversations to test the extraction system.

In [6]:
mock_conversations = [
    {
        "title": " Simple Introduction",
        "conversation": [
            {"role": "user", "content": "hey this is sruthi"},
            {"role": "assistant", "content": "Hi Sruthi! Nice to meet you. How's your day going?"},
            {"role": "user", "content": "good, just chilling"},
            {"role": "assistant", "content": "That sounds relaxing! Anything interesting happen today?"},
            {"role": "user", "content": "not much, just work stuff"}
        ],
        "expected": {"name": "Sruthi"}
    },
    {
        "title": " Professional Chat", 
        "conversation": [
            {"role": "user", "content": "Hi, I'm Alex Johnson from the marketing team"},
            {"role": "assistant", "content": "Hello Alex! Nice to meet you. What can I help you with today?"},
            {"role": "user", "content": "I need help with some data analysis. My email is alex.johnson@company.com if you need to send me anything"},
            {"role": "assistant", "content": "Perfect! I've noted your email. What kind of data analysis are you working on?"},
            {"role": "user", "content": "Customer segmentation for our NYC branch. I'm 28 and this is my first time doing this type of analysis"}
        ],
        "expected": {"name": "Alex Johnson", "email": "alex.johnson@company.com", "age": 28, "location": "NYC"}
    },
    {
        "title": " Casual Friend Chat",
        "conversation": [
            {"role": "user", "content": "wassup!"},
            {"role": "assistant", "content": "Hey there! Not much, just here to help. What's going on?"},
            {"role": "user", "content": "I'm Maya, just moved to San Francisco last week"},
            {"role": "assistant", "content": "Cool! How are you liking SF so far? Big change?"},
            {"role": "user", "content": "Yeah it's crazy expensive lol. I'm 24 so still figuring things out"},
            {"role": "assistant", "content": "Haha yeah SF is known for that! You'll get the hang of it though."},
            {"role": "user", "content": "hope so! btw if you need to reach me my number is 555-123-4567"}
        ],
        "expected": {"name": "Maya", "location": "San Francisco", "age": 24, "phone": "555-123-4567"}
    },
    {
        "title": " Support Context",
        "conversation": [
            {"role": "user", "content": "hello, I need some help"},
            {"role": "assistant", "content": "Of course! I'm here to help. What do you need assistance with?"},
            {"role": "user", "content": "I'm having trouble with my account. This is Dr. Sarah Chen"},
            {"role": "assistant", "content": "Hello Dr. Chen! I'd be happy to help with your account. Can you tell me more about the issue?"},
            {"role": "user", "content": "I can't log in. My registered email should be s.chen@medcenter.org"},
            {"role": "assistant", "content": "Let me check that for you. Are you currently in Boston or have you moved recently?"},
            {"role": "user", "content": "Still in Boston, been here for 5 years now. I'm 34 if that matters for the account"}
        ],
        "expected": {"name": "Dr. Sarah Chen", "email": "s.chen@medcenter.org", "location": "Boston", "age": 34}
    },
    {
        "title": " Gaming Buddy Chat",
        "conversation": [
            {"role": "user", "content": "yo what's up"},
            {"role": "assistant", "content": "Hey! Not much, just hanging out. What's going on with you?"},
            {"role": "user", "content": "call me Phoenix, I'm looking for someone to play with"},
            {"role": "assistant", "content": "Nice to meet you Phoenix! What games are you into?"},
            {"role": "user", "content": "mostly FPS games. I'm from Austin, Texas and I'm 19"},
            {"role": "assistant", "content": "Cool! Austin's got a great gaming scene. What's your favorite FPS right now?"},
            {"role": "user", "content": "been playing a lot of Valorant lately"}
        ],
        "expected": {"name": "Phoenix", "location": "Austin, Texas", "age": 19}
    }
]

print(f" Prepared {len(mock_conversations)} test conversations:")
for i, conv in enumerate(mock_conversations, 1):
    print(f"   {i}. {conv['title']}")

 Prepared 5 test conversations:
   1.  Simple Introduction
   2.  Professional Chat
   3.  Casual Friend Chat
   4.  Support Context
   5.  Gaming Buddy Chat


##  Core Function 1: Regex-Based Extraction
Advanced pattern matching for extracting user information from chat text.

In [7]:
def fallback_extraction(chat_text, return_result=True):
    """Enhanced regex-based extraction"""
    extracted_fallback = {
        "name": None,
        "email": None, 
        "phone": None,
        "location": None,
        "age": None
    }
    
    # Enhanced name extraction patterns
    name_patterns = [
        r"(?:hey|hi|hello|yo)\s+this\s+is\s+([A-Za-z\s]+?)(?:\.|,|$|\s+from|\s+and)",
        r"(?:i'm|i am)\s+([A-Za-z\s]+?)(?:\.|,|$|\s+from|\s+and|\s+in)",
        r"this\s+is\s+((?:Dr\.|Mr\.|Ms\.|Mrs\.)\s*[A-Za-z\s]+?)(?:\.|,|$)",
        r"call\s+me\s+([A-Za-z\s]+?)(?:\.|,|$|\s+and|\s+i)",
        r"my\s+name\s+is\s+([A-Za-z\s]+?)(?:\.|,|$)",
        r"(?:i'm|this is)\s+(Dr\.\s+[A-Za-z\s]+?)(?:\s+from|\s+and|,|$)"
    ]
    
    for pattern in name_patterns:
        match = re.search(pattern, chat_text, re.IGNORECASE)
        if match:
            name = match.group(1).strip().title()
            if len(name) > 1 and name.lower() not in ['the', 'a', 'an', 'is', 'am', 'are', 'from']:
                extracted_fallback["name"] = name
                break
    
    # Email extraction
    email_match = re.search(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', chat_text)
    if email_match:
        extracted_fallback["email"] = email_match.group()
    
    # Phone extraction
    phone_patterns = [
        r'\b\d{3}[-.\s]?\d{3}[-.\s]?\d{4}\b',
        r'\(\d{3}\)\s?\d{3}[-.\s]?\d{4}',
        r'\b\d{10}\b'
    ]
    
    for pattern in phone_patterns:
        match = re.search(pattern, chat_text)
        if match:
            extracted_fallback["phone"] = match.group()
            break
    
    # Age extraction
    age_patterns = [
        r"(?:i'm|i am|my age is)\s+(\d{1,3})\s*(?:years?|yr|y\.o\.?)?(?:\s+and|\s+so|\s+if|$)",
        r"(\d{1,3})\s*(?:years?\s*old|yr\s*old|y\.o\.)",
        r"i'm\s+(\d{1,3})(?:\s+and|\s+so|\s+if|$)"
    ]
    
    for pattern in age_patterns:
        match = re.search(pattern, chat_text.lower())
        if match:
            age = int(match.group(1))
            if 13 <= age <= 120:  # Reasonable age range
                extracted_fallback["age"] = age
                break
    
    # Location extraction
    location_patterns = [
        r"(?:i live in|i'm from|from|in)\s+([A-Za-z\s,]+?)(?:\.|,|$|\s+and|\s+but|\s+for)",
        r"my location is\s+([A-Za-z\s,]+?)(?:\.|,|$)",
        r"moved to\s+([A-Za-z\s,]+?)(?:\s+last|\s+recently|\.|,|$)",
        r"currently in\s+([A-Za-z\s,]+?)(?:\s+or|\.|,|$)",
        r"(?:still in|been in)\s+([A-Za-z\s,]+?)(?:\.|,|$|\s+for)",
        r"(?:our|the)\s+([A-Za-z\s,]+?)\s+branch"
    ]
    
    for pattern in location_patterns:
        match = re.search(pattern, chat_text.lower())
        if match:
            location = match.group(1).strip().title()
            # Clean up common false positives
            if (len(location) > 1 and 
                location.lower() not in ['the', 'a', 'an', 'is', 'am', 'are', 'here', 'there', 'time'] and
                not location.lower().startswith('still') and
                not location.isdigit()):
                extracted_fallback["location"] = location
                break
    
    return extracted_fallback

##  Core Function 2: LLM-Based Extraction
Advanced language model extraction using Groq API.

In [8]:
def llm_extraction(chat_text):
    """LLM-based extraction (requires API key)"""
    if not client.api_key:
        return {}
        
    try:
        response = client.chat.completions.create(
            model="llama-3.1-8b-instant",
            messages=[
                {
                    "role": "system",
                    "content": """Extract user information from the conversation. Return only valid JSON.

Examples:
- "hey this is sruthi" → {"name": "Sruthi"}
- "I'm Alex, email is alex@test.com" → {"name": "Alex", "email": "alex@test.com"}
- "call me Maya, I'm 24 from NYC" → {"name": "Maya", "age": 24, "location": "NYC"}

Return only fields with actual values as valid JSON."""
                },
                {"role": "user", "content": f"Extract info from: {chat_text}"}
            ],
            temperature=0.1
        )
        
        response_content = response.choices[0].message.content.strip()
        json_match = re.search(r'\{[^}]*\}', response_content)
        
        if json_match:
            return json.loads(json_match.group())
    except:
        pass
    
    return {}


##  Core Function 3: Combined Extraction System
Main function that combines both extraction methods with intelligent fallback.

In [9]:
def extract_user_info(conversation_data):
    """Main extraction function"""
    # Convert conversation to text
    chat_text = "\n".join([f"{msg['role']}: {msg['content']}" for msg in conversation_data])
    
    # Run both extraction methods
    fallback_result = fallback_extraction(chat_text)
    llm_result = llm_extraction(chat_text) if client.api_key else {}
    
    # Merge results (LLM takes priority)
    final_result = {
        "name": llm_result.get("name") or fallback_result.get("name"),
        "email": llm_result.get("email") or fallback_result.get("email"),
        "phone": llm_result.get("phone") or fallback_result.get("phone"),
        "location": llm_result.get("location") or fallback_result.get("location"),
        "age": llm_result.get("age") or fallback_result.get("age")
    }
    
    return final_result, fallback_result, llm_result

##  Demo and Testing System
Comprehensive testing system with all mock conversations.

In [10]:
def run_demo():
    print(" CHAT INFORMATION EXTRACTION DEMO")
    print("=" * 50)
    
    api_status = " Enabled" if client.api_key else " Disabled (add API key for LLM extraction)"
    print(f" LLM Extraction: {api_status}")
    print(f" Regex Extraction:  Always Available")
    print()
    
    for i, demo in enumerate(mock_conversations, 1):
        print(f"\n DEMO {i}: {demo['title']}")
        print("-" * 40)
        
        # Show conversation
        print("💬 Conversation:")
        for msg in demo['conversation']:
            role_emoji = "👤" if msg['role'] == 'user' else "🤖"
            print(f"  {role_emoji} {msg['content']}")
        
        print("\n Extraction Results:")
        
        # Run extraction
        final_result, fallback_result, llm_result = extract_user_info(demo['conversation'])
        
        # Show results
        extracted_fields = {k: v for k, v in final_result.items() if v is not None}
        expected_fields = demo['expected']
        
        print(f"   Found: {json.dumps(extracted_fields, indent=6)}")
        print(f"   Expected: {json.dumps(expected_fields, indent=9)}")
        
        # Show accuracy
        matches = 0
        total_expected = len(expected_fields)
        
        for key, expected_value in expected_fields.items():
            if str(final_result.get(key, '')).lower() == str(expected_value).lower():
                matches += 1
        
        accuracy = (matches / total_expected * 100) if total_expected > 0 else 0
        
        
        print(f"   Accuracy: {matches}/{total_expected} ({accuracy:.0f}%)")
        
        # Show which method found what
        if any(fallback_result.values()):
            fallback_fields = [k for k, v in fallback_result.items() if v is not None]
            print(f"  🔧 Regex found: {fallback_fields}")
        
        if any(llm_result.values()):
            llm_fields = [k for k, v in llm_result.items() if v is not None]
            print(f"  🤖 LLM found: {llm_fields}")
        
        print()
    
    # Summary
    print(" DEMO SUMMARY")
    print("-" * 20)
    print(f" Tested {len(mock_conversations)} different conversation styles")
    print(" Extraction handles casual, professional, and support contexts")
    print(" Regex extraction provides reliable fallback")
    print(" LLM extraction enhances accuracy when available")
    print("\n The system gracefully degrades when LLM is unavailable!")

##  Interactive Testing Interface
For custom conversations.

In [16]:
def interactive_chat_extraction():
    """Original interactive function with demo option"""
    print(" INTERACTIVE CHAT EXTRACTION")
    print("=" * 35)
    print()    
    print("\n Start chatting (assistant will reply each turn). Type 'exit' to finish and extract data.\n")
    
    conversation = []
    
    while True:
        user_input = input("You: ")
        conversation.append({"role": "user", "content": user_input})
        
        if user_input.lower() == "exit":
            print("\n Ending session — extracting info from entire chat...\n")
            break
        
        # Simple assistant reply (you can enhance this)
        assistant_msg = "I understand. Please continue..."
        if any(greeting in user_input.lower() for greeting in ['hi', 'hello', 'hey']):
            assistant_msg = "Hello! Nice to meet you. How can I help you today?"
        elif any(word in user_input.lower() for word in ['good', 'fine', 'okay']):
            assistant_msg = "That's great to hear! What's on your mind?"
        
        print("Assistant:", assistant_msg)
        conversation.append({"role": "assistant", "content": assistant_msg})
    
    # Extract information
    final_result, fallback_result, llm_result = extract_user_info(conversation)
    
    print(" Extraction Results:")
    print(f"  🔧 Regex: {fallback_result}")
    if llm_result:
        print(f"  🤖 LLM: {llm_result}")
    
    print("\n=== FINAL EXTRACTED DATA ===")
    print(json.dumps(final_result, indent=2))
    
    found_fields = len([v for v in final_result.values() if v is not None])
    print(f"\n Extraction complete. Found {found_fields} field(s) out of 5.")

## Run Demo

In [12]:
run_demo()

 CHAT INFORMATION EXTRACTION DEMO
 LLM Extraction:  Enabled
 Regex Extraction:  Always Available


 DEMO 1:  Simple Introduction
----------------------------------------
💬 Conversation:
  👤 hey this is sruthi
  🤖 Hi Sruthi! Nice to meet you. How's your day going?
  👤 good, just chilling
  🤖 That sounds relaxing! Anything interesting happen today?
  👤 not much, just work stuff

 Extraction Results:
   Found: {
      "name": "Sruthi"
}
   Expected: {
         "name": "Sruthi"
}
   Accuracy: 1/1 (100%)
  🤖 LLM found: ['name']


 DEMO 2:  Professional Chat
----------------------------------------
💬 Conversation:
  👤 Hi, I'm Alex Johnson from the marketing team
  🤖 Hello Alex! Nice to meet you. What can I help you with today?
  👤 I need help with some data analysis. My email is alex.johnson@company.com if you need to send me anything
  🤖 Perfect! I've noted your email. What kind of data analysis are you working on?
  👤 Customer segmentation for our NYC branch. I'm 28 and this is my first ti

## Run for Interactive_Chat

In [17]:
interactive_chat_extraction()

 INTERACTIVE CHAT EXTRACTION


 Start chatting (assistant will reply each turn). Type 'exit' to finish and extract data.



You:  hi this is ram


Assistant: Hello! Nice to meet you. How can I help you today?


You:  i just want to have a casual convo, felt bore after returning from college in guntur


Assistant: I understand. Please continue...


You:  exit



 Ending session — extracting info from entire chat...

 Extraction Results:
  🔧 Regex: {'name': None, 'email': None, 'phone': None, 'location': None, 'age': None}
  🤖 LLM: {'name': 'Ram', 'location': 'Guntur'}

=== FINAL EXTRACTED DATA ===
{
  "name": "Ram",
  "email": null,
  "phone": null,
  "location": "Guntur",
  "age": null
}

 Extraction complete. Found 2 field(s) out of 5.
