# üéâ Joinnus Events Extraction Pipeline

**Complete end-to-end pipeline for extracting comprehensive event data from Joinnus.**

## üéØ Pipeline Features

This notebook provides a complete workflow:
1. **üåê Event URL Extraction** - Extract all event URLs from Joinnus categories with pagination
2. **? Data Storage** - Save extracted URLs to JSON and CSV files in `data/` directory
3. **üîç Comprehensive Extraction** - Extract detailed event information from each event page
4. **? Results** - Generate detailed JSON and CSV files with all event data

## üìã Output Files

- `events_TIMESTAMP.csv` - URLs and basic info for all 603+ events
- `events_TIMESTAMP.json` - Detailed event data with images, tags, prices
- `events_detailed_TIMESTAMP.csv` - Summary view of all extracted data
- `events_detailed_TIMESTAMP.json` - Complete detailed data for all events

## üöÄ Quick Start

1. Run **Step 1** to extract all event URLs
2. Run **Step 2** to extract detailed event information from those URLs
3. All results saved to `data/` folder


## üìö Import Required Libraries

Import all necessary libraries for web scraping, data processing, and analysis.

In [None]:
import requests
import time
import json
import csv
import re
import datetime
from pathlib import Path
from bs4 import BeautifulSoup
from html import unescape
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
import logging
import pandas as pd
from pymongo import MongoClient
from pymongo.errors import DuplicateKeyError, ConnectionFailure

print("‚úÖ All libraries imported successfully")
print("üîß Setting up logging configuration...")

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

‚úÖ All libraries imported successfully
üîß Setting up logging configuration...


## üîß Configuration Setup

Configure scraping parameters, directories, and settings.

In [2]:
# Configuration
JOINNUS_CONFIG = {
    'base_domain': 'https://www.joinnus.com',
    'classic_domain': 'https://classic.joinnus.com',
    'headers': {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    }
}

URLS_CONFIG = {
    'csv_file': Path('../notebook/urls.csv'),
}

SCRAPING_CONFIG = {
    'timing': {
        'page_load_wait': 10,
        'element_wait': 5,
    },
    'pagination': {
        'max_pages': 100,
    },
    'browser': {
        'headless': True,
        'disable_gpu': True,
        'window_size': '1920,1080',
        'disable_images': False
    },
}

# MongoDB Configuration
MONGODB_CONFIG = {
    'uri': 'mongodb+srv://mrclpgg_db_user:K9NMlwFHZpeltCwI@cluster0.qdopesi.mongodb.net/?appName=Cluster0',
    'database': 'recommendations-system',
    'collection': 'events',
}

# Data directory in same location as notebook
NOTEBOOK_DIR = Path(__file__).parent if '__file__' in dir() else Path.cwd()
DATA_DIR = NOTEBOOK_DIR / 'data'
DATA_DIR.mkdir(exist_ok=True)

print("‚úÖ Configuration loaded")
print(f"üìÇ Data directory: {DATA_DIR}")
print(f"üóÑÔ∏è  MongoDB: {MONGODB_CONFIG['database']}.{MONGODB_CONFIG['collection']}")


‚úÖ Configuration loaded
üìÇ Data directory: c:\Scrapping\joinnus\notebook\data
üóÑÔ∏è  MongoDB: recommendations-system.events


## ‚úÖ Working Pagination Logic

This is the **proven pagination logic** that correctly handles numbered page buttons (1, 2, 3, etc.) instead of generic prev/next buttons. It tracks event IDs to detect duplicates and knows when pagination is complete.

Key features:
- **Finds numbered page buttons** (e.g., button with text "2" for page 2)
- **Compares event IDs** between pages to detect when we've reached the end
- **Uses JavaScript click** for reliable button activation
- **8-second wait** for AJAX page load to complete


In [3]:
# ‚úÖ WORKING PAGINATION LOGIC
# This logic correctly handles pagination by:
# 1. Looking for numbered page buttons (1, 2, 3, etc.)
# 2. Comparing event IDs between pages to detect duplicates
# 3. Stopping when no new events are found

def paginate_category(driver, category_name, extract_event_urls_func):
    """
    Paginate through all pages of a category and extract event URLs.
    
    Parameters:
    - driver: Selenium WebDriver
    - category_name: Name of the category
    - extract_event_urls_func: Function to extract events from HTML
    
    Returns:
    - List of all events from all pages
    """
    all_events = []
    
    try:
        # Extract events from first page
        events = extract_event_urls_func(driver.page_source, category_name)
        print(f"   Found {len(events)} events on page 1")
        all_events.extend(events)
        
        # Track previous page event IDs for comparison
        previous_ids = set(e['id'] for e in events)
        
        # Pagination loop - iterate through numbered page buttons
        page = 2
        while True:
            try:
                # Find all pagination buttons
                pag_buttons = driver.find_elements(By.XPATH, "//div[contains(@class, 'space-x-2')]//button")
                
                # Buttons are: [prev, 1, 2, 3, 4, ..., next]
                # We need to find the button with text matching current page number
                page_btn = None
                for btn in pag_buttons:
                    btn_text = btn.text.strip()
                    if btn_text == str(page):
                        page_btn = btn
                        break
                
                if not page_btn:
                    # No button for this page number, we've reached the end
                    print(f"   Page {page}: Button not found - reached end")
                    break
                
                # Click page button using JavaScript
                print(f"   Clicking page {page}...", end=" ", flush=True)
                driver.execute_script("arguments[0].scrollIntoView(true);", page_btn)
                time.sleep(1)
                driver.execute_script("arguments[0].click();", page_btn)
                time.sleep(8)  # Wait for AJAX page load
                
                # Scroll to load lazy-loaded content
                driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
                time.sleep(2)
                
                # Extract events from this page
                page_events = extract_event_urls_func(driver.page_source, category_name)
                
                if page_events:
                    # Get set of event IDs on this page
                    current_ids = set(e['id'] for e in page_events)
                    
                    # Check if events are different from previous page
                    if current_ids == previous_ids:
                        print(f"Same events as page {page-1} - reached end")
                        break
                    
                    new_count = len(current_ids - previous_ids)
                    print(f"Found {len(page_events)} events ({new_count} new)")
                    all_events.extend(page_events)
                    previous_ids = current_ids
                else:
                    print(f"No events found - stopping pagination")
                    break
                
                page += 1
                
            except Exception as e:
                print(f"Pagination error: {e}")
                break
        
        print(f"   ‚úÖ Total events: {len(all_events)}")
        return all_events
        
    except Exception as e:
        print(f"   ‚ùå Error in pagination: {e}")
        return all_events


print("‚úÖ Pagination function defined - ready to use")
print("   Use: events = paginate_category(driver, category_name, extract_func)")


‚úÖ Pagination function defined - ready to use
   Use: events = paginate_category(driver, category_name, extract_func)


## ü§ñ Selenium Browser Setup

Configure and initialize the Selenium WebDriver with anti-detection measures.

In [4]:
class JonnusWebDriver:
    """Manages Selenium WebDriver with anti-detection measures"""
    
    def __init__(self):
        self.driver = None
        print("ü§ñ Initializing Joinnus WebDriver")
    
    def setup_driver(self):
        """Setup Chrome WebDriver with anti-detection options"""
        try:
            chrome_options = Options()
            
            # Anti-detection measures
            if SCRAPING_CONFIG['browser']['headless']:
                chrome_options.add_argument("--headless")
            
            chrome_options.add_argument("--no-sandbox")
            chrome_options.add_argument("--disable-dev-shm-usage")
            
            if SCRAPING_CONFIG['browser']['disable_gpu']:
                chrome_options.add_argument("--disable-gpu")
            
            chrome_options.add_argument(f"--window-size={SCRAPING_CONFIG['browser']['window_size']}")
            chrome_options.add_argument(f"user-agent={JOINNUS_CONFIG['headers']['User-Agent']}")
            chrome_options.add_argument("--disable-blink-features=AutomationControlled")
            chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
            chrome_options.add_experimental_option('useAutomationExtension', False)
            
            # Disable image loading for faster scraping
            if not SCRAPING_CONFIG['browser']['disable_images']:
                prefs = {"profile.managed_default_content_settings.images": 2}
                chrome_options.add_experimental_option("prefs", prefs)
            
            # Initialize driver
            service = Service(ChromeDriverManager().install())
            self.driver = webdriver.Chrome(service=service, options=chrome_options)
            
            print("‚úÖ Chrome WebDriver initialized successfully")
            return self.driver
            
        except Exception as e:
            print(f"‚ùå Error initializing WebDriver: {e}")
            raise
    
    def quit(self):
        """Close the WebDriver"""
        if self.driver:
            self.driver.quit()
            print("‚úÖ WebDriver closed")
    
    def wait_for_element(self, selector, by=By.CSS_SELECTOR, timeout=None):
        """Wait for element to be present"""
        timeout = timeout or SCRAPING_CONFIG['timing']['element_wait']
        try:
            WebDriverWait(self.driver, timeout).until(
                EC.presence_of_element_located((by, selector))
            )
            return True
        except:
            return False
    
    def scroll_to_bottom(self):
        """Scroll to bottom of page to trigger lazy loading"""
        try:
            self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            time.sleep(2)
            return True
        except:
            return False

print("‚úÖ JonnusWebDriver class defined")

‚úÖ JonnusWebDriver class defined


## üîç Data Extraction Methods

Define methods to extract event information from Joinnus pages.

In [5]:
class JonnusEventExtractor:
    """Extract event data from Joinnus pages"""
    
    def __init__(self):
        self.logger = logging.getLogger(__name__)
        print("üîç Joinnus Event Extractor initialized")
    
    def extract_event_urls_from_listing(self, html_content, category):
        """Extract event URLs and IDs from listing page using aria-label selector"""
        try:
            soup = BeautifulSoup(html_content, 'html.parser')
            events = []
            
            # Find all links with aria-label="Ver detalle del evento"
            event_links = soup.find_all('a', {'aria-label': 'Ver detalle del evento'})
            
            for link in event_links:
                try:
                    href = link.get('href')
                    if not href:
                        continue
                    
                    # Extract ID from URL (last number in the URL)
                    # URL format: .../event-name-12345
                    match = re.search(r'-(\d+)$', href)
                    if match:
                        event_id = match.group(1)
                    else:
                        # Try alternate format or use last part
                        parts = href.strip('/').split('-')
                        event_id = parts[-1] if parts[-1].isdigit() else None
                    
                    if not event_id:
                        continue
                    
                    # Convert relative URLs to absolute
                    if href.startswith('/'):
                        href = JOINNUS_CONFIG['base_domain'] + href
                    elif not href.startswith('http'):
                        href = JOINNUS_CONFIG['base_domain'] + '/' + href
                    
                    event_data = {
                        'id': event_id,
                        'url': href,
                        'category': category
                    }
                    events.append(event_data)
                    
                except Exception as e:
                    self.logger.error(f"Error extracting event from link: {e}")
                    continue
            
            print(f"‚úÖ Found {len(events)} event URLs on page")
            return events
            
        except Exception as e:
            self.logger.error(f"Error extracting event links: {e}")
            return []

print("‚úÖ JonnusEventExtractor class defined")


def extract_and_store_events():
    """
    Step 1: Extract events from all Joinnus categories and store in data/ directory
    """
    print("üöÄ STEP 1: EXTRACTING EVENTS FROM ALL CATEGORIES")
    print("=" * 70)
    
    web_driver_manager = None
    all_events = []
    
    try:
        # Load categories from CSV
        urls_df = pd.read_csv(URLS_CONFIG['csv_file'])
        urls_df.columns = urls_df.columns.str.strip()
        categories = urls_df.to_dict('records')
        
        print(f"üìÇ Found {len(categories)} categories to process\n")
        
        # Initialize driver
        web_driver_manager = JonnusWebDriver()
        driver = web_driver_manager.setup_driver()
        
        # Process each category
        for idx, category_data in enumerate(categories, 1):
            category_url = category_data['url'].strip()
            category_name = category_data['category'].strip()
            
            print(f"[{idx}/{len(categories)}] üìç {category_name}")
            
            try:
                # Navigate to category
                driver.get(category_url)
                time.sleep(5)
                web_driver_manager.scroll_to_bottom()
                
                # Extract events using the pagination function
                extractor = JonnusEventExtractor()
                category_events = paginate_category(driver, category_name, extractor.extract_event_urls_from_listing)
                
                all_events.extend(category_events)
                print(f"   ‚úÖ Total from this category: {len(category_events)}\n")
                
            except Exception as e:
                print(f"   ‚ùå Error: {str(e)}\n")
                continue
        
        # Save extracted events
        if all_events:
            timestamp = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')
            
            # Save as JSON
            json_file = DATA_DIR / f"events_{timestamp}.json"
            with open(json_file, 'w', encoding='utf-8') as f:
                json.dump(all_events, f, ensure_ascii=False, indent=2)
            print(f"‚úÖ Saved JSON: {json_file.name} ({len(all_events)} events)")
            
            # Save as CSV
            csv_file = DATA_DIR / f"events_{timestamp}.csv"
            df = pd.DataFrame(all_events)
            df.to_csv(csv_file, index=False, encoding='utf-8')
            print(f"‚úÖ Saved CSV: {csv_file.name}")
            
            print(f"\nüìä EXTRACTION SUMMARY")
            print("=" * 70)
            print(f"   Total events extracted: {len(all_events)}")
            print(f"   Categories processed: {len(categories)}")
            print(f"   Files saved to: {DATA_DIR}")
            
            return all_events, json_file, csv_file
        
    except Exception as e:
        print(f"‚ùå Extraction failed: {e}")
        import traceback
        traceback.print_exc()
        return [], None, None
    
    finally:
        if web_driver_manager:
            web_driver_manager.quit()
            print("\nüåê Browser closed")


def process_and_analyze_events(csv_file):
    """
    Step 2: Read CSV and extract important data to JSON format
    """
    print("\nüîÑ STEP 2: PROCESSING AND ANALYZING EVENTS")
    print("=" * 70)
    
    try:
        # Read the CSV file
        df = pd.read_csv(csv_file)
        print(f"üìñ Read {len(df)} events from CSV\n")
        
        # Extract important data
        analysis = {
            'extraction_date': datetime.datetime.now().isoformat(),
            'total_events': len(df),
            'total_categories': df['category'].nunique() if 'category' in df.columns else 0,
            'categories': {},
            'sample_events': df.head(10).to_dict('records')
        }
        
        # Analyze by category
        if 'category' in df.columns:
            category_counts = df['category'].value_counts().to_dict()
            for cat, count in sorted(category_counts.items()):
                analysis['categories'][cat] = {
                    'event_count': int(count),
                    'percentage': round((count / len(df)) * 100, 2)
                }
            
            print("üìä Events by Category:")
            for cat, info in analysis['categories'].items():
                print(f"   {cat}: {info['event_count']} events ({info['percentage']}%)")
        
        # Save analysis to JSON
        timestamp = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')
        analysis_file = DATA_DIR / f"analysis_{timestamp}.json"
        
        with open(analysis_file, 'w', encoding='utf-8') as f:
            json.dump(analysis, f, ensure_ascii=False, indent=2)
        
        print(f"\n‚úÖ Saved analysis: {analysis_file.name}")
        
        print(f"\nüìà ANALYSIS SUMMARY")
        print("=" * 70)
        print(f"   Total events: {analysis['total_events']}")
        print(f"   Total categories: {analysis['total_categories']}")
        print(f"   Files saved to: {DATA_DIR}")
        
        return analysis, analysis_file
        
    except Exception as e:
        print(f"‚ùå Processing failed: {e}")
        import traceback
        traceback.print_exc()
        return None, None


print("‚úÖ Extraction and processing functions defined")


‚úÖ JonnusEventExtractor class defined
‚úÖ Extraction and processing functions defined


## ‚ñ∂Ô∏è Run Complete Workflow

Execute the full extraction ‚Üí storage ‚Üí analysis pipeline in one go.


In [6]:
# ‚ñ∂Ô∏è RUN THE COMPLETE WORKFLOW
# This will:
# 1. Extract events from all 19 categories (15-30 minutes)
# 2. Save to data/events_TIMESTAMP.json and .csv
# 3. Analyze and save stats to data/analysis_TIMESTAMP.json

print("üéØ JOINNUS EVENTS EXTRACTION & ANALYSIS PIPELINE")
print("=" * 70)
print("This process will:")
print("  1Ô∏è‚É£ Extract events from all Joinnus categories")
print("  2Ô∏è‚É£ Store results in 'data/' folder (JSON + CSV)")
print("  3Ô∏è‚É£ Analyze results and save stats to JSON")
print("=" * 70)
print()

# Step 1: Extract and store events
events, events_json_file, events_csv_file = extract_and_store_events()

# Step 2: Process and analyze (only if extraction succeeded)
if events_csv_file:
    analysis, analysis_file = process_and_analyze_events(events_csv_file)
    
    print(f"\nüéâ WORKFLOW COMPLETE!")
    print("=" * 70)
    print(f"‚úÖ All files saved to: {DATA_DIR}")
    print(f"   üìÑ Events JSON: events_*.json")
    print(f"   üìä Events CSV: events_*.csv")
    print(f"   üìà Analysis: analysis_*.json")
    print(f"\nüìä Quick Stats:")
    print(f"   Total events: {len(events)}")
    print(f"   Total categories: {analysis['total_categories']}")
    print(f"   Top categories:")
    sorted_cats = sorted(analysis['categories'].items(), key=lambda x: x[1]['event_count'], reverse=True)
    for cat, info in sorted_cats[:5]:
        print(f"      ‚Ä¢ {cat}: {info['event_count']} events")
else:
    print("\n‚ùå Workflow stopped - extraction failed")



üéØ JOINNUS EVENTS EXTRACTION & ANALYSIS PIPELINE
This process will:
  1Ô∏è‚É£ Extract events from all Joinnus categories
  2Ô∏è‚É£ Store results in 'data/' folder (JSON + CSV)
  3Ô∏è‚É£ Analyze results and save stats to JSON

üöÄ STEP 1: EXTRACTING EVENTS FROM ALL CATEGORIES
üìÇ Found 19 categories to process

ü§ñ Initializing Joinnus WebDriver


2025-11-02 13:12:23,853 - WDM - INFO - Get LATEST chromedriver version for google-chrome
2025-11-02 13:12:24,200 - WDM - INFO - Get LATEST chromedriver version for google-chrome
2025-11-02 13:12:24,200 - WDM - INFO - Get LATEST chromedriver version for google-chrome
2025-11-02 13:12:24,433 - WDM - INFO - Get LATEST chromedriver version for google-chrome
2025-11-02 13:12:24,433 - WDM - INFO - Get LATEST chromedriver version for google-chrome
2025-11-02 13:12:25,087 - WDM - INFO - WebDriver version 141.0.7390.122 selected
2025-11-02 13:12:25,091 - WDM - INFO - Modern chrome version https://storage.googleapis.com/chrome-for-testing-public/141.0.7390.122/win32/chromedriver-win32.zip
2025-11-02 13:12:25,092 - WDM - INFO - About to download new driver from https://storage.googleapis.com/chrome-for-testing-public/141.0.7390.122/win32/chromedriver-win32.zip
2025-11-02 13:12:25,087 - WDM - INFO - WebDriver version 141.0.7390.122 selected
2025-11-02 13:12:25,091 - WDM - INFO - Modern chrome vers


üåê Browser closed


KeyboardInterrupt: 

## üìã Step 2: Extract Comprehensive Event Data

Extract detailed information from each event page (title, description, images, pricing, tags, etc.)

In [None]:
import time
import re
from datetime import datetime

def extract_from_json_ld(soup):
    """Extract structured data from JSON-LD schema"""
    try:
        scripts = soup.find_all('script', {'type': 'application/ld+json'})
        for script in scripts:
            data = json.loads(script.string)
            if isinstance(data, dict) and data.get('@type') == 'Event':
                return data
    except:
        pass
    return {}


def extract_event_details(driver, url, event_id, category):
    """
    Extract comprehensive event data from a Joinnus event page.
    
    Returns:
        dict: Extracted event data with all fields
    """
    try:
        driver.get(url)
        time.sleep(2)
        
        # Scroll to load lazy-loaded content
        for _ in range(3):
            driver.execute_script("window.scrollBy(0, window.innerHeight);")
            time.sleep(0.3)
        
        html = driver.page_source
        soup = BeautifulSoup(html, 'html.parser')
        json_ld = extract_from_json_ld(soup)
        
        data = {
            'event_id': event_id,
            'url': url,
            'category': category,
            'title': None,
            'description': None,
            'city': None,
            'location_venue': None,
            'address': None,
            'rating': None,
            'event_type': None,
            'price_min': None,
            'price_currency': None,
            'tags': [],
            'images': [],
            'start_date': None,
            'end_date': None,
            'times': [],
            'extracted_at': datetime.now().isoformat()
        }
        
        # Extract title
        h2_tag = soup.find('h2', class_='text-xl')
        if h2_tag:
            data['title'] = h2_tag.get_text(strip=True)
        elif json_ld.get('name'):
            data['title'] = json_ld['name']
        else:
            og_title = soup.find('meta', {'property': 'og:title'})
            if og_title:
                data['title'] = og_title.get('content', '').split('|')[0].strip()
        
        # Extract description (strip HTML tags and decode HTML entities)
        og_desc = soup.find('meta', {'property': 'og:description'})
        if og_desc:
            raw_desc = og_desc.get('content')
        elif json_ld.get('description'):
            raw_desc = json_ld['description']
        else:
            raw_desc = None

        if raw_desc:
            try:
                cleaned = BeautifulSoup(raw_desc, 'html.parser').get_text(separator=' ', strip=True)
                cleaned = unescape(cleaned)
                data['description'] = cleaned
            except Exception:
                data['description'] = raw_desc
        
        # Extract city and location from spans
        spans = soup.find_all('span', class_='h-full')
        span_texts = [s.get_text(strip=True) for s in spans]
        if len(span_texts) > 1:
            data['city'] = span_texts[1]
        
        # Extract location venue (Real Plaza Angamos, MALI, etc.)
        for i, span in enumerate(spans):
            prev_text = span.find_previous(string=True)
            if prev_text and 'Ubicaci√≥n' in prev_text:
                next_h_full = span.find_next('span', class_='h-full')
                if next_h_full:
                    data['location_venue'] = next_h_full.get_text(strip=True)
                break
        
        # Extract address from JSON-LD
        if json_ld.get('location', {}).get('address', {}).get('streetAddress'):
            data['address'] = json_ld['location']['address']['streetAddress']
        
        # Extract rating
        rating_span = soup.find('span', class_=['base-rating'])
        if rating_span:
            data['rating'] = rating_span.get_text(strip=True)
        
        # Extract price info from JSON-LD
        if json_ld.get('offers', {}).get('lowPrice') is not None:
            data['price_min'] = json_ld['offers']['lowPrice']
            data['price_currency'] = json_ld['offers'].get('priceCurrency')
        
        # Extract event type/audience
        audience_span = soup.find('span', class_=re.compile('text-\\[0.625rem\\]'))
        if audience_span:
            text = audience_span.get_text(strip=True)
            if 'Apto' in text or 'General' in text:
                data['event_type'] = text
        
        # Extract tags/categories from all spans
        all_h_full_spans = soup.find_all('span', class_='h-full')
        for span in all_h_full_spans:
            text = span.get_text(strip=True)
            if text and text not in ['Lima', data.get('city', ''), 'Descubrir']:
                if 2 < len(text) < 40 and len(text.split()) <= 3:
                    if text not in data['tags']:
                        data['tags'].append(text)
        
        data['tags'] = list(set(data['tags']))[:20]
        
        # Extract images: prefer S3-hosted event images on classic routes and avoid generic site assets
        images = []
        img_tags = soup.find_all('img')
        candidate_srcs = []
        for img in img_tags:
            src = img.get('src') or img.get('data-src') or img.get('data-lazy-src') or img.get('data-original')
            if not src:
                continue
            # Normalize protocol-relative and relative URLs
            if src.startswith('//'):
                src = 'https:' + src
            if src.startswith('/') and not src.startswith('//') and 'http' not in src:
                base = JOINNUS_CONFIG.get('base_domain','').rstrip('/')
                src = base + src
            candidate_srcs.append(src)
        # Deduplicate preserving order
        seen = set()
        candidate_srcs = [x for x in candidate_srcs if not (x in seen or seen.add(x))]
        
        def is_generic_asset(s):
            low = s.lower()
            if 'maps-preview.png' in low:
                return True
            if '/profile/' in low:
                return True
            if '/files/' in low:
                return True
            if '/libro-de-reclamaciones' in low:
                return True
            if '/icons/' in low or '/icon-' in low:
                return True
            if 'placeholder' in low or 'avatar' in low:
                return True
            return False
        
        # Identify S3 and CDN candidates
        s3_images = [s for s in candidate_srcs if 's3.' in s and 'joinnus.com' in s and not is_generic_asset(s)]
        cdn_images = [s for s in candidate_srcs if 'cdn.joinnus.com' in s and not is_generic_asset(s)]
        
        # Determine if this is a classic route (only apply S3-priority for classic pages)
        is_classic = False
        classic_domain = JOINNUS_CONFIG.get('classic_domain','')
        try:
            if classic_domain and classic_domain.replace('https://','').replace('http://','') in url:
                is_classic = True
            elif 'classic.joinnus.com' in url:
                is_classic = True
        except Exception:
            is_classic = False
        
        # Ordering: for classic pages prefer S3 images first, otherwise prefer CDN images first
        selected = []
        if is_classic:
            for s in s3_images:
                if s not in selected:
                    selected.append(s)
            for s in cdn_images:
                if s not in selected:
                    selected.append(s)
        else:
            for s in cdn_images:
                if s not in selected:
                    selected.append(s)
            for s in s3_images:
                if s not in selected:
                    selected.append(s)
        
        # Fallback: include any non-data images not matching generic patterns
        for s in candidate_srcs:
            if s not in selected and not s.startswith('data:') and not is_generic_asset(s):
                selected.append(s)
        
        data['images'] = selected[:5]
        
        # Extract dates and times from JSON-LD
        if json_ld.get('startDate'):
            data['start_date'] = json_ld['startDate']
        if json_ld.get('endDate'):
            data['end_date'] = json_ld['endDate']
        
        date_time_ps = soup.find_all('p', class_=['flex', 'gap-1'])
        for p in date_time_ps:
            text = p.get_text(strip=True)
            if re.search(r'\\d{1,2}:\\d{2}', text):
                if text not in data['times']:
                    data['times'].append(text)
        
        return data
        
    except Exception as e:
        logger.error(f"Error extracting details from {url}: {e}")
        return None


print("‚úÖ Event detail extraction functions defined")

‚úÖ Event detail extraction functions defined


In [87]:
def extract_detailed_event_data(csv_file_path, check_mongodb=True):
    """
    Extract comprehensive event data from all events in CSV file.
    - Skips already extracted events by checking data/events/ folder
    - Also checks MongoDB if available
    - Saves each event immediately to data/events/ folder as individual JSON files
    - Creates combined JSON at end in data/ directory
    
    Args:
        csv_file_path: Path to CSV with event URLs
        check_mongodb: If True, also check MongoDB for existing event IDs
        
    Returns:
        List of extracted event data dictionaries
    """
    print("\n" + "=" * 80)
    print("COMPREHENSIVE EVENT DATA EXTRACTION")
    print("=" * 80)
    
    try:
        # Read CSV file
        df = pd.read_csv(csv_file_path)
        total_events = len(df)
        print(f"\n" + "=" * 80)
        print("CSV ANALYSIS (BEFORE EXTRACTION)")
        print("=" * 80)
        print(f"üìñ Loaded {total_events} events from CSV")
        print(f"üìÇ Processing started at {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
        
        # Normalize column names (strip whitespace)
        df.columns = df.columns.str.strip()
        print(f"   Available columns: {list(df.columns)}\n")
        
        # Show CSV stats BEFORE extraction
        print(f"Total rows in CSV: {total_events}")
        print(f"Columns: {list(df.columns)}")
        
        # Determine which columns to use (handle different naming variations)
        id_col = None
        url_col = None
        category_col = None
        
        # Find ID column
        for col in ['id', 'event_id', 'ID', 'Event ID']:
            if col in df.columns:
                id_col = col
                break
        
        # Find URL column
        for col in ['url', 'URL', 'event_url', 'link']:
            if col in df.columns:
                url_col = col
                break
        
        # Find category column
        for col in ['category', 'Category', 'event_category']:
            if col in df.columns:
                category_col = col
                break
        
        if not id_col or not url_col:
            print(f"‚ùå Error: CSV missing required columns")
            print(f"   Need: 'id' (or 'event_id') and 'url'")
            print(f"   Found columns: {list(df.columns)}")
            return [], None, None
        
        print(f"‚úì Using columns: id='{id_col}', url='{url_col}', category='{category_col}'\n")
        
        # Create events directory for individual files
        events_dir = DATA_DIR / 'events'
        events_dir.mkdir(exist_ok=True)
        
        # Get list of already extracted event IDs from local files
        already_extracted = set()
        for event_file in events_dir.glob('event_*.json'):
            event_id = event_file.stem.replace('event_', '')
            already_extracted.add(event_id)
        
        # Also check MongoDB if requested
        mongodb_ids = set()
        if check_mongodb:
            try:
                mongo_store = MongoDBEventStore(MONGODB_CONFIG)
                if mongo_store.connect():
                    mongodb_ids = mongo_store.get_existing_ids()
                    mongo_store.close()
            except Exception as e:
                pass
        
        # Combine both sets of IDs to skip
        skip_ids = already_extracted | mongodb_ids
        total_skip = len(skip_ids)
        to_extract = total_events - total_skip
        
        # Display skip analysis before extraction
        print(f"\nüìä SKIP ANALYSIS:")
        print(f"   Already in data/events/ folder: {len(already_extracted)}")
        print(f"   Already in MongoDB: {len(mongodb_ids)}")
        print(f"   Total to skip: {total_skip}")
        print(f"   New events to extract: {to_extract}")
        
        if total_events > 0:
            skip_pct = (total_skip / total_events) * 100
            extract_pct = (to_extract / total_events) * 100
            print(f"\n   Breakdown: {skip_pct:.1f}% skip | {extract_pct:.1f}% extract")
        
        print(f"\n" + "=" * 80)
        print("STARTING EXTRACTION")
        print("=" * 80 + "\n")
        
        # Initialize WebDriver
        web_driver = JonnusWebDriver()
        driver = web_driver.setup_driver()
        
        all_events = []
        failed_events = []
        skipped_count = 0
        duplicated_count = 0

        repeat = set()
        try:
            for idx, row in df.iterrows():
                url = str(row[url_col])
                event_id = str(row[id_col])
                category = str(row[category_col]) if category_col else 'Unknown'

                if event_id in repeat:
                    duplicated_count += 1
                    skipped_count += 1
                    continue
                repeat.add(event_id)
                
                # Skip if already extracted (local) or in database (MongoDB)
                if event_id in skip_ids:
                    skipped_count += 1
                    continue
                
                # Progress indicator (adjusted for skipped events)
                processed_count = idx + 1 - skipped_count
                total_to_process = total_events - len(skip_ids)
                progress_pct = (processed_count / total_to_process * 100) if total_to_process > 0 else 0
                
                print(f"[{processed_count:4d}/{total_to_process}] ({progress_pct:5.1f}%) Event {event_id}...", end='', flush=True)
                
                # Extract detailed data
                event_data = extract_event_details(driver, url, event_id, category)
                
                if event_data:
                    all_events.append(event_data)
                    
                    # Save individual event immediately to events/ folder
                    event_file = events_dir / f"event_{event_id}.json"
                    with open(event_file, 'w', encoding='utf-8') as f:
                        json.dump(event_data, f, ensure_ascii=False, indent=2)
                    
                    print(f" ‚úì {event_data['title'][:40] if event_data['title'] else 'Unknown'}")
                else:
                    failed_events.append({'id': event_id, 'url': url})
                    print(f" ‚ö†Ô∏è Failed")
                
                # Restart browser every 50 events for stability
                if (processed_count) % 50 == 0 and processed_count > 0:
                    print(f"\n  üîÑ Restarting browser for stability...\n")
                    driver.quit()
                    driver = web_driver.setup_driver()
            
        finally:
            driver.quit()
            print("\n‚úì Browser closed")
        
        # Load already extracted events for combined file
        print(f"\nüîÑ Loading previously extracted events...")
        for event_id in already_extracted:
            event_file = events_dir / f"event_{event_id}.json"
            if event_file.exists():
                try:
                    with open(event_file, 'r', encoding='utf-8') as f:
                        event_data = json.load(f)
                        all_events.append(event_data)
                except Exception as e:
                    logger.error(f"Error loading {event_file}: {e}")
        
        all_events.sort(key=lambda x: int(x['event_id']))
        
        # Print summary statistics
        print(f"\n" + "=" * 80)
        print("EXTRACTION SUMMARY")
        print("=" * 80)
        print(f"Total processed (new): {len(all_events) - len(already_extracted)}/{total_events - len(skip_ids)}")
        print(f"Previously extracted (local): {len(already_extracted)}")
        print(f"Already in MongoDB: {len(mongodb_ids)}")
        print(f"Total unique events: {len(all_events)}")
        print(f"Duplicates in CSV skipped: {duplicated_count}")
        print(f"Failed extractions: {len(failed_events)}")
        if total_events > 0:
            success_rate = (len(all_events) / total_events * 100)
            print(f"Overall success rate: {success_rate:.1f}%")
        
        print(f"\nüìÅ Individual Event Files Created:")
        print(f"   Location: {events_dir.name}/ ({len(all_events)} event_*.json files)")
        
        if failed_events:
            failed_file = DATA_DIR / f'events_failed_{datetime.now().strftime("%Y%m%d_%H%M%S")}.json'
            with open(failed_file, 'w', encoding='utf-8') as f:
                json.dump(failed_events, f, ensure_ascii=False, indent=2)
            print(f"‚ö†Ô∏è  Failed events: {failed_file.name} ({len(failed_events)} events)")
        
        print(f"\nüí° Note: Run the combining cell to generate combined JSON/CSV files")
        
        return all_events, None, None
        
    except Exception as e:
        print(f"\n‚ùå Error: {e}")
        import traceback
        traceback.print_exc()
        return [], None, None


print("‚úÖ Comprehensive extraction function defined")
print("   Use: all_events, json_file, csv_file = extract_detailed_event_data(csv_path, check_mongodb=True)")

‚úÖ Comprehensive extraction function defined
   Use: all_events, json_file, csv_file = extract_detailed_event_data(csv_path, check_mongodb=True)


## ‚ñ∂Ô∏è Execute Step 2: Comprehensive Extraction

Run this to extract detailed data from all 603+ events

In [88]:
# Find the most recent events CSV file in data directory
import os
from pathlib import Path

# Look for the CSV created in previous extraction step
csv_files = sorted(DATA_DIR.glob('events_*.csv'), key=os.path.getctime, reverse=True)

if csv_files:
    # Use the most recent CSV file
    latest_csv = csv_files[0]
    print(f"üìÇ Using CSV file: {latest_csv.name}")
    print(f"üìÖ Created: {datetime.fromtimestamp(latest_csv.stat().st_ctime)}\n")
    
    # Run comprehensive extraction (all analysis and skip logic is inside the function)
    detailed_events, detailed_json, detailed_csv = extract_detailed_event_data(latest_csv)
    
    if detailed_events:
        print(f"\nüéâ EXTRACTION COMPLETE!")
        print(f"   Total events extracted: {len(detailed_events)}")
        print(f"   JSON file: {detailed_json.name if detailed_json else 'N/A'}")
        print(f"   CSV file: {detailed_csv.name if detailed_csv else 'N/A'}")
else:
    print("‚ùå No events CSV file found in data directory")
    print("   Please run the initial extraction pipeline first")




üìÇ Using CSV file: events_combined.csv
üìÖ Created: 2025-10-24 14:29:12.327344


COMPREHENSIVE EVENT DATA EXTRACTION

CSV ANALYSIS (BEFORE EXTRACTION)
üìñ Loaded 518 events from CSV
üìÇ Processing started at 2025-10-24 14:43:58

   Available columns: ['event_id', 'title', 'category', 'city', 'location_venue', 'address', 'rating', 'price_min', 'price_currency', 'image_count', 'tag_count', 'time_slots', 'url']

Total rows in CSV: 518
Columns: ['event_id', 'title', 'category', 'city', 'location_venue', 'address', 'rating', 'price_min', 'price_currency', 'image_count', 'tag_count', 'time_slots', 'url']
‚úì Using columns: id='event_id', url='url', category='category'

üóÑÔ∏è Initializing MongoDB Event Store
‚úÖ Connected to MongoDB
   Database: recommendations-system
   Collection: events
üìä Found 0 existing events in MongoDB
‚úÖ MongoDB connection closed

üìä SKIP ANALYSIS:
   Already in data/events/ folder: 518
   Already in MongoDB: 0
   Total to skip: 518
   New events to extra

2025-10-24 14:43:58,894 - WDM - INFO - Get LATEST chromedriver version for google-chrome
2025-10-24 14:43:59,108 - WDM - INFO - Get LATEST chromedriver version for google-chrome
2025-10-24 14:43:59,108 - WDM - INFO - Get LATEST chromedriver version for google-chrome
2025-10-24 14:43:59,322 - WDM - INFO - Driver [C:\Users\Singoe\.wdm\drivers\chromedriver\win64\141.0.7390.122\chromedriver-win32/chromedriver.exe] found in cache
2025-10-24 14:43:59,322 - WDM - INFO - Driver [C:\Users\Singoe\.wdm\drivers\chromedriver\win64\141.0.7390.122\chromedriver-win32/chromedriver.exe] found in cache


‚úÖ Chrome WebDriver initialized successfully

‚úì Browser closed

üîÑ Loading previously extracted events...

EXTRACTION SUMMARY
Total processed (new): 0/0
Previously extracted (local): 518
Already in MongoDB: 0
Total unique events: 518
Duplicates in CSV skipped: 0
Failed extractions: 0
Overall success rate: 100.0%

üìÅ Individual Event Files Created:
   Location: events/ (518 event_*.json files)

üí° Note: Run the combining cell to generate combined JSON/CSV files

üéâ EXTRACTION COMPLETE!
   Total events extracted: 518
   JSON file: N/A
   CSV file: N/A

‚úì Browser closed

üîÑ Loading previously extracted events...

EXTRACTION SUMMARY
Total processed (new): 0/0
Previously extracted (local): 518
Already in MongoDB: 0
Total unique events: 518
Duplicates in CSV skipped: 0
Failed extractions: 0
Overall success rate: 100.0%

üìÅ Individual Event Files Created:
   Location: events/ (518 event_*.json files)

üí° Note: Run the combining cell to generate combined JSON/CSV files

üéâ

In [93]:
# Combine all individual event JSON files into a single master file
import os
from pathlib import Path
from datetime import datetime as dt

print("=" * 80)
print("COMBINING ALL EVENT JSON FILES")
print("=" * 80)

# Path to individual event files
events_dir = DATA_DIR / 'events'
all_events = []

# Define important fields to keep in combined JSON
IMPORTANT_FIELDS = [
    'event_id', 'url', 'category', 'title', 'description',
    'city', 'location_venue', 'address', 'rating',
    'price_min', 'price_currency', 'tags', 'images',
    'start_date', 'end_date', 'times', 'extracted_at'
]

if events_dir.exists() and events_dir.is_dir():
    print(f"\nüìÅ Events directory: {events_dir}")
    
    # Find all individual event JSON files
    event_files = sorted(events_dir.glob('event_*.json'))
    print(f"üìñ Found {len(event_files)} individual event JSON files\n")
    
    # Load all event files and filter to important fields only
    for idx, event_file in enumerate(event_files, 1):
        try:
            with open(event_file, 'r', encoding='utf-8') as f:
                event_data = json.load(f)
                # Keep only important fields
                filtered_event = {field: event_data.get(field) for field in IMPORTANT_FIELDS}
                all_events.append(filtered_event)
            
            if idx % 100 == 0:
                print(f"   Loaded {idx}/{len(event_files)} files...")
        except Exception as e:
            print(f"   ‚ö†Ô∏è Error loading {event_file.name}: {e}")
            continue
    
    if all_events:
        # Sort by event_id
        all_events.sort(key=lambda x: int(x['event_id']))
        print(f"\n‚úÖ Loaded {len(all_events)} total events\n")
        print(f"üìã Fields in combined JSON: {', '.join(IMPORTANT_FIELDS)}\n")
        
        # Save combined JSON file with only important fields (overwrites previous file)
        json_file = DATA_DIR / 'events_combined.json'
        with open(json_file, 'w', encoding='utf-8') as f:
            json.dump(all_events, f, ensure_ascii=False, indent=2)
        print(f"‚úÖ Saved combined JSON: {json_file.name}")
        print(f"   Location: {DATA_DIR}")
        print(f"   File size: {json_file.stat().st_size / 1024 / 1024:.2f} MB")
        
        # Also save as CSV for easy viewing
        csv_data = []
        for event in all_events:
            csv_data.append({
                'event_id': event.get('event_id'),
                'title': event.get('title'),
                'category': event.get('category'),
                'city': event.get('city'),
                'location_venue': event.get('location_venue'),
                'address': event.get('address'),
                'rating': event.get('rating'),
                'price_min': event.get('price_min'),
                'price_currency': event.get('price_currency'),
                'image_count': len(event.get('images', [])),
                'tag_count': len(event.get('tags', [])),
                'time_slots': len(event.get('times', [])),
                'url': event.get('url')
            })
        
        csv_df = pd.DataFrame(csv_data)
        csv_file = DATA_DIR / 'events_combined.csv'
        csv_df.to_csv(csv_file, index=False, encoding='utf-8')
        print(f"‚úÖ Saved combined CSV: {csv_file.name}")
        print(f"   Location: {DATA_DIR}")
        
        # Print data quality summary
        print(f"\n" + "=" * 80)
        print("COMBINED DATA SUMMARY")
        print("=" * 80)
        print(f"Total events: {len(all_events)}")
        
        print(f"\nData Quality by Field:")
        fields = ['title', 'description', 'city', 'location_venue', 'address',
                 'price_min', 'images', 'times', 'tags', 'rating', 'start_date']
        for field in fields:
            count = sum(1 for e in all_events if e.get(field) and 
                       (isinstance(e[field], (int, float)) or 
                        (isinstance(e[field], (str, list)) and len(str(e[field])) > 0)))
            pct = (count / len(all_events) * 100) if all_events else 0
            print(f"  {field:20s}: {count:4d}/{len(all_events)} ({pct:5.1f}%)")
        
        print(f"\nüìÇ Output files created:")
        print(f"   JSON: {json_file.name}")
        print(f"   CSV:  {csv_file.name}")
        print(f"\n‚úÖ COMBINE COMPLETE!")
    else:
        print("‚ùå No events loaded from files")
else:
    print(f"‚ùå Events directory not found: {events_dir}")
    print("   Please run Step 2 (comprehensive extraction) first")


COMBINING ALL EVENT JSON FILES

üìÅ Events directory: c:\Scrapping\joinnus\notebook\data\events
üìñ Found 518 individual event JSON files

   Loaded 100/518 files...
   Loaded 200/518 files...
   Loaded 300/518 files...
   Loaded 400/518 files...
   Loaded 500/518 files...

‚úÖ Loaded 518 total events

üìã Fields in combined JSON: event_id, url, category, title, description, city, location_venue, address, rating, price_min, price_currency, tags, images, start_date, end_date, times, extracted_at

‚úÖ Saved combined JSON: events_combined.json
   Location: c:\Scrapping\joinnus\notebook\data
   File size: 0.72 MB
‚úÖ Saved combined CSV: events_combined.csv
   Location: c:\Scrapping\joinnus\notebook\data

COMBINED DATA SUMMARY
Total events: 518

Data Quality by Field:
  title               :  518/518 (100.0%)
  description         :  518/518 (100.0%)
  city                :  331/518 ( 63.9%)
  location_venue      :  176/518 ( 34.0%)
  address             :  516/518 ( 99.6%)
  price_min  

In [90]:
from collections import Counter
from datetime import datetime

# Data integrity analysis for events_combined.json
# Saves a JSON report and a CSV summary of field completeness to DATA_DIR


def _parse_iso(dt_str):
    if not dt_str or not isinstance(dt_str, str):
        return None
    s = dt_str.strip()
    # Handle trailing Z or timezone-less strings
    if s.endswith('Z'):
        s = s[:-1] + '+00:00'
    try:
        return datetime.fromisoformat(s)
    except Exception:
        # try removing milliseconds if any odd format
        try:
            return datetime.fromisoformat(s.split('.')[0])
        except Exception:
            return None

# locate combined JSON (use existing variable if available)
combined_path = globals().get('json_file', None) or (DATA_DIR / 'events_combined.json')
if not combined_path.exists():
    raise FileNotFoundError(f"Combined JSON not found: {combined_path}")

with open(combined_path, 'r', encoding='utf-8') as f:
    events = json.load(f)

total = len(events)
report = {
    'file': str(combined_path),
    'checked_at': datetime.now().isoformat(),
    'total_events': total,
    'unique_event_ids': 0,
    'duplicate_event_ids': [],
    'field_completeness': {},
    'field_type_issues': {},
    'dates_parsing_issues': 0,
    'url_domain_issues': 0,
    'price_stats': {},
    'images_tags_times_stats': {},
    'rating_distribution': {}
}

# basic id & duplicates
ids = [str(e.get('event_id')) for e in events]
unique_ids = set(ids)
report['unique_event_ids'] = len(unique_ids)
dupes = [eid for eid, cnt in Counter(ids).items() if cnt > 1]
report['duplicate_event_ids'] = dupes

# fields to analyze
fields = ['event_id','url','category','title','description','city','location_venue',
          'address','rating','event_type','price_min',
          'price_currency','tags','images','start_date','end_date','times','extracted_at']

# completeness
comp_rows = []
for field in fields:
    present = 0
    non_empty = 0
    for e in events:
        if field in e:
            present += 1
            v = e.get(field)
            if v is None:
                pass
            elif isinstance(v, str) and v.strip() == '':
                pass
            elif isinstance(v, (list, dict)) and len(v) == 0:
                pass
            else:
                non_empty += 1
    pct_present = (present / total) * 100 if total else 0
    pct_non_empty = (non_empty / total) * 100 if total else 0
    report['field_completeness'][field] = {
        'present_count': present,
        'present_pct': round(pct_present,2),
        'non_empty_count': non_empty,
        'non_empty_pct': round(pct_non_empty,2)
    }
    comp_rows.append({
        'field': field,
        'present_count': present,
        'present_pct': round(pct_present,2),
        'non_empty_count': non_empty,
        'non_empty_pct': round(pct_non_empty,2)
    })

# type checks & domain checks & date parsing & price collection
price_vals = []
price_parse_issues = 0
date_parse_issues = 0
url_issues = 0
domains_ok = (JOINNUS_CONFIG['base_domain'], JOINNUS_CONFIG['classic_domain'], 'https://prime.joinnus.com')
rating_counter = Counter()
images_counts = []
tags_counts = []
times_counts = []

for e in events:
    # event_id should be numeric-ish
    eid = e.get('event_id')
    if eid is not None:
        if not str(eid).isdigit():
            report['field_type_issues'].setdefault('event_id',0)
            report['field_type_issues']['event_id'] += 1

    # url domain
    url = e.get('url','')
    if not any(url.startswith(d) for d in domains_ok):
        url_issues += 1

    # price checks
    pm = e.get('price_min')
    if pm is not None and pm != '':
        try:
            price_vals.append(float(pm))
        except Exception:
            price_parse_issues += 1

    # dates
    sd = e.get('start_date')
    ed = e.get('end_date')
    ps = _parse_iso(sd)
    pe = _parse_iso(ed)
    if sd and not ps:
        date_parse_issues += 1
    if ed and not pe:
        date_parse_issues += 1

    # rating distribution
    rating_counter.update([str(e.get('rating')) if e.get('rating') is not None else 'NULL'])

    # lists counts
    imgs = e.get('images') or []
    tags = e.get('tags') or []
    times = e.get('times') or []
    images_counts.append(len(imgs) if isinstance(imgs, (list,tuple)) else 0)
    tags_counts.append(len(tags) if isinstance(tags, (list,tuple)) else 0)
    times_counts.append(len(times) if isinstance(times, (list,tuple)) else 0)

report['dates_parsing_issues'] = date_parse_issues
report['url_domain_issues'] = url_issues

# price stats
if price_vals:
    s = pd.Series(price_vals)
    report['price_stats'] = {
        'count': int(s.count()),
        'min': float(s.min()),
        '25%': float(s.quantile(0.25)),
        'median': float(s.median()),
        '75%': float(s.quantile(0.75)),
        'max': float(s.max()),
        'mean': float(s.mean())
    }
else:
    report['price_stats'] = {'count': 0}

# images/tags/times stats
def _summmary_list_stats(lst):
    s = pd.Series(lst)
    return {
        'count': int(s.count()),
        'mean': float(s.mean()) if len(s) else 0,
        'median': float(s.median()) if len(s) else 0,
        'max': int(s.max()) if len(s) else 0,
        'pct_zero': float((s==0).sum() / len(s) * 100) if len(s) else 0
    }

report['images_tags_times_stats'] = {
    'images': _summmary_list_stats(images_counts),
    'tags': _summmary_list_stats(tags_counts),
    'times': _summmary_list_stats(times_counts)
}

# rating distribution
report['rating_distribution'] = dict(rating_counter.most_common())

# record type issues counts found earlier
if price_parse_issues:
    report['field_type_issues']['price_min_parse_errors'] = price_parse_issues

# sample offending records (small samples)
sample_issues = {
    'duplicate_event_ids_sample': dupes[:10],
    'url_domain_issues_sample': [],
    'date_parse_issues_sample': [],
    'price_parse_issues_sample': []
}

for e in events:
    if len(sample_issues['url_domain_issues_sample']) < 10:
        u = e.get('url','')
        if not any(u.startswith(d) for d in domains_ok):
            sample_issues['url_domain_issues_sample'].append({'event_id': e.get('event_id'), 'url': u})
    if len(sample_issues['date_parse_issues_sample']) < 10:
        sd = e.get('start_date')
        ed = e.get('end_date')
        if (sd and not _parse_iso(sd)) or (ed and not _parse_iso(ed)):
            sample_issues['date_parse_issues_sample'].append({'event_id': e.get('event_id'), 'start_date': sd, 'end_date': ed})
    if len(sample_issues['price_parse_issues_sample']) < 10:
        pm = e.get('price_min')
        if pm not in (None, ''):
            try:
                float(pm)
            except Exception:
                sample_issues['price_parse_issues_sample'].append({'event_id': e.get('event_id'), 'price_min': pm})

report['samples'] = sample_issues

# save report
ts = datetime.now().strftime('%Y%m%d_%H%M%S')
out_json = DATA_DIR / f'integrity_report_{ts}.json'
out_csv = DATA_DIR / f'integrity_field_completeness_{ts}.csv'

with open(out_json, 'w', encoding='utf-8') as f:
    json.dump(report, f, ensure_ascii=False, indent=2)

pd.DataFrame(comp_rows).to_csv(out_csv, index=False, encoding='utf-8')

# Print concise summary
print(f"Integrity check saved: {out_json.name}")
print(f"Field completeness CSV: {out_csv.name}")
print(f"Total events: {total}")
print(f"Unique IDs: {report['unique_event_ids']}")
print(f"Duplicate event IDs: {len(report['duplicate_event_ids'])} (sample: {report['duplicate_event_ids'][:5]})")
print(f"Fields with non-trivial missingness (non-empty% < 90%):")
for frow in comp_rows:
    if frow['non_empty_pct'] < 90:
        print(f"  - {frow['field']}: {frow['non_empty_pct']}% non-empty")

print(f"Date parse issues: {report['dates_parsing_issues']}")
print(f"URL domain issues: {report['url_domain_issues']}")
print(f"Price numeric parse issues: {report['field_type_issues'].get('price_min_parse_errors', 0)}")
print(f"Images avg: {report['images_tags_times_stats']['images']['mean']:.2f}, tags avg: {report['images_tags_times_stats']['tags']['mean']:.2f}")

# expose report variable for interactive inspection
report


Integrity check saved: integrity_report_20251024_144408.json
Field completeness CSV: integrity_field_completeness_20251024_144408.csv
Total events: 518
Unique IDs: 518
Duplicate event IDs: 0 (sample: [])
Fields with non-trivial missingness (non-empty% < 90%):
  - city: 63.9% non-empty
  - location_venue: 33.98% non-empty
  - rating: 64.29% non-empty
  - event_type: 64.29% non-empty
  - tags: 64.29% non-empty
  - times: 64.29% non-empty
Date parse issues: 0
URL domain issues: 0
Price numeric parse issues: 0
Images avg: 2.42, tags avg: 5.49


{'file': 'c:\\Scrapping\\joinnus\\notebook\\data\\events_combined.json',
 'checked_at': '2025-10-24T14:44:08.074295',
 'total_events': 518,
 'unique_event_ids': 518,
 'duplicate_event_ids': [],
 'field_completeness': {'event_id': {'present_count': 518,
   'present_pct': 100.0,
   'non_empty_count': 518,
   'non_empty_pct': 100.0},
  'url': {'present_count': 518,
   'present_pct': 100.0,
   'non_empty_count': 518,
   'non_empty_pct': 100.0},
  'category': {'present_count': 518,
   'present_pct': 100.0,
   'non_empty_count': 518,
   'non_empty_pct': 100.0},
  'title': {'present_count': 518,
   'present_pct': 100.0,
   'non_empty_count': 518,
   'non_empty_pct': 100.0},
  'description': {'present_count': 518,
   'present_pct': 100.0,
   'non_empty_count': 518,
   'non_empty_pct': 100.0},
  'city': {'present_count': 518,
   'present_pct': 100.0,
   'non_empty_count': 331,
   'non_empty_pct': 63.9},
  'location_venue': {'present_count': 518,
   'present_pct': 100.0,
   'non_empty_count': 17

## üì¶ Step 3: Store Events in MongoDB

Upload extracted event data to MongoDB database, checking by event ID to avoid duplicates.


In [8]:
class MongoDBEventStore:
    """Manages MongoDB operations for event storage"""
    
    def __init__(self, mongo_config):
        self.config = mongo_config
        self.client = None
        self.db = None
        self.collection = None
        self.logger = logging.getLogger(__name__)
        print("üóÑÔ∏è Initializing MongoDB Event Store")
    
    def connect(self):
        """Establish connection to MongoDB"""
        try:
            self.client = MongoClient(self.config['uri'], serverSelectionTimeoutMS=5000)
            # Test connection
            self.client.admin.command('ping')
            self.db = self.client[self.config['database']]
            self.collection = self.db[self.config['collection']]
            
            # Create index on event_id for faster lookups
            self.collection.create_index('event_id', unique=False)
            
            print(f"‚úÖ Connected to MongoDB")
            print(f"   Database: {self.config['database']}")
            print(f"   Collection: {self.config['collection']}")
            return True
            
        except ConnectionFailure as e:
            print(f"‚ùå Failed to connect to MongoDB: {e}")
            print(f"   Make sure MongoDB is running at {self.config['uri']}")
            return False
        except Exception as e:
            print(f"‚ùå Error connecting to MongoDB: {e}")
            return False
    
    def get_existing_ids(self):
        """Get all existing event IDs in collection"""
        try:
            if self.collection is None:
                return set()
            existing_ids = set()
            for doc in self.collection.find({}, {'event_id': 1}):
                if 'event_id' in doc:
                    existing_ids.add(str(doc['event_id']))
            print(f"üìä Found {len(existing_ids)} existing events in MongoDB")
            return existing_ids
        except Exception as e:
            self.logger.error(f"Error fetching existing IDs: {e}")
            return set()
    
    def store_event(self, event_data):
        """Store a single event in MongoDB"""
        try:
            if self.collection is None:
                return False
            
            # Use replace_one with upsert to avoid duplicates
            result = self.collection.replace_one(
                {'event_id': event_data['event_id']},
                event_data,
                upsert=True
            )
            return True
        except Exception as e:
            self.logger.error(f"Error storing event {event_data.get('event_id')}: {e}")
            return False
    
    def store_events_batch(self, events, skip_existing=True):
        """
        Store multiple events in MongoDB with duplicate checking
        
        Args:
            events: List of event dictionaries
            skip_existing: If True, skip events already in database
            
        Returns:
            dict: Statistics about insertion
        """
        if self.collection is None:
            print("‚ùå Not connected to MongoDB")
            return None
        
        # Get existing IDs if we're skipping
        existing_ids = set()
        if skip_existing:
            existing_ids = self.get_existing_ids()
        
        stats = {
            'total': len(events),
            'inserted': 0,
            'updated': 0,
            'skipped': 0,
            'failed': 0
        }
        
        print(f"\nüì§ Starting batch upload to MongoDB...")
        print(f"   Total events: {stats['total']}")
        if skip_existing:
            print(f"   Existing in DB: {len(existing_ids)}")
            print(f"   New events: {stats['total'] - len(existing_ids)}\n")
        
        for idx, event in enumerate(events, 1):
            event_id = str(event.get('event_id'))
            
            # Skip if already exists
            if skip_existing and event_id in existing_ids:
                stats['skipped'] += 1
                continue
            
            # Show progress
            progress_pct = (idx / stats['total'] * 100)
            print(f"[{idx:4d}/{stats['total']}] ({progress_pct:5.1f}%) Event {event_id}...", end='', flush=True)
            
            try:
                # Store event (upsert = insert if not exists, update if exists)
                result = self.collection.replace_one(
                    {'event_id': event_id},
                    event,
                    upsert=True
                )
                
                if result.upserted_id:
                    stats['inserted'] += 1
                    print(f" ‚úì Inserted")
                else:
                    stats['updated'] += 1
                    print(f" ‚Üª Updated")
                    
            except Exception as e:
                stats['failed'] += 1
                print(f" ‚ö†Ô∏è Error: {str(e)[:50]}")
                self.logger.error(f"Error storing event {event_id}: {e}")
        
        return stats
    
    def close(self):
        """Close MongoDB connection"""
        if self.client is not None:
            self.client.close()
            print("‚úÖ MongoDB connection closed")
    
    def get_stats(self):
        """Get collection statistics"""
        try:
            if self.collection is None:
                return None
            count = self.collection.count_documents({})
            return {
                'total_documents': count,
                'collection_name': self.config['collection'],
                'database_name': self.config['database']
            }
        except Exception as e:
            self.logger.error(f"Error getting stats: {e}")
            return None


print("‚úÖ MongoDBEventStore class defined")

‚úÖ MongoDBEventStore class defined


## ‚ñ∂Ô∏è Execute Step 3: Upload to MongoDB

Run this to load the final events JSON and store in MongoDB


In [9]:
# Find the most recent events_detailed JSON file or load all individual events
import os
from pathlib import Path
from datetime import datetime as dt

# First, try to load all individual event files from data/events/
events_dir = DATA_DIR / 'events'
all_events = []

if events_dir.exists() and events_dir.is_dir():
    print(f"üìÅ Found events directory: {events_dir}")
    event_files = sorted(events_dir.glob('event_*.json'))
    print(f"üìñ Found {len(event_files)} individual event JSON files\n")
    
    # Load all individual event files
    for event_file in event_files:
        try:
            with open(event_file, 'r', encoding='utf-8') as f:
                event_data = json.load(f)
                all_events.append(event_data)
        except Exception as e:
            logger.error(f"Error loading {event_file.name}: {e}")
    
    # Sort by event_id
    all_events.sort(key=lambda x: int(x['event_id']))
    print(f"‚úÖ Loaded {len(all_events)} events from individual files\n")
else:
    # Fallback: load from combined events_detailed JSON file
    json_files = sorted(DATA_DIR.glob('events_detailed_*.json'), key=os.path.getctime, reverse=True)
    
    if json_files:
        latest_json = json_files[0]
        print(f"üìÑ Found events file: {latest_json.name}")
        print(f"üìÖ Created: {dt.fromtimestamp(latest_json.stat().st_ctime)}\n")
        
        try:
            with open(latest_json, 'r', encoding='utf-8') as f:
                all_events = json.load(f)
            print(f"‚úÖ Loaded {len(all_events)} events from JSON\n")
        except json.JSONDecodeError as e:
            print(f"‚ùå Error reading JSON file: {e}")
            all_events = []
    else:
        print("‚ùå No event files found")
        print("   Please run Step 2 (comprehensive extraction) first")
        all_events = []

if all_events:
    # Initialize MongoDB connection
    mongo_store = MongoDBEventStore(MONGODB_CONFIG)
    
    # Connect to MongoDB
    if mongo_store.connect():
        # Store events in MongoDB (skip existing by ID)
        stats = mongo_store.store_events_batch(all_events, skip_existing=True)
        
        if stats:
            print(f"\n" + "=" * 80)
            print("MONGODB UPLOAD SUMMARY")
            print("=" * 80)
            print(f"Total processed: {stats['total']}")
            print(f"  ‚úì Inserted: {stats['inserted']}")
            print(f"  ‚Üª Updated: {stats['updated']}")
            print(f"  ‚äò Skipped (already exists): {stats['skipped']}")
            print(f"  ‚úó Failed: {stats['failed']}")
            
            # Get final statistics
            db_stats = mongo_store.get_stats()
            if db_stats:
                print(f"\nüìä Database Statistics:")
                print(f"   Total events in collection: {db_stats['total_documents']}")
                print(f"   Database: {db_stats['database_name']}")
                print(f"   Collection: {db_stats['collection_name']}")
            
            print(f"\nüéâ UPLOAD COMPLETE!")
        else:
            print("‚ùå Upload failed")
        
        # Close connection
        mongo_store.close()
    else:
        print("‚ùå Failed to connect to MongoDB")
        print("   Make sure MongoDB is running on localhost:27017")
        print("   Or update MONGODB_CONFIG with your MongoDB URI")

üìÅ Found events directory: c:\Scrapping\joinnus\notebook\data\events
üìñ Found 518 individual event JSON files

‚úÖ Loaded 518 events from individual files

üóÑÔ∏è Initializing MongoDB Event Store
‚úÖ Connected to MongoDB
   Database: recommendations-system
   Collection: events
üìä Found 0 existing events in MongoDB

üì§ Starting batch upload to MongoDB...
   Total events: 518
   Existing in DB: 0
   New events: 518

[   1/518] (  0.2%) Event 2025...‚úÖ Connected to MongoDB
   Database: recommendations-system
   Collection: events
üìä Found 0 existing events in MongoDB

üì§ Starting batch upload to MongoDB...
   Total events: 518
   Existing in DB: 0
   New events: 518

[   1/518] (  0.2%) Event 2025... ‚úì Inserted
[   2/518] (  0.4%) Event 52591... ‚úì Inserted
[   2/518] (  0.4%) Event 52591... ‚úì Inserted
[   3/518] (  0.6%) Event 58430... ‚úì Inserted
[   3/518] (  0.6%) Event 58430... ‚úì Inserted
[   4/518] (  0.8%) Event 58431... ‚úì Inserted
[   4/518] (  0.8%) Event

In [99]:
from bson import json_util
from datetime import datetime

# Connect to MongoDB and dump entire collection to a JSON file (handles ObjectId/datetimes)
mongo_store = MongoDBEventStore(MONGODB_CONFIG)
if not mongo_store.connect():
    raise SystemExit("‚ùå Failed to connect to MongoDB. Check MONGODB_CONFIG and network.")

docs = list(mongo_store.collection.find({}))
mongo_store.close()

out_file = DATA_DIR / f'events_from_mongo_{datetime.now().strftime("%Y%m%d_%H%M%S")}.json'
with open(out_file, 'w', encoding='utf-8') as fh:
    fh.write(json_util.dumps(docs, indent=2))

print(f"‚úÖ Exported {len(docs)} documents to: {out_file}")

üóÑÔ∏è Initializing MongoDB Event Store
‚úÖ Connected to MongoDB
   Database: recommendations-system
   Collection: events
‚úÖ Connected to MongoDB
   Database: recommendations-system
   Collection: events
‚úÖ MongoDB connection closed
‚úÖ Exported 518 documents to: c:\Scrapping\joinnus\notebook\data\events_from_mongo_20251024_150003.json
‚úÖ MongoDB connection closed
‚úÖ Exported 518 documents to: c:\Scrapping\joinnus\notebook\data\events_from_mongo_20251024_150003.json
