# NEO Data Fetching and Processing Pipeline

This notebook contains a production-ready pipeline to fetch, parse, validate and process data for Near-Earth Objects (NEOs) from the [NeoDyS-2](https://newton.spacedys.com/neodys/) service.

The pipeline is organized into three main sections:
1.  **Configuration & Imports:** Sets the number of objects to process (`N_OBJECTS`) and imports all necessary libraries.
2.  **Helper Functions:** Contains all the functions for fetching, parsing, and processing the data.
3.  **Execution Pipeline:** Runs the main logic to fetch the object list, verify data, process the objects, and export the final dataset.

The final output is a single CSV file named `DataSet_NEO_VSA.csv` containing the processed data for the specified number of objects.

In [None]:
# --- Imports ---
import requests
import pandas as pd
import numpy as np
import re
import random
from bs4 import BeautifulSoup
from tqdm.notebook import tqdm
from concurrent.futures import ThreadPoolExecutor, as_completed
from urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
from astropy.time import Time
import re
from numpy.polynomial.polyutils import RankWarning
import warnings
warnings.filterwarnings("ignore", category=RankWarning)

# --- Configuration ---
N_OBJECTS = 9000 # Number of objects to process
RANDOM_SAMPLING = True # Use random sampling instead of first N_OBJECTS
RANDOM_SEED = 42 # Random seed for reproducible results (set to None for true randomness)

# t0 = 60800.0 #[MJD] Initial time for the database, according to NeoDys-2 website

# NeoDys-2 query url for the databaset extraction
page_url = "https://newton.spacedys.com/neodys/index.php?pc=3.2.1&pc0=3.2&lra=&ura=&lde=&ude=&lvm=&uvm=&lel=&uel=&lph=&uph=&lgl=&ugl=&ldfe=&udfe=&ldfs=&udfs=&lmo=&umo=&lspu=&uspu=&ldu=&udu=&sb=12&lal=1&ual=8"

In [None]:

def parse_rwo(rwo_content):
    """Parse RWO content from NeoDyS-2 format using dynamic field detection for robustness."""
    nights_data = {}
    if not rwo_content: 
        return nights_data

    lines = rwo_content.split('\n')
    for idx, line in enumerate(lines):
        # Skip header and empty lines
        if (line.startswith('!') or not line.strip() or 
            line.startswith('version') or line.startswith('errmod') or 
            line.startswith('RMS') or line.startswith('END_OF_HEADER')):
            continue
        parts = line.split()
        # Find year (4-digit number), month (1-2 digit), day_frac (float)
        year_idx, month_idx, day_frac_idx = None, None, None
        for i, val in enumerate(parts):
            if year_idx is None and re.match(r'^\d{4}$', val):
                year_idx = i
                # Try to get month and day_frac right after year
                if i+2 < len(parts):
                    month_idx = i+1
                    day_frac_idx = i+2
                break
        if year_idx is None or month_idx is None or day_frac_idx is None:
            continue
        year_val = parts[year_idx]
        month_val = parts[month_idx]
        day_frac_val = parts[day_frac_idx]
        # Check if year and month are integers and day_frac is float
        try:
            year = int(year_val) if year_val is not None and year_val.isdigit() else None
            month = int(month_val) if month_val is not None and month_val.isdigit() else None
            day_frac = float(day_frac_val) if day_frac_val is not None else None
        except Exception as e:
            print(f"DEBUG: Line {idx+1} invalid date fields: year={year_val}, month={month_val}, day_frac={day_frac_val} | {e}")
            continue
        if year is None or month is None or day_frac is None:
            print(f"DEBUG: Line {idx+1} missing or invalid date fields: year={year_val}, month={month_val}, day_frac={day_frac_val}")
            continue

        # Dynamically find RA and Dec indices: RA is 3 consecutive fields after day_frac (hours, min, sec), Dec is 3 consecutive fields after RA (deg, min, sec)
        ra_start = day_frac_idx + 2
        dec_start = None
        # Find RA indices (should be float/int values)
        if ra_start + 2 < len(parts):
            ra_h_idx, ra_m_idx, ra_s_idx = ra_start, ra_start+1, ra_start+2
        else:
            continue
        # Find Dec indices: look for first value after RA that matches degree pattern (+/-dd or dd)
        for j in range(ra_s_idx+1, len(parts)-2):
            if re.match(r'^[+-]?\d+$', parts[j]):
                dec_start = j
                break
        if dec_start is not None and dec_start+2 < len(parts):
            dec_d_idx, dec_m_idx, dec_s_idx = dec_start, dec_start+1, dec_start+2
        else:
            continue

        try:
            day_int = int(day_frac)
            frac = day_frac - day_int
            total_seconds = frac * 24 * 3600
            hours = int(total_seconds // 3600)
            minutes = int((total_seconds % 3600) // 60)
            seconds = int(total_seconds % 60)

            ra_h = float(parts[ra_h_idx]) if parts[ra_h_idx] is not None else 0.0
            ra_m = float(parts[ra_m_idx]) if parts[ra_m_idx] is not None else 0.0
            ra_s = float(parts[ra_s_idx]) if parts[ra_s_idx] is not None else 0.0
            alpha = np.deg2rad(15.0 * (ra_h + ra_m / 60.0 + ra_s / 3600.0))

            dec_d = float(parts[dec_d_idx]) if parts[dec_d_idx] is not None else 0.0
            dec_m = float(parts[dec_m_idx]) if parts[dec_m_idx] is not None else 0.0
            dec_s = float(parts[dec_s_idx]) if parts[dec_s_idx] is not None else 0.0
            dec_sign = -1 if str(parts[dec_d_idx]).startswith('-') else 1
            delta_deg = dec_sign * (abs(dec_d) + dec_m / 60.0 + dec_s / 3600.0)
            delta = np.deg2rad(delta_deg)

            iso_str = f"{year:04d}-{month:02d}-{day_int:02d}T{hours:02d}:{minutes:02d}:{seconds:02d}"
            epoch = Time(iso_str, format='isot', scale='utc').mjd
            night_date = f"{year}-{month:02d}-{day_int:02d}"
            if night_date not in nights_data:
                nights_data[night_date] = []
            nights_data[night_date].append({
                'mjd': epoch,
                'alpha': alpha,
                'delta': delta
            })

        except (ValueError, IndexError, KeyError) as e:
            print(f"DEBUG: Line {idx+1} parse error: {repr(line)} | {e}")
            continue

    return nights_data

def fetch_kep_elements(object_name):
    """Fetches and parses real Keplerian elements using table row order and symbol labels."""
    encoded_name = urllib.parse.quote(object_name)
    url = f"https://newton.spacedys.com/neodys/index.php?pc=1.1.1&n={encoded_name}"
    html_content = get_content(url)
    if not html_content:
        return {}
    try:
        soup = BeautifulSoup(html_content, 'html.parser')
        tables = soup.find_all('table')
        for table in tables:
            if "keplerian" in table.get_text(strip=True).lower():
                rows = table.find_all('tr')
                values = []
                for row in rows:
                    cells = row.find_all(['td', 'th'])
                    if len(cells) >= 2:
                        val = cells[1].get_text(strip=True)
                        numeric_match = re.search(r'([-+]?\d*\.?\d+)', val)
                        if numeric_match:
                            values.append(float(numeric_match.group(1)))
                if len(values) >= 6:
                    # Assign by order: a, e, i, Omega (symbol), omega (symbol), M
                    elements = {
                        'a_kep': values[0],
                        'e': values[1],
                        'i': values[2],
                        'Omega': values[3],
                        'omega': values[4],
                        'M': values[5]
                    }
                    return elements
    except Exception as e:
        pass
    return {}

def process_object(neo_name):
    """
    Fetches, parses, and processes all data for a single NEO object and validates
	that the files contain usable data (e.g., RWO has valid nights, elements are present).
    This function is designed to be executed in a concurrent manner.

    Args:
        neo_name (str): The name of the NEO object to process.

    Returns:
        tuple: A tuple containing:
            - list: A list of dictionaries, each representing a processed observation night.
            - str: The name of the object, for tracking.
            - str: The status of the processing ('processed' or 'rejected').
    """

    ALPHA_DOT_THRESHOLD = np.pi/2  # radians/day
    DELTA_DOT_THRESHOLD = np.pi/2  # radians/day

    ALPHA_MIN, ALPHA_MAX = 0.0, 2 * np.pi
    DELTA_MIN, DELTA_MAX = -np.pi / 2, np.pi / 2

    # Step 1: Fetch all required data for the object
    rwo_result = fetch_rwo(neo_name)
    kep_elements = fetch_kep_elements(neo_name)
    eq0_elements = fetch_equ0_elements(neo_name)

    # Step 2: Validate that ALL required data was fetched successfully
    # Require RWO data AND Keplerian elements AND EQU elements
    if not rwo_result or not rwo_result.get('content') or not kep_elements or not eq0_elements:
        print(f"REJECTED: Missing data for {neo_name}")
        return [], neo_name, 'rejected'

    # Step 3: Parse the RWO data to get observation nights
    nights_data = parse_rwo(rwo_result['content'])
    if not nights_data:
        print(f"REJECTED: No valid nights for {neo_name}")
        return [], neo_name, 'rejected'

    # Step 4: Process each valid night of observation to create data rows
    object_rows = []
    sorted_nights = sorted(nights_data.items())
    for night_index, (night_date, obs_list) in enumerate(sorted_nights, 1):
        alpha_dot, delta_dot = 0.0, 0.0
        a_fit, d_fit = None, None
        if len(obs_list) < 2:
            print(f"Skipping night {night_index} ({night_date}) due to insufficient observations: {len(obs_list)}")
            continue

        obs_list.sort(key=lambda o: o.get('mjd', 0))
        mid_obs = obs_list[len(obs_list) // 2]
        t_fit = mid_obs.get('mjd', 0.0)

        # Calculate alpha_dot and delta_dot using np.polyfit (same as loader)
        if len(obs_list) >= 4:
            t = np.array([obs['mjd'] for obs in obs_list])
            alpha = np.array([obs['alpha'] for obs in obs_list])
            delta = np.array([obs['delta'] for obs in obs_list])

            with warnings.catch_warnings(record=True) as w:
                warnings.simplefilter('always', RankWarning)
                try:
                    a_fit = np.polyfit(t, alpha, 2)
                    d_fit = np.polyfit(t, delta, 2)
                    rankwarn = any(isinstance(warn.message, RankWarning) for warn in w)
                    if rankwarn:
                        a_fit = np.polyfit(t, alpha, 1)
                        d_fit = np.polyfit(t, delta, 1)
                        alpha_dot = a_fit[0]
                        delta_dot = d_fit[0]
                    else:
                        alpha_dot = 2 * a_fit[0] * t_fit + a_fit[1]
                        delta_dot = 2 * d_fit[0] * t_fit + d_fit[1]
                except Exception as fit_err:
                    print(f"Fit error in night {night_index}: {fit_err}")
                    a_fit, d_fit = None, None
                    alpha_dot, delta_dot = 0.0, 0.0

        elif len(obs_list) <= 3:
            t = np.array([obs['mjd'] for obs in obs_list])
            alpha = np.array([obs['alpha'] for obs in obs_list])
            delta = np.array([obs['delta'] for obs in obs_list])
            try:
                a_fit = np.polyfit(t, alpha, 1)
                d_fit = np.polyfit(t, delta, 1)
                alpha_dot = a_fit[0]
                delta_dot = d_fit[0]
            except Exception as fit_err:
                print(f"Linear fit error in night {night_index}: {fit_err}")
                a_fit, d_fit = None, None
                alpha_dot, delta_dot = 0.0, 0.0

        row = {
            'object': neo_name, 'night': night_index, 't_fit': t_fit,
            'alpha': mid_obs.get('alpha', 0.0), 'delta': mid_obs.get('delta', 0.0),
            'alpha_dot': alpha_dot, 'delta_dot': delta_dot, 
            **kep_elements, **eq0_elements
        }

        if (-ALPHA_DOT_THRESHOLD < alpha_dot < ALPHA_DOT_THRESHOLD and 
            -DELTA_DOT_THRESHOLD < delta_dot < DELTA_DOT_THRESHOLD and
            ALPHA_MIN <= row['alpha'] <= ALPHA_MAX and
            DELTA_MIN <= row['delta'] <= DELTA_MAX ):

            object_rows.append(row)
        else:
            print(f"Skipping night {night_index} for {neo_name} due to non-physical fit estimations")
            continue

    if not object_rows:
        print(f"REJECTED: No valid nights with enough observations for {neo_name}")
        return [], neo_name, 'rejected'

    return object_rows, neo_name, 'processed'
def get_content(url, timeout=30, max_retries=3):
    """
    Fetches content from a URL with improved error handling and retries.
    Updated with robust session configuration to avoid HTTP 403 errors.
    """
    session = requests.Session()
    
    # Configure retry strategy
    retry_strategy = Retry(
        total=max_retries,
        backoff_factor=2,
        status_forcelist=[403, 429, 500, 502, 503, 504],
        allowed_methods=["HEAD", "GET", "OPTIONS"]
    )
    
    adapter = HTTPAdapter(max_retries=retry_strategy)
    session.mount("http://", adapter)
    session.mount("https://", adapter)
    
    # Enhanced headers to mimic real browser behavior (fixes 403 errors)
    session.headers.update({
        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
        'Accept-Language': 'en-US,en;q=0.9',
        'Accept-Encoding': 'gzip, deflate, br',
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1',
        'Sec-Fetch-Dest': 'document',
        'Sec-Fetch-Mode': 'navigate',
        'Sec-Fetch-Site': 'none',
        'Sec-Fetch-User': '?1',
        'Cache-Control': 'max-age=0',
        'DNT': '1',
        'Sec-CH-UA': '"Chromium";v="122", "Not(A:Brand";v="24", "Google Chrome";v="122"',
        'Sec-CH-UA-Mobile': '?0',
        'Sec-CH-UA-Platform': '"Linux"'
    })
    
    try:
        response = session.get(url, timeout=timeout)  # Removed verify=False
        response.raise_for_status()
        return response.text
    except requests.exceptions.Timeout:
        return None
    except requests.exceptions.ConnectionError:
        return None
    except requests.exceptions.HTTPError as e:
        if e.response.status_code != 404:
            print(f"🔥 HTTP error {e.response.status_code} fetching {url}")
        return None
    except Exception:
        return None
    finally:
        session.close()

def is_valid_neo_name(name):
    """Validate NEO object names to filter out invalid entries."""
    if not name or len(name) < 4:
        return False
    
    # Clean the name
    name = name.strip()
    
    # More precise patterns for NEO names
    patterns = [
        r'^\d{4}\s+[A-Z]{1,3}\d*$',     # e.g., "2023 BU", "1991 VG"
        r'^\d{4}[A-Z]{1,3}\d*$',        # e.g., "2015RN35", "2019EA2" (no space)
        r'^\(\d+\)\s*[A-Za-z\s]*$',     # e.g., "(433) Eros", "(1) Ceres"
        r'^[A-Z]{2,}\d+$',              # e.g., "AA29", "BD", etc.
    ]
    
    # Check if it matches any valid pattern
    if not any(re.match(pattern, name) for pattern in patterns):
        return False
    
    # Additional filters to exclude obviously invalid names
    invalid_keywords = [
        'object', 'name', 'http', 'www', 'com', 'neodys', 'spacedys',
        'index', 'php', 'table', 'cell', 'row', 'column', 'data',
        'the', 'and', 'or', 'but', 'for', 'with', 'without'
    ]
    
    name_lower = name.lower()
    if any(keyword in name_lower for keyword in invalid_keywords):
        return False
    
    # Must contain at least one digit and one letter
    has_digit = any(c.isdigit() for c in name)
    has_letter = any(c.isalpha() for c in name)
    
    return has_digit and has_letter

def get_all_prefixes(page_url):
    """Fetch the list of all NEO objects from the main NeoDyS-2 page."""
    url = page_url
    html_content = get_content(url)
    
    if html_content:
        soup = BeautifulSoup(html_content, 'html.parser')
        neo_objects = []
        
        # Strategy 1: Look for links containing object names
        for link in soup.find_all('a', href=True):
            if 'pc=1.1.1&n=' in link.get('href', ''):
                # Extract the object name from the URL parameter
                href = link.get('href', '')
                if '&n=' in href:
                    name = href.split('&n=')[1].split('&')[0]  # Get name from URL
                    name = name.replace('%20', ' ')  # Decode URL encoding
                    if is_valid_neo_name(name):
                        neo_objects.append(name)
        
        # Strategy 2: Look for text patterns that match NEO names
        page_text = soup.get_text()
        # Find year-letter-number patterns (e.g., 2015 RN35, 2019EA2)
        neo_patterns = [
            r'\b(\d{4}\s+[A-Z]{1,3}\d+)\b',  # 2015 RN35
            r'\b(\d{4}[A-Z]{1,3}\d+)\b',     # 2015RN35
            r'\b(\(\d+\)\s*[A-Za-z\s]*)\b'   # (433) Eros
        ]
        
        for pattern in neo_patterns:
            matches = re.findall(pattern, page_text)
            for match in matches:
                name = match.strip()
                if is_valid_neo_name(name):
                    neo_objects.append(name)
        
        if neo_objects:
            unique_neos = list(set(neo_objects))
            print(f"Successfully extracted {len(unique_neos)} NEOs from the website.")
            return unique_neos  # Return all extracted NEOs, not limited
    
    # Fallback: try a different page or approach
    print("Primary extraction failed. Trying alternative approach...")
    
    # Try the main NEA list page
    alt_url = "https://newton.spacedys.com/neodys/index.php?pc=1.1.0"
    html_content = get_content(alt_url)
    
    if html_content:
        soup = BeautifulSoup(html_content, 'html.parser')
        neo_objects = []
        
        # Look for any NEO-like patterns in the text
        page_text = soup.get_text()
        neo_patterns = [
            r'\b(\d{4}\s+[A-Z]{1,3}\d+)\b',
            r'\b(\d{4}[A-Z]{1,3}\d+)\b'
        ]
        
        for pattern in neo_patterns:
            matches = re.findall(pattern, page_text)
            for match in matches:
                name = match.strip()
                if is_valid_neo_name(name):
                    neo_objects.append(name)
        
        if neo_objects:
            unique_neos = list(set(neo_objects))
            print(f"Alternative extraction found {len(unique_neos)} NEOs.")
            return unique_neos  # Return all found NEOs, not limited
    
    print("❌ Could not extract NEO names from website. Please check the NeoDyS-2 site structure.")
    return []

def fetch_rwo(object_name):
    """Fetch RWO file content from NeoDyS-2."""
    # URL encode the object name to handle spaces and special characters
    encoded_name = urllib.parse.quote(object_name.replace(" ", "%20"))
    url = f"https://newton.spacedys.com/~neodys2/mpcobs/{encoded_name}.rwo"
    content = get_content(url)
    if content and not content.lower().startswith('<html>'):
        return {'content': content}
    else:
        return None

def fetch_equ0_elements(object_name):
    """Fetch EQU elements from .eq0 file.
    Returns exactly 6 equinoctial elements: a_equ, h, k, p, q, lambda
    """
    encoded_name = urllib.parse.quote(object_name.replace(" ", "%20"))
    url = f"https://newton.spacedys.com/~neodys2/epoch/{encoded_name}.eq0"
    response_text = get_content(url)
    if not response_text: 
        return {}
    try:
        for line in response_text.strip().split('\n'):
            if line.strip().startswith('EQU'):
                parts = line.split()
                # Extract exactly 6 equinoctial elements: a_equ, h, k, p, q, lambda
                if len(parts) >= 8:  # Has a, h, k, p, q, lambda, plus extra data
                    result = {
                        'a_equ': float(parts[1]),      # Semi-major axis (equinoctial)
                        'h': float(parts[2]),          # h = e * sin(omega + Omega)
                        'k': float(parts[3]),          # k = e * cos(omega + Omega)
                        'p': float(parts[4]),          # p = tan(i/2) * sin(Omega)
                        'q': float(parts[5]),          # q = tan(i/2) * cos(Omega)
                        'lambda': float(parts[6])      # lambda = M + omega + Omega
                    }
                    return result
                elif len(parts) >= 7:  # Missing one element, fill with default
                    result = {
                        'a_equ': float(parts[1]) if len(parts) > 1 else 1.0,
                        'h': float(parts[2]) if len(parts) > 2 else 0.0,
                        'k': float(parts[3]) if len(parts) > 3 else 0.0,
                        'p': float(parts[4]) if len(parts) > 4 else 0.0,
                        'q': float(parts[5]) if len(parts) > 5 else 0.0,
                        'lambda': float(parts[6]) if len(parts) > 6 else 0.0
                    }
                    return result
    except (ValueError, IndexError) as e: 
        pass
    return {}

def create_dataset(neo_list):
    """
    Processes a list of NEO names concurrently to fetch their data, including RWO, 
    Keplerian elements, and corrected equinoctial elements.

    Args:
        neo_list (list): A list of NEO names to process.

    Returns:
        tuple: A tuple containing:
            - list: A list of dictionaries, where each dictionary represents a row in the final dataset.
            - int: The number of objects that were rejected due to missing data.
    """
    all_rows = []
    rejected_count = 0
    
    # Use ThreadPoolExecutor to process objects in parallel
    with ThreadPoolExecutor(max_workers=20) as executor:
        # Create a future for each NEO object processing task
        future_to_neo = {executor.submit(process_object, neo_name): neo_name for neo_name in neo_list}
        
        # Use tqdm to create a progress bar for the concurrent tasks
        pbar = tqdm(as_completed(future_to_neo), total=len(neo_list), desc="Processing NEOs")
        
        for future in pbar:
            try:
                # Get the result from the completed future
                object_rows, neo_name, status = future.result()
                
                if status == 'processed' and object_rows:
                    all_rows.extend(object_rows)
                else:
                    rejected_count += 1
            except Exception as e:
                # Handle exceptions that might occur within the thread
                neo_name = future_to_neo[future]
                rejected_count += 1

    return all_rows, rejected_count

def check_data_availability(neo_name):
    """
    Lightweight check to verify only if all three data sources (RWO, Keplerian, EQU) are available,
    but not their quality. Uses HEAD requests or minimal content fetching for efficiency.
    Updated with robust session to avoid HTTP 403 errors.
    
    Args:
        neo_name (str): The NEO name to check
        
    Returns:
        bool: True if all three data sources are available, False otherwise
    """
	    
    # Prepare URLs for all three data sources
    encoded_name = urllib.parse.quote(neo_name.replace(" ", "%20"))
    
    urls_to_check = {
        'rwo': f"https://newton.spacedys.com/~neodys2/mpcobs/{encoded_name}.rwo",
        'equ': f"https://newton.spacedys.com/~neodys2/epoch/{encoded_name}.eq0",
        'kep': f"https://newton.spacedys.com/neodys/index.php?pc=1.1.1&n={urllib.parse.quote(neo_name)}"
    }
    
    session = requests.Session()
    # Use the same robust headers that fixed the 403 errors
    session.headers.update({
        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
        'Accept-Language': 'en-US,en;q=0.9',
        'Accept-Encoding': 'gzip, deflate, br',
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1',
        'Sec-Fetch-Dest': 'document',
        'Sec-Fetch-Mode': 'navigate',
        'Sec-Fetch-Site': 'none',
        'Sec-Fetch-User': '?1',
        'Cache-Control': 'max-age=0'
    })
    
    available_sources = 0
    
    try:
        # Check RWO file availability (HEAD request is faster)
        try:
            response = session.head(urls_to_check['rwo'], timeout=5)  # Removed verify=False
            if response.status_code == 200:
                available_sources += 1
        except:
            pass
            
        # Check EQU file availability (HEAD request is faster)
        try:
            response = session.head(urls_to_check['equ'], timeout=5)  # Removed verify=False
            if response.status_code == 200:
                available_sources += 1
        except:
            pass
            
        # Check Keplerian elements page (need to GET to check content)
        try:
            response = session.get(urls_to_check['kep'], timeout=5)  # Removed verify=False
            if response.status_code == 200 and 'keplerian' in response.text.lower():
                available_sources += 1
        except:
            pass
            
    except Exception:
        pass
    finally:
        session.close()
    
    # Return True only if all 3 sources are available
    return available_sources >= 3

def validate_object(all_neo_names, target_count, use_random_sampling=True, random_seed=None):
    """
    Validates NEO names and checks data availability to ensure exactly target_count 
    valid NEOs with complete data are selected.
    
    Args:
        all_neo_names (list): List of all NEO names from the website
        target_count (int): Number of validated NEOs needed
        use_random_sampling (bool): Whether to randomly sample or take first N objects
        random_seed (int): Random seed for reproducible results (None for true randomness)
        
    Returns:
        list: List of exactly target_count NEO names with complete data available
    """
    sampling_method = "random sampling" if use_random_sampling else "sequential selection"
    print(f"🔍 Validating NEOs with data availability check (target: {target_count}, method: {sampling_method})...")
    
    # Step 1: Filter for valid NEO names first
    valid_names = [name for name in all_neo_names if is_valid_neo_name(name)]
    print(f"📋 Found {len(valid_names)} NEOs with valid name patterns")
    
    # Step 2: Apply sampling strategy
    if use_random_sampling:
        # Set random seed for reproducibility if provided
        if random_seed is not None:
            random.seed(random_seed)
            print(f"🎲 Using random seed {random_seed} for reproducible sampling")
        
        # Randomly shuffle the valid names for diverse sampling
        random.shuffle(valid_names)
        print(f"🎲 Randomly shuffled NEO list for diverse sampling")
        
        # Check more NEOs than needed to account for failures (random sampling)
        check_limit = min(len(valid_names), target_count * 5)
    else:
        # Sequential selection - take first N objects
        print(f"📝 Using sequential selection (first {target_count} valid NEOs)")
        check_limit = min(len(valid_names), target_count * 3)  # Smaller buffer for sequential
    
    validated_neos = []
    checked_count = 0
    
    # Step 3: Use ThreadPoolExecutor for concurrent validation
    with ThreadPoolExecutor(max_workers=10) as executor:
        # Create futures for validation tasks
        future_to_name = {}
        
        for name in valid_names[:check_limit]:
            future = executor.submit(check_data_availability, name)
            future_to_name[future] = name
        
        print(f"🚀 Started validation of {len(future_to_name)} NEOs using {sampling_method}...")
        
        # Process completed validation tasks
        for future in as_completed(future_to_name):
            name = future_to_name[future]
            checked_count += 1
            
            try:
                has_complete_data = future.result()
                if has_complete_data:
                    validated_neos.append(name)
                    print(f"✅ {len(validated_neos)}/{target_count}: {name} (checked {checked_count})")
                    
                    # Stop when we have enough validated NEOs
                    if len(validated_neos) >= target_count:
                        # Cancel remaining futures to save resources
                        for remaining_future in future_to_name:
                            if not remaining_future.done():
                                remaining_future.cancel()
                        break
                else:
                    print(f"❌ {name}: Missing data sources (checked {checked_count})")
                    
            except Exception as e:
                print(f"⚠️ {name}: Validation error - {str(e)[:50]}...")
                
    print(f"🎯 Validation complete: Found {len(validated_neos)} NEOs with complete data (checked {checked_count} total)")
    return validated_neos


In [None]:
# FINAL PIPELINE RUN
# 1. Get all NEO names from the website
print("🛰️ Fetching all NEO names from NeoDyS-2...")
all_neo_names = get_all_prefixes(page_url)
print(f"🌍 Found {len(all_neo_names)} total NEOs.")

# 2. Validate NEOs to get a list of N_OBJECTS with all required data
print(f"\n🔍 Validating NEOs to find {N_OBJECTS} objects with complete data...")
validated_neo_names = validate_object(
    all_neo_names, 
    N_OBJECTS, 
    use_random_sampling=RANDOM_SAMPLING,
    random_seed=RANDOM_SEED
)
print(f"✅ Validated {len(validated_neo_names)} NEOs.")

# 3. Process the validated NEOs using the corrected pipeline
print(f"\n⚙️ Processing {len(validated_neo_names)} validated NEOs in parallel...")
final_dataset_rows, rejected_count = create_dataset(validated_neo_names)

# 4. Create the final DataFrame and save to CSV
print(f"\n💾 Creating final DataFrame and saving to CSV...")
if final_dataset_rows:
    final_df = pd.DataFrame(final_dataset_rows)
    
    # Report results
    successful_count = len(validated_neo_names) - rejected_count
    print(f"\n📊 PROCESSING SUMMARY:")
    print(f"   🎯 Target NEOs: {N_OBJECTS}")
    print(f"   ✅ Successfully processed: {successful_count}")
    print(f"   ❌ Rejected: {rejected_count}")
    print(f"   📊 Total data rows: {len(final_df)}")
    print(f"   🎯 Success rate: {(successful_count/len(validated_neo_names)*100):.1f}%")
    
    # Ensure consistent column order (basic + 6 Keplerian + 6 Equinoctial)
    # Check which equinoctial columns are available
    available_cols = final_df.columns.tolist()
    
    # Define exact column sets as per specification:
    # Keplerian: a_kep, e, i, Omega, omega, M
    # Equinoctial: a_equ, h, k, p, q, lambda
    basic_cols = ['object', 'night', 't_fit', 'alpha', 'delta', 'alpha_dot', 'delta_dot']
    keplerian_cols = ['a_kep', 'e', 'i', 'Omega', 'omega', 'M']
    equinoctial_cols = ['a_equ', 'h', 'k', 'p', 'q', 'lambda']
    
    # Create final column order (now with separate a_kep and a_equ)
    # Format: basic + 6 Keplerian + 6 Equinoctial (no shared columns)
    final_column_order = basic_cols + ['a_kep', 'e', 'i', 'Omega', 'omega', 'M', 'a_equ', 'h', 'k', 'p', 'q', 'lambda']
    
    # Ensure all required columns exist, add missing with default values
    missing_cols = [col for col in final_column_order if col not in available_cols]
    if missing_cols:
        print(f"⚠️  Adding missing columns with default values: {missing_cols}")
        for col in missing_cols:
            final_df[col] = 0.0
    
    final_df = final_df[final_column_order]
    
    print(f"\n📋 Final dataset structure:")
    print(f"   Basic fields: {len(basic_cols)}")
    print(f"   Keplerian elements: 6 (a_kep, e, i, Omega, omega, M)")
    print(f"   Equinoctial elements: 6 (a_equ, h, k, p, q, lambda)")
    print(f"   Total columns: {len(final_column_order)}")
    print(f"   Note: Separate semi-major axes: a_kep (Keplerian) and a_equ (Equinoctial)")
    
    # Save to CSV
    output_path = "./DataSet_NEO_VSA_from1to8nights.csv"
    final_df.to_csv(output_path, index=False)
    
    print(f"\n🎉 SUCCESS! Pipeline complete.")
    print(f"Dataset saved to: {output_path}")
    print(f"📋 Columns: {final_column_order}")
else:
    print("❌ No data was processed. The final dataset is empty.")

🛰️ Fetching all NEO names from NeoDyS-2...
Successfully extracted 9859 NEOs from the website.
🌍 Found 9859 total NEOs.

🔍 Validating NEOs to find 9000 objects with complete data...
🔍 Validating NEOs with data availability check (target: 9000, method: random sampling)...
📋 Found 9859 NEOs with valid name patterns
🎲 Using random seed 42 for reproducible sampling
🎲 Randomly shuffled NEO list for diverse sampling
🚀 Started validation of 9859 NEOs using random sampling...
❌ 2022UM4: Missing data sources (checked 1)
❌ 2019HS2: Missing data sources (checked 2)
❌ 2013PA39: Missing data sources (checked 3)
✅ 1/9000: 2014WA5 (checked 4)
✅ 2/9000: 2015XQ261 (checked 5)
✅ 3/9000: 2023HB4 (checked 6)
✅ 4/9000: 2022UA28 (checked 7)
✅ 5/9000: 2019YF2 (checked 8)
✅ 6/9000: 2025CL1 (checked 9)
✅ 7/9000: 2002TA58 (checked 10)
✅ 8/9000: 2014JX24 (checked 11)
✅ 9/9000: 2015DS198 (checked 12)
✅ 10/9000: 2004MN1 (checked 13)
❌ 2025FR2: Missing data sources (checked 14)
✅ 11/9000: 2021CS1 (checked 15)
✅ 12/9

Processing NEOs:   0%|          | 0/7210 [00:00<?, ?it/s]

Skipping night 1 (2014-05-05) due to insufficient observations: 1
Skipping night 2 for 2021CC6 due to non-physical fit estimations
Skipping night 1 (2020-04-18) due to insufficient observations: 1
Skipping night 5 (2019-01-31) due to insufficient observations: 1
Skipping night 3 (2014-03-25) due to insufficient observations: 1
Skipping night 5 (2014-03-27) due to insufficient observations: 1
REJECTED: Missing data for 2023TD8
Skipping night 2 for 2005TQ45 due to non-physical fit estimations
Skipping night 3 for 2017RV17 due to non-physical fit estimations
Skipping night 6 (2017-10-05) due to insufficient observations: 1
Skipping night 3 (2011-03-08) due to insufficient observations: 1
REJECTED: Missing data for 2020FJ4
Skipping night 1 (2024-01-01) due to insufficient observations: 1
Skipping night 2 (2022-02-03) due to insufficient observations: 1
Skipping night 3 (2011-08-07) due to insufficient observations: 1
REJECTED: Missing data for 2021LN2
Skipping night 2 for 2019JJ3 due to no

  a_fit = np.polyfit(t, alpha, 1)
  d_fit = np.polyfit(t, delta, 1)


Skipping night 1 for 2016QA2 due to non-physical fit estimations
Skipping night 2 for 2016QA2 due to non-physical fit estimations
REJECTED: Missing data for 2022UD1
Skipping night 6 (2020-01-30) due to insufficient observations: 1
Skipping night 2 (2020-04-17) due to insufficient observations: 1
Skipping night 4 (2012-10-07) due to insufficient observations: 1
Skipping night 4 for 2004XB45 due to non-physical fit estimations
Skipping night 4 for 2016CG18 due to non-physical fit estimations
REJECTED: Missing data for 2014WB498
Skipping night 2 for 2022GZ1 due to non-physical fit estimations
Skipping night 2 for 2023ST2 due to non-physical fit estimations
Skipping night 3 for 2019SG1 due to non-physical fit estimations
Skipping night 3 (2008-04-30) due to insufficient observations: 1
Skipping night 1 (2019-08-22) due to insufficient observations: 1
Skipping night 1 for 2022UR4 due to non-physical fit estimations
REJECTED: No valid nights with enough observations for 2022UR4
Skipping nigh

  a_fit = np.polyfit(t, alpha, 1)
  d_fit = np.polyfit(t, delta, 1)


Skipping night 3 for 2013GJ69 due to non-physical fit estimations
Skipping night 3 (2018-01-22) due to insufficient observations: 1
REJECTED: Missing data for 2006WD129
REJECTED: Missing data for 2015SR20
REJECTED: Missing data for 2004HR56
Skipping night 1 (2018-09-15) due to insufficient observations: 1
Skipping night 2 for 2018SJ1 due to non-physical fit estimations
REJECTED: No valid nights with enough observations for 2018SJ1
REJECTED: Missing data for 2009WP106


  a_fit = np.polyfit(t, alpha, 1)
  d_fit = np.polyfit(t, delta, 1)


Skipping night 3 for 2024UO5 due to non-physical fit estimations
REJECTED: Missing data for 2020QO3
REJECTED: Missing data for 2021TT10
REJECTED: Missing data for 2004TJ10
Skipping night 3 (2022-10-25) due to insufficient observations: 1
Skipping night 1 (2020-05-28) due to insufficient observations: 1
Skipping night 5 (2001-09-28) due to insufficient observations: 1
REJECTED: Missing data for 2023QM56
REJECTED: Missing data for 2022EC1
REJECTED: Missing data for 2021RZ9
Skipping night 1 for 2014WR7 due to non-physical fit estimations
REJECTED: Missing data for 1999TY2
Skipping night 1 (2020-10-14) due to insufficient observations: 1
Skipping night 1 (2020-09-22) due to insufficient observations: 1
Skipping night 3 for 2024RY13 due to non-physical fit estimations
Skipping night 4 (2018-12-14) due to insufficient observations: 1
REJECTED: Missing data for 2022RD1
Skipping night 4 for 2012XB112 due to non-physical fit estimations
Skipping night 4 for 2010AL30 due to non-physical fit esti

  a_fit = np.polyfit(t, alpha, 1)
  d_fit = np.polyfit(t, delta, 1)


Skipping night 7 for 2016HD3 due to non-physical fit estimations
REJECTED: Missing data for 2020JD2
REJECTED: Missing data for 2017WE15
Skipping night 2 (2018-03-18) due to insufficient observations: 1
Skipping night 5 (2017-03-26) due to insufficient observations: 1
REJECTED: Missing data for 2020RF8
Skipping night 4 for 2024UQ1 due to non-physical fit estimations
Skipping night 4 for 2001XX103 due to non-physical fit estimations
Skipping night 4 (2017-11-26) due to insufficient observations: 1
Skipping night 6 for 2022KM6 due to non-physical fit estimations
Skipping night 7 for 2022KM6 due to non-physical fit estimations
Skipping night 1 for 2022QH3 due to non-physical fit estimations
Skipping night 5 (2000-03-04) due to insufficient observations: 1
Skipping night 5 (2025-04-30) due to insufficient observations: 1
Skipping night 3 for 2014YE15 due to non-physical fit estimations
Skipping night 4 for 2023TT16 due to non-physical fit estimations
Skipping night 1 (2022-04-04) due to ins

  a_fit = np.polyfit(t, alpha, 1)
  d_fit = np.polyfit(t, delta, 1)


REJECTED: Missing data for 2016NC56
Skipping night 3 for 2024VM1 due to non-physical fit estimations
Skipping night 4 (2024-11-07) due to insufficient observations: 1
Skipping night 3 (2012-04-02) due to insufficient observations: 1
Skipping night 4 (2012-03-28) due to insufficient observations: 1
Skipping night 5 for 2022EM6 due to non-physical fit estimations
REJECTED: Missing data for 2022GP2
Skipping night 3 for 2017SS12 due to non-physical fit estimations
Skipping night 3 (2019-10-29) due to insufficient observations: 1
Skipping night 1 (2022-06-26) due to insufficient observations: 1
Skipping night 3 (2019-07-03) due to insufficient observations: 1
Skipping night 2 (2019-08-28) due to insufficient observations: 1
Skipping night 4 for 2025DJ22 due to non-physical fit estimations
Skipping night 3 (2022-09-28) due to insufficient observations: 1
Skipping night 1 for 2016UR41 due to non-physical fit estimations
Skipping night 4 for 2005AX28 due to non-physical fit estimations
Skippin

  a_fit = np.polyfit(t, alpha, 1)
  d_fit = np.polyfit(t, delta, 1)


Skipping night 1 for 2021RG12 due to non-physical fit estimations
Skipping night 1 (2018-10-04) due to insufficient observations: 1
Skipping night 3 for 2016GY2 due to non-physical fit estimations
Skipping night 4 for 2009DD45 due to non-physical fit estimations
Skipping night 4 (2022-01-28) due to insufficient observations: 1
REJECTED: Missing data for 2023XK16


  a_fit = np.polyfit(t, alpha, 1)
  d_fit = np.polyfit(t, delta, 1)


Skipping night 7 for 2017MQ7 due to non-physical fit estimations
REJECTED: Missing data for 2018RE1
Skipping night 1 for 2024TS2 due to non-physical fit estimations
Skipping night 4 (2006-12-25) due to insufficient observations: 1
REJECTED: Missing data for 2024BR2
Skipping night 1 (2023-01-21) due to insufficient observations: 1
Skipping night 2 for 2024EJ2 due to non-physical fit estimations
Skipping night 4 for 2015YQ1 due to non-physical fit estimations
Skipping night 2 (2017-09-27) due to insufficient observations: 1
Skipping night 7 for 2017SX17 due to non-physical fit estimations
REJECTED: Missing data for 2018TW1
Skipping night 1 for 2012SL50 due to non-physical fit estimations
Skipping night 4 (2023-08-21) due to insufficient observations: 1Skipping night 2 (2018-04-13) due to insufficient observations: 1

Skipping night 1 (2011-12-23) due to insufficient observations: 1
Skipping night 1 (2019-02-04) due to insufficient observations: 1
Skipping night 2 for 2006BH99 due to non-

  a_fit = np.polyfit(t, alpha, 1)
  d_fit = np.polyfit(t, delta, 1)


Skipping night 4 for 2014HM4 due to non-physical fit estimations
Skipping night 3 (2001-11-15) due to insufficient observations: 1
REJECTED: Missing data for 2025DD15
Skipping night 1 (2021-08-27) due to insufficient observations: 1
Skipping night 5 (2021-05-15) due to insufficient observations: 1
Skipping night 4 for 2017MO8 due to non-physical fit estimations
Skipping night 3 (2024-03-15) due to insufficient observations: 1
Skipping night 4 (2024-03-16) due to insufficient observations: 1
Skipping night 3 (2013-06-12) due to insufficient observations: 1
Skipping night 5 (2014-08-30) due to insufficient observations: 1
REJECTED: Missing data for 2025EN2
Skipping night 2 for 2021SZ1 due to non-physical fit estimations
REJECTED: Missing data for 2024QK2
REJECTED: Missing data for 2015YH1
Skipping night 3 (2023-12-09) due to insufficient observations: 1
Skipping night 6 (2017-10-23) due to insufficient observations: 1
Skipping night 4 for 2023VD3 due to non-physical fit estimations
Skipp

  a_fit = np.polyfit(t, alpha, 1)
  d_fit = np.polyfit(t, delta, 1)


Skipping night 6 for 2015OQ21 due to non-physical fit estimations
REJECTED: Missing data for 2014FK38
Skipping night 2 (2022-12-16) due to insufficient observations: 1
Skipping night 3 for 2023CL1 due to non-physical fit estimations
Skipping night 1 for 2019XL3 due to non-physical fit estimations
Skipping night 2 (2010-10-07) due to insufficient observations: 1
Skipping night 2 (2025-05-25) due to insufficient observations: 1
Skipping night 3 (2025-05-26) due to insufficient observations: 1
REJECTED: Missing data for 2021UW3

💾 Creating final DataFrame and saving to CSV...

📊 PROCESSING SUMMARY:
   🎯 Target NEOs: 9000
   ✅ Successfully processed: 7042
   ❌ Rejected: 168
   📊 Total data rows: 25404
   🎯 Success rate: 97.7%

📋 Final dataset structure:
   Basic fields: 7
   Keplerian elements: 6 (a_kep, e, i, Omega, omega, M)
   Equinoctial elements: 6 (a_equ, h, k, p, q, lambda)
   Total columns: 19
   Note: Separate semi-major axes: a_kep (Keplerian) and a_equ (Equinoctial)

🎉 SUCCESS! 