# AiScore Tennis Scraper - Interactive Notebook

This notebook provides an interactive interface for scraping tennis match data from aiscore.com using Selenium.

## Features
- ‚úÖ Scrape matches by date
- ‚úÖ Filter by status (finished, live, scheduled)
- ‚úÖ Extract match URLs for detailed scraping
- ‚úÖ Analyze scraped data with pandas
- ‚úÖ Interactive visualizations


## 1. Setup and Imports


## Create Selenium Driver


In [1]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager

# Configure Chrome options
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--window-size=1920,1080")
# chrome_options.add_argument("--headless")  # Uncomment for headless mode

# Create driver
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service, options=chrome_options)

# Set timeouts
driver.set_page_load_timeout(30)
driver.implicitly_wait(10)

print("‚úÖ Selenium driver created successfully!")
print(f"Driver: {driver}")


‚úÖ Selenium driver created successfully!
Driver: <selenium.webdriver.chrome.webdriver.WebDriver (session="2b41b83412a00b524fcdb263de2fa0d2")>


## Use the Driver


‚úÖ Navigated to: https://www.aiscore.com/tennis/20251102
Page title: 20251102 SUN tennis schedule and matches results - AiScore


In [4]:
from selenium.webdriver.common.by import By

In [15]:
# Navigate to a page
url = "https://www.aiscore.com/tennis/20251001"
driver.get(url)

print(f"‚úÖ Navigated to: {url}")
print(f"Page title: {driver.title}")

# Wait for initial page load
import time
time.sleep(3)

# Collect match links WHILE scrolling (page uses virtual scrolling - removes old elements!)
print("\nüîÑ Scrolling and collecting links...")

match_links = []  # Collect links as we go!

scroll_pause_time = 2.0
scroll_increment = 500
no_change_count = 0
max_no_change = 8  # Stop after 8 scrolls with no new matches

while no_change_count < max_no_change:
    # Get current position
    current_position = driver.execute_script("return window.pageYOffset;")
    
    # Collect links currently visible in DOM (before they disappear!)
    links_before = len(match_links)
    
    for link in driver.find_elements(By.CSS_SELECTOR, "a[href*='/tennis/match']"):
        href = link.get_attribute('href')
        if href and href not in match_links:
            match_links.append(href)
    
    links_after = len(match_links)
    new_links_found = links_after - links_before
    
    print(f"   Position: {current_position}px | Total unique: {links_after} (+{new_links_found} new)")
    
    # Check if new links were found
    if new_links_found == 0:
        no_change_count += 1
    else:
        no_change_count = 0  # Reset counter if new links found
    
    # Scroll down by increment
    driver.execute_script(f"window.scrollBy(0, {scroll_increment});")
    time.sleep(scroll_pause_time)
    
    # Check if we've reached the actual bottom
    at_bottom = driver.execute_script(
        "return (window.innerHeight + window.pageYOffset) >= document.body.scrollHeight;"
    )
    if at_bottom and no_change_count >= 3:
        print("   ‚úì Reached bottom of page")
        break

# Final collection at the end
for link in driver.find_elements(By.CSS_SELECTOR, "a[href*='/tennis/match']"):
    href = link.get_attribute('href')
    if href and href not in match_links:
        match_links.append(href)

print(f"\n‚úÖ Finished! Found {len(match_links)} unique match links\n")

# Display the links
print("üéæ Match Links:")
print("=" * 80)
for idx, link in enumerate(match_links, 1):
    print(f"{idx}. {link}")

‚úÖ Navigated to: https://www.aiscore.com/tennis/20251001
Page title: 20251001 WED tennis schedule and matches results - AiScore

üîÑ Scrolling and collecting links...
   Position: 0px | Total unique: 15 (+15 new)
   Position: 500px | Total unique: 20 (+5 new)
   Position: 1000px | Total unique: 32 (+12 new)
   Position: 1500px | Total unique: 36 (+4 new)
   Position: 2000px | Total unique: 45 (+9 new)
   Position: 2500px | Total unique: 49 (+4 new)
   Position: 3000px | Total unique: 57 (+8 new)
   Position: 3500px | Total unique: 65 (+8 new)
   Position: 4000px | Total unique: 72 (+7 new)
   Position: 4500px | Total unique: 82 (+10 new)
   Position: 5000px | Total unique: 84 (+2 new)
   Position: 5500px | Total unique: 92 (+8 new)
   Position: 6000px | Total unique: 101 (+9 new)
   Position: 6500px | Total unique: 103 (+2 new)
   Position: 7000px | Total unique: 113 (+10 new)
   Position: 7500px | Total unique: 138 (+25 new)
   Position: 8000px | Total unique: 138 (+0 new)
   Positi

## Close the Driver (Run this when done)


In [None]:
# Close the browser
driver.quit()
print("‚úÖ Driver closed")


In [2]:
# Import required libraries
import pandas as pd
import time
from datetime import datetime, timedelta

# Import scraper functions
from scraper import scrape_matches, scrape_match_details, print_matches_summary
from utils import (
    save_to_csv, 
    get_timestamp, 
    filter_matches_by_status,
    build_date_url,
    create_directories
)
import config

# Create data directories
create_directories()

print("‚úÖ Imports successful!")
print(f"Current date: {datetime.now().strftime('%Y-%m-%d')}")


Created directories: data, data/raw, data/processed
‚úÖ Imports successful!
Current date: 2025-11-02


## 2. Basic Scraping - Get Finished Matches for a Specific Date

This will scrape all finished matches for a given date.


In [None]:
# Set the date to scrape (YYYY-MM-DD format)
TARGET_DATE = "2025-11-01"  # Change this to your desired date

# Scrape finished matches
print(f"üéæ Scraping finished matches for {TARGET_DATE}...")
print("="*60)

matches = scrape_matches(
    headless=True,           # Run in background (set False to see browser)
    date=TARGET_DATE,
    status_filter='finished' # Only get finished matches
)

# Show summary
print_matches_summary(matches)

# Display count
print(f"\n‚úÖ Found {len(matches)} finished matches")


## 3. Convert to DataFrame and Analyze


In [None]:
# Convert to pandas DataFrame
df = pd.DataFrame(matches)

# Display basic info
print(f"üìä Data Shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
print("\n" + "="*60)

# Show first few matches
print("\nüéæ Sample Matches:")
df.head()


## 4. Data Analysis and Filtering


In [None]:
# Count matches by tournament
print("üìà Matches by Tournament:")
print(df['tournament'].value_counts().head(10))
print("\n" + "="*60)

# Count by status
print("\nüìä Matches by Status:")
print(df['status'].value_counts())
print("\n" + "="*60)

# Filter matches with valid URLs
matches_with_urls = df[df['match_url'].notna() & (df['match_url'] != '')]
print(f"\nüîó Matches with URLs: {len(matches_with_urls)}/{len(df)}")

# Show some match URLs
if len(matches_with_urls) > 0:
    print("\nüìé Sample Match URLs:")
    for idx, row in matches_with_urls.head(3).iterrows():
        print(f"  {row['player1']} vs {row['player2']}")
        print(f"  ‚Üí {row['match_url']}")
        print()


## 5. Save Data to CSV


In [None]:
# Save to CSV
filename = f"tennis_matches_{TARGET_DATE.replace('-', '')}_{get_timestamp('file')}.csv"
filepath = save_to_csv(matches, filename)

print(f"‚úÖ Saved {len(matches)} matches to:")
print(f"   {filepath}")


## 6. Scrape Match Details (Advanced)

Scrape detailed statistics for a specific match using its URL.


In [None]:
# Get first match with valid URL
if len(matches_with_urls) > 0:
    first_match = matches_with_urls.iloc[0]
    
    print(f"üîç Scraping details for:")
    print(f"   {first_match['player1']} vs {first_match['player2']}")
    print(f"   Tournament: {first_match['tournament']}")
    print(f"   URL: {first_match['match_url']}")
    print("\n" + "="*60)
    
    # Scrape match details
    details = scrape_match_details(
        match_url=first_match['match_url'],
        headless=True  # Set False to see browser
    )
    
    if details:
        print("\n‚úÖ Successfully scraped match details!")
        print(f"\nüìä Available data:")
        print(f"   - Statistics: {len(details.get('statistics', {}))} items")
        print(f"   - H2H data: {len(details.get('h2h', {}))} items")
        print(f"   - Odds: {len(details.get('odds', {}))} bookmakers")
        
        # Display statistics if available
        if details.get('statistics'):
            print("\nüìà Match Statistics:")
            for key, value in list(details['statistics'].items())[:5]:
                print(f"   {key}: {value}")
    else:
        print("\n‚ö†Ô∏è  Failed to scrape match details")
else:
    print("‚ö†Ô∏è  No matches with URLs found. Cannot scrape details.")


## 7. Advanced: Scrape Multiple Dates

Scrape finished matches for a range of dates.


In [None]:
# Define date range
start_date = datetime(2025, 11, 1)
num_days = 3  # Scrape 3 days

all_matches = []

print(f"üéæ Scraping {num_days} days of matches...")
print("="*60 + "\n")

for i in range(num_days):
    current_date = start_date + timedelta(days=i)
    date_str = current_date.strftime('%Y-%m-%d')
    
    print(f"üìÖ Scraping {date_str}...")
    
    try:
        matches = scrape_matches(
            headless=True,
            date=date_str,
            status_filter='finished'
        )
        
        # Add date column
        for match in matches:
            match['scrape_date'] = date_str
        
        all_matches.extend(matches)
        print(f"   ‚úÖ Found {len(matches)} matches\n")
        
        # Delay between dates
        if i < num_days - 1:
            print("   ‚è≥ Waiting 5 seconds...\n")
            time.sleep(5)
            
    except Exception as e:
        print(f"   ‚ùå Error: {e}\n")
        continue

print("="*60)
print(f"\n‚úÖ Total matches scraped: {len(all_matches)}")

# Convert to DataFrame
df_all = pd.DataFrame(all_matches)
print(f"üìä Final dataset shape: {df_all.shape}")


## 8. Filter by Player or Tournament

Search for specific players or tournaments in your scraped data.


In [None]:
# Search for specific player (case insensitive)
player_name = "Zverev"  # Change to search for different player

player_matches = df[
    df['player1'].str.contains(player_name, case=False, na=False) | 
    df['player2'].str.contains(player_name, case=False, na=False)
]

print(f"üéæ Matches with '{player_name}':")
print(f"   Found {len(player_matches)} matches\n")

if len(player_matches) > 0:
    print("="*60)
    for idx, match in player_matches.head(5).iterrows():
        print(f"\n{match['player1']} vs {match['player2']}")
        print(f"Tournament: {match['tournament']}")
        print(f"Score: {match['score']}")
        print(f"Status: {match['status']}")
        if match.get('match_url'):
            print(f"URL: {match['match_url']}")

# Search by tournament
print("\n\n" + "="*60)
tournament_keyword = "ATP"  # Change to search for different tournament

tournament_matches = df[
    df['tournament'].str.contains(tournament_keyword, case=False, na=False)
]

print(f"\nüèÜ '{tournament_keyword}' Tournaments:")
print(f"   Found {len(tournament_matches)} matches")
print(f"\n   Tournaments:")
for tournament in tournament_matches['tournament'].unique()[:5]:
    count = len(tournament_matches[tournament_matches['tournament'] == tournament])
    print(f"   - {tournament} ({count} matches)")


## 9. Quick Reference - Useful Commands

Common scraping patterns and configurations.


In [None]:
"""
QUICK REFERENCE - Copy and modify these as needed

# 1. Scrape yesterday's finished matches
yesterday = (datetime.now() - timedelta(days=1)).strftime('%Y-%m-%d')
matches = scrape_matches(headless=True, date=yesterday, status_filter='finished')

# 2. Scrape live matches (current)
live_matches = scrape_matches(headless=True, status_filter='live')

# 3. Scrape all matches (no filter)
all_matches = scrape_matches(headless=True, date='2025-11-01', status_filter='all')

# 4. Scrape with visible browser (for debugging)
matches = scrape_matches(headless=False, date='2025-11-01')

# 5. Scrape match details by URL
details = scrape_match_details(
    match_url='https://www.aiscore.com/tennis/match-player1-player2/abc123',
    headless=True
)

# 6. Save to custom filename
save_to_csv(matches, 'my_custom_filename.csv')

# 7. Filter matches after scraping
from utils import filter_matches_by_status
finished = filter_matches_by_status(matches, 'finished')
live = filter_matches_by_status(matches, 'live')

# 8. Get date-specific URL
from utils import build_date_url
url = build_date_url('2025-11-01')
print(url)  # https://www.aiscore.com/tennis/20251101

# 9. Load previously saved data
df_saved = pd.read_csv('data/raw/tennis_matches_20251101_xxxxxx.csv')

# 10. Scrape with custom timeout (edit config.py first)
# config.PAGE_LOAD_TIMEOUT = 60  # Increase if getting timeouts
"""

print("‚úÖ Quick reference loaded!")
print("üí° Uncomment and run the examples above as needed")


## 10. Export Data in Different Formats


In [None]:
# Export to different formats
timestamp = get_timestamp('file')

# CSV (already done above)
csv_path = f'data/raw/matches_{timestamp}.csv'
df.to_csv(csv_path, index=False)
print(f"‚úÖ CSV: {csv_path}")

# Excel (requires openpyxl: pip install openpyxl)
try:
    excel_path = f'data/raw/matches_{timestamp}.xlsx'
    df.to_excel(excel_path, index=False)
    print(f"‚úÖ Excel: {excel_path}")
except ImportError:
    print("‚ö†Ô∏è  Excel export requires: pip install openpyxl")

# JSON
json_path = f'data/raw/matches_{timestamp}.json'
df.to_json(json_path, orient='records', indent=2)
print(f"‚úÖ JSON: {json_path}")

# HTML (for viewing in browser)
html_path = f'data/raw/matches_{timestamp}.html'
df.to_html(html_path, index=False)
print(f"‚úÖ HTML: {html_path}")

print(f"\nüìÅ All files saved in: data/raw/")


## üìù Notes and Tips

### Scraping Tips:
- **Headless mode** (`headless=True`) is faster but you can't see what's happening
- **Visible mode** (`headless=False`) is useful for debugging
- Always add delays when scraping multiple pages (`time.sleep()`)
- Check `TROUBLESHOOTING.md` if you encounter errors

### Status Filters:
- `'finished'` - Only completed matches
- `'live'` - Only ongoing matches  
- `'scheduled'` - Only upcoming matches
- `'all'` - All matches (no filter)

### Date Formats Accepted:
- `'2025-11-01'` (YYYY-MM-DD)
- `'20251101'` (YYYYMMDD)
- Or use Python datetime objects

### Common Issues:
1. **Timeout errors** ‚Üí Increase `PAGE_LOAD_TIMEOUT` in `config.py`
2. **No matches found** ‚Üí Website structure may have changed, or no matches for that date
3. **Empty match_url** ‚Üí Some matches don't have clickable links on the listing page

### Performance:
- Scraping 1 date takes ~10-30 seconds
- Scraping match details takes ~5-10 seconds per match
- Use delays between requests to avoid being blocked

### Next Steps:
- Run the cells in order from top to bottom
- Modify `TARGET_DATE` variable to scrape different dates
- Check the scraped data in `data/raw/` folder
- Use the DataFrame `df` for further analysis
