# Google News Scraping (Best Practices)

This notebook demonstrates how to scrape today's top stories from Google News, following best practices and respecting the site's robots.txt policy. We will set browser-like headers and add delays to avoid being blocked.

In [14]:
import pandas as pd

# Wikipedia URL
url = "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"

# Read all tables from the page
tables = pd.read_html(url)

# Usually the first or second table contains the country/population data
# Let's print all table shapes to see which one is the right one
for i, table in enumerate(tables):
    print(f"Table {i}: {table.shape}")

# length of tables
print(f"Number of tables found: {len(tables)}")
# Preview the likely correct table (e.g., table[0] or table[1])
population_df = tables[2]  # adjust the index if necessary
population_df.head()

Table 0: (1, 1)
Table 1: (1, 3)
Table 2: (210, 7)
Table 3: (9, 2)
Table 4: (8, 2)
Table 5: (13, 2)
Table 6: (2, 2)
Number of tables found: 7


Unnamed: 0_level_0,Country/Territory,IMF[1][12],IMF[1][12],World Bank[13],World Bank[13],United Nations[14],United Nations[14]
Unnamed: 0_level_1,Country/Territory,Forecast,Year,Estimate,Year,Estimate,Year
0,World,113795678,2025,105435540,2023,100834796,2022
1,United States,30507217,2025,27360935,2023,25744100,2022
2,China,19231705,[n 1]2025,17794782,[n 3]2023,17963170,[n 1]2022
3,Germany,4744804,2025,4456081,2023,4076923,2022
4,India,4187017,2025,3549919,2023,3465541,2022


## Notes
- This approach targets Google News top stories, extracting headlines and descriptions using lxml and XPath.
- We use browser headers and a delay between requests to avoid being blocked.
- For large-scale scraping, consider using proxies and more advanced rate limiting.
- Always check the site's terms of service and robots.txt before scraping.