# Walk the World: A Land Journey Explorer
**Author**: Sahar Imani 

This project implements a program to determine if a land route exists between two cities in the world. If possible, it calculates the shortest path with the minimum number of border crossings and provides key details about the countries along the route. The project utilizes **web scraping**, **graph traversal algorithms**, and **data analysis** to achieve its goals.


## Project Overview

### Objectives:
1. **Determine Connectivity**: Check if two cities can be connected on foot without crossing oceans.
2. **Shortest Path**: Find the route with the minimum number of border crossings.
3. **Country Insights**: Provide information about the largest, most populous, and richest countries along the route.

### Features:
- **Input Validation**: Confirms that both cities exist on Wikipedia.
- **Ocean Separation Check**: Identifies if the cities are separated by an ocean using their respective continents.
- **Shortest Path Calculation**: Uses a breadth-first search (BFS) algorithm on a graph of countries to determine the shortest route.
- **Country Data Extraction**: Scrapes Wikipedia for detailed information about countries along the route.


## Methodology

### 1. Input Validation
- Verifies if Wikipedia pages exist for the input cities using **requests** and **BeautifulSoup**.
- Ends with an error message if a city's page does not exist.

In [1]:
import requests
from bs4 import BeautifulSoup
from lxml import html
from collections import deque
from lxml import html


In [2]:
def check_wikipedia_page(city_name):
    """
    Checking if a Wikipedia page for the given city exists by parsing HTML content.
    """
    #Constructing the URL for city pages in Wikipedia and replacing space in city names with (_) as Wikipedia uses it in its URLs
    url = f"https://en.wikipedia.org/wiki/{city_name.replace(' ', '_')}"
    #Requesting and parsingthe HTML content of the response using BeautifulSoup and creates a soup object 
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Checking for the "No article text" template or related message:Wiki uses"No article text" template for pages that do not exist
    if soup.find(class_='noarticletext') or soup.find(id='noarticletext'):
        return False

    return True

def validate_input(start_city, end_city):
    """
    Validates if both start and end cities have Wikipedia pages.
    """
    #Cheking if the cities exist on Wikipedia
    if not check_wikipedia_page(start_city):
        return f"Error:No Wikipedia page found for {start_city}."
    
    if not check_wikipedia_page(end_city):
        return f"Error:No Wikipedia page found for {end_city}."

    return "Input validated:Both cities have Wikipedia pages."


In [3]:
print(validate_input("Tehran","Los Angeles"))

Input validated:Both cities have Wikipedia pages.


### 2. Country Extraction
- Extracts the country of a city from its Wikipedia page using **CSS selectors** and **XPath queries**.
- Checks if the two cities belong to the same country, ending the program if true.


In [4]:
def extract_country(city_name):
    """
    Extracting the country of the given city from its Wikipedia page.
    """
    url = f"https://en.wikipedia.org/wiki/{city_name.replace(' ', '_')}"
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    #Looking for 'td.infobox-data' directly containing the country
    country_info = soup.select_one('td.infobox-data')
    if country_info and country_info.text.strip():
        return country_info.text.strip()

    #Looking for an 'a' element within infobox containing country
    country_link = soup.select_one('.infobox a[title]')
    if country_link and country_link['title']:
        return country_link['title']

    return "Country not found"

In [5]:
print(extract_country("Washington, D.C."))
print(extract_country("Los Angeles"))
print(extract_country("London"))
print(extract_country("Tehran"))

United States
United States
United Kingdom
Iran


In [6]:
def same_country(start_city, end_city):
    """
    Checks if both cities belong to the same country.
    """
    start_country =extract_country(start_city)
    end_country =extract_country(end_city)

    if start_country==end_country:
        return f"Both cities are in the same country:{start_country}."
    else:
        return f"Different countries:{start_city} is in {start_country}, {end_city} is in {end_country}."

In [7]:
print(same_country("Washington, D.C.","Los Angeles"))
print(same_country("Tehran","Berlin"))

Both cities are in the same country:United States.
Different countries:Tehran is in Iran, Berlin is in Germany.


### 3. Ocean Separation
- Maps countries to their respective continents using the "List of sovereign states and dependent territories by continent" Wikipedia page.
- Determines if the cities are separated by an ocean based on their continents.

In [8]:
def is_valid_country(country_name):
    """
    Determineing if a given string is a valid country name.
    Excludes entries like "Member states of" or "Template:".
    """
    return not ("Member states of" in country_name or "Template:" in country_name)

def country_continent_mapping():
    """
    Maping countries to their respective continents.
    """
    url = "https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_by_continent"
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    country_continent_map = {}
    continents = ['Africa', 'Asia', 'Europe', 'North America', 'South America', 'Oceania', 'Antarctica']

    for continent in continents:
        continent_headline_id = continent.replace(' ', '_')
        continent_headline = soup.find('span', id=continent_headline_id)

        if continent_headline:
            table = continent_headline.find_next('table')
            for row in table.find_all('tr')[1:]:
                country_cell = row.find('td')
                if country_cell and country_cell.a and country_cell.a.has_attr('title'):
                    country_name = country_cell.a['title']
                    if is_valid_country(country_name):
                        country_continent_map[country_name] = continent

    return country_continent_map

# Create the country-continent mapping
country_continent_map = country_continent_mapping()


In [9]:
# Trying the  function
country_count = 0
for country, continent in country_continent_map.items():
    print(f"{country}: {continent}")
    country_count += 1

print(f"Total number of countries listed: {country_count}")


Algeria: Africa
Angola: Africa
Benin: Africa
Botswana: Africa
Burkina Faso: Africa
Burundi: Africa
Cameroon: Africa
Cape Verde: Africa
Central African Republic: Africa
Chad: Africa
Comoros: Africa
Democratic Republic of the Congo: Africa
Republic of the Congo: Africa
Djibouti: Africa
Egypt: Asia
Equatorial Guinea: Africa
Eritrea: Africa
Eswatini: Africa
Ethiopia: Africa
Gabon: Africa
The Gambia: Africa
Ghana: Africa
Guinea: Africa
Guinea-Bissau: Africa
Ivory Coast: Africa
Kenya: Africa
Lesotho: Africa
Liberia: Africa
Libya: Africa
Madagascar: Africa
Malawi: Africa
Mali: Africa
Mauritania: Africa
Mauritius: Africa
Morocco: Africa
Mozambique: Africa
Namibia: Africa
Niger: Africa
Nigeria: Africa
Rwanda: Africa
São Tomé and Príncipe: Africa
Senegal: Africa
Seychelles: Africa
Sierra Leone: Africa
Somalia: Africa
South Africa: Africa
South Sudan: Africa
Sudan: Africa
Tanzania: Africa
Togo: Africa
Tunisia: Africa
Uganda: Africa
Zambia: Africa
Zimbabwe: Africa
Sahrawi Arab Democratic Republic:

In [10]:

def are_separated_by_ocean(continent1, continent2):
    """
    Determining if two continents are separated by an ocean.
    """
    oceans_between_continents = {
        ('North America','Europe'):'Atlantic Ocean',
        ('South America', 'Africa'):'Atlantic Ocean',
        ('Europe','Africa'):'Atlantic Ocean',
        ('Asia', 'North America'): 'Pacific Ocean',
        ('Oceania', 'South America'): 'Pacific Ocean',
        ('Oceania', 'North America'): 'Pacific Ocean',
        ('Africa', 'Oceania'): 'Indian Ocean',
        ('Asia', 'Oceania'): 'Indian Ocean',
        ('North America', 'Antarctica'): 'Arctic Ocean',
        ('Europe', 'Antarctica'): 'Arctic Ocean',
        ('Asia', 'Antarctica'): 'Arctic Ocean',
        ('Antarctica', 'Africa'): 'Southern Ocean',
        ('Antarctica', 'Oceania'): 'Southern Ocean',
        ('Antarctica', 'South America'): 'Southern Ocean',
        ('Antarctica', 'Asia'): 'Southern Ocean'
        }
    return oceans_between_continents.get((continent1, continent2)) or oceans_between_continents.get((continent2, continent1))



In [11]:
def continents_and_ocean(city1, city2, country_continent_map):
    """
    Checking if two cities are separated by an ocean.
    """
    country1=extract_country(city1)
    country2=extract_country(city2)
    if not country1 or not country2:
        return "Country not found for one or both cities."
    continent1 = country_continent_map.get(country1)
    continent2 = country_continent_map.get(country2)
    if not continent1 or not continent2:
        return "Continent information not found for one or both countries."
    ocean = are_separated_by_ocean(continent1, continent2)
    if ocean:
        return f"{city1} (in {continent1}) and {city2} (in {continent2}) are separated by the {ocean}."
    else:
        return f"{city1} (in {continent1}) and {city2} (in {continent2}) are not separated by an ocean."

In [12]:
#Test
city1 = "Tehran"
city2 = "Los Angeles"
print(continents_and_ocean(city1, city2, country_continent_map))


Tehran (in Asia) and Los Angeles (in North America) are separated by the Pacific Ocean.


In [13]:
#Test
city1 = "Tehran"
city2 = "Paris"
print(continents_and_ocean(city1, city2, country_continent_map))


Tehran (in Asia) and Paris (in Europe) are not separated by an ocean.


### 4. Shortest Land Route
- Constructs a graph of countries and their neighbors from the "List of countries and territories by number of land borders" Wikipedia page.
- Uses BFS to find the shortest route between the countries of the two cities.


In [14]:
url = "https://en.wikipedia.org/wiki/List_of_countries_and_territories_by_number_of_land_borders"
response = requests.get(url)
soup = BeautifulSoup(response.content,'html.parser')

In [15]:
def parse_country_data(soup):
    graph = {}

    table = soup.find('table', {'class': 'wikitable'})
    if not table:
        return graph

    for row in table.find_all('tr')[1:]:  # Skipping the header row
        columns = row.find_all('td')
        if not columns:
            continue

        country = columns[0].get_text().strip()
        neighbors = set()

        # Extracting neighbors from the last column
        neighbor_data = columns[-1]
        for neighbor_info in neighbor_data.find_all('a', title=True):
            neighbor = neighbor_info.get_text().strip()
            if neighbor and neighbor != country:  # Exclude self-references
                neighbors.add(neighbor)

        graph[country] = neighbors

    return graph



In [16]:
country_graph=parse_country_data(soup)
print(country_graph)

{'Abkhazia': {'Georgia', 'Russia'}, 'Afghanistan': {"People's Republic of China", 'Tajikistan', 'Uzbekistan', 'Pakistan', 'Turkmenistan', 'Iran'}, 'Albania': {'Kosovo', 'Montenegro', 'Greece', 'North Macedonia'}, 'Algeria': {'Morocco', 'Mali', 'Western Sahara', 'Mauritania', 'Tunisia', 'Niger', 'Libya'}, 'Andorra': {'Spain', 'France'}, 'Angola': {'Democratic Republic of the Congo', 'Namibia', 'Republic of the Congo', 'Zambia'}, 'Antigua and Barbuda': set(), 'Argentina': {'Chile', 'Paraguay', 'Brazil', 'Bolivia', 'Uruguay'}, 'Armenia': {'Turkey', 'Georgia', 'Iran', 'Azerbaijan'}, 'Australia': set(), 'Austria': {'Germany', 'Slovakia', 'Hungary', 'Switzerland', 'Czech Republic', 'Liechtenstein', 'Slovenia', 'Italy'}, 'Azerbaijan': {'Russia', 'Iran', 'Turkey', 'Georgia', 'Armenia'}, 'Bahamas': set(), 'Bahrain': set(), 'Bangladesh': {'India', 'Dahagram-Angarpota', 'Myanmar'}, 'Barbados': set(), 'Belarus': {'Russia', 'Poland', 'Latvia', 'Lithuania', 'Ukraine'}, 'Belgium': {'France', 'Luxembo

In [17]:
def bfs(graph,start,goal):
    # Initialize a set to track visited nodes and a queue with the starting node.
    visited = set()
    queue = deque([[start]])

    while queue:
        path = queue.popleft()
        node = path[-1]

        if node == goal:
            return path

        if node not in visited:
            visited.add(node)

            for neighbor in graph.get(node, []):
                new_path = list(path)
                new_path.append(neighbor)
                queue.append(new_path)

    return None


In [18]:
def find_route(start_country, end_country, graph):
    route=bfs(graph,start_country,end_country)
    if route:
        return " -> ".join(route)
    else:
        return "No route found"

In [19]:
#Test
start_country = "Italy"
end_country = "Morocco"
route = find_route(start_country, end_country, country_graph)
print(route)


Italy -> Austria -> Hungary -> Romania -> Bulgaria -> Turkey -> Syria -> Israel -> Egypt -> Libya -> Algeria -> Morocco


In [20]:
start_country = "Malaysia"
end_country = "France"
route =find_route(start_country, end_country, country_graph)
print(route)

Malaysia -> Thailand -> Myanmar -> India -> Pakistan -> Iran -> Azerbaijan -> Russia -> Poland -> Germany -> France


### 5. Country Insights
- For each country in the route, extracts:
  - Total area
  - Population
  - GDP
- Compares the countries to identify the largest, most populous, and richest.

In [2]:
def get_country_info(country_url):
    response = requests.get(country_url)
    tree = html.fromstring(response.content)

    def extract_first_or_none(xpath_query):
        results = tree.xpath(xpath_query)
        return ' '.join(results[0].strip().split()) if results else "Not available"

    country_name_xpath = "//div[@class='fn org country-name']/text()"
    country_name = extract_first_or_none(country_name_xpath)

    area_xpath = "//th[contains(., 'Total')]/following-sibling::td/text()"
    area = extract_first_or_none(area_xpath)

    population_xpath = "//th[contains(., 'Population')]/../following-sibling::tr[1]//td/text()"
    population = extract_first_or_none(population_xpath)

    gdp_xpath = "//th[contains(., 'Total') and following-sibling::td[contains(text(), '$')]]/following-sibling::td/text()"
    gdp = extract_first_or_none(gdp_xpath)

    return f"The country {country_name} has an area of {area}, a population of {population}, and a total GDP of {gdp}."

In [22]:
#Test
france_url = 'https://en.wikipedia.org/wiki/France'
get_country_info(france_url)



'The country French Republic has an area of 643,801 km, a population of 68,042,591, and a total GDP of $3.868 trillion.'

In [23]:
#Using all previous steps here
def walk_the_world(start_city, end_city):
    # Validate city existence
    if not check_wikipedia_page(start_city) or not check_wikipedia_page(end_city):
        return f"Error: One or both cities do not have a Wikipedia page."

    # Extracting countries
    start_country = extract_country(start_city)
    end_country = extract_country(end_city)

    # Checking if in the same country
    if start_country == end_country:
        return f"Both cities are in the same country: {start_country}."

    # Determining continents and ocean separation
    country_continent_map = country_continent_mapping()
    ocean_check = continents_and_ocean(start_city, end_city, country_continent_map)
    if "separated by the" in ocean_check:
        return ocean_check  # Ocean separates the cities

    # Finding the walking route
    country_graph = parse_country_data(soup)  # soup should be defined with the appropriate URL
    route = find_route(start_country, end_country, country_graph)
    if not route:
        return "No walkable route found."

    # Gathering country information along the route
    route_info = ""
    for country in route.split(" -> "):
        country_url = f'https://en.wikipedia.org/wiki/{country.replace(" ", "_")}'
        route_info += get_country_info(country_url) + "\n"

    return f"To walk from {start_city} in {start_country} to {end_city} in {end_country},the shortest land route is through {route}. \n\nCountry Information:\n{route_info}"



In [24]:
# Example
print(walk_the_world("Tehran", "Paris"))

To walk from Tehran in Iran to Paris in France,the shortest land route is through Iran -> Azerbaijan -> Russia -> Poland -> Germany -> France. 

Country Information:
The country Islamic Republic of Iran has an area of 1,648,195 km, a population of 87,590,873, and a total GDP of $1.726 trillion.
The country Republic of Azerbaijan has an area of 86,600 km, a population of 10,353,296, and a total GDP of $192.146 billion.
The country Russian Federation has an area of 17,098,246 km, a population of (, and a total GDP of $5.056 trillion.
The country Republic of Poland has an area of 312,700 km, a population of 38,036,118, and a total GDP of $1.712 trillion.
The country Federal Republic of Germany has an area of 357,600 km, a population of 84,482,267, and a total GDP of $5.537 trillion.
The country French Republic has an area of 643,801 km, a population of 68,042,591, and a total GDP of $3.868 trillion.



## Conclusion
The `Walk the World` project demonstrates:
- How to use web scraping to extract structured geographic data from Wikipedia.
- How graph traversal (BFS) can be applied to real-world problems like route planning.
- Insights into global geography, such as land border connections and continent separation by oceans.

This project highlights the power of combining programming with geography for solving intriguing problems.
