I was trying to perform an analysis for an Ecommerce business named `Alpha Stores`. I found that I was missing a location table for states in Brazil from my data. Hence I decided to Scrape the data off the link  `https://www.distancelatlong.com/country/brazil/`.

This code block below performs web scraping to extract structured data from a webpage and organizes it into DataFrames for analysis:

- **Import libraries**:  
  `requests`, `BeautifulSoup`, and `pandas` are imported for HTTP requests, HTML parsing, and data manipulation, respectively.

- **Initialize `details` list**:  
  A list is created to store extracted data, though it's unused in this specific block.

- **Define `url` and fetch webpage**:  
  The URL of the target webpage (`https://www.distancelatlong.com/country/brazil/`) is defined, and `requests.get` retrieves the HTML content. BeautifulSoup parses the content into a navigable structure.

- **Locate tables on the page**:  
  `soup.find_all` identifies all `<table>` elements with the specified CSS classes (`table table-striped setBorder`) for data extraction.

1. **Parse general country information (first table)**:  
   - A dictionary, `country_info`, is populated by iterating over rows of the first table.  
   - Labels (first column) and values (second column) are extracted and stored in key-value pairs.  
   - A pandas DataFrame (`country_info_df`) is created to organize this data.

2. **Parse distances to other countries (second table)**:  
   - A list, `distances`, is created to store country-distance pairs.  
   - Iterates through table rows (skipping the header) to extract the name of the country and its distance from Brazil.  
   - Data is converted into a DataFrame (`distance_df`) with columns "Country" and "Distance to Brazil."

3. **Parse states with latitude and longitude (third table)**:  
   - A list, `states`, is created to store state name, latitude, and longitude.  
   - Iterates through table rows (skipping the header) to extract the respective data.  
   - Data is organized into a DataFrame (`states_df`) with columns "State," "Latitude," and "Longitude."

- **Display extracted DataFrames**:  
  Prints the `country_info_df`, `distance_df`, and `states_df` DataFrames to the console for review.

- **Optional: Save DataFrames to CSV**:  
  Comments indicate the option to save each DataFrame to a CSV file for further use. File names and paths can be customized as needed.


In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Initialize list to hold details
details = []

# URL to scrape
url = 'https://www.distancelatlong.com/country/brazil/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Find all the tables on the page
tables = soup.find_all('table', class_='table table-striped setBorder')

# Parse the first table (general country information)
country_info = {}
general_info_table = tables[0]
for row in general_info_table.find_all('tr'):
    cols = row.find_all('td')
    if len(cols) == 2:
        label = cols[0].text.strip()
        value = cols[1].text.strip()
        country_info[label] = value

# Convert the general info into a DataFrame
country_info_df = pd.DataFrame([country_info])

# Parse the second table (distances to other countries)
distances = []
distance_table = tables[1]
for row in distance_table.find_all('tr')[1:]:  # Skip header row
    cols = row.find_all('td')
    if len(cols) == 2:
        country = cols[0].text.strip()
        distance = cols[1].text.strip()
        distances.append([country, distance])

# Convert the distances to a DataFrame
distance_df = pd.DataFrame(distances, columns=['Country', 'Distance to Brazil'])

# Parse the third table (states with latitude and longitude)
states = []
states_table = tables[2]
for row in states_table.find_all('tr')[1:]:  # Skip header row
    cols = row.find_all('td')
    if len(cols) == 3:
        state = cols[0].text.strip()
        latitude = cols[1].text.strip()
        longitude = cols[2].text.strip()
        states.append([state, latitude, longitude])

# Convert the states to a DataFrame
states_df = pd.DataFrame(states, columns=['State', 'Latitude', 'Longitude'])

# Display the DataFrames
print("Country Info:")
print(country_info_df)
print("\nDistances:")
print(distance_df)
print("\nStates:")
print(states_df)

# Optionally, you can save the dataframes to CSV files:
# country_info_df.to_csv('country_info.csv', index=False)
# distance_df.to_csv('distances.csv', index=False)
# states_df.to_csv('states.csv', index=False)


Country Info:
    CAPITAL DIAL CODE   POPULATION           AREA COAST LINE MOBILE USERS  \
0  Brasília       +55  204,259,812  8,514,877 KM2   7,491 KM    1,774,725   

  INTERNET USERS  
0    120,111,118  

Distances:
         Country Distance to Brazil
0       Paraguay            1231 km
1        Bolivia            1271 km
2  French Guiana            2024 km
3       Suriname            2068 km
4        Uruguay            2070 km

States:
                      State      Latitude     Longitude
0                  Acre (3)  -9.070003236  -68.66997929
1               Alagoas (5)   -9.48000405  -35.83996769
2                 Amapa (5)  -0.039598369  -51.17998743
3             Amazonas (16)  -3.289580873   -60.6199797
4                Bahia (31)  -16.28000242   -39.0299797
5                Ceara (19)   -2.89999225  -40.85002364
6      Distrito Federal (1)  -15.78334023  -47.91605229
7        Espirito Santo (5)  -20.85000771  -41.12998071
8                Goias (18)  -17.73004311  -49.10998

**Removing the numbers that is after
the state names in the dataset**

In [None]:
states_df['State'] = states_df['State'].str.replace(r'\s\(\d+\)', '', regex=True)

# Display the updated DataFrame



**Optional: Save DataFrames to CSV**:  
  Comments indicate the option to save each DataFrame to a CSV file for further use. File names and paths can be customized as needed.


In [None]:
states_df.to_csv('states.csv', index=False)