##Scraping The Geographocal Data Of Brazil

#### **What I want to do:**
I want to scrape geographical information about Brazil, details about Brazilian states with their corresponding latitude and longitude.

#### **Why I want to do it:**
This information is valuable for understanding Brazil's geographical context, and the geographical layout of its states. It can be useful because my data on Alpha store was lacking location data.
#### **How I want to do it:**
- Initialize an empty list called `details` to hold the scraped data.
- Define the URL for the webpage containing the geographical information about Brazil and send a GET request to retrieve the page content.
- Use BeautifulSoup to parse the HTML content of the page.
- Find all the tables on the page using `soup.find_all()`, specifically looking for tables with the class `table table-striped setBorder`.
- Parse the first table to extract general country information:
  - Create a dictionary called `country_info` to store the label-value pairs from the table rows.
- Convert the `country_info` dictionary into a Pandas DataFrame for easier manipulation and display.
- Parse the second table to extract distances to other countries:
  - Create a list called `distances` to hold country-distance pairs and convert it into a DataFrame.
- Parse the third table to extract states with their latitude and longitude:
  - Create a list called `states` and convert it into a DataFrame.
- Print the collected DataFrames to display the information.
- Optionally, save the DataFrames to CSV files for future reference (commented out in the code).



In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Initialize list to hold details
details = []

# URL to scrape
url = 'https://www.distancelatlong.com/country/brazil/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Find all the tables on the page
tables = soup.find_all('table', class_='table table-striped setBorder')

# Parse the first table (general country information)
country_info = {}
general_info_table = tables[0]
for row in general_info_table.find_all('tr'):
    cols = row.find_all('td')
    if len(cols) == 2:
        label = cols[0].text.strip()
        value = cols[1].text.strip()
        country_info[label] = value

# Convert the general info into a DataFrame
country_info_df = pd.DataFrame([country_info])

# Parse the second table (distances to other countries)
distances = []
distance_table = tables[1]
for row in distance_table.find_all('tr')[1:]:  # Skip header row
    cols = row.find_all('td')
    if len(cols) == 2:
        country = cols[0].text.strip()
        distance = cols[1].text.strip()
        distances.append([country, distance])

# Convert the distances to a DataFrame
distance_df = pd.DataFrame(distances, columns=['Country', 'Distance to Brazil'])

# Parse the third table (states with latitude and longitude)
states = []
states_table = tables[2]
for row in states_table.find_all('tr')[1:]:  # Skip header row
    cols = row.find_all('td')
    if len(cols) == 3:
        state = cols[0].text.strip()
        latitude = cols[1].text.strip()
        longitude = cols[2].text.strip()
        states.append([state, latitude, longitude])

# Convert the states to a DataFrame
states_df = pd.DataFrame(states, columns=['State', 'Latitude', 'Longitude'])

# Display the DataFrames
print("Country Info:")
print(country_info_df)
print("\nDistances:")
print(distance_df)
print("\nStates:")
print(states_df)

# Optionally, you can save the dataframes to CSV files:
# country_info_df.to_csv('country_info.csv', index=False)
# distance_df.to_csv('distances.csv', index=False)
# states_df.to_csv('states.csv', index=False)


Country Info:
    CAPITAL DIAL CODE   POPULATION           AREA COAST LINE MOBILE USERS  \
0  Brasília       +55  204,259,812  8,514,877 KM2   7,491 KM    1,774,725   

  INTERNET USERS  
0    120,111,118  

Distances:
         Country Distance to Brazil
0       Paraguay            1231 km
1        Bolivia            1271 km
2  French Guiana            2024 km
3       Suriname            2068 km
4        Uruguay            2070 km

States:
                      State      Latitude     Longitude
0                  Acre (3)  -9.070003236  -68.66997929
1               Alagoas (5)   -9.48000405  -35.83996769
2                 Amapa (5)  -0.039598369  -51.17998743
3             Amazonas (16)  -3.289580873   -60.6199797
4                Bahia (31)  -16.28000242   -39.0299797
5                Ceara (19)   -2.89999225  -40.85002364
6      Distrito Federal (1)  -15.78334023  -47.91605229
7        Espirito Santo (5)  -20.85000771  -41.12998071
8                Goias (18)  -17.73004311  -49.10998

#### **Result:**
I succesfully got all the states in Brazil and their respective longitude and latitude.


---



##Cleaning The Data Gotten To An Extent.

#### **What I want to do**:
I want to clean the 'State' column in the `states_df` DataFrame by removing any numerical information that may be present in parentheses.

#### **Why I want to do it:**
The presence of numerical information in parentheses (e.g., population or code) may not be relevant for certain analyses or visualizations. By removing this information, I can ensure that the 'State' column contains only the names of the states, making it cleaner and more useful for further processing or analysis.

#### **How I want to do it:**
- Use the `str.replace()` method on the 'State' column of the `states_df` DataFrame.
- Apply a regular expression (`r'\s\(\d+\)'`) to match any whitespace followed by a number enclosed in parentheses.
- Set `regex=True` to indicate that the first argument is a regex pattern, allowing for the removal of the matched patterns from the 'State' names.



In [None]:
states_df['State'] = states_df['State'].str.replace(r'\s\(\d+\)', '', regex=True)


#### Result:
The leading numbers in bracket is droped in Nigeria.



---







##Optional: Save DataFrames to CSV**:  
  Comments indicate the option to save each DataFrame to a CSV file for further use. File names and paths can be customized as needed.


In [None]:
states_df.to_csv('states.csv', index=False)