## Web Scraping Project: Global Indicators Data Extraction

### Project Overview
The goal of this project was to scrape data from the [United Nations Database - Global Indicators](https://data.un.org/default.aspx) to gather comprehensive global indicator data for every country in the world, as well as various regions. The project focused on four key indicators:
- **Economic Indicators**
- **Environmental and Infrastructural Indicators**
- **General Information**
- **Social Indicators**

### Data Collection Process
For each country and region, the web scraping tool was programmed to navigate the respective web pages of the United Nations Database and extract data for each of the four indicators. The process was repeated for all countries and the following regions:
- **Global** (The whole world)
- **Africa** (Northern Africa, Sub-Saharan Africa, Eastern Africa, Middle Africa, Southern Africa, Western Africa)
- **Americas** (Northern America, Latin America and Caribbean, Caribbean, Central America, South America)
- **Asia** (Central Asia, Eastern Asia, Southern Asia, South-Eastern Asia, Western Asia)
- **Europe** (Northern Europe, Southern Europe, Western Europe, Eastern Europe)
- **Oceania** (Australia and New Zealand, Melanesia, Micronesia, Polynesia)

### Output
The data for each of the four indicators was saved into separate CSV files. For every country and region, a folder named after the country or region contained four CSV files, each corresponding to one of the indicators. For example, a country's folder would include:
- `Economic_Indicators.csv`
- `Environmental_and_Infrastructure_Indicators.csv`
- `General_Information.csv`
- `Social_Indicators.csv`

Similarly, each region also had four CSV files to represent its respective indicators, ensuring a structured and organized dataset for future analysis.


### Importation of Necessary Libraries

In [65]:
# -- imports the Beautiful Soup Library for Parsing HTML Code.

from bs4 import BeautifulSoup

# -- imports the request library for making HTML requests to a website.

import requests

# -- imports the regex library for parsing of data

import re

# -- imports pandas library for creating and working with dataframe (s)

import pandas as pd

# -- imports operating system library

import os

# -- imports tkinter library for incorporating GUI

import tkinter as tk

# -- imports the ttk library, entry and label library for creating GUI element, single line input fields and displaying text and images within GUI respectively

from tkinter import ttk, Entry, Label

### Important things to Note
- A successful response from a website is 200 (i.e "Request has succeeded"). Check: [HTTP Request Response Summary](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) for meaning of other potential responses.
- The Python Classes in this Script use Class Inheritance to enable the use of parent class methods in a child class.
- A majority of the class methods in this script make use of function chaining by using the output of previously defined functions as their input.

### Class for Graphical User Interface (GUI)

In [68]:
class Gui:
    
    def __init__(self):
        
        """Class Initialization."""
        self.root = tk.Tk()  # creates the main root widget
        self.entry = self.create_entry()  # creates and set up the entry widget
        self.root.title("WebScraper") # sets the title of the GUI window
        self.root.iconbitmap("web_scraping_icon.ico") # inserts the icon on the gui window
        self.root.geometry("400x200")  # sets the size of the root window (width x height)
        

    
    # --- method to create an entry widget with default attributes
    def create_entry(self):
        
        e = Entry(self.root, width=50, borderwidth=3, foreground = "black", background = "white") # creates the widget using the specifications provided
        global default_txt
        default_txt = "Enter the link to the Website you'll like to scrape"
        e.insert(0, default_txt)  # Default text in entry at postion 0  (obviosuly the only position in the widget)
        e.place(relx=0.5, rely=0.5, anchor='center')  # positions the entry widget in the middle of the window using place
        
        # binding events that will be performed when a specific action is initiated
        e.bind("<FocusIn>", self.clear_default_text) # Bind focus event (i.e the event to look out for) to clear the default text function (i.e the handler) when clicked on
        e.bind("<Return>", self.retrieve_link) # Bind Enter Keyboard Key event (i.e the event to Look out for) to process the input (i.e the handler) when enter key is pressed
        return e # returns the entry widget
        

    # --- method to create progress bar widget of the data scraping
    def create_progress_bar(self):
        
        progress = ttk.Progressbar(self.root, orient='horizontal', length=300, mode='determinate')
        progress.place(relx=0.5, rely=0.55, anchor='center')  # positions the progress bar widget below the entry widget in the window using place
        return progress # returns the progress bar
        

    # --- method to clear the default text when the user starts typing
    def clear_default_text(self, event):
        
        # conditional to check for default text is present in the entry widget
        if self.entry.get() == default_txt:
            self.entry.delete(0, 'end')  # Clear the default text if true
            

    # --- method to retrieve the link entered by the user in the entry widget
    def retrieve_link(self, event):
        link = self.entry.get() # gets the link entered by the user and stores it in the "link" variable
        self.processed_link = link # store the link for later use
        self.root.destroy() # close the GUI window
        

    # --- method to return the retrieved link
    def get_link(self):
        
        return getattr(self, 'processed_link', None) # returns the value of 'processed_link' attribute if it exists; otherwise, returns None.

        

    # --- method to run the tkinter main loop
    def run(self):
        
        self.root.mainloop() # starts the Tkinter main event loop, keeping the GUI window active and responsive.

    


### Class for Accessing the Website

In [70]:
class WebsiteAccess:

    """Class Initialization"""
    def __init__(self):
        self.gui = Gui() # create an instance of the GUI class
        self.gui.run() # Run the GUI        

    # url1 = "https://data.un.org/en/reg/g1.html"
    # url2 = "https://data.un.org/default.aspx"
    

    # --- class method for taking in the URL from the user, and returning the website reponse back to the user  
    
    def website_response(self):
    
        # url = input("Kindly enter the url of the website you want to scrape: ") # requests website link from the user
        url = self.gui.get_link() # retrieves the link entered by the user by calling the retrieve_link function.
    
        # try-except block to handle any unexpected error that might occur
        try:
            response = requests.get(url) # sends a request to website and gets a response, 200 means success
            
            response.raise_for_status() # returns a HTPPError if the response code was unsuccessful
            
        except requests.exceptions.RequestException as e:
            print( f"Failed to get response from the website: {e}") # prints back the error that occurs
            
        # returns the website's response and raw html code in an easy to read hierarchical format    
        return response


    # --- class method to get the selected table using its index
    
    def table_html_code(self, table_index, ech_webpage_soup):
        
        table = ech_webpage_soup.find_all("details")[table_index] # gets the html code for the selected table      
        return table # returns htmll code for current table


### Class for Scraping the Website with Parent Class "WebsiteAccess"

In [72]:

class WebsiteScraping(WebsiteAccess):
    
    """Class Initialization"""
    def __init__(self):
        pass

    
    #  --- class method to get the name of the table
   
    def get_table_name(self, ech_webpage_soup, table_index, title_tag = "summary"):
        
        table_info = self.table_html_code(table_index, ech_webpage_soup) # calls the table_html_code method to get the html doc of the selected (or current) table
        table_title = table_info.find(title_tag) # uses the "summary" html tag to get the title of the selected (or current) table 
        table_name = table_title.text.strip("\n") # extracts the text data (type) element from the "summary" tag while stripping off newline characters 
        return table_name # returns the name of the table

        
    # --- class method to get all rows in the table
    
    def table_data(self, table_index : int, ech_webpage_soup, header_title_tag : str = "th", row_data : str = "tr"): 
        
        table_info = self.table_html_code(table_index, ech_webpage_soup) # calls the table_html_code method and returns table specified by index while assigning it to the "table" variable
        table = table_info.find("table") # uses the "table" html tag to get the all contents in the of the selected (or current) table
        all_row_data = [] # creates a list object and assigns it to the variable "all_row_data"        
        table_rows = table.find_all("tr") # uses the "tr" html tag to get all rows in the table
        
        # loops through each row in the table and extracts the text data (type)
        for row in table_rows:
            each_row = row.find_all("td") # uses the "td" html tag to get the data in each row (or each data cell)
            each_row_data = [data.text.strip("\n") for data in each_row]  # loops through each row and extracts the text data (type) element from the "tr" tag while stripping off newline characters and stores them in a list using list comprehension     
            
            all_row_data.append(each_row_data) # appends the data in each row to the list "all_row_data" (i.e a list of lists)
    
        return all_row_data # returns all the rows in the table in a list object "all_row_data"
    
    
    

### Class for Extracting Data from the Website with Parent Class "WebsiteScraping"

In [74]:

class WebsiteDataExtraction(WebsiteScraping):
    
    """Class Initialization"""
    def __init__(self):
        pass

    
    # --- class method to create dataframe
    
    def get_into_df(self, table_index : int, ech_webpage_soup):
        
        table_df = pd.DataFrame(self.table_data(table_index, ech_webpage_soup)) # creates a new data frame for the table using the output of the "table_data" class method (i.e the "all_row_data" list)
        return table_df # returns the created dataframe
        
    
    # --- class method to write dataframe to a csv file
    
    def convert_df2_csv(self, section_name, folder_name, table_index, ech_webpage_soup): 
   
        folder_path = os.path.join(section_name, folder_name) # creates path to folder

        # creates path to folder directory if it does not exist
        if not os.path.exists(folder_path):
            os.makedirs(folder_path)
            
        file_path = os.path.join(folder_path, f"{self.get_table_name(ech_webpage_soup, table_index)}.csv") # creates path where each csv file will be stored
            
        # try-except block to handle any errors that may arise
        try:
            csv_file = self.get_into_df(table_index, ech_webpage_soup).to_csv(file_path, index = False) # writes the created dataframe to a csv file using the "table_name" class method
            return csv_file # returns the created csv file
            
        except PermissionError as pe:
            print("Permission Error: The file you are trying to modify is currently in use. Kindly close it or use another file") # prints this out incase a Permission Error is raised
            
        except Exception as e:
            print(f"An Error Occured: {e}") # prints this incase any other error is raised.
            

    # --- class method to write all available tables in the website to csv files.
    
    def get_all_tables(self):

        website_access = WebsiteAccess() # creates an instance of the website class
        response = website_access.website_response() # calls the website response method in the WebsiteAccess class to get the website's response and assigns the result to the "response" variable
        
        # response = self.website_response() # class method call for the website's response and assigns the result to the "response" variable
        soup = BeautifulSoup(response.text, 'html.parser') # gets the raw response and returns it in html format    
        div_tag = soup.find_all('div', class_='Left', style='width: 32.5%;') # retrieve all <div> tags that contain links to different countries' global indicators.
        
        for section in range(len(div_tag)):
            section_name = "Section_" + str(section + 1) # creates the name of the current section's folder
            div_tag = soup.find_all('div', class_='Left', style='width: 32.5%;')[section] # retrieves the current <div> tag that contain links to current section of global indicators.
            
            li_tags = div_tag.find("ul").find_all("li") # uses the "ul" and "li" html tag to find the list of countries with their respetive names and hyperlinks          

            for li_tag in li_tags:
                
                folder_name = li_tag.text # extracts the name of each subfolder in the current section
                link_tag = li_tag.find("a") # gets the content of the "a" html tag from within the "li" tag
                hyper_link = link_tag.get("href") # gets the hyper link from the link_tag (i.e a tag)
                hyper_link = "https://data.un.org/" + hyper_link
                # print(hyper_link)
                
                # try-except block to handle any unexpected error that might occur
                try:
                    ech_hyper_lnk_response = requests.get(hyper_link) # sends a request to website and gets a response, 200 means success
                    ech_hyper_lnk_response.raise_for_status() # returns a HTPPError if the response code was unsuccessful
                    
                except requests.exceptions.RequestException as e:
                    print( f"Failed to get response from the website: {e}") # prints back the error that occurs
                
                ech_webpage_soup = BeautifulSoup(ech_hyper_lnk_response.text, 'html.parser') # gets the raw response in html for each subwebpage 
    
               # conditional to check for the availability of a table in the website
               # NOTE: From the website's inspection, the "details" tag is what holds each table in the website
                if ech_webpage_soup.find_all("details"):            
                    total_no_of_tables = len(ech_webpage_soup.find_all("details")) # returns the total number of tables in the website -- works for any table
                    
                    # for loop to iterate through all tables extract only tables that are NOT empty
                    for table_index in range(total_no_of_tables):
                        
                        # conditional statement to check if table is empty
                        if self.get_into_df(table_index, ech_webpage_soup).empty != True:    
                            self.convert_df2_csv(section_name, folder_name, table_index, ech_webpage_soup) # writes the current table dataframe to a csv file if table is NOT empty
                        else:
                            pass # omits empty tables
                        
                else:
                    print("There's no tabular data in this website. Please provide another link.")
                
       

#### Note:
- The "get_all_tables" method initiates the whole webscraping process by getting response from the website and creating a soup object.
- The "get_all_tables" method is able to obtain the data for tables in the website by only calling the "convert_df2_csv" function due to function chaining. 

### Creating the Website Extraction Class Object

In [79]:
# creates an instance of the WebsiteDataExtraction class 
extraction = WebsiteDataExtraction()

# calls the get_all_tables() method
extraction.get_all_tables()