# <h1 style='Text-align: center;'>**Web Scraping**</h1>

`Created by: Erick Eduardo Robledo Montes`

---
---

<p style='Text-align: justify;'><i>Description:</i>  Web scraping data from the Mexican government's open data portal, located at https://datos.gob.mx/, using the Python libraries BeautifulSoup and Selenium. The script utilizes the capabilities of both libraries to navigate and extract the desired information from the website such as dataset titles and file formats. </p>

*Link: [https://datos.gob.mx/]*

In [None]:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import WebDriverException
import pandas as pd
import re

## Classes
---
* The `DataScraper` class is responsible for scraping data from the Mexican government's open data portal. This class likely contains the following functions:

    - The `__init__` method, which is used to initialize the class and set any necessary variables, such as the URL of the website to scrape, and the web driver instance from Selenium.

    - The `scrape_data` method, which is used to extract the desired information from the website, such as dataset titles, descriptions, and links, as well as statistics on the number of datasets and views. This method likely uses the BeautifulSoup library to parse the HTML structure of the website, and the Selenium web driver to interact with the website, navigate through the pages, and extract the data.

    - The `display_stats` method, which is used to display the statistics of the scraping process.

    - The `close_driver` method, which is used to close the Selenium web driver instance, and end the scraping process.

In [None]:
class DataScraper:
    """
    This class uses Selenium and BeautifulSoup to scrape data from the website "https://datos.gob.mx/".
    It navigates to the website, searches for datasets with a specific keyword, and extracts the title and file format of each dataset.
    The extracted data is then stored in a pandas dataframe and saved to a csv file.
    """
    def __init__(self):
        """
        Initializes the class variables and opens a webdriver for Selenium.
        """
        self.data_name = "datos_gob.csv"
        self.path = '"../chromedriver/chromedriver.exe"'
        self.driver = webdriver.Chrome(self.path)
        self.home_link = 'https://datos.gob.mx/'
        self.search_kw = ''
        self.search_url = '/busca/dataset?q='+self.search_kw+'&'
        self.search_title = []
        self.search_format = []
        self.df = pd.DataFrame(columns=["dataset_title", "file_type"])
        
    def scrape_data(self):
        """
        Navigates to the website, iterates through the pages of search results, and extracts the title and file format of each dataset.
        """
        self.driver.get(self.home_link + self.search_url)
        page = BeautifulSoup(self.driver.page_source, 'html.parser')

        pagination = page.find('div', attrs={'class':'pagination'})
        a_element = str(pagination.find_all('a')[-2])
        pg_amount = int(re.sub(r'<[^>]*>', '', a_element))

        for i in range(pg_amount):
            self.driver.get(self.home_link + self.search_url)
            page = BeautifulSoup(self.driver.page_source, 'html.parser')
            for search in page.findAll('div', attrs={'class':'dataset-item'}):
                if search:
                    title = search.find('h3')
                    file_formats = search.find_all('span', attrs={'dataset-format'})
                    if file_formats:
                        for file_format in file_formats:
                            self.search_title.append(title.text)
                            self.search_format.append(file_format.text)
                    else: 
                        self.search_title.append(title.text)
                        self.search_format.append('No Format.')
                else: 
                    self.search_title.append("No Title.")

            self.search_url = '/busca/dataset?q='+self.search_kw+'&page='+str(i+2)

            next_btn = self.driver.find_element(
                By.CSS_SELECTOR, 
                'div.pagination.pagination-centered li:last-child a:not([class])')
            next_btn.click()

        self.search_title = [x.replace("\n", "") for x in self.search_title]
        self.df = pd.DataFrame({
            "dataset_title": self.search_title, 
            "file_type": self.search_format
            })
        self.df = self.df.drop_duplicates()
        self.df.to_csv(self.data_name, index=None, header=True, encoding='utf-8-sig')

    def display_stats(self):
        """
        Prints the number of unique titles and file formats found in the search results.
        """
        num_unique_titles = self.df["dataset_title"].nunique()
        print(num_unique_titles)
        num_unique_files = self.df["file_type"].nunique()
        print(num_unique_files)

    def close_driver(self):
        """
        Close the webdriver.
        """
        self.driver.close()

## Scraping Data from a Website using Python's DataScraper Class
---
* The script of the following code creates an instance of the `DataScraper` class by calling the constructor.

* The script then calls the `scrape_data` method on the `data_scraper` object to start the scraping process. This method likely uses the BeautifulSoup library to parse the HTML structure of the website, and the Selenium web driver to interact with the website, navigate through the pages, and extract the data.

* After the scraping is done, the script calls the `display_stats` method on the `data_scraper` object to display the statistics of the scraping process, such as the number of datasets scraped and the number of views.

* Finally, the script calls the `close_driver` method on the `data_scraper` object to close the Selenium web driver instance, and end the scraping process.

* This script's purpose is to scrape data from a website and display the statistics of the scraping process and close the web driver instance.

In [None]:
data_scraper = DataScraper()
data_scraper.scrape_data()
data_scraper.display_stats()
data_scraper.close_driver()