<div style="width: 100%; clear: both;">
<div style="float: left; width: 50%;">
<img src="http://www.uoc.edu/portal/_resources/common/imatges/marca_UOC/UOC_Masterbrand.jpg", align="left">
</div>
<div style="float: right; width: 50%;">
<p style="margin: 0; padding-top: 22px; text-align:right;">M2.852 · Tipología y ciclo de vida de los datos · PRA1</p>
<p style="margin: 0; text-align:right;">2022-2 · Máster universitario en Ciencia de datos (Data science)</p>
<p style="margin: 0; text-align:right; padding-button: 100px;">Estudios de Informática, Multimedia y Telecomunicación</p>
</div>
</div>
<div style="width:100%;">&nbsp;</div>


# PRA1: Web scraping - Testing file

## 1. Settings

In this first section we are going to describe and explain the main libraries used in the notebook and why we need them:

- `requests`:
	Python library that allows the user to send HTTP/1.1 requests easily (POST, GET, PUT, etc.). 
	It is beeing used to get the main content of the URLs used (GET request). 
 
 
- `builtwith`:
	Python library that detects the technology used by a website (Apache, JQuery, Wordpress),
	the servers and several relevant information.  
	It will be applied for detecting teh etchnology used for developing the web's design, as it will
	define the Web Scraping style that will be necessary to apply.
    
    
- `beautifulsoup4`:
	Python library to scrape information from web pages easily through it HTML or XML file.
	It is the main library for scraping all the information from the main URL.
    
    
- `python-whois`:
	Python library that produces parsed WHOIS data for a given URL to be able to extract data for all
	the popular TLDs (com, org, net, …). Also it enables the query of a WHOIS server directly instead of
	going through an intermediate web service.
	This library allows us to know the owner of the webpage we want to scrape and to see if there are any
	scraping blockers.
    
    
- `re`:
	Python library for creating regular expressions to search with.
	It will helps us to extract the relevant information of the HTML's text.
    
    
- `pandas`:
	Python package that provides fast, flexible, and expressive data structures designed to make working with
	"relational" or "labeled" data easy and intuitive.
	This package will be fundamental to develop the final steps of this proyect by creating and exporting the
	final dataset.

### 1.1 Installations

If any of the forementioned libraries is not installed in teh user's machine, this piece of code will installed them:

In [None]:
!pip install requests
!pip install builtwith
!pip install beautifulsoup4
!pip install python-whois

### 1.2 Imports

Once the libraries are all installed in the machine, it is time to import them to this notebook:

In [1]:
import requests
import builtwith
import whois
from bs4 import BeautifulSoup
import re
import pandas as pd

## 2. Information about the main URL

In [None]:
# Let's check the technologies of the webpage we want to scrap
builtwith.parse('https://www.expatistan.com/cost-of-living/country/ranking')

In [None]:
# Let's see the content of the webpage
page = requests.get("https://www.expatistan.com/cost-of-living/country/ranking")

In [None]:
# Let's see the owner of the domain
print(whois.whois("expatistan.com"))

In [None]:
# Let's analyse the structure of the html code
soup = BeautifulSoup(page.content)
#print(soup.prettify())

## 3. Web Scraping 

### 3.1 Creation of the main classes

In [13]:
class ExpatistanScraper():
    """
    Class to carry on the Web Scraping of the Country links from the original URL 
    www.expatistan.com
    ...

    Private Methods
    ---------------
        __init__(url):
            Constructor of the class.
        __get_HTML(url):
            Returns the HTML code of the given url
        __get_countries_links(html):
            Retruns all teh available links related to the Countries in the main page
        __get_ranking_pos(html):
            Return teh ranking position of a specific Country in the Countries' Ranking main page
        __scraping_single_country(self, url, country):
            Extract the relevant information of a single Country link from its HTML file
        __saving(self):
            Creates a Pandas dataframe to save it as a CSV dataset
    
    Public Methods
    --------------
        scraping():
            Main function of the class that starts the web scraping  
    """
    
    def __init__(self):
        """
        Constructs all the necessary attributes for the Expatistan Web Scraper.
            
        """
        self.original_url = "https://www.expatistan.com/cost-of-living/country/ranking"
        
        # The dataset, as a first instance, it is going to be stored as a JSON variable
        self.dataset =  {
            "Ranking position": [],
            'Country': [],
            'Category': [],
            'Items': [],
            'Original Currency': [],
            'Original Currency Value': [],
            'Exchanged Currency': [],
            'Exchanged Currency Value': []
        }
         
            
    def __get_HTML(self, url):
        '''
        Returns the HTML file of a webpage by its link

        Parameters
        ----------
            self (class): Instance of the class that is invoking the method
            url (str): link of the webpage

        Returns
        -------
            HTML file of the webpage
        '''
        # First, we need to request the contents of the webpage
        page = requests.get(url)
        
        # Then we return its HTML file
        return BeautifulSoup(page.content)
        
        
    def __get_countries_links(self, html):
        """
        Returns a list of the links to be scraped from the main webpage

        Parameters
        ----------
            self (class): Instance of the class that is invoking the method
            html (str): HTML code of the original page

        Returns
        -------
            List of string values which are the links to the Countries' pages

        """
        # From the HTML file, we collect all the <td> tags that have a class = "country-name"
        td_tags = html.find_all('td', {"class": "country-name"})
        
        # Then, for every <td> tag, we get the hyperlink in its <a> tag and add the extra currency reference
        countries_links = [td.find('a').get('href') + "?currency=EUR" for td in td_tags]
        return countries_links
    
    
    def __get_ranking_pos(self, html):
        """
        Returns the ranking position of a particular Country

        Parameters
        ----------
            self (class): Instance of the class that is invoking the method
            html (str): HTML code of the Country's page

        Returns
        -------
            Number of the position in string format

        """
        # First, we get all <li> tags that have a class = "key-point"
        li_tags = html.findAll('li', {'class': 'key-point'})
        
        # For every <li> tag
        for li in li_tags:
            # If the word "World" is in its text
            if 'World' in li.text:
                # We split the whole text by the word "World", keeping the side after it
                text = li.text.split("World", 1)[1].strip()
                # From the remaining text we get the first 1 or 2 digits that appear
                pos = re.search(r'\b(\d{1,2})\b', text).group(1)
                
        return pos

    
    def scraping(self):
        """
        Main method of the class that starts the Web Scraping of the main webpage

        Parameters
        ----------
            self (class): Instance of the class that is invoking the method

        """
        # 1. Get the original HTML file using Beautiful Soup
        html = self.__get_HTML(self.original_url)
        
        # 2. Get all the Country links to scrape 
        country_urls = self.__get_countries_links(html)
        
        # 3. For each country, let's scrape all the information
        for url in country_urls:
            # From the link, we get the last 
            country = re.search(r"country/([^/?]*)", url).group(1)
            country = re.sub("-", " ", country).title()
            #print("Scraping country " + country)
            self.__scraping_single_country(url, country)

        # 4. Finally, we save all the information ina CSV file
        print("\nScraping finished!")
        self.__saving()
            
        
    def __scraping_single_country(self, url, country):
        """
        Runs the Web Scraping of a particular Country URL

        Parameters
        ----------
            self (class): Instance of the class that is invoking the method
            link (str): link to the Country's webpage
            country (str): name of the Country to be scraped

        """
        # First we need to get the HTML file for this link
        country_html = self.__get_HTML(url)
        
        # Then we get the Ranking position of the Country
        pos = self.__get_ranking_pos(country_html)

        # All the information is under <tr> tags, let's find them all
        tr_tags = country_html.findAll('tr')

        # But not all of them have value to the proyect, so we need to remove them
        # The last one regards setting of the view, not information
        tr_tags.pop(-1)

        # For every <tr> asnd its position in the list
        for i, tr in enumerate(tr_tags):
            # We get the first <td> and/or <th> tag that the <tr> tag holds
            td = tr.find("td")
            th = tr.find("th")

            # Then we check if its one of the non-valuable types of label by its attributes
            # If so, we get rid of it
            if td and td.get("colspan") == "4":
                tr_tags.pop(i)
            elif th and th.get("class") == ["ranking"]:
                tr_tags.pop(i)
                
            elif td:
                # However some of the <td> tags have hyperlink tags, <a>, inside to city pages
                # so we need to track them by searching on the first one if it has an <a> tag
                tds = tr.findAll("td")
                first_td_tag = tds[0]
                
                # Verify if the <td> tag has the <a> tag
                if first_td_tag.find("a") is not None:
                    # If so, erase it
                    tr_tags.pop(i)
               
                    
                    
                

        # Setting the common variables for the scraping
        current_category = ""
        current_orig_currency = ""
        
        # For every <tr> tag
        for tr in tr_tags:
            # If the tag has the class = "categoryHeader"
            if "categoryHeader" in tr.get("class", []):
                # We get the first <th> tag 
                first_th_tag = tr.find("th")
                # And extract its text as the Category of the Dataset's row
                current_category = first_th_tag.text
                
            else:
                # If not, it means it has <td> tags and we retrieve them all
                td_tags = tr.findAll("td")
                
                # Depending on how many <td> tags, we scrape different information
                # 2 tags means that we can retrive the currency being used in the Country
                if len(td_tags) == 2:
                    currency = tr.find("td")
                    current_orig_currency = currency.text

                # 3 tags means we are scraping an european country that uses Euros as currency, 
                # so the exchange common currency is going to be the same one
                elif len(td_tags) == 3: 
                    self.dataset['Ranking position'].append(pos)
                    self.dataset['Country'].append(country)
                    self.dataset['Category'].append(current_category)
                    self.dataset['Items'].append(td_tags[1].text.strip())
                    self.dataset['Original Currency'].append("EUR")
                    self.dataset['Original Currency Value'].append((td_tags[2].text.strip()))
                    self.dataset['Exchanged Currency'].append("EUR")
                    self.dataset['Exchanged Currency Value'].append((td_tags[2].text.strip()))
                
                # 4 tags means we are scraping a Country which currency is not Euros,
                # so we scrape its own currency and its exchanged value in Euros
                elif len(td_tags) == 4:
                    self.dataset['Ranking position'].append(pos)
                    self.dataset['Country'].append(country)
                    self.dataset['Category'].append(current_category)
                    self.dataset['Items'].append(td_tags[1].text.strip())
                    self.dataset['Original Currency'].append(current_orig_currency)
                    self.dataset['Original Currency Value'].append((td_tags[2].text.strip()))
                    self.dataset['Exchanged Currency'].append("EUR")
                    self.dataset['Exchanged Currency Value'].append((td_tags[3].text.strip()))
                    
            
    def __saving(self):
        """
        Creates a Pandas Dataframe from the scraped data to save it as a CSV dataset

        Parameters
        ----------
            self (class): Instance of the class that is invoking the method

        """
        # Let's create a Pandas Dataframe with the obtained dataset
        expatistan_df = pd.DataFrame(self.dataset)
        
        # Now we save it as a CSV file with no index column
        expatistan_df.to_csv("expatistan.csv", index = False)
        
        print("\nDataset saved as CSV file!")


In [None]:
scraper = ExpatistanScraper()
scraper.scraping()