# Dataset sourcing

## API method

Salary information from years 1997 to 2018 can be accessed using the IMSS api endpoint with the following structure:

```
http://datos.imss.gob.mx/api/action/datastore/search.json?resource_id={resource_id}
```

Where __resource_id__ is the identifier for the files stored for each year. 

The API method get is not reliable since some of the endpoints return blank jsons or have not been uploaded.

## Web scraping method

Salary information from years 2019 to the current year (2022) is not yet available in the API endpoint, therefore it will have to be downloaded using a webscraper that can access the following url:

```
http://datos.imss.gob.mx/dataset/asg-{year}
```

Where __year__ corresponds to the target year.

After accesing the target url, the next step is to gather all the relative urls for each of the 12 files per year.

The HTML element for each of the target files follows this structure

```
<a href="/dataset/asg{year}/resource/asg-{year}-{month}-{day}" class="heading" title="asg-{year}-{month}-{day}" property="dcat:accessURL">asg-{year}-{month}-{day}<span class="format-label" property="dc:format" data-format="csv">csv</span></a>
```

Where __year__ corresponds to the target year, __month__ is the target month for the specified year, __day__ is the last day for the target month and year.

After joining the base url with the relative url extracted from the html elements, the result would be the following:

```
http://datos.imss.gob.mx/dataset/asg2020/resource/asg-{year}-{month}-{day}
```

Inside this url, the next step is to locate the following html element:

```
<a href="/node/{id}/download" class="btn-primary btn"><i class="icon-large icon-download"></i> Descargar</a>
```

Where __id__ is the number of the csv file, usually in a 4 digit format. Joining this url with the base url gives the final path to download the required csv.

Each of these csv files is around 400 megabytes. Assuming all files for the 26 years have the same filesize, it would be a total size of 120 gigabytes. 

In [10]:
# Import Splinter, BeautifulSoup, and Pandas
from bs4 import BeautifulSoup as soup
from selenium import webdriver
from splinter import Browser
from webdriver_manager.chrome import ChromeDriverManager
import pandas as pd
import datetime as dt
import requests

In [14]:
def scrape_year(target_year,target_month='All'):
    # Create options chrome options instance
    options = webdriver.ChromeOptions()
    
    # Create options dictionary
    prefs = {"download.default_directory" : "/IMSS_Files/",
            "download.directory_upgrade": "true",
            "download.prompt_for_download": "false",
            "disable-popup-blocking": "false"}
    
    # Add options dictionary to options instance
    options.add_experimental_option("prefs", prefs)
    
    # Install chrome driver manager
    executable_path = {'executable_path': ChromeDriverManager().install()}
    
    # Initiate automated browser with preferred options, setting headless to False for debugging
    browser = Browser('chrome', **executable_path, headless=False, options=options)
    
    scrape_month(browser,target_year,target_month)

In [62]:
def scrape_month(browser,year,month):
    
    # Set the target url according to the year argument
    target_url = f'http://datos.imss.gob.mx/dataset/asg-{year}'
    
    # Acess the url for the target year
    browser.visit(target_url)
    
    if month == 'All':
        
        # Create list to store the list of extracted urls as strings
        links_for_each_month = []
        
        # Loop through the urls that match the search text asg-{year}
        for link in browser.links.find_by_partial_text(f'asg-{year}'):
            
            # Append the link as a string to avoid stale element error
            links_for_each_month.append(str(link['href']))
        
        # Loop through the link list given by links_for_each_month
        for link in links_for_each_month:
            
            # Visit the new url
            browser.visit(link)
            
            # Look for the url that includes node to find the csv
            for link in browser.links.find_by_partial_href('node'):
                
                # Store the link as a string to avoid stale element error
                csv_link = str(link['href'])
            
            # Debug to get the link which will be downloaded
            print(csv_link)

    else:
        
        return ''

In [63]:
scrape_year(2021)

http://datos.imss.gob.mx/node/1113/api
http://datos.imss.gob.mx/node/1116/api
http://datos.imss.gob.mx/node/1118/api
http://datos.imss.gob.mx/node/1120/api
http://datos.imss.gob.mx/node/1125/api
http://datos.imss.gob.mx/node/1129/api
http://datos.imss.gob.mx/node/1131/api
http://datos.imss.gob.mx/node/1133/api
http://datos.imss.gob.mx/node/1135/api
http://datos.imss.gob.mx/node/1137/api
http://datos.imss.gob.mx/node/1142/api
http://datos.imss.gob.mx/node/1146/api
