<h1 align="center">Data Science 1: Foundations | Final Report</h1> 
&ensp;

<h3 align="center">Alvaro Jose Altamirano Montoya</h3> 

<h3 align="center">Word Count: 2720</h3> 


#### I. Introduction

The advent of large scale, non-traditional data sources combined with a range of data science techniques, including Machine Learning (ML), have made it possible to tackle previously  unsolvable  development  economics concerns. Technologies such as Machine Learning, however, come with new hurdles, mainly related to each project’s customization requirements and governance considerations, but most importantly related to their usefulness as a guide for public policy. On a recent piece analyzing the Latin American and Caribbean experience of using job vacancies data (Altamirano & Amaral, 2020) we argued that beyond technicalities (data, methods, models), it is important to have a clear objective of the information a project will produce and the decisions it will help make, whether at the level of the individual or for public policy. This report constitutes the first exploration for a dataset of 4 million observations of jobs adds for 18 countries in Latin American and the Caribbean.

I have long and short-run objectives for this online job boards data project. On the long-run, I plan to focus my research in exploring whether the Covid-19 crisis will accelerate technological change as seen through vacancies information, develop metrics for the concentration of firms and occupations, and examine the regional evolution of skills-contents and occupational distribution of postings. In particular for this report, I ask what are the determinants of offered salaries in job vacancies for two Latin American countries: Mexico and Guatemala. For this task I will rely mainly on a battery of Machine Learning regression models that predict the natural logarithm of advertised monthly salaries by each country for the years 2020 and 2021.

#### II. Problem Statement and Background

Traditional sources of information regarding labor markets in developing countries are scarce and limited. Household and labor market surveys do not allow the different characteristics of sectoral employment to be disaggregated in a statistically representative way. For the most part, these surveys are designed to study population aggregates of employment, unemployment, informality, or poverty; and they do not allow the occupations or industries of each country to be studied in a more granular fashion. Moreover, they usually do not include information about the skill and tasks content of jobs. In addition, there is a time lag of between one and two years in the publication of official surveys. On the other hand, population censuses are carried out every 10 years and the economic censuses that vary in periodicity are too sporadic to be continuously relevant (they are carried out every 5 or 10 years in most countries) for current public policy.

With these limitations in mind, in October 2019 I started web-scraping data on online job vacancies for 18 countries in Latin America. My main inspirations to start this project have been CEDEFOP’s Online Vacancy Analysis [Tool for Europe](https://www.cedefop.europa.eu/en/tools/skills-online-vacancies), Burning Glass-based [papers for the USA](https://papers.ssrn.com/sol3/Papers.cfm?abstract_id=3664265), and a few national projects on job portals data implemented by national statistics institutes (Mexico, [Ecuador](https://www.ecuadorencifras.gob.ec/documentos/web-inec/Bibliotecas/Revista_Estadistica/Elaboracion%20de%20estadisticas%20de%20vacantes.pdf), Chile). A showcase of my data is being displayed on the Inter-American Development Bank’s coronavirus labor markets monitor [dashboard](https://observatoriolaboral.iadb.org/en/vacantes/). Figure 1 illustrates the cities in Latin America and the Caribbean for which my data has vacancies information, showing how most job ads concentrate in urban areas and capital metropolitan regions for this region of the planet.

<h3 align="center">Figure 1. Map representation of Latin America and the Caribbean cities included in the vacancies' regional dataset</h3>
<img src="/../assets/figure1_map.png" width="500" align = "center">


#### III. Data

The regional dataset currently comprises ~4 million unique observations for all 18 countries, coming from three job portals: [Computrabajo](https://www.computrabajo.com/), [Tecoloco](https://www.tecoloco.com/), and [Infojobs.](https://www.infojobs.com.br/). It is important to mention that the greatest volume of observations is highly concentration in Mexico, Colombia, Brazil, Chile, Venezuela, and Peru. I have been implementing weekly downloads of new postings and save country-level datasets with an average of 15 variables per country, including a rich set of sociodemographic characteristics (age, gender, education, experience, etc.). Currently I have two years of weekly ingested relational csv files (October 2019 - November 2021). The code for the webscrapping process was developed in Python 3, using BeautifulSoup, and it is documented in [this repository](https://github.com/AlvaroAltamiranoM/Webscrapping-job-vacancies-in-Latin-America). All scripts, data, and asset files (figures) is included in the following [Github repository](https://github.com/AlvaroAltamiranoM/PPOL564_Final_Project_Fall_2021).

This report uses a subset of the data for the comparison of model results. This initial exploration project will be based upon data from two countries: Mexico and Guatemala. There are a few reasons for not using the whole dataset for now. In June 2021 I ran text analysis exercises to identify words associated with *telework*, for a subset of the larger countries, and the processing for that exercise was extremely slow. I suxpect using the whole set of countries would require more time both to run and calibrate the models, and to provide the readers with some descriptive statistics at the country-level. Therefore, I prefer to create the set-up to run these analyses next semester (2022 I), using AWS, Azure, or other cloud-computing platform. 

Another consideration is further work I have to do with cross-country comparisons. This because some variables change across countries. Guatemala and Mexico are good first exploration candidates, since both have labor markets with similar levels of labor formality and skills requirements (both host a majority of small informal low productivity firms), and the contrast of very big and very small datasets can shed light into differences across models resuls.

For Mexico the dataset consists of 1.1 million unique observations. For Guatemala the dataset contains approximately 50 thousand unique observations. For the the deduplication process I relied on each posting's url, which constitutes a natural unique identificator. The unit of analysis for the dependent variable is the natural logarithm of advertised monthly salaries. Figures 2a and 2b detail the percentage of missing values for the whole set of variables in each country, respectively. 

<img src="/../assets/fig2a.png" width="800" align = "center">
<img src="/../assets/fig2b.png" width="800" align = "center">

#### IV. Analysis

Machine learning models can be used to classify (eg. True or False, Poor or Non poor) or to estimate mean predicted values (eg. Sales, Income). I created regression models for the prediction of the natural logarithmic of monthly wages with the aid of scikit-learn's Machine Learning Application Programming Interface (API). This Machine Learning regression analysis also builds upon broad data analytics methods learned during class.

The first step of the analysis involves data wrangling for the descriptive exploration. This task was performed with Python's built-in modules, and with popular data science libraries such as Numpy and Pandas. Some of the methods for the initial data wrangling include, among others: groupby, pandas_datetime, pandas_extract, pivot_tables, and lambda transformations. Figures 3a and 3b are results from that data wrangling process, as they display the time-series of weekly flow of **new job ads** for both countries. These figures are useful for understanding the high volatility of labor demand during 2020 and 2021. However, a longer series (3 to 4 years) is still necessary to identify labor demand's seasonal and cyclical components in Mexico and Guatemala. In overall terms, Mexico is the country with the greatest volume of vacancies, with a weekly average of about 4 thousand new postings per week (equivalent to a flow of about 15-20 thousand *new* job ads per month). Given the smaller size of its labor market, Guatemala has a more timid flow of new postings, approximately 800 new job ads posted each month.

<img src="/../assets/fig3a.png" width="800" align = "center">
<img src="/../assets/fig3b.png" width="800" align = "center">

The process of data wrangling continued with the tasks of creating dummy variables from the categorical variables, transforming the independent variable (advertised salaries) into natural logarithms, defining a threshold of 2 standard deviations for trimming outliers (which would cover about 95% of the sample given the normal distribution of our continuous variables), and filling missing values with the mean for the continuous variables in our data as well. Following these activities, I built a pipeline of 10 different Machine Learning regression models: Linear Regression, Lasso, Elastic Net, Bayesian Ridge, LGM, Gradient Boost, SVR, XGB, Catbooster, SGDR, and Kernel Ridge. This battery of Machine Learning regression models was nested inside of a scikit-learn Pipeline that used the Mean Absolute Error (MAE) term as the scoring criteria for selecting the best model (`neg_mean_absolute_error`), which measures the average distance between our fitted and observed values. The split of dependent variable and features followed a standard 75/25 training/test sets.

A generic functional form of the regressions is detailed in Equation 1:

$Log(W)_i = f(Age_i,Gender_i,Schooling_i,Experience_i, Contract-type_i, Firm-size_i, Date-contro_i,Skills_i, Teleworkdummy_i)$


#### V. Results

Before presenting the results for the models that best fitted the regression in both countries, it is important to introduce briefly a group of features that are unique to online job ads data. In particular, the availability of a 'description' column depicting the main functions and tasks required for each advertised job posting allows for the creation of additional features of skills-dummy variables (via `pandas.extract` and `pandas.get_dummies`) found in the pertinent literature. A relevant academic reference is Deming & Kahn's 2018 paper using data from online job vacancies in the U.S. These authors analyze the heterogeneity of skill requirements within occupations and across locations in the US. Their main definition of skills, displayed in Table 1 in Annex, depends on strings that can be extracted from the text using panda's extract method. 

Two translated examples of such a descriptive feature, for Guatemala and Mexico, are the following:

Guatemala 2021 (May): **"Define the strategy for the implementation, compliance and updating of the management methodology for Money Laundering and Terrorism Financing Risks, in accordance with the provisions of the legislation on prevention of Money Laundering and Terrorism Financing in force and as notified by the Superintendency of Banks, through the Intendancy of Special Verification, in order to identify the level of risk Requirements Bachelor's Degree in Economic, Legal and Social Sciences or careers at the end 3-year experience in similar positions Solid knowledge in prevention issues of Money Laundering and Terrorism Financing Knowledge in risk management issues Competencies: Analytical thinking Strategic planning Leadership in team development and management."**

Mexico 2020 (August): **"Senior Executive of a multinational company (preferably Central America) Between 8 to 10 years of experience in the Regional Customer Service Area. Experience in total management of Customer Service Departments and their staff involved. Preferable with experience in business and multilevel companies and direct sales by catalog. We offer: Salary from Q12,0000 to Q16,000 plus benefits Hours from Monday to Friday Additional benefits Job growth Excellent pleasant environment We require; High level of English language. Bachelor of Business Administration or Industrial Engineering. Preferable with Master's Degree in Business Administration, Finance."**

The results in terms of best-model fit did not differ by country. For both countries, the best estimator across all searched parameters was a XGBRegressor (eXtreme Gradient Boosting model). For Guatemala, the model produced a Root Mean Square Error (transformed from the `mean_squared_error`) of 0.24 and 0.26 on training and testing data respectively. For this same regression, Rsquared values (measuring how much of the variance of our log salaries is explained by the model) were 0.43 and 0.33 in the training and tests splits respectively. For the much bigger dataset, Mexico, the best model resulted in Root Mean Square Errors of 0.32 and 0.33 in the training and testing data splits. For Mexico, Rsquared coefficients were 0.35 and 0.30 for the training and test datasets.

The Regression results in terms of fitted versus actual values, along a fitted regression line is presented in panel of Figure 4a and 4b (in natural logarithm units). In overall terms we can say the model's fit is good, as the Root Mean Square Errors lie between what is considered a optimal threshold (0.2 to 0.5), along considerably high Rsquared values when taking into account that we are dealing with much heterogenous individual level microdata (Wooldridge, 2010).

<img src="/../assets/fig4a.png" width="800" align = "center">
<img src="/../assets/fig4b.png" width="800" align = "center">

A more interesting type of Machine Learning post-estimation analysis, however, is represented by the permuation analysis we studied this semester. For both models the permutation importance object was created using 25 as the scalar definiton for the number of repetitions. The most important relationships between salary and other features/variables in the data are illustrated by Figures 5a and 5b. For Mexico, this permutation exercise indicates that the main 6 features were age, experience, college education, high education, customer service skills, and computer skills. Conversely for Guatemala, college education, age, month, work experience, and people management skills were the most important features of the model.

<img src="/../assets/fig5a.png" width="800" align = "center">
<img src="/../assets/fig5b.png" width="800" align = "center">


#### VI. Discussion

A potentially positive aspect that new and disruptive technologies is that they provide access to massive, real-time information. This *new future* promises to complement traditional sources such as household surveys and population censuses with a myriad of new sources of information derived from social networks and the internet, sensors and cameras, satellites, cell phones, etc. I plan to demonstrate the potential of these data sources to help inform policy and labor market research in Latin America and the Caribbean. They can help us, for example, anticipate the demand for sectorial and local employment, with the implications such prognosis holds for the quasi-real time adaption of education curricula. Specifically, this project has demonstrated the use of data webscrapped from online job portals for the creation of labor markets semantic indicators and inferencial analysis. 

The significance or influence of the skills-sets defined by Deming & Kahn (2018) derived from the permutation analysis is part of the project's success. We now know they help identify which skills have been more important or affected in these two labor markets during the pandemic. According to Altamirano et al. (2020), in the Americas, the first labor market impacts resulting from the pandemic were observed starting in April of 2020 for most countries. Another indication of success was the selection of a particular model or group of models that better fit this regression task, and a better understanding of jobs ads dynamics in both Guatemala and Mexico. Further tasks for which I will surely receive needed advice from professors and peers will be on the creation of a robust Extraction, Training, and Loading (ETL) pipeline to automate the Webscrapping, Natural Language processing, and regression and classification ML analysis using Airflow or Luigi. I plan on working on this ETL project during 2022.

Finally, a food-for-thought commentary on the topic. We all know the future cannot be predicted, but it can be planned. Regardless of the deadlines and the type of disruptions that the penetration of new technologies and/or pandemics bring to Latin American and Caribbean countries (and to all developing countries for that matter), it is essential to be prepared. The literature that exposes future scenarios for the world of work usually brings the lesson of recent technological revolutions: you have to worry about workers, not so much about jobs. Policies must focus on ensuring that people are prepared for the changes to come, that regulations protect workers displaced by automation, and that traditionally excluded groups are not disproportionately affected.


#### VII. References

Altamirano, A., & Amaral, N. 2020. *A Skills Taxonomy for LAC: Lessons Learned and a Roadmap for Future Users*. Washington, DC: Inter-American Development Bank. Doi: http://dx.doi.org/10.18235/0002898

Altamirano, A., Azuara, O., & González, S. (2020). *¿Cómo impactará la COVID-19 al empleo. Posibles escenarios para América Latina y el Caribe*. Banco Interamericano de Desarrollo.

Deming, D., & Kahn, L. B. (2018). *Skill requirements across firms and labor markets: Evidence from job postings for professionals*. Journal of Labor Economics, 36(S1), S337-S369.

Wooldridge, J. M. (2010). Econometric analysis of cross section and panel data. MIT press.


#### VIII. Annex

##### Table 1: Deming & Kahn's 2018 definition of skills for the US
<img src="/../assets/DK2018.png" width="700" align = "left">