<h1> The determinants of offered salaries in Mexico and Guatemala, an analysis of online job vacancies

<br>
<br>
<h4> Alvaro Altamirano Montoya
<br>
<br>
Final Project – Proposal Presentation
<br>
<br>
PPOL564 | Data Science 1
</h4>

[Github-repo](https://github.com/AlvaroAltamiranoM/PPOL564_Final_Project_Fall_2021)




<table border="1">
<h2 align = "left"> Background (I: The Project) </h2>
<tr>
    <img src="figure1_map.png" width="500" align = "right">
 </tr>
 <tr>
 </tr>
<br>
<br>
&#9679; In 2019 I started web-scraping data on online job vacancies for 18 countries in Latin America and the Caribbean.
<br>
<br>
&#9679; Regional dataset of ~4 million unique observations.
<br>

&#9679; [CEDEFOP](https://www.cedefop.europa.eu/en/tools/skills-online-vacancies), Burning Glass, [Mexico, Chile, Colombia, Costa Rica].
<br>
<br>
&#9679; IDB’s coronavirus labor markets monitor [dashboard](https://observatoriolaboral.iadb.org/en/vacantes/).

<h2 align = "left" left-margin= "140"> Background (II: Long/short goals) </h2>
<br>
<br>
&#9679; On the long-run: covid-19 crisis and technological change as seen through vacancies information.

&#9679; While the time-series grows: think of AWS setup and classification of occupations and skills models.


<h2 align = "left" left-margin= "140"> Data (I: Subset for now) </h2>
<br>
<br>
&#9679; Only two countries for now: Mexico (**~ 1 M**) & Guatemala (**~ 50 K**).

i) Previous analysis & computational setup, 


ii) Harmonization of variables for cross-country comparisons, 


iii) Similar labor markets.

<h2 align = "left" left-margin= "140"> Data (II: Webscraping configuration) </h2>
<br>
<br>
<img src="scraping.png" width="700" align = "left">

<h2 align = "left" left-margin= "140"> Data (III: Job ad example) </h2>

<img src="example.png" width="700" align = "left">

<h2 align = "left" left-margin= "140"> Methods (I: defining Y and Xs) </h2>
<br>
<br>
<table border="1">
<tr>
    <img src="DK2018.png" width="500" align = "right">
 </tr>
 <tr>
 </tr>
&#9679; Specifically for this project: Machine Learning (ML) for the prediction of the log of offered montlhy wages for each job advert.
<br>
<br>
$ Log(W)_i = f(Age_i,Gender_i,Schooling_i,Experience_i, Skills_i)$
<br>
<br>
&#9679; Deming and Kahn (2018, Journal of Labor Economics):
<br>
<br>

<table border="1">
<h2 align = "left"> Methods (II: staging the analysis) </h2>
<tr>
    <img src="methods.png" width="400" align = "right">
 </tr>
 <tr>
 </tr>
<br>
<br>
&#9679; Data wrangling and descriptive stats(numpy, pandas, etc.).


&#9679; ML Pipeline (scikit-learn models: KNN, RF, LM, LS, SVM).


&#9679; Post-estimation visuals as a second step (ggplot, plotly, seaborn, etc.).

In [4]:
# An overview of the relational datasets: Guatemala.
import pandas as pd
import plotly.express as px
from plotly.offline import plot, init_notebook_mode, iplot
df = pd.read_csv(r'C:\Users\unily\Documents\Georgetown\PPOL 564 - Intro to Data Science\project\PPOL564_Final_Project_Fall_2021\gt.csv')[0:40000]
#Create df of missing values by variable
df_merged_pctna = (df.isnull().sum() * 100 / len(df)).round(1).sort_values(ascending = False)
# create missing values' bar graph
fig = px.bar(df_merged_pctna, x=df_merged_pctna.index.values, y=df_merged_pctna,
             template="simple_white", text = df_merged_pctna,
             title= 'Preliminary results (I, Variables): <br>  <br> Percent of missing values in each variable')
fig.update_layout(showlegend=False,
                xaxis_title="Variables",
                yaxis_title="Percent missing (%)",
                font_family="Arial",
                title_font_family="Arial Black",
                yaxis_title_font_family ="Arial Black",
                xaxis_title_font_family ="Arial Black",
                title_font_color="black",
                title_font_size=19,
                legend_title_font_color="green")

iplot(fig)

In [5]:
df2 = df.groupby([df.date_posted])['count'].\
            sum().reset_index()
df2['conteo_MA'] = df2['count'].transform(lambda x: x.rolling(7, 1).mean()) 

df2 = df2.loc[pd.to_datetime(df2.date_posted).dt.year>2019]
#Create Figure
fig2 = px.line(df2, x='date_posted', y='conteo_MA', 
                  title = 'Preliminary results (II): The Pandemic <br>  <br> Weekly new vacancies for Guatemala',
               template ='simple_white'
                 )
fig2.update_layout(showlegend=False,
                xaxis_title="Date (weeks in ISO 8601)",
                yaxis_title="New downloaded vacancies",
                font_family="Arial",
                title_font_family="Arial Black",
                yaxis_title_font_family ="Arial Black",
                xaxis_title_font_family ="Arial Black",
                title_font_color="black",
                title_font_size=19,
                legend_title_font_color="green",
                uniformtext_minsize=14, 
                uniformtext_mode=False)
fig2.update_traces(textposition='middle right')
fig2.update_xaxes(showline=True, linewidth=2, linecolor='black', showspikes=True)
fig2.update_yaxes(showline=True, linewidth=2, linecolor='black', showspikes=True)

iplot(fig2)

In [6]:
df3 = df.groupby([df.date_posted])['computer'].\
            mean().reset_index()
df3['conteo_MA'] = df3['computer'].transform(lambda x: x.rolling(7, 1).mean()) 

df3 = df3.loc[pd.to_datetime(df3.date_posted).dt.year>2019]
#Create Figure
fig3 = px.line(df3, x='date_posted', y='conteo_MA', 
                  title = 'Preliminary results (III): The Skills <br>  <br> Guatemala, % of jobs requiring computer skills',
               template ='simple_white'
                 )
fig3.update_layout(showlegend=False,
                xaxis_title="Date (weeks in ISO 8601)",
                yaxis_title="Percent of total weekly ads",
                font_family="Arial",
                title_font_family="Arial Black",
                yaxis_title_font_family ="Arial Black",
                xaxis_title_font_family ="Arial Black",
                title_font_color="black",
                title_font_size=19,
                legend_title_font_color="green",
                uniformtext_minsize=14, 
                uniformtext_mode=False)
fig3.update_traces(textposition='middle right')
fig3.update_xaxes(showline=True, linewidth=2, linecolor='black', showspikes=True)
fig3.update_yaxes(showline=True, linewidth=2, linecolor='black', showspikes=True)

iplot(fig3)

<table border="1">
<h2 align = "left"> Lessons learned (I) </h2>
<tr>
    <img src="text.png" width="300" align = "right">
 </tr>
 <tr>
 </tr>
<br>
<br>
&#9679; Text data.
<br>
<br>
&#9679; Datetime constrained time-series.
<br>
<br>
&#9679; ML Pipeline will need neat datasets.
<br>
<br>

<table border="1">
<h2 align = "left"> Lessons learned (II) </h2>
<tr>
    <img src="ahead.png" width="300" align = "right">
 </tr>
 <tr>
 </tr>
<br>
<br>
&#9679; Skills matter.
<br>
<br>
&#9679; How to make ML specifications comparable.
<br>
<br>
&#9679; Delimit variables and semantic indicators for report.
<br>
<br>
THANK YOU!