In [4]:
import os
import numpy as np
import csv
import re
from bs4 import BeautifulSoup


files = os.listdir(".")

experiences = []
educations = []

profiles = []

tag = ['h3', 'span']
classes = ['education__item education__item--degree-info','pv-entity__comma-item','pv-entity__school-name t-16 t-black t-bold', 't-16 t-black t-bold','profile-section-card__title']


for f in files:
    if f.endswith(".html"):
        with open(f, encoding='utf-8') as fp:
            soup = BeautifulSoup(fp)
            regex = re.compile('.*experience.*')
            experienceSection = soup.find('section', {'class':regex})
            for el in experienceSection.find_all('span', {'class':'visually-hidden'}): el.decompose()
            experienceTags = (experienceSection.find_all(t, {'class': c}) for t in tag for c in classes)
            expText = next((exp for exp in experienceTags if len(exp) > 0), "")
            experience = [el.get_text().strip() for el in expText if el != ""]
            regexEd = re.compile('.*education.*')
            educationSection = soup.find('section', {'class':regexEd})
            educationTags = (educationSection.find_all(tag[1], {'class':c}) for c in classes)
            edText = next((ed for ed in educationTags if len(ed) > 0), "")
            education = [el.get_text().strip() for el in edText if el != ""]
            profiles.append([experience, education])


In this first huge part of code above i built a scraper almost from scratch using Beautiful Soup which is a python library i used to pull out data from the html files of linkedin profiles i manually downloaded in the current local directory. In a nutshell what i did was looping over almost 500 profiles pages donwloaded from LinkedIn and for each file i extracted those element of interest in order to fatch data from education and work section. In order to do that i queried the significant html element by tag name and class name saving the text data i needed into a list of lists in order to save them into a csv file for persistency and then loading them into pandas df for the manipulation.  

In [5]:
#loggin into a .csv file

import csv

header = ['Job','Education']

csv_file = 'linkedin_profiles.csv'
with open(csv_file, 'w', newline='') as fp:
    writer = csv.writer(fp)
    writer.writerow(header)
    writer.writerows(profiles)

In [1]:
#reporting data
import pandas as pd
import numpy as np

df = pd.read_csv('linkedin_profiles.csv', encoding='ISO-8859-1')
df.head()

Unnamed: 0,Job,Education
0,"['Training Specialist', 'Training Specialist',...","['Master di I livello', 'Politiche di Sicurezz..."
1,"[""Esperienze di Lavoro all'Estero come Care As...","['Scienze Politiche,Sociali ed Internazionali'..."
2,"['docente di lettere', 'docente di lettere', '...",['Laurea Magistrale LM in Scienze Umanistiche...
3,"['Junior Software Developer', 'Promoter vendit...","['Laurea triennale', 'Ingegneria informatica',..."
4,"['IoT Edge Developer', 'Sviluppatore front-end']","['Laurea Magistrale LM', 'Ingegneria informat..."


In [2]:
df.dtypes

Job          object
Education    object
dtype: object

the dataset, whose first elements are showcased by the head() method of pandas library, does still contain data that are a bit raw. Given that, the two columns displayed incapsulate inside lists all of those data scraped from html pages of linkedin profile such as all jobs registered in the experience section and all education information that could go from high school to P.h.d degrees. From this first sight at data it appears to me that this dataset needs to be modified a bit in order to keep just the information of interest for this work such as education titles involving just bachelor or master degrees from University of Bologna and the last job inserted in the list of jobs, since the aim is to understand which job is most likely to be done by people with a specific degree.

In [3]:
df.shape

(496, 2)

In [4]:
from ast import literal_eval

df.Job = df.Job.apply(literal_eval)
df.Education = df.Education.apply(literal_eval)

in the cell above, since data fetched from the csv file containing linkedin profiles previously stored in arrays are perceived as one big string i applied a function in order to split those string back into the two lists of job and education, whilst in the cell below i noticed from the csv file that there were some fields with valued marked as empty lists thus i got rid of them.  

In [5]:
df.head()

df1 = df[df.Education.map(len) > 0]
df1.reset_index()
df1

Unnamed: 0,Job,Education
0,"[Training Specialist, Training Specialist, Leg...","[Master di I livello, Politiche di Sicurezza e..."
1,[Esperienze di Lavoro all'Estero come Care Ass...,"[Scienze Politiche,Sociali ed Internazionali, ..."
2,"[docente di lettere, docente di lettere, docen...","[Laurea Magistrale LM in Scienze Umanistiche,..."
3,"[Junior Software Developer, Promoter vendite, ...","[Laurea triennale, Ingegneria informatica, 86/..."
4,"[IoT Edge Developer, Sviluppatore front-end]","[Laurea Magistrale LM, Ingegneria informatica..."
...,...,...
491,"[Professore di Archeologia del Paesaggio, Ph.D...","[Master, Bioarcheologia, Paleopatologia Antrop..."
492,"[Insegnante di lettere miur, Coworker logistic...","[laurea quadriennale in lettere moderne, lette..."
493,[Tutor privato],"[Laurea Magistrale LM, Filologia classica, 11..."
494,[Ricercatore],"[1° ciclo - Laurea L, storia dei Paesi Afrosi..."


In [6]:
import json

jobWrapperFile = 'jobWrapper.json'
educationWrapperFile = 'educationWrapper.json'

jobWrapper = {}
educationWrapper = {} 

with open(jobWrapperFile) as json_job_wrapper:
    jobWrapper = json.load(json_job_wrapper)

with open(educationWrapperFile) as json_edu_wrapper:
    educationWrapper = json.load(json_edu_wrapper)
    
print({k:jobWrapper[k] for k in list(jobWrapper.keys())[:2]})
print({k:educationWrapper[k] for k in list(educationWrapper.keys())[:2]})

{'Risorse umane': ['Responsabile Area Espansione', 'Senior Recruitment Officer', 'Analista Direzione Sviluppo Persone e Organizzazione', 'Training Specialist', 'HR'], 'Ristorazione': ['Assistente enologo', 'Barman', 'barista']}
{'Giurisprudenza': ['Giurisprudenza'], 'Scienze della comunicazione': ['Semiotica', 'Marketing', 'PubblicitÃ\xa0', 'Communication', 'Comunicazione', 'comunicazione', "comunicazione pubblica, d'impresa e pubblicitÃ\xa0", 'Mass media e Politica', 'Brand Strategy and Marketing', 'Scienze della comunicazione', 'Comunicazione e Digital Media', 'Scienze della Comunicazione', 'Comunicazione']}


As i anticipated previously, since education and job fields fetched from linkedin html pages could have entered with different naming even though they could be identified with a single category, i decided to compute some basic remapping. By the way in order to have a more clean and meaningful dataset to inspect i decided to map semantically similar jobs or degrees under one same branch. To clarify, for each profile whose work experience was Developer, Web Developer, Sviluppatore software and so on, i mapped those field under the same keyword Sviluppatore. In order to do so i wrapped into two json files key value pairs where i linked a list of synonims under a unique key. in the cell above it is wrapped the code through which i incapsulated the contents of those two files in two dictionaries whose two first elements i showcased to give a glimpse of the fields.        

In [7]:
def inList(array , dictionary):
    for lval in array:
        for key, val in dictionary.items():
               for v in val:
                    if v in lval:
                          return key

what comes then is the definition of a basic function looping over the element of the array in jobs and educations and over the key and values in the dictionaries checking if one the values of the dictionary is contained in the lists of jobs and educations thus returning the key associated.  

In [8]:
import warnings
from functools import partial
warnings.filterwarnings('ignore')

mappedJobs = list(map(partial(inList, dictionary=jobWrapper), df1.Job))
mappedEdus = list(map(partial(inList, dictionary=educationWrapper), df1.Education))

df2 = df1
df2['Job'] = mappedJobs
df2['Education'] = mappedEdus

once the values are mapped i built another dataset switching the old values with the new ones whose first rows are showed in the following cells. After this small processing the dataset, fullfilled with more meaningful and easy-to-read data is ready to be scanned to get the first insights.  

In [9]:
df2.head()

Unnamed: 0,Job,Education
0,Risorse umane,Giurisprudenza
1,Ristorazione,Scienze politiche
2,Insegnante,Lettere e Filosofia
3,Sviluppatore,Ingegneria Informatica
4,Sviluppatore,Ingegneria Informatica


In [12]:
from pandas_profiling import ProfileReport

df2.reset_index(drop = True,inplace=True)

profile = ProfileReport(df2,missing_diagrams={'bar':False,'matrix':False,'heatmap':False,'dendrogram':False})
profile.to_file("report.html")

In the chunck of code above in order to display a rapresentation of the report of the data as fancy as possible i used the pandas-profiling module in order to save it in a html file. Basically what it does is to provide a more enriched rappresentation of the API describe() from pandas which is used as well to generate some descriptive statistics about the features of the dataset to give some insights about it. As shown above, the domain of the features of the dataset is the one of nominal data thus involving description about frequency, the most common value printed by the top metric including also first and last values alongside with a count of the value of the features. Other than that there are also some metrics about how data are correlated exploiting for example Cramer's V as a measure of association beetween the two nominal features involved. The graph also gives a glimpse of missing data and a sample of the dataset.  