# Benchmark analysis notebook EDA

## Introduction

The purpose of this notebook is to apply EDA in a NLP dataset. What we are going to do is to provide the steps to use in order to clean this dataset and analysis the content of the job descriptions in order that later it will be use to create a NER (Named entity relationship) in order to find the job skills required in each of the job descriptions.

## Paquetes

Todos los paquetes necesarios se encuentran descritos en el archivo de `requirements.txt`

## Import libraries required for this project

In [None]:
!python -m spacy download en_core_web_lg

In [1]:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
import numpy as np
import spacy
import unidecode
from spacy.matcher import Matcher,PhraseMatcher
import re
import json
from langdetect import detect,detect_langs,DetectorFactory
import string
import random
from sklearn.feature_extraction.text import CountVectorizer

nltk.download("punkt")
nlp = spacy.load('en_core_web_lg')
nltk.download('stopwords')

  LARGE_SPARSE_SUPPORTED = LooseVersion(scipy_version) >= '0.14.0'


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\migue\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\migue\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Code

### Crea clase 
Creamos la clase que vamos a utilizar en el EDA y la inicializamos con sus respectivos jsons

In [2]:
class DatasetCleaner:
    
    def __init__(self,
                 nlp,
                 work_times_json_path,
                 techs_json_path,
                 seniorities_json_path,
                 fields_json_path,
                 companies_json_path,
                 reeplacements_json_path
                ):
        
        """
        Documentacion
        
        Parametros: nlp: vocabulario en spacy utilizando el metodo spacy.load(), 
        work_times_json_path: path al archivo json donde los tiempos de trabajo estan (Full-time, Intership, etc)
        techs_json_path: path al archivo json donde las tecnologicas estan (spark,scala,aws,etc)
        seniorities_json_path: path al archivo json donde los seniorities estan (Senior Junior etc)
        fields_json_path: path al archivo json donde los campo estans (enginner, manager, developer)
        companies_json_path: path al archivo json donde las companias estan (Apex, Amazon, etc)
        reeplacements_json_path: path al archivo json donde estan los typos que queremos reemplazar en los strings
        
        Creamos un objecto de vocabulario un diccionario de los matchers que vamos a utilizar para que podamos encontrar palabras
        claves en nuestras descripciones.
        Tambien creeamos un diccionario reeplacer que es el que vamos a utilizar para reemplazar las palabras mal escritas.
        
        """
        self.nlp = nlp
        self.vocab = self.nlp.vocab
        self.matcher_dict = {}
        self.stemmer = LancasterStemmer()
        #self.stemmer = PorterStemmer()
        self.stop_words = set(stopwords.words('english'))
        
        self.techs_json_path = techs_json_path

        self.add_work_times_matcher(work_times_json_path)
        self.add_techs_matcher(techs_json_path)
        self.add_seniority_matcher(seniorities_json_path)
        self.add_fields_matcher(fields_json_path)
        self.add_companies_matcher(companies_json_path)
        self.add_url_matcher()

        self.reeplacer = self.set_reeplacer(reeplacements_json_path)
        
        self.vectorizer = CountVectorizer(analyzer='word')
        DetectorFactory.seed = 0
        
    def add_matcher_by_name_and_json_path(self,matcher_name,json_path):
        """
        Documentacion
        
        Parametros: matcher_name: Nombre del matcher en matcher_dict, json_path El path al json para agregar el nuevo matcher
        Agrega un nuevo matcher utilizando los patterns en el json path y lo guarda en self.matcher_dict
        
        """
        patterns = self.get_pattern_list_by_json_path(json_path)
        self.add_matcher(matcher_name,patterns)
        
        
    def set_reeplacer(self,json_path):
        """
        Documentacion
        
        Parametros: matcher_name: json_path El path al json para crear reeplacer
        Crea variable self.reeplacer utilizando el json path
        
        """
        return self.get_json(json_path)
    
    def add_url_matcher(self):
        """
        Documentacion
        
        Crea matchers de URLs
        
        """
        pattern_urls = [
            [{"LIKE_URL": True}]
        ]
        
        self.add_matcher("UrlMatcher",pattern_urls)
        
    def add_companies_matcher(self,company_json_path):
        """
        Documentacion
        
        Agrega a self.matcher_dict el matcher de companias
        
        """
        self.add_matcher_by_name_and_json_path("CompanyMatcher",company_json_path)
    
    def add_work_times_matcher(self,work_times_json_path):
        """
        Documentacion
        
        Agrega a self.matcher_dict el matcher de tiempo de trabajo
        
        """
        self.add_matcher_by_name_and_json_path("WorkTimeMatcher",work_times_json_path)
    
    def add_seniority_matcher(self,seniorities_json_path):
        """
        Documentacion
        
        Agrega a self.matcher_dict el matcher de seniority
        
        """
        self.add_matcher_by_name_and_json_path("SeniorityMatcher",seniorities_json_path)
        
    def add_techs_matcher(self,techs_json_path):
        """
        Documentacion
        
        Agrega a self.matcher_dict el matcher de tecnologias
        
        """
        self.add_matcher_by_name_and_json_path("MatcherTechs",techs_json_path)
    
    def add_fields_matcher(self,fields_json_path):
        """
        Documentacion
        
        Agrega a self.matcher_dict el matcher de campos
        
        """
        self.add_matcher_by_name_and_json_path("FieldMatcher",fields_json_path)
        
    
    def get_dataset(self,*args,**kwargs):
        """
        Documentacion
        
        Lo mismo que si utilizaras un pd.read_csv
        
        """
        
        df = pd.read_csv(*args,**kwargs)
        
        return df
    
    def drop_records_with_na(self,df):
        """
        Documentacion
        
        Remueve records con nulos en el df que provees como parametro
        
        """
        
        len_df = len(df)
        df.dropna(inplace = True)
        len_df_after_drop = len(df)
        
        print(
            "Dataset row count before dropping NAs {}, now it is {}. {} records removed"
            .format(str(len_df),str(len_df_after_drop),str(len_df - len_df_after_drop)))
        
        df.reset_index(inplace = True)
        
    def remove_unicode_in_column(self,string):
        """
        Documentacion
        
        Remueve el unicode de un string por ejemplo México a -> Mexico
        
        """
        return unidecode.unidecode(string)
        
    
    def group_by_df(self,df,list_of_cols):
        """
        Documentacion
        
        Retorna el group by count de un df y la lista de cols que esten en el segundo parametro
        
        """
        return df.groupby(list_of_cols).size()
        
    def add_matcher(self,matcher_name,patterns_list):
        """
        Documentacion
        
        Agrega el matcher al matcher dict utilizando el nombre del matcher y la lista de patterns.
        Ejemplos de patterns:
        https://spacy.io/usage/rule-based-matching
        
        """
        self.matcher_dict[matcher_name] = Matcher(self.vocab)
        self.matcher_dict.get(matcher_name).add(matcher_name, patterns_list)
        
    def get_json(self,json_path):
        """
        Documentacion
        
        Lee un archivo json y retorna su contenido
        
        """
        data = None
        with open(json_path) as f:
            data = json.load(f)
        return data
        
        
    def get_pattern_list_by_json_path(self,json_path):
        return self.get_json(json_path)
    
    def get_matcher(self,matcher_name):
        """
        Documentacion
        
        Busca un matcher en el matcher dict y lo retorna.
        Lanza excepcion si no esta el matcher en el diccionario
        
        """
        matcher = self.matcher_dict.get(matcher_name)
        if matcher is None:
            raise Exception("No matcher name found!!")
        return matcher
          
    def get_first_match(self,string,matcher_name):
        """
        Documentacion
        
        Regresa el primer match que encuentre el matcher del segundo parametro matcher_name en el string del primer parametro.
        Si no encuentra match regresa una string vacia
        
        """
        doc = self.nlp(string)
        matcher = self.get_matcher(matcher_name)

        matches = matcher(doc)
        # Hardcoded to the first match
        if len(matches) == 0:
            return ''
        work_time = doc[matches[0][1]:matches[0][2]]

        return work_time.text
        
    def get_work_time(self,string):
        return self.get_first_match(string,"WorkTimeMatcher")
    
    def get_all_matches(self,doc,matcher_name,remove_duplicates = True, trim_string = True,matcher = None):
        """
        Documentacion
        
        Regresa todos los matches de doc que tu mandes y el matcher name. Por default te regresa la lista ya con trim y en 
        minusculas pero puedes modificar esos parametros
        
        """
        if matcher is None:
            matcher = self.get_matcher(matcher_name)
        matches = matcher(doc)
        return_matches = []
        
        for match in matches:
            match_local = doc[match[1]:match[2]].text            
            if trim_string == True:
                match_local = match_local.strip()
            
            return_matches.append(match_local)
        
        if remove_duplicates == True:
            return_matches = set(return_matches)
        
        return return_matches
        

    def replace_words_in_string(self,string):
        """
        Documentacion
        Reemplaza las palabras utilizando el reeplcaer y quita los caracteres proividos
        
        """
        
        string = self.remove_punctuations(string,punctuations = None,string_exceptions = "/")
        
        for key in self.reeplacer:
            string = string.replace(key,self.reeplacer.get(key))
        
        return string
        
    def remove_if_word_is_inside_another_word_within_list(self,list_v):
        """
        Documentacion
        Reemplaza las palabras que esten dentro de una palabra en la misma lista.
        Ejemplo
        De list(bi,power bi,business intelligence) a (power bi, business intelliegence) porque puede causar duplicados en los
        keywords
        
        """
        all_words = list()

        for i in range(len(list_v)):
            word1 = list_v[i]
            add = True
            for j in range(len(list_v)):
                word2 = list_v[j]
                if word1 == word2:
                    continue

                if word1 in word2:
                    add = False

            if add == True:
                all_words.append(word1) 
        
        return all_words
    
    def list_to_string(self,string_separator,list_to_str):
        return string_separator.join(list_to_str)
        
    
    def get_keywords_from_title_role(self,string,to_lower_string = True):
        """
        Documentacion
        Utiliando los matchers de tecnologicas seniorities y campos extrae las palabras clave de un titulo de un role y 
        lo retorna sin repetir palabras.
        
        En el siguiente orden:
        Tecnologias, seniorities y campos mencionados en el titulo del role
        
        """
        
        
        if to_lower_string == True:
            string = string.lower()
        
        string = self.replace_words_in_string(string)

        doc = self.nlp(string)

        techs_set = self.get_all_matches(doc,"MatcherTechs")
        seniority_set = self.get_all_matches(doc,"SeniorityMatcher")
        field_set = self.get_all_matches(doc,"FieldMatcher")
        
        join_all = list(sorted(techs_set)) + list(sorted(seniority_set)) + list(sorted(field_set))

        all_words = self.remove_if_word_is_inside_another_word_within_list(join_all)

        return self.list_to_string(" ",all_words)
    
    def is_url(self,url):
        """
        Documentacion
        Checa si un string es una url
        
        """
        return self.get_first_match(url,"UrlMatcher")
    
    def drop_column(self,df,column_name):
        """
        Documentacion
        Remueve una columna de un df
        
        """
        df.drop(column_name, axis=1, inplace=True)
    
    def get_company_name(self,string):
        """
        Documentacion
        Estandariza algunos nombre de companias.
        Por ejemplo de :
        Accenture Mexico a Accenture para que todos los trabajos entren en la categoria de Accenture
        
        """
        result = self.get_first_match(string,"CompanyMatcher")
        
        return result if result != "" else string
        
    def sort_columns(self,df,order_cols):
        """
        Documentacion
        Ordena columnas de un df
        
        """
        return df[order_cols]
    
    def detect_language(self,string):
        """
        Documentacion
        Detecta los lenguajes del string que pasamos y returna un string separado por comas por cada lenguaje que detecte.
        Para mas informacion sobre que significa el string que regresa checar https://pypi.org/project/langdetect/
        
        """
        list_langs_detected = detect_langs(string)
        
        return ','.join([lang.__repr__().split(":")[0] for lang in list_langs_detected])
    
    def string_to_lower(self,string):
        """
        Documentacion
        String a minusculas
        
        """
        return string.lower()
    
    def remove_punctuations(self,string_replace,punctuations = string.punctuation,string_exceptions = None):
        """
        Documentacion
        Quita los punctuations de un string, los punctuations son los caracteres como @#$%^& etc este string de punctuations
        esta especificdos en el primer parametro de llave. Lo que hace es que cambia estos caracteres por espacios en blanco
        y el segundo parametro de llave son quellas punctuatinos que quieres cambiar borrar osea que no deja espacio sino vacio
        
        """
        if punctuations is not None:
            string_replace = re.sub('[%s]' % re.escape(punctuations)," ",string_replace)
        if string_exceptions is not None:
            for punctiation in string_exceptions:
                string_replace = re.sub('[%s]' % punctiation,'',string_replace)
                
        return string_replace
    
    def create_year_range_string (self,string,regex):
        """
        Documentacion
        Reemplaza dependiendo del regex que mandes por ejemplo el string 1 - 2 years lo cambia a 1 year range 2 years para que 
        no sea eliminado a la hora de borrar las stop words si lo cambiamos a 1 to 2 years el to seria borrado por los stop 
        words
        
        """
        for match in re.findall(regex,string):
            if len(re.findall(r"\d+",match)) > 0:
                year_range_string = re.findall(r"\d+",match)[0] + " year range " + re.findall(r"\d+",match)[1] + " years"
                string = string.replace(match,year_range_string)
        
        return string
        
    
    def format_year_ranges(self,string):
        """
        Documentacion
        Le da un formato a frases como 7+ years a 7 plus years y corre la funcion create_year_range_string
        
        """
        replace_string = string
        for match in re.findall("\d+\+",string):
            replace_string = string.replace(match,re.search(r"\d+",match).group(0) + " " + "plus")

        replace_string = self.create_year_range_string(replace_string,"\d+ to \d+ years")
        replace_string = self.create_year_range_string(replace_string,"\d+ (-|–) \d+ years")
  
        return replace_string
    
    def deformat_year_ranges(self,string):
        """
        Documentacion
        Regresa a un formato de 1 year range 2 years a 1 to 2 years, preferente mente utilizar esto despues de quitar las 
        stop words
        
        """
        
        replace_string = string
        for match in re.findall("\d+ year range \d+ years",string):
            year_range_string = re.findall(r"\d+",match)[0] + " to " + re.findall(r"\d+",match)[1] + " years"
            replace_string = replace_string.replace(match,year_range_string)
        
        return replace_string
 
    def basic_clean_descriptions(self,description):
        """
        Documentacion
        Utiliza la funcion replace_words_in_string para reemplazar palabras en la descripcion
        
        """
        replaced_description = self.replace_words_in_string(description)

        return replaced_description
    
    def intermediate_clean_description_lowercased_without_punctuations(self,description):
        """
        Documentacion
        Corre la funcion basic_clean_descriptions con la descripcion en minusculas y quita punctuations
        
        """
        replaced_description = self.basic_clean_descriptions(self.string_to_lower(description))

        description_without_punctuations = self.remove_punctuations(
            replaced_description,
            punctuations = string.punctuation + "–’",
            string_exceptions = "-")
        
        return description_without_punctuations
    
    def intermediate_clean_description_lowercased_without_punctuations_formating_year_ranges(self,description,with_stop_words = True):
        """
        Documentacion
        Corre la funcion basic_clean_descriptions con la descripcion en minusculas, quita punctuations, le da formato a los rangos
        de años ysi es necesario quita los stop words
        
        """
        replaced_description = self.basic_clean_descriptions(self.string_to_lower(description))
        
        description_date_ranges = self.format_year_ranges(replaced_description)

        description_without_punctuations = self.remove_punctuations(
            description_date_ranges,
            punctuations = string.punctuation + "–’",
            string_exceptions = "-")
        
        if with_stop_words == False:
            description_without_punctuations = ' '.join([word for word in word_tokenize(description_without_punctuations) if word not in self.stop_words])
            
        
        description_date_ranges_deformated = self.deformat_year_ranges(description_without_punctuations)
        
        return description_date_ranges_deformated
    
    def stem_description(self,description):
        """
        Documentacion
        Hace stemming en toda la descripcion
        
        """
        return ' '.join([self.stemmer.stem(word) for word in word_tokenize(description)])
    
    def lemma_description(self,description):
        """
        Documentacion
        Hace lemmatization en toda la descripcion
        
        """
        doc = self.nlp(description)
        return ' '.join([word.lemma_ for word in doc])
    
    
    def create_bow_df(self,df,index_col,lemmatized_col):
        """
        Documentacion
        Crea el bag of words de las descripciones que preferente mente esten lemmatizadas
        """
        df_grouped = df[[index_col,lemmatized_col]].groupby(by=index_col).agg(lambda x:' '.join(x))
        
        data = self.vectorizer.fit_transform(df_grouped[lemmatized_col])
        df_dtm = pd.DataFrame(data.toarray(), columns=self.vectorizer.get_feature_names())
        
        return df_dtm
    
    def pandas_df_to_csv(self,df,directory):
        df.to_csv(directory, index = False)
        
    

In [3]:
work_time_json_path = "data/tiempo_de_contratacion.json"
techs_json_path = "data/tecnologias.json"
seniorities_json_path = "data/seniorities.json"
fields_json_path = "data/fields.json"
companies_json_path = "data/companias.json"
reeplacements_json_path = "data/reemplazos.json"


eda_obj = DatasetCleaner(nlp,work_time_json_path,techs_json_path,seniorities_json_path,fields_json_path,companies_json_path,reeplacements_json_path)

### Dataset

In [4]:
df = eda_obj.get_dataset("data/linkedin_etl_15-12-2022-21-28.csv",index_col="Unnamed: 0")
df.head()

Unnamed: 0,Title_role,Company,City,Work_place,Description,Date_published,Work_times,Date_scrapped,Source
0,Senior Software Engineer,Chubb,"Monterrey, Nuevo León, Mexico",On-site,About the job\nJob Requirements\n\nKEY OBJECTI...,1 week ago,Full-time · Mid-Senior level,2022-12-15,https://www.linkedin.com/jobs/view/3395197492/...
1,Programador y sistemas,OFIK S.A. DE C.V.,"Santa Catarina, Nuevo León, Mexico",Hybrid,About the job\nPuesto de programación y sistem...,3 hours ago,Full-time,2022-12-15,https://www.linkedin.com/jobs/view/3353688157/...
2,Data Engineering,Accenture,"Monterrey, Nuevo León, Mexico",On-site,About the job\nDARE TO BE A PART OF THE CHALLE...,1 week ago,Full-time · Entry level,2022-12-15,https://www.linkedin.com/jobs/view/3375598097/...
3,Power BI Developer,Johnson Controls,"San Pedro Garza García, Nuevo León, Mexico",On-site,About the job\nJob Details\n\nWhat You Will Do...,1 week ago,Full-time · Mid-Senior level,2022-12-15,https://www.linkedin.com/jobs/view/3391869575/...
4,Data intern,Schneider Electric,"Ciudad Apodaca, Nuevo León, Mexico",On-site,About the job\nSchneider Electric esta buscand...,1 week ago,Temporary · Internship,2022-12-15,https://www.linkedin.com/jobs/view/3390790194/...


### Remover filas con campos nulos

In [5]:
eda_obj.drop_records_with_na(df)

Dataset row count before dropping NAs 481, now it is 452. 29 records removed


### Estandarizar palabras sin acentos

In [6]:
df["City"] = df["City"].apply(eda_obj.remove_unicode_in_column)

### EDA 1

Aplicamos group by para ver que tan limpio esta nuestro dataset en los campos.
Tenemos que estar seguro que podamos hacer estos campos tan categoricos como posible.

In [7]:
eda_obj.group_by_df(df,["Work_place"])   

Work_place
Hybrid      29
On-site    114
Remote     309
dtype: int64

In [8]:
eda_obj.group_by_df(df,["Work_times"])  

Work_times
Contract · Entry level                                         2
Contract · Mid-Senior level                                    8
Full-time                                                     33
Full-time · Associate                                         18
Full-time · Director                                           5
Full-time · Entry level                                      165
Full-time · Executive                                          9
Full-time · Internship                                         1
Full-time · Mid-Senior level                                 202
Internship · Internship                                        3
MX$600,000/yr - MX$1,080,000/yr · Full-time · Entry level      2
Part-time · Entry level                                        2
Temporary · Internship                                         2
dtype: int64

In [9]:
df["Work_times"] = df["Work_times"].apply(eda_obj.get_work_time)

In [10]:
eda_obj.group_by_df(df,["Work_times"]) 

Work_times
Contract       10
Full-time     435
Internship      3
Part-time       2
Temporary       2
dtype: int64

### EDA 2
1. Detectamos el lenguage en la descripcion del puesto y nos quedamos con solo aquellos que estan en ingles
2. Extraemos las palabras claves de los titulos de las posiciones para asi categorizar lo mas que podamos cada una de las posiciones.
3. Checamos que de verdad todos los sources sean urls
4. Estandarizamos el nombre de las companias de cada una de las posiciones

In [11]:
eda_obj.group_by_df(df,["Title_role"]) 

Title_role
.NET Developer - Remote!!!                                                                      1
Acceleration Center - Managed Services - Salesforce Developer - Senior Assoc.                   1
Acceleration Center, Products & Technology - Digital - Database Developer - Senior Associate    2
Account Architect Data                                                                          1
Analista Información BI                                                                         1
Analista de Bases de Datos Oracle                                                               1
Analista de Dados Sênior - Trabalho Remoto |LATAM|                                              1
Analista de Datos                                                                               2
Analista de Información y Estadística                                                           1
Analyst, Data and Analytics                                                                     1
Analytics

In [12]:
df["Description_language"] = df["Description"].apply(eda_obj.detect_language)
eda_obj.group_by_df(df,["Description_language"])

Description_language
en    381
es     64
pt      7
dtype: int64

In [13]:
df = df[df["Description_language"] == "en"]

In [14]:
df["Title_role_keywords"] = df["Title_role"].apply(eda_obj.get_keywords_from_title_role)

In [15]:
eda_obj.group_by_df(df,["Title_role_keywords"])

Title_role_keywords
                                           8
analyst                                    3
analytics aws senior engineer              1
analytics bi engineer                      1
analytics bi specialist                    1
analytics cloud data solution architect    1
analytics data                             1
analytics data analyst                     1
analytics data architect                   1
analytics data developer                   1
analytics data engineer                    1
analytics data intern                      1
analytics data lead                        2
analytics data senior engineer             1
analytics engineer                         2
analytics intern                           1
analytics senior engineer                  1
analytics software engineer                1
application support                        1
architect                                  1
aws azure data engineer                    1
aws cloud architect                

In [16]:
df["Source"] = df["Source"].apply(eda_obj.is_url)

In [17]:
df.head()

Unnamed: 0,index,Title_role,Company,City,Work_place,Description,Date_published,Work_times,Date_scrapped,Source,Description_language,Title_role_keywords
0,0,Senior Software Engineer,Chubb,"Monterrey, Nuevo Leon, Mexico",On-site,About the job\nJob Requirements\n\nKEY OBJECTI...,1 week ago,Full-time,2022-12-15,https://www.linkedin.com/jobs/view/3395197492/...,en,senior software engineer
2,2,Data Engineering,Accenture,"Monterrey, Nuevo Leon, Mexico",On-site,About the job\nDARE TO BE A PART OF THE CHALLE...,1 week ago,Full-time,2022-12-15,https://www.linkedin.com/jobs/view/3375598097/...,en,data engineer
3,3,Power BI Developer,Johnson Controls,"San Pedro Garza Garcia, Nuevo Leon, Mexico",On-site,About the job\nJob Details\n\nWhat You Will Do...,1 week ago,Full-time,2022-12-15,https://www.linkedin.com/jobs/view/3391869575/...,en,power bi developer
5,5,Data Analyst,Honeywell,"Monterrey, Nuevo Leon, Mexico",On-site,About the job\nJoin a team recognized for lead...,5 days ago,Full-time,2022-12-15,https://www.linkedin.com/jobs/view/3366337825/...,en,data analyst
8,8,Data Architect,GM Financial,"Monterrey, Nuevo Leon, Mexico",On-site,About the job\nOverview\n\nThe Data Architect ...,2 days ago,Full-time,2022-12-15,https://www.linkedin.com/jobs/view/3396144916/...,en,data architect


In [18]:
eda_obj.drop_column(df,"Date_published")
df.reset_index(inplace = True)

In [19]:
eda_obj.group_by_df(df,["Company"])

Company
AFL                                                         1
AI Fund                                                     1
AXEN IT Consulting                                          1
Accenture                                                   5
Acklen Avenue                                               2
ActiveSoft, Inc                                             1
AdventInfotech                                              2
Alio IT Solutions                                           1
Alluxi                                                      1
Angi                                                        1
Anthology Inc                                               3
Apex Systems                                                1
Atos                                                        1
AutoZone                                                    2
Autodesk                                                    2
Axented                                                     4


In [20]:
df["Company"] = df["Company"].apply(eda_obj.get_company_name)

In [21]:
eda_obj.group_by_df(df,["Company"])

Company
AFL                                                         1
AI Fund                                                     1
AXEN IT Consulting                                          1
Accenture                                                   5
Acklen Avenue                                               2
ActiveSoft, Inc                                             1
AdventInfotech                                              2
Alio IT Solutions                                           1
Alluxi                                                      1
Angi                                                        1
Anthology Inc                                               3
Apex Systems                                                1
Atos                                                        1
AutoZone                                                    2
Autodesk                                                    2
Axented                                                     4


Mejoramos presentacion de nuestro dataset organizando las columnas.

In [22]:
df = eda_obj.sort_columns(
    df,[
        "index",
        "Title_role",
        "Title_role_keywords",
        "Company",
        "City",
        "Work_place",
        "Work_times",
        "Description_language",
        "Date_scrapped",
        "Description",
        "Source"]
)

df.head()

Unnamed: 0,index,Title_role,Title_role_keywords,Company,City,Work_place,Work_times,Description_language,Date_scrapped,Description,Source
0,0,Senior Software Engineer,senior software engineer,Chubb,"Monterrey, Nuevo Leon, Mexico",On-site,Full-time,en,2022-12-15,About the job\nJob Requirements\n\nKEY OBJECTI...,https://www.linkedin.com/jobs/view/3395197492/...
1,2,Data Engineering,data engineer,Accenture,"Monterrey, Nuevo Leon, Mexico",On-site,Full-time,en,2022-12-15,About the job\nDARE TO BE A PART OF THE CHALLE...,https://www.linkedin.com/jobs/view/3375598097/...
2,3,Power BI Developer,power bi developer,Johnson Controls,"San Pedro Garza Garcia, Nuevo Leon, Mexico",On-site,Full-time,en,2022-12-15,About the job\nJob Details\n\nWhat You Will Do...,https://www.linkedin.com/jobs/view/3391869575/...
3,5,Data Analyst,data analyst,Honeywell,"Monterrey, Nuevo Leon, Mexico",On-site,Full-time,en,2022-12-15,About the job\nJoin a team recognized for lead...,https://www.linkedin.com/jobs/view/3366337825/...
4,8,Data Architect,data architect,GM Financial,"Monterrey, Nuevo Leon, Mexico",On-site,Full-time,en,2022-12-15,About the job\nOverview\n\nThe Data Architect ...,https://www.linkedin.com/jobs/view/3396144916/...


### Vamos a limpiar las descripciones
Vamos a estandarizar las descripciones para que sea mas facil el poder hacer la relacion de entidades.

Van a haber 6 nuevas columnas de la descripcion:
1. description_formated: Reemplazar las palabras utilizando el diccionario de reemplazos
2. description_formated2: Reemplazar las palabras utilizando el diccionario de reemplazos, cambiarlo a minusculas y sin punctuations
3. description_formated3: Reemplazar las palabras utilizando el diccionario de reemplazos, cambiarlo a minusculas, sin punctuations y dandole formato a los rangos de años.
4. description_formated4: Igual que el 3 pero sin stop words
5. description_formated4_stemmed: Version utilizando stemming en description_formated4
6. description_formated4_lemma: Version utilizando lemmatization en description_formated4

In [23]:
df["description_formated"] = df["Description"].apply(eda_obj.basic_clean_descriptions)

In [24]:
df["description_formated2"] = df["Description"].apply(eda_obj.intermediate_clean_description_lowercased_without_punctuations)

In [25]:
df["description_formated3"] = df["Description"].apply(eda_obj.intermediate_clean_description_lowercased_without_punctuations_formating_year_ranges)

In [26]:
df["description_formated4"] = df["Description"].apply(lambda x:eda_obj.intermediate_clean_description_lowercased_without_punctuations_formating_year_ranges(x,with_stop_words = False))

In [27]:
df["description_formated4_stemmed"] = df["description_formated4"].apply(eda_obj.stem_description)

In [28]:
df["description_formated4_lemma"] = df["description_formated4"].apply(eda_obj.lemma_description)

### Lenguaje en las descripciones lemmazitadas

Nos dimos cuenta que algunas descripciones tienen 2 lenguajes entonces para estandarizar lo mas posible nos quedaremos con solo aquellas que tengan solo ingles

In [29]:
df["description_formated4_lemma_language"] = df["description_formated4_lemma"].apply(eda_obj.detect_language)
eda_obj.group_by_df(df,["description_formated4_lemma_language"])

description_formated4_lemma_language
en       359
en,fr     17
en,it      3
es,en      1
fr         1
dtype: int64

In [30]:
df = df[df["description_formated4_lemma_language"] == "en"]

In [31]:
df = eda_obj.sort_columns(
    df,[
        "index",
        "Title_role",
        "Title_role_keywords",
        "Company",
        "City",
        "Work_place",
        "Work_times",
        "Description_language",
        "Date_scrapped",
        "Description",
        "description_formated",
        "description_formated2",
        "description_formated3",
        "description_formated4",
        "description_formated4_stemmed",
        "description_formated4_lemma",
        "description_formated4_lemma_language",
        "Source"]
)

df.head()

Unnamed: 0,index,Title_role,Title_role_keywords,Company,City,Work_place,Work_times,Description_language,Date_scrapped,Description,description_formated,description_formated2,description_formated3,description_formated4,description_formated4_stemmed,description_formated4_lemma,description_formated4_lemma_language,Source
0,0,Senior Software Engineer,senior software engineer,Chubb,"Monterrey, Nuevo Leon, Mexico",On-site,Full-time,en,2022-12-15,About the job\nJob Requirements\n\nKEY OBJECTI...,About the job\nJob Requirements\n\nKEY OBJECTI...,about the job\njob requirements\n\nkey objecti...,about the job\njob requirements\n\nkey objecti...,job job requirements key objective looking sr ...,job job requir key object look sr softw engin ...,job job requirement key objective look sr soft...,en,https://www.linkedin.com/jobs/view/3395197492/...
1,2,Data Engineering,data engineer,Accenture,"Monterrey, Nuevo Leon, Mexico",On-site,Full-time,en,2022-12-15,About the job\nDARE TO BE A PART OF THE CHALLE...,About the job\nDARE TO BE A PART OF THE CHALLE...,about the job\ndare to be a part of the challe...,about the job\ndare to be a part of the challe...,job dare part challenge come join team togethe...,job dar part challeng com join team togeth mak...,job dare part challenge come join team togethe...,en,https://www.linkedin.com/jobs/view/3375598097/...
2,3,Power BI Developer,power bi developer,Johnson Controls,"San Pedro Garza Garcia, Nuevo Leon, Mexico",On-site,Full-time,en,2022-12-15,About the job\nJob Details\n\nWhat You Will Do...,About the job\nJob Details\n\nWhat You Will Do...,about the job\njob details\n\nwhat you will do...,about the job\njob details\n\nwhat you will do...,job job details responsible analyzing bi needs...,job job detail respons analys bi nee org recom...,job job detail responsible analyze bi need org...,en,https://www.linkedin.com/jobs/view/3391869575/...
3,5,Data Analyst,data analyst,Honeywell,"Monterrey, Nuevo Leon, Mexico",On-site,Full-time,en,2022-12-15,About the job\nJoin a team recognized for lead...,About the job\nJoin a team recognized for lead...,about the job\njoin a team recognized for lead...,about the job\njoin a team recognized for lead...,job join team recognized leadership innovation...,job join team recogn lead innov divers join co...,job join team recognize leadership innovation ...,en,https://www.linkedin.com/jobs/view/3366337825/...
4,8,Data Architect,data architect,GM Financial,"Monterrey, Nuevo Leon, Mexico",On-site,Full-time,en,2022-12-15,About the job\nOverview\n\nThe Data Architect ...,About the job\nOverview\n\nThe Data Architect ...,about the job\noverview\n\nthe data architect ...,about the job\noverview\n\nthe data architect ...,job overview data architect ii responsible des...,job overview dat architect ii respons design d...,job overview datum architect ii responsible de...,en,https://www.linkedin.com/jobs/view/3396144916/...


### Bag of words

Crea el dataset del bow of words donde el numero que sale es cuantas veces la palabra aparece en la descripcion 

In [32]:
df_bow = eda_obj.create_bow_df(df,"index","description_formated4_lemma")
df_bow.head()

Unnamed: 0,00,000,0088oe,00mxn,00pm,03f1qietxt,052022,062022,07,0auth,...,yrs,yxoo6xu5sm,z20m9pnics,zabbix,zapier,zero,zf,zone,zookeeper,zqwyqyj5of
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [34]:
eda_obj.pandas_df_to_csv(df,"data/eda_cleaned_result.csv")