# Proyecto DataScience III - Análisis de publicaciones de ofertas de trabajo y fraude

En esta notebook se analizarán las ofertas de trabajo de un dataset de la siguiente fuente: [Link](https:/https://www.kaggle.com/datasets/shivamb/real-or-fake-fake-jobposting-prediction/).

La misma cuenta con un dataset de 18 columnas x 18000 filas (aprox) con distintas ofertas de trabajo de las cuales algunas son fraudulentas. Se tienen las siguientes columnas:

*  Job_id : ID de la oferta de trabajo.
*  title : Titulo de la oferta de empleo
*  location: ubicación geográfica de la oferta
*  department: departamento de la compañía
*  salary_range: Rango del salario
*  company_profile: Perfil de la compañia
*  description: Descripción de la oferta de trabajo
*  requirements: requisitos del puesto
*  benefits: beneficios del puesto de trabajo
*  telecommuting: Indica si el trabajo es presencial o remoto - 1 = remoto; 0 = presencial.
*  has_company_logo: indica si la oferta contaba con el logo de la empresa. 1 = contaba. 0 = no cuenta.
*  has_questions: Verdadero si contaba con preguntas filtro
*  employment_type: Full time, part-time, contrato, etc.
*  required_experience: Experiencia requerida
*  required_education: Educación requerida
*  Industry: Industria en la que se basa.
*  function: Función del trabajo
*  fraudulent: indica si es fraudulenta o no la oferta. 0 no es fraudulenta. 1 es fraudulenta - variable objetivo.


Importación de librerías

In [1]:
%%time
import gzip
import json
import string
import pandas as pd
import plotly
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
import zipfile
import nltk
nltk.download('stopwords')
nltk.download('punkt')
import zipfile
import requests
import os

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


CPU times: user 5.91 s, sys: 1.11 s, total: 7.02 s
Wall time: 22.3 s


Importación de archivo

In [2]:
url = 'https://github.com/RodrigoTGonzalez/Gonzalez_DataScienceIII_Proyecto/raw/main/fake_job_postings.csv.zip'
zip_file_name = 'fake_job_postings.csv.zip'
response = requests.get(url)
with open(zip_file_name, 'wb') as file:
    file.write(response.content)
with zipfile.ZipFile(zip_file_name, 'r') as zip_ref:
    zip_ref.extractall()
csv_file_name = 'fake_job_postings.csv'
df = pd.read_csv(csv_file_name)
os.remove(zip_file_name)

In [3]:
df

Unnamed: 0,job_id,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent
0,1,Marketing Intern,"US, NY, New York",Marketing,,"We're Food52, and we've created a groundbreaki...","Food52, a fast-growing, James Beard Award-winn...",Experience with content management systems a m...,,0,1,0,Other,Internship,,,Marketing,0
1,2,Customer Service - Cloud Video Production,"NZ, , Auckland",Success,,"90 Seconds, the worlds Cloud Video Production ...",Organised - Focused - Vibrant - Awesome!Do you...,What we expect from you:Your key responsibilit...,What you will get from usThrough being part of...,0,1,0,Full-time,Not Applicable,,Marketing and Advertising,Customer Service,0
2,3,Commissioning Machinery Assistant (CMA),"US, IA, Wever",,,Valor Services provides Workforce Solutions th...,"Our client, located in Houston, is actively se...",Implement pre-commissioning and commissioning ...,,0,1,0,,,,,,0
3,4,Account Executive - Washington DC,"US, DC, Washington",Sales,,Our passion for improving quality of life thro...,THE COMPANY: ESRI – Environmental Systems Rese...,"EDUCATION: Bachelor’s or Master’s in GIS, busi...",Our culture is anything but corporate—we have ...,0,1,0,Full-time,Mid-Senior level,Bachelor's Degree,Computer Software,Sales,0
4,5,Bill Review Manager,"US, FL, Fort Worth",,,SpotSource Solutions LLC is a Global Human Cap...,JOB TITLE: Itemization Review ManagerLOCATION:...,QUALIFICATIONS:RN license in the State of Texa...,Full Benefits Offered,0,1,1,Full-time,Mid-Senior level,Bachelor's Degree,Hospital & Health Care,Health Care Provider,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17875,17876,Account Director - Distribution,"CA, ON, Toronto",Sales,,Vend is looking for some awesome new talent to...,Just in case this is the first time you’ve vis...,To ace this role you:Will eat comprehensive St...,What can you expect from us?We have an open cu...,0,1,1,Full-time,Mid-Senior level,,Computer Software,Sales,0
17876,17877,Payroll Accountant,"US, PA, Philadelphia",Accounting,,WebLinc is the e-commerce platform and service...,The Payroll Accountant will focus primarily on...,- B.A. or B.S. in Accounting- Desire to have f...,Health &amp; WellnessMedical planPrescription ...,0,1,1,Full-time,Mid-Senior level,Bachelor's Degree,Internet,Accounting/Auditing,0
17877,17878,Project Cost Control Staff Engineer - Cost Con...,"US, TX, Houston",,,We Provide Full Time Permanent Positions for m...,Experienced Project Cost Control Staff Enginee...,At least 12 years professional experience.Abil...,,0,0,0,Full-time,,,,,0
17878,17879,Graphic Designer,"NG, LA, Lagos",,,,Nemsia Studios is looking for an experienced v...,1. Must be fluent in the latest versions of Co...,Competitive salary (compensation will be based...,0,0,1,Contract,Not Applicable,Professional,Graphic Design,Design,0


Analisis de dataset

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17880 entries, 0 to 17879
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   job_id               17880 non-null  int64 
 1   title                17880 non-null  object
 2   location             17534 non-null  object
 3   department           6333 non-null   object
 4   salary_range         2868 non-null   object
 5   company_profile      14572 non-null  object
 6   description          17879 non-null  object
 7   requirements         15184 non-null  object
 8   benefits             10668 non-null  object
 9   telecommuting        17880 non-null  int64 
 10  has_company_logo     17880 non-null  int64 
 11  has_questions        17880 non-null  int64 
 12  employment_type      14409 non-null  object
 13  required_experience  10830 non-null  object
 14  required_education   9775 non-null   object
 15  industry             12977 non-null  object
 16  func

Me quedo solo con las columnas title, company_profile, description y fraudulent ya que buscaremos hacer un análsis de texto.

In [5]:
df=df[["company_profile","description","fraudulent"]]

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17880 entries, 0 to 17879
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   company_profile  14572 non-null  object
 1   description      17879 non-null  object
 2   fraudulent       17880 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 419.2+ KB


Son pocos nulos, asi que los relleno con un espacio.

In [7]:
df = df.fillna(" ")

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17880 entries, 0 to 17879
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   company_profile  17880 non-null  object
 1   description      17880 non-null  object
 2   fraudulent       17880 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 419.2+ KB


In [9]:
df.shape

(17880, 3)

Paso todas las columnas de texto a minuscula

In [10]:
df['company_profile'] = df['company_profile'].str.lower()
df['description'] = df['description'].str.lower()

Elimino los carácteres no deseados

In [11]:
df['company_profile'] = df['company_profile'].str.replace(r'[^a-záéíóúñü\s]', '', regex=True)
df['description'] = df['description'].str.replace(r'[^a-záéíóúñü\s]', '', regex=True)

In [12]:
df

Unnamed: 0,company_profile,description,fraudulent
0,were food and weve created a groundbreaking an...,food a fastgrowing james beard awardwinning on...,0
1,seconds the worlds cloud video production ser...,organised focused vibrant awesomedo you hav...,0
2,valor services provides workforce solutions th...,our client located in houston is actively seek...,0
3,our passion for improving quality of life thro...,the company esri environmental systems resear...,0
4,spotsource solutions llc is a global human cap...,job title itemization review managerlocation f...,0
...,...,...,...
17875,vend is looking for some awesome new talent to...,just in case this is the first time youve visi...,0
17876,weblinc is the ecommerce platform and services...,the payroll accountant will focus primarily on...,0
17877,we provide full time permanent positions for m...,experienced project cost control staff enginee...,0
17878,,nemsia studios is looking for an experienced v...,0


Tokenización

In [13]:
df['tokens-1'] = df['company_profile'].apply(word_tokenize)
df['tokens-2'] = df['description'].apply(word_tokenize)
df[['company_profile','description','tokens-1','tokens-2']].head()

Unnamed: 0,company_profile,description,tokens-1,tokens-2
0,were food and weve created a groundbreaking an...,food a fastgrowing james beard awardwinning on...,"[were, food, and, weve, created, a, groundbrea...","[food, a, fastgrowing, james, beard, awardwinn..."
1,seconds the worlds cloud video production ser...,organised focused vibrant awesomedo you hav...,"[seconds, the, worlds, cloud, video, productio...","[organised, focused, vibrant, awesomedo, you, ..."
2,valor services provides workforce solutions th...,our client located in houston is actively seek...,"[valor, services, provides, workforce, solutio...","[our, client, located, in, houston, is, active..."
3,our passion for improving quality of life thro...,the company esri environmental systems resear...,"[our, passion, for, improving, quality, of, li...","[the, company, esri, environmental, systems, r..."
4,spotsource solutions llc is a global human cap...,job title itemization review managerlocation f...,"[spotsource, solutions, llc, is, a, global, hu...","[job, title, itemization, review, managerlocat..."


Llamo a las stopwords del vocabulario en ingles

In [14]:
stop_words = set(stopwords.words('english'))

Elimino las stopwords

In [15]:
# Eliminamos las stopwords
df['tokens_sin_stopwords-1'] = df['tokens-1'].apply(lambda x: [word for word in x if word not in stop_words])
df['tokens_sin_stopwords-2'] = df['tokens-2'].apply(lambda x: [word for word in x if word not in stop_words])

# Mostramos el resultado
df[['tokens-1', 'tokens_sin_stopwords-1']].head()

Unnamed: 0,tokens-1,tokens_sin_stopwords-1
0,"[were, food, and, weve, created, a, groundbrea...","[food, weve, created, groundbreaking, awardwin..."
1,"[seconds, the, worlds, cloud, video, productio...","[seconds, worlds, cloud, video, production, se..."
2,"[valor, services, provides, workforce, solutio...","[valor, services, provides, workforce, solutio..."
3,"[our, passion, for, improving, quality, of, li...","[passion, improving, quality, life, geography,..."
4,"[spotsource, solutions, llc, is, a, global, hu...","[spotsource, solutions, llc, global, human, ca..."


Lematizo

In [16]:
!pip install nltk spacy sklearn
!python -m spacy download en_core_web_sm

Collecting sklearn
  Downloading sklearn-0.0.post12.tar.gz (2.6 kB)
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Preparing metadata (setup.py) ... [?25l[?25herror
[1;31merror[0m: [1mmetadata-generation-failed[0m

[31m×[0m Encountered error while generating package metadata.
[31m╰─>[0m See above for output.

[1;35mnote[0m: This is an issue with the package mentioned above, not pip.
[1;36mhint[0m: See above for details.
Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m82.0 MB/s[0m eta [36m0:00:00[0m
[

In [17]:
# Cargamos el modelo de spaCy en ingles
nlp = spacy.load('en_core_web_sm')

# Lematización usando spaCy
def lemmatize_text(text):
    doc = nlp(' '.join(text))
    return [token.lemma_ for token in doc]

df['lemmas-1'] = df['tokens_sin_stopwords-1'].apply(lemmatize_text)
df['lemmas-2'] = df['tokens_sin_stopwords-2'].apply(lemmatize_text)

# Mostramos las palabras lematizadas
df[['tokens_sin_stopwords-1', 'lemmas-1']].head()

Unnamed: 0,tokens_sin_stopwords-1,lemmas-1
0,"[food, weve, created, groundbreaking, awardwin...","[food, we, ve, create, groundbreaking, awardwi..."
1,"[seconds, worlds, cloud, video, production, se...","[second, world, cloud, video, production, serv..."
2,"[valor, services, provides, workforce, solutio...","[valor, service, provide, workforce, solution,..."
3,"[passion, improving, quality, life, geography,...","[passion, improve, quality, life, geography, h..."
4,"[spotsource, solutions, llc, global, human, ca...","[spotsource, solution, llc, global, human, cap..."


APLICO TF-IDF

In [18]:
df['lemmas_str-1'] = df['lemmas-1'].apply(lambda x: ' '.join(x))
df['lemmas_str-2'] = df['lemmas-2'].apply(lambda x: ' '.join(x))

In [22]:
tfidf_vectorizer = TfidfVectorizer(max_features=100, stop_words=stopwords.words('english'))
tfidf_matrix1 = tfidf_vectorizer.fit_transform(df['lemmas_str-1'])
tfidf_matrix2 = tfidf_vectorizer.fit_transform(df['lemmas_str-2'])


tfidf_df1 = pd.DataFrame(tfidf_matrix1.toarray(), columns=tfidf_vectorizer.get_feature_names_out())


tfidf_df1.head()

Unnamed: 0,ability,able,account,also,amp,application,apply,base,build,business,...,time,use,user,want,web,website,well,within,work,year
0,0.0,0.0,0.178039,0.0,0.0,0.188716,0.0,0.0,0.0,0.0,...,0.0,0.177207,0.0,0.0,0.178513,0.0,0.165731,0.0,0.0,0.0
1,0.0,0.149902,0.0,0.0,0.0,0.0,0.0,0.0,0.284013,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.095628,0.492646,0.0
2,0.190691,0.0,0.0,0.160699,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.170132,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.240683,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.405699,0.0
4,0.147259,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [21]:
tfidf_df2 = pd.DataFrame(tfidf_matrix2.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

tfidf_df2.head()

Unnamed: 0,ability,able,account,also,amp,application,apply,base,build,business,...,time,use,user,want,web,website,well,within,work,year
0,0.0,0.0,0.0,0.0,0.257422,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.337069,0.0,0.0,0.312997,0.0
1,0.0,0.0,0.133461,0.0,0.104566,0.0,0.0,0.101043,0.0,0.255843,...,0.107986,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.127141,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.720407,0.0,0.0,0.0,0.061417,0.0,0.057125,0.276202,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.124081,0.034315,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.184451,0.0,0.0,0.0,0.0,0.0,0.344761,0.0,0.108585,0.0
