# Capstone  - Fake Job Posting - Data Wrangling and Exploratory Data Analysis

### Table of contents
1. [Background](#Background)
     -   1.1 [Data Source](#Data-Source)
     -   1.2 [Objective](#Objective)
     
     
2. [Loading Data](#Loading-Data)
     -   2.1 [Load libraries](#Load-libraries)
     -   2.2 [Load Dataset](#Load-Dataset)
     
     
3. [Data Quality Check](#Data-Quality-Check)
     -   3.1 [Duplicate Values](#Duplicate-Values)
     -   3.2 [Missing Values](#Missing-Values)
     
     
4. [Data Wrangling](#Data-Wrangling)
     -   4.1 [Split the location in Country, State and City](#Split-the-location-in-Country-State-and-City)
     -   4.2 [Split the salary range column to minimum and maximum](#Split-the-salary-range-column-to-minimum-and-maximum)
     -   4.3 [Dealing with missing values](#Dealing-with-missing-values)

## Background
    
Scammers advertise jobs the same way legitimate employers do—online (in ads, on job sites, college employment sites, and social media), in newspapers, and sometimes on TV and radio. They promise you a job, but what they want is your money and your personal information.

Fake Job or Employment Scams occur when criminal actors deceive victims into believing they have a job or a potential job. Criminals leverage their position as “employers” to persuade victims to provide them with personally identifiable information (PII), become unwitting money mules, or to send them money.

Fake Job Scams have existed for a long time but technology has made this scam easier and more lucrative. Cyber criminals now pose as legitimate employers by spoofing company websites and posting fake job openings on popular online job boards. They conduct false interviews with unsuspecting applicant victims, then request PII and/or money from these individuals. 


https://www.fbi.gov/contact-us/field-offices/elpaso/news/press-releases/fbi-warns-cyber-criminals-are-using-fake-job-listings-to-target-applicants-personally-identifiable-information

## What is a Fake Job Posting?

A fake job posting is a (rarely) smartly designed type of scam aimed at job seekers for a variety of unprofessional reasons. Still, these scams can look legit to an unsuspicious person scrolling through the vast pool of jobs. And although most tech talents aren’t actively looking for a new employer, falling for a phantom ad is still realistic. How so?

Scammers will sometimes go the extra mile to draw the attention of their target audience, more often than not, by offering incredibly high salary ranges or another sort of advantage that seems too good to be true. So, make sure to remember this: when a JD seems like a dream come true, do a thorough background check on the company or recruitment agency advertising it. Search through their website, social media, and various job boards before you take a leap of faith and end up wasting your time on a dead-end hiring process, or worse. 

https://www.omnesgroup.com/fake-job-posting/

#### 1.1 Data Source

This dataset contains 18K job descriptions out of which about 800 are fake. The data consists of both textual information and meta-information about the jobs. The dataset can be used to create classification models which can learn the job descriptions which are fraudulent.

https://www.kaggle.com/datasets/shivamb/real-or-fake-fake-jobposting-prediction

#### 1.2 Problem Statement / Objective
To predicit fradulent job posting in the dataset.

## Loading Data 

#### 2.1 Load Libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
sns.set_theme(style='darkgrid')
import matplotlib.pyplot as plt
import collections, re

<div class="alert alert-success">
  <strong>Success!</strong> Successfully loaded all the required libraries.
</div>

#### 2.2 Load Data Set 

In [None]:
df = pd.read_csv('fake_job_postings.csv')

<div class="alert alert-success">
  <strong>Success!</strong> Successfully loaded data.
</div>

In [None]:
df.head(2)

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
df.info()

## Data Quality Check

#### 3.1 Duplicate Values

In [None]:
if len(df[df.duplicated(keep=False)]) == 0:
    print("There is no duplicated records in the fake job posting dataset")
else:
    print("There are duplicated records in the fake job posting dataset. Please indetify the reasons and work to fix")

<div class="alert alert-success">
  <strong>Success!</strong> There is no duplicated values in the dataset.
</div>

#### 3.2 Missing Values

In [None]:
print(" \nCount total NaN at each column in a DataFrame : \n\n",
df.isnull().sum())

#### 3.3  Graphical representation of missing values

In [None]:
null_values = df.isnull().sum()
plt.figure(figsize = (10,10))
sns.barplot(null_values.index, null_values, color = 'grey')
plt.ylabel('Missing values count', fontsize = 15)
plt.xticks(rotation = '90', fontsize=15)
plt.show()

## Data Wrangling 

#### Job ID

In [None]:
# number of unique job_id 
len(pd.unique(df['job_id']))

<div class="alert alert-info">
  <strong>Info!</strong> job_id is unique identifer in the dataset.
</div>

#### Location 

In [None]:
location = df["location"].str.split(",", expand= True, n= 2)
location.columns = ["country", "state", "city"]
df[["country", "state", "city"]] = location

In [None]:
#drop the original location column from the dataset
df = df.drop(columns= "location")

<div class="alert alert-success">
  <strong>Success!</strong> dropped the original location column and splitted location into City, State and Country
</div>

#### Department

In [None]:
# number of unique department
print(df['department'].value_counts())

In [None]:
df['department'] =df['department'].fillna("None")

In [None]:
df['department']=df['department'].str.lower()
regex = re.compile('[@_!#$%^&*()<>?/\|}{~:]')

In [None]:
def clean_string(subject):
    clean_tokens = re.findall(r"(?i)\b[a-z]+\b", subject)
    clean_s = ' '.join(clean_tokens)
    return clean_s

# source: https://github.com/TommyJiang91/Fake_Job_Posting_Detection/blob/master/Data_Cleaning_and_Salary_Matching_Final.ipynb

In [None]:
df['department']= df['department'].apply(lambda x: clean_string(x))

In [None]:
print(df['department'].value_counts())

#### Salary Range

In [None]:
salary = df["salary_range"].str.split("-", expand= True, n= 1)
df[["min_salary", "max_salary"]] = salary

In [None]:
df = df.drop(columns= "salary_range")

<div class="alert alert-success">
  <strong>Success!</strong> Splitted the salary_range into minimum and maxium column and dropped the original salary_range column.
</div>

In [None]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import RegexpTokenizer
from nltk.stem.wordnet import WordNetLemmatizer

#### Description

In [None]:
# chnage the datatype
df.description = df.description.astype('str')

In [None]:
#reference: https://www.andyfitzgeraldconsulting.com/writing/keyword-extraction-nlp/

In [None]:
# Create a list of stop words from nltk
stop_words = set(stopwords.words("english"))
print(sorted(stop_words))

In [None]:
# Pre-process dataset to get a cleaned and normalised text corpus
# Add word count for description column in the dataset
corpus_desc = []
df['desc_word_count'] = df['description'].apply(lambda x: len(str(x).split(" ")))
ds_count = len(df.desc_word_count)
for i in range(0, ds_count):
    # Remove punctuation
    text = re.sub('[^a-zA-Z]', ' ', str(df['description'][i]))
    
    # Convert to lowercase
    text = text.lower()
    
    # Remove tags
    text=re.sub("&lt;/?.*?&gt;"," &lt;&gt; ",text)
    
    # Remove special characters and digits
    text=re.sub("(\\d|\\W)+"," ",text)
    
    # Convert to list from string
    text = text.split()
    
    # Stemming
    ps=PorterStemmer()
    
    # Lemmatisation
    lem = WordNetLemmatizer()
    text = [lem.lemmatize(word) for word in text if not word in  
            stop_words] 
    text = " ".join(text)
    corpus_desc.append(text)

In [None]:
# Generate word cloud for description 
from os import path
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
%matplotlib inline
wordcloud_desc = WordCloud(
                          background_color='white',
                          stopwords=stop_words,
                          max_words=200,
                          max_font_size=50, 
                          random_state=42
                         ).generate(str(corpus_desc))
print(wordcloud_desc)
plt.figure( figsize=(20,10) )
plt.imshow(wordcloud_desc)
plt.axis('off')
plt.show()

In [None]:
# Tokenize the text and build a vocabulary of known words
from sklearn.feature_extraction.text import CountVectorizer
import re
cv=CountVectorizer(max_df=0.8,stop_words=stop_words, max_features=10000, ngram_range=(1,3))
X=cv.fit_transform(corpus_desc)

In [None]:
# Sample the returned vector encoding the length of the entire vocabulary
list(cv.vocabulary_.keys())[:10]

In [None]:
# View most frequently occuring keywords
def get_top_n_words(corpus, n=None):
    vec = CountVectorizer().fit(corpus_desc)
    bag_of_words = vec.transform(corpus_desc)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in      
                   vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], 
                       reverse=True)
    return words_freq[:n]

# Convert most freq words to dataframe for plotting bar plot, save as CSV
top_words = get_top_n_words(corpus_desc, n=20)
top_df = pd.DataFrame(top_words)
top_df.columns=["Keyword", "Frequency"]


# Barplot of most freq words
import seaborn as sns
sns.set(rc={'figure.figsize':(13,8)})
g = sns.barplot(x="Keyword", y="Frequency", data=top_df, palette="Blues_d")
g.set_xticklabels(g.get_xticklabels(), rotation=90)

In [None]:
# Most frequently occuring bigrams
def get_top_n2_words(corpus, n=None):
    vec1 = CountVectorizer(ngram_range=(2,2),  
            max_features=2000).fit(corpus_desc)
    bag_of_words = vec1.transform(corpus_desc)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in     
                  vec1.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], 
                reverse=True)
    return words_freq[:n]

# Convert most freq bigrams to dataframe for plotting bar plot, save as CSV
top2_words = get_top_n2_words(corpus_desc, n=20)
top2_df = pd.DataFrame(top2_words)
top2_df.columns=["Bi-gram", "Frequency"]


# Barplot of most freq Bi-grams
import seaborn as sns
sns.set(rc={'figure.figsize':(13,8)})
h=sns.barplot(x="Bi-gram", y="Frequency", data=top2_df, palette="Blues_d")
h.set_xticklabels(h.get_xticklabels(), rotation=90)

In [None]:
# Most frequently occuring Tri-grams
def get_top_n3_words(corpus_desc, n=None):
    vec1 = CountVectorizer(ngram_range=(3,3), 
           max_features=2000).fit(corpus_desc)
    bag_of_words = vec1.transform(corpus_desc)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in     
                  vec1.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], 
                reverse=True)
    return words_freq[:n]

# Convert most freq trigrams to dataframe for plotting bar plot, save as CSV
top3_words = get_top_n3_words(corpus_desc, n=20)
top3_df = pd.DataFrame(top3_words)
top3_df.columns=["Tri-gram", "Frequency"]

# Barplot of most freq Tri-grams
import seaborn as sns
sns.set(rc={'figure.figsize':(13,8)})
j=sns.barplot(x="Tri-gram", y="Frequency", data=top3_df, palette="Blues_d")
j.set_xticklabels(j.get_xticklabels(), rotation=90)

<div class="alert alert-info">
  <strong>Info!</strong> Did not find meaningful keywords. Probabaloy, its would be a good idea to find keywords based on industry. Due to limited time, I will leave it for future.
</div>

#### Requirements

#### Benefits

In [None]:
# number of unique department
print(df['benefits'].value_counts())

###### Extracting Keywords fron Benefits

In [None]:
# Pre-process dataset to get a cleaned and normalised text corpus
corpus = []
df['word_count_benefits'] = df['benefits'].apply(lambda x: len(str(x).split(" ")))
ds_count = len(df.word_count_benefits)
for i in range(0, ds_count):
    # Remove punctuation
    text = re.sub('[^a-zA-Z]', ' ', str(df['benefits'][i]))
    
    # Convert to lowercase
    text = text.lower()
    
    # Remove tags
    text=re.sub("&lt;/?.*?&gt;"," &lt;&gt; ",text)
    
    # Remove special characters and digits
    text=re.sub("(\\d|\\W)+"," ",text)
    
    # Convert to list from string
    text = text.split()
    
    # Stemming
    ps=PorterStemmer()
    
    # Lemmatisation
    lem = WordNetLemmatizer()
    text = [lem.lemmatize(word) for word in text if not word in  
            stop_words] 
    text = " ".join(text)
    corpus.append(text)

In [None]:
#View sample pre-processed corpus item
corpus[10]

In [None]:
# Generate word cloud
from os import path
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
%matplotlib inline
wordcloud = WordCloud(
                          background_color='white',
                          stopwords=stop_words,
                          max_words=100,
                          max_font_size=50, 
                          random_state=42
                         ).generate(str(corpus))
print(wordcloud)
plt.figure( figsize=(20,10) )
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

In [None]:
# Tokenize the text and build a vocabulary of known words
from sklearn.feature_extraction.text import CountVectorizer
import re
cv=CountVectorizer(max_df=0.8,stop_words=stop_words, max_features=10000, ngram_range=(1,3))
X=cv.fit_transform(corpus)

In [None]:
# Sample the returned vector encoding the length of the entire vocabulary
list(cv.vocabulary_.keys())[:10]

In [None]:
# View most frequently occuring keywords
def get_top_n_words(corpus, n=None):
    vec = CountVectorizer().fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in      
                   vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], 
                       reverse=True)
    return words_freq[:n]

# Convert most freq words to dataframe for plotting bar plot, save as CSV
top_words = get_top_n_words(corpus, n=20)
top_df = pd.DataFrame(top_words)
top_df.columns=["Keyword", "Frequency"]

# Barplot of most freq words
import seaborn as sns
sns.set(rc={'figure.figsize':(13,8)})
g = sns.barplot(x="Keyword", y="Frequency", data=top_df, palette="Blues_d")
g.set_xticklabels(g.get_xticklabels(), rotation=90)

In [None]:
# Most frequently occuring bigrams
def get_top_n2_words(corpus, n=None):
    vec1 = CountVectorizer(ngram_range=(2,2),  
            max_features=2000).fit(corpus)
    bag_of_words = vec1.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in     
                  vec1.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], 
                reverse=True)
    return words_freq[:n]

# Convert most freq bigrams to dataframe for plotting bar plot, save as CSV
top2_words = get_top_n2_words(corpus, n=20)
top2_df = pd.DataFrame(top2_words)
top2_df.columns=["Bi-gram", "Frequency"]

# Barplot of most freq Bi-grams
import seaborn as sns
sns.set(rc={'figure.figsize':(13,8)})
h=sns.barplot(x="Bi-gram", y="Frequency", data=top2_df, palette="Blues_d")
h.set_xticklabels(h.get_xticklabels(), rotation=90)

In [None]:
# Most frequently occuring Tri-grams
def get_top_n3_words(corpus, n=None):
    vec1 = CountVectorizer(ngram_range=(3,3), 
           max_features=2000).fit(corpus)
    bag_of_words = vec1.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in     
                  vec1.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], 
                reverse=True)
    return words_freq[:n]

# Convert most freq trigrams to dataframe for plotting bar plot, save as CSV
top3_words = get_top_n3_words(corpus, n=20)
top3_df = pd.DataFrame(top3_words)
top3_df.columns=["Tri-gram", "Frequency"]

# Barplot of most freq Tri-grams
import seaborn as sns
sns.set(rc={'figure.figsize':(13,8)})
j=sns.barplot(x="Tri-gram", y="Frequency", data=top3_df, palette="Blues_d")
j.set_xticklabels(j.get_xticklabels(), rotation=90)

In [None]:
# Get TF-IDF (term frequency/inverse document frequency) -- 
# TF-IDF lists word frequency scores that highlight words that 
# are more important to the context rather than those that 
# appear frequently across documents

from sklearn.feature_extraction.text import TfidfTransformer 
tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit(X)

# Get feature names
feature_names=cv.get_feature_names()
 
# Fetch document for which keywords needs to be extracted
doc=corpus[ds_count-1]
 
# Generate tf-idf for the given document
tf_idf_vector=tfidf_transformer.transform(cv.transform([doc]))

In [None]:
# Sort tf_idf in descending order
from scipy.sparse import coo_matrix
def sort_coo(coo_matrix):
    tuples = zip(coo_matrix.col, coo_matrix.data)
    return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)
 
def extract_topn_from_vector(feature_names, sorted_items, topn=25):
    
    # Use only topn items from vector
    sorted_items = sorted_items[:topn]
    score_vals = []
    feature_vals = []
    
    # Word index and corresponding tf-idf score
    for idx, score in sorted_items:
        
        # Keep track of feature name and its corresponding score
        score_vals.append(round(score, 3))
        feature_vals.append(feature_names[idx])
 
    # Create tuples of feature,score
    # Results = zip(feature_vals,score_vals)
    results= {}
    for idx in range(len(feature_vals)):
        results[feature_vals[idx]]=score_vals[idx]
    return results

# Sort the tf-idf vectors by descending order of scores
sorted_items=sort_coo(tf_idf_vector.tocoo())

# Extract only the top n; n here is 25
keywords=extract_topn_from_vector(feature_names,sorted_items,25)
 
# Print the results, save as CSV
print("\nAbstract:")
print(doc)
print("\nKeywords:")
for k in keywords:
    print(k,keywords[k])

# import csv
# with open(file_prefix + 'td_idf.csv', 'w', newline="") as csv_file:  
#     writer = csv.writer(csv_file)
#     writer.writerow(["Keyword", "Importance"])
#     for key, value in keywords.items():
#        writer.writerow([key, value])

<div class="alert alert-info">
  <strong>Info!</strong> Benefits Keywords are 'benefits', 'mental','dental','visison', 'life insurance', 'health insurance', 'work life balance', 'long term disability', 'stock options', 'work life balance'.
</div>

In [None]:
benefits_kw = ['benefits', 'mental','dental','visison', 'life insurance', 'health insurance', 'work life balance', 'long term disability', 'stock options', 'work life balance']

In [None]:
# Fill in the missing values
df['benefits'] = df['benefits'].fillna("None")

# Now we will convert it into 'int64' type.
df.benefits = df.benefits.astype('str')

#count the keywords and create a column with
df['benefit_kw_count']=df['benefits'].str.findall('|'.join(benefits_kw)).str.len()

df['benefit_kw_count'].value_counts()

<div class="alert alert-info">
  <strong>Info!</strong> 14184 job postings donot have the benefits keywords remaining 3705 have one or more benefits keywords
</div>

#### Employement Type

In [None]:
# fill missing employement type with 'none'
df['employment_type'] =df['employment_type'].fillna("None")

In [None]:
# types of employment type
print(df['employment_type'].value_counts())

<div class="alert alert-success">
  <strong>Success!</strong> filled the missing employment type with None.
</div>

#### Required Experience 

In [None]:
df['required_experience'].isnull().sum()

In [None]:
# fill missing required experience with 'none'
df['required_experience'] =df['required_experience'].fillna("None")

In [None]:
# types of required experience 
print(df['required_experience'].value_counts())

#### Required Education

In [None]:
df['required_education'].isnull().sum()

In [None]:
# fill missing required education with 'none'
df['required_education'] =df['required_education'].fillna("None")

In [None]:
# types of required experience 
print(df['required_education'].value_counts())

In [None]:
# renaming some of the required education for consistency
df["required_education"]=df["required_education"].replace("Vocational - Degree", "Vocational")
df["required_education"]=df["required_education"].replace("Vocational - HS Diploma", "High School")
df["required_education"]=df["required_education"].replace("Some High School Coursework", "High School")
df["required_education"]=df["required_education"].replace("High School or equivalent", "High School")
df["required_education"]=df["required_education"].replace("Some College Coursework Completed", "Associate")
df["required_education"]=df["required_education"].replace("Unspecified", "None")
df["required_education"]=df["required_education"].replace("Bachelor's Degree", "Bachelor's")
df["required_education"]=df["required_education"].replace("Master's Degree", "Master's")
df["required_education"]=df["required_education"].replace("Associate Degree", "Associate")

In [None]:
# types of required experience 
print(df['required_education'].value_counts())

### Exploratory Data Analysis

#### 5.1 Count of real and fradulent job posting 

In [None]:
sns.countplot(df.fraudulent,palette=["#0b5394", "#9fc5e8"]).set_title('Real & Fradulent')
df.groupby('fraudulent').count()['title'].reset_index().sort_values(by='title',ascending=False)

#### 5.4 Count of real and fraudulent job posting by department 

In [None]:
sns.countplot(x='department', data=df, hue="fraudulent", palette=["#0b5394", "#9fc5e8"], order=df['department'].value_counts().iloc[:10].index)
plt.xticks(rotation=90)
plt.show()

#### 5.2 Count of real and fraudulent job posting by Country 

In [None]:
sns.countplot(x='country', data=df, hue="fraudulent", palette=["#0b5394", "#9fc5e8"], order=df['country'].value_counts().iloc[:10].index)
plt.xticks(rotation=90)
plt.show()

#### 5.3 Count of real and fraudulent job posting by Employment Type

In [None]:
# Calculate the "employment_type" column cross table regarding the target variable in a normalized form
employment_type_cross = pd.crosstab(df["employment_type"], df["fraudulent"])

In [None]:
employment_type_cross

In [None]:
sns.countplot(x='employment_type', data=df, hue="fraudulent", palette=["#0b5394", "#9fc5e8"], order=df['employment_type'].value_counts().iloc[:10].index)
plt.xticks(rotation=90)
plt.show()

#### 5.5 Count of real and fraudulent job posting by telecommuting 

In [None]:
sns.countplot(x='telecommuting', data=df, hue="fraudulent", palette=["#0b5394", "#9fc5e8"], order=df['telecommuting'].value_counts().iloc[:10].index)
plt.xticks()
plt.show()

#### 5.6 Count of real and fraudulent job posting by has_company_logo

In [None]:
sns.countplot(x='has_company_logo', data=df, hue="fraudulent", palette=["#0b5394", "#9fc5e8"], order=df['has_company_logo'].value_counts().iloc[:10].index)
plt.xticks()
plt.show()

#### 5.7 Count of real and fraudulent job posting by experience 

In [None]:
sns.set(rc={"figure.figsize":(12, 6)}) #width=6, height=5
sns.countplot(x='required_experience', data=df, hue="fraudulent", palette=["#0b5394", "#9fc5e8"], order=df['required_experience'].value_counts().iloc[:10].index)
plt.xticks()
plt.show()

#### Count of real and Fake job posting by education 

In [None]:
sns.set(rc={"figure.figsize":(12, 6)}) #width=6, height=5
sns.countplot(x='required_education', data=df, hue="fraudulent", palette=["#0b5394", "#9fc5e8"], order=df['required_education'].value_counts().iloc[:10].index)
plt.xticks()
plt.show()

#### 5.8 Count of real and fraudulent job posting by Function 

In [None]:
sns.set(rc={"figure.figsize":(15, 6)}) #width=6, height=5
sns.countplot(x='function', data=df, hue="fraudulent", palette=["#0b5394", "#9fc5e8"], order=df['function'].value_counts().iloc[:10].index)
plt.xticks()
plt.show()

#### 5.9 Count of benefits keywords in Real and Fake job postings

In [None]:
# plotting the comparision between benefit keywords in fake and real job postings
sns.set(rc={"figure.figsize":(8,6)}) #width=6, height=5
sns.countplot(x='benefit_kw_count', data=df, hue="fraudulent", palette=["#0b5394", "#9fc5e8"], order=df['benefit_kw_count'].value_counts().iloc[:4].index)
plt.xticks(rotation=90)
plt.show()

<div class="alert alert-info">
  <strong>Info!</strong> It looks like mostly both real and fake job postings have no benefits keywords. However, fake jobs have only unigrams 
</div>

### Comparing number of characters in Real and Fake job postings 

In [None]:
"""Extracting Text Featurs"""

text_df = df[["title", "company_profile", "description", "requirements", "benefits","fraudulent"]]
text_df.head()

#### 6.1 Characters in Descriptions

In [None]:
fig,(ax1,ax2)= plt.subplots(ncols=2, figsize=(17, 5), dpi=100)
length=text_df[text_df["fraudulent"]==1]['description'].str.len()
ax1.hist(length,bins = 20,color='#0b5394')
ax1.set_title('Fake Post')
length=text_df[text_df["fraudulent"]==0]['description'].str.len()
ax2.hist(length, bins = 20,color = '#9fc5e8')
ax2.set_title('Real Post')
fig.suptitle('Characters in description')
plt.show()

The distribution of charaters in description of the fake and real post are similar but some fake post reach to 6000 to 6500 characters.

#### 6.2 Characters in Requirement

In [None]:
fig,(ax1,ax2)= plt.subplots(ncols=2, figsize=(17, 5), dpi=100)
length=text_df[text_df["fraudulent"]==1]['requirements'].str.len()
ax1.hist(length,bins = 20,color='#0b5394')
ax1.set_title('Fake Post')
length=text_df[text_df["fraudulent"]==0]['requirements'].str.len()
ax2.hist(length,bins = 20, color = '#9fc5e8')
ax2.set_title('Real Post')
fig.suptitle('Characters in requirements')
plt.show()