# ML Final Project

In this final project, I will analyze a dataset containing both real and fake job postings. The goal is to accurately classify a job posting based on its features. There are 17,880 observations. The distribution of the classes is not equal. There are 17,014 real job postings and 866 fake postings. The dataset has 18 features including one unique ID (job_id). The descriptions of the features are below.
 
| Columns	| Description |
| :-----------: | :-----------: |
| job_id	| Unique Job ID |
| title	| The title of the job posting entry. |
| location	| Geographical location of the job posting. |
| department	| Corporate department (e.g., sales). |
| salary_range	| Indicative salary range (e.g., $ 50,000-60,000). |
| company_profile	| A brief company description. |
| description	| The details description of the job ad. |
| requirements	| Enlisted requirements for the job opening. | 
| benefits	| Enlisted offered benefits by the employer. |
| telecommuting	| True for telecommuting positions. |
| hascompanylogo	| True if company logo is present. |
| has_questions	| True if screening questions are present. |
| employment_type	| Full-type, Part-time, Contract, etc. |
| required_experience	| Executive, Entry level, Intern, etc. |
| required_education	| Doctorate, Master’s Degree, Bachelor, etc. |
| industry	| Automotive, IT, Health care, Real estate, etc. |
| function	| Consulting, Engineering, Research, Sales, etc. |
| fraudulent	| The target and binary classification attribute. |

In [8]:
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
# import tensorflow as tf
# import keras

In [9]:
# read in the file
df = pd.read_csv("fake_job_postings.csv")

In [10]:
# glance at the dataset
df.head()

Unnamed: 0,job_id,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent
0,1,Marketing Intern,"US, NY, New York",Marketing,,"We're Food52, and we've created a groundbreaki...","Food52, a fast-growing, James Beard Award-winn...",Experience with content management systems a m...,,0,1,0,Other,Internship,,,Marketing,0
1,2,Customer Service - Cloud Video Production,"NZ, , Auckland",Success,,"90 Seconds, the worlds Cloud Video Production ...",Organised - Focused - Vibrant - Awesome!Do you...,What we expect from you:Your key responsibilit...,What you will get from usThrough being part of...,0,1,0,Full-time,Not Applicable,,Marketing and Advertising,Customer Service,0
2,3,Commissioning Machinery Assistant (CMA),"US, IA, Wever",,,Valor Services provides Workforce Solutions th...,"Our client, located in Houston, is actively se...",Implement pre-commissioning and commissioning ...,,0,1,0,,,,,,0
3,4,Account Executive - Washington DC,"US, DC, Washington",Sales,,Our passion for improving quality of life thro...,THE COMPANY: ESRI – Environmental Systems Rese...,"EDUCATION: Bachelor’s or Master’s in GIS, busi...",Our culture is anything but corporate—we have ...,0,1,0,Full-time,Mid-Senior level,Bachelor's Degree,Computer Software,Sales,0
4,5,Bill Review Manager,"US, FL, Fort Worth",,,SpotSource Solutions LLC is a Global Human Cap...,JOB TITLE: Itemization Review ManagerLOCATION:...,QUALIFICATIONS:RN license in the State of Texa...,Full Benefits Offered,0,1,1,Full-time,Mid-Senior level,Bachelor's Degree,Hospital & Health Care,Health Care Provider,0


In [11]:
# take a look at the column names
df.columns

Index(['job_id', 'title', 'location', 'department', 'salary_range',
       'company_profile', 'description', 'requirements', 'benefits',
       'telecommuting', 'has_company_logo', 'has_questions', 'employment_type',
       'required_experience', 'required_education', 'industry', 'function',
       'fraudulent'],
      dtype='object')

In [12]:
# there is a lot of missing data across the features and most of the features are categorical
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17880 entries, 0 to 17879
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   job_id               17880 non-null  int64 
 1   title                17880 non-null  object
 2   location             17534 non-null  object
 3   department           6333 non-null   object
 4   salary_range         2868 non-null   object
 5   company_profile      14572 non-null  object
 6   description          17879 non-null  object
 7   requirements         15185 non-null  object
 8   benefits             10670 non-null  object
 9   telecommuting        17880 non-null  int64 
 10  has_company_logo     17880 non-null  int64 
 11  has_questions        17880 non-null  int64 
 12  employment_type      14409 non-null  object
 13  required_experience  10830 non-null  object
 14  required_education   9775 non-null   object
 15  industry             12977 non-null  object
 16  func

In [13]:
# gauge the shape of the dataset
df.shape

(17880, 18)

In [14]:
# the breakdown between real and fradulent job postings is not even (17,014 real and 866 fake, respectively)

df.iloc[:,-1].value_counts()

0    17014
1      866
Name: fraudulent, dtype: int64

## Feature Engineering

While there are 18 features to choose from, I do not believe that they are all important for my classification problem. As we can see from the below subset of unique locations, they are not all consistently formatted. We can try grabbing the first two characters to create a country column. This way we can see the proportion of fraudulent job postings in different countries.

We also see that most of the columns have a lot of null values. We can likely disregard these features for our classification because they probably do not matter much. The most important feature is the description which we can use natural language processing (NLP) on to find patterns between fake job postings.

In [15]:
df['location'].unique()

array(['US, NY, New York', 'NZ, , Auckland', 'US, IA, Wever', ...,
       'US, CA, los Angeles', 'CA, , Ottawa', 'GB, WSX, Chichester'],
      dtype=object)

In [16]:
# countries mapped to country codes for matching later

country_dict = {'Afghanistan': 'AF',
 'Albania': 'AL',
 'Algeria': 'DZ',
 'American Samoa': 'AS',
 'Andorra': 'AD',
 'Angola': 'AO',
 'Anguilla': 'AI',
 'Antarctica': 'AQ',
 'Antigua and Barbuda': 'AG',
 'Argentina': 'AR',
 'Armenia': 'AM',
 'Aruba': 'AW',
 'Australia': 'AU',
 'Austria': 'AT',
 'Azerbaijan': 'AZ',
 'Bahamas': 'BS',
 'Bahrain': 'BH',
 'Bangladesh': 'BD',
 'Barbados': 'BB',
 'Belarus': 'BY',
 'Belgium': 'BE',
 'Belize': 'BZ',
 'Benin': 'BJ',
 'Bermuda': 'BM',
 'Bhutan': 'BT',
 'Bolivia, Plurinational State of': 'BO',
 'Bonaire, Sint Eustatius and Saba': 'BQ',
 'Bosnia and Herzegovina': 'BA',
 'Botswana': 'BW',
 'Bouvet Island': 'BV',
 'Brazil': 'BR',
 'British Indian Ocean Territory': 'IO',
 'Brunei Darussalam': 'BN',
 'Bulgaria': 'BG',
 'Burkina Faso': 'BF',
 'Burundi': 'BI',
 'Cambodia': 'KH',
 'Cameroon': 'CM',
 'Canada': 'CA',
 'Cape Verde': 'CV',
 'Cayman Islands': 'KY',
 'Central African Republic': 'CF',
 'Chad': 'TD',
 'Chile': 'CL',
 'China': 'CN',
 'Christmas Island': 'CX',
 'Cocos (Keeling) Islands': 'CC',
 'Colombia': 'CO',
 'Comoros': 'KM',
 'Congo': 'CG',
 'Congo, the Democratic Republic of the': 'CD',
 'Cook Islands': 'CK',
 'Costa Rica': 'CR',
 'Croatia': 'HR',
 'Cuba': 'CU',
 'Curaçao': 'CW',
 'Cyprus': 'CY',
 'Czech Republic': 'CZ',
 "Côte d'Ivoire": 'CI',
 'Denmark': 'DK',
 'Djibouti': 'DJ',
 'Dominica': 'DM',
 'Dominican Republic': 'DO',
 'Ecuador': 'EC',
 'Egypt': 'EG',
 'El Salvador': 'SV',
 'Equatorial Guinea': 'GQ',
 'Eritrea': 'ER',
 'Estonia': 'EE',
 'Ethiopia': 'ET',
 'Falkland Islands (Malvinas)': 'FK',
 'Faroe Islands': 'FO',
 'Fiji': 'FJ',
 'Finland': 'FI',
 'France': 'FR',
 'French Guiana': 'GF',
 'French Polynesia': 'PF',
 'French Southern Territories': 'TF',
 'Gabon': 'GA',
 'Gambia': 'GM',
 'Georgia': 'GE',
 'Germany': 'DE',
 'Ghana': 'GH',
 'Gibraltar': 'GI',
 'Greece': 'GR',
 'Greenland': 'GL',
 'Grenada': 'GD',
 'Guadeloupe': 'GP',
 'Guam': 'GU',
 'Guatemala': 'GT',
 'Guernsey': 'GG',
 'Guinea': 'GN',
 'Guinea-Bissau': 'GW',
 'Guyana': 'GY',
 'Haiti': 'HT',
 'Heard Island and McDonald Islands': 'HM',
 'Holy See (Vatican City State)': 'VA',
 'Honduras': 'HN',
 'Hong Kong': 'HK',
 'Hungary': 'HU',
 'Iceland': 'IS',
 'India': 'IN',
 'Indonesia': 'ID',
 'Iran, Islamic Republic of': 'IR',
 'Iraq': 'IQ',
 'Ireland': 'IE',
 'Isle of Man': 'IM',
 'Israel': 'IL',
 'Italy': 'IT',
 'Jamaica': 'JM',
 'Japan': 'JP',
 'Jersey': 'JE',
 'Jordan': 'JO',
 'Kazakhstan': 'KZ',
 'Kenya': 'KE',
 'Kiribati': 'KI',
 "Korea, Democratic People's Republic of": 'KP',
 'Korea, Republic of': 'KR',
 'Kuwait': 'KW',
 'Kyrgyzstan': 'KG',
 "Lao People's Democratic Republic": 'LA',
 'Latvia': 'LV',
 'Lebanon': 'LB',
 'Lesotho': 'LS',
 'Liberia': 'LR',
 'Libya': 'LY',
 'Liechtenstein': 'LI',
 'Lithuania': 'LT',
 'Luxembourg': 'LU',
 'Macao': 'MO',
 'Macedonia, the former Yugoslav Republic of': 'MK',
 'Madagascar': 'MG',
 'Malawi': 'MW',
 'Malaysia': 'MY',
 'Maldives': 'MV',
 'Mali': 'ML',
 'Malta': 'MT',
 'Marshall Islands': 'MH',
 'Martinique': 'MQ',
 'Mauritania': 'MR',
 'Mauritius': 'MU',
 'Mayotte': 'YT',
 'Mexico': 'MX',
 'Micronesia, Federated States of': 'FM',
 'Moldova, Republic of': 'MD',
 'Monaco': 'MC',
 'Mongolia': 'MN',
 'Montenegro': 'ME',
 'Montserrat': 'MS',
 'Morocco': 'MA',
 'Mozambique': 'MZ',
 'Myanmar': 'MM',
 'Namibia': 'NA',
 'Nauru': 'NR',
 'Nepal': 'NP',
 'Netherlands': 'NL',
 'New Caledonia': 'NC',
 'New Zealand': 'NZ',
 'Nicaragua': 'NI',
 'Niger': 'NE',
 'Nigeria': 'NG',
 'Niue': 'NU',
 'Norfolk Island': 'NF',
 'Northern Mariana Islands': 'MP',
 'Norway': 'NO',
 'Oman': 'OM',
 'Pakistan': 'PK',
 'Palau': 'PW',
 'Palestine, State of': 'PS',
 'Panama': 'PA',
 'Papua New Guinea': 'PG',
 'Paraguay': 'PY',
 'Peru': 'PE',
 'Philippines': 'PH',
 'Pitcairn': 'PN',
 'Poland': 'PL',
 'Portugal': 'PT',
 'Puerto Rico': 'PR',
 'Qatar': 'QA',
 'Romania': 'RO',
 'Russian Federation': 'RU',
 'Rwanda': 'RW',
 'Réunion': 'RE',
 'Saint Barthélemy': 'BL',
 'Saint Helena, Ascension and Tristan da Cunha': 'SH',
 'Saint Kitts and Nevis': 'KN',
 'Saint Lucia': 'LC',
 'Saint Martin (French part)': 'MF',
 'Saint Pierre and Miquelon': 'PM',
 'Saint Vincent and the Grenadines': 'VC',
 'Samoa': 'WS',
 'San Marino': 'SM',
 'Sao Tome and Principe': 'ST',
 'Saudi Arabia': 'SA',
 'Senegal': 'SN',
 'Serbia': 'RS',
 'Seychelles': 'SC',
 'Sierra Leone': 'SL',
 'Singapore': 'SG',
 'Sint Maarten (Dutch part)': 'SX',
 'Slovakia': 'SK',
 'Slovenia': 'SI',
 'Solomon Islands': 'SB',
 'Somalia': 'SO',
 'South Africa': 'ZA',
 'South Georgia and the South Sandwich Islands': 'GS',
 'South Sudan': 'SS',
 'Spain': 'ES',
 'Sri Lanka': 'LK',
 'Sudan': 'SD',
 'Suriname': 'SR',
 'Svalbard and Jan Mayen': 'SJ',
 'Swaziland': 'SZ',
 'Sweden': 'SE',
 'Switzerland': 'CH',
 'Syrian Arab Republic': 'SY',
 'Taiwan, Province of China': 'TW',
 'Tajikistan': 'TJ',
 'Tanzania, United Republic of': 'TZ',
 'Thailand': 'TH',
 'Timor-Leste': 'TL',
 'Togo': 'TG',
 'Tokelau': 'TK',
 'Tonga': 'TO',
 'Trinidad and Tobago': 'TT',
 'Tunisia': 'TN',
 'Turkey': 'TR',
 'Turkmenistan': 'TM',
 'Turks and Caicos Islands': 'TC',
 'Tuvalu': 'TV',
 'Uganda': 'UG',
 'Ukraine': 'UA',
 'United Arab Emirates': 'AE',
 'United Kingdom': 'GB',
 'United States': 'US',
 'United States Minor Outlying Islands': 'UM',
 'Uruguay': 'UY',
 'Uzbekistan': 'UZ',
 'Vanuatu': 'VU',
 'Venezuela, Bolivarian Republic of': 'VE',
 'Viet Nam': 'VN',
 'Virgin Islands, British': 'VG',
 'Virgin Islands, U.S.': 'VI',
 'Wallis and Futuna': 'WF',
 'Western Sahara': 'EH',
 'Yemen': 'YE',
 'Zambia': 'ZM',
 'Zimbabwe': 'ZW',
 'Åland Islands': 'AX'}

In [17]:
# function to match country codes to country names
def match_code_to_country(code):
    for key, value in country_dict.items():
        if code in value:
            return key

In [18]:
# create a new country column using the first two characters of the location column and put it next to location

df["location"] = df["location"].apply(str) # ensure that location is a string

location_index = df.columns.get_loc('location') # location column index

df['country_code'] = df['location'].str[:2] # grab first two characters from the string

df.insert((location_index+1),'country', df['country_code'].apply(match_code_to_country)) # insert country next to location

df['country']

0        United States
1          New Zealand
2        United States
3        United States
4        United States
             ...      
17875           Canada
17876    United States
17877    United States
17878          Nigeria
17879      New Zealand
Name: country, Length: 17880, dtype: object

In [19]:
# remove the country code column as we don't need it

df.pop('country_code')

0        US
1        NZ
2        US
3        US
4        US
         ..
17875    CA
17876    US
17877    US
17878    NG
17879    NZ
Name: country_code, Length: 17880, dtype: object

In [20]:
# look at the unique countries
df['country'].unique()

array(['United States', 'New Zealand', 'Germany', 'United Kingdom',
       'Australia', 'Singapore', 'Israel', 'United Arab Emirates',
       'Canada', 'India', 'Egypt', 'Poland', 'Greece', None, 'Pakistan',
       'Belgium', 'Brazil', 'Saudi Arabia', 'Denmark',
       'Russian Federation', 'South Africa', 'Cyprus', 'Hong Kong',
       'Turkey', 'Ireland', 'Lithuania', 'Japan', 'Netherlands',
       'Austria', 'Korea, Republic of', 'France', 'Estonia', 'Thailand',
       'Panama', 'Kenya', 'Mauritius', 'Mexico', 'Romania', 'Malaysia',
       'Finland', 'China', 'Spain', 'Sweden', 'Chile', 'Ukraine', 'Qatar',
       'Italy', 'Latvia', 'Iraq', 'Bulgaria', 'Philippines',
       'Czech Republic', 'Virgin Islands, U.S.', 'Malta', 'Hungary',
       'Bangladesh', 'Kuwait', 'Luxembourg', 'Nigeria', 'Serbia',
       'Belarus', 'Viet Nam', 'Indonesia', 'Zambia', 'Norway', 'Bahrain',
       'Uganda', 'Switzerland', 'Trinidad and Tobago', 'Sudan',
       'Slovakia', 'Argentina', 'Taiwan, Province 

In [21]:
# there are still 346 missing country values
df['country'].isna().sum()

346

In [22]:
# replace the none values with Unknown so it makes more sense
df['country'].fillna(value="Unknown", inplace=True)
df['country'].unique()

array(['United States', 'New Zealand', 'Germany', 'United Kingdom',
       'Australia', 'Singapore', 'Israel', 'United Arab Emirates',
       'Canada', 'India', 'Egypt', 'Poland', 'Greece', 'Unknown',
       'Pakistan', 'Belgium', 'Brazil', 'Saudi Arabia', 'Denmark',
       'Russian Federation', 'South Africa', 'Cyprus', 'Hong Kong',
       'Turkey', 'Ireland', 'Lithuania', 'Japan', 'Netherlands',
       'Austria', 'Korea, Republic of', 'France', 'Estonia', 'Thailand',
       'Panama', 'Kenya', 'Mauritius', 'Mexico', 'Romania', 'Malaysia',
       'Finland', 'China', 'Spain', 'Sweden', 'Chile', 'Ukraine', 'Qatar',
       'Italy', 'Latvia', 'Iraq', 'Bulgaria', 'Philippines',
       'Czech Republic', 'Virgin Islands, U.S.', 'Malta', 'Hungary',
       'Bangladesh', 'Kuwait', 'Luxembourg', 'Nigeria', 'Serbia',
       'Belarus', 'Viet Nam', 'Indonesia', 'Zambia', 'Norway', 'Bahrain',
       'Uganda', 'Switzerland', 'Trinidad and Tobago', 'Sudan',
       'Slovakia', 'Argentina', 'Taiwan, Prov

In [23]:
# check how many null values each column has
df.isnull().sum()

job_id                     0
title                      0
location                   0
country                    0
department             11547
salary_range           15012
company_profile         3308
description                1
requirements            2695
benefits                7210
telecommuting              0
has_company_logo           0
has_questions              0
employment_type         3471
required_experience     7050
required_education      8105
industry                4903
function                6455
fraudulent                 0
dtype: int64

In [24]:
# find the columns without null values and add them to a list
no_null_columns = df.columns[df.notna().all()].tolist()
no_null_columns

['job_id',
 'title',
 'location',
 'country',
 'telecommuting',
 'has_company_logo',
 'has_questions',
 'fraudulent']

In [25]:
# let us remove location since we have country already and add description because it is important
important_features = no_null_columns
important_features[no_null_columns.index('location')] = 'description'

In [26]:
# check to make sure it worked correctly
important_features

['job_id',
 'title',
 'description',
 'country',
 'telecommuting',
 'has_company_logo',
 'has_questions',
 'fraudulent']

In [27]:
df2 = df[important_features].copy()

In [28]:
df2

Unnamed: 0,job_id,title,description,country,telecommuting,has_company_logo,has_questions,fraudulent
0,1,Marketing Intern,"Food52, a fast-growing, James Beard Award-winn...",United States,0,1,0,0
1,2,Customer Service - Cloud Video Production,Organised - Focused - Vibrant - Awesome!Do you...,New Zealand,0,1,0,0
2,3,Commissioning Machinery Assistant (CMA),"Our client, located in Houston, is actively se...",United States,0,1,0,0
3,4,Account Executive - Washington DC,THE COMPANY: ESRI – Environmental Systems Rese...,United States,0,1,0,0
4,5,Bill Review Manager,JOB TITLE: Itemization Review ManagerLOCATION:...,United States,0,1,1,0
...,...,...,...,...,...,...,...,...
17875,17876,Account Director - Distribution,Just in case this is the first time you’ve vis...,Canada,0,1,1,0
17876,17877,Payroll Accountant,The Payroll Accountant will focus primarily on...,United States,0,1,1,0
17877,17878,Project Cost Control Staff Engineer - Cost Con...,Experienced Project Cost Control Staff Enginee...,United States,0,0,0,0
17878,17879,Graphic Designer,Nemsia Studios is looking for an experienced v...,Nigeria,0,0,1,0


In [29]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17880 entries, 0 to 17879
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   job_id            17880 non-null  int64 
 1   title             17880 non-null  object
 2   description       17879 non-null  object
 3   country           17880 non-null  object
 4   telecommuting     17880 non-null  int64 
 5   has_company_logo  17880 non-null  int64 
 6   has_questions     17880 non-null  int64 
 7   fraudulent        17880 non-null  int64 
dtypes: int64(5), object(3)
memory usage: 1.1+ MB


In [30]:
# since there is one null value in description, let us drop it
df2.dropna(inplace=True)

In [31]:
df2

Unnamed: 0,job_id,title,description,country,telecommuting,has_company_logo,has_questions,fraudulent
0,1,Marketing Intern,"Food52, a fast-growing, James Beard Award-winn...",United States,0,1,0,0
1,2,Customer Service - Cloud Video Production,Organised - Focused - Vibrant - Awesome!Do you...,New Zealand,0,1,0,0
2,3,Commissioning Machinery Assistant (CMA),"Our client, located in Houston, is actively se...",United States,0,1,0,0
3,4,Account Executive - Washington DC,THE COMPANY: ESRI – Environmental Systems Rese...,United States,0,1,0,0
4,5,Bill Review Manager,JOB TITLE: Itemization Review ManagerLOCATION:...,United States,0,1,1,0
...,...,...,...,...,...,...,...,...
17875,17876,Account Director - Distribution,Just in case this is the first time you’ve vis...,Canada,0,1,1,0
17876,17877,Payroll Accountant,The Payroll Accountant will focus primarily on...,United States,0,1,1,0
17877,17878,Project Cost Control Staff Engineer - Cost Con...,Experienced Project Cost Control Staff Enginee...,United States,0,0,0,0
17878,17879,Graphic Designer,Nemsia Studios is looking for an experienced v...,Nigeria,0,0,1,0


In [32]:
country_comp = pd.crosstab(df2["country"],df2["fraudulent"])

In [33]:
country_comp["Total"] = country_comp[1] + country_comp[0]
country_comp["Percentage"] = round((country_comp[1] / country_comp["Total"])*100,2)

In [34]:
country_comp = country_comp[[1, 0, "Total", "Percentage"]]
country_comp.rename(columns={1:"Fake", 0:"Real"},inplace=True)

An overwhelming amount of total job postings in the dataset are from the United States. We see below that a few countries had more fake postings than real ones. Overall, given how skewed the dataset is towards the US, we may want to remove country as a feature when we build our classification model.

In [35]:
# sort the values by highest fake postings percentage
country_comp.sort_values(by=["Percentage"],ascending=False, inplace=True)
country_comp.head(20)

fraudulent,Fake,Real,Total,Percentage
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Malaysia,12,9,21,57.14
Bahrain,5,4,9,55.56
"Taiwan, Province of China",2,2,4,50.0
Qatar,6,15,21,28.57
Australia,40,174,214,18.69
Indonesia,1,12,13,7.69
United States,730,9926,10656,6.85
Saudi Arabia,1,14,15,6.67
Unknown,19,327,346,5.49
Pakistan,1,26,27,3.7


In [36]:
countries = country_comp.index

In [37]:
countries

Index(['Malaysia', 'Bahrain', 'Taiwan, Province of China', 'Qatar',
       'Australia', 'Indonesia', 'United States', 'Saudi Arabia', 'Unknown',
       'Pakistan', 'Brazil', 'Poland', 'Canada', 'South Africa', 'Egypt',
       'United Arab Emirates', 'Spain', 'India', 'Estonia', 'United Kingdom',
       'Philippines', 'Panama', 'Peru', 'Nigeria', 'Nicaragua', 'Portugal',
       'New Zealand', 'Romania', 'Russian Federation', 'Netherlands', 'Norway',
       'Albania', 'Serbia', 'Singapore', 'Virgin Islands, U.S.', 'Viet Nam',
       'Ukraine', 'Uganda', 'Turkey', 'Tunisia', 'Trinidad and Tobago',
       'Thailand', 'Switzerland', 'Sweden', 'Sudan', 'Sri Lanka', 'Mexico',
       'Slovenia', 'Slovakia', 'Morocco', 'Lithuania', 'Mauritius', 'Malta',
       'Finland', 'El Salvador', 'Denmark', 'Czech Republic', 'Cyprus',
       'Croatia', 'Colombia', 'China', 'Chile', 'Cameroon', 'Cambodia',
       'Bulgaria', 'Belgium', 'Belarus', 'Bangladesh', 'Austria', 'Armenia',
       'France', 'German

Out of curiousity, I mapped the breakout of different job postings across the globe. It is a cool visual but not useful for our model.

In [39]:
import geocoder
import folium

In [40]:
lat = []
long = []
for c in countries:
    lat.append(geocoder.arcgis(c).json['lat'])
    long.append(geocoder.arcgis(c).json['lng'])
country_comp["lat"] = lat
country_comp["long"] = long
country_comp

fraudulent,Fake,Real,Total,Percentage,lat,long
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Malaysia,12,9,21,57.14,2.500000,112.500000
Bahrain,5,4,9,55.56,26.051859,50.564427
"Taiwan, Province of China",2,2,4,50.00,32.328120,118.763840
Qatar,6,15,21,28.57,25.285445,51.193148
Australia,40,174,214,18.69,-25.709932,134.484031
...,...,...,...,...,...,...
Iraq,0,10,10,0.00,33.038975,43.777172
Iceland,0,2,2,0.00,64.986957,-18.582382
Hungary,0,14,14,0.00,47.167089,19.424532
Hong Kong,0,77,77,0.00,22.351958,114.119386


In [41]:
ad_map = folium.Map(location=[20,0], tiles="OpenStreetMap", zoom_start=2)

In [42]:
for i in range(0,len(country_comp)):
    html=f"""
        <h1> {country_comp.index[i]}</h1>
        <ul>
            <li>Fake Ads: {country_comp.iloc[i]['Fake']}</li>
            <li>Real Ads: {country_comp.iloc[i]['Real']}</li>
            <li>Total Ads: {country_comp.iloc[i]['Total']}</li>
            <li><strong>% Fake Ads: {country_comp.iloc[i]['Percentage']}</strong></li>
        </ul>
        """
    iframe = folium.IFrame(html=html, width=200, height=200)
    popup = folium.Popup(iframe, max_width=2650)
    folium.Marker(
        location=[country_comp.iloc[i]['lat'], country_comp.iloc[i]['long']],
        popup=popup,
        icon=folium.DivIcon(html=f"""
            <div><svg>
                <circle cx="50" cy="50" r="40" fill="#69b3a2" opacity=".4"/>
                <rect x="35", y="35" width="30" height="30", fill="red", opacity=".3" 
            </svg></div>""")
    ).add_to(ad_map)

In [43]:
ad_map

## Preprocessing

Now that we see that country is probably not a great predictor of our class based on the above analysis, we can begin preprocessing the rest of the data. To prepare the data for classification, we need to analyze the description feature which is likely the biggest determinant of our class.

In [44]:
df2

Unnamed: 0,job_id,title,description,country,telecommuting,has_company_logo,has_questions,fraudulent
0,1,Marketing Intern,"Food52, a fast-growing, James Beard Award-winn...",United States,0,1,0,0
1,2,Customer Service - Cloud Video Production,Organised - Focused - Vibrant - Awesome!Do you...,New Zealand,0,1,0,0
2,3,Commissioning Machinery Assistant (CMA),"Our client, located in Houston, is actively se...",United States,0,1,0,0
3,4,Account Executive - Washington DC,THE COMPANY: ESRI – Environmental Systems Rese...,United States,0,1,0,0
4,5,Bill Review Manager,JOB TITLE: Itemization Review ManagerLOCATION:...,United States,0,1,1,0
...,...,...,...,...,...,...,...,...
17875,17876,Account Director - Distribution,Just in case this is the first time you’ve vis...,Canada,0,1,1,0
17876,17877,Payroll Accountant,The Payroll Accountant will focus primarily on...,United States,0,1,1,0
17877,17878,Project Cost Control Staff Engineer - Cost Con...,Experienced Project Cost Control Staff Enginee...,United States,0,0,0,0
17878,17879,Graphic Designer,Nemsia Studios is looking for an experienced v...,Nigeria,0,0,1,0


There are a few rows where the description is essentially missing. We can safely drop those without it affecting our model.

In [45]:
# check to see if there are really short descriptions that are not useful
df2[df2['description'].apply(len) < 9]

Unnamed: 0,job_id,title,description,country,telecommuting,has_company_logo,has_questions,fraudulent
3030,3031,mobile apps for Android/iOS developer,#NAME?,Viet Nam,0,0,0,0
3951,3952,Senior Java Developer with Hadoop Exp,#NAME?,United States,0,1,0,0
4184,4185,Senior Architect / Technician,#NAME?,Philippines,0,1,0,0
7612,7613,Mobile Marketing Specialist,#NAME?,Hong Kong,0,1,1,0
11893,11894,Junior Specialist - Seed Production and Harvest,#NAME?,Philippines,0,0,0,0
12189,12190,Researcher - Nutrient and Crop Management,#NAME?,Philippines,0,0,0,0
13529,13530,Full-time Web Developer,#NAME?,Philippines,0,1,0,0
16690,16691,Marketing Lead,#NAME?,Unknown,0,1,0,0


In [46]:
df2.drop(df2[df2['description'].apply(len) < 9].index, inplace=True)

In [47]:
# check if they were all deleted successfully
df2['description'].str.contains('#NAME?').sum()

0

Now that we determined that the country is not going to be important for our model, we can drop it from our dataframe.

In [48]:
df2.pop("country")

0        United States
1          New Zealand
2        United States
3        United States
4        United States
             ...      
17875           Canada
17876    United States
17877    United States
17878          Nigeria
17879      New Zealand
Name: country, Length: 17871, dtype: object

Let us now experiment with the natural language toolkit (NLTK) library to transform the description column. This will allow us to remove unnecessary words and hone it on those that can differentiate real job ads from fake ones. We will first test on one description value until we find an optimal method. After that, we will apply it to the whole description column.

QKXKD
stan.lyubarskiy@gmail.com
QjZqIFVp2eU!e8#&#joL

In [50]:
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

In [82]:
dummy_text = df2["description"][0]
dummy_text

'Food52, a fast-growing, James Beard Award-winning online food community and crowd-sourced and curated recipe hub, is currently interviewing full- and part-time unpaid interns to work in a small team of editors, executives, and developers in its New York City headquarters.Reproducing and/or repackaging existing Food52 content for a number of partner sites, such as Huffington Post, Yahoo, Buzzfeed, and more in their various content management systemsResearching blogs and websites for the Provisions by Food52 Affiliate ProgramAssisting in day-to-day affiliate program support, such as screening affiliates and assisting in any affiliate inquiriesSupporting with PR &amp; Events when neededHelping with office administrative work, such as filing, mailing, and preparing for meetingsWorking with developers to document bugs and suggest improvements to the siteSupporting the marketing and executive staff'

In [52]:
nltk.download('omw-1.4')
nltk.download("stopwords")

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\stanl\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\stanl\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [53]:
# create a function to transform the text by removing punctuation, stop words, and lemmatizing the words

def transform_text(text):
    # first remove punctuation
    tokenizer = RegexpTokenizer(r'\w+') #search for alphanumeric characters
    text = tokenizer.tokenize(text.casefold())

    # next, remove the stop words
    stop_words = set(stopwords.words("english"))
    filtered_list = [word for word in text if word not in stop_words]


    # finally, lemmatize the text to derive the core meaning of each word
    lemmatizer = WordNetLemmatizer() #reduces words to their core meaning
    text = [lemmatizer.lemmatize(w) for w in text]
    text = " ".join(filtered_list)
    
    return text

In [54]:
dummy = transform_text(dummy_text)
test = dummy
dummy

'food52 fast growing james beard award winning online food community crowd sourced curated recipe hub currently interviewing full part time unpaid interns work small team editors executives developers new york city headquarters reproducing repackaging existing food52 content number partner sites huffington post yahoo buzzfeed various content management systemsresearching blogs websites provisions food52 affiliate programassisting day day affiliate program support screening affiliates assisting affiliate inquiriessupporting pr amp events neededhelping office administrative work filing mailing preparing meetingsworking developers document bugs suggest improvements sitesupporting marketing executive staff'

In [55]:
string = 'food52 fast growing james beard award winning online food community crowd sourced curated recipe hub currently interviewing full part time unpaid intern work small team editor executive developer new york city headquarters reproducing repackaging existing food52 content number partner site huffington post yahoo buzzfeed various content management systemsresearching blog website provision food52 affiliate programassisting day day affiliate program support screening affiliate assisting affiliate inquiriessupporting pr amp event neededhelping office administrative work filing mailing preparing meetingsworking developer document bug suggest improvement sitesupporting marketing executive staff'

In [56]:
string

'food52 fast growing james beard award winning online food community crowd sourced curated recipe hub currently interviewing full part time unpaid intern work small team editor executive developer new york city headquarters reproducing repackaging existing food52 content number partner site huffington post yahoo buzzfeed various content management systemsresearching blog website provision food52 affiliate programassisting day day affiliate program support screening affiliate assisting affiliate inquiriessupporting pr amp event neededhelping office administrative work filing mailing preparing meetingsworking developer document bug suggest improvement sitesupporting marketing executive staff'

In [57]:
dummy = dummy.replace(r'\S*@\S*\s?', '')
dummy = dummy.replace(r'\S*@\S*\s?', '')
dummy

'food52 fast growing james beard award winning online food community crowd sourced curated recipe hub currently interviewing full part time unpaid interns work small team editors executives developers new york city headquarters reproducing repackaging existing food52 content number partner sites huffington post yahoo buzzfeed various content management systemsresearching blogs websites provisions food52 affiliate programassisting day day affiliate program support screening affiliates assisting affiliate inquiriessupporting pr amp events neededhelping office administrative work filing mailing preparing meetingsworking developers document bugs suggest improvements sitesupporting marketing executive staff'

In [58]:
import language_tool_python
tool = language_tool_python.LanguageTool('en-US') 

Downloading LanguageTool 5.7: 100%|█████████████████████████████████████████████████████████████████████████████| 225M/225M [02:39<00:00, 1.41MB/s]
Unzipping C:\Users\stanl\AppData\Local\Temp\tmpzohgtj61.zip to C:\Users\stanl\.cache\language_tool_python.
Downloaded https://www.languagetool.org/download/LanguageTool-5.7.zip to C:\Users\stanl\.cache\language_tool_python.


In [95]:
tokenizer = RegexpTokenizer(r'\w+')
punct_re=lambda x :" ".join(tokenizer.tokenize(x.title()))
review = punct_re(dummy_text)

In [96]:
tool.correct(review)

'Food52 A Fast Growing James Beard Award Winning Online Food Community And Crowd Sourced And Curated Recipe Hub Is Currently Interviewing Full And Part Time Unpaid Interns To Work In A Small Team Of Editors Executives And Developers In Its New York City Headquarters Reproducing And Or Repackaging Existing Food52 Content For A Number Of Partner Sites Such As Huffington Post Yahoo BuzzFeed And More In Their Various Content Management Systems researching Blogs And Websites For The Provisions By Food52 Affiliate Program assisting In Day To Day Affiliate Program Support Such As Screening Affiliates And Assisting In Any Affiliate Inquiries supporting With Pr Amp Events When Needed helping With Office Administrative Work Such As Filing Mailing And Preparing For Meetings working With Developers To Document Bugs And Suggest Improvements To The Site supporting The Marketing And Executive Staff'

In [80]:
tool.correct(df2["description"][0])

'Food52, a fast-growing, James Beard Award-winning online food community and crowdsourced and curated recipe hub, is currently interviewing full- and part-time unpaid interns to work in a small team of editors, executives, and developers in its New York City headquarters. Reproducing and/or repackaging existing Food52 content for a number of partner sites, such as Huffington Post, Yahoo, BuzzFeed, and more in their various content management systemsResearching blogs and websites for the Provisions by Food52 Affiliate ProgramAssisting in day-to-day affiliate program support, such as screening affiliates and assisting in any affiliate inquiriesSupporting with PR camp; Events when neededHelping with office administrative work, such as filing, mailing, and preparing for meetingsWorking with developers to document bugs and suggest improvements to the siteSupporting the marketing and executive staff'

In [81]:
df2["description"][0]

'Food52, a fast-growing, James Beard Award-winning online food community and crowd-sourced and curated recipe hub, is currently interviewing full- and part-time unpaid interns to work in a small team of editors, executives, and developers in its New York City headquarters.Reproducing and/or repackaging existing Food52 content for a number of partner sites, such as Huffington Post, Yahoo, Buzzfeed, and more in their various content management systemsResearching blogs and websites for the Provisions by Food52 Affiliate ProgramAssisting in day-to-day affiliate program support, such as screening affiliates and assisting in any affiliate inquiriesSupporting with PR &amp; Events when neededHelping with office administrative work, such as filing, mailing, and preparing for meetingsWorking with developers to document bugs and suggest improvements to the siteSupporting the marketing and executive staff'