# Assignment 2: Fake Job Posting Detection

## Dataset: Fake Job Postings

This notebook contains the analysis for detecting fraudulent job postings using machine learning.


In [2]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)

print("Libraries imported successfully!")


Libraries imported successfully!


## Step 1: Load and Explore the Data


In [3]:
# Load the dataset
df = pd.read_csv('fake_job_postings.csv')

print(f"Dataset shape: {df.shape}")
print(f"\nColumn names:")
print(df.columns.tolist())
print(f"\nFirst few rows:")
df.head()


Dataset shape: (17880, 18)

Column names:
['job_id', 'title', 'location', 'department', 'salary_range', 'company_profile', 'description', 'requirements', 'benefits', 'telecommuting', 'has_company_logo', 'has_questions', 'employment_type', 'required_experience', 'required_education', 'industry', 'function', 'fraudulent']

First few rows:


Unnamed: 0,job_id,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent
0,1,Marketing Intern,"US, NY, New York",Marketing,,"We're Food52, and we've created a groundbreaking and award-winning cooking site. We support, con...","Food52, a fast-growing, James Beard Award-winning online food community and crowd-sourced and cu...",Experience with content management systems a major plus (any blogging counts!)Familiar with the ...,,0,1,0,Other,Internship,,,Marketing,0
1,2,Customer Service - Cloud Video Production,"NZ, , Auckland",Success,,"90 Seconds, the worlds Cloud Video Production Service.90 Seconds is the worlds Cloud Video Produ...",Organised - Focused - Vibrant - Awesome!Do you have a passion for customer service? Slick typing...,"What we expect from you:Your key responsibility will be to communicate with the client, 90 Secon...",What you will get from usThrough being part of the 90 Seconds team you will gain:experience work...,0,1,0,Full-time,Not Applicable,,Marketing and Advertising,Customer Service,0
2,3,Commissioning Machinery Assistant (CMA),"US, IA, Wever",,,Valor Services provides Workforce Solutions that meet the needs of companies across the Private ...,"Our client, located in Houston, is actively seeking an experienced Commissioning Machinery Assis...",Implement pre-commissioning and commissioning procedures for rotary equipment.Execute all activi...,,0,1,0,,,,,,0
3,4,Account Executive - Washington DC,"US, DC, Washington",Sales,,Our passion for improving quality of life through geography is at the heart of everything we do....,THE COMPANY: ESRI – Environmental Systems Research InstituteOur passion for improving quality of...,"EDUCATION: Bachelor’s or Master’s in GIS, business administration, or a related field, or equiva...","Our culture is anything but corporate—we have a collaborative, creative environment; phone direc...",0,1,0,Full-time,Mid-Senior level,Bachelor's Degree,Computer Software,Sales,0
4,5,Bill Review Manager,"US, FL, Fort Worth",,,SpotSource Solutions LLC is a Global Human Capital Management Consulting firm headquartered in M...,"JOB TITLE: Itemization Review ManagerLOCATION: Fort Worth, TX ...","QUALIFICATIONS:RN license in the State of TexasDiploma or Bachelors of Science in Nursing, requi...",Full Benefits Offered,0,1,1,Full-time,Mid-Senior level,Bachelor's Degree,Hospital & Health Care,Health Care Provider,0


In [4]:
# Basic information about the dataset
print("Dataset Info:")
print(df.info())
print("\n" + "="*50)
print("\nMissing values:")
print(df.isnull().sum())
print("\n" + "="*50)
print("\nTarget variable distribution:")
print(df['fraudulent'].value_counts())
print(f"\nFraudulent rate: {df['fraudulent'].mean():.2%}")


Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17880 entries, 0 to 17879
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   job_id               17880 non-null  int64 
 1   title                17880 non-null  object
 2   location             17534 non-null  object
 3   department           6333 non-null   object
 4   salary_range         2868 non-null   object
 5   company_profile      14572 non-null  object
 6   description          17879 non-null  object
 7   requirements         15184 non-null  object
 8   benefits             10668 non-null  object
 9   telecommuting        17880 non-null  int64 
 10  has_company_logo     17880 non-null  int64 
 11  has_questions        17880 non-null  int64 
 12  employment_type      14409 non-null  object
 13  required_experience  10830 non-null  object
 14  required_education   9775 non-null   object
 15  industry             12977 non-null  ob

## Step 2: Simple Data Cleaning

Based on the initial exploration, we'll perform basic cleaning operations.


In [5]:
# Create a copy for cleaning
df_clean = df.copy()

# Check for duplicate rows
print(f"Number of duplicate rows: {df_clean.duplicated().sum()}")

# Remove duplicates if any
if df_clean.duplicated().sum() > 0:
    df_clean = df_clean.drop_duplicates()
    print(f"Removed duplicates. New shape: {df_clean.shape}")
else:
    print("No duplicates found.")


Number of duplicate rows: 0
No duplicates found.


In [6]:
# Handle missing values - fill empty strings with NaN for consistency
text_columns = ['title', 'location', 'department', 'company_profile', 
                'description', 'requirements', 'benefits', 'employment_type',
                'required_experience', 'required_education', 'industry', 'function']

for col in text_columns:
    if col in df_clean.columns:
        # Replace empty strings with NaN
        df_clean[col] = df_clean[col].replace('', np.nan)

print("Missing values after cleaning:")
print(df_clean[text_columns].isnull().sum())


Missing values after cleaning:
title                      0
location                 346
department             11547
company_profile         3308
description                1
requirements            2696
benefits                7212
employment_type         3471
required_experience     7050
required_education      8105
industry                4903
function                6455
dtype: int64


In [7]:
# Check data types and ensure target variable is correct
print("Data types:")
print(df_clean.dtypes)
print(f"\nTarget variable (fraudulent) type: {df_clean['fraudulent'].dtype}")
print(f"Target variable unique values: {df_clean['fraudulent'].unique()}")

# Ensure target is integer (should already be, but just to be safe)
df_clean['fraudulent'] = df_clean['fraudulent'].astype(int)

print(f"\nFinal dataset shape: {df_clean.shape}")
print(f"Final fraudulent rate: {df_clean['fraudulent'].mean():.2%}")


Data types:
job_id                  int64
title                  object
location               object
department             object
salary_range           object
company_profile        object
description            object
requirements           object
benefits               object
telecommuting           int64
has_company_logo        int64
has_questions           int64
employment_type        object
required_experience    object
required_education     object
industry               object
function               object
fraudulent              int64
dtype: object

Target variable (fraudulent) type: int64
Target variable unique values: [0 1]

Final dataset shape: (17880, 18)
Final fraudulent rate: 4.84%
