## Cleaning FAKE JOB POSTING DATASET

## 1) Problem Statement
With the rise of online job portals, job seekers are increasingly exposed to fraudulent job postings that aim to scam applicants. These fake listings often mimic legitimate companies and job descriptions, making it difficult for users to distinguish between real and fraudulent opportunities.

## 2) Data Collection

- Dataset Source - https://www.kaggle.com/datasets/shivamb/real-or-fake-fake-jobposting-prediction
- The dataset consist of 18 column and 17880 rows

In [1]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
## importing csv file 
df = pd.read_csv('fake_job_postings.csv')

In [3]:
df.head()

Unnamed: 0,job_id,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent
0,1,Marketing Intern,"US, NY, New York",Marketing,,"We're Food52, and we've created a groundbreaki...","Food52, a fast-growing, James Beard Award-winn...",Experience with content management systems a m...,,0,1,0,Other,Internship,,,Marketing,0
1,2,Customer Service - Cloud Video Production,"NZ, , Auckland",Success,,"90 Seconds, the worlds Cloud Video Production ...",Organised - Focused - Vibrant - Awesome!Do you...,What we expect from you:Your key responsibilit...,What you will get from usThrough being part of...,0,1,0,Full-time,Not Applicable,,Marketing and Advertising,Customer Service,0
2,3,Commissioning Machinery Assistant (CMA),"US, IA, Wever",,,Valor Services provides Workforce Solutions th...,"Our client, located in Houston, is actively se...",Implement pre-commissioning and commissioning ...,,0,1,0,,,,,,0
3,4,Account Executive - Washington DC,"US, DC, Washington",Sales,,Our passion for improving quality of life thro...,THE COMPANY: ESRI – Environmental Systems Rese...,"EDUCATION: Bachelor’s or Master’s in GIS, busi...",Our culture is anything but corporate—we have ...,0,1,0,Full-time,Mid-Senior level,Bachelor's Degree,Computer Software,Sales,0
4,5,Bill Review Manager,"US, FL, Fort Worth",,,SpotSource Solutions LLC is a Global Human Cap...,JOB TITLE: Itemization Review ManagerLOCATION:...,QUALIFICATIONS:RN license in the State of Texa...,Full Benefits Offered,0,1,1,Full-time,Mid-Senior level,Bachelor's Degree,Hospital & Health Care,Health Care Provider,0


In [4]:
df.shape

(17880, 18)

## Information about the dataset

- title -> Just gives the job title how which it is posted. (e.g., "Software Engineer", "Marketing Intern").
- location -> location of job posting
- department -> for which department job is posted
- Salary_range -> tell about salary range that will be given to applicant
- Company_profile -> tells about profile of the company (information)
- description -> description of the job posting
- requirements -> experience needed for job posting
- benefits -> benefits of getting that job 
- telecomputing -> Indicates if the job allows remote work or working from home (1 = yes, 0 = no).
- has_company_logo -> companies logo is there or not (1 = yes, 0 = no).
- has_questions -> Indicates whether the job posting contains additional application questions (1 = yes, 0 = no).
- employement_type -> what type of employement they are giving (e.g., Full-time, Part-time, Contract, Temporary, Internship).
- required_experience -> what is the experience needed for the job (e.g., Entry-level, Mid-level, Senior).
- required_education -> education is required for job or not (e.g., Bachelor’s Degree, Master’s Degree, High School).
- Industry -> from which industry the company and job belongs (e.g., IT, Healthcare, Finance).
- Function -> what is your function in the company (e.g., Engineering, Human Resources, Marketing).
- fraudulent -> fraud or not (1 = fake, 0 = real).

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17880 entries, 0 to 17879
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   job_id               17880 non-null  int64 
 1   title                17880 non-null  object
 2   location             17534 non-null  object
 3   department           6333 non-null   object
 4   salary_range         2868 non-null   object
 5   company_profile      14572 non-null  object
 6   description          17879 non-null  object
 7   requirements         15184 non-null  object
 8   benefits             10668 non-null  object
 9   telecommuting        17880 non-null  int64 
 10  has_company_logo     17880 non-null  int64 
 11  has_questions        17880 non-null  int64 
 12  employment_type      14409 non-null  object
 13  required_experience  10830 non-null  object
 14  required_education   9775 non-null   object
 15  industry             12977 non-null  object
 16  func

In [6]:
df.isnull().sum()

job_id                     0
title                      0
location                 346
department             11547
salary_range           15012
company_profile         3308
description                1
requirements            2696
benefits                7212
telecommuting              0
has_company_logo           0
has_questions              0
employment_type         3471
required_experience     7050
required_education      8105
industry                4903
function                6455
fraudulent                 0
dtype: int64

In [7]:
df['employment_type'].value_counts()

employment_type
Full-time    11620
Contract      1524
Part-time      797
Temporary      241
Other          227
Name: count, dtype: int64

In [8]:
df['required_experience'].value_counts()

required_experience
Mid-Senior level    3809
Entry level         2697
Associate           2297
Not Applicable      1116
Director             389
Internship           381
Executive            141
Name: count, dtype: int64

In [9]:
df['required_education'].value_counts() 

required_education
Bachelor's Degree                    5145
High School or equivalent            2080
Unspecified                          1397
Master's Degree                       416
Associate Degree                      274
Certification                         170
Some College Coursework Completed     102
Professional                           74
Vocational                             49
Some High School Coursework            27
Doctorate                              26
Vocational - HS Diploma                 9
Vocational - Degree                     6
Name: count, dtype: int64

In [10]:
cols = [
    "location",
    "department",
    "employment_type",
    "required_experience",
    "required_education",
    "industry",
    "function"
]

for col in cols:
    n_unique = df[col].dropna().str.lower().str.strip().nunique()
    print(f"{col}: {n_unique} unique values")


location: 2817 unique values
department: 1224 unique values
employment_type: 5 unique values
required_experience: 7 unique values
required_education: 13 unique values
industry: 131 unique values
function: 37 unique values


In [11]:
print(df['salary_range'].isnull().sum())

15012


## So i Think that salary_range and depatment are not need due to
- department → (redundant with function)
- salary_range -> (too sparse, noisy)

In [12]:
df.drop(["department", "salary_range"], axis=1, inplace=True)


In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17880 entries, 0 to 17879
Data columns (total 16 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   job_id               17880 non-null  int64 
 1   title                17880 non-null  object
 2   location             17534 non-null  object
 3   company_profile      14572 non-null  object
 4   description          17879 non-null  object
 5   requirements         15184 non-null  object
 6   benefits             10668 non-null  object
 7   telecommuting        17880 non-null  int64 
 8   has_company_logo     17880 non-null  int64 
 9   has_questions        17880 non-null  int64 
 10  employment_type      14409 non-null  object
 11  required_experience  10830 non-null  object
 12  required_education   9775 non-null   object
 13  industry             12977 non-null  object
 14  function             11425 non-null  object
 15  fraudulent           17880 non-null  int64 
dtypes: i

In [14]:
df.isnull().sum()

job_id                    0
title                     0
location                346
company_profile        3308
description               1
requirements           2696
benefits               7212
telecommuting             0
has_company_logo          0
has_questions             0
employment_type        3471
required_experience    7050
required_education     8105
industry               4903
function               6455
fraudulent                0
dtype: int64

In [15]:
# Fill missing text columns with empty strings
text_cols = ['company_profile', 'description', 'requirements', 'benefits']
df[text_cols] = df[text_cols].fillna('')

# Fill missing categorical columns with 'Unknown'
cat_cols = ['employment_type', 'required_experience', 'required_education', 'industry', 'function', 'location']
df[cat_cols] = df[cat_cols].fillna('Unknown')

# Verify no more missing values (except possibly in parsed features later)
df.isnull().sum()

job_id                 0
title                  0
location               0
company_profile        0
description            0
requirements           0
benefits               0
telecommuting          0
has_company_logo       0
has_questions          0
employment_type        0
required_experience    0
required_education     0
industry               0
function               0
fraudulent             0
dtype: int64

In [16]:
# Function to parse location
def parse_location(loc):
    if loc == 'Unknown' or pd.isna(loc):
        return pd.Series(['Unknown', 'Unknown', 'Unknown'])
    parts = [part.strip() for part in loc.split(',')]
    country = parts[0] if len(parts) > 0 else 'Unknown'
    state = parts[1] if len(parts) > 1 else 'Unknown'
    city = parts[2] if len(parts) > 2 else 'Unknown'
    return pd.Series([country, state, city])

# Apply parsing and create new columns
df[['country', 'state', 'city']] = df['location'].apply(parse_location)

# Drop original location column
df.drop('location', axis=1, inplace=True)

df[['country', 'state', 'city']].head()

Unnamed: 0,country,state,city
0,US,NY,New York
1,NZ,,Auckland
2,US,IA,Wever
3,US,DC,Washington
4,US,FL,Fort Worth


In [17]:
# Standardize categorical columns
for col in ['employment_type', 'required_experience', 'required_education', 'industry', 'function', 'country', 'state', 'city']:
    df[col] = df[col].str.lower().str.strip()

# Re-check unique counts after standardization
for col in ['employment_type', 'required_experience', 'required_education', 'industry', 'function']:
    print(f"{col}: {df[col].nunique()} unique values")

employment_type: 6 unique values
required_experience: 8 unique values
required_education: 14 unique values
industry: 132 unique values
function: 38 unique values


In [18]:
df['country'].value_counts()

country
us    10656
gb     2384
gr      940
ca      457
de      383
      ...  
hr        1
sv        1
jm        1
kz        1
kh        1
Name: count, Length: 91, dtype: int64

In [19]:
df['state'].value_counts()

state
       2140
ca     2051
ny     1259
lnd     992
tx      975
       ... 
der       1
iow       1
dud       1
sn        1
nle       1
Name: count, Length: 326, dtype: int64

In [20]:
df['city'].value_counts()

city
                      1628
london                1109
new york               699
athens                 560
san francisco          494
                      ... 
abbotsford               1
sw chicago suburbs       1
collegedale              1
dearborn                 1
melton mowbray           1
Name: count, Length: 2027, dtype: int64

In [21]:
df.head()

Unnamed: 0,job_id,title,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,country,state,city
0,1,Marketing Intern,"We're Food52, and we've created a groundbreaki...","Food52, a fast-growing, James Beard Award-winn...",Experience with content management systems a m...,,0,1,0,other,internship,unknown,unknown,marketing,0,us,ny,new york
1,2,Customer Service - Cloud Video Production,"90 Seconds, the worlds Cloud Video Production ...",Organised - Focused - Vibrant - Awesome!Do you...,What we expect from you:Your key responsibilit...,What you will get from usThrough being part of...,0,1,0,full-time,not applicable,unknown,marketing and advertising,customer service,0,nz,,auckland
2,3,Commissioning Machinery Assistant (CMA),Valor Services provides Workforce Solutions th...,"Our client, located in Houston, is actively se...",Implement pre-commissioning and commissioning ...,,0,1,0,unknown,unknown,unknown,unknown,unknown,0,us,ia,wever
3,4,Account Executive - Washington DC,Our passion for improving quality of life thro...,THE COMPANY: ESRI – Environmental Systems Rese...,"EDUCATION: Bachelor’s or Master’s in GIS, busi...",Our culture is anything but corporate—we have ...,0,1,0,full-time,mid-senior level,bachelor's degree,computer software,sales,0,us,dc,washington
4,5,Bill Review Manager,SpotSource Solutions LLC is a Global Human Cap...,JOB TITLE: Itemization Review ManagerLOCATION:...,QUALIFICATIONS:RN license in the State of Texa...,Full Benefits Offered,0,1,1,full-time,mid-senior level,bachelor's degree,hospital & health care,health care provider,0,us,fl,fort worth


In [22]:
# Standardize categorical columns
for col in ['employment_type', 'required_experience', 'required_education', 'industry', 'function', 'country', 'state', 'city']:
    df[col] = df[col].str.lower().str.strip()

# Re-check unique counts after standardization
for col in ['employment_type', 'required_experience', 'required_education', 'industry', 'function']:
    print(f"{col}: {df[col].nunique()} unique values")

employment_type: 6 unique values
required_experience: 8 unique values
required_education: 14 unique values
industry: 132 unique values
function: 38 unique values


- The industry column has 132 unique values, but the industry itself does not determine whether a job posting is fraudulent. Including it may add noise without improving prediction, so it is reasonable to exclude this feature.

In [23]:
df.drop(columns=['industry'], inplace=True)

- The function column describes job roles, but fraud can occur in any function. With 38 unique values, it adds little predictive power, so it is reasonable to exclude it.

In [24]:
df.drop(columns=['function'], inplace=True)

In [25]:
df['required_education'].value_counts()

required_education
unknown                              8105
bachelor's degree                    5145
high school or equivalent            2080
unspecified                          1397
master's degree                       416
associate degree                      274
certification                         170
some college coursework completed     102
professional                           74
vocational                             49
some high school coursework            27
doctorate                              26
vocational - hs diploma                 9
vocational - degree                     6
Name: count, dtype: int64

- Above unknow and unspecified is like same thing so good to merege it 

In [26]:
df['required_education'] = df['required_education'].replace({
"unspecified": "unknown"
})


In [27]:
df['required_education'].value_counts()

required_education
unknown                              9502
bachelor's degree                    5145
high school or equivalent            2080
master's degree                       416
associate degree                      274
certification                         170
some college coursework completed     102
professional                           74
vocational                             49
some high school coursework            27
doctorate                              26
vocational - hs diploma                 9
vocational - degree                     6
Name: count, dtype: int64

In [28]:
# Standardize categorical columns
for col in ['employment_type', 'required_experience', 'required_education', 'country', 'state', 'city']:
    df[col] = df[col].str.lower().str.strip()

# Re-check unique counts after standardization
for col in ['employment_type', 'required_experience', 'required_education', ]:
    print(f"{col}: {df[col].nunique()} unique values")

employment_type: 6 unique values
required_experience: 8 unique values
required_education: 13 unique values


In [29]:
df.columns

Index(['job_id', 'title', 'company_profile', 'description', 'requirements',
       'benefits', 'telecommuting', 'has_company_logo', 'has_questions',
       'employment_type', 'required_experience', 'required_education',
       'fraudulent', 'country', 'state', 'city'],
      dtype='object')

- Country, state, and city are unlikely to predict fraud, as fake postings can appear from any location and may have manipulated names.

In [30]:
df.drop(columns=['country', 'state', 'city'], inplace=True)


In [31]:
df.columns

Index(['job_id', 'title', 'company_profile', 'description', 'requirements',
       'benefits', 'telecommuting', 'has_company_logo', 'has_questions',
       'employment_type', 'required_experience', 'required_education',
       'fraudulent'],
      dtype='object')

- The columns title, company_profile, description, requirements, and benefits are all textual information describing the job. Combining them into a single text column simplifies preprocessing, reduces the number of features, and better captures the overall content, which reflects how a user would provide job details in practice.

In [32]:
df['full_text'] = df['title'].fillna('') + ' ' + \
                  df['company_profile'].fillna('') + ' ' + \
                  df['description'].fillna('') + ' ' + \
                  df['requirements'].fillna('') + ' ' + \
                  df['benefits'].fillna('')


In [33]:
df.drop(columns=['title', 'company_profile', 'description', 'requirements', 'benefits'], inplace=True)


In [34]:
df.head()

Unnamed: 0,job_id,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,fraudulent,full_text
0,1,0,1,0,other,internship,unknown,0,"Marketing Intern We're Food52, and we've creat..."
1,2,0,1,0,full-time,not applicable,unknown,0,Customer Service - Cloud Video Production 90 S...
2,3,0,1,0,unknown,unknown,unknown,0,Commissioning Machinery Assistant (CMA) Valor ...
3,4,0,1,0,full-time,mid-senior level,bachelor's degree,0,Account Executive - Washington DC Our passion ...
4,5,0,1,1,full-time,mid-senior level,bachelor's degree,0,Bill Review Manager SpotSource Solutions LLC i...


In [38]:
df['employment_type'].value_counts()

employment_type
full-time    11620
other         3698
contract      1524
part-time      797
temporary      241
Name: count, dtype: int64

In [37]:
df['employment_type'] = df['employment_type'].replace({
    'unknown': 'other',
})


In [40]:
df['required_experience'].value_counts()

required_experience
unknown             7050
mid-senior level    3809
entry level         2697
associate           2297
not applicable      1116
director             389
internship           381
executive            141
Name: count, dtype: int64

In [41]:
df['required_education'].value_counts()

required_education
unknown                              9502
bachelor's degree                    5145
high school or equivalent            2080
master's degree                       416
associate degree                      274
certification                         170
some college coursework completed     102
professional                           74
vocational                             49
some high school coursework            27
doctorate                              26
vocational - hs diploma                 9
vocational - degree                     6
Name: count, dtype: int64

In [43]:
# Standardize categorical columns
for col in ['employment_type', 'required_experience', 'required_education']:
    df[col] = df[col].str.lower().str.strip()

# Re-check unique counts after standardization
for col in ['employment_type', 'required_experience', 'required_education' ]:
    print(f"{col}: {df[col].nunique()} unique values")

employment_type: 5 unique values
required_experience: 8 unique values
required_education: 13 unique values


In [44]:
df.head()

Unnamed: 0,job_id,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,fraudulent,full_text
0,1,0,1,0,other,internship,unknown,0,"Marketing Intern We're Food52, and we've creat..."
1,2,0,1,0,full-time,not applicable,unknown,0,Customer Service - Cloud Video Production 90 S...
2,3,0,1,0,other,unknown,unknown,0,Commissioning Machinery Assistant (CMA) Valor ...
3,4,0,1,0,full-time,mid-senior level,bachelor's degree,0,Account Executive - Washington DC Our passion ...
4,5,0,1,1,full-time,mid-senior level,bachelor's degree,0,Bill Review Manager SpotSource Solutions LLC i...


In [45]:
df.to_csv('fake_job_posting_cleaned.csv', index=False)
