## Cleaning FAKE JOB POSTING DATASET

## 1) Problem Statement
With the rise of online job portals, job seekers are increasingly exposed to fraudulent job postings that aim to scam applicants. These fake listings often mimic legitimate companies and job descriptions, making it difficult for users to distinguish between real and fraudulent opportunities.

## 2) Data Collection

- Dataset Source - https://www.kaggle.com/datasets/shivamb/real-or-fake-fake-jobposting-prediction
- The dataset consist of 18 column and 17880 rows

In [1]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
## importing csv file 
df = pd.read_csv('fake_job_postings.csv')

In [3]:
df.head()

Unnamed: 0,job_id,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent
0,1,Marketing Intern,"US, NY, New York",Marketing,,"We're Food52, and we've created a groundbreaki...","Food52, a fast-growing, James Beard Award-winn...",Experience with content management systems a m...,,0,1,0,Other,Internship,,,Marketing,0
1,2,Customer Service - Cloud Video Production,"NZ, , Auckland",Success,,"90 Seconds, the worlds Cloud Video Production ...",Organised - Focused - Vibrant - Awesome!Do you...,What we expect from you:Your key responsibilit...,What you will get from usThrough being part of...,0,1,0,Full-time,Not Applicable,,Marketing and Advertising,Customer Service,0
2,3,Commissioning Machinery Assistant (CMA),"US, IA, Wever",,,Valor Services provides Workforce Solutions th...,"Our client, located in Houston, is actively se...",Implement pre-commissioning and commissioning ...,,0,1,0,,,,,,0
3,4,Account Executive - Washington DC,"US, DC, Washington",Sales,,Our passion for improving quality of life thro...,THE COMPANY: ESRI – Environmental Systems Rese...,"EDUCATION: Bachelor’s or Master’s in GIS, busi...",Our culture is anything but corporate—we have ...,0,1,0,Full-time,Mid-Senior level,Bachelor's Degree,Computer Software,Sales,0
4,5,Bill Review Manager,"US, FL, Fort Worth",,,SpotSource Solutions LLC is a Global Human Cap...,JOB TITLE: Itemization Review ManagerLOCATION:...,QUALIFICATIONS:RN license in the State of Texa...,Full Benefits Offered,0,1,1,Full-time,Mid-Senior level,Bachelor's Degree,Hospital & Health Care,Health Care Provider,0


In [4]:
df.shape

(17880, 18)

## Information about the dataset

- title -> Just gives the job title how which it is posted. (e.g., "Software Engineer", "Marketing Intern").
- location -> location of job posting
- department -> for which department job is posted
- Salary_range -> tell about salary range that will be given to applicant
- Company_profile -> tells about profile of the company (information)
- description -> description of the job posting
- requirements -> experience needed for job posting
- benefits -> benefits of getting that job 
- telecomputing -> Indicates if the job allows remote work or working from home (1 = yes, 0 = no).
- has_company_logo -> companies logo is there or not (1 = yes, 0 = no).
- has_questions -> Indicates whether the job posting contains additional application questions (1 = yes, 0 = no).
- employement_type -> what type of employement they are giving (e.g., Full-time, Part-time, Contract, Temporary, Internship).
- required_experience -> what is the experience needed for the job (e.g., Entry-level, Mid-level, Senior).
- required_education -> education is required for job or not (e.g., Bachelor’s Degree, Master’s Degree, High School).
- Industry -> from which industry the company and job belongs (e.g., IT, Healthcare, Finance).
- Function -> what is your function in the company (e.g., Engineering, Human Resources, Marketing).
- fraudulent -> fraud or not (1 = fake, 0 = real).

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17880 entries, 0 to 17879
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   job_id               17880 non-null  int64 
 1   title                17880 non-null  object
 2   location             17534 non-null  object
 3   department           6333 non-null   object
 4   salary_range         2868 non-null   object
 5   company_profile      14572 non-null  object
 6   description          17879 non-null  object
 7   requirements         15184 non-null  object
 8   benefits             10668 non-null  object
 9   telecommuting        17880 non-null  int64 
 10  has_company_logo     17880 non-null  int64 
 11  has_questions        17880 non-null  int64 
 12  employment_type      14409 non-null  object
 13  required_experience  10830 non-null  object
 14  required_education   9775 non-null   object
 15  industry             12977 non-null  object
 16  func

In [6]:
df.isnull().sum()

job_id                     0
title                      0
location                 346
department             11547
salary_range           15012
company_profile         3308
description                1
requirements            2696
benefits                7212
telecommuting              0
has_company_logo           0
has_questions              0
employment_type         3471
required_experience     7050
required_education      8105
industry                4903
function                6455
fraudulent                 0
dtype: int64

In [7]:
df['employment_type'].value_counts()

employment_type
Full-time    11620
Contract      1524
Part-time      797
Temporary      241
Other          227
Name: count, dtype: int64

In [8]:
df['required_experience'].value_counts()

required_experience
Mid-Senior level    3809
Entry level         2697
Associate           2297
Not Applicable      1116
Director             389
Internship           381
Executive            141
Name: count, dtype: int64

In [9]:
df['required_education'].value_counts() 

required_education
Bachelor's Degree                    5145
High School or equivalent            2080
Unspecified                          1397
Master's Degree                       416
Associate Degree                      274
Certification                         170
Some College Coursework Completed     102
Professional                           74
Vocational                             49
Some High School Coursework            27
Doctorate                              26
Vocational - HS Diploma                 9
Vocational - Degree                     6
Name: count, dtype: int64

In [10]:
cols = [
    "location",
    "department",
    "employment_type",
    "required_experience",
    "required_education",
    "industry",
    "function"
]

for col in cols:
    n_unique = df[col].dropna().str.lower().str.strip().nunique()
    print(f"{col}: {n_unique} unique values")


location: 2817 unique values
department: 1224 unique values
employment_type: 5 unique values
required_experience: 7 unique values
required_education: 13 unique values
industry: 131 unique values
function: 37 unique values


In [11]:
print(df['salary_range'].isnull().sum())

15012


## So i Think that salary_range and depatment are not need due to
- department → (redundant with function)
- salary_range -> (too sparse, noisy)

In [12]:
df.drop(["department", "salary_range"], axis=1, inplace=True)


In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17880 entries, 0 to 17879
Data columns (total 16 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   job_id               17880 non-null  int64 
 1   title                17880 non-null  object
 2   location             17534 non-null  object
 3   company_profile      14572 non-null  object
 4   description          17879 non-null  object
 5   requirements         15184 non-null  object
 6   benefits             10668 non-null  object
 7   telecommuting        17880 non-null  int64 
 8   has_company_logo     17880 non-null  int64 
 9   has_questions        17880 non-null  int64 
 10  employment_type      14409 non-null  object
 11  required_experience  10830 non-null  object
 12  required_education   9775 non-null   object
 13  industry             12977 non-null  object
 14  function             11425 non-null  object
 15  fraudulent           17880 non-null  int64 
dtypes: i

In [14]:
df.isnull().sum()

job_id                    0
title                     0
location                346
company_profile        3308
description               1
requirements           2696
benefits               7212
telecommuting             0
has_company_logo          0
has_questions             0
employment_type        3471
required_experience    7050
required_education     8105
industry               4903
function               6455
fraudulent                0
dtype: int64