INTRODUCTION

This dataset contains **17,880 job postings** with various details like job title, location, company profile, description, requirements, and benefits. It also includes indicators such as whether the job allows telecommuting, has a company logo, or includes screening questions. The key feature is the **"fraudulent"** column, which labels postings as either real or fake. This dataset is useful for analyzing job market trends and detecting fraudulent job listings.

IMPORT LABARIES AND LOAD DATASET

In [49]:
#logistic regression 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.metrics import accuracy_score

df=pd.read_csv(r"C:\Data Science\data_set\fake_job_postings.csv")
print(df)





      job_id                                      title             location  \
0          1                           Marketing Intern     US, NY, New York   
1          2  Customer Service - Cloud Video Production       NZ, , Auckland   
2          3    Commissioning Machinery Assistant (CMA)        US, IA, Wever   
3          4          Account Executive - Washington DC   US, DC, Washington   
4          5                        Bill Review Manager   US, FL, Fort Worth   
...      ...                                        ...                  ...   
8994    8995                Senior interaction designer         DE, , Berlin   
8995    8996    English Teacher Abroad (Conversational)   US, IL, Charleston   
8996    8997                        Marketing Associate  US, CA, Chula Vista   
8997    8998                      Senior JAVA Developer   US, MA, Burlington   
8998    8999              Data analyst intership (paid)            GB, LND,    

     department  salary_range  \
0     

DATA CLEANING

In [50]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8999 entries, 0 to 8998
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   job_id               8999 non-null   int64 
 1   title                8999 non-null   object
 2   location             8816 non-null   object
 3   department           3261 non-null   object
 4   salary_range         1455 non-null   object
 5   company_profile      7092 non-null   object
 6   description          8999 non-null   object
 7   requirements         7799 non-null   object
 8   benefits             5456 non-null   object
 9   telecommuting        8999 non-null   int64 
 10  has_company_logo     8999 non-null   int64 
 11  has_questions        8999 non-null   int64 
 12  employment_type      7008 non-null   object
 13  required_experience  5396 non-null   object
 14  required_education   5019 non-null   object
 15  industry             6412 non-null   object
 16  functi

In [51]:
df.isna().sum()

job_id                    0
title                     0
location                183
department             5738
salary_range           7544
company_profile        1907
description               0
requirements           1200
benefits               3543
telecommuting             0
has_company_logo          0
has_questions             0
employment_type        1991
required_experience    3603
required_education     3980
industry               2587
function               3247
fraudulent                0
dtype: int64

In [52]:
df.fillna(0,inplace=True)

In [53]:
df.duplicated().sum()

0

In [54]:
df.isnull().sum()

job_id                 0
title                  0
location               0
department             0
salary_range           0
company_profile        0
description            0
requirements           0
benefits               0
telecommuting          0
has_company_logo       0
has_questions          0
employment_type        0
required_experience    0
required_education     0
industry               0
function               0
fraudulent             0
dtype: int64

In [55]:
le=LabelEncoder()
column_to_convert=['department','salary_range','company_profile','employment_type','telecommuting','required_experience','required_education','industry']
for col in column_to_convert:
    df[col]=df[col].astype(str)
    df[col]=le.fit_transform(df[col])
df.head(10)



Unnamed: 0,job_id,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent
0,1,Marketing Intern,"US, NY, New York",519,0,1149,"Food52, a fast-growing, James Beard Award-winn...",Experience with content management systems a m...,0,0,1,0,3,5,0,0,Marketing,0
1,2,Customer Service - Cloud Video Production,"NZ, , Auckland",780,0,38,Organised - Focused - Vibrant - Awesome!Do you...,What we expect from you:Your key responsibilit...,What you will get from usThrough being part of...,0,1,0,2,7,0,73,Customer Service,0
2,3,Commissioning Machinery Assistant (CMA),"US, IA, Wever",6,0,1046,"Our client, located in Houston, is actively se...",Implement pre-commissioning and commissioning ...,0,0,1,0,0,0,0,0,0,0
3,4,Account Executive - Washington DC,"US, DC, Washington",716,0,734,THE COMPANY: ESRI – Environmental Systems Rese...,"EDUCATION: Bachelor’s or Master’s in GIS, busi...",Our culture is anything but corporate—we have ...,0,1,0,2,6,2,22,Sales,0
4,5,Bill Review Manager,"US, FL, Fort Worth",6,0,909,JOB TITLE: Itemization Review ManagerLOCATION:...,QUALIFICATIONS:RN license in the State of Texa...,Full Benefits Offered,0,1,1,2,6,2,49,Health Care Provider,0
5,6,Accounting Clerk,"US, MD,",6,0,26,Job OverviewApex is an environmental consultin...,0,0,0,0,0,0,0,0,0,0,0
6,7,Head of Content (m/f),"DE, BE, Berlin",31,198,425,Your Responsibilities: Manage the English-spea...,Your Know-How: ...,Your Benefits: Being part of a fast-growing co...,0,1,1,2,6,6,85,Management,0
7,8,Lead Guest Service Specialist,"US, CA, San Francisco",6,0,93,Who is Airenvy?Hey there! We are seasoned entr...,"Experience with CRM software, live chat, and p...",Competitive Pay. You'll be able to eat steak e...,0,1,1,0,0,0,0,0,0
8,9,HP BSM SME,"US, FL, Pensacola",6,0,899,Implementation/Configuration/Testing/Training ...,MUST BE A US CITIZEN.An active TS/SCI clearanc...,0,0,1,1,2,1,0,56,0,0
9,10,Customer Service Associate - Part Time,"US, AZ, Phoenix",6,0,699,The Customer Service Associate will be based i...,Minimum Requirements:Minimum of 6 months custo...,0,0,1,0,4,3,5,39,Customer Service,0


EXTRACTING INDEPENDENT AND DEPENDENT VARIABLES

In [56]:
x=df[['department','salary_range','company_profile','employment_type','telecommuting','required_experience','required_education','industry']]
x=pd.DataFrame(x)
y=df['fraudulent']
y=pd.DataFrame(y)

SPLITING DATA INTO TRAIN AND TEST DATA

In [57]:

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.2,random_state=0)

MODEL BUILDGING AND EVALUATES PREDICTIONS

MODEL OF KNN

In [58]:
from sklearn.neighbors import KNeighborsClassifier
classifier=KNeighborsClassifier(n_neighbors=5,metric='minkowski',p=2)
classifier.fit(x_train,y_train)

In [59]:
y_predict=classifier.predict(x_test)
print(y_test)
print(y_predict)

      fraudulent
4424           0
1726           0
7894           0
7803           0
1986           0
...          ...
7127           0
4394           0
7192           0
6280           0
4905           0

[1800 rows x 1 columns]
[0 0 0 ... 0 0 0]


In [60]:
print("mse value of knn regression:",metrics.mean_absolute_error(y_predict,y_test))
print("accuracy in knn regression:",metrics.accuracy_score(y_predict,y_test))

mse value of knn regression: 0.03277777777777778
accuracy in knn regression: 0.9672222222222222


MODEL OF LOGISTIC REGRESSION

In [61]:
from sklearn.linear_model import LogisticRegression
model=LogisticRegression(multi_class='multinomial',random_state=0)
model.fit(x_train,y_train)

In [62]:
y_predict=model.predict(x_test)
print(y_test)
print(y_predict)

      fraudulent
4424           0
1726           0
7894           0
7803           0
1986           0
...          ...
7127           0
4394           0
7192           0
6280           0
4905           0

[1800 rows x 1 columns]
[0 0 0 ... 0 0 0]


In [63]:
print("mse value of logistic regression:",metrics.mean_squared_error(y_predict,y_test))
print("mae value of lodistic regression:",metrics.mean_absolute_error(y_predict,y_test))
print("accuracy in logistic regression:",metrics.accuracy_score(y_predict,y_test))

mse value of logistic regression: 0.04388888888888889
mae value of lodistic regression: 0.04388888888888889
accuracy in logistic regression: 0.9561111111111111


MODEL OF RANDOM FOREST

In [64]:
from sklearn.ensemble import RandomForestClassifier
model2=RandomForestClassifier(n_estimators=10,criterion='entropy',random_state=67)
model2.fit(x_train,y_train)


In [65]:
y_predict=model2.predict(x_test)
print(y_test)
print(y_predict)

      fraudulent
4424           0
1726           0
7894           0
7803           0
1986           0
...          ...
7127           0
4394           0
7192           0
6280           0
4905           0

[1800 rows x 1 columns]
[0 0 0 ... 0 0 0]


In [66]:
print("mse value of random forest:",metrics.mean_squared_error(y_predict,y_test))
print("mae value of random forest:",metrics.mean_absolute_error(y_predict,y_test))
print("accuracy in random forest:",metrics.accuracy_score(y_predict,y_test))

mse value of random forest: 0.021666666666666667
mae value of random forest: 0.021666666666666667
accuracy in random forest: 0.9783333333333334


MODEL OF DECISION TREE

In [67]:
from sklearn.tree import DecisionTreeClassifier
model3=DecisionTreeClassifier(criterion='entropy',random_state=56)
model3.fit(x_train,y_train)




In [68]:
y_predict=model3.predict(x_test)
print(y_test)
print(y_predict)

      fraudulent
4424           0
1726           0
7894           0
7803           0
1986           0
...          ...
7127           0
4394           0
7192           0
6280           0
4905           0

[1800 rows x 1 columns]
[0 0 0 ... 0 0 0]


In [69]:
print("mse value of decision tree:",metrics.mean_squared_error(y_predict,y_test))
print("mae value of decision tree:",metrics.mean_absolute_error(y_predict,y_test))
print("accuracy in decision tree:",metrics.accuracy_score(y_predict,y_test))

mse value of decision tree: 0.02666666666666667
mae value of decision tree: 0.02666666666666667
accuracy in decision tree: 0.9733333333333334


MODEL OF SVM

In [70]:
from sklearn.svm import SVC
model2=SVC(kernel='poly',random_state=67)
model2.fit(x_train,y_train)

In [71]:
y_predict=model2.predict(x_test)
print(y_test)
print(y_predict)

      fraudulent
4424           0
1726           0
7894           0
7803           0
1986           0
...          ...
7127           0
4394           0
7192           0
6280           0
4905           0

[1800 rows x 1 columns]
[0 0 0 ... 0 0 0]


In [72]:
print("mse value of svm regression : ",metrics.mean_squared_error(y_predict,y_test))
print("mae value of svm  regression:",metrics.mean_absolute_error(y_predict,y_test))
print("accuracy in svm regression:",metrics.accuracy_score(y_predict,y_test))

mse value of svm regression :  0.04388888888888889
mae value of svm  regression: 0.04388888888888889
accuracy in svm regression: 0.9561111111111111


SUMMARY

This project aimed to predict by fraudulent column, which labels postings as either real or fake. This dataset is useful for analyzing job market trends and detecting fraudulent job listings.using job application datas.After processing,we tested multiple models, inculding Logistic regression,Decision tree,Random forest,SVM,KNN.The Random forest model achieved the highest accuracy of 97.83%,making it the best performing model.