<a href="https://colab.research.google.com/github/PrithuVerma/Exploratory-Data-Analysis/blob/main/Fake%26Real_Job_Posting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Fraud Detection & Marketplace Trust**
**Problem Statement** - How do fake job postings differ from real job postings, and which observable patterns can be used to flag potentially fraudulent jobs?

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [3]:
data = pd.read_csv('/content/fake_real_job_postings_3000x25.csv')
data.columns

Index(['job_id', 'job_title', 'job_description', 'requirements', 'benefits',
       'company_name', 'company_profile', 'industry', 'employment_type',
       'location', 'salary_range', 'required_experience_years',
       'education_level', 'department', 'posting_date', 'application_deadline',
       'contact_email', 'company_website', 'has_logo', 'num_open_positions',
       'job_function', 'telecommuting', 'fraud_reason', 'text_length',
       'is_fake'],
      dtype='object')

In [7]:
data.size

75000

In [11]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 25 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   job_id                     3000 non-null   int64 
 1   job_title                  3000 non-null   object
 2   job_description            3000 non-null   object
 3   requirements               3000 non-null   object
 4   benefits                   3000 non-null   object
 5   company_name               1528 non-null   object
 6   company_profile            3000 non-null   object
 7   industry                   3000 non-null   object
 8   employment_type            3000 non-null   object
 9   location                   3000 non-null   object
 10  salary_range               3000 non-null   object
 11  required_experience_years  3000 non-null   int64 
 12  education_level            3000 non-null   object
 13  department                 3000 non-null   object
 14  posting_

In [8]:
data.isna().sum().sort_values(ascending=False)

Unnamed: 0,0
fraud_reason,1528
company_website,1472
company_name,1472
job_id,0
job_title,0
benefits,0
company_profile,0
requirements,0
job_description,0
employment_type,0


In [12]:
data.drop(columns='fraud_reason', inplace = True)

# Why 'fraud_reason' column was dropped?
'fraud_reason' was the description of the fraud which in future would be misleading as we only need if the fraud is True or not,it represents post-label explanatory data and introduces target leakage, making any insights unrealistic in a real-world detection setting.

In [17]:
data["has_company_website"] = data["company_website"].notna().astype(int)
data["has_company_website"].value_counts()

Unnamed: 0_level_0,count
has_company_website,Unnamed: 1_level_1
1,1528
0,1472


The code block above replaces the url of the websites and replaces them with integer True/Fallse value

In [27]:
name_counts = data["company_name"].value_counts()
data["company_name_frequency"] = data["company_name"].map(name_counts)
data.groupby("is_fake")["company_name_frequency"].mean()


Unnamed: 0_level_0,company_name_frequency
is_fake,Unnamed: 1_level_1
0,2.22644
1,


# Key Points


*   All fake-job rows have company_name_frequency missing (NaN)
*   Fake postings donâ€™t have company names

Real job postings tend to include identifiable company names that repeat modestly across listings. In contrast, fraudulent postings frequently omit company names altogether, making frequency-based analysis impossible.

In [36]:
data['employment_type'].value_counts()

Unnamed: 0_level_0,count
employment_type,Unnamed: 1_level_1
Part-time,615
Full-time,614
Internship,605
Contract,598
Temporary,568


In [40]:
data.groupby('employment_type')['is_fake'].sum().sort_values(ascending=False)

Unnamed: 0_level_0,is_fake
employment_type,Unnamed: 1_level_1
Part-time,305
Internship,298
Contract,292
Full-time,292
Temporary,285


In [41]:
data['has_logo'].value_counts()

Unnamed: 0_level_0,count
has_logo,Unnamed: 1_level_1
0,2262
1,738


In [46]:
data[data['has_logo'] == 0].is_fake.value_counts()

Unnamed: 0_level_0,count
is_fake,Unnamed: 1_level_1
1,1472
0,790


The above result matches perfcetly wih the number of comapnies that does not have website URL as well as a name

In [47]:
data.columns

Index(['job_id', 'job_title', 'job_description', 'requirements', 'benefits',
       'company_name', 'company_profile', 'industry', 'employment_type',
       'location', 'salary_range', 'required_experience_years',
       'education_level', 'department', 'posting_date', 'application_deadline',
       'contact_email', 'company_website', 'has_logo', 'num_open_positions',
       'job_function', 'telecommuting', 'text_length', 'is_fake',
       'has_company_website', 'company_name_frequency'],
      dtype='object')

# Key Takeaway
*   Comapnies that do not have a website or name are Fraud
*   Companies that also do not have a Logo are also Fraud
*   Majority of Fraudster are of Part-Time employment

3000