## Introduction
#### Fraudulent information, such as fake news, employment scams, false advertising, and so forth, is also on the rise as a result of the proliferation of data. I decided to examine job posts and determine the distinctions between genuine and phony ones because I found it to be a significant problem to get a job through internet job postings.
#### The [Real or Fake]: Fake Job Description Prediction dataset, which originated from University of the Aegean's Laboratory of Information & Communication Systems Security, was retrieved from the website Kaggle (Greece). This dataset contains 17,880 actual job advertisements, 17,014 of which are valid and 866 of which are fake, all of which were posted between 2012 and 2014. There are 18 columns and 17,880 rows in the data set.

## Downloading the Dataset
#### Let's download dataset directly from Kaggle website by using dataset's url. This way, there is no need to download csv file and upload it back to Jupyter.

## Task:
   #### The dataset contains unclean data, I will be cleaning the dataset and writing it back to excel file to be analysed in MS Excel

In [2]:
import pandas as pd
import os
import numpy as np

In [83]:
dataset = pd.read_csv(r'C:\Users\osakue\Desktop\fake_job_postings.csv')
dataset.head(5)

Unnamed: 0,job_id,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent
0,1,Marketing Intern,"US, NY, New York",Marketing,,"We're Food52, and we've created a groundbreaki...","Food52, a fast-growing, James Beard Award-winn...",Experience with content management systems a m...,,0,1,0,Other,Internship,,,Marketing,0
1,2,Customer Service - Cloud Video Production,"NZ, , Auckland",Success,,"90 Seconds, the worlds Cloud Video Production ...",Organised - Focused - Vibrant - Awesome!Do you...,What we expect from you:Your key responsibilit...,What you will get from usThrough being part of...,0,1,0,Full-time,Not Applicable,,Marketing and Advertising,Customer Service,0
2,3,Commissioning Machinery Assistant (CMA),"US, IA, Wever",,,Valor Services provides Workforce Solutions th...,"Our client, located in Houston, is actively se...",Implement pre-commissioning and commissioning ...,,0,1,0,,,,,,0
3,4,Account Executive - Washington DC,"US, DC, Washington",Sales,,Our passion for improving quality of life thro...,THE COMPANY: ESRI – Environmental Systems Rese...,"EDUCATION: Bachelor’s or Master’s in GIS, busi...",Our culture is anything but corporate—we have ...,0,1,0,Full-time,Mid-Senior level,Bachelor's Degree,Computer Software,Sales,0
4,5,Bill Review Manager,"US, FL, Fort Worth",,,SpotSource Solutions LLC is a Global Human Cap...,JOB TITLE: Itemization Review ManagerLOCATION:...,QUALIFICATIONS:RN license in the State of Texa...,Full Benefits Offered,0,1,1,Full-time,Mid-Senior level,Bachelor's Degree,Hospital & Health Care,Health Care Provider,0


In [10]:
dataset.shape

(17880, 18)

Let's load the country code dataset and rename the columns. For complete country names, we will require this dataset. This data collection was obtained online. The urlretrieve function from the urllib.request module will be used to download this file.

In [17]:
from urllib.request import urlretrieve
urlretrieve('https://pkgstore.datahub.io/core/country-list/data_csv/data/d7c9d7cfb42cb69f4422dec222dbbaa8/data_csv.csv', 
            'country_codes.csv')
countrycodes = pd.read_csv('country_codes.csv')
countrycodes.head(5)

Unnamed: 0,Name,Code
0,Afghanistan,AF
1,Åland Islands,AX
2,Albania,AL
3,Algeria,DZ
4,American Samoa,AS


In [84]:
countrycodes=countrycodes.rename({'Name':'Country', 'Code':'countrycode'}, axis = 1)
countrycodes.head(5)

Unnamed: 0,Country,countrycode
0,Afghanistan,AF
1,Åland Islands,AX
2,Albania,AL
3,Algeria,DZ
4,American Samoa,AS


We've now loaded the datasets, and we are ready to move on to the next step of preprocessing and cleaning the data for our analysis.

## Cleaning and Preparation of Data
#### 
By checking for erroneous data and missing values, we will prepare and clean the data for analysis in this step. Prior to cleaning, we should pay greater attention to the specifics and, if necessary, filter out some additional information.

Despite having access to rich text values for more in-depth research, let's keep our focus on the following areas:

Geolocation of job listings, including state and country
Industry and the prevalence of fake job postings
Type of employment and information about the company
Let's look at some of the data frame's fundamental details.

In [21]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17880 entries, 0 to 17879
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   job_id               17880 non-null  int64 
 1   title                17880 non-null  object
 2   location             17534 non-null  object
 3   department           6333 non-null   object
 4   salary_range         2868 non-null   object
 5   company_profile      14572 non-null  object
 6   description          17879 non-null  object
 7   requirements         15185 non-null  object
 8   benefits             10670 non-null  object
 9   telecommuting        17880 non-null  int64 
 10  has_company_logo     17880 non-null  int64 
 11  has_questions        17880 non-null  int64 
 12  employment_type      14409 non-null  object
 13  required_experience  10830 non-null  object
 14  required_education   9775 non-null   object
 15  industry             12977 non-null  object
 16  func

The majority of columns have the data type object because they either contain values of various types or contain empty values, which are denoted by the symbol NaN. Since the Non-Null count for each column is less than the total number of rows (17,880), it appears that some columns have some missing values. We'll need to handle empty values and, if necessary, change the data type for some columns.

Even though there is one more column (salary range) that most certainly includes numeric values, five of the columns (job id, telecommuting, has company logo, has questions, and fraudulent) were identified as having numeric values. 

Lets view the descriptive statistics of numeric column

In [22]:
dataset.describe()

Unnamed: 0,job_id,telecommuting,has_company_logo,has_questions,fraudulent
count,17880.0,17880.0,17880.0,17880.0,17880.0
mean,8940.5,0.042897,0.795302,0.491723,0.048434
std,5161.655742,0.202631,0.403492,0.499945,0.214688
min,1.0,0.0,0.0,0.0,0.0
25%,4470.75,0.0,1.0,0.0,0.0
50%,8940.5,0.0,1.0,0.0,0.0
75%,13410.25,0.0,1.0,1.0,0.0
max,17880.0,1.0,1.0,1.0,1.0


Let's choose a subset of the columns containing the pertinent information for our investigation. We will create a new data frame called jobsdf and copy the data from the dataset columns into it so that we can alter it without impacting the original data frame.

In [48]:
jobsdf = dataset[['title','location','salary_range','company_profile','description','requirements',
                   'benefits','has_company_logo','employment_type','required_experience','required_education',
                   'industry','function','fraudulent']]
jobsdf.head(5)

Unnamed: 0,title,location,salary_range,company_profile,description,requirements,benefits,has_company_logo,employment_type,required_experience,required_education,industry,function,fraudulent
0,Marketing Intern,"US, NY, New York",,"We're Food52, and we've created a groundbreaki...","Food52, a fast-growing, James Beard Award-winn...",Experience with content management systems a m...,,1,Other,Internship,,,Marketing,0
1,Customer Service - Cloud Video Production,"NZ, , Auckland",,"90 Seconds, the worlds Cloud Video Production ...",Organised - Focused - Vibrant - Awesome!Do you...,What we expect from you:Your key responsibilit...,What you will get from usThrough being part of...,1,Full-time,Not Applicable,,Marketing and Advertising,Customer Service,0
2,Commissioning Machinery Assistant (CMA),"US, IA, Wever",,Valor Services provides Workforce Solutions th...,"Our client, located in Houston, is actively se...",Implement pre-commissioning and commissioning ...,,1,,,,,,0
3,Account Executive - Washington DC,"US, DC, Washington",,Our passion for improving quality of life thro...,THE COMPANY: ESRI – Environmental Systems Rese...,"EDUCATION: Bachelor’s or Master’s in GIS, busi...",Our culture is anything but corporate—we have ...,1,Full-time,Mid-Senior level,Bachelor's Degree,Computer Software,Sales,0
4,Bill Review Manager,"US, FL, Fort Worth",,SpotSource Solutions LLC is a Global Human Cap...,JOB TITLE: Itemization Review ManagerLOCATION:...,QUALIFICATIONS:RN license in the State of Texa...,Full Benefits Offered,1,Full-time,Mid-Senior level,Bachelor's Degree,Hospital & Health Care,Health Care Provider,0


Let's split your location column for our geographical analysis.

In [51]:
Location= dataset['location'].str.split(',', 2, expand=True)
Location=Location.rename({0: "countrycode", 1: "state", 2: "city"}, axis = 1)
Location['location']=dataset['location']
Location.head(5)

Unnamed: 0,countrycode,state,city,location
0,US,NY,New York,"US, NY, New York"
1,NZ,,Auckland,"NZ, , Auckland"
2,US,IA,Wever,"US, IA, Wever"
3,US,DC,Washington,"US, DC, Washington"
4,US,FL,Fort Worth,"US, FL, Fort Worth"


In [79]:
jobsdf['state']=Location['state']
jobsdf['city']=Location['city']
jobsdf['countrycode']=Location['countrycode']
jobsdf=jobsdf.drop(['location'], axis=1)

In [82]:
jobsdf= jobsdf.merge(countrycodes, on="countrycode")
jobsdf.head(5)

Unnamed: 0,title,salary_range,company_profile,description,requirements,benefits,has_company_logo,employment_type,required_experience,required_education,industry,function,fraudulent,state,city,countrycode,Country_x,Country_y
0,Marketing Intern,,"We're Food52, and we've created a groundbreaki...","Food52, a fast-growing, James Beard Award-winn...",Experience with content management systems a m...,,1,Other,Internship,,,Marketing,0,NY,New York,US,United States,United States
1,Commissioning Machinery Assistant (CMA),,Valor Services provides Workforce Solutions th...,"Our client, located in Houston, is actively se...",Implement pre-commissioning and commissioning ...,,1,,,,,,0,IA,Wever,US,United States,United States
2,Account Executive - Washington DC,,Our passion for improving quality of life thro...,THE COMPANY: ESRI – Environmental Systems Rese...,"EDUCATION: Bachelor’s or Master’s in GIS, busi...",Our culture is anything but corporate—we have ...,1,Full-time,Mid-Senior level,Bachelor's Degree,Computer Software,Sales,0,DC,Washington,US,United States,United States
3,Bill Review Manager,,SpotSource Solutions LLC is a Global Human Cap...,JOB TITLE: Itemization Review ManagerLOCATION:...,QUALIFICATIONS:RN license in the State of Texa...,Full Benefits Offered,1,Full-time,Mid-Senior level,Bachelor's Degree,Hospital & Health Care,Health Care Provider,0,FL,Fort Worth,US,United States,United States
4,Accounting Clerk,,,Job OverviewApex is an environmental consultin...,,,0,,,,,,0,MD,,US,United States,United States


In [87]:
jobsdf.to_excel('C:\\Users\\osakue\\Desktop\\newjobs.xlsx')