# NLP in Pyspark's MLlib Project

## Fake Job Posting Predictions

Indeed.com has just hired you to create a system that automatically flags suspicious job postings on it's website. It has recently seen an influx of fake job postings that is negativley impacting it's customer experience. Becuase of the high volume of job postings it receives everyday, their employees do have the capacity to check every posting so they would like prioritize which postings to review before deleting it. 

#### Your task
Use the attached dataset with NLP to create an alogorthim which automatically flags suspicious posts for review. 

#### The data
This dataset contains 18K job descriptions out of which about 800 are fake. The data consists of both textual information and meta-information about the jobs.

**Data Source:** https://www.kaggle.com/shivamb/real-or-fake-fake-jobposting-prediction

#### Have fun!

In [63]:
import findspark
findspark.init()

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('NLP').getOrCreate()

In [64]:
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.ml.feature import Tokenizer, StopWordsRemover, CountVectorizer, IDF, StringIndexer, VectorAssembler
from pyspark.ml.classification import NaiveBayes
from pyspark.ml import Pipeline
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

from sklearn.metrics import classification_report

In [65]:
df = spark.read.csv('Datasets/fake_job_postings.csv',inferSchema=True,header=True)
df.limit(5).toPandas()

Unnamed: 0,job_id,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent
0,1,Marketing Intern,"US, NY, New York",Marketing,,"We're Food52, and we've created a groundbreaki...","Food52, a fast-growing, James Beard Award-winn...",Experience with content management systems a m...,,0,1,0,Other,Internship,,,Marketing,0
1,2,Customer Service - Cloud Video Production,"NZ, , Auckland",Success,,"90 Seconds, the worlds Cloud Video Production ...",Organised - Focused - Vibrant - Awesome!Do you...,What we expect from you:Your key responsibilit...,What you will get from usThrough being part of...,0,1,0,Full-time,Not Applicable,,Marketing and Advertising,Customer Service,0
2,3,Commissioning Machinery Assistant (CMA),"US, IA, Wever",,,Valor Services provides Workforce Solutions th...,"Our client, located in Houston, is actively se...",Implement pre-commissioning and commissioning ...,,0,1,0,,,,,,0
3,4,Account Executive - Washington DC,"US, DC, Washington",Sales,,Our passion for improving quality of life thro...,THE COMPANY: ESRI – Environmental Systems Rese...,"EDUCATION: Bachelor’s or Master’s in GIS, busi...",Our culture is anything but corporate—we have ...,0,1,0,Full-time,Mid-Senior level,Bachelor's Degree,Computer Software,Sales,0
4,5,Bill Review Manager,"US, FL, Fort Worth",,,SpotSource Solutions LLC is a Global Human Cap...,JOB TITLE: Itemization Review ManagerLOCATION:...,QUALIFICATIONS:RN license in the State of Texa...,Full Benefits Offered,0,1,1,Full-time,Mid-Senior level,Bachelor's Degree,Hospital & Health Care,Health Care Provider,0


In [66]:
df.printSchema()

root
 |-- job_id: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- location: string (nullable = true)
 |-- department: string (nullable = true)
 |-- salary_range: string (nullable = true)
 |-- company_profile: string (nullable = true)
 |-- description: string (nullable = true)
 |-- requirements: string (nullable = true)
 |-- benefits: string (nullable = true)
 |-- telecommuting: string (nullable = true)
 |-- has_company_logo: string (nullable = true)
 |-- has_questions: string (nullable = true)
 |-- employment_type: string (nullable = true)
 |-- required_experience: string (nullable = true)
 |-- required_education: string (nullable = true)
 |-- industry: string (nullable = true)
 |-- function: string (nullable = true)
 |-- fraudulent: string (nullable = true)



In [67]:
df.count()

17880

In [68]:
df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).toPandas().T.sort_values(0, ascending=False)

Unnamed: 0,0
salary_range,15011
department,11547
required_education,7748
benefits,6966
required_experience,6723
function,6317
industry,4831
company_profile,3308
employment_type,3292
requirements,2573


In [69]:
cleaned_df = df.select(
    'title',
    'description',
    'requirements',
    'has_company_logo',
    'has_questions',
    'telecommuting',
    'fraudulent'
    ).dropna()

cleaned_df.show(5)

+--------------------+--------------------+--------------------+----------------+-------------+-------------+----------+
|               title|         description|        requirements|has_company_logo|has_questions|telecommuting|fraudulent|
+--------------------+--------------------+--------------------+----------------+-------------+-------------+----------+
|    Marketing Intern|Food52, a fast-gr...|Experience with c...|               1|            0|            0|         0|
|Customer Service ...|Organised - Focus...|What we expect fr...|               1|            0|            0|         0|
|Commissioning Mac...|Our client, locat...|Implement pre-com...|               1|            0|            0|         0|
|Account Executive...|THE COMPANY: ESRI...|EDUCATION: Bachel...|               1|            0|            0|         0|
| Bill Review Manager|JOB TITLE: Itemiz...|QUALIFICATIONS:RN...|               1|            1|            0|         0|
+--------------------+----------

In [70]:
cleaned_df.groupBy('fraudulent').count().orderBy(desc('count')).show(5)

+-----------------+-----+
|       fraudulent|count|
+-----------------+-----+
|                0|13659|
|                1|  736|
|        Full-time|   73|
|Bachelor's Degree|   50|
|      Engineering|   20|
+-----------------+-----+
only showing top 5 rows



In [71]:
cleaned_df = cleaned_df.filter(
    (df.fraudulent.isin([0,1]))
    & (df.has_company_logo.isin([0,1]))
    & (df.has_questions.isin([0,1]))
    & (df.telecommuting.isin([0,1]))
    )

cleaned_df.groupBy('fraudulent').count().orderBy(desc('count')).show(5)

+----------+-----+
|fraudulent|count|
+----------+-----+
|         0|13591|
|         1|  682|
+----------+-----+



In [72]:
cleaned_df = cleaned_df.withColumn('has_company_logo', cleaned_df.has_company_logo.cast(IntegerType())) \
    .withColumn('has_questions', cleaned_df.has_questions.cast(IntegerType())) \
    .withColumn('telecommuting', cleaned_df.telecommuting.cast(IntegerType())) \
    .withColumn('fraudulent', cleaned_df.fraudulent.cast(IntegerType()))

In [73]:
cleaned_df.printSchema()

root
 |-- title: string (nullable = true)
 |-- description: string (nullable = true)
 |-- requirements: string (nullable = true)
 |-- has_company_logo: integer (nullable = true)
 |-- has_questions: integer (nullable = true)
 |-- telecommuting: integer (nullable = true)
 |-- fraudulent: integer (nullable = true)



In [74]:
cleaned_df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in cleaned_df.columns]).toPandas().T.sort_values(0, ascending=False)

Unnamed: 0,0
title,0
description,0
requirements,0
has_company_logo,0
has_questions,0
telecommuting,0
fraudulent,0


In [75]:
cleaned_df = cleaned_df.withColumn('text',concat_ws(' ','title','description','requirements'))

In [76]:
# NLP tools
tokenizer = Tokenizer(inputCol='text', outputCol='token_text')
stop_rm = StopWordsRemover(inputCol='token_text', outputCol='stop_token')
count_vect = CountVectorizer(inputCol='stop_token', outputCol=('c_vec'), vocabSize=20000)
idf = IDF(inputCol='c_vec', outputCol='tf_idf')

# Vector assembler
featured_data = VectorAssembler(inputCols=[
    'tf_idf',
    'has_company_logo',
    'has_questions',
    'telecommuting',
    ], outputCol='features')

# Pre-process data
pre_processor = Pipeline(stages=[tokenizer, stop_rm, count_vect, idf, featured_data])

cleaner = pre_processor.fit(cleaned_df)
cleaned_df = cleaner.transform(cleaned_df)

In [77]:
df_for_ml = cleaned_df.select('features','fraudulent')
df_for_ml.printSchema()

root
 |-- features: vector (nullable = true)
 |-- fraudulent: integer (nullable = true)



In [78]:
df_for_ml.show(5)

+--------------------+----------+
|            features|fraudulent|
+--------------------+----------+
|(20003,[0,1,2,6,9...|         0|
|(20003,[0,1,2,3,4...|         0|
|(20003,[1,8,22,28...|         0|
|(20003,[0,1,2,3,4...|         0|
|(20003,[0,1,2,3,6...|         0|
+--------------------+----------+
only showing top 5 rows



In [79]:
train, test = df_for_ml.randomSplit([0.7,0.3])

# Evaluator precises the target column and the kind of metrics to use 
evaluator = MulticlassClassificationEvaluator(labelCol='fraudulent', metricName='f1')

# Define transformtions stages to throw in the pipeline
nb = NaiveBayes(featuresCol='features', labelCol='fraudulent')

# Definition of pipeline
pipeline_nb = Pipeline(stages=[nb])

# Definition of the grid parameters
paramGrid = ParamGridBuilder().\
            addGrid(nb.modelType, ["multinomial"]).\
            build()

# Definition of the cross validator
cv = CrossValidator(
  estimator=pipeline_nb,
  estimatorParamMaps=paramGrid, 
  evaluator=evaluator, 
  numFolds=3)

# Train the model
model = cv.fit(train)

# Predict classes on test part
predictions = model.transform(test)
print(predictions.show(5))

predictions_pd = predictions.toPandas()
print(classification_report(predictions_pd.prediction, predictions_pd.fraudulent))

+--------------------+----------+--------------------+--------------------+----------+
|            features|fraudulent|       rawPrediction|         probability|prediction|
+--------------------+----------+--------------------+--------------------+----------+
|(20003,[0,1,2,3,4...|         0|[-6212.6779657376...|[1.0,1.0110119636...|       0.0|
|(20003,[0,1,2,3,4...|         0|[-10995.457086747...|[1.0,5.7780640703...|       0.0|
|(20003,[0,1,2,3,4...|         0|[-3492.7624269515...|[1.0,3.0333622710...|       0.0|
|(20003,[0,1,2,3,4...|         1|[-6491.3744842387...|[1.0,5.2431242827...|       0.0|
|(20003,[0,1,2,3,4...|         0|[-11706.139477914...|           [1.0,0.0]|       0.0|
+--------------------+----------+--------------------+--------------------+----------+
only showing top 5 rows

None
              precision    recall  f1-score   support

         0.0       0.97      0.99      0.98      3997
         1.0       0.80      0.56      0.65       293

    accuracy           

In [80]:
result_test = predictions.withColumn('prediction_results', predictions.fraudulent - predictions.prediction)
res = udf(lambda x: True if x==0 else False)
result_test = result_test.withColumn('prediction_results', res(result_test.prediction_results))

print('Prediction of genuine offers:')
print(result_test.select('fraudulent','prediction','prediction_results').orderBy(rand()).filter('fraudulent == 0').show(10))
print('\nPrediction of fraudulent offers:')
print(result_test.select('fraudulent','prediction','prediction_results').orderBy(rand()).filter('fraudulent == 1').show(10))

Prediction of genuine offers:
+----------+----------+------------------+
|fraudulent|prediction|prediction_results|
+----------+----------+------------------+
|         0|       0.0|              true|
|         0|       0.0|              true|
|         0|       0.0|              true|
|         0|       0.0|              true|
|         0|       0.0|              true|
|         0|       0.0|              true|
|         0|       0.0|              true|
|         0|       0.0|              true|
|         0|       0.0|              true|
|         0|       0.0|              true|
+----------+----------+------------------+
only showing top 10 rows

None

Prediction of fraudulent offers:
+----------+----------+------------------+
|fraudulent|prediction|prediction_results|
+----------+----------+------------------+
|         1|       1.0|              true|
|         1|       1.0|              true|
|         1|       1.0|              true|
|         1|       1.0|              true|
| 