# NLP in Pyspark's MLlib Project

## Fake Job Posting Predictions

Indeed.com has just hired you to create a system that automatically flags suspicious job postings on it's website. It has recently seen an influx of fake job postings that is negativley impacting it's customer experience. Becuase of the high volume of job postings it receives everyday, their employees do have the capacity to check every posting so they would like prioritize which postings to review before deleting it. 

#### Your task
Use the attached dataset with NLP to create an alogorthim which automatically flags suspicious posts for review. 

#### The data
This dataset contains 18K job descriptions out of which about 800 are fake. The data consists of both textual information and meta-information about the jobs.

**Data Source:** https://www.kaggle.com/shivamb/real-or-fake-fake-jobposting-prediction

#### Have fun!

In [None]:
!pip install pyspark

import pyspark # only run after findspark.init()
from pyspark.sql import SparkSession
# May take awhile locally
spark = SparkSession.builder.appName("NLP").getOrCreate()

spark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.2.tar.gz (281.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m281.4/281.4 MB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.7/199.7 KB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.2-py2.py3-none-any.whl size=281824025 sha256=5dc10c9e9d01c6b03f054ae25bdbdf9158d3affd7f6c70f609ff2ec0a6a703e2
  Stored in directory: /root/.cache/pip/wheels/b1/59/a0/a1a0624b5e865fd389919c1a10f53aec9b12195d6747710baf
Successfully built pyspark
Installing collected packages: py4j, pyspa

In [None]:
from pyspark.ml.feature import * 
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.ml.classification import *
from pyspark.ml.evaluation import *
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

# For pipeline development
from pyspark.ml import Pipeline 

# Import the data

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
path ="drive/MyDrive/5. Spark/spark-scripts/section3/Datasets/"

df = spark.read.csv(path+'fake_job_postings.csv',inferSchema=True,header=True)

In [None]:
df.show(5,truncate = False)

+------+-----------------------------------------+------------------+----------+------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# Checking data consistency




In [None]:
df.printSchema()

root
 |-- job_id: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- location: string (nullable = true)
 |-- department: string (nullable = true)
 |-- salary_range: string (nullable = true)
 |-- company_profile: string (nullable = true)
 |-- description: string (nullable = true)
 |-- requirements: string (nullable = true)
 |-- benefits: string (nullable = true)
 |-- telecommuting: string (nullable = true)
 |-- has_company_logo: string (nullable = true)
 |-- has_questions: string (nullable = true)
 |-- employment_type: string (nullable = true)
 |-- required_experience: string (nullable = true)
 |-- required_education: string (nullable = true)
 |-- industry: string (nullable = true)
 |-- function: string (nullable = true)
 |-- fraudulent: string (nullable = true)



In [None]:
df.count()

17880

In [None]:
df.show(5,truncate = False)

+------+-----------------------------------------+------------------+----------+------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [None]:
df.select(countDistinct('fraudulent')).show()

+--------------------------+
|count(DISTINCT fraudulent)|
+--------------------------+
|                       258|
+--------------------------+



In [None]:
num = df.count() - df.filter('fraudulent=1 or fraudulent=0').count()
print('Number of rows with labels not in [0,1]: ', num)

Number of rows with labels not in [0,1]:  914


### After checking the labels column, I found that it contains some text. thus I will remove these rows

In [None]:
df = df.filter('fraudulent=1 or fraudulent=0')

In [None]:
df.groupBy('fraudulent').count().show()

+----------+-----+
|fraudulent|count|
+----------+-----+
|         0|16080|
|         1|  886|
+----------+-----+



### Checking consistency of binary (0,1) columns

In [None]:
bin_features = ['telecommuting','has_company_logo','has_questions']

In [None]:
def checkNumericConsistency(dataframe):
  for i in bin_features:
    num = dataframe.count() - dataframe.filter( (dataframe[i].isNull()) | (dataframe[i] == '0') |  (dataframe[i] == '1') ).count()
    if num ==0:
      print(i, "doesn't contain any strings")
      continue
    print('Found', num, 'rows in ', i ,'containing strings')

checkNumericConsistency(df)

Found 122 rows in  telecommuting containing strings
Found 122 rows in  has_company_logo containing strings
Found 122 rows in  has_questions containing strings


I will simply remove any row that contains inconvenient datatype in numeric columns 

In [None]:
filtered_df = df
for i in bin_features:
  filtered_df = filtered_df.filter( (df[i].isNull()) | (df[i] == '0') |  (df[i] == '1') )

checkNumericConsistency(filtered_df)

telecommuting doesn't contain any strings
has_company_logo doesn't contain any strings
has_questions doesn't contain any strings


Changing their datatype to integers

In [None]:
checked_df = filtered_df
for i in bin_features:
  checked_df = checked_df.withColumn(i,checked_df[i].cast(IntegerType()))

checked_df.printSchema()

root
 |-- job_id: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- location: string (nullable = true)
 |-- department: string (nullable = true)
 |-- salary_range: string (nullable = true)
 |-- company_profile: string (nullable = true)
 |-- description: string (nullable = true)
 |-- requirements: string (nullable = true)
 |-- benefits: string (nullable = true)
 |-- telecommuting: integer (nullable = true)
 |-- has_company_logo: integer (nullable = true)
 |-- has_questions: integer (nullable = true)
 |-- employment_type: string (nullable = true)
 |-- required_experience: string (nullable = true)
 |-- required_education: string (nullable = true)
 |-- industry: string (nullable = true)
 |-- function: string (nullable = true)
 |-- fraudulent: string (nullable = true)



### Checking consistency of the other features

**title, company_profile, description, requirements, benefits**: will stay as a free text



**salary_range**: will be examined


I found some dates in salary_range column

In [None]:
checked_df.filter( (~col('salary_range').rlike('^[0-9\-]+$')) & (col('salary_range').isNotNull()) ) \
                  .groupBy('salary_range').count().show(20)

+------------+-----+
|salary_range|count|
+------------+-----+
|       4-Apr|    1|
|       4-Jun|    1|
|      Oct-15|    1|
|       3-Apr|    1|
|       9-Dec|    1|
|       8-Sep|    1|
|      10-Oct|    6|
|       2-Apr|    1|
|      Jun-18|    1|
|       2-Jun|    1|
|      11-Dec|    1|
|      Oct-20|    2|
|      11-Nov|    2|
|      10-Nov|    5|
|      Dec-25|    1|
+------------+-----+



I will remove these rows 


In [None]:
unwanted_df = checked_df.filter( (~col('salary_range').rlike('^[0-9\-]+$')) )

final_df = checked_df.subtract(unwanted_df)

In [None]:
final_df.filter( (~col('salary_range').rlike('^[0-9\-]+$')) & (col('salary_range').isNotNull()) ) \
                  .groupBy('salary_range').count().show(20)

+------------+-----+
|salary_range|count|
+------------+-----+
+------------+-----+



# Performing some EDA

In [None]:
bin_features = ['telecommuting','has_company_logo','has_questions']
string_features = ['location', 'department', 'salary_range', 'employment_type', 'required_experience', 'required_education', 'industry', 'function']
text_features = ['title', 'company_profile', 'description', 'requirements', 'benefits']

Checking number of unique values in string features that will be indexed

In [None]:
print('Number of unique values in: ')
for i in string_features:
  print('   ',i,": ",final_df.groupBy(i).count().count())

Number of unique values in: 
    location :  2972
    department :  1256
    salary_range :  817
    employment_type :  6
    required_experience :  8
    required_education :  14
    industry :  131
    function :  38


As **location**, **department**, **industry** columns contain many unique values, I will put them in the text_features to be concatenated with them later

In [None]:
bin_features = ['telecommuting','has_company_logo','has_questions']
string_features = ['salary_range', 'employment_type', 'required_experience', 'required_education', 'function']
text_features = ['title','location', 'department','company_profile', 'description', 'requirements', 'benefits', 'industry']

Checking classes imbalance

In [None]:
print(final_df.groupBy('fraudulent').count().show())

+----------+-----+
|fraudulent|count|
+----------+-----+
|         0|15987|
|         1|  831|
+----------+-----+

None


Checking null values

In [None]:
nullPct = (final_df.count()-final_df.na.drop().count())/final_df.count()
nullPct = nullPct *100
print('Percentage of rows containing null values: ', nullPct,'%')

Percentage of rows containing null values:  95.70698061600666 %


As you can see 95% of the data contains null values in its features, Lets deal with them

In [None]:
def null_value_calc(df):
    null_columns_counts = []
    numRows = df.count()
    for k in bin_features + string_features:
        nullRows = df.where(col(k).isNull()).count()
        if(nullRows > 0):
            temp = k,nullRows,(nullRows/numRows)*100
            null_columns_counts.append(temp)
    return(null_columns_counts)

In [None]:
null_columns_calc_list = null_value_calc(final_df)
spark.createDataFrame(null_columns_calc_list, ['Column_Name', 'Null_Values_Count','Null_Value_Percent']) \
                                              .orderBy(col('Null_Value_Percent').desc()).show(15)

+-------------------+-----------------+------------------+
|        Column_Name|Null_Values_Count|Null_Value_Percent|
+-------------------+-----------------+------------------+
|       salary_range|            14153| 84.15388274467833|
| required_education|             7642| 45.43941015578547|
|required_experience|             6667|39.642050184326315|
|           function|             6139| 36.50255678439767|
|    employment_type|             3272|19.455345463194195|
+-------------------+-----------------+------------------+



As we can see bin_features doesn't contain any null values

Lets deal with the rest

# Data preprocessing

since salary_range contains too many null values I'll simply drop it 

Next time I will check first for missing values before checking data consistency to avoid unnecessary effort

In [None]:
bin_features = ['telecommuting','has_company_logo','has_questions']
string_features = ['employment_type', 'required_experience', 'required_education', 'function']
text_features = ['title','location', 'department','company_profile', 'description', 'requirements', 'benefits', 'industry']

In [None]:
final_df = final_df.drop('salary_range')
final_df.printSchema()

root
 |-- job_id: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- location: string (nullable = true)
 |-- department: string (nullable = true)
 |-- company_profile: string (nullable = true)
 |-- description: string (nullable = true)
 |-- requirements: string (nullable = true)
 |-- benefits: string (nullable = true)
 |-- telecommuting: integer (nullable = true)
 |-- has_company_logo: integer (nullable = true)
 |-- has_questions: integer (nullable = true)
 |-- employment_type: string (nullable = true)
 |-- required_experience: string (nullable = true)
 |-- required_education: string (nullable = true)
 |-- industry: string (nullable = true)
 |-- function: string (nullable = true)
 |-- fraudulent: string (nullable = true)



Since all bin_features doesn't contain any nulls I'll fill null values with ' ' (empty space)

In [None]:
final_df = final_df.fillna(' ')
final_df.count()

16818

Next concat all text features to be one column

In [None]:
final_df = final_df.select(concat_ws(' ',*text_features).alias('text_feature'),*string_features,*bin_features,'fraudulent')
final_df.show(2)

+--------------------+---------------+-------------------+------------------+---------------+-------------+----------------+-------------+----------+
|        text_feature|employment_type|required_experience|required_education|       function|telecommuting|has_company_logo|has_questions|fraudulent|
+--------------------+---------------+-------------------+------------------+---------------+-------------+----------------+-------------+----------+
|English Teacher A...|       Contract|                   | Bachelor's Degree|      Education|            0|               1|            1|         0|
|HR Administrator ...|      Full-time|                   |                  |Human Resources|            0|               1|            0|         0|
+--------------------+---------------+-------------------+------------------+---------------+-------------+----------------+-------------+----------+
only showing top 2 rows



### Now use stringIndexer on string_features to encode them

In [None]:
indexed = final_df
string_inputs = []
for column in string_features:
  indexer = StringIndexer(inputCol=column, outputCol=column+"_num") 
  indexed = indexer.fit(indexed).transform(indexed)
  new_col_name = column+"_num"
  string_inputs.append(new_col_name)

In [None]:
indexed.show(2)

+--------------------+---------------+-------------------+------------------+---------------+-------------+----------------+-------------+----------+-------------------+-----------------------+----------------------+------------+
|        text_feature|employment_type|required_experience|required_education|       function|telecommuting|has_company_logo|has_questions|fraudulent|employment_type_num|required_experience_num|required_education_num|function_num|
+--------------------+---------------+-------------------+------------------+---------------+-------------+----------------+-------------+----------+-------------------+-----------------------+----------------------+------------+
|English Teacher A...|       Contract|                   | Bachelor's Degree|      Education|            0|               1|            1|         0|                2.0|                    0.0|                   1.0|         8.0|
|HR Administrator ...|      Full-time|                   |                  |Hum

Now that we have the string_features ready and encoded, lets change their datatype

In [None]:
string_features = string_inputs

In [None]:
for i in string_features:
  indexed = indexed.withColumn(i,indexed[i].cast(IntegerType()))
indexed = indexed.withColumn('fraudulent',col('fraudulent').cast(IntegerType()))
indexed.printSchema()

root
 |-- text_feature: string (nullable = false)
 |-- employment_type: string (nullable = false)
 |-- required_experience: string (nullable = false)
 |-- required_education: string (nullable = false)
 |-- function: string (nullable = false)
 |-- telecommuting: integer (nullable = true)
 |-- has_company_logo: integer (nullable = true)
 |-- has_questions: integer (nullable = true)
 |-- fraudulent: integer (nullable = true)
 |-- employment_type_num: integer (nullable = true)
 |-- required_experience_num: integer (nullable = true)
 |-- required_education_num: integer (nullable = true)
 |-- function_num: integer (nullable = true)



### Now lets process text column

In [None]:
indexed = indexed.withColumn("text_feature",translate(col("text_feature"), ".#$\/()", " "))
indexed = indexed.withColumn("text_feature",regexp_replace(col('text_feature'), '[^A-Za-z0-9]+ ', ' '))
indexed = indexed.withColumn("text_feature",regexp_replace(col('text_feature'), ' +', ' '))
indexed = indexed.withColumn("text_feature",lower(col('text_feature')))
indexed.select('text_feature').show(3,truncate = False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [None]:
regex_tokenizer = RegexTokenizer(inputCol="text_feature", outputCol="words", pattern="\\W") 
tokenized = regex_tokenizer.transform(indexed)

remover = StopWordsRemover(inputCol="words", outputCol="filtered")
feature_data = remover.transform(tokenized)

feature_data.show(2,False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [None]:
hashingTF = HashingTF(inputCol="filtered", outputCol="rawfeatures", numFeatures=1000)
feature_data = hashingTF.transform(feature_data)
idf = IDF(inputCol="rawfeatures", outputCol="text")
feature_data = idf.fit(feature_data).transform(feature_data)

In [None]:
feature_data = feature_data.select('text',*string_features,*bin_features, 'fraudulent')
feature_data.show(2) 

+--------------------+-------------------+-----------------------+----------------------+------------+-------------+----------------+-------------+----------+
|                text|employment_type_num|required_experience_num|required_education_num|function_num|telecommuting|has_company_logo|has_questions|fraudulent|
+--------------------+-------------------+-----------------------+----------------------+------------+-------------+----------------+-------------+----------+
|(1000,[7,12,32,35...|                  2|                      0|                     1|           8|            0|               1|            1|         0|
|(1000,[7,12,18,20...|                  0|                      0|                     0|          13|            0|               1|            0|         0|
+--------------------+-------------------+-----------------------+----------------------+------------+-------------+----------------+-------------+----------+
only showing top 2 rows



In [None]:
assembler = VectorAssembler(inputCols=['text']+ string_features + bin_features  ,outputCol='features')

output = assembler.transform(feature_data).select('features','fraudulent')

In [None]:
output = output.withColumn('label',col('fraudulent'))
output.show(2,False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [None]:
output = output.select('features','label')
output.show(2)

+--------------------+-----+
|            features|label|
+--------------------+-----+
|(1007,[7,12,32,35...|    0|
|(1007,[7,12,18,20...|    0|
+--------------------+-----+
only showing top 2 rows



Normalize the features vector

In [None]:
scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures",min = 0, max = 50)

scalerModel = scaler.fit(output)
scaled_data = scalerModel.transform(output)

feature_data = scaled_data.select('label','scaledFeatures')
feature_data = feature_data.withColumnRenamed('scaledFeatures','features')


In [None]:
feature_data.show(2,False)

+-----+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# Modeling

In [None]:
train, test = feature_data.randomSplit([0.7, 0.3],seed = 11)
model = RandomForestClassifier()
modelfit = model.fit(train)
prediction = modelfit.transform(test)

In [None]:
prediction.show(2)

+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|    0|(1007,[0,1,2,3,7,...|[19.2582868498734...|[0.96291434249367...|       0.0|
|    0|(1007,[0,1,2,3,7,...|[19.2704201771085...|[0.96352100885542...|       0.0|
+-----+--------------------+--------------------+--------------------+----------+
only showing top 2 rows



In [None]:
eval = BinaryClassificationEvaluator(rawPredictionCol='prediction') 
auc = eval.evaluate(prediction)

In [None]:
print(str(auc))

0.5020242914979757


It is the worst prediction ever I know :')  
but I'm just showing how the whole pipeline is done using spark.

Otherwise, I'll change the whole strategy and the features used, specially the concatenation part :')

