# NLP in Pyspark's MLlib Project

## Fake Job Posting Predictions

Indeed.com has just hired you to create a system that automatically flags suspicious job postings on it's website. It has recently seen an influx of fake job postings that is negativley impacting it's customer experience. Becuase of the high volume of job postings it receives everyday, their employees do have the capacity to check every posting so they would like prioritize which postings to review before deleting it. 

#### Your task
Use the attached dataset with NLP to create an alogorthim which automatically flags suspicious posts for review. 

#### The data
This dataset contains 18K job descriptions out of which about 800 are fake. The data consists of both textual information and meta-information about the jobs.

**Data Source:** https://www.kaggle.com/shivamb/real-or-fake-fake-jobposting-prediction

#### Have fun!

In [1]:
# First let's create our PySpark instance
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("NLP").getOrCreate()

cores = spark._jsc.sc().getExecutorMemoryStatus().keySet().size()
print("You are working with", cores, "core(s)")
spark

You are working with 1 core(s)


In [2]:
from pyspark.ml.feature import * #CountVectorizer,StringIndexer, RegexTokenizer,StopWordsRemover, vectorassembler
from pyspark.sql.functions import * #col,udf,regexp_replace,isnull
from pyspark.sql.types import * #StringType,IntegerType
from pyspark.ml.classification import *
from pyspark.ml.evaluation import *
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

# For pipeline development
from pyspark.ml import Pipeline 

In [3]:
# To display ALL columns
import pandas as pd
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [4]:
#reading our dataset
path ="../Datasets/"
df = spark.read.csv(path+'fake_job_postings.csv',inferSchema=True,header=True)

In [5]:
df.limit(3).toPandas()

Unnamed: 0,job_id,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent
0,1,Marketing Intern,"US, NY, New York",Marketing,,"We're Food52, and we've created a groundbreaking and award-winning cooking site. We support, connect, and celebrate home cooks, and give them everything they need in one place.We have a top editorial, business, and engineering team. We're focused on using technology to find new and better ways to connect people around their specific food interests, and to offer them superb, highly curated information about food and cooking. We attract the most talented home cooks and contributors in the country; we also publish well-known professionals like Mario Batali, Gwyneth Paltrow, and Danny Meyer. And we have partnerships with Whole Foods Market and Random House.Food52 has been named the best food website by the James Beard Foundation and IACP, and has been featured in the New York Times, NPR, Pando Daily, TechCrunch, and on the Today Show.We're located in Chelsea, in New York City.","Food52, a fast-growing, James Beard Award-winning online food community and crowd-sourced and curated recipe hub, is currently interviewing full- and part-time unpaid interns to work in a small team of editors, executives, and developers in its New York City headquarters.Reproducing and/or repackaging existing Food52 content for a number of partner sites, such as Huffington Post, Yahoo, Buzzfeed, and more in their various content management systemsResearching blogs and websites for the Provisions by Food52 Affiliate ProgramAssisting in day-to-day affiliate program support, such as screening affiliates and assisting in any affiliate inquiriesSupporting with PR &amp; Events when neededHelping with office administrative work, such as filing, mailing, and preparing for meetingsWorking with developers to document bugs and suggest improvements to the siteSupporting the marketing and executive staff","Experience with content management systems a major plus (any blogging counts!)Familiar with the Food52 editorial voice and aestheticLoves food, appreciates the importance of home cooking and cooking with the seasonsMeticulous editor, perfectionist, obsessive attention to detail, maddened by typos and broken links, delighted by finding and fixing themCheerful under pressureExcellent communication skillsA+ multi-tasker and juggler of responsibilities big and smallInterested in and engaged with social media like Twitter, Facebook, and PinterestLoves problem-solving and collaborating to drive Food52 forwardThinks big picture but pitches in on the nitty gritty of running a small company (dishes, shopping, administrative support)Comfortable with the realities of working for a startup: being on call on evenings and weekends, and working long hours",,0,1,0,Other,Internship,,,Marketing,0
1,2,Customer Service - Cloud Video Production,"NZ, , Auckland",Success,,"90 Seconds, the worlds Cloud Video Production Service.90 Seconds is the worlds Cloud Video Production Service enabling brands and agencies to get high quality online video content shot and produced anywhere in the world. 90 Seconds makes video production fast, affordable, and all managed seamlessly in the cloud from purchase to publish. http://90#URL_fbe6559afac620a3cd2c22281f7b8d0eef56a73e3d9a311e2f1ca13d081dd630#90 Seconds removes the hassle, cost, risk and speed issues of working with regular video production companies by managing every aspect of video projects in a beautiful online experience. With a growing global network of over 2,000 rated video professionals in over 50 countries managed by dedicated production success teams in 5 countries, 90 Seconds provides a 100% success guarantee.90 Seconds has produced almost 4,000 videos in over 30 Countries for over 500 Global brands including some of the worlds largest including Paypal, L’Oreal, Sony and Barclays and has offices in Auckland, London, Sydney, Tokyo and Singapore.http://90#URL_fbe6559afac620a3cd2c22281f7b8d0eef56a73e3d9a311e2f1ca13d081dd630# | http://90#URL_e2ad0bde3f09a0913a486abdbb1e6ac373bb3310f64b1fbcf550049bcba4a17b# | http://90#URL_8c5dd1806f97ab90876d9daebeb430f682dbc87e2f01549b47e96c7bff2ea17e#","Organised - Focused - Vibrant - Awesome!Do you have a passion for customer service? Slick typing skills? Maybe Account Management? ...And think administration is cooler than a polar bear on a jetski? Then we need to hear you! We are the Cloud Video Production Service and opperating on a glodal level. Yeah, it's pretty cool. Serious about delivering a world class product and excellent customer service.Our rapidly expanding business is looking for a talented Project Manager to manage the successful delivery of video projects, manage client communications and drive the production process. Work with some of the coolest brands on the planet and learn from a global team that are representing NZ is a huge way!We are entering the next growth stage of our business and growing quickly internationally. Therefore, the position is bursting with opportunity for the right person entering the business at the right time. 90 Seconds, the worlds Cloud Video Production Service - http://90#URL_fbe6559afac620a3cd2c22281f7b8d0eef56a73e3d9a311e2f1ca13d081dd630#90 Seconds is the worlds Cloud Video Production Service enabling brands and agencies to get high quality online video content shot and produced anywhere in the world. Fast, affordable, and all managed seamlessly in the cloud from purchase to publish. 90 Seconds removes the hassle, cost, risk and speed issues of working with regular video production companies by managing every aspect of video projects in a beautiful online experience. With a growing network of over 2,000 rated video professionals in over 50 countries and dedicated production success teams in 5 countries guaranteeing video project success 100%. It's as easy as commissioning a quick google adwords campaign.90 Seconds has produced almost 4,000 videos in over 30 Countries for over 500 Global brands including some of the worlds largest including Paypal, L'oreal, Sony and Barclays and has offices in Auckland, London, Sydney, Tokyo &amp; Singapore.Our Auckland office is based right in the heart of the Wynyard Quarter Innovation Precinct - GridAKL!","What we expect from you:Your key responsibility will be to communicate with the client, 90 Seconds team and freelance community throughout the video production process including, shoot planning, securing freelance talent, managing workflow and the online production management system. The aim is to manage each video project effectively so that we produce great videos that our clients love.Key attributesClient focused - excellent customer service and communication skillsOnline - oustanding computer knowledge and experience using online software and project management toolsOrganised - manage workload and able to multi-task100% attention to detailMotivated - self-starter with a passion for doing excellent work and achieving great resultsAdaptable - show initiative and think on your feet as this is a constantly evolving atmosphereFlexible - fast turnaround work and after hours availabilityEasy going &amp; upbeat - dosen't get bogged down and loves the challengeSense of Humour - have a laugh and know that working in a startup takes guts!Ability to deliver - including meeting project deadlines and budgetAttitude is more important than experience at 90 Seconds, however previous experience in customer service and/or project management is beneficialPlease view our platform / website at #URL_395a8683a907ce95f49a12fb240e6e47ad8d5a4f96d07ebbd869c4dd4dea1826# and get a clear understand about what we do before reaching out.","What you will get from usThrough being part of the 90 Seconds team you will gain:experience working on projects located around the world with an international brandexperience working with a variety of clients and on a large range of projectsopportunity to drive and grow production function and teama positive working environment with a great teamPay$40,000-$55,000Applying for this role with a VIDEOBeing a video business, we understand that one of the quickest ways that we can assess your suitability for this role, and one of the quickest ways that you can apply for it, is for you to submit a 60-90 second long video telling us about yourself, your experience and why you think you would be perfect for the role. It’s not about being a filmmaker or making a really creative video. A simple video filmed with a smart phone or web cam will be fine. Please also include where you are based and when you can start.You can upload the video onto YouTube or Vimeo (or similar) as a Draft or Live link.APPLICATIONS DUE by 5pm on Wednesday 18th July 2014 - Once you have a video ready, apply for this role via the following link together with a cover letter and your CV. After we have watched your video and get an idea of your suitability for the role, we will email the shortlisted candidates",0,1,0,Full-time,Not Applicable,,Marketing and Advertising,Customer Service,0
2,3,Commissioning Machinery Assistant (CMA),"US, IA, Wever",,,"Valor Services provides Workforce Solutions that meet the needs of companies across the Private Sector, with a special focus on the Oil &amp; Gas Industry. Valor Services will be involved with you throughout every step of the hiring process and remain in contact with you all the way through the final step of signing of the employment contract with your new employer. Valor Services was founded with the vision of employing the unique skills, experiences, and qualities of America’s finest veterans to provide Private Sector companies with precise and concerted value-added services – and America’s finest Veterans with an optimized career opportunity.We are eager to get the word out to veterans that there are ample opportunities for employment in the private sector and that you are the ideal candidates to fill those positions. Valor Services Your Success is Our Mission. ™","Our client, located in Houston, is actively seeking an experienced Commissioning Machinery Assistant that possesses strong supervisory skills and has an attention to detail. A strong dedication to safety is a must. The ideal candidate will execute all activities while complying with quality requirements and health, environmental, and safety regulations.","Implement pre-commissioning and commissioning procedures for rotary equipment.Execute all activities with subcontractor’s assigned crew that pertains to the discipline.Ensure effective utilization of commissioning manpower and consumables.Ensure the execution of vendor specialists' field activities with the assigned resources from the sub-contractor per vendor’s representative plans.Carry out equipment inspections with client representatives and ensure proper certification is produced.Prepare forms for all pending tests and submit signed certificates for final hand over to the certification engineer for QA and QC.Coordinate in the field with vendor representatives.Keep records of all activities.Ensure that safety practices are strictly followed during the execution of activities.Report progress and constraints to the mechanical supervisor.Possible authorization by site manager to receive or issue a Permit To Work according to project Permit To Work procedures.Assist supervisor to expedite pending punch-list items in accordance with the commissioning manager’s priorities.Assist supervisor to coordinate and supervise construction-support activities during pre-commissioning and commissioning activities.Company Overview:Our client is a premiere engineering, construction, and procurement company that executes large-scale projects internationally.",,0,1,0,,,,,,0


In [6]:
df.count()

17880

In [7]:
df.printSchema()

root
 |-- job_id: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- location: string (nullable = true)
 |-- department: string (nullable = true)
 |-- salary_range: string (nullable = true)
 |-- company_profile: string (nullable = true)
 |-- description: string (nullable = true)
 |-- requirements: string (nullable = true)
 |-- benefits: string (nullable = true)
 |-- telecommuting: string (nullable = true)
 |-- has_company_logo: string (nullable = true)
 |-- has_questions: string (nullable = true)
 |-- employment_type: string (nullable = true)
 |-- required_experience: string (nullable = true)
 |-- required_education: string (nullable = true)
 |-- industry: string (nullable = true)
 |-- function: string (nullable = true)
 |-- fraudulent: string (nullable = true)



## PreProcessing 

Checking our target (dependent variable) column..

In [8]:
df.groupBy("fraudulent").count().orderBy(desc("count")).show()

+--------------------+-----+
|          fraudulent|count|
+--------------------+-----+
|                   0|16080|
|                   1|  886|
|                null|  176|
|           Full-time|   73|
|Hospital & Health...|   55|
|   Bachelor's Degree|   53|
|         Engineering|   26|
| perform quality ...|   17|
|         Unspecified|   15|
|    Mid-Senior level|   15|
|           Associate|   14|
|               Sales|   14|
|Information Techn...|   13|
|           Marketing|   13|
| passionate about...|   13|
|            Internet|   12|
|   Computer Software|   12|
|      Not Applicable|   11|
|We offer an excel...|   11|
| además con el fi...|   10|
+--------------------+-----+
only showing top 20 rows



The fraudulent column which we want to predict has many unwanted (noisy) rows, we can't use these rows for training or testing so we will filter our data so it only contains either 0 or 1 for fraudulent.

In [9]:
#first, for handling the most important column, the fraudulent column.
df = df.filter("fraudulent IN(0,1)")
# Make sure it worked
df.groupBy("fraudulent").count().orderBy(col("count").desc()).show(truncate=False)

+----------+-----+
|fraudulent|count|
+----------+-----+
|0         |16080|
|1         |886  |
+----------+-----+



In [10]:
df.count()

16966

After making sure our dependent variable column is as we want, We need to clean our data before starting NLP.

First, we will extract the country from the location column and use it instead of location as it is a better representation of the location's diversity.

In [11]:
df = df.withColumn('country', split(col('location'), ',')[0])

df.limit(4).toPandas()

Unnamed: 0,job_id,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,country
0,1,Marketing Intern,"US, NY, New York",Marketing,,"We're Food52, and we've created a groundbreaking and award-winning cooking site. We support, connect, and celebrate home cooks, and give them everything they need in one place.We have a top editorial, business, and engineering team. We're focused on using technology to find new and better ways to connect people around their specific food interests, and to offer them superb, highly curated information about food and cooking. We attract the most talented home cooks and contributors in the country; we also publish well-known professionals like Mario Batali, Gwyneth Paltrow, and Danny Meyer. And we have partnerships with Whole Foods Market and Random House.Food52 has been named the best food website by the James Beard Foundation and IACP, and has been featured in the New York Times, NPR, Pando Daily, TechCrunch, and on the Today Show.We're located in Chelsea, in New York City.","Food52, a fast-growing, James Beard Award-winning online food community and crowd-sourced and curated recipe hub, is currently interviewing full- and part-time unpaid interns to work in a small team of editors, executives, and developers in its New York City headquarters.Reproducing and/or repackaging existing Food52 content for a number of partner sites, such as Huffington Post, Yahoo, Buzzfeed, and more in their various content management systemsResearching blogs and websites for the Provisions by Food52 Affiliate ProgramAssisting in day-to-day affiliate program support, such as screening affiliates and assisting in any affiliate inquiriesSupporting with PR &amp; Events when neededHelping with office administrative work, such as filing, mailing, and preparing for meetingsWorking with developers to document bugs and suggest improvements to the siteSupporting the marketing and executive staff","Experience with content management systems a major plus (any blogging counts!)Familiar with the Food52 editorial voice and aestheticLoves food, appreciates the importance of home cooking and cooking with the seasonsMeticulous editor, perfectionist, obsessive attention to detail, maddened by typos and broken links, delighted by finding and fixing themCheerful under pressureExcellent communication skillsA+ multi-tasker and juggler of responsibilities big and smallInterested in and engaged with social media like Twitter, Facebook, and PinterestLoves problem-solving and collaborating to drive Food52 forwardThinks big picture but pitches in on the nitty gritty of running a small company (dishes, shopping, administrative support)Comfortable with the realities of working for a startup: being on call on evenings and weekends, and working long hours",,0,1,0,Other,Internship,,,Marketing,0,US
1,2,Customer Service - Cloud Video Production,"NZ, , Auckland",Success,,"90 Seconds, the worlds Cloud Video Production Service.90 Seconds is the worlds Cloud Video Production Service enabling brands and agencies to get high quality online video content shot and produced anywhere in the world. 90 Seconds makes video production fast, affordable, and all managed seamlessly in the cloud from purchase to publish. http://90#URL_fbe6559afac620a3cd2c22281f7b8d0eef56a73e3d9a311e2f1ca13d081dd630#90 Seconds removes the hassle, cost, risk and speed issues of working with regular video production companies by managing every aspect of video projects in a beautiful online experience. With a growing global network of over 2,000 rated video professionals in over 50 countries managed by dedicated production success teams in 5 countries, 90 Seconds provides a 100% success guarantee.90 Seconds has produced almost 4,000 videos in over 30 Countries for over 500 Global brands including some of the worlds largest including Paypal, L’Oreal, Sony and Barclays and has offices in Auckland, London, Sydney, Tokyo and Singapore.http://90#URL_fbe6559afac620a3cd2c22281f7b8d0eef56a73e3d9a311e2f1ca13d081dd630# | http://90#URL_e2ad0bde3f09a0913a486abdbb1e6ac373bb3310f64b1fbcf550049bcba4a17b# | http://90#URL_8c5dd1806f97ab90876d9daebeb430f682dbc87e2f01549b47e96c7bff2ea17e#","Organised - Focused - Vibrant - Awesome!Do you have a passion for customer service? Slick typing skills? Maybe Account Management? ...And think administration is cooler than a polar bear on a jetski? Then we need to hear you! We are the Cloud Video Production Service and opperating on a glodal level. Yeah, it's pretty cool. Serious about delivering a world class product and excellent customer service.Our rapidly expanding business is looking for a talented Project Manager to manage the successful delivery of video projects, manage client communications and drive the production process. Work with some of the coolest brands on the planet and learn from a global team that are representing NZ is a huge way!We are entering the next growth stage of our business and growing quickly internationally. Therefore, the position is bursting with opportunity for the right person entering the business at the right time. 90 Seconds, the worlds Cloud Video Production Service - http://90#URL_fbe6559afac620a3cd2c22281f7b8d0eef56a73e3d9a311e2f1ca13d081dd630#90 Seconds is the worlds Cloud Video Production Service enabling brands and agencies to get high quality online video content shot and produced anywhere in the world. Fast, affordable, and all managed seamlessly in the cloud from purchase to publish. 90 Seconds removes the hassle, cost, risk and speed issues of working with regular video production companies by managing every aspect of video projects in a beautiful online experience. With a growing network of over 2,000 rated video professionals in over 50 countries and dedicated production success teams in 5 countries guaranteeing video project success 100%. It's as easy as commissioning a quick google adwords campaign.90 Seconds has produced almost 4,000 videos in over 30 Countries for over 500 Global brands including some of the worlds largest including Paypal, L'oreal, Sony and Barclays and has offices in Auckland, London, Sydney, Tokyo &amp; Singapore.Our Auckland office is based right in the heart of the Wynyard Quarter Innovation Precinct - GridAKL!","What we expect from you:Your key responsibility will be to communicate with the client, 90 Seconds team and freelance community throughout the video production process including, shoot planning, securing freelance talent, managing workflow and the online production management system. The aim is to manage each video project effectively so that we produce great videos that our clients love.Key attributesClient focused - excellent customer service and communication skillsOnline - oustanding computer knowledge and experience using online software and project management toolsOrganised - manage workload and able to multi-task100% attention to detailMotivated - self-starter with a passion for doing excellent work and achieving great resultsAdaptable - show initiative and think on your feet as this is a constantly evolving atmosphereFlexible - fast turnaround work and after hours availabilityEasy going &amp; upbeat - dosen't get bogged down and loves the challengeSense of Humour - have a laugh and know that working in a startup takes guts!Ability to deliver - including meeting project deadlines and budgetAttitude is more important than experience at 90 Seconds, however previous experience in customer service and/or project management is beneficialPlease view our platform / website at #URL_395a8683a907ce95f49a12fb240e6e47ad8d5a4f96d07ebbd869c4dd4dea1826# and get a clear understand about what we do before reaching out.","What you will get from usThrough being part of the 90 Seconds team you will gain:experience working on projects located around the world with an international brandexperience working with a variety of clients and on a large range of projectsopportunity to drive and grow production function and teama positive working environment with a great teamPay$40,000-$55,000Applying for this role with a VIDEOBeing a video business, we understand that one of the quickest ways that we can assess your suitability for this role, and one of the quickest ways that you can apply for it, is for you to submit a 60-90 second long video telling us about yourself, your experience and why you think you would be perfect for the role. It’s not about being a filmmaker or making a really creative video. A simple video filmed with a smart phone or web cam will be fine. Please also include where you are based and when you can start.You can upload the video onto YouTube or Vimeo (or similar) as a Draft or Live link.APPLICATIONS DUE by 5pm on Wednesday 18th July 2014 - Once you have a video ready, apply for this role via the following link together with a cover letter and your CV. After we have watched your video and get an idea of your suitability for the role, we will email the shortlisted candidates",0,1,0,Full-time,Not Applicable,,Marketing and Advertising,Customer Service,0,NZ
2,3,Commissioning Machinery Assistant (CMA),"US, IA, Wever",,,"Valor Services provides Workforce Solutions that meet the needs of companies across the Private Sector, with a special focus on the Oil &amp; Gas Industry. Valor Services will be involved with you throughout every step of the hiring process and remain in contact with you all the way through the final step of signing of the employment contract with your new employer. Valor Services was founded with the vision of employing the unique skills, experiences, and qualities of America’s finest veterans to provide Private Sector companies with precise and concerted value-added services – and America’s finest Veterans with an optimized career opportunity.We are eager to get the word out to veterans that there are ample opportunities for employment in the private sector and that you are the ideal candidates to fill those positions. Valor Services Your Success is Our Mission. ™","Our client, located in Houston, is actively seeking an experienced Commissioning Machinery Assistant that possesses strong supervisory skills and has an attention to detail. A strong dedication to safety is a must. The ideal candidate will execute all activities while complying with quality requirements and health, environmental, and safety regulations.","Implement pre-commissioning and commissioning procedures for rotary equipment.Execute all activities with subcontractor’s assigned crew that pertains to the discipline.Ensure effective utilization of commissioning manpower and consumables.Ensure the execution of vendor specialists' field activities with the assigned resources from the sub-contractor per vendor’s representative plans.Carry out equipment inspections with client representatives and ensure proper certification is produced.Prepare forms for all pending tests and submit signed certificates for final hand over to the certification engineer for QA and QC.Coordinate in the field with vendor representatives.Keep records of all activities.Ensure that safety practices are strictly followed during the execution of activities.Report progress and constraints to the mechanical supervisor.Possible authorization by site manager to receive or issue a Permit To Work according to project Permit To Work procedures.Assist supervisor to expedite pending punch-list items in accordance with the commissioning manager’s priorities.Assist supervisor to coordinate and supervise construction-support activities during pre-commissioning and commissioning activities.Company Overview:Our client is a premiere engineering, construction, and procurement company that executes large-scale projects internationally.",,0,1,0,,,,,,0,US
3,4,Account Executive - Washington DC,"US, DC, Washington",Sales,,"Our passion for improving quality of life through geography is at the heart of everything we do. Esri’s geographic information system (GIS) technology inspires and enables governments, universities and businesses worldwide to save money, lives and our environment through a deeper understanding of the changing world around them.Carefully managed growth and zero debt give Esri stability that is uncommon in today's volatile business world. Privately held, we offer exceptional benefits, competitive salaries, 401(k) and profit-sharing programs, opportunities for personal and professional growth, and much more.","THE COMPANY: ESRI – Environmental Systems Research InstituteOur passion for improving quality of life through geography is at the heart of everything we do. Esri’s geographic information system (GIS) technology inspires and enables governments, universities and businesses worldwide to save money, lives and our environment through a deeper understanding of the changing world around them.Carefully managed growth and zero debt give Esri stability that is uncommon in today's volatile business world. Privately held, we offer exceptional benefits, competitive salaries, 401(k) and profit-sharing programs, opportunities for personal and professional growth, and much more.THE OPPORTUNITY: Account ExecutiveAs a member of the Sales Division, you will work collaboratively with an account team in order to sell and promote adoption of Esri’s ArcGIS platform within an organization. As part of an account team, you will be responsible for facilitating the development and execution of a set of strategies for a defined portfolio of accounts. When executing these strategies you will utilize your experience in enterprise sales to help customers leverage geospatial information and technology to achieve their business goals. Specifically…Prospect and develop opportunities to partner with key stakeholders to envision, develop, and implement a location strategy for their organizationClearly articulate the strength and value proposition of the ArcGIS platformDevelop and maintain a healthy pipeline of opportunities for business growthDemonstrate a thoughtful understanding of insightful industry knowledge and how GIS applies to initiatives, trends, and triggersUnderstand the key business drivers within an organization and identify key business stakeholdersUnderstand your customers’ budgeting and acquisition processesSuccessfully execute the account management process including account prioritization, account resourcing, and account planningSuccessfully execute the sales process for all opportunitiesLeverage and lead an account team consisting of sales and other cross-divisional resources to define and execute an account strategyEffectively utilize and leverage the CRM to manage opportunities and drive the buying processPursue professional and personal development to ensure competitive knowledge of the real estate industryLeverage social media to successfully prospect and build a professional networkParticipate in trade shows, workshops, and seminars (as required)Support visual story telling through effective whiteboard sessionsBe resourceful and takes initiative to resolve issues","EDUCATION: Bachelor’s or Master’s in GIS, business administration, or a related field, or equivalent work experience, depending on position levelEXPERIENCE: 5+ years of enterprise sales experience providing platform solutions to businessesDemonstrated experience in managing the sales cycle including prospecting, proposing, and closingAbility to adapt to new technology trends and translate them into solutions that address customer needsDemonstrated experience with strong partnerships and advocacy with customersExcellent presentation, white boarding, and negotiation skills including good listening, probing, and qualification abilitiesExperience executing insight selling methodologiesDemonstrated understanding and mitigation of competitive threatsExcellent written and verbal communication and interpersonal skillsAbility to manage and prioritize your activitiesDemonstrated experience to lead executive engagements to provide services and sell to the real estate industryKnowledge of the real estate industry fiscal year, budgeting, and procurement cycleHighly motivated team player with a mature, positive attitude and passion to meet the challenges and opportunities of a businessAbility to travel domestically and/or internationally up to 50%General knowledge of spatial analysis and problem solvingResults oriented; ability to write and craft smart, attainable, realistic, time-driven goals with clear lead indicators","Our culture is anything but corporate—we have a collaborative, creative environment; phone directories organized by first name; a relaxed dress code; and open-door policies.A Place to ThrivePassionate people who strive to make a differenceCasual dress codeFlexible work schedulesSupport for continuing educationCollege-Like CampusA network of buildings amid lush landscaping and numerous outdoor patio areasOn-site café including a Starbucks coffee bar and lounge areaFitness center available 24/7Comprehensive reference library and GIS bibliographyState-of-the-art conference center to host staff and guest speakers Green InitiativesSolar rooftop panels reduce carbon emissionsElectric vehicles provide on-campus transportationHundreds of trees reduce the cost of cooling buildings",0,1,0,Full-time,Mid-Senior level,Bachelor's Degree,Computer Software,Sales,0,US


In [12]:
df.groupBy("country").count().orderBy(desc("count")).show(5)

+-------+-----+
|country|count|
+-------+-----+
|     US|10170|
|     GB| 2253|
|     GR|  862|
|     CA|  424|
|     DE|  368|
+-------+-----+
only showing top 5 rows



Now, we will check for nulls and see how we can use them for feature selection and how can we fill them.

### Checking for Nulls

In [13]:
null_counts = df.select([count(when(col(c).isNull(),c)).alias(c) for c in df.columns])
null_counts.toPandas()

Unnamed: 0,job_id,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,country
0,0,0,337,11039,14258,3206,0,2571,6949,0,0,0,3273,6675,7661,4667,6158,0,337


In [14]:
def null_value_calc(df):
    null_columns_counts = []
    cols_to_keep = []
    numRows = df.count()
    for k in df.columns:
        nullRows = df.where(col(k).isNull()).count()
        if(nullRows > 0):
            null_percent = (nullRows/numRows)*100
            temp = k,nullRows,null_percent
            null_columns_counts.append(temp)
            if (null_percent < 20):
                cols_to_keep.append(k)
        else:
            cols_to_keep.append(k)
    return(null_columns_counts), cols_to_keep

null_columns_calc_list, cols_to_keep = null_value_calc(df)
spark.createDataFrame(null_columns_calc_list, ['Column_Name', 'Null_Values_Count','Null_Value_Percent']).show()

+-------------------+-----------------+------------------+
|        Column_Name|Null_Values_Count|Null_Value_Percent|
+-------------------+-----------------+------------------+
|           location|              337|1.9863255923611929|
|         department|            11039| 65.06542496758222|
|       salary_range|            14258| 84.03866556642697|
|    company_profile|             3206| 18.89661676293764|
|       requirements|             2571|15.153837085936578|
|           benefits|             6949| 40.95838736296122|
|    employment_type|             3273| 19.29152422492043|
|required_experience|             6675| 39.34339266768832|
| required_education|             7661|  45.1550159141813|
|           industry|             4667|27.507957090651892|
|           function|             6158|36.296121655074856|
|            country|              337|1.9863255923611929|
+-------------------+-----------------+------------------+



### Features Selection

We will be dropping any column whose null values are greater than 20%, so we will focus on the following columns as calculated from the previous function.

In [15]:
print(cols_to_keep)

['job_id', 'title', 'location', 'company_profile', 'description', 'requirements', 'telecommuting', 'has_company_logo', 'has_questions', 'employment_type', 'fraudulent', 'country']


As location column is in the cols_to_keep, we need to remove it as we already has the country column.

In [16]:
cols_to_keep.remove("location")
cols_to_keep

['job_id',
 'title',
 'company_profile',
 'description',
 'requirements',
 'telecommuting',
 'has_company_logo',
 'has_questions',
 'employment_type',
 'fraudulent',
 'country']

In [17]:
df = df.select(cols_to_keep)
df.limit(4).toPandas()

Unnamed: 0,job_id,title,company_profile,description,requirements,telecommuting,has_company_logo,has_questions,employment_type,fraudulent,country
0,1,Marketing Intern,"We're Food52, and we've created a groundbreaking and award-winning cooking site. We support, connect, and celebrate home cooks, and give them everything they need in one place.We have a top editorial, business, and engineering team. We're focused on using technology to find new and better ways to connect people around their specific food interests, and to offer them superb, highly curated information about food and cooking. We attract the most talented home cooks and contributors in the country; we also publish well-known professionals like Mario Batali, Gwyneth Paltrow, and Danny Meyer. And we have partnerships with Whole Foods Market and Random House.Food52 has been named the best food website by the James Beard Foundation and IACP, and has been featured in the New York Times, NPR, Pando Daily, TechCrunch, and on the Today Show.We're located in Chelsea, in New York City.","Food52, a fast-growing, James Beard Award-winning online food community and crowd-sourced and curated recipe hub, is currently interviewing full- and part-time unpaid interns to work in a small team of editors, executives, and developers in its New York City headquarters.Reproducing and/or repackaging existing Food52 content for a number of partner sites, such as Huffington Post, Yahoo, Buzzfeed, and more in their various content management systemsResearching blogs and websites for the Provisions by Food52 Affiliate ProgramAssisting in day-to-day affiliate program support, such as screening affiliates and assisting in any affiliate inquiriesSupporting with PR &amp; Events when neededHelping with office administrative work, such as filing, mailing, and preparing for meetingsWorking with developers to document bugs and suggest improvements to the siteSupporting the marketing and executive staff","Experience with content management systems a major plus (any blogging counts!)Familiar with the Food52 editorial voice and aestheticLoves food, appreciates the importance of home cooking and cooking with the seasonsMeticulous editor, perfectionist, obsessive attention to detail, maddened by typos and broken links, delighted by finding and fixing themCheerful under pressureExcellent communication skillsA+ multi-tasker and juggler of responsibilities big and smallInterested in and engaged with social media like Twitter, Facebook, and PinterestLoves problem-solving and collaborating to drive Food52 forwardThinks big picture but pitches in on the nitty gritty of running a small company (dishes, shopping, administrative support)Comfortable with the realities of working for a startup: being on call on evenings and weekends, and working long hours",0,1,0,Other,0,US
1,2,Customer Service - Cloud Video Production,"90 Seconds, the worlds Cloud Video Production Service.90 Seconds is the worlds Cloud Video Production Service enabling brands and agencies to get high quality online video content shot and produced anywhere in the world. 90 Seconds makes video production fast, affordable, and all managed seamlessly in the cloud from purchase to publish. http://90#URL_fbe6559afac620a3cd2c22281f7b8d0eef56a73e3d9a311e2f1ca13d081dd630#90 Seconds removes the hassle, cost, risk and speed issues of working with regular video production companies by managing every aspect of video projects in a beautiful online experience. With a growing global network of over 2,000 rated video professionals in over 50 countries managed by dedicated production success teams in 5 countries, 90 Seconds provides a 100% success guarantee.90 Seconds has produced almost 4,000 videos in over 30 Countries for over 500 Global brands including some of the worlds largest including Paypal, L’Oreal, Sony and Barclays and has offices in Auckland, London, Sydney, Tokyo and Singapore.http://90#URL_fbe6559afac620a3cd2c22281f7b8d0eef56a73e3d9a311e2f1ca13d081dd630# | http://90#URL_e2ad0bde3f09a0913a486abdbb1e6ac373bb3310f64b1fbcf550049bcba4a17b# | http://90#URL_8c5dd1806f97ab90876d9daebeb430f682dbc87e2f01549b47e96c7bff2ea17e#","Organised - Focused - Vibrant - Awesome!Do you have a passion for customer service? Slick typing skills? Maybe Account Management? ...And think administration is cooler than a polar bear on a jetski? Then we need to hear you! We are the Cloud Video Production Service and opperating on a glodal level. Yeah, it's pretty cool. Serious about delivering a world class product and excellent customer service.Our rapidly expanding business is looking for a talented Project Manager to manage the successful delivery of video projects, manage client communications and drive the production process. Work with some of the coolest brands on the planet and learn from a global team that are representing NZ is a huge way!We are entering the next growth stage of our business and growing quickly internationally. Therefore, the position is bursting with opportunity for the right person entering the business at the right time. 90 Seconds, the worlds Cloud Video Production Service - http://90#URL_fbe6559afac620a3cd2c22281f7b8d0eef56a73e3d9a311e2f1ca13d081dd630#90 Seconds is the worlds Cloud Video Production Service enabling brands and agencies to get high quality online video content shot and produced anywhere in the world. Fast, affordable, and all managed seamlessly in the cloud from purchase to publish. 90 Seconds removes the hassle, cost, risk and speed issues of working with regular video production companies by managing every aspect of video projects in a beautiful online experience. With a growing network of over 2,000 rated video professionals in over 50 countries and dedicated production success teams in 5 countries guaranteeing video project success 100%. It's as easy as commissioning a quick google adwords campaign.90 Seconds has produced almost 4,000 videos in over 30 Countries for over 500 Global brands including some of the worlds largest including Paypal, L'oreal, Sony and Barclays and has offices in Auckland, London, Sydney, Tokyo &amp; Singapore.Our Auckland office is based right in the heart of the Wynyard Quarter Innovation Precinct - GridAKL!","What we expect from you:Your key responsibility will be to communicate with the client, 90 Seconds team and freelance community throughout the video production process including, shoot planning, securing freelance talent, managing workflow and the online production management system. The aim is to manage each video project effectively so that we produce great videos that our clients love.Key attributesClient focused - excellent customer service and communication skillsOnline - oustanding computer knowledge and experience using online software and project management toolsOrganised - manage workload and able to multi-task100% attention to detailMotivated - self-starter with a passion for doing excellent work and achieving great resultsAdaptable - show initiative and think on your feet as this is a constantly evolving atmosphereFlexible - fast turnaround work and after hours availabilityEasy going &amp; upbeat - dosen't get bogged down and loves the challengeSense of Humour - have a laugh and know that working in a startup takes guts!Ability to deliver - including meeting project deadlines and budgetAttitude is more important than experience at 90 Seconds, however previous experience in customer service and/or project management is beneficialPlease view our platform / website at #URL_395a8683a907ce95f49a12fb240e6e47ad8d5a4f96d07ebbd869c4dd4dea1826# and get a clear understand about what we do before reaching out.",0,1,0,Full-time,0,NZ
2,3,Commissioning Machinery Assistant (CMA),"Valor Services provides Workforce Solutions that meet the needs of companies across the Private Sector, with a special focus on the Oil &amp; Gas Industry. Valor Services will be involved with you throughout every step of the hiring process and remain in contact with you all the way through the final step of signing of the employment contract with your new employer. Valor Services was founded with the vision of employing the unique skills, experiences, and qualities of America’s finest veterans to provide Private Sector companies with precise and concerted value-added services – and America’s finest Veterans with an optimized career opportunity.We are eager to get the word out to veterans that there are ample opportunities for employment in the private sector and that you are the ideal candidates to fill those positions. Valor Services Your Success is Our Mission. ™","Our client, located in Houston, is actively seeking an experienced Commissioning Machinery Assistant that possesses strong supervisory skills and has an attention to detail. A strong dedication to safety is a must. The ideal candidate will execute all activities while complying with quality requirements and health, environmental, and safety regulations.","Implement pre-commissioning and commissioning procedures for rotary equipment.Execute all activities with subcontractor’s assigned crew that pertains to the discipline.Ensure effective utilization of commissioning manpower and consumables.Ensure the execution of vendor specialists' field activities with the assigned resources from the sub-contractor per vendor’s representative plans.Carry out equipment inspections with client representatives and ensure proper certification is produced.Prepare forms for all pending tests and submit signed certificates for final hand over to the certification engineer for QA and QC.Coordinate in the field with vendor representatives.Keep records of all activities.Ensure that safety practices are strictly followed during the execution of activities.Report progress and constraints to the mechanical supervisor.Possible authorization by site manager to receive or issue a Permit To Work according to project Permit To Work procedures.Assist supervisor to expedite pending punch-list items in accordance with the commissioning manager’s priorities.Assist supervisor to coordinate and supervise construction-support activities during pre-commissioning and commissioning activities.Company Overview:Our client is a premiere engineering, construction, and procurement company that executes large-scale projects internationally.",0,1,0,,0,US
3,4,Account Executive - Washington DC,"Our passion for improving quality of life through geography is at the heart of everything we do. Esri’s geographic information system (GIS) technology inspires and enables governments, universities and businesses worldwide to save money, lives and our environment through a deeper understanding of the changing world around them.Carefully managed growth and zero debt give Esri stability that is uncommon in today's volatile business world. Privately held, we offer exceptional benefits, competitive salaries, 401(k) and profit-sharing programs, opportunities for personal and professional growth, and much more.","THE COMPANY: ESRI – Environmental Systems Research InstituteOur passion for improving quality of life through geography is at the heart of everything we do. Esri’s geographic information system (GIS) technology inspires and enables governments, universities and businesses worldwide to save money, lives and our environment through a deeper understanding of the changing world around them.Carefully managed growth and zero debt give Esri stability that is uncommon in today's volatile business world. Privately held, we offer exceptional benefits, competitive salaries, 401(k) and profit-sharing programs, opportunities for personal and professional growth, and much more.THE OPPORTUNITY: Account ExecutiveAs a member of the Sales Division, you will work collaboratively with an account team in order to sell and promote adoption of Esri’s ArcGIS platform within an organization. As part of an account team, you will be responsible for facilitating the development and execution of a set of strategies for a defined portfolio of accounts. When executing these strategies you will utilize your experience in enterprise sales to help customers leverage geospatial information and technology to achieve their business goals. Specifically…Prospect and develop opportunities to partner with key stakeholders to envision, develop, and implement a location strategy for their organizationClearly articulate the strength and value proposition of the ArcGIS platformDevelop and maintain a healthy pipeline of opportunities for business growthDemonstrate a thoughtful understanding of insightful industry knowledge and how GIS applies to initiatives, trends, and triggersUnderstand the key business drivers within an organization and identify key business stakeholdersUnderstand your customers’ budgeting and acquisition processesSuccessfully execute the account management process including account prioritization, account resourcing, and account planningSuccessfully execute the sales process for all opportunitiesLeverage and lead an account team consisting of sales and other cross-divisional resources to define and execute an account strategyEffectively utilize and leverage the CRM to manage opportunities and drive the buying processPursue professional and personal development to ensure competitive knowledge of the real estate industryLeverage social media to successfully prospect and build a professional networkParticipate in trade shows, workshops, and seminars (as required)Support visual story telling through effective whiteboard sessionsBe resourceful and takes initiative to resolve issues","EDUCATION: Bachelor’s or Master’s in GIS, business administration, or a related field, or equivalent work experience, depending on position levelEXPERIENCE: 5+ years of enterprise sales experience providing platform solutions to businessesDemonstrated experience in managing the sales cycle including prospecting, proposing, and closingAbility to adapt to new technology trends and translate them into solutions that address customer needsDemonstrated experience with strong partnerships and advocacy with customersExcellent presentation, white boarding, and negotiation skills including good listening, probing, and qualification abilitiesExperience executing insight selling methodologiesDemonstrated understanding and mitigation of competitive threatsExcellent written and verbal communication and interpersonal skillsAbility to manage and prioritize your activitiesDemonstrated experience to lead executive engagements to provide services and sell to the real estate industryKnowledge of the real estate industry fiscal year, budgeting, and procurement cycleHighly motivated team player with a mature, positive attitude and passion to meet the challenges and opportunities of a businessAbility to travel domestically and/or internationally up to 50%General knowledge of spatial analysis and problem solvingResults oriented; ability to write and craft smart, attainable, realistic, time-driven goals with clear lead indicators",0,1,0,Full-time,0,US


For Handling Nulls that are still present in these columns we will:
- For the columns location, company_profile, requirements we will replace nulls by "unspecified".
- For employment type we will replace nulls  and any category other than (full-time, contract, part-time) by other.
<br><br>
And we will further handle the string columns using NLP cleaning methods.

In [18]:
#now for handeling other text columns 
# replace nulls with "unspecified" in columns location, company_profile and requiremnets
df = df.withColumn("country", when(df["country"].isNull(), "unspecified").otherwise(df["country"]))\
    .withColumn("company_profile", when(df["company_profile"].isNull(), "unspecified").otherwise(df["company_profile"]))\
    .withColumn("requirements", when(df["requirements"].isNull(), "unspecified").otherwise(df["requirements"]))\
    .withColumn("employment_type", when(df["employment_type"].isNull(), "other").otherwise(df["employment_type"]))

# show the resulting dataframe
df.show()

+------+--------------------+--------------------+--------------------+--------------------+-------------+----------------+-------------+---------------+----------+-------+
|job_id|               title|     company_profile|         description|        requirements|telecommuting|has_company_logo|has_questions|employment_type|fraudulent|country|
+------+--------------------+--------------------+--------------------+--------------------+-------------+----------------+-------------+---------------+----------+-------+
|     1|    Marketing Intern|We're Food52, and...|Food52, a fast-gr...|Experience with c...|            0|               1|            0|          Other|         0|     US|
|     2|Customer Service ...|90 Seconds, the w...|Organised - Focus...|What we expect fr...|            0|               1|            0|      Full-time|         0|     NZ|
|     3|Commissioning Mac...|Valor Services pr...|Our client, locat...|Implement pre-com...|            0|               1|            

In [19]:
#check that everything is going well, all nulls in these columns are replaced.
print(df.where(col("country").isNull()).count())
print(df.where(col("company_profile").isNull()).count())
print(df.where(col("requirements").isNull()).count())
print(df.where(col("employment_type").isNull()).count())

0
0
0
0


In [20]:
df.groupBy("employment_type").count().orderBy(desc("count")).show(10)

+--------------------+-----+
|     employment_type|count|
+--------------------+-----+
|           Full-time|10923|
|               other| 3273|
|            Contract| 1496|
|           Part-time|  709|
|           Temporary|  233|
|               Other|  210|
|              France|   16|
| the London Inter...|   10|
| have full-time a...|    9|
| have full-time a...|    9|
+--------------------+-----+
only showing top 10 rows



In [21]:
df = df.withColumn("employment_type", when((df.employment_type != "Full-time") & (df.employment_type != "Contract") & (df.employment_type != "Part-time")\
                                           , "other").otherwise(df.employment_type))

In [22]:
df.groupBy("employment_type").count().orderBy(desc("count")).show(10)

+---------------+-----+
|employment_type|count|
+---------------+-----+
|      Full-time|10923|
|          other| 3838|
|       Contract| 1496|
|      Part-time|  709|
+---------------+-----+



### Checking for duplicates

In [23]:
#duplicated Rows
num_duplicates = df.count() - df.dropDuplicates().count()
print(f"Number of duplicates: {num_duplicates}")

Number of duplicates: 0


In [24]:
#check their is no duplicated job id
df.count() - df.dropDuplicates(['job_id']).count()

0

We have no duplicates in the whole rows or on the job_id.<br>
Now, our data has no nulls and no duplicates, time for cleaning the text columns.

### Validating data types

In [25]:
df.printSchema()

root
 |-- job_id: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- company_profile: string (nullable = true)
 |-- description: string (nullable = true)
 |-- requirements: string (nullable = true)
 |-- telecommuting: string (nullable = true)
 |-- has_company_logo: string (nullable = true)
 |-- has_questions: string (nullable = true)
 |-- employment_type: string (nullable = true)
 |-- fraudulent: string (nullable = true)
 |-- country: string (nullable = true)



We need the telecommuting, has_company_logo, has_questions and fraudulent columns to be integers. we have to change their types.

In [26]:
# Change the data type of a column
df = df.withColumn("telecommuting1", col("telecommuting").cast("integer"))\
    .withColumn("has_company_logo1", col("has_company_logo").cast("integer"))\
    .withColumn("has_questions1", col("has_questions").cast("integer"))\
    .withColumn("fraudulent1", col("fraudulent").cast("integer"))

In [27]:
df.printSchema()

root
 |-- job_id: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- company_profile: string (nullable = true)
 |-- description: string (nullable = true)
 |-- requirements: string (nullable = true)
 |-- telecommuting: string (nullable = true)
 |-- has_company_logo: string (nullable = true)
 |-- has_questions: string (nullable = true)
 |-- employment_type: string (nullable = true)
 |-- fraudulent: string (nullable = true)
 |-- country: string (nullable = true)
 |-- telecommuting1: integer (nullable = true)
 |-- has_company_logo1: integer (nullable = true)
 |-- has_questions1: integer (nullable = true)
 |-- fraudulent1: integer (nullable = true)



In [28]:
df.columns[10:]

['country',
 'telecommuting1',
 'has_company_logo1',
 'has_questions1',
 'fraudulent1']

In [29]:
cols = df.columns[0:5] + [df.columns[8]] + df.columns[10:]
df = df.select(cols)
df.printSchema()

root
 |-- job_id: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- company_profile: string (nullable = true)
 |-- description: string (nullable = true)
 |-- requirements: string (nullable = true)
 |-- employment_type: string (nullable = true)
 |-- country: string (nullable = true)
 |-- telecommuting1: integer (nullable = true)
 |-- has_company_logo1: integer (nullable = true)
 |-- has_questions1: integer (nullable = true)
 |-- fraudulent1: integer (nullable = true)



### Cleaning text

In this section we will:
- remove hashtags and links
- remove un wanted charachters and numbers
- remove multiple spaces
- lower casing all text
- tokenizing 
- remove stop words

In [30]:
df.select("title","company_profile","description","requirements").limit(2).toPandas()

Unnamed: 0,title,company_profile,description,requirements
0,Marketing Intern,"We're Food52, and we've created a groundbreaking and award-winning cooking site. We support, connect, and celebrate home cooks, and give them everything they need in one place.We have a top editorial, business, and engineering team. We're focused on using technology to find new and better ways to connect people around their specific food interests, and to offer them superb, highly curated information about food and cooking. We attract the most talented home cooks and contributors in the country; we also publish well-known professionals like Mario Batali, Gwyneth Paltrow, and Danny Meyer. And we have partnerships with Whole Foods Market and Random House.Food52 has been named the best food website by the James Beard Foundation and IACP, and has been featured in the New York Times, NPR, Pando Daily, TechCrunch, and on the Today Show.We're located in Chelsea, in New York City.","Food52, a fast-growing, James Beard Award-winning online food community and crowd-sourced and curated recipe hub, is currently interviewing full- and part-time unpaid interns to work in a small team of editors, executives, and developers in its New York City headquarters.Reproducing and/or repackaging existing Food52 content for a number of partner sites, such as Huffington Post, Yahoo, Buzzfeed, and more in their various content management systemsResearching blogs and websites for the Provisions by Food52 Affiliate ProgramAssisting in day-to-day affiliate program support, such as screening affiliates and assisting in any affiliate inquiriesSupporting with PR &amp; Events when neededHelping with office administrative work, such as filing, mailing, and preparing for meetingsWorking with developers to document bugs and suggest improvements to the siteSupporting the marketing and executive staff","Experience with content management systems a major plus (any blogging counts!)Familiar with the Food52 editorial voice and aestheticLoves food, appreciates the importance of home cooking and cooking with the seasonsMeticulous editor, perfectionist, obsessive attention to detail, maddened by typos and broken links, delighted by finding and fixing themCheerful under pressureExcellent communication skillsA+ multi-tasker and juggler of responsibilities big and smallInterested in and engaged with social media like Twitter, Facebook, and PinterestLoves problem-solving and collaborating to drive Food52 forwardThinks big picture but pitches in on the nitty gritty of running a small company (dishes, shopping, administrative support)Comfortable with the realities of working for a startup: being on call on evenings and weekends, and working long hours"
1,Customer Service - Cloud Video Production,"90 Seconds, the worlds Cloud Video Production Service.90 Seconds is the worlds Cloud Video Production Service enabling brands and agencies to get high quality online video content shot and produced anywhere in the world. 90 Seconds makes video production fast, affordable, and all managed seamlessly in the cloud from purchase to publish. http://90#URL_fbe6559afac620a3cd2c22281f7b8d0eef56a73e3d9a311e2f1ca13d081dd630#90 Seconds removes the hassle, cost, risk and speed issues of working with regular video production companies by managing every aspect of video projects in a beautiful online experience. With a growing global network of over 2,000 rated video professionals in over 50 countries managed by dedicated production success teams in 5 countries, 90 Seconds provides a 100% success guarantee.90 Seconds has produced almost 4,000 videos in over 30 Countries for over 500 Global brands including some of the worlds largest including Paypal, L’Oreal, Sony and Barclays and has offices in Auckland, London, Sydney, Tokyo and Singapore.http://90#URL_fbe6559afac620a3cd2c22281f7b8d0eef56a73e3d9a311e2f1ca13d081dd630# | http://90#URL_e2ad0bde3f09a0913a486abdbb1e6ac373bb3310f64b1fbcf550049bcba4a17b# | http://90#URL_8c5dd1806f97ab90876d9daebeb430f682dbc87e2f01549b47e96c7bff2ea17e#","Organised - Focused - Vibrant - Awesome!Do you have a passion for customer service? Slick typing skills? Maybe Account Management? ...And think administration is cooler than a polar bear on a jetski? Then we need to hear you! We are the Cloud Video Production Service and opperating on a glodal level. Yeah, it's pretty cool. Serious about delivering a world class product and excellent customer service.Our rapidly expanding business is looking for a talented Project Manager to manage the successful delivery of video projects, manage client communications and drive the production process. Work with some of the coolest brands on the planet and learn from a global team that are representing NZ is a huge way!We are entering the next growth stage of our business and growing quickly internationally. Therefore, the position is bursting with opportunity for the right person entering the business at the right time. 90 Seconds, the worlds Cloud Video Production Service - http://90#URL_fbe6559afac620a3cd2c22281f7b8d0eef56a73e3d9a311e2f1ca13d081dd630#90 Seconds is the worlds Cloud Video Production Service enabling brands and agencies to get high quality online video content shot and produced anywhere in the world. Fast, affordable, and all managed seamlessly in the cloud from purchase to publish. 90 Seconds removes the hassle, cost, risk and speed issues of working with regular video production companies by managing every aspect of video projects in a beautiful online experience. With a growing network of over 2,000 rated video professionals in over 50 countries and dedicated production success teams in 5 countries guaranteeing video project success 100%. It's as easy as commissioning a quick google adwords campaign.90 Seconds has produced almost 4,000 videos in over 30 Countries for over 500 Global brands including some of the worlds largest including Paypal, L'oreal, Sony and Barclays and has offices in Auckland, London, Sydney, Tokyo &amp; Singapore.Our Auckland office is based right in the heart of the Wynyard Quarter Innovation Precinct - GridAKL!","What we expect from you:Your key responsibility will be to communicate with the client, 90 Seconds team and freelance community throughout the video production process including, shoot planning, securing freelance talent, managing workflow and the online production management system. The aim is to manage each video project effectively so that we produce great videos that our clients love.Key attributesClient focused - excellent customer service and communication skillsOnline - oustanding computer knowledge and experience using online software and project management toolsOrganised - manage workload and able to multi-task100% attention to detailMotivated - self-starter with a passion for doing excellent work and achieving great resultsAdaptable - show initiative and think on your feet as this is a constantly evolving atmosphereFlexible - fast turnaround work and after hours availabilityEasy going &amp; upbeat - dosen't get bogged down and loves the challengeSense of Humour - have a laugh and know that working in a startup takes guts!Ability to deliver - including meeting project deadlines and budgetAttitude is more important than experience at 90 Seconds, however previous experience in customer service and/or project management is beneficialPlease view our platform / website at #URL_395a8683a907ce95f49a12fb240e6e47ad8d5a4f96d07ebbd869c4dd4dea1826# and get a clear understand about what we do before reaching out."


In [31]:
# Removing hashtags and links (starts with http or https)
df = df.withColumn("title",regexp_replace(col('title'), r'#\w+|http\S+', ''))\
    .withColumn("company_profile",regexp_replace(col('company_profile'), r'#\w+|http\S+', ''))\
    .withColumn("description",regexp_replace(col('description'), r'#\w+|http\S+', ''))\
    .withColumn("requirements",regexp_replace(col('requirements'), r'#\w+|http\S+', ''))
df.select("title","company_profile","description","requirements").limit(2).toPandas()

Unnamed: 0,title,company_profile,description,requirements
0,Marketing Intern,"We're Food52, and we've created a groundbreaking and award-winning cooking site. We support, connect, and celebrate home cooks, and give them everything they need in one place.We have a top editorial, business, and engineering team. We're focused on using technology to find new and better ways to connect people around their specific food interests, and to offer them superb, highly curated information about food and cooking. We attract the most talented home cooks and contributors in the country; we also publish well-known professionals like Mario Batali, Gwyneth Paltrow, and Danny Meyer. And we have partnerships with Whole Foods Market and Random House.Food52 has been named the best food website by the James Beard Foundation and IACP, and has been featured in the New York Times, NPR, Pando Daily, TechCrunch, and on the Today Show.We're located in Chelsea, in New York City.","Food52, a fast-growing, James Beard Award-winning online food community and crowd-sourced and curated recipe hub, is currently interviewing full- and part-time unpaid interns to work in a small team of editors, executives, and developers in its New York City headquarters.Reproducing and/or repackaging existing Food52 content for a number of partner sites, such as Huffington Post, Yahoo, Buzzfeed, and more in their various content management systemsResearching blogs and websites for the Provisions by Food52 Affiliate ProgramAssisting in day-to-day affiliate program support, such as screening affiliates and assisting in any affiliate inquiriesSupporting with PR &amp; Events when neededHelping with office administrative work, such as filing, mailing, and preparing for meetingsWorking with developers to document bugs and suggest improvements to the siteSupporting the marketing and executive staff","Experience with content management systems a major plus (any blogging counts!)Familiar with the Food52 editorial voice and aestheticLoves food, appreciates the importance of home cooking and cooking with the seasonsMeticulous editor, perfectionist, obsessive attention to detail, maddened by typos and broken links, delighted by finding and fixing themCheerful under pressureExcellent communication skillsA+ multi-tasker and juggler of responsibilities big and smallInterested in and engaged with social media like Twitter, Facebook, and PinterestLoves problem-solving and collaborating to drive Food52 forwardThinks big picture but pitches in on the nitty gritty of running a small company (dishes, shopping, administrative support)Comfortable with the realities of working for a startup: being on call on evenings and weekends, and working long hours"
1,Customer Service - Cloud Video Production,"90 Seconds, the worlds Cloud Video Production Service.90 Seconds is the worlds Cloud Video Production Service enabling brands and agencies to get high quality online video content shot and produced anywhere in the world. 90 Seconds makes video production fast, affordable, and all managed seamlessly in the cloud from purchase to publish. Seconds removes the hassle, cost, risk and speed issues of working with regular video production companies by managing every aspect of video projects in a beautiful online experience. With a growing global network of over 2,000 rated video professionals in over 50 countries managed by dedicated production success teams in 5 countries, 90 Seconds provides a 100% success guarantee.90 Seconds has produced almost 4,000 videos in over 30 Countries for over 500 Global brands including some of the worlds largest including Paypal, L’Oreal, Sony and Barclays and has offices in Auckland, London, Sydney, Tokyo and Singapore.","Organised - Focused - Vibrant - Awesome!Do you have a passion for customer service? Slick typing skills? Maybe Account Management? ...And think administration is cooler than a polar bear on a jetski? Then we need to hear you! We are the Cloud Video Production Service and opperating on a glodal level. Yeah, it's pretty cool. Serious about delivering a world class product and excellent customer service.Our rapidly expanding business is looking for a talented Project Manager to manage the successful delivery of video projects, manage client communications and drive the production process. Work with some of the coolest brands on the planet and learn from a global team that are representing NZ is a huge way!We are entering the next growth stage of our business and growing quickly internationally. Therefore, the position is bursting with opportunity for the right person entering the business at the right time. 90 Seconds, the worlds Cloud Video Production Service - Seconds is the worlds Cloud Video Production Service enabling brands and agencies to get high quality online video content shot and produced anywhere in the world. Fast, affordable, and all managed seamlessly in the cloud from purchase to publish. 90 Seconds removes the hassle, cost, risk and speed issues of working with regular video production companies by managing every aspect of video projects in a beautiful online experience. With a growing network of over 2,000 rated video professionals in over 50 countries and dedicated production success teams in 5 countries guaranteeing video project success 100%. It's as easy as commissioning a quick google adwords campaign.90 Seconds has produced almost 4,000 videos in over 30 Countries for over 500 Global brands including some of the worlds largest including Paypal, L'oreal, Sony and Barclays and has offices in Auckland, London, Sydney, Tokyo &amp; Singapore.Our Auckland office is based right in the heart of the Wynyard Quarter Innovation Precinct - GridAKL!","What we expect from you:Your key responsibility will be to communicate with the client, 90 Seconds team and freelance community throughout the video production process including, shoot planning, securing freelance talent, managing workflow and the online production management system. The aim is to manage each video project effectively so that we produce great videos that our clients love.Key attributesClient focused - excellent customer service and communication skillsOnline - oustanding computer knowledge and experience using online software and project management toolsOrganised - manage workload and able to multi-task100% attention to detailMotivated - self-starter with a passion for doing excellent work and achieving great resultsAdaptable - show initiative and think on your feet as this is a constantly evolving atmosphereFlexible - fast turnaround work and after hours availabilityEasy going &amp; upbeat - dosen't get bogged down and loves the challengeSense of Humour - have a laugh and know that working in a startup takes guts!Ability to deliver - including meeting project deadlines and budgetAttitude is more important than experience at 90 Seconds, however previous experience in customer service and/or project management is beneficialPlease view our platform / website at # and get a clear understand about what we do before reaching out."


In [32]:
# Removing anything that is not a letter (numbers or symbols)
df = df.withColumn("title",regexp_replace(col('title'), '[^A-Za-z ]+', ' '))\
    .withColumn("company_profile",regexp_replace(col('company_profile'), '[^A-Za-z ]+', ' '))\
    .withColumn("description",regexp_replace(col('description'), '[^A-Za-z ]+', ' '))\
    .withColumn("requirements",regexp_replace(col('requirements'), '[^A-Za-z ]+', ' '))
df.select("title","company_profile","description","requirements").limit(2).toPandas()

Unnamed: 0,title,company_profile,description,requirements
0,Marketing Intern,We re Food and we ve created a groundbreaking and award winning cooking site We support connect and celebrate home cooks and give them everything they need in one place We have a top editorial business and engineering team We re focused on using technology to find new and better ways to connect people around their specific food interests and to offer them superb highly curated information about food and cooking We attract the most talented home cooks and contributors in the country we also publish well known professionals like Mario Batali Gwyneth Paltrow and Danny Meyer And we have partnerships with Whole Foods Market and Random House Food has been named the best food website by the James Beard Foundation and IACP and has been featured in the New York Times NPR Pando Daily TechCrunch and on the Today Show We re located in Chelsea in New York City,Food a fast growing James Beard Award winning online food community and crowd sourced and curated recipe hub is currently interviewing full and part time unpaid interns to work in a small team of editors executives and developers in its New York City headquarters Reproducing and or repackaging existing Food content for a number of partner sites such as Huffington Post Yahoo Buzzfeed and more in their various content management systemsResearching blogs and websites for the Provisions by Food Affiliate ProgramAssisting in day to day affiliate program support such as screening affiliates and assisting in any affiliate inquiriesSupporting with PR amp Events when neededHelping with office administrative work such as filing mailing and preparing for meetingsWorking with developers to document bugs and suggest improvements to the siteSupporting the marketing and executive staff,Experience with content management systems a major plus any blogging counts Familiar with the Food editorial voice and aestheticLoves food appreciates the importance of home cooking and cooking with the seasonsMeticulous editor perfectionist obsessive attention to detail maddened by typos and broken links delighted by finding and fixing themCheerful under pressureExcellent communication skillsA multi tasker and juggler of responsibilities big and smallInterested in and engaged with social media like Twitter Facebook and PinterestLoves problem solving and collaborating to drive Food forwardThinks big picture but pitches in on the nitty gritty of running a small company dishes shopping administrative support Comfortable with the realities of working for a startup being on call on evenings and weekends and working long hours
1,Customer Service Cloud Video Production,Seconds the worlds Cloud Video Production Service Seconds is the worlds Cloud Video Production Service enabling brands and agencies to get high quality online video content shot and produced anywhere in the world Seconds makes video production fast affordable and all managed seamlessly in the cloud from purchase to publish Seconds removes the hassle cost risk and speed issues of working with regular video production companies by managing every aspect of video projects in a beautiful online experience With a growing global network of over rated video professionals in over countries managed by dedicated production success teams in countries Seconds provides a success guarantee Seconds has produced almost videos in over Countries for over Global brands including some of the worlds largest including Paypal L Oreal Sony and Barclays and has offices in Auckland London Sydney Tokyo and Singapore,Organised Focused Vibrant Awesome Do you have a passion for customer service Slick typing skills Maybe Account Management And think administration is cooler than a polar bear on a jetski Then we need to hear you We are the Cloud Video Production Service and opperating on a glodal level Yeah it s pretty cool Serious about delivering a world class product and excellent customer service Our rapidly expanding business is looking for a talented Project Manager to manage the successful delivery of video projects manage client communications and drive the production process Work with some of the coolest brands on the planet and learn from a global team that are representing NZ is a huge way We are entering the next growth stage of our business and growing quickly internationally Therefore the position is bursting with opportunity for the right person entering the business at the right time Seconds the worlds Cloud Video Production Service Seconds is the worlds Cloud Video Production Service enabling brands and agencies to get high quality online video content shot and produced anywhere in the world Fast affordable and all managed seamlessly in the cloud from purchase to publish Seconds removes the hassle cost risk and speed issues of working with regular video production companies by managing every aspect of video projects in a beautiful online experience With a growing network of over rated video professionals in over countries and dedicated production success teams in countries guaranteeing video project success It s as easy as commissioning a quick google adwords campaign Seconds has produced almost videos in over Countries for over Global brands including some of the worlds largest including Paypal L oreal Sony and Barclays and has offices in Auckland London Sydney Tokyo amp Singapore Our Auckland office is based right in the heart of the Wynyard Quarter Innovation Precinct GridAKL,What we expect from you Your key responsibility will be to communicate with the client Seconds team and freelance community throughout the video production process including shoot planning securing freelance talent managing workflow and the online production management system The aim is to manage each video project effectively so that we produce great videos that our clients love Key attributesClient focused excellent customer service and communication skillsOnline oustanding computer knowledge and experience using online software and project management toolsOrganised manage workload and able to multi task attention to detailMotivated self starter with a passion for doing excellent work and achieving great resultsAdaptable show initiative and think on your feet as this is a constantly evolving atmosphereFlexible fast turnaround work and after hours availabilityEasy going amp upbeat dosen t get bogged down and loves the challengeSense of Humour have a laugh and know that working in a startup takes guts Ability to deliver including meeting project deadlines and budgetAttitude is more important than experience at Seconds however previous experience in customer service and or project management is beneficialPlease view our platform website at and get a clear understand about what we do before reaching out


In [33]:
# Remove multiple spaces 
df = df.withColumn("title",regexp_replace(col('title'), ' +', ' '))\
    .withColumn("company_profile",regexp_replace(col('company_profile'), ' +', ' '))\
    .withColumn("description",regexp_replace(col('description'), ' +', ' '))\
    .withColumn("requirements",regexp_replace(col('requirements'), ' +', ' '))
df.select("title","company_profile","description","requirements").limit(2).toPandas()

Unnamed: 0,title,company_profile,description,requirements
0,Marketing Intern,We re Food and we ve created a groundbreaking and award winning cooking site We support connect and celebrate home cooks and give them everything they need in one place We have a top editorial business and engineering team We re focused on using technology to find new and better ways to connect people around their specific food interests and to offer them superb highly curated information about food and cooking We attract the most talented home cooks and contributors in the country we also publish well known professionals like Mario Batali Gwyneth Paltrow and Danny Meyer And we have partnerships with Whole Foods Market and Random House Food has been named the best food website by the James Beard Foundation and IACP and has been featured in the New York Times NPR Pando Daily TechCrunch and on the Today Show We re located in Chelsea in New York City,Food a fast growing James Beard Award winning online food community and crowd sourced and curated recipe hub is currently interviewing full and part time unpaid interns to work in a small team of editors executives and developers in its New York City headquarters Reproducing and or repackaging existing Food content for a number of partner sites such as Huffington Post Yahoo Buzzfeed and more in their various content management systemsResearching blogs and websites for the Provisions by Food Affiliate ProgramAssisting in day to day affiliate program support such as screening affiliates and assisting in any affiliate inquiriesSupporting with PR amp Events when neededHelping with office administrative work such as filing mailing and preparing for meetingsWorking with developers to document bugs and suggest improvements to the siteSupporting the marketing and executive staff,Experience with content management systems a major plus any blogging counts Familiar with the Food editorial voice and aestheticLoves food appreciates the importance of home cooking and cooking with the seasonsMeticulous editor perfectionist obsessive attention to detail maddened by typos and broken links delighted by finding and fixing themCheerful under pressureExcellent communication skillsA multi tasker and juggler of responsibilities big and smallInterested in and engaged with social media like Twitter Facebook and PinterestLoves problem solving and collaborating to drive Food forwardThinks big picture but pitches in on the nitty gritty of running a small company dishes shopping administrative support Comfortable with the realities of working for a startup being on call on evenings and weekends and working long hours
1,Customer Service Cloud Video Production,Seconds the worlds Cloud Video Production Service Seconds is the worlds Cloud Video Production Service enabling brands and agencies to get high quality online video content shot and produced anywhere in the world Seconds makes video production fast affordable and all managed seamlessly in the cloud from purchase to publish Seconds removes the hassle cost risk and speed issues of working with regular video production companies by managing every aspect of video projects in a beautiful online experience With a growing global network of over rated video professionals in over countries managed by dedicated production success teams in countries Seconds provides a success guarantee Seconds has produced almost videos in over Countries for over Global brands including some of the worlds largest including Paypal L Oreal Sony and Barclays and has offices in Auckland London Sydney Tokyo and Singapore,Organised Focused Vibrant Awesome Do you have a passion for customer service Slick typing skills Maybe Account Management And think administration is cooler than a polar bear on a jetski Then we need to hear you We are the Cloud Video Production Service and opperating on a glodal level Yeah it s pretty cool Serious about delivering a world class product and excellent customer service Our rapidly expanding business is looking for a talented Project Manager to manage the successful delivery of video projects manage client communications and drive the production process Work with some of the coolest brands on the planet and learn from a global team that are representing NZ is a huge way We are entering the next growth stage of our business and growing quickly internationally Therefore the position is bursting with opportunity for the right person entering the business at the right time Seconds the worlds Cloud Video Production Service Seconds is the worlds Cloud Video Production Service enabling brands and agencies to get high quality online video content shot and produced anywhere in the world Fast affordable and all managed seamlessly in the cloud from purchase to publish Seconds removes the hassle cost risk and speed issues of working with regular video production companies by managing every aspect of video projects in a beautiful online experience With a growing network of over rated video professionals in over countries and dedicated production success teams in countries guaranteeing video project success It s as easy as commissioning a quick google adwords campaign Seconds has produced almost videos in over Countries for over Global brands including some of the worlds largest including Paypal L oreal Sony and Barclays and has offices in Auckland London Sydney Tokyo amp Singapore Our Auckland office is based right in the heart of the Wynyard Quarter Innovation Precinct GridAKL,What we expect from you Your key responsibility will be to communicate with the client Seconds team and freelance community throughout the video production process including shoot planning securing freelance talent managing workflow and the online production management system The aim is to manage each video project effectively so that we produce great videos that our clients love Key attributesClient focused excellent customer service and communication skillsOnline oustanding computer knowledge and experience using online software and project management toolsOrganised manage workload and able to multi task attention to detailMotivated self starter with a passion for doing excellent work and achieving great resultsAdaptable show initiative and think on your feet as this is a constantly evolving atmosphereFlexible fast turnaround work and after hours availabilityEasy going amp upbeat dosen t get bogged down and loves the challengeSense of Humour have a laugh and know that working in a startup takes guts Ability to deliver including meeting project deadlines and budgetAttitude is more important than experience at Seconds however previous experience in customer service and or project management is beneficialPlease view our platform website at and get a clear understand about what we do before reaching out


In [34]:
# Lower case everything 
df = df.withColumn("title",lower(col("title")))\
    .withColumn("company_profile",lower(col("company_profile")))\
    .withColumn("description",lower(col("description")))\
    .withColumn("requirements",lower(col("requirements")))
df.select("title","company_profile","description","requirements").limit(2).toPandas()

Unnamed: 0,title,company_profile,description,requirements
0,marketing intern,we re food and we ve created a groundbreaking and award winning cooking site we support connect and celebrate home cooks and give them everything they need in one place we have a top editorial business and engineering team we re focused on using technology to find new and better ways to connect people around their specific food interests and to offer them superb highly curated information about food and cooking we attract the most talented home cooks and contributors in the country we also publish well known professionals like mario batali gwyneth paltrow and danny meyer and we have partnerships with whole foods market and random house food has been named the best food website by the james beard foundation and iacp and has been featured in the new york times npr pando daily techcrunch and on the today show we re located in chelsea in new york city,food a fast growing james beard award winning online food community and crowd sourced and curated recipe hub is currently interviewing full and part time unpaid interns to work in a small team of editors executives and developers in its new york city headquarters reproducing and or repackaging existing food content for a number of partner sites such as huffington post yahoo buzzfeed and more in their various content management systemsresearching blogs and websites for the provisions by food affiliate programassisting in day to day affiliate program support such as screening affiliates and assisting in any affiliate inquiriessupporting with pr amp events when neededhelping with office administrative work such as filing mailing and preparing for meetingsworking with developers to document bugs and suggest improvements to the sitesupporting the marketing and executive staff,experience with content management systems a major plus any blogging counts familiar with the food editorial voice and aestheticloves food appreciates the importance of home cooking and cooking with the seasonsmeticulous editor perfectionist obsessive attention to detail maddened by typos and broken links delighted by finding and fixing themcheerful under pressureexcellent communication skillsa multi tasker and juggler of responsibilities big and smallinterested in and engaged with social media like twitter facebook and pinterestloves problem solving and collaborating to drive food forwardthinks big picture but pitches in on the nitty gritty of running a small company dishes shopping administrative support comfortable with the realities of working for a startup being on call on evenings and weekends and working long hours
1,customer service cloud video production,seconds the worlds cloud video production service seconds is the worlds cloud video production service enabling brands and agencies to get high quality online video content shot and produced anywhere in the world seconds makes video production fast affordable and all managed seamlessly in the cloud from purchase to publish seconds removes the hassle cost risk and speed issues of working with regular video production companies by managing every aspect of video projects in a beautiful online experience with a growing global network of over rated video professionals in over countries managed by dedicated production success teams in countries seconds provides a success guarantee seconds has produced almost videos in over countries for over global brands including some of the worlds largest including paypal l oreal sony and barclays and has offices in auckland london sydney tokyo and singapore,organised focused vibrant awesome do you have a passion for customer service slick typing skills maybe account management and think administration is cooler than a polar bear on a jetski then we need to hear you we are the cloud video production service and opperating on a glodal level yeah it s pretty cool serious about delivering a world class product and excellent customer service our rapidly expanding business is looking for a talented project manager to manage the successful delivery of video projects manage client communications and drive the production process work with some of the coolest brands on the planet and learn from a global team that are representing nz is a huge way we are entering the next growth stage of our business and growing quickly internationally therefore the position is bursting with opportunity for the right person entering the business at the right time seconds the worlds cloud video production service seconds is the worlds cloud video production service enabling brands and agencies to get high quality online video content shot and produced anywhere in the world fast affordable and all managed seamlessly in the cloud from purchase to publish seconds removes the hassle cost risk and speed issues of working with regular video production companies by managing every aspect of video projects in a beautiful online experience with a growing network of over rated video professionals in over countries and dedicated production success teams in countries guaranteeing video project success it s as easy as commissioning a quick google adwords campaign seconds has produced almost videos in over countries for over global brands including some of the worlds largest including paypal l oreal sony and barclays and has offices in auckland london sydney tokyo amp singapore our auckland office is based right in the heart of the wynyard quarter innovation precinct gridakl,what we expect from you your key responsibility will be to communicate with the client seconds team and freelance community throughout the video production process including shoot planning securing freelance talent managing workflow and the online production management system the aim is to manage each video project effectively so that we produce great videos that our clients love key attributesclient focused excellent customer service and communication skillsonline oustanding computer knowledge and experience using online software and project management toolsorganised manage workload and able to multi task attention to detailmotivated self starter with a passion for doing excellent work and achieving great resultsadaptable show initiative and think on your feet as this is a constantly evolving atmosphereflexible fast turnaround work and after hours availabilityeasy going amp upbeat dosen t get bogged down and loves the challengesense of humour have a laugh and know that working in a startup takes guts ability to deliver including meeting project deadlines and budgetattitude is more important than experience at seconds however previous experience in customer service and or project management is beneficialplease view our platform website at and get a clear understand about what we do before reaching out


In [35]:
#tokenization
cols = ["title","company_profile","description","requirements"]
raw_words = df
for col in cols:
    regex_tokenizer = RegexTokenizer(inputCol=col, outputCol=col+"_tokenized", pattern="\W")
    raw_words = regex_tokenizer.transform(raw_words)

raw_words.limit(1).toPandas()

Unnamed: 0,job_id,title,company_profile,description,requirements,employment_type,country,telecommuting1,has_company_logo1,has_questions1,fraudulent1,title_tokenized,company_profile_tokenized,description_tokenized,requirements_tokenized
0,1,marketing intern,we re food and we ve created a groundbreaking and award winning cooking site we support connect and celebrate home cooks and give them everything they need in one place we have a top editorial business and engineering team we re focused on using technology to find new and better ways to connect people around their specific food interests and to offer them superb highly curated information about food and cooking we attract the most talented home cooks and contributors in the country we also publish well known professionals like mario batali gwyneth paltrow and danny meyer and we have partnerships with whole foods market and random house food has been named the best food website by the james beard foundation and iacp and has been featured in the new york times npr pando daily techcrunch and on the today show we re located in chelsea in new york city,food a fast growing james beard award winning online food community and crowd sourced and curated recipe hub is currently interviewing full and part time unpaid interns to work in a small team of editors executives and developers in its new york city headquarters reproducing and or repackaging existing food content for a number of partner sites such as huffington post yahoo buzzfeed and more in their various content management systemsresearching blogs and websites for the provisions by food affiliate programassisting in day to day affiliate program support such as screening affiliates and assisting in any affiliate inquiriessupporting with pr amp events when neededhelping with office administrative work such as filing mailing and preparing for meetingsworking with developers to document bugs and suggest improvements to the sitesupporting the marketing and executive staff,experience with content management systems a major plus any blogging counts familiar with the food editorial voice and aestheticloves food appreciates the importance of home cooking and cooking with the seasonsmeticulous editor perfectionist obsessive attention to detail maddened by typos and broken links delighted by finding and fixing themcheerful under pressureexcellent communication skillsa multi tasker and juggler of responsibilities big and smallinterested in and engaged with social media like twitter facebook and pinterestloves problem solving and collaborating to drive food forwardthinks big picture but pitches in on the nitty gritty of running a small company dishes shopping administrative support comfortable with the realities of working for a startup being on call on evenings and weekends and working long hours,other,US,0,1,0,0,"[marketing, intern]","[we, re, food, and, we, ve, created, a, groundbreaking, and, award, winning, cooking, site, we, support, connect, and, celebrate, home, cooks, and, give, them, everything, they, need, in, one, place, we, have, a, top, editorial, business, and, engineering, team, we, re, focused, on, using, technology, to, find, new, and, better, ways, to, connect, people, around, their, specific, food, interests, and, to, offer, them, superb, highly, curated, information, about, food, and, cooking, we, attract, the, most, talented, home, cooks, and, contributors, in, the, country, we, also, publish, well, known, professionals, like, mario, batali, gwyneth, paltrow, and, danny, meyer, and, we, have, ...]","[food, a, fast, growing, james, beard, award, winning, online, food, community, and, crowd, sourced, and, curated, recipe, hub, is, currently, interviewing, full, and, part, time, unpaid, interns, to, work, in, a, small, team, of, editors, executives, and, developers, in, its, new, york, city, headquarters, reproducing, and, or, repackaging, existing, food, content, for, a, number, of, partner, sites, such, as, huffington, post, yahoo, buzzfeed, and, more, in, their, various, content, management, systemsresearching, blogs, and, websites, for, the, provisions, by, food, affiliate, programassisting, in, day, to, day, affiliate, program, support, such, as, screening, affiliates, and, assisting, in, any, affiliate, inquiriessupporting, with, pr, ...]","[experience, with, content, management, systems, a, major, plus, any, blogging, counts, familiar, with, the, food, editorial, voice, and, aestheticloves, food, appreciates, the, importance, of, home, cooking, and, cooking, with, the, seasonsmeticulous, editor, perfectionist, obsessive, attention, to, detail, maddened, by, typos, and, broken, links, delighted, by, finding, and, fixing, themcheerful, under, pressureexcellent, communication, skillsa, multi, tasker, and, juggler, of, responsibilities, big, and, smallinterested, in, and, engaged, with, social, media, like, twitter, facebook, and, pinterestloves, problem, solving, and, collaborating, to, drive, food, forwardthinks, big, picture, but, pitches, in, on, the, nitty, gritty, of, running, a, small, company, dishes, shopping, administrative, support, comfortable, ...]"


In [36]:
raw_words.printSchema() 

root
 |-- job_id: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- company_profile: string (nullable = true)
 |-- description: string (nullable = true)
 |-- requirements: string (nullable = true)
 |-- employment_type: string (nullable = true)
 |-- country: string (nullable = true)
 |-- telecommuting1: integer (nullable = true)
 |-- has_company_logo1: integer (nullable = true)
 |-- has_questions1: integer (nullable = true)
 |-- fraudulent1: integer (nullable = true)
 |-- title_tokenized: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- company_profile_tokenized: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- description_tokenized: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- requirements_tokenized: array (nullable = true)
 |    |-- element: string (containsNull = true)



In [37]:
df_tokenized_cols = raw_words.columns[5:]
df_tokenized = raw_words.select(df_tokenized_cols)
df_tokenized.printSchema()

root
 |-- employment_type: string (nullable = true)
 |-- country: string (nullable = true)
 |-- telecommuting1: integer (nullable = true)
 |-- has_company_logo1: integer (nullable = true)
 |-- has_questions1: integer (nullable = true)
 |-- fraudulent1: integer (nullable = true)
 |-- title_tokenized: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- company_profile_tokenized: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- description_tokenized: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- requirements_tokenized: array (nullable = true)
 |    |-- element: string (containsNull = true)



In [38]:
#removing stop words
cols = ["title_tokenized","company_profile_tokenized","description_tokenized","requirements_tokenized"]
filtered_df = df_tokenized
for col in cols:
    remover = StopWordsRemover(inputCol=col, outputCol=col+"_filtered")
    filtered_df = remover.transform(filtered_df)
filtered_df.limit(1).toPandas()

Unnamed: 0,employment_type,country,telecommuting1,has_company_logo1,has_questions1,fraudulent1,title_tokenized,company_profile_tokenized,description_tokenized,requirements_tokenized,title_tokenized_filtered,company_profile_tokenized_filtered,description_tokenized_filtered,requirements_tokenized_filtered
0,other,US,0,1,0,0,"[marketing, intern]","[we, re, food, and, we, ve, created, a, groundbreaking, and, award, winning, cooking, site, we, support, connect, and, celebrate, home, cooks, and, give, them, everything, they, need, in, one, place, we, have, a, top, editorial, business, and, engineering, team, we, re, focused, on, using, technology, to, find, new, and, better, ways, to, connect, people, around, their, specific, food, interests, and, to, offer, them, superb, highly, curated, information, about, food, and, cooking, we, attract, the, most, talented, home, cooks, and, contributors, in, the, country, we, also, publish, well, known, professionals, like, mario, batali, gwyneth, paltrow, and, danny, meyer, and, we, have, ...]","[food, a, fast, growing, james, beard, award, winning, online, food, community, and, crowd, sourced, and, curated, recipe, hub, is, currently, interviewing, full, and, part, time, unpaid, interns, to, work, in, a, small, team, of, editors, executives, and, developers, in, its, new, york, city, headquarters, reproducing, and, or, repackaging, existing, food, content, for, a, number, of, partner, sites, such, as, huffington, post, yahoo, buzzfeed, and, more, in, their, various, content, management, systemsresearching, blogs, and, websites, for, the, provisions, by, food, affiliate, programassisting, in, day, to, day, affiliate, program, support, such, as, screening, affiliates, and, assisting, in, any, affiliate, inquiriessupporting, with, pr, ...]","[experience, with, content, management, systems, a, major, plus, any, blogging, counts, familiar, with, the, food, editorial, voice, and, aestheticloves, food, appreciates, the, importance, of, home, cooking, and, cooking, with, the, seasonsmeticulous, editor, perfectionist, obsessive, attention, to, detail, maddened, by, typos, and, broken, links, delighted, by, finding, and, fixing, themcheerful, under, pressureexcellent, communication, skillsa, multi, tasker, and, juggler, of, responsibilities, big, and, smallinterested, in, and, engaged, with, social, media, like, twitter, facebook, and, pinterestloves, problem, solving, and, collaborating, to, drive, food, forwardthinks, big, picture, but, pitches, in, on, the, nitty, gritty, of, running, a, small, company, dishes, shopping, administrative, support, comfortable, ...]","[marketing, intern]","[re, food, ve, created, groundbreaking, award, winning, cooking, site, support, connect, celebrate, home, cooks, give, everything, need, one, place, top, editorial, business, engineering, team, re, focused, using, technology, find, new, better, ways, connect, people, around, specific, food, interests, offer, superb, highly, curated, information, food, cooking, attract, talented, home, cooks, contributors, country, also, publish, well, known, professionals, like, mario, batali, gwyneth, paltrow, danny, meyer, partnerships, whole, foods, market, random, house, food, named, best, food, website, james, beard, foundation, iacp, featured, new, york, times, npr, pando, daily, techcrunch, today, show, re, located, chelsea, new, york, city]","[food, fast, growing, james, beard, award, winning, online, food, community, crowd, sourced, curated, recipe, hub, currently, interviewing, full, part, time, unpaid, interns, work, small, team, editors, executives, developers, new, york, city, headquarters, reproducing, repackaging, existing, food, content, number, partner, sites, huffington, post, yahoo, buzzfeed, various, content, management, systemsresearching, blogs, websites, provisions, food, affiliate, programassisting, day, day, affiliate, program, support, screening, affiliates, assisting, affiliate, inquiriessupporting, pr, amp, events, neededhelping, office, administrative, work, filing, mailing, preparing, meetingsworking, developers, document, bugs, suggest, improvements, sitesupporting, marketing, executive, staff]","[experience, content, management, systems, major, plus, blogging, counts, familiar, food, editorial, voice, aestheticloves, food, appreciates, importance, home, cooking, cooking, seasonsmeticulous, editor, perfectionist, obsessive, attention, detail, maddened, typos, broken, links, delighted, finding, fixing, themcheerful, pressureexcellent, communication, skillsa, multi, tasker, juggler, responsibilities, big, smallinterested, engaged, social, media, like, twitter, facebook, pinterestloves, problem, solving, collaborating, drive, food, forwardthinks, big, picture, pitches, nitty, gritty, running, small, company, dishes, shopping, administrative, support, comfortable, realities, working, startup, call, evenings, weekends, working, long, hours]"


In [39]:
filtered_df.printSchema()

root
 |-- employment_type: string (nullable = true)
 |-- country: string (nullable = true)
 |-- telecommuting1: integer (nullable = true)
 |-- has_company_logo1: integer (nullable = true)
 |-- has_questions1: integer (nullable = true)
 |-- fraudulent1: integer (nullable = true)
 |-- title_tokenized: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- company_profile_tokenized: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- description_tokenized: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- requirements_tokenized: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- title_tokenized_filtered: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- company_profile_tokenized_filtered: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- description_tokenized_filtered: array (nullable = true)
 |    |-- element: string (containsNull = tru

In [40]:
filtered_df.columns[:6]

['employment_type',
 'country',
 'telecommuting1',
 'has_company_logo1',
 'has_questions1',
 'fraudulent1']

In [41]:
df_final_cols = filtered_df.columns[:6] + filtered_df.columns[10:]
df_final_cols

['employment_type',
 'country',
 'telecommuting1',
 'has_company_logo1',
 'has_questions1',
 'fraudulent1',
 'title_tokenized_filtered',
 'company_profile_tokenized_filtered',
 'description_tokenized_filtered',
 'requirements_tokenized_filtered']

In [42]:
df_final = filtered_df.select(df_final_cols)
df_final.limit(1).toPandas()

Unnamed: 0,employment_type,country,telecommuting1,has_company_logo1,has_questions1,fraudulent1,title_tokenized_filtered,company_profile_tokenized_filtered,description_tokenized_filtered,requirements_tokenized_filtered
0,other,US,0,1,0,0,"[marketing, intern]","[re, food, ve, created, groundbreaking, award, winning, cooking, site, support, connect, celebrate, home, cooks, give, everything, need, one, place, top, editorial, business, engineering, team, re, focused, using, technology, find, new, better, ways, connect, people, around, specific, food, interests, offer, superb, highly, curated, information, food, cooking, attract, talented, home, cooks, contributors, country, also, publish, well, known, professionals, like, mario, batali, gwyneth, paltrow, danny, meyer, partnerships, whole, foods, market, random, house, food, named, best, food, website, james, beard, foundation, iacp, featured, new, york, times, npr, pando, daily, techcrunch, today, show, re, located, chelsea, new, york, city]","[food, fast, growing, james, beard, award, winning, online, food, community, crowd, sourced, curated, recipe, hub, currently, interviewing, full, part, time, unpaid, interns, work, small, team, editors, executives, developers, new, york, city, headquarters, reproducing, repackaging, existing, food, content, number, partner, sites, huffington, post, yahoo, buzzfeed, various, content, management, systemsresearching, blogs, websites, provisions, food, affiliate, programassisting, day, day, affiliate, program, support, screening, affiliates, assisting, affiliate, inquiriessupporting, pr, amp, events, neededhelping, office, administrative, work, filing, mailing, preparing, meetingsworking, developers, document, bugs, suggest, improvements, sitesupporting, marketing, executive, staff]","[experience, content, management, systems, major, plus, blogging, counts, familiar, food, editorial, voice, aestheticloves, food, appreciates, importance, home, cooking, cooking, seasonsmeticulous, editor, perfectionist, obsessive, attention, detail, maddened, typos, broken, links, delighted, finding, fixing, themcheerful, pressureexcellent, communication, skillsa, multi, tasker, juggler, responsibilities, big, smallinterested, engaged, social, media, like, twitter, facebook, pinterestloves, problem, solving, collaborating, drive, food, forwardthinks, big, picture, pitches, nitty, gritty, running, small, company, dishes, shopping, administrative, support, comfortable, realities, working, startup, call, evenings, weekends, working, long, hours]"


## Feature vectorization

We will:
- Use string indexer to vectorize the employment_type column.
- Use string indexer to vectorize the country column.
- Use:
    - Hashing TF Vectorizer
    - TFIDF Vectorizer
    - WordtoVec Vectorizer 
    <br>to vectorize the text columns (title, company_profile, description, requirements)

In [43]:
#vectorizing employment type column
indexer = StringIndexer(inputCol="employment_type", outputCol="employment_type_indexed")
feature_data = indexer.fit(df_final).transform(df_final)
feature_data.limit(2).toPandas()

Unnamed: 0,employment_type,country,telecommuting1,has_company_logo1,has_questions1,fraudulent1,title_tokenized_filtered,company_profile_tokenized_filtered,description_tokenized_filtered,requirements_tokenized_filtered,employment_type_indexed
0,other,US,0,1,0,0,"[marketing, intern]","[re, food, ve, created, groundbreaking, award, winning, cooking, site, support, connect, celebrate, home, cooks, give, everything, need, one, place, top, editorial, business, engineering, team, re, focused, using, technology, find, new, better, ways, connect, people, around, specific, food, interests, offer, superb, highly, curated, information, food, cooking, attract, talented, home, cooks, contributors, country, also, publish, well, known, professionals, like, mario, batali, gwyneth, paltrow, danny, meyer, partnerships, whole, foods, market, random, house, food, named, best, food, website, james, beard, foundation, iacp, featured, new, york, times, npr, pando, daily, techcrunch, today, show, re, located, chelsea, new, york, city]","[food, fast, growing, james, beard, award, winning, online, food, community, crowd, sourced, curated, recipe, hub, currently, interviewing, full, part, time, unpaid, interns, work, small, team, editors, executives, developers, new, york, city, headquarters, reproducing, repackaging, existing, food, content, number, partner, sites, huffington, post, yahoo, buzzfeed, various, content, management, systemsresearching, blogs, websites, provisions, food, affiliate, programassisting, day, day, affiliate, program, support, screening, affiliates, assisting, affiliate, inquiriessupporting, pr, amp, events, neededhelping, office, administrative, work, filing, mailing, preparing, meetingsworking, developers, document, bugs, suggest, improvements, sitesupporting, marketing, executive, staff]","[experience, content, management, systems, major, plus, blogging, counts, familiar, food, editorial, voice, aestheticloves, food, appreciates, importance, home, cooking, cooking, seasonsmeticulous, editor, perfectionist, obsessive, attention, detail, maddened, typos, broken, links, delighted, finding, fixing, themcheerful, pressureexcellent, communication, skillsa, multi, tasker, juggler, responsibilities, big, smallinterested, engaged, social, media, like, twitter, facebook, pinterestloves, problem, solving, collaborating, drive, food, forwardthinks, big, picture, pitches, nitty, gritty, running, small, company, dishes, shopping, administrative, support, comfortable, realities, working, startup, call, evenings, weekends, working, long, hours]",1.0
1,Full-time,NZ,0,1,0,0,"[customer, service, cloud, video, production]","[seconds, worlds, cloud, video, production, service, seconds, worlds, cloud, video, production, service, enabling, brands, agencies, get, high, quality, online, video, content, shot, produced, anywhere, world, seconds, makes, video, production, fast, affordable, managed, seamlessly, cloud, purchase, publish, seconds, removes, hassle, cost, risk, speed, issues, working, regular, video, production, companies, managing, every, aspect, video, projects, beautiful, online, experience, growing, global, network, rated, video, professionals, countries, managed, dedicated, production, success, teams, countries, seconds, provides, success, guarantee, seconds, produced, almost, videos, countries, global, brands, including, worlds, largest, including, paypal, l, oreal, sony, barclays, offices, auckland, london, sydney, tokyo, singapore]","[organised, focused, vibrant, awesome, passion, customer, service, slick, typing, skills, maybe, account, management, think, administration, cooler, polar, bear, jetski, need, hear, cloud, video, production, service, opperating, glodal, level, yeah, pretty, cool, serious, delivering, world, class, product, excellent, customer, service, rapidly, expanding, business, looking, talented, project, manager, manage, successful, delivery, video, projects, manage, client, communications, drive, production, process, work, coolest, brands, planet, learn, global, team, representing, nz, huge, way, entering, next, growth, stage, business, growing, quickly, internationally, therefore, position, bursting, opportunity, right, person, entering, business, right, time, seconds, worlds, cloud, video, production, service, seconds, worlds, cloud, video, production, service, enabling, brands, ...]","[expect, key, responsibility, communicate, client, seconds, team, freelance, community, throughout, video, production, process, including, shoot, planning, securing, freelance, talent, managing, workflow, online, production, management, system, aim, manage, video, project, effectively, produce, great, videos, clients, love, key, attributesclient, focused, excellent, customer, service, communication, skillsonline, oustanding, computer, knowledge, experience, using, online, software, project, management, toolsorganised, manage, workload, able, multi, task, attention, detailmotivated, self, starter, passion, excellent, work, achieving, great, resultsadaptable, show, initiative, think, feet, constantly, evolving, atmosphereflexible, fast, turnaround, work, hours, availabilityeasy, going, amp, upbeat, dosen, get, bogged, loves, challengesense, humour, laugh, know, working, startup, takes, guts, ability, deliver, including, meeting, project, ...]",0.0


In [44]:
feature_data = feature_data.drop("employment_type")

In [45]:
feature_data.printSchema()

root
 |-- country: string (nullable = true)
 |-- telecommuting1: integer (nullable = true)
 |-- has_company_logo1: integer (nullable = true)
 |-- has_questions1: integer (nullable = true)
 |-- fraudulent1: integer (nullable = true)
 |-- title_tokenized_filtered: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- company_profile_tokenized_filtered: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- description_tokenized_filtered: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- requirements_tokenized_filtered: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- employment_type_indexed: double (nullable = false)



In [46]:
#vectorizing country column
indexer = StringIndexer(inputCol="country", outputCol="country_indexed")
feature_data = indexer.fit(feature_data).transform(feature_data)
feature_data.limit(2).toPandas()

Unnamed: 0,country,telecommuting1,has_company_logo1,has_questions1,fraudulent1,title_tokenized_filtered,company_profile_tokenized_filtered,description_tokenized_filtered,requirements_tokenized_filtered,employment_type_indexed,country_indexed
0,US,0,1,0,0,"[marketing, intern]","[re, food, ve, created, groundbreaking, award, winning, cooking, site, support, connect, celebrate, home, cooks, give, everything, need, one, place, top, editorial, business, engineering, team, re, focused, using, technology, find, new, better, ways, connect, people, around, specific, food, interests, offer, superb, highly, curated, information, food, cooking, attract, talented, home, cooks, contributors, country, also, publish, well, known, professionals, like, mario, batali, gwyneth, paltrow, danny, meyer, partnerships, whole, foods, market, random, house, food, named, best, food, website, james, beard, foundation, iacp, featured, new, york, times, npr, pando, daily, techcrunch, today, show, re, located, chelsea, new, york, city]","[food, fast, growing, james, beard, award, winning, online, food, community, crowd, sourced, curated, recipe, hub, currently, interviewing, full, part, time, unpaid, interns, work, small, team, editors, executives, developers, new, york, city, headquarters, reproducing, repackaging, existing, food, content, number, partner, sites, huffington, post, yahoo, buzzfeed, various, content, management, systemsresearching, blogs, websites, provisions, food, affiliate, programassisting, day, day, affiliate, program, support, screening, affiliates, assisting, affiliate, inquiriessupporting, pr, amp, events, neededhelping, office, administrative, work, filing, mailing, preparing, meetingsworking, developers, document, bugs, suggest, improvements, sitesupporting, marketing, executive, staff]","[experience, content, management, systems, major, plus, blogging, counts, familiar, food, editorial, voice, aestheticloves, food, appreciates, importance, home, cooking, cooking, seasonsmeticulous, editor, perfectionist, obsessive, attention, detail, maddened, typos, broken, links, delighted, finding, fixing, themcheerful, pressureexcellent, communication, skillsa, multi, tasker, juggler, responsibilities, big, smallinterested, engaged, social, media, like, twitter, facebook, pinterestloves, problem, solving, collaborating, drive, food, forwardthinks, big, picture, pitches, nitty, gritty, running, small, company, dishes, shopping, administrative, support, comfortable, realities, working, startup, call, evenings, weekends, working, long, hours]",1.0,0.0
1,NZ,0,1,0,0,"[customer, service, cloud, video, production]","[seconds, worlds, cloud, video, production, service, seconds, worlds, cloud, video, production, service, enabling, brands, agencies, get, high, quality, online, video, content, shot, produced, anywhere, world, seconds, makes, video, production, fast, affordable, managed, seamlessly, cloud, purchase, publish, seconds, removes, hassle, cost, risk, speed, issues, working, regular, video, production, companies, managing, every, aspect, video, projects, beautiful, online, experience, growing, global, network, rated, video, professionals, countries, managed, dedicated, production, success, teams, countries, seconds, provides, success, guarantee, seconds, produced, almost, videos, countries, global, brands, including, worlds, largest, including, paypal, l, oreal, sony, barclays, offices, auckland, london, sydney, tokyo, singapore]","[organised, focused, vibrant, awesome, passion, customer, service, slick, typing, skills, maybe, account, management, think, administration, cooler, polar, bear, jetski, need, hear, cloud, video, production, service, opperating, glodal, level, yeah, pretty, cool, serious, delivering, world, class, product, excellent, customer, service, rapidly, expanding, business, looking, talented, project, manager, manage, successful, delivery, video, projects, manage, client, communications, drive, production, process, work, coolest, brands, planet, learn, global, team, representing, nz, huge, way, entering, next, growth, stage, business, growing, quickly, internationally, therefore, position, bursting, opportunity, right, person, entering, business, right, time, seconds, worlds, cloud, video, production, service, seconds, worlds, cloud, video, production, service, enabling, brands, ...]","[expect, key, responsibility, communicate, client, seconds, team, freelance, community, throughout, video, production, process, including, shoot, planning, securing, freelance, talent, managing, workflow, online, production, management, system, aim, manage, video, project, effectively, produce, great, videos, clients, love, key, attributesclient, focused, excellent, customer, service, communication, skillsonline, oustanding, computer, knowledge, experience, using, online, software, project, management, toolsorganised, manage, workload, able, multi, task, attention, detailmotivated, self, starter, passion, excellent, work, achieving, great, resultsadaptable, show, initiative, think, feet, constantly, evolving, atmosphereflexible, fast, turnaround, work, hours, availabilityeasy, going, amp, upbeat, dosen, get, bogged, loves, challengesense, humour, laugh, know, working, startup, takes, guts, ability, deliver, including, meeting, project, ...]",0.0,6.0


In [47]:
feature_data = feature_data.drop("country")
feature_data.printSchema()

root
 |-- telecommuting1: integer (nullable = true)
 |-- has_company_logo1: integer (nullable = true)
 |-- has_questions1: integer (nullable = true)
 |-- fraudulent1: integer (nullable = true)
 |-- title_tokenized_filtered: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- company_profile_tokenized_filtered: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- description_tokenized_filtered: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- requirements_tokenized_filtered: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- employment_type_indexed: double (nullable = false)
 |-- country_indexed: double (nullable = false)



In [48]:
feature_data.columns[4:8]

['title_tokenized_filtered',
 'company_profile_tokenized_filtered',
 'description_tokenized_filtered',
 'requirements_tokenized_filtered']

In [49]:
# Hashing TF
cols = feature_data.columns[4:8]
HTFfeaturizedData = feature_data

for col in cols:
    hashingTF = HashingTF(inputCol=col, outputCol=col+"_rawfeatures", numFeatures=50)
    HTFfeaturizedData = hashingTF.transform(HTFfeaturizedData)
    HTFfeaturizedData = HTFfeaturizedData.drop(col)
HTFfeaturizedData.limit(2).toPandas()

Unnamed: 0,telecommuting1,has_company_logo1,has_questions1,fraudulent1,employment_type_indexed,country_indexed,title_tokenized_filtered_rawfeatures,company_profile_tokenized_filtered_rawfeatures,description_tokenized_filtered_rawfeatures,requirements_tokenized_filtered_rawfeatures
0,0,1,0,0,1.0,0.0,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0)","(2.0, 3.0, 1.0, 5.0, 1.0, 1.0, 2.0, 2.0, 2.0, 1.0, 5.0, 0.0, 5.0, 4.0, 0.0, 4.0, 0.0, 3.0, 3.0, 1.0, 2.0, 1.0, 0.0, 0.0, 3.0, 1.0, 0.0, 1.0, 3.0, 0.0, 3.0, 1.0, 0.0, 4.0, 0.0, 4.0, 2.0, 2.0, 1.0, 1.0, 3.0, 1.0, 5.0, 1.0, 1.0, 3.0, 0.0, 2.0, 3.0, 1.0)","(2.0, 5.0, 0.0, 4.0, 1.0, 4.0, 1.0, 2.0, 0.0, 1.0, 1.0, 2.0, 3.0, 2.0, 0.0, 0.0, 3.0, 2.0, 2.0, 1.0, 1.0, 3.0, 2.0, 3.0, 3.0, 0.0, 1.0, 2.0, 0.0, 2.0, 1.0, 1.0, 1.0, 3.0, 2.0, 1.0, 1.0, 0.0, 5.0, 0.0, 2.0, 1.0, 5.0, 1.0, 2.0, 1.0, 0.0, 2.0, 2.0, 0.0)","(1.0, 1.0, 1.0, 5.0, 1.0, 0.0, 0.0, 1.0, 2.0, 3.0, 1.0, 2.0, 4.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 0.0, 0.0, 1.0, 1.0, 2.0, 2.0, 1.0, 1.0, 1.0, 0.0, 1.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 3.0, 2.0, 2.0, 1.0, 4.0, 3.0, 2.0, 0.0, 4.0, 1.0, 5.0, 1.0)"
1,0,1,0,0,0.0,6.0,"(0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(0.0, 1.0, 3.0, 6.0, 2.0, 0.0, 1.0, 5.0, 3.0, 0.0, 3.0, 2.0, 1.0, 2.0, 1.0, 0.0, 3.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 7.0, 4.0, 1.0, 0.0, 3.0, 0.0, 5.0, 1.0, 2.0, 1.0, 2.0, 2.0, 1.0, 4.0, 1.0, 2.0, 10.0, 0.0, 2.0, 0.0, 5.0, 4.0, 1.0, 0.0, 0.0, 1.0)","(0.0, 2.0, 7.0, 6.0, 3.0, 3.0, 1.0, 10.0, 3.0, 2.0, 5.0, 3.0, 2.0, 5.0, 3.0, 2.0, 3.0, 2.0, 3.0, 3.0, 2.0, 1.0, 0.0, 3.0, 13.0, 6.0, 3.0, 3.0, 4.0, 0.0, 9.0, 2.0, 4.0, 4.0, 2.0, 7.0, 2.0, 7.0, 2.0, 2.0, 13.0, 3.0, 3.0, 1.0, 9.0, 8.0, 2.0, 3.0, 1.0, 5.0)","(5.0, 3.0, 5.0, 3.0, 1.0, 1.0, 1.0, 3.0, 1.0, 3.0, 2.0, 3.0, 1.0, 3.0, 3.0, 4.0, 1.0, 3.0, 1.0, 1.0, 3.0, 2.0, 0.0, 1.0, 4.0, 0.0, 6.0, 2.0, 1.0, 1.0, 4.0, 5.0, 1.0, 2.0, 2.0, 7.0, 2.0, 1.0, 1.0, 1.0, 6.0, 6.0, 2.0, 2.0, 3.0, 0.0, 3.0, 1.0, 3.0, 0.0)"


In [50]:
# TF-IDF
TFIDFfeaturizedData = HTFfeaturizedData
cols = HTFfeaturizedData.columns[6:10]
for col in cols:
    idf = IDF(inputCol=col, outputCol=col+"_tfidf")
    idfModel = idf.fit(HTFfeaturizedData)
    TFIDFfeaturizedData = idfModel.transform(TFIDFfeaturizedData)
    TFIDFfeaturizedData = TFIDFfeaturizedData.drop(col)
TFIDFfeaturizedData.name = 'TFIDFfeaturizedData'
TFIDFfeaturizedData.limit(2).toPandas()

Unnamed: 0,telecommuting1,has_company_logo1,has_questions1,fraudulent1,employment_type_indexed,country_indexed,title_tokenized_filtered_rawfeatures_tfidf,company_profile_tokenized_filtered_rawfeatures_tfidf,description_tokenized_filtered_rawfeatures_tfidf,requirements_tokenized_filtered_rawfeatures_tfidf
0,0,1,0,0,1.0,0.0,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.596198158881319, 0.0, 0.0, 0.0, 2.7872533956440284, 0.0, 0.0, 0.0)","(1.6539129241456119, 2.267133799935574, 0.6601615508551427, 2.5581571675046466, 0.7994445933226298, 0.8167673405398773, 1.24819074574283, 0.7733298560926558, 1.5041575412723518, 0.733252353808031, 2.1636942350977386, 0.0, 3.0570741783287634, 1.853088198074099, 0.0, 3.0621896460876177, 0.0, 2.151557387812281, 1.9349829314014195, 0.8700493707683976, 1.1509247592513716, 1.107968112477655, 0.0, 0.0, 1.633224445595979, 0.6917925259369071, 0.0, 0.7489578642892835, 1.8818723754626365, 0.0, 2.062339441355095, 0.7145329546372695, 0.0, 2.487151225316324, 0.0, 1.9074081577069577, 1.8754132247553965, 1.5975786564889385, 0.7550840321016392, 0.6923812807375465, 1.5544229225732566, 0.6699031899822944, 2.035878853347798, 0.762005345832527, 0.558629064439829, 2.0672661286793423, 0.0, 1.0107187255265202, 2.9122872440149603, 0.8783845182690638)","(0.764154336861333, 1.5658533195851905, 0.0, 0.8636813409900508, 0.41099104525011354, 1.3791262003606026, 0.39336789111912723, 0.2780215317298331, 0.0, 0.4226149243396903, 0.20291284886784092, 0.8584038191636133, 0.5890932419897865, 0.3945945667098162, 0.0, 0.0, 1.030104366879011, 0.5810696092920499, 0.49152886543337254, 0.5734733708568909, 0.3522978707609938, 1.0722737358966812, 0.8007598558276658, 0.5905998603806727, 0.5567761825542852, 0.0, 0.2572849481892588, 0.6670593465854314, 0.0, 1.190077377233556, 0.3131706639170381, 0.32688785477214544, 0.17276052551872845, 0.742950475010308, 1.0702115117393591, 0.232216828334358, 0.4210871774687386, 0.0, 1.8456030919770918, 0.0, 0.25525957689458506, 0.2944038923420912, 0.8323779431477479, 0.36232365085153156, 0.5219015068493438, 0.370997912231163, 0.0, 0.4786091179587886, 0.6716718187722954, 0.0)","(0.8711754970135327, 0.741383481519428, 0.6720937713621129, 3.450729270966911, 0.9869183617135769, 0.0, 0.0, 0.47145455461248503, 1.6612028410927244, 2.568236282264297, 0.595466221757108, 1.91156519260407, 2.9694943224408457, 0.5951455547670266, 0.620910132109274, 1.6944778483712495, 0.7158584955917359, 0.772414172990075, 1.3032881909344087, 0.0, 0.0, 0.8715981215179566, 0.5935437604152533, 1.317359969365888, 1.0207084627640668, 1.0576747824604187, 0.7856440878756971, 0.6567458581342744, 0.0, 1.0075277659796968, 1.542023117595481, 1.364372158105036, 0.4193824531763066, 0.8218489240877762, 0.8476515505583042, 0.3293426469384993, 0.9380077265022259, 0.0, 2.4331442971228134, 1.7032981493263548, 0.7123229475058264, 0.8010189835717383, 1.9860338254861418, 1.675578090233334, 1.5430427454534053, 0.0, 2.632898237686968, 0.8189038081705347, 3.6344612704546746, 0.7082904373664832)"
1,0,1,0,0,0.0,6.0,"(0.0, 0.0, 1.5349006274688988, 2.442612291269019, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.254095276753278, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.4590173071587516, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.136125097838184, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(0.0, 0.7557112666451913, 1.980484652565428, 3.069788601005576, 1.5988891866452597, 0.0, 0.624095372871415, 1.9333246402316395, 2.256236311908528, 0.0, 1.2982165410586433, 1.8166727225638832, 0.6114148356657527, 0.9265440990370495, 0.789530606564978, 0.0, 2.0395709276054865, 0.0, 0.0, 0.0, 0.5754623796256858, 1.107968112477655, 0.0, 0.7079314783437833, 3.810857039723951, 2.7671701037476284, 0.7580983523945618, 0.0, 1.8818723754626365, 0.0, 3.4372324022584917, 0.7145329546372695, 1.2598860402819676, 0.621787806329081, 2.050887109242624, 0.9537040788534789, 0.9377066123776983, 3.195157312977877, 0.7550840321016392, 1.384762561475093, 5.181409741910855, 0.0, 0.8143515413391192, 0.0, 2.7931453221991447, 2.756354838239123, 0.76580089694785, 0.0, 0.0, 0.8783845182690638)","(0.0, 0.6263413278340761, 1.5252798883035048, 1.295522011485076, 1.2329731357503406, 1.034344650270452, 0.39336789111912723, 1.3901076586491656, 1.2851624002541784, 0.8452298486793806, 1.0145642443392047, 1.28760572874542, 0.392728827993191, 0.9864864167745404, 1.1007195975123945, 0.8032242110065614, 1.030104366879011, 0.5810696092920499, 0.7372932981500588, 1.7204201125706726, 0.7045957415219876, 0.35742457863222704, 0.0, 0.5905998603806727, 2.412696791068569, 2.3869942778879683, 0.7718548445677764, 1.000589019878147, 1.2077924049696676, 0.0, 2.8185359752533428, 0.6537757095442909, 0.6910421020749138, 0.9906006333470774, 1.0702115117393591, 1.625517798340506, 0.8421743549374772, 2.5427884926011415, 0.7382412367908368, 0.6067679323974952, 1.6591872498148028, 0.8832116770262736, 0.49942676588864876, 0.36232365085153156, 2.348556780822047, 2.967983297849304, 0.7204180761897104, 0.7179136769381829, 0.3358359093861477, 1.7964001160301448)","(4.355877485067664, 2.2241504445582843, 3.3604688568105647, 2.070437562580147, 0.9869183617135769, 0.6846374898406433, 0.910677939989779, 1.414363663837455, 0.8306014205463622, 2.568236282264297, 1.190932443514216, 2.867347788906105, 0.7423735806102114, 1.7854366643010797, 1.8627303963278221, 3.388955696742499, 0.7158584955917359, 2.317242518970225, 0.6516440954672044, 0.9010540686857316, 2.2822300703116216, 1.7431962430359131, 0.0, 0.658679984682944, 2.0414169255281336, 0.0, 4.713864527254183, 1.3134917162685489, 0.8689245112572138, 1.0075277659796968, 3.084046235190962, 3.4109303952625902, 0.4193824531763066, 1.6436978481755524, 1.6953031011166084, 2.305398528569495, 1.8760154530044517, 0.8721618988408527, 0.8110480990409378, 0.8516490746631774, 2.1369688425174793, 4.80611390143043, 0.9930169127430709, 1.117052060155556, 2.3145641181801078, 0.0, 1.9746736782652259, 0.8189038081705347, 2.180676762272805, 0.0)"


In [52]:
# Word2Vec
cols = feature_data.columns[4:8]
W2VfeaturizedData = feature_data

for col in cols:
    word2Vec = Word2Vec(vectorSize=5, minCount=0, inputCol=col, outputCol=col+"_w2v")
    model = word2Vec.fit(W2VfeaturizedData)
    W2VfeaturizedData = model.transform(W2VfeaturizedData)
    W2VfeaturizedData = W2VfeaturizedData.drop(col)
W2VfeaturizedData.limit(2).toPandas()

Unnamed: 0,telecommuting1,has_company_logo1,has_questions1,fraudulent1,employment_type_indexed,country_indexed,title_tokenized_filtered_w2v,company_profile_tokenized_filtered_w2v,description_tokenized_filtered_w2v,requirements_tokenized_filtered_w2v
0,0,1,0,0,1.0,0.0,"[-0.06928297132253647, -0.3702474907040596, -0.16312932595610619, -0.37926886044442654, -0.2184382677078247]","[-0.14248051975873557, 0.20141157888846986, -0.1762768022636784, -0.389535442945805, 0.37078204979758433]","[-0.07412367839632289, 0.09655342376624633, 0.33162052989272134, 0.037503075414514614, 0.021555425371930358]","[-0.07241791500347099, 0.23837789629477188, -0.26856790416474857, 0.13539069683545016, -0.1046861531333877]"
1,0,1,0,0,0.0,6.0,"[-0.5204122811555862, -0.1517421631142497, 0.02989270463585854, -0.17545882016420367, -0.426320093870163]","[-0.13137909273960088, 0.1920246264456134, -0.3858934829109593, 0.0811426564058485, 0.2116717520427253]","[-0.11637427035020664, 0.0043905874008487444, 0.3997167061655394, -0.13870391027376172, 0.010491414275899539]","[0.022067648882997067, 0.35713057887624017, -0.26549652126268486, 0.20456188595853744, -0.11866273788036778]"


In [53]:
# W2Vec Dataframes typically has negative values so we will correct for that here so that we can use the Naive Bayes classifier
cols = W2VfeaturizedData.columns[6:10]
for col in cols:
    scaler = MinMaxScaler(inputCol=col, outputCol=col+"_scaledFeatures")

    # Compute summary statistics and generate MinMaxScalerModel
    scalerModel = scaler.fit(W2VfeaturizedData)

    # rescale each feature to range [min, max].
    W2VfeaturizedData = scalerModel.transform(W2VfeaturizedData)
    W2VfeaturizedData = W2VfeaturizedData.drop(col)

W2VfeaturizedData.name = 'W2VfeaturizedData' # We will need this to print later
W2VfeaturizedData.limit(2).toPandas()

Unnamed: 0,telecommuting1,has_company_logo1,has_questions1,fraudulent1,employment_type_indexed,country_indexed,title_tokenized_filtered_w2v_scaledFeatures,company_profile_tokenized_filtered_w2v_scaledFeatures,description_tokenized_filtered_w2v_scaledFeatures,requirements_tokenized_filtered_w2v_scaledFeatures
0,0,1,0,0,1.0,0.0,"[0.5938849866325313, 0.31394627216468474, 0.594669685839558, 0.24351221495283287, 0.41688990325425856]","[0.3949858217777207, 0.3925060384878866, 0.34471420262940267, 0.27259337839238384, 0.6389375600462016]","[0.42664604919089694, 0.7414088181486949, 0.8237246426849907, 0.4828342812984385, 0.34483377183585395]","[0.3832848011107217, 0.5665807784903685, 0.3191296196720339, 0.46690713155779584, 0.6301159056496086]"
1,0,1,0,0,0.0,6.0,"[0.47400191609691034, 0.3964122989606692, 0.6435826036577118, 0.302910470271507, 0.3471509405930398]","[0.39934470980149117, 0.38853941088296007, 0.1967346984048519, 0.5194790452139721, 0.5453165554492275]","[0.4064482249423874, 0.715136710383798, 0.8521189722909471, 0.41835198845595584, 0.341916719094687]","[0.4069349089677808, 0.6179740657221018, 0.32043532968433974, 0.4889262604410459, 0.6250656377405553]"


**Last check**, id the column has more than 32 distict values we will have to drop it to use the decision tree and its family

In [55]:
for col in W2VfeaturizedData.columns:
    distinct_count = W2VfeaturizedData.select(col).distinct().count()
    print(col, distinct_count)

telecommuting1 3
has_company_logo1 3
has_questions1 3
fraudulent1 2
employment_type_indexed 4
country_indexed 91
title_tokenized_filtered_w2v_scaledFeatures 9747
company_profile_tokenized_filtered_w2v_scaledFeatures 1623
description_tokenized_filtered_w2v_scaledFeatures 13548
requirements_tokenized_filtered_w2v_scaledFeatures 10987


We see that the country column has 91 distict values so we will not use it, but if we have time, we will change handeling this column and can cast the less repeated value to one value.

In [58]:
W2VfeaturizedData.drop("country_indexed")

DataFrame[telecommuting1: int, has_company_logo1: int, has_questions1: int, fraudulent1: int, employment_type_indexed: double, title_tokenized_filtered_w2v_scaledFeatures: vector, company_profile_tokenized_filtered_w2v_scaledFeatures: vector, description_tokenized_filtered_w2v_scaledFeatures: vector, requirements_tokenized_filtered_w2v_scaledFeatures: vector]

In [59]:
TFIDFfeaturizedData.drop("country_indexed")

DataFrame[telecommuting1: int, has_company_logo1: int, has_questions1: int, fraudulent1: int, employment_type_indexed: double, title_tokenized_filtered_rawfeatures_tfidf: vector, company_profile_tokenized_filtered_rawfeatures_tfidf: vector, description_tokenized_filtered_rawfeatures_tfidf: vector, requirements_tokenized_filtered_rawfeatures_tfidf: vector]

In [60]:
HTFfeaturizedData.drop("country_indexed")

DataFrame[telecommuting1: int, has_company_logo1: int, has_questions1: int, fraudulent1: int, employment_type_indexed: double, title_tokenized_filtered_rawfeatures: vector, company_profile_tokenized_filtered_rawfeatures: vector, description_tokenized_filtered_rawfeatures: vector, requirements_tokenized_filtered_rawfeatures: vector]

Now we have 3 datasets each with a different type of vectorizer (TF, TFIDF, WordToVec), it is time to train our model and see which is the best one to use.
<br> For that, we need to combine our features into one column called features and our target label as a column named label for each of the 3 datasets.

In [61]:
#for tf dataset
features_list = HTFfeaturizedData.columns
features_list.remove('fraudulent1')

# Create your vector assembler object
assembler = VectorAssembler(inputCols=features_list,outputCol='features',handleInvalid="skip")
output_tf = assembler.transform(HTFfeaturizedData).select('features','fraudulent1')

output_tf = output_tf.withColumnRenamed('fraudulent1','label')
output_tf.name = 'tf'
output_tf.limit(3).toPandas()

Unnamed: 0,features,label
0,"(0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 2.0, 3.0, 1.0, 5.0, 1.0, 1.0, 2.0, 2.0, 2.0, 1.0, 5.0, 0.0, 5.0, 4.0, 0.0, 4.0, 0.0, 3.0, 3.0, 1.0, 2.0, 1.0, 0.0, 0.0, 3.0, 1.0, 0.0, 1.0, 3.0, 0.0, 3.0, 1.0, 0.0, 4.0, 0.0, 4.0, 2.0, 2.0, 1.0, 1.0, 3.0, 1.0, 5.0, 1.0, 1.0, ...)",0
1,"[0.0, 1.0, 0.0, 0.0, 6.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 3.0, 6.0, 2.0, 0.0, 1.0, 5.0, 3.0, 0.0, 3.0, 2.0, 1.0, 2.0, 1.0, 0.0, 3.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 7.0, 4.0, 1.0, 0.0, 3.0, 0.0, 5.0, 1.0, 2.0, 1.0, 2.0, 2.0, 1.0, 4.0, 1.0, 2.0, 10.0, 0.0, 2.0, 0.0, 5.0, ...]",0
2,"(0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 3.0, 2.0, 4.0, 5.0, 1.0, 2.0, 0.0, 3.0, 4.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 7.0, 4.0, 0.0, 0.0, 0.0, 2.0, 5.0, 3.0, 1.0, 2.0, 1.0, 2.0, 3.0, 0.0, 1.0, 1.0, 2.0, 0.0, 1.0, 0.0, 2.0, 3.0, 2.0, 3.0, 1.0, 1.0, ...)",0


In [62]:
#for tfidf dataset
features_list = TFIDFfeaturizedData.columns
features_list.remove('fraudulent1')

# Create your vector assembler object
assembler = VectorAssembler(inputCols=features_list,outputCol='features',handleInvalid="skip")
output_tfidf = assembler.transform(TFIDFfeaturizedData).select('features','fraudulent1')

output_tfidf = output_tfidf.withColumnRenamed('fraudulent1','label')
output_tfidf.name = 'tfidf'
output_tfidf.limit(3).toPandas()

Unnamed: 0,features,label
0,"(0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.596198158881319, 0.0, 0.0, 0.0, 2.7872533956440284, 0.0, 0.0, 0.0, 1.6539129241456119, 2.267133799935574, 0.6601615508551427, 2.5581571675046466, 0.7994445933226298, 0.8167673405398773, 1.24819074574283, 0.7733298560926558, 1.5041575412723518, 0.733252353808031, 2.1636942350977386, 0.0, 3.0570741783287634, 1.853088198074099, 0.0, 3.0621896460876177, 0.0, 2.151557387812281, 1.9349829314014195, 0.8700493707683976, 1.1509247592513716, 1.107968112477655, 0.0, 0.0, 1.633224445595979, 0.6917925259369071, 0.0, 0.7489578642892835, 1.8818723754626365, 0.0, 2.062339441355095, 0.7145329546372695, 0.0, 2.487151225316324, 0.0, 1.9074081577069577, 1.8754132247553965, 1.5975786564889385, 0.7550840321016392, 0.6923812807375465, 1.5544229225732566, 0.6699031899822944, 2.035878853347798, 0.762005345832527, 0.558629064439829, ...)",0
1,"[0.0, 1.0, 0.0, 0.0, 6.0, 0.0, 0.0, 1.5349006274688988, 2.442612291269019, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.254095276753278, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.4590173071587516, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.136125097838184, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.7557112666451913, 1.980484652565428, 3.069788601005576, 1.5988891866452597, 0.0, 0.624095372871415, 1.9333246402316395, 2.256236311908528, 0.0, 1.2982165410586433, 1.8166727225638832, 0.6114148356657527, 0.9265440990370495, 0.789530606564978, 0.0, 2.0395709276054865, 0.0, 0.0, 0.0, 0.5754623796256858, 1.107968112477655, 0.0, 0.7079314783437833, 3.810857039723951, 2.7671701037476284, 0.7580983523945618, 0.0, 1.8818723754626365, 0.0, 3.4372324022584917, 0.7145329546372695, 1.2598860402819676, 0.621787806329081, 2.050887109242624, 0.9537040788534789, 0.9377066123776983, 3.195157312977877, 0.7550840321016392, 1.384762561475093, 5.181409741910855, 0.0, 0.8143515413391192, 0.0, 2.7931453221991447, ...]",0
2,"(0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.230256423071258, 2.254095276753278, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.608126729746593, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.398190005919665, 0.0, 0.0, 0.0, 0.0, 0.8269564620728059, 0.0, 0.6601615508551427, 1.534894300502788, 1.5988891866452597, 3.2670693621595093, 3.1204768643570753, 0.3866649280463279, 1.5041575412723518, 0.0, 1.2982165410586433, 3.6333454451277665, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.2899886209342797, 6.090345595378784, 2.301849518502743, 0.0, 0.0, 0.0, 1.088816297063986, 3.4589626296845353, 2.2742950571836853, 0.7489578642892835, 1.2545815836417578, 1.1319916646269126, 1.3748929609033966, 2.1435988639118086, 0.0, 0.621787806329081, 1.025443554621312, 0.9537040788534789, 0.0, 0.7987893282444692, 0.0, 1.384762561475093, 1.5544229225732566, 1.3398063799645887, 1.221527312008679, 0.762005345832527, 0.558629064439829, ...)",0


In [63]:
#for w2v
features_list = W2VfeaturizedData.columns
features_list.remove('fraudulent1')

# Create your vector assembler object
assembler = VectorAssembler(inputCols=features_list,outputCol='features',handleInvalid="skip")
output_w2v = assembler.transform(W2VfeaturizedData).select('features','fraudulent1')

output_w2v = output_w2v.withColumnRenamed('fraudulent1','label')
output_w2v.name = 'w2v'
output_w2v.limit(3).toPandas()

Unnamed: 0,features,label
0,"[0.0, 1.0, 0.0, 1.0, 0.0, 0.5938849866325313, 0.31394627216468474, 0.594669685839558, 0.24351221495283287, 0.41688990325425856, 0.3949858217777207, 0.3925060384878866, 0.34471420262940267, 0.27259337839238384, 0.6389375600462016, 0.42664604919089694, 0.7414088181486949, 0.8237246426849907, 0.4828342812984385, 0.34483377183585395, 0.3832848011107217, 0.5665807784903685, 0.3191296196720339, 0.46690713155779584, 0.6301159056496086]",0
1,"[0.0, 1.0, 0.0, 0.0, 6.0, 0.47400191609691034, 0.3964122989606692, 0.6435826036577118, 0.302910470271507, 0.3471509405930398, 0.39934470980149117, 0.38853941088296007, 0.1967346984048519, 0.5194790452139721, 0.5453165554492275, 0.4064482249423874, 0.715136710383798, 0.8521189722909471, 0.41835198845595584, 0.341916719094687, 0.4069349089677808, 0.6179740657221018, 0.32043532968433974, 0.4889262604410459, 0.6250656377405553]",0
2,"[0.0, 1.0, 0.0, 1.0, 0.0, 0.5667715002251741, 0.43237919593240515, 0.6067006880714229, 0.3197604987262238, 0.4949714104866441, 0.25973427111289293, 0.7272436866294264, 0.714207900664917, 0.4087479597348681, 0.11788445304304353, 0.5149665676111005, 0.7540698089556455, 0.673108253446897, 0.4509247071689377, 0.2025246116421931, 0.3993938540149158, 0.5589437107398724, 0.4375731020558867, 0.4901143212943974, 0.5117774668994118]",0


## Training 

After preparing our data for the model, we will try different models and parameters to find the best one for classification of our label, on the 3 datasets which were encoded by different ways.

In [64]:
def ClassTrainEval(classifier,features,classes,train,test):

    def FindMtype(classifier):
        # Intstantiate Model
        M = classifier
        # Learn what it is
        Mtype = type(M).__name__
        
        return Mtype
    
    Mtype = FindMtype(classifier)
    

    def IntanceFitModel(Mtype,classifier,classes,features,train):
        
        if Mtype == "OneVsRest":
            # instantiate the base classifier.
            lr = LogisticRegression()
            # instantiate the One Vs Rest Classifier.
            OVRclassifier = OneVsRest(classifier=lr)
#             fitModel = OVRclassifier.fit(train)
            # Add parameters of your choice here:
            paramGrid = ParamGridBuilder() \
                .addGrid(lr.regParam, [0.1, 0.01]) \
                .build()
            #Cross Validator requires the following parameters:
            crossval = CrossValidator(estimator=OVRclassifier,
                                      estimatorParamMaps=paramGrid,
                                      evaluator=MulticlassClassificationEvaluator(),
                                      numFolds=2) # 3 is best practice
            # Run cross-validation, and choose the best set of parameters.
            fitModel = crossval.fit(train)
            return fitModel
        if Mtype == "MultilayerPerceptronClassifier":
            # specify layers for the neural network:
            # input layer of size features, two intermediate of features+1 and same size as features
            # and output of size number of classes
            # Note: crossvalidator cannot be used here
            features_count = len(features[0][0])
            layers = [features_count, features_count+1, features_count, classes]
            MPC_classifier = MultilayerPerceptronClassifier(maxIter=100, layers=layers, blockSize=128, seed=1234)
            fitModel = MPC_classifier.fit(train)
            return fitModel
        if Mtype in("LinearSVC","GBTClassifier") and classes != 2: # These classifiers currently only accept binary classification
            print(Mtype," could not be used because PySpark currently only accepts binary classification data for this algorithm")
            return
        if Mtype in("LogisticRegression","NaiveBayes","RandomForestClassifier","GBTClassifier","LinearSVC","DecisionTreeClassifier"):
  
            # Add parameters of your choice here:
            if Mtype in("LogisticRegression"):
                paramGrid = (ParamGridBuilder() \
#                              .addGrid(classifier.regParam, [0.1, 0.01]) \
                             .addGrid(classifier.maxIter, [10, 15,20])
                             .build())
                
            # Add parameters of your choice here:
            if Mtype in("NaiveBayes"):
                paramGrid = (ParamGridBuilder() \
                             .addGrid(classifier.smoothing, [0.0, 0.2, 0.4, 0.6]) \
                             .build())
                
            # Add parameters of your choice here:
            if Mtype in("RandomForestClassifier"):
                paramGrid = (ParamGridBuilder() \
                               .addGrid(classifier.maxDepth, [2, 5, 10])
#                                .addGrid(classifier.maxBins, [5, 10, 20])
#                                .addGrid(classifier.numTrees, [5, 20, 50])
                             .build())
                
            # Add parameters of your choice here:
            if Mtype in("GBTClassifier"):
                paramGrid = (ParamGridBuilder() \
#                              .addGrid(classifier.maxDepth, [2, 5, 10, 20, 30]) \
#                              .addGrid(classifier.maxBins, [10, 20, 40, 80, 100]) \
                             .addGrid(classifier.maxIter, [10, 15,50,100])
                             .build())
                
            # Add parameters of your choice here:
            if Mtype in("LinearSVC"):
                paramGrid = (ParamGridBuilder() \
                             .addGrid(classifier.maxIter, [10, 15]) \
                             .addGrid(classifier.regParam, [0.1, 0.01]) \
                             .build())
            
            # Add parameters of your choice here:
            if Mtype in("DecisionTreeClassifier"):
                paramGrid = (ParamGridBuilder() \
#                              .addGrid(classifier.maxDepth, [2, 5, 10, 20, 30]) \
                             .addGrid(classifier.maxBins, [99, 100]) \
                             .build())
            
            #Cross Validator requires all of the following parameters:
            crossval = CrossValidator(estimator=classifier,
                                      estimatorParamMaps=paramGrid,
                                      evaluator=MulticlassClassificationEvaluator(),
                                      numFolds=2) # 3 + is best practice
            # Fit Model: Run cross-validation, and choose the best set of parameters.
            fitModel = crossval.fit(train)
            return fitModel
    
    fitModel = IntanceFitModel(Mtype,classifier,classes,features,train)
    
    # Print feature selection metrics
    if fitModel is not None:
        
        if Mtype in("OneVsRest"):
            # Get Best Model
            BestModel = fitModel.bestModel
            print(" ")
            print('\033[1m' + Mtype + '\033[0m')
            # Extract list of binary models
            models = BestModel.models
            for model in models:
                print('\033[1m' + 'Intercept: '+ '\033[0m',model.intercept,'\033[1m' + '\nCoefficients:'+ '\033[0m',model.coefficients)

        if Mtype == "MultilayerPerceptronClassifier":
            print("")
            print('\033[1m' + Mtype," Weights"+ '\033[0m')
            print('\033[1m' + "Model Weights: "+ '\033[0m',fitModel.weights.size)
            print("")

        if Mtype in("DecisionTreeClassifier", "GBTClassifier","RandomForestClassifier"):
            # FEATURE IMPORTANCES
            # Estimate of the importance of each feature.
            # Each feature’s importance is the average of its importance across all trees 
            # in the ensemble The importance vector is normalized to sum to 1. 
            # Get Best Model
            BestModel = fitModel.bestModel
            print(" ")
            print('\033[1m' + Mtype," Feature Importances"+ '\033[0m')
            print("(Scores add up to 1)")
            print("Lowest score is the least important")
            print(" ")
            print(BestModel.featureImportances)
            
            if Mtype in("DecisionTreeClassifier"):
                global DT_featureimportances
                DT_featureimportances = BestModel.featureImportances.toArray()
                global DT_BestModel
                DT_BestModel = BestModel
            if Mtype in("GBTClassifier"):
                global GBT_featureimportances
                GBT_featureimportances = BestModel.featureImportances.toArray()
                global GBT_BestModel
                GBT_BestModel = BestModel
            if Mtype in("RandomForestClassifier"):
                global RF_featureimportances
                RF_featureimportances = BestModel.featureImportances.toArray()
                global RF_BestModel
                RF_BestModel = BestModel

        if Mtype in("LogisticRegression"):
            # Get Best Model
            BestModel = fitModel.bestModel
            print(" ")
            print('\033[1m' + Mtype," Coefficient Matrix"+ '\033[0m')
            print("You should compares these relative to eachother")
            print("Coefficients: \n" + str(BestModel.coefficientMatrix))
            print("Intercept: " + str(BestModel.interceptVector))
            global LR_coefficients
            LR_coefficients = BestModel.coefficientMatrix.toArray()
            global LR_BestModel
            LR_BestModel = BestModel

        if Mtype in("LinearSVC"):
            # Get Best Model
            BestModel = fitModel.bestModel
            print(" ")
            print('\033[1m' + Mtype," Coefficients"+ '\033[0m')
            print("You should compares these relative to eachother")
            print("Coefficients: \n" + str(BestModel.coefficients))
            global LSVC_coefficients
            LSVC_coefficients = BestModel.coefficients.toArray()
            global LSVC_BestModel
            LSVC_BestModel = BestModel
        
   
    # Set the column names to match the external results dataframe that we will join with later:
    columns = ['Classifier', 'Result']
    
    if Mtype in("LinearSVC","GBTClassifier") and classes != 2:
        Mtype = [Mtype] # make this a list
        score = ["N/A"]
        result = spark.createDataFrame(zip(Mtype,score), schema=columns)
    else:
        predictions = fitModel.transform(test)
        MC_evaluator = MulticlassClassificationEvaluator(metricName="accuracy") # redictionCol="prediction",
        accuracy = (MC_evaluator.evaluate(predictions))*100
        Mtype = [Mtype] # make this a string
        score = [str(accuracy)] #make this a string and convert to a list
        result = spark.createDataFrame(zip(Mtype,score), schema=columns)
        result = result.withColumn('Result',result.Result.substr(0, 5))
        
    return result
    #Also returns the fit model important scores or p values

In [71]:
classifiers = [
                LogisticRegression()
                ,OneVsRest()
               ,LinearSVC()
               ,NaiveBayes()
               ,MultilayerPerceptronClassifier()
              ] 

featureDF_list = [output_tf,output_tfidf,output_w2v]

In [72]:
for featureDF in featureDF_list:
    print('\033[1m' + featureDF.name," Results:"+ '\033[0m')
    train, test = featureDF.randomSplit([0.7, 0.3],seed = 11)
    features = featureDF.select(['features']).collect()
    # Learn how many classes there are in order to specify evaluation type based on binary or multi and turn the df into an object
    class_count = featureDF.select(countDistinct("label")).collect()
    classes = class_count[0][0]

    #set up your results table
    columns = ['Classifier', 'Result']
    vals = [("Place Holder","N/A")]
    results = spark.createDataFrame(vals, columns)

    for classifier in classifiers:
        new_result = ClassTrainEval(classifier,features,classes,train,test)
        results = results.union(new_result)
    results = results.where("Classifier!='Place Holder'")
    print(results.show(truncate=False))

[1mtf  Results:[0m
 
[1mLogisticRegression  Coefficient Matrix[0m
You should compares these relative to eachother
Coefficients: 
DenseMatrix([[ 1.09894541e-01, -2.45507097e+00, -6.10700759e-01,
              -1.38285757e-02, -1.27960008e-02,  4.95890121e-01,
               5.04311905e-01,  3.03284933e-01,  2.05625123e-02,
              -3.09914050e-01,  2.27942795e-02, -6.75220216e-01,
               4.76626693e-01, -2.52758553e-01, -2.30759433e-01,
              -4.49558788e-01, -8.09215981e-02,  6.41269127e-01,
               8.61154494e-01, -2.99556239e-01,  5.25510280e-01,
               1.20917135e-01, -2.64486306e-01,  9.22015315e-01,
              -2.52950270e-02, -2.94457935e-01, -1.16246280e+00,
               7.63184468e-02,  3.72905958e-02,  4.84385345e-01,
              -2.98578181e-02, -1.32286832e-01,  2.32450833e-01,
               6.26355655e-01, -3.01285945e-01,  5.14480397e-01,
              -8.41828190e-01,  3.73322618e-01, -4.40794759e-01,
              -2.52205

We see that Multilayer Perceptron Classifier with tfidf vectorization has the highest accuracy which is 97%.<br>
So, we will use that to do prediction on the test data.

In [77]:
output_tfidf

train, test = featureDF.randomSplit([0.7, 0.3],seed = 11)
features = featureDF.select(['features']).collect()
features_count = len(features[0][0])
layers = [features_count, features_count+1, features_count, classes]
classifier = MultilayerPerceptronClassifier(maxIter=200, layers=layers, blockSize=128, seed=1234)
model = classifier.fit(output_tfidf)

In [79]:
predictions = model.transform(test)
predictions.show()

+--------------------+-----+--------------------+--------------------+----------+
|            features|label|       rawPrediction|         probability|prediction|
+--------------------+-----+--------------------+--------------------+----------+
|(205,[0,1,2,3,16,...|    0|[32.4394991019324...|[1.0,1.3552636765...|       0.0|
|(205,[0,1,2,3,33,...|    0|[34.8121172465717...|[1.0,9.0784053169...|       0.0|
|(205,[0,1,2,8,42,...|    0|[35.6300149882356...|[1.0,1.9750962562...|       0.0|
|(205,[0,1,2,9,13,...|    0|[30.1074214508040...|[1.0,1.3770302060...|       0.0|
|(205,[0,1,2,14,21...|    0|[17.6072621780993...|[0.99999999999999...|       0.0|
|(205,[0,1,2,15,52...|    0|[28.2125437109945...|[1.0,5.6230031911...|       0.0|
|(205,[0,1,2,44,50...|    0|[36.0735556254588...|[1.0,6.9261187227...|       0.0|
|(205,[0,1,3,8,20,...|    0|[33.1071873497003...|[1.0,3.0648619332...|       0.0|
|(205,[0,1,3,16,46...|    0|[29.1333301296530...|[1.0,8.3664844045...|       0.0|
|(205,[0,1,7,8,2