# NLP Basics: Implementing a pipeline to clean text

### Pre-processing text data

Cleaning up the text data is necessary to highlight attributes that you're going to want your machine learning system to pick up on. Cleaning (or pre-processing) the data typically consists of a number of steps:
1. **Remove punctuation**
2. **Tokenization**
3. **Remove stopwords**
4. Lemmatize/Stem

The first three steps are covered in this chapter as they're implemented in pretty much any text cleaning pipeline. Lemmatizing and stemming are covered in the next chapter as they're helpful but not critical.

In [24]:
import pandas as pd
pd.set_option('display.max_colwidth', 100)

data = pd.read_csv("total_jobs.csv")
client = pd.read_csv("Client_Descriptions.csv")
#data.columns = ['label', 'body_text']

client.head()
#type(data['Description'][0])

Unnamed: 0,Labor Category,Description
0,IT Project Manager I,Manages projects and development teams executing in a range of methodologies including waterfall...
1,IT Project Manager III,Manages projects and development teams executing in a range of methodologies including waterfall...
2,Senior Computer Security Systems Specialist,"Analyzes and defines security requirements for Multilevel Security (MLS) issues. Designs, develo..."
3,Senior Security Analyst,"Analyzes security measures for more than one IT functional area (e.g., data, systems, network an..."
4,Cloud Engineer,"Experience with cloud services - including open source technology, software development, system ..."


In [7]:
# What does the cleaned version look like?
#data_cleaned = pd.read_csv("SMSSpamCollection_cleaned.tsv", sep='\t')
#data_cleaned.head()

### Remove punctuation

In [8]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [9]:
def remove_punct(text):
    text_nopunct = "".join([char for char in text if char not in string.punctuation])
    return text_nopunct

data['body_text_clean'] = data['summary'].apply(lambda x: remove_punct(x))

data.head()


Unnamed: 0.1,Unnamed: 0,title,company,summary,location,salary,Experience,posted_date,body_text_clean
0,0,Software Developer (No Prior Experience Required),Revature,College degree (Associates or Bachelors).\nMust be authorized to work in the US.\nStrong desire ...,"Washington, DC",58387.0,0,2018-12-12,College degree Associates or Bachelors\nMust be authorized to work in the US\nStrong desire to l...
1,1,SQL Application Developer,ServiceSource Inc.,Assist with the management and coordination between the tracking system vendor and software and ...,"Capitol Heights, MD",162720.0,0,2019-08-10,Assist with the management and coordination between the tracking system vendor and software and ...
2,2,GIS Developer,Leidos,Train other team members on GIS and other GIS technologies and provide guidance as needed • Proa...,"Washington, DC 20090 (South West area)+1 location",197394.0,0,2020-04-23,Train other team members on GIS and other GIS technologies and provide guidance as needed • Proa...
3,3,Jr Software Developer,KSquare Inc.,BS/ MS in IT or equivalent is required with strong academic background.\nRecent graduates are en...,"Greenbelt, MD 20770",256733.0,0,2017-08-30,BS MS in IT or equivalent is required with strong academic background\nRecent graduates are enco...
4,4,HP059 Junior Software Developer,"ADNET Systems, Inc.",Experience with using a Git workflow for software development.\nWe are seeking an early-career s...,"Greenbelt, MD 20771",100701.0,2,2020-07-30,Experience with using a Git workflow for software development\nWe are seeking an earlycareer sof...


### Tokenization

In [17]:
import re

def tokenize(text):
    tokens = re.split('\W+', text)
    return tokens

data['body_text_tokenized'] = data['body_text_clean'].apply(lambda x: tokenize(x.lower()))

data.head(100)

Unnamed: 0.1,Unnamed: 0,title,company,summary,location,salary,Experience,posted_date,body_text_clean,body_text_tokenized,body_text_nostop,body_text_lemmatized
0,0,Software Developer (No Prior Experience Required),Revature,College degree (Associates or Bachelors).\nMust be authorized to work in the US.\nStrong desire ...,"Washington, DC",58387.0,0,2018-12-12,College degree Associates or Bachelors\nMust be authorized to work in the US\nStrong desire to l...,"[college, degree, associates, or, bachelors, must, be, authorized, to, work, in, the, us, strong...","[college, degree, associates, bachelors, must, authorized, work, us, strong, desire, learn, code...","[college, degree, associate, bachelor, must, authorized, work, u, strong, desire, learn, code, p..."
1,1,SQL Application Developer,ServiceSource Inc.,Assist with the management and coordination between the tracking system vendor and software and ...,"Capitol Heights, MD",162720.0,0,2019-08-10,Assist with the management and coordination between the tracking system vendor and software and ...,"[assist, with, the, management, and, coordination, between, the, tracking, system, vendor, and, ...","[assist, management, coordination, tracking, system, vendor, software, installation, mobile, tra...","[assist, management, coordination, tracking, system, vendor, software, installation, mobile, tra..."
2,2,GIS Developer,Leidos,Train other team members on GIS and other GIS technologies and provide guidance as needed • Proa...,"Washington, DC 20090 (South West area)+1 location",197394.0,0,2020-04-23,Train other team members on GIS and other GIS technologies and provide guidance as needed • Proa...,"[train, other, team, members, on, gis, and, other, gis, technologies, and, provide, guidance, as...","[train, team, members, gis, gis, technologies, provide, guidance, needed, proactively, identify,...","[train, team, member, gi, gi, technology, provide, guidance, needed, proactively, identify, issu..."
3,3,Jr Software Developer,KSquare Inc.,BS/ MS in IT or equivalent is required with strong academic background.\nRecent graduates are en...,"Greenbelt, MD 20770",256733.0,0,2017-08-30,BS MS in IT or equivalent is required with strong academic background\nRecent graduates are enco...,"[bs, ms, in, it, or, equivalent, is, required, with, strong, academic, background, recent, gradu...","[bs, ms, equivalent, required, strong, academic, background, recent, graduates, encouraged, appl...","[b, m, equivalent, required, strong, academic, background, recent, graduate, encouraged, apply, ..."
4,4,HP059 Junior Software Developer,"ADNET Systems, Inc.",Experience with using a Git workflow for software development.\nWe are seeking an early-career s...,"Greenbelt, MD 20771",100701.0,2,2020-07-30,Experience with using a Git workflow for software development\nWe are seeking an earlycareer sof...,"[experience, with, using, a, git, workflow, for, software, development, we, are, seeking, an, ea...","[experience, using, git, workflow, software, development, seeking, earlycareer, software, develo...","[experience, using, git, workflow, software, development, seeking, earlycareer, software, develo..."
...,...,...,...,...,...,...,...,...,...,...,...,...
95,95,Entry Level Software Developer,EAI Technologies,Discover the feelings of Camaraderie and Family while being technically challenged to grow!\nCom...,"Vienna, VA 22182•Temporarily Remote",70000.0,1,2019-10-28,Discover the feelings of Camaraderie and Family while being technically challenged to grow\nCome...,"[discover, the, feelings, of, camaraderie, and, family, while, being, technically, challenged, t...","[discover, feelings, camaraderie, family, technically, challenged, grow, come, see, many, employ...","[discover, feeling, camaraderie, family, technically, challenged, grow, come, see, many, employe..."
96,96,Software Engineer,Emerson Automation Solutions,We are looking for a meticulous leader and software engineer passionate about ensuring quality p...,"Washington, DC",91000.0,2,2018-07-25,We are looking for a meticulous leader and software engineer passionate about ensuring quality p...,"[we, are, looking, for, a, meticulous, leader, and, software, engineer, passionate, about, ensur...","[looking, meticulous, leader, software, engineer, passionate, ensuring, quality, production, sof...","[looking, meticulous, leader, software, engineer, passionate, ensuring, quality, production, sof..."
97,97,Front End Web Developer,StraitSys,StraitSys is looking for a Web Developer to help support one of our Federal Customers Washington...,"Washington, DC 20001 (Shaw area)",144444.0,2,2018-04-14,StraitSys is looking for a Web Developer to help support one of our Federal Customers Washington...,"[straitsys, is, looking, for, a, web, developer, to, help, support, one, of, our, federal, custo...","[straitsys, looking, web, developer, help, support, one, federal, customers, washington, dc, str...","[straitsys, looking, web, developer, help, support, one, federal, customer, washington, dc, stra..."
98,98,Entry Level Appian Developer,Technology Solutions Provider Inc.,TSPi provides the services required to enable our customers to streamline their business process...,"Reston, VA 20191 (South Lakes Dr - Soapstone Dr area)",188077.0,2,2017-08-04,TSPi provides the services required to enable our customers to streamline their business process...,"[tspi, provides, the, services, required, to, enable, our, customers, to, streamline, their, bus...","[tspi, provides, services, required, enable, customers, streamline, business, processes, take, a...","[tspi, provides, service, required, enable, customer, streamline, business, process, take, advan..."


### Remove stopwords

In [11]:
import nltk

stopword = nltk.corpus.stopwords.words('english')

In [12]:
def remove_stopwords(tokenized_list):
    #text = [word for word in tokenized_list if word not in stopword]
    text = [re.sub(r"[^a-z]","",word) for word in tokenized_list if word not in stopword]
   # new_string = re.sub(r"[^a-zA-Z0-9]","",string)
    return text

data['body_text_nostop'] = data['body_text_tokenized'].apply(lambda x: remove_stopwords(x))

data.head(15)

Unnamed: 0.1,Unnamed: 0,title,company,summary,location,salary,Experience,posted_date,body_text_clean,body_text_tokenized,body_text_nostop
0,0,Software Developer (No Prior Experience Required),Revature,College degree (Associates or Bachelors).\nMust be authorized to work in the US.\nStrong desire ...,"Washington, DC",58387.0,0,2018-12-12,College degree Associates or Bachelors\nMust be authorized to work in the US\nStrong desire to l...,"[college, degree, associates, or, bachelors, must, be, authorized, to, work, in, the, us, strong...","[college, degree, associates, bachelors, must, authorized, work, us, strong, desire, learn, code..."
1,1,SQL Application Developer,ServiceSource Inc.,Assist with the management and coordination between the tracking system vendor and software and ...,"Capitol Heights, MD",162720.0,0,2019-08-10,Assist with the management and coordination between the tracking system vendor and software and ...,"[assist, with, the, management, and, coordination, between, the, tracking, system, vendor, and, ...","[assist, management, coordination, tracking, system, vendor, software, installation, mobile, tra..."
2,2,GIS Developer,Leidos,Train other team members on GIS and other GIS technologies and provide guidance as needed • Proa...,"Washington, DC 20090 (South West area)+1 location",197394.0,0,2020-04-23,Train other team members on GIS and other GIS technologies and provide guidance as needed • Proa...,"[train, other, team, members, on, gis, and, other, gis, technologies, and, provide, guidance, as...","[train, team, members, gis, gis, technologies, provide, guidance, needed, proactively, identify,..."
3,3,Jr Software Developer,KSquare Inc.,BS/ MS in IT or equivalent is required with strong academic background.\nRecent graduates are en...,"Greenbelt, MD 20770",256733.0,0,2017-08-30,BS MS in IT or equivalent is required with strong academic background\nRecent graduates are enco...,"[bs, ms, in, it, or, equivalent, is, required, with, strong, academic, background, recent, gradu...","[bs, ms, equivalent, required, strong, academic, background, recent, graduates, encouraged, appl..."
4,4,HP059 Junior Software Developer,"ADNET Systems, Inc.",Experience with using a Git workflow for software development.\nWe are seeking an early-career s...,"Greenbelt, MD 20771",100701.0,2,2020-07-30,Experience with using a Git workflow for software development\nWe are seeking an earlycareer sof...,"[experience, with, using, a, git, workflow, for, software, development, we, are, seeking, an, ea...","[experience, using, git, workflow, software, development, seeking, earlycareer, software, develo..."
5,5,Software Developer,Creative Systems and Consulting,"To enhance our consulting practice, Creative is currently seeking self-motivated Software Develo...","McLean, VA 22102",107200.0,2,2017-11-29,To enhance our consulting practice Creative is currently seeking selfmotivated Software Develope...,"[to, enhance, our, consulting, practice, creative, is, currently, seeking, selfmotivated, softwa...","[enhance, consulting, practice, creative, currently, seeking, selfmotivated, software, developer..."
6,6,Entry Level Software Developer,Revature,College degree (Associates or Bachelors).\nMust be authorized to work in the US.\nStrong desire ...,"Washington, DC",219865.0,0,2017-04-01,College degree Associates or Bachelors\nMust be authorized to work in the US\nStrong desire to l...,"[college, degree, associates, or, bachelors, must, be, authorized, to, work, in, the, us, strong...","[college, degree, associates, bachelors, must, authorized, work, us, strong, desire, learn, code..."
7,7,Software Engineer,PMcertDC,Software Engineer- Must have ability to gain a clearance (citizenship required).\nYou must be sk...,"Washington, DC 20500 (Downtown area)",140551.0,2,2017-08-12,Software Engineer Must have ability to gain a clearance citizenship required\nYou must be skille...,"[software, engineer, must, have, ability, to, gain, a, clearance, citizenship, required, you, mu...","[software, engineer, must, ability, gain, clearance, citizenship, required, must, skilled, least..."
8,8,Front End Web Developer,National Security Agency,Must also demonstrate knowledge (through coursework or experience) in using web-authoring softwa...,"Fort Meade, MD",87198.0,0,2019-02-08,Must also demonstrate knowledge through coursework or experience in using webauthoring software ...,"[must, also, demonstrate, knowledge, through, coursework, or, experience, in, using, webauthorin...","[must, also, demonstrate, knowledge, coursework, experience, using, webauthoring, software, grap..."
9,9,Entry Level Software Developer,EAI Technologies,"Work in a fast-paced, hands-on capacity in a client-facing role, seeing first-hand the impact of...","Vienna, VA 22182•Temporarily Remote",70000.0,1,2020-01-21,Work in a fastpaced handson capacity in a clientfacing role seeing firsthand the impact of youyo...,"[work, in, a, fastpaced, handson, capacity, in, a, clientfacing, role, seeing, firsthand, the, i...","[work, fastpaced, handson, capacity, clientfacing, role, seeing, firsthand, impact, youyour, tea..."


In [13]:
import nltk

wn = nltk.WordNetLemmatizer()

In [14]:
def lemmatizing(tokenized_text):
    text = [wn.lemmatize(word) for word in tokenized_text]
    return text

data['body_text_lemmatized'] = data['body_text_nostop'].apply(lambda x: lemmatizing(x))

data.head(10)

Unnamed: 0.1,Unnamed: 0,title,company,summary,location,salary,Experience,posted_date,body_text_clean,body_text_tokenized,body_text_nostop,body_text_lemmatized
0,0,Software Developer (No Prior Experience Required),Revature,College degree (Associates or Bachelors).\nMust be authorized to work in the US.\nStrong desire ...,"Washington, DC",58387.0,0,2018-12-12,College degree Associates or Bachelors\nMust be authorized to work in the US\nStrong desire to l...,"[college, degree, associates, or, bachelors, must, be, authorized, to, work, in, the, us, strong...","[college, degree, associates, bachelors, must, authorized, work, us, strong, desire, learn, code...","[college, degree, associate, bachelor, must, authorized, work, u, strong, desire, learn, code, p..."
1,1,SQL Application Developer,ServiceSource Inc.,Assist with the management and coordination between the tracking system vendor and software and ...,"Capitol Heights, MD",162720.0,0,2019-08-10,Assist with the management and coordination between the tracking system vendor and software and ...,"[assist, with, the, management, and, coordination, between, the, tracking, system, vendor, and, ...","[assist, management, coordination, tracking, system, vendor, software, installation, mobile, tra...","[assist, management, coordination, tracking, system, vendor, software, installation, mobile, tra..."
2,2,GIS Developer,Leidos,Train other team members on GIS and other GIS technologies and provide guidance as needed • Proa...,"Washington, DC 20090 (South West area)+1 location",197394.0,0,2020-04-23,Train other team members on GIS and other GIS technologies and provide guidance as needed • Proa...,"[train, other, team, members, on, gis, and, other, gis, technologies, and, provide, guidance, as...","[train, team, members, gis, gis, technologies, provide, guidance, needed, proactively, identify,...","[train, team, member, gi, gi, technology, provide, guidance, needed, proactively, identify, issu..."
3,3,Jr Software Developer,KSquare Inc.,BS/ MS in IT or equivalent is required with strong academic background.\nRecent graduates are en...,"Greenbelt, MD 20770",256733.0,0,2017-08-30,BS MS in IT or equivalent is required with strong academic background\nRecent graduates are enco...,"[bs, ms, in, it, or, equivalent, is, required, with, strong, academic, background, recent, gradu...","[bs, ms, equivalent, required, strong, academic, background, recent, graduates, encouraged, appl...","[b, m, equivalent, required, strong, academic, background, recent, graduate, encouraged, apply, ..."
4,4,HP059 Junior Software Developer,"ADNET Systems, Inc.",Experience with using a Git workflow for software development.\nWe are seeking an early-career s...,"Greenbelt, MD 20771",100701.0,2,2020-07-30,Experience with using a Git workflow for software development\nWe are seeking an earlycareer sof...,"[experience, with, using, a, git, workflow, for, software, development, we, are, seeking, an, ea...","[experience, using, git, workflow, software, development, seeking, earlycareer, software, develo...","[experience, using, git, workflow, software, development, seeking, earlycareer, software, develo..."
5,5,Software Developer,Creative Systems and Consulting,"To enhance our consulting practice, Creative is currently seeking self-motivated Software Develo...","McLean, VA 22102",107200.0,2,2017-11-29,To enhance our consulting practice Creative is currently seeking selfmotivated Software Develope...,"[to, enhance, our, consulting, practice, creative, is, currently, seeking, selfmotivated, softwa...","[enhance, consulting, practice, creative, currently, seeking, selfmotivated, software, developer...","[enhance, consulting, practice, creative, currently, seeking, selfmotivated, software, developer..."
6,6,Entry Level Software Developer,Revature,College degree (Associates or Bachelors).\nMust be authorized to work in the US.\nStrong desire ...,"Washington, DC",219865.0,0,2017-04-01,College degree Associates or Bachelors\nMust be authorized to work in the US\nStrong desire to l...,"[college, degree, associates, or, bachelors, must, be, authorized, to, work, in, the, us, strong...","[college, degree, associates, bachelors, must, authorized, work, us, strong, desire, learn, code...","[college, degree, associate, bachelor, must, authorized, work, u, strong, desire, learn, code, p..."
7,7,Software Engineer,PMcertDC,Software Engineer- Must have ability to gain a clearance (citizenship required).\nYou must be sk...,"Washington, DC 20500 (Downtown area)",140551.0,2,2017-08-12,Software Engineer Must have ability to gain a clearance citizenship required\nYou must be skille...,"[software, engineer, must, have, ability, to, gain, a, clearance, citizenship, required, you, mu...","[software, engineer, must, ability, gain, clearance, citizenship, required, must, skilled, least...","[software, engineer, must, ability, gain, clearance, citizenship, required, must, skilled, least..."
8,8,Front End Web Developer,National Security Agency,Must also demonstrate knowledge (through coursework or experience) in using web-authoring softwa...,"Fort Meade, MD",87198.0,0,2019-02-08,Must also demonstrate knowledge through coursework or experience in using webauthoring software ...,"[must, also, demonstrate, knowledge, through, coursework, or, experience, in, using, webauthorin...","[must, also, demonstrate, knowledge, coursework, experience, using, webauthoring, software, grap...","[must, also, demonstrate, knowledge, coursework, experience, using, webauthoring, software, grap..."
9,9,Entry Level Software Developer,EAI Technologies,"Work in a fast-paced, hands-on capacity in a client-facing role, seeing first-hand the impact of...","Vienna, VA 22182•Temporarily Remote",70000.0,1,2020-01-21,Work in a fastpaced handson capacity in a clientfacing role seeing firsthand the impact of youyo...,"[work, in, a, fastpaced, handson, capacity, in, a, clientfacing, role, seeing, firsthand, the, i...","[work, fastpaced, handson, capacity, clientfacing, role, seeing, firsthand, impact, youyour, tea...","[work, fastpaced, handson, capacity, clientfacing, role, seeing, firsthand, impact, youyour, tea..."


In [20]:
def clean_text(text):
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = [wn.lemmatize(re.sub(r"[^a-z]","",word)) for word in tokens if word not in stopword]
    #text=[wn.lemmatize(word) for word in tokens if word not in stopword]
    #re.sub(r"[^a-z]","",word)
    return text

In [49]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(analyzer=clean_text)
#client_vectorizer is a vectorizer object
client_vectorizer = tfidf_vect.fit(client['Description'])
#X_tfidf = tfidf_vect.fit_transform(data['summary'])
#print(X_tfidf.shape)
#print(tfidf_vect.get_feature_names())

In [44]:
dictionary=[]
dictionary=tfidf_vect.get_feature_names()
dictionary.remove('')
len(dictionary)


234

In [54]:
#tfidf_vect = TfidfVectorizer(analyzer=clean_text)
#tfidf_vect_fit = tfidf_vect.fit(X_train['body_text'])

client_tfidf = client_vectorizer.transform(client['Description'])
data_tfidf=client_vectorizer.transform(data['summary'])
print(client_tfidf.shape)
print(data_tfidf.shape)


(10, 235)
(45093, 235)


In [96]:
from sklearn.metrics.pairwise import cosine_similarity
cs=cosine_similarity(data_tfidf,client_tfidf , dense_output=False)

In [97]:
cs.shape

(45093, 10)

In [98]:
cs_df=pd.DataFrame(cs.toarray())
cs_df.head(100)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.000000,0.000000,0.000000,0.000000,0.110350,0.280505,0.000000,0.000000,0.000000,0.058131
1,0.072129,0.072129,0.000000,0.032209,0.096892,0.030215,0.106144,0.118392,0.118392,0.000000
2,0.062841,0.062841,0.093335,0.004956,0.061990,0.072676,0.104557,0.038244,0.038244,0.005676
3,0.000000,0.000000,0.000000,0.000000,0.000000,0.183754,0.000000,0.000000,0.000000,0.000000
4,0.045839,0.045839,0.037024,0.005974,0.207947,0.124111,0.090552,0.265028,0.265028,0.030625
...,...,...,...,...,...,...,...,...,...,...
95,0.062067,0.062067,0.050132,0.036210,0.054464,0.033969,0.066005,0.066549,0.066549,0.041467
96,0.000000,0.000000,0.050243,0.000000,0.172121,0.000000,0.000000,0.200093,0.200093,0.331681
97,0.049400,0.049400,0.011146,0.083192,0.125131,0.007553,0.181423,0.091137,0.091137,0.033004
98,0.026006,0.026006,0.043047,0.031093,0.165228,0.080725,0.078429,0.186985,0.186985,0.044629


In [99]:
cs_df.shape

(45093, 10)

In [100]:
category=['IT Project Manager I','IT Project Manager III','Senior Computer Security Systems Specialist',
         'Senior Security Analyst','Cloud Engineer','Senior Data Scientist','User Experience (UX) Developer',
         'Software Developer I','Software Developer III','Test Automation Engineer']

In [101]:
import numpy as np
cs_df['cat_index']=cs_df.apply(np.argmax,axis=1)


In [102]:
cs_df.head(20)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,cat_index
0,0.0,0.0,0.0,0.0,0.11035,0.280505,0.0,0.0,0.0,0.058131,5
1,0.072129,0.072129,0.0,0.032209,0.096892,0.030215,0.106144,0.118392,0.118392,0.0,7
2,0.062841,0.062841,0.093335,0.004956,0.06199,0.072676,0.104557,0.038244,0.038244,0.005676,6
3,0.0,0.0,0.0,0.0,0.0,0.183754,0.0,0.0,0.0,0.0,5
4,0.045839,0.045839,0.037024,0.005974,0.207947,0.124111,0.090552,0.265028,0.265028,0.030625,7
5,0.03562,0.03562,0.115573,0.078965,0.235701,0.005446,0.130817,0.152063,0.152063,0.077602,4
6,0.0,0.0,0.0,0.0,0.11035,0.280505,0.0,0.0,0.0,0.058131,5
7,0.0,0.0,0.039798,0.037556,0.199451,0.132736,0.0,0.105662,0.105662,0.03292,4
8,0.0,0.0,0.070826,0.019578,0.242238,0.219591,0.175567,0.14393,0.14393,0.0,4
9,0.104054,0.104054,0.096112,0.0,0.0,0.0,0.069996,0.126356,0.126356,0.0,7


In [104]:
def category(x):
    category=['IT Project Manager I','IT Project Manager III','Senior Computer Security Systems Specialist',
         'Senior Security Analyst','Cloud Engineer','Senior Data Scientist','User Experience (UX) Developer',
         'Software Developer I','Software Developer III','Test Automation Engineer']
    return category[x]

cs_df['new_category'] = cs_df['cat_index'].apply(lambda x: category(x))
cs_df.head(100)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,cat_index,new_category
0,0.000000,0.000000,0.000000,0.000000,0.110350,0.280505,0.000000,0.000000,0.000000,0.058131,5,Senior Data Scientist
1,0.072129,0.072129,0.000000,0.032209,0.096892,0.030215,0.106144,0.118392,0.118392,0.000000,7,Software Developer I
2,0.062841,0.062841,0.093335,0.004956,0.061990,0.072676,0.104557,0.038244,0.038244,0.005676,6,User Experience (UX) Developer
3,0.000000,0.000000,0.000000,0.000000,0.000000,0.183754,0.000000,0.000000,0.000000,0.000000,5,Senior Data Scientist
4,0.045839,0.045839,0.037024,0.005974,0.207947,0.124111,0.090552,0.265028,0.265028,0.030625,7,Software Developer I
...,...,...,...,...,...,...,...,...,...,...,...,...
95,0.062067,0.062067,0.050132,0.036210,0.054464,0.033969,0.066005,0.066549,0.066549,0.041467,7,Software Developer I
96,0.000000,0.000000,0.050243,0.000000,0.172121,0.000000,0.000000,0.200093,0.200093,0.331681,9,Test Automation Engineer
97,0.049400,0.049400,0.011146,0.083192,0.125131,0.007553,0.181423,0.091137,0.091137,0.033004,6,User Experience (UX) Developer
98,0.026006,0.026006,0.043047,0.031093,0.165228,0.080725,0.078429,0.186985,0.186985,0.044629,7,Software Developer I


In [112]:
newdata = pd.concat([data[['title', 'salary','Experience','posted_date']].reset_index(drop=True),
                     pd.DataFrame(cs_df['new_category'])], axis=1)
newdata.head(1000)

Unnamed: 0,title,salary,Experience,posted_date,new_category
0,Software Developer (No Prior Experience Required),58387.0,0,2018-12-12,Senior Data Scientist
1,SQL Application Developer,162720.0,0,2019-08-10,Software Developer I
2,GIS Developer,197394.0,0,2020-04-23,User Experience (UX) Developer
3,Jr Software Developer,256733.0,0,2017-08-30,Senior Data Scientist
4,HP059 Junior Software Developer,100701.0,2,2020-07-30,Software Developer I
...,...,...,...,...,...
995,Sr. Software Engineer - Cloud - TS/SCI Poly,154498.0,2,2019-01-21,Cloud Engineer
996,ML Platform - Senior Full Stack Software Engineer,138810.0,1,2018-02-22,Cloud Engineer
997,Java Developer - Top Secret Clearance,120000.0,0,2017-12-07,User Experience (UX) Developer
998,Sr Linux Systems Engineer,165834.0,2,2018-04-25,Cloud Engineer


In [114]:
#client_vect = pd.concat([client[['Labor Category']].reset_index(drop=True), 
           #pd.DataFrame(client_tfidf.toarray())], axis=1)
#data_vect = pd.concat([data[['title', 'salary','Experience','posted_date']].reset_index(drop=True), 
           #pd.DataFrame(data_tfidf.toarray())], axis=1)

#client_vect.head(10)


In [116]:
newdata.to_csv('NewData.csv')  