# Junior Job Data Cleaning

This notebook is just to clean data that will go into Tableau. We have text data so we need something like Python to do this. The end goal is to look at different types of jobs and filter by them to see what trends occur in junior level positions.

In [1]:
import numpy as np
import pandas as pd
import string
import re
import ast
import nltk
from nltk.corpus import stopwords
from collections import Counter

In [2]:
orig_df = pd.read_csv('Jobs_Data.csv')

#want to keep original just in case, since a lot of changes will occur
df = orig_df.copy()

df.head()

Unnamed: 0,company,description,focus,jobTitle,link,location,postedDate,remote,salaryCurrency,salaryRange,type,yearsOfExperience
0,Novovu,[b]We are looking for a talented frontend web ...,Frontend Developer,"Frontend Developer (HTML, CSS, and JS) - Remote",https://lensa.com/frontend-developer-html-css-...,"Fort Washington, PA, USA","Jul 14, 2023 8:00 am",100% Remote,$,[],Full-time,0-1 year
1,Coalition Technologies,[heading2]WHY YOU SHOULD APPLY:[/heading2]\r\n...,Frontend Developer,Front End Developer,https://www.virtualvocations.com/job/remote-fr...,,"Jul 14, 2023 8:01 am",100% Remote,$,"[15,35]",Full-time,0-1 year
2,Get It Recruit - Information Technology,[heading2]Job Description[/heading2]\r\n\r\nAs...,Backend Developer,Software Developer - Remote | WFH,https://www.linkedin.com/jobs/view/software-de...,"King George, VA 22485, USA","Jul 14, 2023 8:03 am",100% In-Office,$,"[72000,182000]",Full-time,0-1 year
3,SiLo,"[b]Based in Downtown Nashville, Simple Logisti...",Fullstack Developer,Junior .NET Full Stack Developer,https://jobs.wjhl.com/jobs/junior-.net-full-st...,"Nashville, TN, USA","Jul 14, 2023 8:04 am",100% In-Office,$,[],Full-time,0-1 year
4,eDiligence,[center][size=4][b]Entry Level Full Stack Deve...,Fullstack Developer,Entry Level Software Developer,https://jooble.org/jdp/-2571818786162603564/En...,"Los Angeles, CA, USA","Jul 14, 2023 8:05 am",Hybrid Work,$,[],Full-time,0-1 year


In [3]:
df.describe()

Unnamed: 0,company,description,focus,jobTitle,link,location,postedDate,remote,salaryCurrency,salaryRange,type,yearsOfExperience
count,1426,1422,1426,1426,1426,1057,1426,1426,1424,1426,1426,1426
unique,943,1202,85,845,1244,367,895,3,1,261,3,4
top,Guidehouse,[b]Position Summary...[/b]\r\n\r\nWhat you'll ...,Data Analyst,Data Analyst,https://www.ziprecruiter.com/c/Reesby/Job/UI-U...,United States,"Aug 1, 2023 8:00 pm",Hybrid Work,$,[],Full-time,0-1 year
freq,27,15,210,48,10,54,17,595,1424,1032,1353,879


In [4]:
countCompany = Counter(df['company'])

max(countCompany.values())

27

In [5]:
def get_key_by_value(dictionary, target_value):
    for key, value in dictionary.items():
        if value == target_value:
            return key
    return None

In [6]:
get_key_by_value(countCompany, 27)

'Guidehouse'

Most common company is Guidehouse with only 27 occurences, so this is not a significant feature to look at, especially if broken down by job title.

Get the hour of the job posting.

In [7]:
df['postingHour'] = df['postedDate'].apply(lambda x: x[-7] + ' ' + x[-2:])

In [8]:
Counter(df['postingHour'])

Counter({'8 am': 1094,
         '9 am': 149,
         '0 am': 66,
         '5 pm': 10,
         '4 am': 4,
         '1 pm': 17,
         '6 am': 7,
         '2 am': 3,
         '3 pm': 6,
         '1 am': 5,
         '7 pm': 6,
         '8 pm': 33,
         '0 pm': 3,
         '7 am': 3,
         '6 pm': 3,
         '5 am': 2,
         '2 pm': 6,
         '3 am': 3,
         '9 pm': 2,
         '4 pm': 4})

In [9]:
df.head()

Unnamed: 0,company,description,focus,jobTitle,link,location,postedDate,remote,salaryCurrency,salaryRange,type,yearsOfExperience,postingHour
0,Novovu,[b]We are looking for a talented frontend web ...,Frontend Developer,"Frontend Developer (HTML, CSS, and JS) - Remote",https://lensa.com/frontend-developer-html-css-...,"Fort Washington, PA, USA","Jul 14, 2023 8:00 am",100% Remote,$,[],Full-time,0-1 year,8 am
1,Coalition Technologies,[heading2]WHY YOU SHOULD APPLY:[/heading2]\r\n...,Frontend Developer,Front End Developer,https://www.virtualvocations.com/job/remote-fr...,,"Jul 14, 2023 8:01 am",100% Remote,$,"[15,35]",Full-time,0-1 year,8 am
2,Get It Recruit - Information Technology,[heading2]Job Description[/heading2]\r\n\r\nAs...,Backend Developer,Software Developer - Remote | WFH,https://www.linkedin.com/jobs/view/software-de...,"King George, VA 22485, USA","Jul 14, 2023 8:03 am",100% In-Office,$,"[72000,182000]",Full-time,0-1 year,8 am
3,SiLo,"[b]Based in Downtown Nashville, Simple Logisti...",Fullstack Developer,Junior .NET Full Stack Developer,https://jobs.wjhl.com/jobs/junior-.net-full-st...,"Nashville, TN, USA","Jul 14, 2023 8:04 am",100% In-Office,$,[],Full-time,0-1 year,8 am
4,eDiligence,[center][size=4][b]Entry Level Full Stack Deve...,Fullstack Developer,Entry Level Software Developer,https://jooble.org/jdp/-2571818786162603564/En...,"Los Angeles, CA, USA","Jul 14, 2023 8:05 am",Hybrid Work,$,[],Full-time,0-1 year,8 am


Office location, years of experience, and type features need no cleaning:

In [10]:
Counter(df['remote'])

Counter({'100% Remote': 333, '100% In-Office': 498, 'Hybrid Work': 595})

In [11]:
Counter(df['type'])

Counter({'Full-time': 1353, 'Part-time': 38, 'Internship': 35})

In [12]:
Counter(df['yearsOfExperience'])

Counter({'0-1 year': 879, '1-3 years': 545, '5+ years': 1, '3-5 years': 1})

Almost all salaries are in $, and two are nan, so we can ignore this feature.

In [13]:
Counter(df['salaryCurrency'])

Counter({'$': 1424, nan: 2})

Want to split the salary range into two features.

In [14]:
#since the original feature is a string
df['salaryRange'] = df['salaryRange'].apply(lambda x: ast.literal_eval(x) if pd.notna(x) and x != '[]' else np.nan)

In [15]:
df['minSalary'] = df['salaryRange'].apply(lambda x: x[0] if x is not np.nan else np.nan)
df['maxSalary'] = df['salaryRange'].apply(lambda x: x[1] if x is not np.nan else np.nan)

In [16]:
df.head(7)

Unnamed: 0,company,description,focus,jobTitle,link,location,postedDate,remote,salaryCurrency,salaryRange,type,yearsOfExperience,postingHour,minSalary,maxSalary
0,Novovu,[b]We are looking for a talented frontend web ...,Frontend Developer,"Frontend Developer (HTML, CSS, and JS) - Remote",https://lensa.com/frontend-developer-html-css-...,"Fort Washington, PA, USA","Jul 14, 2023 8:00 am",100% Remote,$,,Full-time,0-1 year,8 am,,
1,Coalition Technologies,[heading2]WHY YOU SHOULD APPLY:[/heading2]\r\n...,Frontend Developer,Front End Developer,https://www.virtualvocations.com/job/remote-fr...,,"Jul 14, 2023 8:01 am",100% Remote,$,"[15, 35]",Full-time,0-1 year,8 am,15.0,35.0
2,Get It Recruit - Information Technology,[heading2]Job Description[/heading2]\r\n\r\nAs...,Backend Developer,Software Developer - Remote | WFH,https://www.linkedin.com/jobs/view/software-de...,"King George, VA 22485, USA","Jul 14, 2023 8:03 am",100% In-Office,$,"[72000, 182000]",Full-time,0-1 year,8 am,72000.0,182000.0
3,SiLo,"[b]Based in Downtown Nashville, Simple Logisti...",Fullstack Developer,Junior .NET Full Stack Developer,https://jobs.wjhl.com/jobs/junior-.net-full-st...,"Nashville, TN, USA","Jul 14, 2023 8:04 am",100% In-Office,$,,Full-time,0-1 year,8 am,,
4,eDiligence,[center][size=4][b]Entry Level Full Stack Deve...,Fullstack Developer,Entry Level Software Developer,https://jooble.org/jdp/-2571818786162603564/En...,"Los Angeles, CA, USA","Jul 14, 2023 8:05 am",Hybrid Work,$,,Full-time,0-1 year,8 am,,
5,Leidos Holding,[center][size=5]Description[/size][/center]\r\...,Fullstack Developer,Junior Full Stack Developer,https://clearedcareers.com/job/319461/junior-f...,"Colorado Springs, CO, USA","Jul 14, 2023 8:06 am",Hybrid Work,$,"[53300, 110700]",Full-time,0-1 year,8 am,53300.0,110700.0
6,BigBear.ai,[heading2]Overview[/heading2]\r\n\r\nBigBear.a...,Fullstack Developer,Jr. Full Stack Developer,https://getwork.com/details/bcdec3380ac6383c52...,"Washington, MA, USA","Jul 14, 2023 8:07 am",100% Remote,$,,Full-time,0-1 year,8 am,,


Now looking at the focus column - we have lots of focuses that are more than just one focus.

In [17]:
Counter(df['focus'])

Counter({'Frontend Developer': 82,
         'Backend Developer': 92,
         'Fullstack Developer': 189,
         'UX Designer': 114,
         'Data Analyst': 210,
         'Data Scientist': 100,
         'IT Support': 114,
         'Penetration Tester': 50,
         'Security Analyst': 155,
         'UI Designer': 56,
         'Outbound Sales Representative': 1,
         'UI Designer , UX Designer': 24,
         'Frontend Developer , Backend Developer': 2,
         'Backend Developer , Business Analyst': 1,
         'Fullstack Developer , Scrum': 26,
         'Fullstack Developer , Project Management': 2,
         'UX Designer , UX Researcher': 5,
         'UI Designer , Scrum': 4,
         'UI Designer , Frontend Developer': 1,
         'UI Designer , UX Researcher , UX Designer , Product Manager': 1,
         'UI Designer , Strategy': 1,
         'UI Designer , UX Designer , Product Manager': 3,
         'Data Analyst , Finance , Business Analyst': 1,
         'Data Analyst , Busin

Jobs with multiple focuses mostly show up only once, and if we take the first focus as the focus then this should balance out and not create any major bias just by looking at the above dictionary.

Se we want to put all focuses if there are multiple into a list in a new column. Then will iterate through them to replace the focus with the first focus.

In [18]:
#splitting the focus feature by commas so if there are more than one focus, 
#this will put them in a new column as a list, and remove white space
df['list_of_focus'] = df['focus'].str.split(',').apply(lambda x: [item.strip() for item in x])

df.head()

Unnamed: 0,company,description,focus,jobTitle,link,location,postedDate,remote,salaryCurrency,salaryRange,type,yearsOfExperience,postingHour,minSalary,maxSalary,list_of_focus
0,Novovu,[b]We are looking for a talented frontend web ...,Frontend Developer,"Frontend Developer (HTML, CSS, and JS) - Remote",https://lensa.com/frontend-developer-html-css-...,"Fort Washington, PA, USA","Jul 14, 2023 8:00 am",100% Remote,$,,Full-time,0-1 year,8 am,,,[Frontend Developer]
1,Coalition Technologies,[heading2]WHY YOU SHOULD APPLY:[/heading2]\r\n...,Frontend Developer,Front End Developer,https://www.virtualvocations.com/job/remote-fr...,,"Jul 14, 2023 8:01 am",100% Remote,$,"[15, 35]",Full-time,0-1 year,8 am,15.0,35.0,[Frontend Developer]
2,Get It Recruit - Information Technology,[heading2]Job Description[/heading2]\r\n\r\nAs...,Backend Developer,Software Developer - Remote | WFH,https://www.linkedin.com/jobs/view/software-de...,"King George, VA 22485, USA","Jul 14, 2023 8:03 am",100% In-Office,$,"[72000, 182000]",Full-time,0-1 year,8 am,72000.0,182000.0,[Backend Developer]
3,SiLo,"[b]Based in Downtown Nashville, Simple Logisti...",Fullstack Developer,Junior .NET Full Stack Developer,https://jobs.wjhl.com/jobs/junior-.net-full-st...,"Nashville, TN, USA","Jul 14, 2023 8:04 am",100% In-Office,$,,Full-time,0-1 year,8 am,,,[Fullstack Developer]
4,eDiligence,[center][size=4][b]Entry Level Full Stack Deve...,Fullstack Developer,Entry Level Software Developer,https://jooble.org/jdp/-2571818786162603564/En...,"Los Angeles, CA, USA","Jul 14, 2023 8:05 am",Hybrid Work,$,,Full-time,0-1 year,8 am,,,[Fullstack Developer]


In [19]:
df['list_of_focus'] = df['list_of_focus'].apply(lambda x: x[0])

In [20]:
Counter(df['list_of_focus'])

Counter({'Frontend Developer': 84,
         'Backend Developer': 100,
         'Fullstack Developer': 229,
         'UX Designer': 131,
         'Data Analyst': 299,
         'Data Scientist': 148,
         'IT Support': 116,
         'Penetration Tester': 52,
         'Security Analyst': 172,
         'UI Designer': 92,
         'Outbound Sales Representative': 1,
         'Strategy': 1,
         'Finance': 1})

We have 3 one offs - looking at them above, Strategy and Finance have a second focus being IT support, so let's change it to that. Then, Outbound Sales doesn't have a second focus, but it is also not a tech job, so let's drop that row.

In [21]:
#getting the index
df[df['list_of_focus'] == 'Strategy']['list_of_focus']

726    Strategy
Name: list_of_focus, dtype: object

In [22]:
df[df['list_of_focus'] == 'Finance']['list_of_focus']

730    Finance
Name: list_of_focus, dtype: object

In [23]:
df.at[730, 'list_of_focus'] = 'IT Support'
df.at[726, 'list_of_focus'] = 'IT Support'

In [24]:
df[df['list_of_focus'] == 'Outbound Sales Representative']['list_of_focus']

86    Outbound Sales Representative
Name: list_of_focus, dtype: object

In [25]:
df.drop(86, inplace=True)

Double check list of focuses.

In [26]:
Counter(df['list_of_focus'])

Counter({'Frontend Developer': 84,
         'Backend Developer': 100,
         'Fullstack Developer': 229,
         'UX Designer': 131,
         'Data Analyst': 299,
         'Data Scientist': 148,
         'IT Support': 118,
         'Penetration Tester': 52,
         'Security Analyst': 172,
         'UI Designer': 92})

Now that it looks good, replace the focus column with the list of focus column

In [27]:
df.drop(columns = 'focus', inplace = True)

In [28]:
df.head()

Unnamed: 0,company,description,jobTitle,link,location,postedDate,remote,salaryCurrency,salaryRange,type,yearsOfExperience,postingHour,minSalary,maxSalary,list_of_focus
0,Novovu,[b]We are looking for a talented frontend web ...,"Frontend Developer (HTML, CSS, and JS) - Remote",https://lensa.com/frontend-developer-html-css-...,"Fort Washington, PA, USA","Jul 14, 2023 8:00 am",100% Remote,$,,Full-time,0-1 year,8 am,,,Frontend Developer
1,Coalition Technologies,[heading2]WHY YOU SHOULD APPLY:[/heading2]\r\n...,Front End Developer,https://www.virtualvocations.com/job/remote-fr...,,"Jul 14, 2023 8:01 am",100% Remote,$,"[15, 35]",Full-time,0-1 year,8 am,15.0,35.0,Frontend Developer
2,Get It Recruit - Information Technology,[heading2]Job Description[/heading2]\r\n\r\nAs...,Software Developer - Remote | WFH,https://www.linkedin.com/jobs/view/software-de...,"King George, VA 22485, USA","Jul 14, 2023 8:03 am",100% In-Office,$,"[72000, 182000]",Full-time,0-1 year,8 am,72000.0,182000.0,Backend Developer
3,SiLo,"[b]Based in Downtown Nashville, Simple Logisti...",Junior .NET Full Stack Developer,https://jobs.wjhl.com/jobs/junior-.net-full-st...,"Nashville, TN, USA","Jul 14, 2023 8:04 am",100% In-Office,$,,Full-time,0-1 year,8 am,,,Fullstack Developer
4,eDiligence,[center][size=4][b]Entry Level Full Stack Deve...,Entry Level Software Developer,https://jooble.org/jdp/-2571818786162603564/En...,"Los Angeles, CA, USA","Jul 14, 2023 8:05 am",Hybrid Work,$,,Full-time,0-1 year,8 am,,,Fullstack Developer


Now let's rename the list of focus column to just focus, but let's call it jobTitle actually, since that is really what it is. We need to remove that column now too, and let's remove the company, link, postedDate, location, salaryCurrency, and SalaryRange columns as well since we won't need them. 

In [29]:
df.drop(columns = ['company', 'link', 'jobTitle', 'location', 'postedDate', 'salaryCurrency', 'salaryRange'], inplace = True)

In [30]:
df.rename(columns={'list_of_focus': 'jobTitle'}, inplace=True)

In [31]:
df.head()

Unnamed: 0,description,remote,type,yearsOfExperience,postingHour,minSalary,maxSalary,jobTitle
0,[b]We are looking for a talented frontend web ...,100% Remote,Full-time,0-1 year,8 am,,,Frontend Developer
1,[heading2]WHY YOU SHOULD APPLY:[/heading2]\r\n...,100% Remote,Full-time,0-1 year,8 am,15.0,35.0,Frontend Developer
2,[heading2]Job Description[/heading2]\r\n\r\nAs...,100% In-Office,Full-time,0-1 year,8 am,72000.0,182000.0,Backend Developer
3,"[b]Based in Downtown Nashville, Simple Logisti...",100% In-Office,Full-time,0-1 year,8 am,,,Fullstack Developer
4,[center][size=4][b]Entry Level Full Stack Deve...,Hybrid Work,Full-time,0-1 year,8 am,,,Fullstack Developer


## Description/Text Cleaning

Now we can handle the description part. 

In [32]:
sum(df.description.isnull())

4

Only 4 null values, not much to worry about here.

To get the skills, probably want to do some tokenization with the descriptions, grouping by jobTitle - or previously known as the focus - and the most frequent token will hopefully be the technical skills.

In [75]:
df.description[1]

"[heading2]WHY YOU SHOULD APPLY:[/heading2]\r\n\r\nCoalition Technologies is devoted to doing the highest quality of work for our clients while maintaining a fun, thriving environment for our team. Along with the opportunity to grow with our team, we are excited to offer:\r\n\r\n- The most competitive profit-sharing bonus plans in the industry. We pay up to 50% of all profits monthly to all full-time employees!\r\n- Joining our Coalition means you also get to enjoy paid time off and subsidized gym memberships.\r\n- Our US-Based team members can enjoy our medical, dental, vision, and life insurance packages in all US states.\r\n- Our international team members have the opportunity to participate in our International Insurance Reimbursement Program, a benefit unique to Coalition.\r\n\r\n100% of our team works remotely with the support of time tracking software. Our company culture has specialized in supporting remote team members for over a decade. We welcome your application, wherever i

grouping by job title

In [84]:
#filling nulls in first so next operation works
df['description'].fillna('', inplace=True)

grouped_df = df.groupby('jobTitle')['description'].apply(lambda x: ' '.join(x)).reset_index()

In [85]:
grouped_df

Unnamed: 0,jobTitle,description
0,Backend Developer,[heading2]Job Description[/heading2]\r\n\r\nAs...
1,Data Analyst,[center][size=4][b]Entry Level Data Analyst[/b...
2,Data Scientist,[heading2]Job Description[/heading2]\r\n\r\n• ...
3,Frontend Developer,[b]We are looking for a talented frontend web ...
4,Fullstack Developer,"[b]Based in Downtown Nashville, Simple Logisti..."
5,IT Support,[b]This is a remote position.[/b]\r\n\r\n[b]Wh...
6,Penetration Tester,"[b]This a Full Remote job, the offer is availa..."
7,Security Analyst,[b]We are thrilled to welcome a talented Junio...
8,UI Designer,[b]Job Description:[/b]\r\n\r\nWe are currentl...
9,UX Designer,[b]Job Description:[/b]\r\n\r\nWe are currentl...


We have a lot of text to get rid of, as seen below. anything in brackets should be replaced by spaces, and then \r and \n should be removed as well.

In [86]:
grouped_df.iloc[0]['description']

'[heading2]Job Description[/heading2]\r\n\r\nAs a full stack developer, you can resolve a problem with a complete end-to-end solution in a fast, Agile environment. If you’re looking for the chance to not just develop software, but to help create a system that will make a difference, we need you on our team. We’re looking for a developer like you with an appetite to learn and the skills needed to develop software and systems from vision to production ready.\r\n\r\nThis role is more than just coding. As a full stack developer at Booz Allen, you’ll use your passion to learn new tools and techniques and identify needed system improvements. You’ll help clients overcome their most difficult challenges using the latest architectural approaches, tools, and technologies. You’ll help make sure the solution developed by the team considers the current architecture and operating environment, as well as future functionality and enhancements.\r\n\r\nWork with us as we shape systems for the better.\r\

In [87]:
#pattern to get rid of anything in brackets
pattern = r'\[.*?\]'

grouped_df['description'] = grouped_df['description'].apply(lambda x: re.sub(pattern, '', x))

'This is a   with   in brackets.'

In [94]:
grouped_df.iloc[0]['description']

'Job Description\r\n\r\nAs a full stack developer, you can resolve a problem with a complete end-to-end solution in a fast, Agile environment. If you’re looking for the chance to not just develop software, but to help create a system that will make a difference, we need you on our team. We’re looking for a developer like you with an appetite to learn and the skills needed to develop software and systems from vision to production ready.\r\n\r\nThis role is more than just coding. As a full stack developer at Booz Allen, you’ll use your passion to learn new tools and techniques and identify needed system improvements. You’ll help clients overcome their most difficult challenges using the latest architectural approaches, tools, and technologies. You’ll help make sure the solution developed by the team considers the current architecture and operating environment, as well as future functionality and enhancements.\r\n\r\nWork with us as we shape systems for the better.\r\n\r\nJoin us. The wor

quickly scanning this it seems it is just punctuations and \r and \n that need to be removed now.

In [95]:
grouped_df['description'] = grouped_df['description'].apply(lambda x: x.replace('\r', ' '))
grouped_df['description'] = grouped_df['description'].apply(lambda x: x.replace('\n', ' '))

Now just punctuations.

In [98]:
punctuation_list = list(string.punctuation)

for punc in punctuation_list:
    grouped_df['description'] = grouped_df['description'].apply(lambda x: x.replace(punc, ' '))

In [99]:
grouped_df.iloc[0]['description']

'Job Description    As a full stack developer  you can resolve a problem with a complete end to end solution in a fast  Agile environment  If you’re looking for the chance to not just develop software  but to help create a system that will make a difference  we need you on our team  We’re looking for a developer like you with an appetite to learn and the skills needed to develop software and systems from vision to production ready     This role is more than just coding  As a full stack developer at Booz Allen  you’ll use your passion to learn new tools and techniques and identify needed system improvements  You’ll help clients overcome their most difficult challenges using the latest architectural approaches  tools  and technologies  You’ll help make sure the solution developed by the team considers the current architecture and operating environment  as well as future functionality and enhancements     Work with us as we shape systems for the better     Join us  The world can’t wait   

One more time, bullet points weren't in the punctuation list so we need to do that too.

In [100]:
grouped_df['description'] = grouped_df['description'].apply(lambda x: x.replace('•', ' '))

In [101]:
grouped_df.iloc[0]['description']

'Job Description    As a full stack developer  you can resolve a problem with a complete end to end solution in a fast  Agile environment  If you’re looking for the chance to not just develop software  but to help create a system that will make a difference  we need you on our team  We’re looking for a developer like you with an appetite to learn and the skills needed to develop software and systems from vision to production ready     This role is more than just coding  As a full stack developer at Booz Allen  you’ll use your passion to learn new tools and techniques and identify needed system improvements  You’ll help clients overcome their most difficult challenges using the latest architectural approaches  tools  and technologies  You’ll help make sure the solution developed by the team considers the current architecture and operating environment  as well as future functionality and enhancements     Work with us as we shape systems for the better     Join us  The world can’t wait   

removing extra spaces

In [105]:
grouped_df['description'] = grouped_df['description'].apply(lambda x: x.split())
grouped_df['description'] = grouped_df['description'].apply(lambda x: ' '.join(x))

grouped_df.iloc[0]['description']

'Job Description As a full stack developer you can resolve a problem with a complete end to end solution in a fast Agile environment If you’re looking for the chance to not just develop software but to help create a system that will make a difference we need you on our team We’re looking for a developer like you with an appetite to learn and the skills needed to develop software and systems from vision to production ready This role is more than just coding As a full stack developer at Booz Allen you’ll use your passion to learn new tools and techniques and identify needed system improvements You’ll help clients overcome their most difficult challenges using the latest architectural approaches tools and technologies You’ll help make sure the solution developed by the team considers the current architecture and operating environment as well as future functionality and enhancements Work with us as we shape systems for the better Join us The world can’t wait You Have Experience with modern

In [109]:
#let's also lower case everything now
grouped_df['description'] = grouped_df['description'].apply(lambda x: x.lower())

#grouped_df.iloc[0]['description']

'job description as a full stack developer you can resolve a problem with a complete end to end solution in a fast agile environment if you’re looking for the chance to not just develop software but to help create a system that will make a difference we need you on our team we’re looking for a developer like you with an appetite to learn and the skills needed to develop software and systems from vision to production ready this role is more than just coding as a full stack developer at booz allen you’ll use your passion to learn new tools and techniques and identify needed system improvements you’ll help clients overcome their most difficult challenges using the latest architectural approaches tools and technologies you’ll help make sure the solution developed by the team considers the current architecture and operating environment as well as future functionality and enhancements work with us as we shape systems for the better join us the world can’t wait you have experience with modern

Now let's remove stop words.

In [107]:
nltk.download('stopwords')
#set of stopwords to search through
stop_words = set(stopwords.words('english'))

#tokenize and then remove stop words 
grouped_df['description'] = grouped_df['description'].apply(lambda x: nltk.word_tokenize(x))
grouped_df['description'] = grouped_df['description'].apply(lambda x: [word for word in x if word not in stop_words])

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\RaviB\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Easiest thing to do here since hard technical skills aren't the most common is to look manually for the top 5 of each one, and note them down in another dictinary of dataframe.

In [139]:
Counter(grouped_df.iloc[0]['description']).most_common()

[('experience', 550),
 ('team', 279),
 ('work', 249),
 ('data', 220),
 ('development', 218),
 ('backend', 204),
 ('working', 183),
 ('’', 169),
 ('software', 159),
 ('skills', 144),
 ('services', 137),
 ('design', 136),
 ('applications', 133),
 ('web', 131),
 ('code', 125),
 ('new', 116),
 ('developer', 115),
 ('end', 114),
 ('technologies', 113),
 ('cloud', 109),
 ('knowledge', 109),
 ('ability', 107),
 ('us', 106),
 ('strong', 106),
 ('role', 105),
 ('systems', 103),
 ('job', 100),
 ('solutions', 100),
 ('using', 97),
 ('requirements', 97),
 ('engineering', 97),
 ('application', 95),
 ('opportunity', 95),
 ('environment', 93),
 ('years', 93),
 ('including', 89),
 ('time', 89),
 ('building', 89),
 ('product', 89),
 ('business', 85),
 ('technical', 85),
 ('build', 84),
 ('company', 84),
 ('quality', 81),
 ('help', 79),
 ('engineers', 76),
 ('understanding', 74),
 ('position', 72),
 ('technology', 72),
 ('teams', 72),
 ('engineer', 69),
 ('back', 68),
 ('status', 68),
 ('employment', 67

In [141]:
backend_top_skills = {'java': 67, 'aws': 63, 'apis': 60, 'python': 59, 'react': 48}

In [142]:
Counter(grouped_df.iloc[1]['description']).most_common()

[('data', 3919),
 ('business', 1261),
 ('experience', 1222),
 ('work', 921),
 ('analysis', 642),
 ('skills', 631),
 ('team', 616),
 ('’', 508),
 ('ability', 493),
 ('support', 483),
 ('analytics', 483),
 ('knowledge', 470),
 ('tools', 451),
 ('quality', 423),
 ('information', 406),
 ('related', 401),
 ('science', 385),
 ('analyst', 377),
 ('management', 365),
 ('requirements', 356),
 ('job', 353),
 ('including', 345),
 ('required', 344),
 ('time', 343),
 ('development', 338),
 ('position', 324),
 ('qualifications', 318),
 ('company', 315),
 ('sql', 315),
 ('working', 314),
 ('degree', 295),
 ('using', 295),
 ('new', 293),
 ('reports', 293),
 ('solutions', 292),
 ('years', 280),
 ('technical', 277),
 ('strong', 276),
 ('based', 274),
 ('preferred', 259),
 ('role', 258),
 ('insights', 251),
 ('benefits', 250),
 ('systems', 247),
 ('customer', 247),
 ('provide', 245),
 ('opportunity', 239),
 ('environment', 238),
 ('communication', 233),
 ('analytical', 229),
 ('statistics', 228),
 ('rese

In [144]:
data_analyst_top_skills = {'sql': 315, 'python': 219, 'excel': 193, 'tableau': 164, 'power': 114}

In [145]:
Counter(grouped_df.iloc[2]['description']).most_common()

[('data', 1926),
 ('experience', 706),
 ('work', 438),
 ('science', 392),
 ('analysis', 356),
 ('business', 324),
 ('skills', 305),
 ('analytics', 278),
 ('team', 254),
 ('ability', 230),
 ('job', 216),
 ('position', 215),
 ('support', 213),
 ('learning', 209),
 ('development', 208),
 ('information', 204),
 ('including', 204),
 ('scientist', 195),
 ('’', 194),
 ('tools', 192),
 ('time', 185),
 ('required', 185),
 ('solutions', 181),
 ('degree', 180),
 ('technical', 167),
 ('benefits', 161),
 ('opportunity', 160),
 ('develop', 160),
 ('techniques', 154),
 ('research', 154),
 ('knowledge', 153),
 ('working', 151),
 ('years', 150),
 ('models', 150),
 ('related', 149),
 ('requirements', 149),
 ('statistical', 146),
 ('using', 146),
 ('employment', 145),
 ('provide', 145),
 ('machine', 144),
 ('strong', 142),
 ('python', 141),
 ('apply', 138),
 ('status', 137),
 ('new', 135),
 ('may', 134),
 ('complex', 132),
 ('engineering', 132),
 ('us', 132),
 ('must', 132),
 ('environment', 130),
 ('qua

In [146]:
data_scientist_top_skills = {'python': 141, 'sql': 121, 'r': 93, 'tableau': 49, 'aws': 32}

In [None]:
#if wanted to automate getting most common skills, could make a list of most common ones manually, and search for them
backend_skills = ['python', 'java', 'ruby', 'node', 'django', 'flask', 'sql', 'nosql', 'linux', 'aws', 'docker', 
                  'api', 'git', 'php', 'ruby', 'express', 'api', 'json', 'xml', 'github']

frontend_skills = ['html', 'css', 'javascript', 'react', 'angular', 'vue', 'ux', 'ui', 'redux', 'vuex', 'sass', 'less'
                'jquery', 'version', 'testing', 'unit', 'stylus', 'gui']

#fullstack is both of the above combined

#testing here is a/b testing, removing white space may have messed this word here. 
#google is also referring to google sheets and power is PowerBI
data_analyst_skills = ['python', 'r', 'tableau', 'power', 'matplotlib', 'ggplot2', 'sql', 'excel', 'google',
                       'statistics', 'statistical', 'cleaning', 'testing', 'visualization', 'business']

#Machine Learning, Deep Learning, Predictive Modeling are the full version of first three, and vision is computer vision
data_scientist_skills = ['machine', 'deep', 'predictive', 'spark', 'nlp', 'vision', 'ai', 'scikit', 'tensorflow', 
                         'pytorch', 'matlab', 'algebra', 'calculus']


it_support_skills = ['troubleshooting', 'operating', 'windows', 'macos', 'linux', 'network', 'cybersecurity', 
                     'desktop', 'aws', 'azure', 'gcp']

penetration_tester_skills = [
    "Ethical Hacking",
    "Network Scanning",
    "Vulnerability Assessment",
    "Exploitation Techniques",
    "Web Application Testing",
    "Wireless Network Security",
    "Security Tools (Metasploit, Wireshark, Nmap)",
    "Report Writing",
    "Compliance Standards (ISO 27001, NIST)",
    "Social Engineering"]

security_analyst_skills = penetration_tester_skills

ux_designer_skills = [
    "User Research",
    "Information Architecture",
    "Wireframing and Prototyping",
    "Usability Testing",
    "Interaction Design",
    "Persona Development",
    "User Flows",
    "Visual Design Principles",
    "Design Thinking",
    "UX/UI Tools (Sketch, Figma, Adobe XD)"
]

ui_designer_skills = [
    "Visual Design",
    "Graphic Design Software (Adobe Creative Suite)",
    "Color Theory",
    "Typography",
    "Icon Design",
    "User Interface Patterns",
    "Responsive Design",
    "Design Systems",
    "Animation",
    "Collaboration with Developers"
]
