# Junior Job Data Cleaning

This notebook is just to clean data that will go into Tableau. We have text data so we need something like Python to do this. The end goal is to look at different types of jobs and filter by them to see what trends occur in junior level positions.

In [109]:
import numpy as np
import pandas as pd
import ast
from collections import Counter

In [140]:
orig_df = pd.read_csv('Jobs_Data.csv')

#want to keep original just in case, since a lot of changes will occur
df = orig_df.copy()

df.head()

Unnamed: 0,company,description,focus,jobTitle,link,location,postedDate,remote,salaryCurrency,salaryRange,type,yearsOfExperience
0,Novovu,[b]We are looking for a talented frontend web ...,Frontend Developer,"Frontend Developer (HTML, CSS, and JS) - Remote",https://lensa.com/frontend-developer-html-css-...,"Fort Washington, PA, USA","Jul 14, 2023 8:00 am",100% Remote,$,[],Full-time,0-1 year
1,Coalition Technologies,[heading2]WHY YOU SHOULD APPLY:[/heading2]\r\n...,Frontend Developer,Front End Developer,https://www.virtualvocations.com/job/remote-fr...,,"Jul 14, 2023 8:01 am",100% Remote,$,"[15,35]",Full-time,0-1 year
2,Get It Recruit - Information Technology,[heading2]Job Description[/heading2]\r\n\r\nAs...,Backend Developer,Software Developer - Remote | WFH,https://www.linkedin.com/jobs/view/software-de...,"King George, VA 22485, USA","Jul 14, 2023 8:03 am",100% In-Office,$,"[72000,182000]",Full-time,0-1 year
3,SiLo,"[b]Based in Downtown Nashville, Simple Logisti...",Fullstack Developer,Junior .NET Full Stack Developer,https://jobs.wjhl.com/jobs/junior-.net-full-st...,"Nashville, TN, USA","Jul 14, 2023 8:04 am",100% In-Office,$,[],Full-time,0-1 year
4,eDiligence,[center][size=4][b]Entry Level Full Stack Deve...,Fullstack Developer,Entry Level Software Developer,https://jooble.org/jdp/-2571818786162603564/En...,"Los Angeles, CA, USA","Jul 14, 2023 8:05 am",Hybrid Work,$,[],Full-time,0-1 year


In [141]:
df.describe()

Unnamed: 0,company,description,focus,jobTitle,link,location,postedDate,remote,salaryCurrency,salaryRange,type,yearsOfExperience
count,1426,1422,1426,1426,1426,1057,1426,1426,1424,1426,1426,1426
unique,943,1202,85,845,1244,367,895,3,1,261,3,4
top,Guidehouse,[b]Position Summary...[/b]\r\n\r\nWhat you'll ...,Data Analyst,Data Analyst,https://www.ziprecruiter.com/c/Reesby/Job/UI-U...,United States,"Aug 1, 2023 8:00 pm",Hybrid Work,$,[],Full-time,0-1 year
freq,27,15,210,48,10,54,17,595,1424,1032,1353,879


In [142]:
countCompany = Counter(df['company'])

max(countCompany.values())

27

In [143]:
def get_key_by_value(dictionary, target_value):
    for key, value in dictionary.items():
        if value == target_value:
            return key
    return None

In [144]:
get_key_by_value(countCompany, 27)

'Guidehouse'

Most common company is Guidehouse with only 27 occurences, so this is not a significant feature to look at, especially if broken down by job title.

Get the hour of the job posting.

In [145]:
df['postingHour'] = df['postedDate'].apply(lambda x: x[-7] + ' ' + x[-2:])

In [146]:
Counter(df['postingHour'])

Counter({'8 am': 1094,
         '9 am': 149,
         '0 am': 66,
         '5 pm': 10,
         '4 am': 4,
         '1 pm': 17,
         '6 am': 7,
         '2 am': 3,
         '3 pm': 6,
         '1 am': 5,
         '7 pm': 6,
         '8 pm': 33,
         '0 pm': 3,
         '7 am': 3,
         '6 pm': 3,
         '5 am': 2,
         '2 pm': 6,
         '3 am': 3,
         '9 pm': 2,
         '4 pm': 4})

In [147]:
df.head()

Unnamed: 0,company,description,focus,jobTitle,link,location,postedDate,remote,salaryCurrency,salaryRange,type,yearsOfExperience,postingHour
0,Novovu,[b]We are looking for a talented frontend web ...,Frontend Developer,"Frontend Developer (HTML, CSS, and JS) - Remote",https://lensa.com/frontend-developer-html-css-...,"Fort Washington, PA, USA","Jul 14, 2023 8:00 am",100% Remote,$,[],Full-time,0-1 year,8 am
1,Coalition Technologies,[heading2]WHY YOU SHOULD APPLY:[/heading2]\r\n...,Frontend Developer,Front End Developer,https://www.virtualvocations.com/job/remote-fr...,,"Jul 14, 2023 8:01 am",100% Remote,$,"[15,35]",Full-time,0-1 year,8 am
2,Get It Recruit - Information Technology,[heading2]Job Description[/heading2]\r\n\r\nAs...,Backend Developer,Software Developer - Remote | WFH,https://www.linkedin.com/jobs/view/software-de...,"King George, VA 22485, USA","Jul 14, 2023 8:03 am",100% In-Office,$,"[72000,182000]",Full-time,0-1 year,8 am
3,SiLo,"[b]Based in Downtown Nashville, Simple Logisti...",Fullstack Developer,Junior .NET Full Stack Developer,https://jobs.wjhl.com/jobs/junior-.net-full-st...,"Nashville, TN, USA","Jul 14, 2023 8:04 am",100% In-Office,$,[],Full-time,0-1 year,8 am
4,eDiligence,[center][size=4][b]Entry Level Full Stack Deve...,Fullstack Developer,Entry Level Software Developer,https://jooble.org/jdp/-2571818786162603564/En...,"Los Angeles, CA, USA","Jul 14, 2023 8:05 am",Hybrid Work,$,[],Full-time,0-1 year,8 am


Office location, years of experience, and type features need no cleaning:

In [148]:
Counter(df['remote'])

Counter({'100% Remote': 333, '100% In-Office': 498, 'Hybrid Work': 595})

In [149]:
Counter(df['type'])

Counter({'Full-time': 1353, 'Part-time': 38, 'Internship': 35})

In [150]:
Counter(df['yearsOfExperience'])

Counter({'0-1 year': 879, '1-3 years': 545, '5+ years': 1, '3-5 years': 1})

Almost all salaries are in $, and two are nan, so we can ignore this feature.

In [151]:
Counter(df['salaryCurrency'])

Counter({'$': 1424, nan: 2})

Want to split the salary range into two features.

In [152]:
#since the original feature is a string
df['salaryRange'] = df['salaryRange'].apply(lambda x: ast.literal_eval(x) if pd.notna(x) and x != '[]' else np.nan)

In [153]:
df['minSalary'] = df['salaryRange'].apply(lambda x: x[0] if x is not np.nan else np.nan)
df['maxSalary'] = df['salaryRange'].apply(lambda x: x[1] if x is not np.nan else np.nan)

In [154]:
df.head(7)

Unnamed: 0,company,description,focus,jobTitle,link,location,postedDate,remote,salaryCurrency,salaryRange,type,yearsOfExperience,postingHour,minSalary,maxSalary
0,Novovu,[b]We are looking for a talented frontend web ...,Frontend Developer,"Frontend Developer (HTML, CSS, and JS) - Remote",https://lensa.com/frontend-developer-html-css-...,"Fort Washington, PA, USA","Jul 14, 2023 8:00 am",100% Remote,$,,Full-time,0-1 year,8 am,,
1,Coalition Technologies,[heading2]WHY YOU SHOULD APPLY:[/heading2]\r\n...,Frontend Developer,Front End Developer,https://www.virtualvocations.com/job/remote-fr...,,"Jul 14, 2023 8:01 am",100% Remote,$,"[15, 35]",Full-time,0-1 year,8 am,15.0,35.0
2,Get It Recruit - Information Technology,[heading2]Job Description[/heading2]\r\n\r\nAs...,Backend Developer,Software Developer - Remote | WFH,https://www.linkedin.com/jobs/view/software-de...,"King George, VA 22485, USA","Jul 14, 2023 8:03 am",100% In-Office,$,"[72000, 182000]",Full-time,0-1 year,8 am,72000.0,182000.0
3,SiLo,"[b]Based in Downtown Nashville, Simple Logisti...",Fullstack Developer,Junior .NET Full Stack Developer,https://jobs.wjhl.com/jobs/junior-.net-full-st...,"Nashville, TN, USA","Jul 14, 2023 8:04 am",100% In-Office,$,,Full-time,0-1 year,8 am,,
4,eDiligence,[center][size=4][b]Entry Level Full Stack Deve...,Fullstack Developer,Entry Level Software Developer,https://jooble.org/jdp/-2571818786162603564/En...,"Los Angeles, CA, USA","Jul 14, 2023 8:05 am",Hybrid Work,$,,Full-time,0-1 year,8 am,,
5,Leidos Holding,[center][size=5]Description[/size][/center]\r\...,Fullstack Developer,Junior Full Stack Developer,https://clearedcareers.com/job/319461/junior-f...,"Colorado Springs, CO, USA","Jul 14, 2023 8:06 am",Hybrid Work,$,"[53300, 110700]",Full-time,0-1 year,8 am,53300.0,110700.0
6,BigBear.ai,[heading2]Overview[/heading2]\r\n\r\nBigBear.a...,Fullstack Developer,Jr. Full Stack Developer,https://getwork.com/details/bcdec3380ac6383c52...,"Washington, MA, USA","Jul 14, 2023 8:07 am",100% Remote,$,,Full-time,0-1 year,8 am,,


Now looking at the focus column - we have lots of focuses that are more than just one focus.

In [155]:
Counter(df['focus'])

Counter({'Frontend Developer': 82,
         'Backend Developer': 92,
         'Fullstack Developer': 189,
         'UX Designer': 114,
         'Data Analyst': 210,
         'Data Scientist': 100,
         'IT Support': 114,
         'Penetration Tester': 50,
         'Security Analyst': 155,
         'UI Designer': 56,
         'Outbound Sales Representative': 1,
         'UI Designer , UX Designer': 24,
         'Frontend Developer , Backend Developer': 2,
         'Backend Developer , Business Analyst': 1,
         'Fullstack Developer , Scrum': 26,
         'Fullstack Developer , Project Management': 2,
         'UX Designer , UX Researcher': 5,
         'UI Designer , Scrum': 4,
         'UI Designer , Frontend Developer': 1,
         'UI Designer , UX Researcher , UX Designer , Product Manager': 1,
         'UI Designer , Strategy': 1,
         'UI Designer , UX Designer , Product Manager': 3,
         'Data Analyst , Finance , Business Analyst': 1,
         'Data Analyst , Busin

Jobs with multiple focuses mostly show up only once, and if we take the first focus as the focus then this should balance out and not create any major bias just by looking at the above dictionary.

Se we want to put all focuses if there are multiple into a list in a new column. Then will iterate through them to replace the focus with the first focus.

In [156]:
#splitting the focus feature by commas so if there are more than one focus, 
#this will put them in a new column as a list, and remove white space
df['list_of_focus'] = df['focus'].str.split(',').apply(lambda x: [item.strip() for item in x])

df.head()

Unnamed: 0,company,description,focus,jobTitle,link,location,postedDate,remote,salaryCurrency,salaryRange,type,yearsOfExperience,postingHour,minSalary,maxSalary,list_of_focus
0,Novovu,[b]We are looking for a talented frontend web ...,Frontend Developer,"Frontend Developer (HTML, CSS, and JS) - Remote",https://lensa.com/frontend-developer-html-css-...,"Fort Washington, PA, USA","Jul 14, 2023 8:00 am",100% Remote,$,,Full-time,0-1 year,8 am,,,[Frontend Developer]
1,Coalition Technologies,[heading2]WHY YOU SHOULD APPLY:[/heading2]\r\n...,Frontend Developer,Front End Developer,https://www.virtualvocations.com/job/remote-fr...,,"Jul 14, 2023 8:01 am",100% Remote,$,"[15, 35]",Full-time,0-1 year,8 am,15.0,35.0,[Frontend Developer]
2,Get It Recruit - Information Technology,[heading2]Job Description[/heading2]\r\n\r\nAs...,Backend Developer,Software Developer - Remote | WFH,https://www.linkedin.com/jobs/view/software-de...,"King George, VA 22485, USA","Jul 14, 2023 8:03 am",100% In-Office,$,"[72000, 182000]",Full-time,0-1 year,8 am,72000.0,182000.0,[Backend Developer]
3,SiLo,"[b]Based in Downtown Nashville, Simple Logisti...",Fullstack Developer,Junior .NET Full Stack Developer,https://jobs.wjhl.com/jobs/junior-.net-full-st...,"Nashville, TN, USA","Jul 14, 2023 8:04 am",100% In-Office,$,,Full-time,0-1 year,8 am,,,[Fullstack Developer]
4,eDiligence,[center][size=4][b]Entry Level Full Stack Deve...,Fullstack Developer,Entry Level Software Developer,https://jooble.org/jdp/-2571818786162603564/En...,"Los Angeles, CA, USA","Jul 14, 2023 8:05 am",Hybrid Work,$,,Full-time,0-1 year,8 am,,,[Fullstack Developer]


In [157]:
df['list_of_focus'] = df['list_of_focus'].apply(lambda x: x[0])

In [158]:
Counter(df['list_of_focus'])

Counter({'Frontend Developer': 84,
         'Backend Developer': 100,
         'Fullstack Developer': 229,
         'UX Designer': 131,
         'Data Analyst': 299,
         'Data Scientist': 148,
         'IT Support': 116,
         'Penetration Tester': 52,
         'Security Analyst': 172,
         'UI Designer': 92,
         'Outbound Sales Representative': 1,
         'Strategy': 1,
         'Finance': 1})

We have 3 one offs - looking at them above, Strategy and Finance have a second focus being IT support, so let's change it to that. Then, Outbound Sales doesn't have a second focus, but it is also not a tech job, so let's drop that row.

In [159]:
#getting the index
df[df['list_of_focus'] == 'Strategy']['list_of_focus']

726    Strategy
Name: list_of_focus, dtype: object

In [162]:
df[df['list_of_focus'] == 'Finance']['list_of_focus']

730    Finance
Name: list_of_focus, dtype: object

In [163]:
df.at[730, 'list_of_focus'] = 'IT Support'
df.at[726, 'list_of_focus'] = 'IT Support'

In [165]:
df[df['list_of_focus'] == 'Outbound Sales Representative']['list_of_focus']

86    Outbound Sales Representative
Name: list_of_focus, dtype: object

In [166]:
df.drop(86, inplace=True)

Double check list of focuses.

In [169]:
Counter(df['list_of_focus'])

Counter({'Frontend Developer': 84,
         'Backend Developer': 100,
         'Fullstack Developer': 229,
         'UX Designer': 131,
         'Data Analyst': 299,
         'Data Scientist': 148,
         'IT Support': 118,
         'Penetration Tester': 52,
         'Security Analyst': 172,
         'UI Designer': 92})

Now that it looks good, replace the focus column with the list of focus column

In [170]:
df.drop(columns = 'focus', inplace = True)

In [172]:
df.head()

Unnamed: 0,company,description,jobTitle,link,location,postedDate,remote,salaryCurrency,salaryRange,type,yearsOfExperience,postingHour,minSalary,maxSalary,list_of_focus
0,Novovu,[b]We are looking for a talented frontend web ...,"Frontend Developer (HTML, CSS, and JS) - Remote",https://lensa.com/frontend-developer-html-css-...,"Fort Washington, PA, USA","Jul 14, 2023 8:00 am",100% Remote,$,,Full-time,0-1 year,8 am,,,Frontend Developer
1,Coalition Technologies,[heading2]WHY YOU SHOULD APPLY:[/heading2]\r\n...,Front End Developer,https://www.virtualvocations.com/job/remote-fr...,,"Jul 14, 2023 8:01 am",100% Remote,$,"[15, 35]",Full-time,0-1 year,8 am,15.0,35.0,Frontend Developer
2,Get It Recruit - Information Technology,[heading2]Job Description[/heading2]\r\n\r\nAs...,Software Developer - Remote | WFH,https://www.linkedin.com/jobs/view/software-de...,"King George, VA 22485, USA","Jul 14, 2023 8:03 am",100% In-Office,$,"[72000, 182000]",Full-time,0-1 year,8 am,72000.0,182000.0,Backend Developer
3,SiLo,"[b]Based in Downtown Nashville, Simple Logisti...",Junior .NET Full Stack Developer,https://jobs.wjhl.com/jobs/junior-.net-full-st...,"Nashville, TN, USA","Jul 14, 2023 8:04 am",100% In-Office,$,,Full-time,0-1 year,8 am,,,Fullstack Developer
4,eDiligence,[center][size=4][b]Entry Level Full Stack Deve...,Entry Level Software Developer,https://jooble.org/jdp/-2571818786162603564/En...,"Los Angeles, CA, USA","Jul 14, 2023 8:05 am",Hybrid Work,$,,Full-time,0-1 year,8 am,,,Fullstack Developer


Now let's rename the list of focus column to just focus, but let's call it jobTitle actually, since that is really what it is. We need to remove that column now too, and let's remove the company, link, postedDate, location, salaryCurrency, and SalaryRange columns as well since we won't need them. 

In [173]:
df.drop(columns = ['company', 'link', 'jobTitle', 'location', 'postedDate', 'salaryCurrency', 'salaryRange'], inplace = True)

In [174]:
df.rename(columns={'list_of_focus': 'jobTitle'}, inplace=True)

In [175]:
df.head()

Unnamed: 0,description,remote,type,yearsOfExperience,postingHour,minSalary,maxSalary,jobTitle
0,[b]We are looking for a talented frontend web ...,100% Remote,Full-time,0-1 year,8 am,,,Frontend Developer
1,[heading2]WHY YOU SHOULD APPLY:[/heading2]\r\n...,100% Remote,Full-time,0-1 year,8 am,15.0,35.0,Frontend Developer
2,[heading2]Job Description[/heading2]\r\n\r\nAs...,100% In-Office,Full-time,0-1 year,8 am,72000.0,182000.0,Backend Developer
3,"[b]Based in Downtown Nashville, Simple Logisti...",100% In-Office,Full-time,0-1 year,8 am,,,Fullstack Developer
4,[center][size=4][b]Entry Level Full Stack Deve...,Hybrid Work,Full-time,0-1 year,8 am,,,Fullstack Developer


## Description/Text Cleaning

Now we can handle the description part. 

In [179]:
sum(df.description.isnull())

4

Only 4 null values, not much to worry about here.

To get the skills, probably want to do some tokenization with the descriptions, grouping by jobTitle - or previously known as the focus - and the most frequent token will hopefully be the technical skills.

In [181]:
df.description[0]

'[b]We are looking for a talented frontend web developer to help us create Novovu![/b]\r\n\r\nThis person will be tasked with coding in HTML, CSS, and JavaScript for the frontend portions of Novovu while coordinating with our backend teams and other developers.\r\n\r\nIs a remote position, must have Discord, and excellent communication skills.\r\n\r\n[b]Novovu is building a next-generation UGC Sandbox Platform and is looking for talent to help accomplish that![/b]\r\n\r\n[b]Must have experience:[/b]\r\n• Discord\r\n• HTML, CSS, and client-side JavaScript\r\n\r\n[b]Bonus Experience:[/b]\r\n• Trello\r\n• GitLab Infrastructure\r\n\r\nWill report to the Lead Web Developer @ Novovu. This position is preferred to be volunteer, but will pay at fixed rates if need be.\r\n\r\nIf you have any questions, please ask Novovu management via Discord or email.'