# Conclusion:

- Import data from file 'postings_v3.csv'
- Created new field 'has_salary'
- Value filled to 'salary_range'
- No more empty value left in dataframe


# Import Libraries & Packages

In [45]:
import pandas as pd
import numpy as np
import nltk
import re

# Import Dataset

In [46]:
df = pd.read_csv("./postings_v3.csv")
df.head(3)

Unnamed: 0,title,salary_range,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,Country,State,City,has_company_profile,has_required_education
0,Marketing Intern,,"Food52, a fast-growing, James Beard Award-winn...",Experience with content management systems a m...,No Listed Benefits,0,1,0,Other,Internship,Unknown,Unknown,Marketing,0,US,NY,New York,1,0
1,Customer Service - Cloud Video Production,,Organised - Focused - Vibrant - Awesome!Do you...,What we expect from you:Your key responsibilit...,What you will get from usThrough being part of...,0,1,0,Full-time,Not Applicable,Unknown,Marketing and Advertising,Customer Service,0,NZ,,Auckland,1,0
2,Commissioning Machinery Assistant (CMA),,"Our client, located in Houston, is actively se...",Implement pre-commissioning and commissioning ...,No Listed Benefits,0,1,0,Unknown,Not Applicable,Unknown,Unknown,Assistant,0,US,IA,Wever,1,0


In [47]:
# create new feature that contains Boolean value if has Salary
# initializing with 0, means does not has salary
df['has_salary'] = 0

In [48]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17880 entries, 0 to 17879
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   title                   17880 non-null  object
 1   salary_range            2868 non-null   object
 2   description             17880 non-null  object
 3   requirements            17880 non-null  object
 4   benefits                17880 non-null  object
 5   telecommuting           17880 non-null  int64 
 6   has_company_logo        17880 non-null  int64 
 7   has_questions           17880 non-null  int64 
 8   employment_type         17880 non-null  object
 9   required_experience     17880 non-null  object
 10  required_education      17880 non-null  object
 11  industry                17880 non-null  object
 12  function                17880 non-null  object
 13  fraudulent              17880 non-null  int64 
 14  Country                 17880 non-null  object
 15  St

# to extract salary from column 'description'
#### if salary_range does not have value only then execute
#### if salary found in column 'description', then add 1 to has_salary
#### if salary found in column 'description', then add salary to salary_range

In [49]:
size = df.shape[0]

# Define a regular expression pattern to match salary information
salary_pattern = r'\$?\d+(?:,\d{3})*(?:\.\d+)?(?:\s*[kKmMbB])?(?:\s*\(\s*[kK]\s*\))?'

# Loop over each item in the dataset
for i in range(0, size):
    if pd.isnull(df.salary_range[i]):

        description = df['description'][i]
    
        # Tokenize the description
        tokens = nltk.word_tokenize(description)
    
        # Perform part-of-speech tagging on the tokens
        tagged_tokens = nltk.pos_tag(tokens)
    
        # Define a grammar to identify salary information
        grammar = "salary: {(<NN.*>|<CD>)<.*>*<IN>?<.*>*<CD><.*>*}"
        cp = nltk.RegexpParser(grammar)
    
        # Parse the tagged tokens using the grammar
        tree = cp.parse(tagged_tokens)
    
        # Extract salary information from the parsed tree
        salary_info = []
        for subtree in tree.subtrees():
            if subtree.label() == 'salary':
                salary_phrase = ' '.join([token[0] for token in subtree.leaves()])
                salary = re.findall(salary_pattern, salary_phrase)
                if salary:
                    salary_info.append(salary[0])
    
        # Print the extracted salary information for this item
        if salary_info:
            df['has_salary'][i] = 1
            df.salary_range[i] = salary_info[0]
        else:
            df.salary_range[i] = "Not Available"
    else:
        df['has_salary'][i] = 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.salary_range[i] = "Not Available"
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['has_salary'][i] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.salary_range[i] = salary_info[0]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['has_salary'][i] = 1


In [50]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17880 entries, 0 to 17879
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   title                   17880 non-null  object
 1   salary_range            17880 non-null  object
 2   description             17880 non-null  object
 3   requirements            17880 non-null  object
 4   benefits                17880 non-null  object
 5   telecommuting           17880 non-null  int64 
 6   has_company_logo        17880 non-null  int64 
 7   has_questions           17880 non-null  int64 
 8   employment_type         17880 non-null  object
 9   required_experience     17880 non-null  object
 10  required_education      17880 non-null  object
 11  industry                17880 non-null  object
 12  function                17880 non-null  object
 13  fraudulent              17880 non-null  int64 
 14  Country                 17880 non-null  object
 15  St

In [51]:
# Percentages of Null values in dataframe df
round(df.isnull().mean() * 100, 2)

title                     0.0
salary_range              0.0
description               0.0
requirements              0.0
benefits                  0.0
telecommuting             0.0
has_company_logo          0.0
has_questions             0.0
employment_type           0.0
required_experience       0.0
required_education        0.0
industry                  0.0
function                  0.0
fraudulent                0.0
Country                   0.0
State                     0.0
City                      0.0
has_company_profile       0.0
has_required_education    0.0
has_salary                0.0
dtype: float64

In [52]:
# create new csv file
# by adding has_salary and missing salary_range
df.to_csv('postings_v4.csv', index=False)

In [56]:
size = df.shape[0]
count = 0

# Loop over each item in the dataset
for i in range(0, size):
    if df.salary_range[i] == "Not Available":
        count = count + 1
        
print("Total 'Not Available' Salary Range:", count, "/ 17880 =", (count/17880) * 100)

Total 'Not Available' Salary Range: 7714 / 17880 = 43.143176733780756


# keep below as code sample

In [None]:
# add this and work on it
df.description[3]

# Define a sample dataset
data = [{'description': 'The salary for this job is $60,000 per year.'},
        {'description': 'We offer a competitive salary of $75k per annum.'},
        {'description': 'No salary is mentioned in this job posting.'},
        {'description': 'THE COMPANY: ESRI – Environmental Systems Research InstituteOur passion for improving quality of life through geography is at the heart of everything we do.\xa0 Esri’s geographic information system (GIS) technology inspires and enables governments, universities and businesses worldwide to save money, lives and our environment through a deeper understanding of the changing world around them.Carefully managed growth and zero debt give Esri stability that is uncommon in today\'s volatile business world.\xa0 Privately held, we offer exceptional benefits, competitive salaries, 401(k) and profit-sharing programs, opportunities for personal and professional growth, and much more.'}]

# Define a regular expression pattern to match salary information
salary_pattern = r'\$?\d+(?:,\d{3})*(?:\.\d+)?(?:\s*[kKmMbB])?(?:\s*\(\s*[kK]\s*\))?'

# Loop over each item in the dataset
for item in data:
    description = item['description']
    
    # Tokenize the description
    tokens = nltk.word_tokenize(description)
    
    # Perform part-of-speech tagging on the tokens
    tagged_tokens = nltk.pos_tag(tokens)
    
    # Define a grammar to identify salary information
    grammar = "salary: {(<NN.*>|<CD>)<.*>*<IN>?<.*>*<CD><.*>*}"
    cp = nltk.RegexpParser(grammar)
    
    # Parse the tagged tokens using the grammar
    tree = cp.parse(tagged_tokens)
    
    # Extract salary information from the parsed tree
    salary_info = []
    for subtree in tree.subtrees():
        if subtree.label() == 'salary':
            salary_phrase = ' '.join([token[0] for token in subtree.leaves()])
            salary = re.findall(salary_pattern, salary_phrase)
            if salary:
                salary_info.append(salary[0])
    
    # Print the extracted salary information for this item
    if salary_info:
        print(f"Salary information found: {salary_info[0]}")
    else:
        print("No salary information found.")