### What kind of skills should you obtain to get lucrative jobs?

OK. I admit that I am greedy. I want to get good salary. So, I decided to find out what kind of skills I should have in order to earn high salary, by using natural language processing techniques.

#### Strategy

- Check the distribution of salaries of jobs in New York City.
- Check text descriptions of the requirements for jobs and preferred skills.
- Identify specific words related to the jobs with top 25 percentile salaries by TF-IDF

#### Reference

- Sebastian Raschka and Vahid Mirjalili, Python Machine Learning: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow, 2nd Edition. (Capter 8, Applying Machine Learning to Sentiment Analysis)

## 1. Import libraries

In [None]:
import numpy as np
import pandas as pd
import sklearn as sk
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
import re

## 2. Load data

In [None]:
df = pd.read_csv('../input/nyc-jobs.csv')
df.info()

In [None]:
pd.set_option("display.max_columns", 28)
df.head()

## 3. Feature selection

We are going to use the data satisfying the following conditions.

- 'Salary Range From' > 0
- 'Salary Range To' > 0
- 'Salary Frequency' = Annual
- 'Full-Time/Part-Time indicator' = F (full time)
- 'Minimum Qual Requirements' != NaN
- 'Preferred Skills' ! = NaN

In [None]:
df = df.dropna(subset=['Salary Range From', 'Salary Range To', 'Salary Frequency','Full-Time/Part-Time indicator', 'Minimum Qual Requirements', 'Preferred Skills']) 
df = df[(df['Salary Frequency'] =='Annual') & (df['Full-Time/Part-Time indicator']  =='F') & (df['Salary Range From'] > 0) & (df['Salary Range To']  >0)]

df.info()

We got 2271samples.

#### Check the distribution of salaries

In [None]:
df.describe()

In [None]:
plt.hist(df['Salary Range From'], bins=50, alpha=  0.5, color='r', label='Salary Range From')
plt.hist(df['Salary Range To'],     bins=50, alpha = 0.5, color='b', label='Salary Range To')
plt.xlabel('Salary ($/year)')
plt.title('Distribution of salary')

plt.axvline(df['Salary Range From'].quantile(.75), color='r')
plt.axvline(df['Salary Range To'].quantile(.75), color='b')

plt.legend()
plt.show()

We can see the following results.

- The 75 percentile value for "Salary Range From" is $73576/year.

- The 75 percentile vaule for "Salary Range To " is $108657/year.

Therefore, we decided to identify the following words.

1. Words in "Minimum Qual Requirements" for jobs whose minimum salary is more than $73576/year.

1. Words in "Minimum Qual Requirements" for jobs whose maximum salary is more than $108653/year.

1. Words in "Preferred Skills" for jobs whose minimum salary is more than $73576/year.

1. Words in "Preferred Skills" for jobs whose maximum salary is more than $108653/year.

#### Tagging

We add two following tags to the dataframe.

- Min_Salary75: 1 for "Salary Range From" > $73576/year, otherwise 0.

- Max_Salary75: 1 for "Salary Range To" > $108653/year, otherwise 0.

In [None]:
min75 = 73576
max75 = 108653

df.loc[  (df['Salary Range From'] > min75), 'Min_Salary75'] = 1
df.loc[~(df['Salary Range From'] > min75), 'Min_Salary75'] = 0

df.loc[ (df['Salary Range To'] > max75), 'Max_Salary75'] = 1
df.loc[~(df['Salary Range To'] > max75), 'Max_Salary75'] = 0

df['Min_Salary75'] = df['Min_Salary75'].astype(int)
df['Max_Salary75'] = df['Max_Salary75'].astype(int)

df.head()

## 4. Text clensing

Here we conduct the following preprocessing for text for "Minimum Qual Requirements" and "Preferred Skills".

- Delete all characters except for alphabet and space.
- Change all characters into lower cases.

In [None]:
def clensing(df_series):
    df = df_series.replace('[^a-zA-Z ]',' ', regex = True)
    df = df.str.lower()
    return df

df['MinQualReq'] = clensing(df['Minimum Qual Requirements'])
df['PrefSkills']     = clensing(df['Preferred Skills'])

In [None]:
df['PrefSkills'].head()

## 5. Calculate tf-idf and identify important words

Calculate tf-idf for the following four patterns, and determine the important words.

1. Words in "Minimum Qual Requirements" for jobs whose minimum salary is more than $73576/year.

1. Words in "Minimum Qual Requirements" for jobs whose maximum salary is more than $108653/year.

1. Words in "Preferred Skills" for jobs whose minimum salary is more than $73576/year.

1. Words in "Preferred Skills" for jobs whose maximum salary is more than $108653/year.

#### CASE 1: Minimum quality requirements for jobs with high minimum salary.

First, we calculate tf-idf.

In [None]:
def calc_tfidf(docs, count, tfidf):
    bag = count.fit_transform(docs)
    t = tfidf.fit_transform(bag)
    return bag, t

def conc_text(texts, flags):
    pos = ""
    neg = ""
    for (t,f) in zip(texts.values, flags.values):
        if f >0:
            pos = pos + t + " "
        else:
            neg = neg + t + " "
    
    return [pos,neg]

tfidf = TfidfTransformer(use_idf = True, norm ='l2', smooth_idf = True)
count = CountVectorizer()

docs1 = conc_text(df['MinQualReq'], df['Min_Salary75'])
bag1, tfidf1 = calc_tfidf(docs1, count, tfidf)

Check the shape of bag of words.

In [None]:
bag1.shape

We found that there are 1636 words in the text.

We can check the tfidf values for each words in two text types; high salary jobs and low salary jobs.

In [None]:
print(tfidf1.toarray())

Then integrate vocabulary and tf-idf values into the same dataframe.

In [None]:
def stats(count, tfidf):
    df1 = pd.DataFrame(list(count.vocabulary_.items()),columns=['word','id'])
    df1 = df1.sort_values('id').reset_index()
    dfx = pd.DataFrame(tfidf.toarray().T)
    dfx.columns = ['tf-idf for high salary', 'tf-idf for low salary']
    df1 = pd.concat([df1, dfx], axis=1)
    df1['diff'] = df1['tf-idf for high salary']- df1['tf-idf for low salary']
    return df1

df1 = stats(count,tfidf1)

In [None]:
df1.head()

Identify the words having large difference for tf-idf values between high salary and low salary jobs

In [None]:
df1.nlargest(20,'diff')

#### Consideration

- We found a lot of words related to IT systems (computer, data, programming, professional, software)

#### CASE 2: Minimum quality requirements for jobs with high maximum salary.

In [None]:
docs2 = conc_text(df['MinQualReq'], df['Max_Salary75'])
bag2, tfidf2 = calc_tfidf(docs2, count, tfidf)
df2 = stats(count, tfidf2)
df2.nlargest(20,'diff')

#### Consideration

- We found a lot of words related to management (managerial, executive, administrative, management)
- There are also some words related to engineering (engineering, engineer, data, computer)

#### CASE 3: Preferred skills for jobs with high minimum salary.

In [None]:
docs3 = conc_text(df['PrefSkills'], df['Min_Salary75'])
bag3, tfidf3 = calc_tfidf(docs3, count, tfidf)
df3 = stats(count, tfidf3)
df3.nlargest(20,'diff')

#### Consideration

- There are a lot of words related to IT systems (security, development, SQL, application, web, server, systems, services, net, java, applications, developing)

#### CASE 4: Preferred skills for jobs with high maximum salary.

In [None]:
docs4 = conc_text(df['PrefSkills'], df['Max_Salary75'])
bag4, tfidf4 = calc_tfidf(docs4, count, tfidf)
df4 = stats(count, tfidf4)
df4.nlargest(20,'diff')

#### Consideration

- We found a lot of words related to management (management, supervisory, business, managerial)
- We also have many words related to engineering (security, application, engineering, systems)

## 6. Conclusion

Here we found the following patterns.

- If you have some software engineering skills, it should be quite helpful to get good minimum salary. It would be better if you have extensive knowledge in some specific areas (e.g. SQL, Java, security).
- If you want to get higher salaries, having only engineering skills is not enough. You need to have　managerial/administrative skills, knowledge and experience.

In short, it turned out that there is no way for me to get easy money... I should work really hard in software engineering and management, or should I buy some lotteries?