<a href="https://colab.research.google.com/github/EduHdzVillasana/Technical-Test-Torre/blob/main/Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Technical Test Torre
----
*Eduardo Alan Hernandez Villasana*

## Data Extraction

The data was extracted from [Data World](https://data.world/promptcloud/50000-job-board-records-from-reed-uk).

Reed is one of the top employment agency based in the United Kingdom. This data set contains 50000 records of latest job postings on Reed UK.

This data was extracted on March 13th 2018 and contains job postings from last 15 days. Following data fields are included in the dataset:

* category
* city
* state
* company name
* job title
* job description
* job requirement
* job type
* salary offered
* posting date

In [1]:
# Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
data_url = "https://query.data.world/s/rut5pr5nm5i4onzlfp6xjgcsw3rvu4"
df_raw = pd.read_csv(data_url)

## Data Exploration

In [3]:
df_raw.sample(5)

Unnamed: 0,category,city,company_name,geo,job_board,job_description,job_requirements,job_title,job_type,post_date,salary_offered,state
11494,recruitment consultancy jobs,Birmingham,Amanda Wright Recruitment,uk,reed,Apply now Education Recruitment Consultant Bi...,,Education Recruitment Consultant,"Permanent, full-time",3/8/2018,"£25,000 - £35,000 per annum, negotiable",West Midlands
27956,hr jobs,Woodley,Mabella Recruitment,uk,reed,Apply now HR Assistant / HR Administrator Woo...,Required skills Administrative Support Contra...,HR Admin / HR Assistant,"Permanent, full-time",3/6/2018,"£18,000 - £23,000 per annum",Berkshire
28270,estate agent jobs,Bromley,10Ten Recruitment,uk,reed,Apply now This is a fantastic opportunity to ...,Required skills Experienced Property Manager,Property Manager,"Permanent, full-time",2/26/2018,"£22,000 - £26,000 per annum",Kent
21821,fmcg jobs,East London,Lime Talent Ltd,uk,reed,"Apply now This is a field based role, so cand...",,Key Account Manager- Drinks- East London,"Permanent, full-time",2/28/2018,"£28,000 - £35,000 per annum",London
723,retail jobs,Stockport,Zachary Daniels,uk,reed,Apply now We are exclusively recruiting for P...,,Sales Assistant - NEW SHOWROOM,"Permanent, full-time",3/6/2018,£8.00 - £9.00 per hour,Cheshire


In [4]:
df_raw.dtypes

category            object
city                object
company_name        object
geo                 object
job_board           object
job_description     object
job_requirements    object
job_title           object
job_type            object
post_date           object
salary_offered      object
state               object
dtype: object

In [5]:
df_raw.isnull().sum()

category                0
city                    0
company_name            0
geo                     0
job_board               0
job_description         0
job_requirements    29452
job_title               0
job_type                0
post_date               0
salary_offered          0
state                  20
dtype: int64

In [6]:
df_raw.shape

(50000, 12)

In [7]:
df_raw.describe()

Unnamed: 0,category,city,company_name,geo,job_board,job_description,job_requirements,job_title,job_type,post_date,salary_offered,state
count,50000,50000,50000,50000,50000,50000,20548,50000,50000,50000,50000,49980
unique,37,2918,5166,1,1,42057,14887,29155,9,66,7345,167
top,health jobs,London,Hays Specialist Recruitment Limited,uk,reed,Apply on employer's website Add an annual tur...,Required skills Recruitment,Administrator,"Permanent, full-time",3/7/2018,Salary negotiable,London
freq,1930,4349,1830,50000,50000,85,123,162,36864,8472,4539,5900


In [8]:
df_raw["city"].sort_values().unique()[[937,1012]]

array(['FRANKFURT', 'Frankfurt'], dtype=object)

In [9]:
df_raw["job_type"].unique()

array(['Permanent, full-time', 'Permanent, full-time or part-time',
       'Permanent, part-time', 'Contract, full-time',
       'Temporary, part-time', 'Temporary, full-time or part-time',
       'Temporary, full-time', 'Contract, full-time or part-time',
       'Contract, part-time'], dtype=object)

## Data Cleaning
* The `geo` and `job_board` columns will be dropped because they have only one unique value in all rows.
* Transform to lower case the cities and states because some cities or states are repeated but some with uper case or lower case.
* Get kew words in `job_requirements` to get a list of requirements.

### Dropping unnecessary columns

In [10]:
df = df_raw.drop(columns = ["geo","job_board"], axis = 1)

### Transforming string columns to lower case.

In [11]:
df["city"] = df["city"].str.lower()
df["state"] = df["state"].str.lower()
df["job_requirements"] = df["job_requirements"].str.lower()
df["job_title"] = df["job_title"].str.lower()
df["company_name"] = df["company_name"].str.lower()
df["category"] = df["category"].str.lower()

In [12]:
len(df[df["salary_offered"] == " Salary not specified "])

317

In [13]:
df["salary_offered"] = df["salary_offered"].str.strip()

In [14]:
len(df[df["salary_offered"] == "Salary not specified"])

317

### Getting key words of `job_requirements` 

In [17]:
df["job_requirements"].sample(3)

10058                                                  NaN
11684                                                  NaN
32816     required skills rgn senior dementia staff nur...
Name: job_requirements, dtype: object

"required skills* is repeated in all non NaN rows

In [18]:
df["job_requirements"] = df["job_requirements"].str.replace("required skills ","")

In [19]:
df["job_requirements"].sample(5)

21053                                                  NaN
12434     branch manager recruitment consultant interna...
10237                                                  NaN
44285                    eyfs teacher early years teacher 
24617                                                  NaN
Name: job_requirements, dtype: object

In [21]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
import re
from nltk.corpus import stopwords

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [23]:
df["job_requirements"] = df["job_requirements"].fillna("Unrequirement")

In [26]:
df["job_requirements"].sample(5)

681                                          Unrequirement
19129     sia reception patrols distribution gatehouse ...
10844                              clinical nurse nursing 
30793                                        Unrequirement
11473                                        Unrequirement
Name: job_requirements, dtype: object

In [27]:
df_req = df["job_requirements"]
df_req = df_req.str.lower()
df_req = df_req.str.strip()
df_req = df_req.str.replace('[^\w\s]', '')
df_req = df_req.str.replace('\d', '')
df_req = df_req.str.replace('\\n', '')

In [28]:
df_req.sample(3)

39535    loans operations administration regulatory sta...
20062    finance fundraising legal management marketing...
1946                                                    hr
Name: job_requirements, dtype: object

In [29]:
tokenized = df_req.apply(nltk.word_tokenize)

In [30]:
tokenized.sample(5)

37899                                      [unrequirement]
21324    [spa, beauty, therapy, massage, spa, therapist...
22847    [vna, flt, counter, balance, counterbalance, f...
2855                                       [unrequirement]
47860    [body, communication, contract, management, co...
Name: job_requirements, dtype: object

In [31]:
all_words = tokenized.sum()

In [32]:
english_stop_words = stopwords.words('english')
all_words_except_stop_words = [word for word in all_words if word not in english_stop_words]

In [35]:
len(all_words_except_stop_words)

180812

In [36]:
most_common_20 = np.array(list(map(lambda x: list(x), freq_dist.most_common(20))))
y = np.array(most_common_20[:,1], dtype=int)

NameError: ignored