# Job Salary Prediction
_Predict the salary of any UK job ad based on its contents_

### Job Data

- **Id**: Identificador para cada job.

- **Title**: Texto livre com o titulo ou resumo da vaga.

- **FullDescription**: Descrição da vaga sem qualquer informação salarial.

- **LocationRaw**: Localização da vaga em texto livre.

- **LocationNormalized**: Localização aproximada a partir da convesao do texto livre.

- **ContractType**: full_time ou part_time.

- **ContractTime**: permanent or contract.

- **Company**: Nome da empresa.

- **Category**: Qual das 30 categorias de trabalho padrão esse anúncio se encaixa, inferida de uma maneira muito confusa com base na origem da origem do anúncio. Sabemos que há muito barulho e erro nesse campo.

- **SalaryRaw**: Descrição salarial em texto livre.

- **SalaryNormalised**: Salario bruto anual. Valor que estamos tentando prever.

- **SourceName**: Nome do site ou anunciante da vaga.

### Location Tree

Este é um conjunto de dados suplementares que descreve o relacionamento hierárquico entre os diferentes locais normalizados mostrados nos dados do trabalho. É provável que existam relações significativas entre os salários dos empregos em uma área geográfica semelhante, por exemplo, os salários médios em Londres e no Sudeste são mais altos do que no resto do Reino Unido.

### Saida


    Id,SalaryNormalized
    13656201,36205
    14663195,74570
    16530664,31910.50
    ... 
    

## Imports

In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

## Dataset

In [2]:
df_location_tree = pd.read_csv('dataset/Location_Tree.csv', header=None, names=['rel'])

In [3]:
df_job_data = pd.read_csv('dataset/Train_rev1.csv')

In [4]:
df_test_rev1 = pd.read_csv('dataset/Test_rev1.csv')

## Informações

### Location Tree

In [95]:
df_location_tree.head()

Unnamed: 0,rel
0,UK~London~East London~Mile End
1,UK~London~East London~Shadwell
2,UK~London~East London~Spitalfields
3,UK~London~East London~Stepney
4,UK~London~East London~Wapping


In [6]:
df_location_tree.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31763 entries, 0 to 31762
Data columns (total 1 columns):
rel    31763 non-null object
dtypes: object(1)
memory usage: 248.2+ KB


### Job Data

In [7]:
df_job_data.head(n=2)

Unnamed: 0,Id,Title,FullDescription,LocationRaw,LocationNormalized,ContractType,ContractTime,Company,Category,SalaryRaw,SalaryNormalized,SourceName
0,12612628,Engineering Systems Analyst,Engineering Systems Analyst Dorking Surrey Sal...,"Dorking, Surrey, Surrey",Dorking,,permanent,Gregory Martin International,Engineering Jobs,20000 - 30000/annum 20-30K,25000,cv-library.co.uk
1,12612830,Stress Engineer Glasgow,Stress Engineer Glasgow Salary **** to **** We...,"Glasgow, Scotland, Scotland",Glasgow,,permanent,Gregory Martin International,Engineering Jobs,25000 - 35000/annum 25-35K,30000,cv-library.co.uk


In [8]:
df_job_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244768 entries, 0 to 244767
Data columns (total 12 columns):
Id                    244768 non-null int64
Title                 244767 non-null object
FullDescription       244768 non-null object
LocationRaw           244768 non-null object
LocationNormalized    244768 non-null object
ContractType          65442 non-null object
ContractTime          180863 non-null object
Company               212338 non-null object
Category              244768 non-null object
SalaryRaw             244768 non-null object
SalaryNormalized      244768 non-null int64
SourceName            244767 non-null object
dtypes: int64(2), object(10)
memory usage: 22.4+ MB


### Test

In [9]:
df_test_rev1.head(n=2)

Unnamed: 0,Id,Title,FullDescription,LocationRaw,LocationNormalized,ContractType,ContractTime,Company,Category,SourceName
0,11888454,Business Development Manager,The Company: Our client is a national training...,"Tyne Wear, North East",Newcastle Upon Tyne,,permanent,Asset Appointments,Teaching Jobs,cv-library.co.uk
1,11988350,Internal Account Manager,The Company: Founded in **** our client is a U...,"Tyne and Wear, North East",Newcastle Upon Tyne,,permanent,Asset Appointments,Consultancy Jobs,cv-library.co.uk


In [10]:
df_test_rev1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 122463 entries, 0 to 122462
Data columns (total 10 columns):
Id                    122463 non-null int64
Title                 122463 non-null object
FullDescription       122463 non-null object
LocationRaw           122463 non-null object
LocationNormalized    122463 non-null object
ContractType          33013 non-null object
ContractTime          90702 non-null object
Company               106202 non-null object
Category              122463 non-null object
SourceName            122463 non-null object
dtypes: int64(1), object(9)
memory usage: 9.3+ MB


## Pré-processamento

#### SalaryRaw

In [47]:
del df_job_data['SalaryRaw']

#### Remove ContractType

Grande quantidade de valores null

In [37]:
del df_job_data['ContractType']
del df_test_rev1['ContractType']

#### Remove ContractTime

In [45]:
del df_job_data['ContractTime']
del df_test_rev1['ContractTime']

#### LocationNormalized

In [64]:
df_job_data = normalizeTextField(df_job_data, 'LocationNormalized')

In [65]:
df_job_data.shape

(181599, 303)

#### Extraindo label

In [13]:
y = np.array(df_job_data['SalaryNormalized'])
y[1:5]

array([30000, 30000, 27500, 25000])

#### Retirando label

In [14]:
del df_job_data['SalaryNormalized']
df_job_data.head()

Unnamed: 0,Id,Title,FullDescription,LocationRaw,LocationNormalized,ContractType,ContractTime,Company,Category,SalaryRaw,SourceName
0,12612628,Engineering Systems Analyst,Engineering Systems Analyst Dorking Surrey Sal...,"Dorking, Surrey, Surrey",Dorking,,permanent,Gregory Martin International,Engineering Jobs,20000 - 30000/annum 20-30K,cv-library.co.uk
1,12612830,Stress Engineer Glasgow,Stress Engineer Glasgow Salary **** to **** We...,"Glasgow, Scotland, Scotland",Glasgow,,permanent,Gregory Martin International,Engineering Jobs,25000 - 35000/annum 25-35K,cv-library.co.uk
2,12612844,Modelling and simulation analyst,Mathematical Modeller / Simulation Analyst / O...,"Hampshire, South East, South East",Hampshire,,permanent,Gregory Martin International,Engineering Jobs,20000 - 40000/annum 20-40K,cv-library.co.uk
3,12613049,Engineering Systems Analyst / Mathematical Mod...,Engineering Systems Analyst / Mathematical Mod...,"Surrey, South East, South East",Surrey,,permanent,Gregory Martin International,Engineering Jobs,25000 - 30000/annum 25K-30K negotiable,cv-library.co.uk
4,12613647,"Pioneer, Miser Engineering Systems Analyst","Pioneer, Miser Engineering Systems Analyst Do...","Surrey, South East, South East",Surrey,,permanent,Gregory Martin International,Engineering Jobs,20000 - 30000/annum 20-30K,cv-library.co.uk


#### Removendo Category

In [15]:
del df_job_data['Category']

In [16]:
df_job_data.head()

Unnamed: 0,Id,Title,FullDescription,LocationRaw,LocationNormalized,ContractType,ContractTime,Company,SalaryRaw,SourceName
0,12612628,Engineering Systems Analyst,Engineering Systems Analyst Dorking Surrey Sal...,"Dorking, Surrey, Surrey",Dorking,,permanent,Gregory Martin International,20000 - 30000/annum 20-30K,cv-library.co.uk
1,12612830,Stress Engineer Glasgow,Stress Engineer Glasgow Salary **** to **** We...,"Glasgow, Scotland, Scotland",Glasgow,,permanent,Gregory Martin International,25000 - 35000/annum 25-35K,cv-library.co.uk
2,12612844,Modelling and simulation analyst,Mathematical Modeller / Simulation Analyst / O...,"Hampshire, South East, South East",Hampshire,,permanent,Gregory Martin International,20000 - 40000/annum 20-40K,cv-library.co.uk
3,12613049,Engineering Systems Analyst / Mathematical Mod...,Engineering Systems Analyst / Mathematical Mod...,"Surrey, South East, South East",Surrey,,permanent,Gregory Martin International,25000 - 30000/annum 25K-30K negotiable,cv-library.co.uk
4,12613647,"Pioneer, Miser Engineering Systems Analyst","Pioneer, Miser Engineering Systems Analyst Do...","Surrey, South East, South East",Surrey,,permanent,Gregory Martin International,20000 - 30000/annum 20-30K,cv-library.co.uk


#### Removendo Location Raw

In [17]:
del df_job_data['LocationRaw']
del df_test_rev1['LocationRaw']
df_job_data.head()

Unnamed: 0,Id,Title,FullDescription,LocationNormalized,ContractType,ContractTime,Company,SalaryRaw,SourceName
0,12612628,Engineering Systems Analyst,Engineering Systems Analyst Dorking Surrey Sal...,Dorking,,permanent,Gregory Martin International,20000 - 30000/annum 20-30K,cv-library.co.uk
1,12612830,Stress Engineer Glasgow,Stress Engineer Glasgow Salary **** to **** We...,Glasgow,,permanent,Gregory Martin International,25000 - 35000/annum 25-35K,cv-library.co.uk
2,12612844,Modelling and simulation analyst,Mathematical Modeller / Simulation Analyst / O...,Hampshire,,permanent,Gregory Martin International,20000 - 40000/annum 20-40K,cv-library.co.uk
3,12613049,Engineering Systems Analyst / Mathematical Mod...,Engineering Systems Analyst / Mathematical Mod...,Surrey,,permanent,Gregory Martin International,25000 - 30000/annum 25K-30K negotiable,cv-library.co.uk
4,12613647,"Pioneer, Miser Engineering Systems Analyst","Pioneer, Miser Engineering Systems Analyst Do...",Surrey,,permanent,Gregory Martin International,20000 - 30000/annum 20-30K,cv-library.co.uk


#### Title

- Removendo linha com titulo null

In [18]:
df_job_data.dropna(subset=['Title'], inplace = True)

- Discretizando title

In [19]:
def normalizeTextField(df, field):
    vectorizer = CountVectorizer(max_features=100)
    fields = vectorizer.fit_transform(df[field]).toarray()
    fcols = np.vectorize(lambda x: field + str(x))(np.arange(100))
    df_fields = pd.DataFrame(fields, columns=fcols)
    df = pd.concat([df, df_fields], join ='inner',axis=1)
    del df[field]
    return df

In [20]:
df_job_data = normalizeTextField(df_job_data, 'Title')

In [23]:
df_job_data.head(n=1)

Unnamed: 0,Id,FullDescription,LocationNormalized,ContractType,ContractTime,Company,SalaryRaw,SourceName,Title0,Title1,...,Title90,Title91,Title92,Title93,Title94,Title95,Title96,Title97,Title98,Title99
0,12612628,Engineering Systems Analyst Dorking Surrey Sal...,Dorking,,permanent,Gregory Martin International,20000 - 30000/annum 20-30K,cv-library.co.uk,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Full Description

In [24]:
df_job_data = normalizeTextField(df_job_data, 'FullDescription')

In [26]:
df_job_data.head(n=1)

Unnamed: 0,Id,LocationNormalized,ContractType,ContractTime,Company,SalaryRaw,SourceName,Title0,Title1,Title2,...,FullDescription90,FullDescription91,FullDescription92,FullDescription93,FullDescription94,FullDescription95,FullDescription96,FullDescription97,FullDescription98,FullDescription99
0,12612628,Dorking,,permanent,Gregory Martin International,20000 - 30000/annum 20-30K,cv-library.co.uk,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Discretizando Source Name

In [27]:
df_job_data.dropna(subset=['SourceName'], inplace = True)

In [28]:
_, sources = np.unique(df_job_data['SourceName'], return_inverse=True)

In [29]:
df_job_data['SourceName'] = sources

In [30]:
df_job_data.head(n=2)

Unnamed: 0,Id,LocationNormalized,ContractType,ContractTime,Company,SalaryRaw,SourceName,Title0,Title1,Title2,...,FullDescription90,FullDescription91,FullDescription92,FullDescription93,FullDescription94,FullDescription95,FullDescription96,FullDescription97,FullDescription98,FullDescription99
0,12612628,Dorking,,permanent,Gregory Martin International,20000 - 30000/annum 20-30K,42,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,12612830,Glasgow,,permanent,Gregory Martin International,25000 - 35000/annum 25-35K,42,0,0,0,...,0,3,2,0,1,2,0,0,6,1


#### Discretizar Company

Será removido todos os valores null

In [69]:
df_job_data.dropna(subset=['Company'], inplace = True)

In [61]:
_, comps = np.unique(df_job_data['Company'], return_inverse=True)

In [67]:
df_job_data['Company'] = comps

In [70]:
df_job_data.head()

Unnamed: 0,Id,Company,SourceName,Title0,Title1,Title2,Title3,Title4,Title5,Title6,...,LocationNormalized90,LocationNormalized91,LocationNormalized92,LocationNormalized93,LocationNormalized94,LocationNormalized95,LocationNormalized96,LocationNormalized97,LocationNormalized98,LocationNormalized99
0,12612628,7757,42,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,12612830,7757,42,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,12612844,7757,42,0,0,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
3,12613049,7757,42,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,12613647,7757,42,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


In [71]:
df_job_data.tail()

Unnamed: 0,Id,Company,SourceName,Title0,Title1,Title2,Title3,Title4,Title5,Title6,...,LocationNormalized90,LocationNormalized91,LocationNormalized92,LocationNormalized93,LocationNormalized94,LocationNormalized95,LocationNormalized96,LocationNormalized97,LocationNormalized98,LocationNormalized99
212332,72229944,13327,158,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
212333,72229945,4592,158,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
212334,72229953,13296,158,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
212335,72229958,9296,158,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
212336,72229974,5702,158,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Criando modelos