# Job Salary Prediction
_Predict the salary of any UK job ad based on its contents_

### Job Data

- **Id**: Identificador para cada job.

- **Title**: Texto livre com o titulo ou resumo da vaga.

- **FullDescription**: Descrição da vaga sem qualquer informação salarial.

- **LocationRaw**: Localização da vaga em texto livre.

- **LocationNormalized**: Localização aproximada a partir da convesao do texto livre.

- **ContractType**: full_time ou part_time.

- **ContractTime**: permanent or contract.

- **Company**: Nome da empresa.

- **Category**: Qual das 30 categorias de trabalho padrão esse anúncio se encaixa, inferida de uma maneira muito confusa com base na origem da origem do anúncio. Sabemos que há muito barulho e erro nesse campo.

- **SalaryRaw**: Descrição salarial em texto livre.

- **SalaryNormalised**: Salario bruto anual. Valor que estamos tentando prever.

- **SourceName**: Nome do site ou anunciante da vaga.

### Location Tree

Este é um conjunto de dados suplementares que descreve o relacionamento hierárquico entre os diferentes locais normalizados mostrados nos dados do trabalho. É provável que existam relações significativas entre os salários dos empregos em uma área geográfica semelhante, por exemplo, os salários médios em Londres e no Sudeste são mais altos do que no resto do Reino Unido.

### Saida


    Id,SalaryNormalized
    13656201,36205
    14663195,74570
    16530664,31910.50
    ... 
    

## Imports

In [76]:
import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor

## Dataset

In [2]:
df_job_data = pd.read_csv('dataset/Train_rev1.csv')

In [3]:
df_test_rev1 = pd.read_csv('dataset/Test_rev1.csv')

## Informações

### Job Data

In [4]:
df_job_data.head(n=2)

Unnamed: 0,Id,Title,FullDescription,LocationRaw,LocationNormalized,ContractType,ContractTime,Company,Category,SalaryRaw,SalaryNormalized,SourceName
0,12612628,Engineering Systems Analyst,Engineering Systems Analyst Dorking Surrey Sal...,"Dorking, Surrey, Surrey",Dorking,,permanent,Gregory Martin International,Engineering Jobs,20000 - 30000/annum 20-30K,25000,cv-library.co.uk
1,12612830,Stress Engineer Glasgow,Stress Engineer Glasgow Salary **** to **** We...,"Glasgow, Scotland, Scotland",Glasgow,,permanent,Gregory Martin International,Engineering Jobs,25000 - 35000/annum 25-35K,30000,cv-library.co.uk


In [5]:
df_job_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244768 entries, 0 to 244767
Data columns (total 12 columns):
Id                    244768 non-null int64
Title                 244767 non-null object
FullDescription       244768 non-null object
LocationRaw           244768 non-null object
LocationNormalized    244768 non-null object
ContractType          65442 non-null object
ContractTime          180863 non-null object
Company               212338 non-null object
Category              244768 non-null object
SalaryRaw             244768 non-null object
SalaryNormalized      244768 non-null int64
SourceName            244767 non-null object
dtypes: int64(2), object(10)
memory usage: 22.4+ MB


### Test

In [9]:
df_test_rev1.head(n=2)

Unnamed: 0,Id,Title,FullDescription,LocationRaw,LocationNormalized,ContractType,ContractTime,Company,Category,SourceName
0,11888454,Business Development Manager,The Company: Our client is a national training...,"Tyne Wear, North East",Newcastle Upon Tyne,,permanent,Asset Appointments,Teaching Jobs,cv-library.co.uk
1,11988350,Internal Account Manager,The Company: Founded in **** our client is a U...,"Tyne and Wear, North East",Newcastle Upon Tyne,,permanent,Asset Appointments,Consultancy Jobs,cv-library.co.uk


In [10]:
df_test_rev1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 122463 entries, 0 to 122462
Data columns (total 10 columns):
Id                    122463 non-null int64
Title                 122463 non-null object
FullDescription       122463 non-null object
LocationRaw           122463 non-null object
LocationNormalized    122463 non-null object
ContractType          33013 non-null object
ContractTime          90702 non-null object
Company               106202 non-null object
Category              122463 non-null object
SourceName            122463 non-null object
dtypes: int64(1), object(9)
memory usage: 9.3+ MB


## Pré-processamento

In [11]:
def normalizeTextField(df, field):
    vectorizer = CountVectorizer(max_features=100)
    fields = vectorizer.fit_transform(df[field]).toarray()
    fcols = np.vectorize(lambda x: field + str(x))(np.arange(100))
    df_fields = pd.DataFrame(fields, columns=fcols)
    df = pd.concat([df, df_fields], join ='inner',axis=1)
    del df[field]
    return df

### SalaryRaw

In [12]:
del df_job_data['SalaryRaw']

### Remove ContractType

Grande quantidade de valores null

In [13]:
del df_job_data['ContractType']
del df_test_rev1['ContractType']

### Remove ContractTime

In [14]:
del df_job_data['ContractTime']
del df_test_rev1['ContractTime']

### Removendo Category

In [15]:
del df_job_data['Category']
del df_test_rev1['Category']

### Removendo Location Raw

In [16]:
del df_job_data['LocationRaw']
del df_test_rev1['LocationRaw']

### Company

In [37]:
del df_job_data['Company']

In [38]:
del df_test_rev1['Company']

### Removendo linhas com valores NULL

In [17]:
df_job_data.dropna(subset=['Title'], inplace = True)

In [18]:
df_job_data.dropna(subset=['SourceName'], inplace = True)

### Retirando Label

In [51]:
y = df_job_data['SalaryNormalized'].values

In [53]:
y

array([25000, 30000, 30000, ..., 15000, 20000, 19000])

### Retirando IDS

In [59]:
idx_job = df_job_data['Id'].values

In [61]:
idx_job

array([12612628, 12612830, 12612844, ..., 72703444, 72703454, 72703459])

In [62]:
idx_test = df_test_rev1['Id'].values

In [63]:
idx_test

array([11888454, 11988350, 12612558, ..., 72705210, 72705214, 72705218])

### Juntando conteudo

In [24]:
df_job_tuple = df_job_data.shape
df_job_tuple

(212337, 7)

In [25]:
df_test_tuple = df_test_rev1.shape
df_test_tuple

(122463, 6)

In [22]:
df = df_job_data.append(df_test_rev1)

In [23]:
df.shape

(334800, 7)

#### LocationNormalized

In [26]:
df = normalizeTextField(df, 'LocationNormalized')

In [27]:
df.shape

(334800, 106)

#### Title

In [28]:
df = normalizeTextField(df, 'Title')

In [29]:
df.shape

(334800, 205)

#### Full Description

In [30]:
df = normalizeTextField(df, 'FullDescription')

In [31]:
df.shape

(334800, 304)

#### Source Name

In [32]:
_, sources = np.unique(df['SourceName'], return_inverse=True)

In [33]:
sources.shape

(334800,)

In [34]:
df['SourceName'] = sources

In [35]:
df.shape

(334800, 304)

In [65]:
df.head()

Unnamed: 0,SourceName,LocationNormalized0,LocationNormalized1,LocationNormalized2,LocationNormalized3,LocationNormalized4,LocationNormalized5,LocationNormalized6,LocationNormalized7,LocationNormalized8,...,FullDescription90,FullDescription91,FullDescription92,FullDescription93,FullDescription94,FullDescription95,FullDescription96,FullDescription97,FullDescription98,FullDescription99
0,42,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,42,0,0,0,0,0,0,0,0,0,...,0,3,2,0,1,2,0,0,6,1
2,42,0,0,0,0,0,0,0,0,0,...,0,3,4,0,2,3,0,0,0,0
3,42,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,42,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [66]:
df.tail()

Unnamed: 0,SourceName,LocationNormalized0,LocationNormalized1,LocationNormalized2,LocationNormalized3,LocationNormalized4,LocationNormalized5,LocationNormalized6,LocationNormalized7,LocationNormalized8,...,FullDescription90,FullDescription91,FullDescription92,FullDescription93,FullDescription94,FullDescription95,FullDescription96,FullDescription97,FullDescription98,FullDescription99
122458,95,0,0,0,0,0,0,0,0,0,...,0,2,3,1,0,1,0,1,16,10
122459,95,0,0,0,0,0,0,0,0,0,...,0,2,1,0,0,0,0,0,5,3
122460,64,0,0,0,0,0,0,0,0,0,...,1,10,6,2,0,4,0,0,8,4
122461,64,0,0,0,0,0,0,0,0,0,...,0,10,3,2,2,3,0,0,7,2
122462,64,0,0,0,0,0,0,0,0,0,...,1,6,5,3,4,2,0,0,8,6


### Separando Train e Test 

In [68]:
X_train = df.values[:df_job_tuple[0], :df_job_tuple[0]]

In [69]:
X_test = df.values[:df_test_tuple[0], :df_test_tuple[0]]

In [70]:
X_train.shape, X_test.shape

((212337, 301), (122463, 301))

## Estratificando dados

In [91]:
scaler = StandardScaler().fit(X_train)



In [92]:
X_std_train = scaler.fit_transform(X_train)



In [93]:
X_std_test = scaler.fit_transform(X_test)



In [94]:
X_std_train.shape, y.shape

((212337, 301), (212337,))

### Criando Folds

In [95]:
n_splits = 10
skfold = StratifiedKFold(n_splits=n_splits)

## Criando modelos

In [83]:
rf_model = RandomForestRegressor(n_estimators=50, min_samples_split=30, random_state=1)

In [84]:
gb_model = GradientBoostingRegressor(min_samples_split=30, random_state=1)

In [85]:
lgr_model = LogisticRegression(random_state=1)

In [86]:
ada_model = AdaBoostRegressor(random_state=1)

In [87]:
knn_model = KNeighborsRegressor()

In [88]:
lnr_model = LinearRegression()

In [89]:
svr_model = SVR()

## Treinamento

In [None]:
cv_rf = cross_validate(estimator=rf_model, X=X_std_train, y=y, cv=skfold)

In [None]:
cv_gb = cross_validate(estimator=gb_model, X=X_std_train, y=y, cv=skfold)

In [None]:
cv_lgr = cross_validate(estimator=lgr_model, X=X_std_train, y=y, cv=skfold)

In [None]:
cv_ada = cross_validate(estimator=ada_model, X=X_std_train, y=y, cv=skfold)

In [None]:
cv_knn = cross_validate(estimator=knn_model, X=X_std_train, y=y, cv=skfold)

In [None]:
cv_lnr = cross_validate(estimator=lnr_model, X=X_std_train, y=y, cv=skfold)

In [None]:
cv_svr = cross_validate(estimator=svr_model, X=X_std_train, y=y, cv=skfold)