# Job Salary Prediction
_Predict the salary of any UK job ad based on its contents_

### Job Data

- **Id**: Identificador para cada job.

- **Title**: Texto livre com o titulo ou resumo da vaga.

- **FullDescription**: Descrição da vaga sem qualquer informação salarial.

- **LocationRaw**: Localização da vaga em texto livre.

- **LocationNormalized**: Localização aproximada a partir da convesao do texto livre.

- **ContractType**: full_time ou part_time.

- **ContractTime**: permanent or contract.

- **Company**: Nome da empresa.

- **Category**: Qual das 30 categorias de trabalho padrão esse anúncio se encaixa, inferida de uma maneira muito confusa com base na origem da origem do anúncio. Sabemos que há muito barulho e erro nesse campo.

- **SalaryRaw**: Descrição salarial em texto livre.

- **SalaryNormalised**: Salario bruto anual. Valor que estamos tentando prever.

- **SourceName**: Nome do site ou anunciante da vaga.

### Location Tree

Este é um conjunto de dados suplementares que descreve o relacionamento hierárquico entre os diferentes locais normalizados mostrados nos dados do trabalho. É provável que existam relações significativas entre os salários dos empregos em uma área geográfica semelhante, por exemplo, os salários médios em Londres e no Sudeste são mais altos do que no resto do Reino Unido.

### Saida


    Id,SalaryNormalized
    13656201,36205
    14663195,74570
    16530664,31910.50
    ... 
    

## Imports

In [2]:
import numpy as np
import pandas as pd
from sklearn.svm import SVR
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import KFold, cross_validate
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor


## Dataset

In [67]:
# Install a Drive FUSE wrapper.
# https://github.com/astrada/google-drive-ocamlfuse
# !apt-get install -y -qq software-properties-common python-software-properties module-init-tools
# !add-apt-repository -y ppa:alessandro-strada/ppa 2>&1 > /dev/null
# !apt-get update -qq 2>&1 > /dev/null
# !apt-get -y install -qq google-drive-ocamlfuse fuse

gpg: keybox '/tmp/tmpfhvsk8t_/pubring.gpg' created
gpg: /tmp/tmpfhvsk8t_/trustdb.gpg: trustdb created
gpg: key AD5F235DF639B041: public key "Launchpad PPA for Alessandro Strada" imported
gpg: Total number processed: 1
gpg:               imported: 1


In [72]:
# Generate auth tokens for Colab
# from google.colab import auth
# auth.authenticate_user()

In [0]:
# Generate creds for the Drive FUSE library.
# from oauth2client.client import GoogleCredentials
# creds = GoogleCredentials.get_application_default()
# import getpass
# !google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret} < /dev/null 2>&1 | grep URL
# vcode = getpass.getpass()
# !echo {vcode} | google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret}

In [12]:
# Create a directory and mount Google Drive using that directory.
# !mkdir -p drive
# !google-drive-ocamlfuse drive

In [13]:
# !cd drive/

In [3]:
!ls

Job Salary Prediction.ipynb		    README.md
List_12__Clustering.ipynb		    Train_rev1.csv
List_13__Clusterization_Hierarchical.ipynb  Train_rev1.zip
Não confirmado 712499.crdownload


In [15]:
# df_job_data = pd.read_csv('drive/MachineLearning/JobSalaryPredict/Train_rev1.csv')

In [5]:
df_job_data = pd.read_csv('Train_rev1.csv')

In [16]:
# df_test_rev1 = pd.read_csv('drive/MachineLearning/JobSalaryPredict/Test_rev1.csv')

In [6]:
df_test_rev1 = pd.read_csv('Test_rev1.csv')

## Informações

### Job Data

In [7]:
df_job_data.head(n=2)

Unnamed: 0,Id,Title,FullDescription,LocationRaw,LocationNormalized,ContractType,ContractTime,Company,Category,SalaryRaw,SalaryNormalized,SourceName
0,12612628,Engineering Systems Analyst,Engineering Systems Analyst Dorking Surrey Sal...,"Dorking, Surrey, Surrey",Dorking,,permanent,Gregory Martin International,Engineering Jobs,20000 - 30000/annum 20-30K,25000,cv-library.co.uk
1,12612830,Stress Engineer Glasgow,Stress Engineer Glasgow Salary **** to **** We...,"Glasgow, Scotland, Scotland",Glasgow,,permanent,Gregory Martin International,Engineering Jobs,25000 - 35000/annum 25-35K,30000,cv-library.co.uk


In [8]:
df_job_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244768 entries, 0 to 244767
Data columns (total 12 columns):
Id                    244768 non-null int64
Title                 244767 non-null object
FullDescription       244768 non-null object
LocationRaw           244768 non-null object
LocationNormalized    244768 non-null object
ContractType          65442 non-null object
ContractTime          180863 non-null object
Company               212338 non-null object
Category              244768 non-null object
SalaryRaw             244768 non-null object
SalaryNormalized      244768 non-null int64
SourceName            244767 non-null object
dtypes: int64(2), object(10)
memory usage: 22.4+ MB


### Test

In [9]:
df_test_rev1.head(n=2)

Unnamed: 0,Id,Title,FullDescription,LocationRaw,LocationNormalized,ContractType,ContractTime,Company,Category,SourceName
0,11888454,Business Development Manager,The Company: Our client is a national training...,"Tyne Wear, North East",Newcastle Upon Tyne,,permanent,Asset Appointments,Teaching Jobs,cv-library.co.uk
1,11988350,Internal Account Manager,The Company: Founded in **** our client is a U...,"Tyne and Wear, North East",Newcastle Upon Tyne,,permanent,Asset Appointments,Consultancy Jobs,cv-library.co.uk


In [10]:
df_test_rev1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 122463 entries, 0 to 122462
Data columns (total 10 columns):
Id                    122463 non-null int64
Title                 122463 non-null object
FullDescription       122463 non-null object
LocationRaw           122463 non-null object
LocationNormalized    122463 non-null object
ContractType          33013 non-null object
ContractTime          90702 non-null object
Company               106202 non-null object
Category              122463 non-null object
SourceName            122463 non-null object
dtypes: int64(1), object(9)
memory usage: 9.3+ MB


## Pré-processamento

In [44]:
def normalizeTextField(df, field):
    vectorizer = CountVectorizer(max_features=100)
    fields = vectorizer.fit_transform(df[field]).toarray()
    # Generate field names
    fcols = np.vectorize(lambda x: field + str(x))(np.arange(2))
    # Reduz a dimensionalidade para 2 
    pca = PCA(n_components = 2)
    _df = pd.DataFrame(pca.fit_transform(fields), columns=fcols)
    # Concatena o dataframe com o novo
    df = pd.concat([df, _df], join ='inner', axis=1)
    del df[field]
    return df

### SalaryRaw

In [12]:
del df_job_data['SalaryRaw']

### Remove ContractType

Grande quantidade de valores null

In [13]:
del df_job_data['ContractType']
del df_test_rev1['ContractType']

### Remove ContractTime

In [17]:
del df_job_data['ContractTime']
del df_test_rev1['ContractTime']

### Removendo Category

In [18]:
del df_job_data['Category']
del df_test_rev1['Category']

### Removendo Location Raw

In [19]:
del df_job_data['LocationRaw']
del df_test_rev1['LocationRaw']

### Company

In [20]:
del df_job_data['Company']

In [21]:
del df_test_rev1['Company']

### Removendo linhas com valores NULL

In [23]:
df_job_data.dropna(subset=['Title'], inplace = True)

In [24]:
df_job_data.dropna(subset=['SourceName'], inplace = True)

### Retirando Label

In [25]:
y = df_job_data['SalaryNormalized'].values

In [26]:
y

array([25000, 30000, 30000, ..., 22800, 22800, 42500])

### Retirando IDS

In [27]:
idx_job = df_job_data['Id'].values

In [28]:
idx_job

array([12612628, 12612830, 12612844, ..., 72705213, 72705216, 72705235])

In [29]:
idx_test = df_test_rev1['Id'].values

In [30]:
idx_test

array([11888454, 11988350, 12612558, ..., 72705210, 72705214, 72705218])

### Juntando conteudo

In [31]:
df_job_tuple = df_job_data.shape
df_job_tuple

(244766, 6)

In [32]:
df_test_tuple = df_test_rev1.shape
df_test_tuple

(122463, 5)

In [38]:
df = df_job_data.append(df_test_rev1, sort=False)

In [45]:
df.shape

(367229, 6)

#### LocationNormalized

In [46]:
df = normalizeTextField(df, 'LocationNormalized')

In [47]:
df.shape

(367229, 7)

In [48]:
df.head()

Unnamed: 0,Id,Title,FullDescription,SalaryNormalized,SourceName,LocationNormalized0,LocationNormalized1
0,12612628,Engineering Systems Analyst,Engineering Systems Analyst Dorking Surrey Sal...,25000.0,cv-library.co.uk,-0.11679,-0.229172
1,12612830,Stress Engineer Glasgow,Stress Engineer Glasgow Salary **** to **** We...,30000.0,cv-library.co.uk,-0.118995,-0.237572
2,12612844,Modelling and simulation analyst,Mathematical Modeller / Simulation Analyst / O...,30000.0,cv-library.co.uk,-0.120516,-0.241914
3,12613049,Engineering Systems Analyst / Mathematical Mod...,Engineering Systems Analyst / Mathematical Mod...,27500.0,cv-library.co.uk,-0.122604,-0.249312
4,12613647,"Pioneer, Miser Engineering Systems Analyst","Pioneer, Miser Engineering Systems Analyst Do...",25000.0,cv-library.co.uk,-0.122604,-0.249312


#### Title

In [49]:
df = normalizeTextField(df, 'Title')

In [50]:
df.shape

(367229, 8)

In [51]:
df.head()

Unnamed: 0,Id,FullDescription,SalaryNormalized,SourceName,LocationNormalized0,LocationNormalized1,Title0,Title1
0,12612628,Engineering Systems Analyst Dorking Surrey Sal...,25000.0,cv-library.co.uk,-0.11679,-0.229172,-0.211711,0.010125
1,12612830,Stress Engineer Glasgow Salary **** to **** We...,30000.0,cv-library.co.uk,-0.118995,-0.237572,-0.379567,-0.578679
2,12612844,Mathematical Modeller / Simulation Analyst / O...,30000.0,cv-library.co.uk,-0.120516,-0.241914,-0.204023,0.064291
3,12613049,Engineering Systems Analyst / Mathematical Mod...,27500.0,cv-library.co.uk,-0.122604,-0.249312,-0.211711,0.010125
4,12613647,"Pioneer, Miser Engineering Systems Analyst Do...",25000.0,cv-library.co.uk,-0.122604,-0.249312,-0.211711,0.010125


#### Full Description

In [52]:
df = normalizeTextField(df, 'FullDescription')

In [53]:
df.shape

(367229, 9)

In [54]:
df.head()

Unnamed: 0,Id,SalaryNormalized,SourceName,LocationNormalized0,LocationNormalized1,Title0,Title1,FullDescription0,FullDescription1
0,12612628,25000.0,cv-library.co.uk,-0.11679,-0.229172,-0.211711,0.010125,-18.530014,2.881801
1,12612830,30000.0,cv-library.co.uk,-0.118995,-0.237572,-0.379567,-0.578679,1.115408,-2.899838
2,12612844,30000.0,cv-library.co.uk,-0.120516,-0.241914,-0.204023,0.064291,-1.111251,2.198476
3,12613049,27500.0,cv-library.co.uk,-0.122604,-0.249312,-0.211711,0.010125,-18.890457,3.393422
4,12613647,25000.0,cv-library.co.uk,-0.122604,-0.249312,-0.211711,0.010125,-19.451188,2.751042


#### Source Name

In [55]:
_, sources = np.unique(df['SourceName'], return_inverse=True)

In [56]:
sources.shape

(367229,)

In [57]:
df['SourceName'] = sources

In [58]:
df.shape

(367229, 9)

In [59]:
df.head(n=2)

Unnamed: 0,Id,SalaryNormalized,SourceName,LocationNormalized0,LocationNormalized1,Title0,Title1,FullDescription0,FullDescription1
0,12612628,25000.0,42,-0.11679,-0.229172,-0.211711,0.010125,-18.530014,2.881801
1,12612830,30000.0,42,-0.118995,-0.237572,-0.379567,-0.578679,1.115408,-2.899838


In [60]:
df.tail(n=2)

Unnamed: 0,Id,SalaryNormalized,SourceName,LocationNormalized0,LocationNormalized1,Title0,Title1,FullDescription0,FullDescription1
122461,72705214,,64,-0.11679,-0.229172,0.868987,-0.102754,-3.389519,-0.760345
122462,72705218,,64,-0.118635,-0.235408,-0.168568,0.034971,-13.765711,-0.120908


### Separando Train e Test 

In [61]:
X_train = df.values[:df_job_tuple[0], :df_job_tuple[0]]

In [62]:
X_test = df.values[:df_test_tuple[0], :df_test_tuple[0]]

In [63]:
X_train.shape, X_test.shape

((244766, 9), (122463, 9))

## Estratificando dados

In [64]:
scaler = StandardScaler()

## Criando Folds

In [132]:
n_splits = 10
kfold = KFold(n_splits=n_splits)

## Função para executar modelos

In [137]:
def cross_validation(model, X, y):
    scoring = [ 'neg_mean_absolute_error', 'neg_mean_squared_error']
    pipeline = Pipeline([('transformer', scaler), ('estimator', model)])
    
    return cross_validate(pipeline, X=X, y=y, cv=kfold, n_jobs=1, verbose=5, scoring=scoring, return_train_score=True)

## Criando modelos

In [66]:
rf_model = RandomForestRegressor(n_estimators=50, min_samples_split=30, random_state=1)

In [67]:
gb_model = GradientBoostingRegressor(min_samples_split=30, random_state=1)

In [68]:
lgr_model = LogisticRegression(random_state=1)

In [69]:
ada_model = AdaBoostRegressor(random_state=1)

In [70]:
knn_model = KNeighborsRegressor()

## Treinamento

In [134]:
_X_train = X_train[0:10000, :]
_y = y[0:10000]

In [135]:
_X_train.shape, _y.shape

((10000, 9), (10000,))

### KNN

In [142]:
cv_knn = cross_validation(model=knn_model, X=_X_train, y=_y)
cv_knn

[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-2550.122, neg_mean_squared_error=-15744662.243280001, total=   0.2s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.7s remaining:    0.0s


[CV]  , neg_mean_absolute_error=-2214.6792, neg_mean_squared_error=-12458117.76312, total=   0.1s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    3.1s remaining:    0.0s


[CV]  , neg_mean_absolute_error=-2828.7947999999997, neg_mean_squared_error=-15848798.990559999, total=   0.2s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    4.4s remaining:    0.0s


[CV]  , neg_mean_absolute_error=-2497.3994000000002, neg_mean_squared_error=-13610335.29028, total=   0.2s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    5.8s remaining:    0.0s


[CV]  , neg_mean_absolute_error=-2165.3876, neg_mean_squared_error=-9908923.28816, total=   0.2s
[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-2624.421, neg_mean_squared_error=-15562216.90532, total=   0.2s
[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-2649.2385999999997, neg_mean_squared_error=-16014176.654759998, total=   0.2s
[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-2482.4066000000003, neg_mean_squared_error=-14431853.60396, total=   0.2s
[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-2747.912, neg_mean_squared_error=-16478976.41624, total=   0.2s
[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-2782.3858, neg_mean_squared_error=-16022491.66852, total=   0.2s


[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:   13.8s finished


{'fit_time': array([0.08238339, 0.01304674, 0.01403999, 0.01553154, 0.01260948,
        0.01632166, 0.01613665, 0.01424146, 0.01343989, 0.01306725]),
 'score_time': array([0.16094398, 0.11355901, 0.19534731, 0.19469833, 0.17406321,
        0.16777301, 0.15837765, 0.14983416, 0.15908694, 0.1451602 ]),
 'test_neg_mean_absolute_error': array([-2550.122 , -2214.6792, -2828.7948, -2497.3994, -2165.3876,
        -2624.421 , -2649.2386, -2482.4066, -2747.912 , -2782.3858]),
 'train_neg_mean_absolute_error': array([-1954.66437778, -1887.07488889, -1791.60766667, -1796.35971111,
        -1870.7512    , -1875.34964444, -1819.23557778, -1833.56475556,
        -1832.91535556, -1807.4956    ]),
 'test_neg_mean_squared_error': array([-15744662.24328, -12458117.76312, -15848798.99056, -13610335.29028,
         -9908923.28816, -15562216.90532, -16014176.65476, -14431853.60396,
        -16478976.41624, -16022491.66852]),
 'train_neg_mean_squared_error': array([-8701229.5046    , -8395795.83823111, -791

### ADA

In [143]:
cv_ada = cross_validation(model=ada_model, X=_X_train, y=_y)
cv_ada

[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-1257.442047543157, neg_mean_squared_error=-2447096.3696461488, total=   0.6s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.7s remaining:    0.0s


[CV]  , neg_mean_absolute_error=-1189.1709356958243, neg_mean_squared_error=-2223189.859184863, total=   0.8s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    1.5s remaining:    0.0s


[CV]  , neg_mean_absolute_error=-1083.9454787092502, neg_mean_squared_error=-1855631.928882781, total=   1.2s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    2.8s remaining:    0.0s


[CV]  , neg_mean_absolute_error=-1487.3715982455426, neg_mean_squared_error=-3761588.0604206715, total=   0.5s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    3.3s remaining:    0.0s


[CV]  , neg_mean_absolute_error=-1310.5972536349068, neg_mean_squared_error=-2692964.3258552663, total=   0.9s
[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-1802.1741363665485, neg_mean_squared_error=-4902974.649374635, total=   0.8s
[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-1380.4973449396166, neg_mean_squared_error=-3982149.8791560703, total=   0.7s
[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-1441.7396684492999, neg_mean_squared_error=-3051915.462881405, total=   0.6s
[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-1543.7290467987125, neg_mean_squared_error=-3290416.9288304797, total=   0.7s
[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-1091.8701408406612, neg_mean_squared_error=-1966254.4885299

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    8.5s finished


{'fit_time': array([0.61782336, 0.76314497, 1.151968  , 0.49003673, 0.90698767,
        0.7666502 , 0.65726519, 0.63609624, 0.66434479, 1.16955066]),
 'score_time': array([0.00834894, 0.00970173, 0.01907849, 0.00723267, 0.01216912,
        0.00931382, 0.00864649, 0.010921  , 0.01091027, 0.01821923]),
 'test_neg_mean_absolute_error': array([-1257.44204754, -1189.1709357 , -1083.94547871, -1487.37159825,
        -1310.59725363, -1802.17413637, -1380.49734494, -1441.73966845,
        -1543.7290468 , -1091.87014084]),
 'train_neg_mean_absolute_error': array([-1294.58419399, -1288.47175685, -1047.69419239, -1429.17868539,
        -1241.46957057, -1663.71923839, -1315.87213727, -1460.39681608,
        -1527.93340331, -1070.47354605]),
 'test_neg_mean_squared_error': array([-2447096.36964615, -2223189.85918486, -1855631.92888278,
        -3761588.06042067, -2692964.32585527, -4902974.64937463,
        -3982149.87915607, -3051915.4628814 , -3290416.92883048,
        -1966254.48852993]),
 'trai

### Gradient Boosting

In [144]:
cv_gb = cross_validation(model=gb_model, X=_X_train, y=_y)
cv_gb

[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-67.30092815795007, neg_mean_squared_error=-80788.45260609436, total=   1.1s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.1s remaining:    0.0s


[CV]  , neg_mean_absolute_error=-60.46498147038024, neg_mean_squared_error=-9971.01019403888, total=   1.1s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    2.2s remaining:    0.0s


[CV]  , neg_mean_absolute_error=-52.47695261015145, neg_mean_squared_error=-46967.30204160595, total=   1.0s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    3.3s remaining:    0.0s


[CV]  , neg_mean_absolute_error=-54.373526364346745, neg_mean_squared_error=-9806.417986480847, total=   1.1s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    4.4s remaining:    0.0s


[CV]  , neg_mean_absolute_error=-44.49609575903625, neg_mean_squared_error=-7428.717945544002, total=   1.1s
[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-48.717534495005694, neg_mean_squared_error=-7953.6024637814835, total=   1.0s
[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-45.175241995388674, neg_mean_squared_error=-11583.960540118911, total=   1.0s
[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-47.44889754410903, neg_mean_squared_error=-8361.091747525212, total=   1.0s
[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-55.6102562079959, neg_mean_squared_error=-13680.677461129439, total=   1.1s
[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-56.90067018536043, neg_mean_squared_error=-74653.60282511871, t

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:   10.9s finished


{'fit_time': array([1.05953479, 1.05363297, 0.99702692, 1.09057975, 1.05202579,
        1.03469467, 1.01335287, 1.00400591, 1.08423162, 1.00319076]),
 'score_time': array([0.005548  , 0.00565481, 0.0057559 , 0.00591278, 0.00648952,
        0.00572777, 0.00562668, 0.00620365, 0.00620508, 0.00677609]),
 'test_neg_mean_absolute_error': array([-67.30092816, -60.46498147, -52.47695261, -54.37352636,
        -44.49609576, -48.7175345 , -45.175242  , -47.44889754,
        -55.61025621, -56.90067019]),
 'train_neg_mean_absolute_error': array([-45.31289308, -44.90129377, -49.11102351, -43.15026228,
        -48.24343662, -46.25471808, -47.28219049, -48.53999471,
        -47.85831803, -47.69498149]),
 'test_neg_mean_squared_error': array([-80788.45260609,  -9971.01019404, -46967.30204161,  -9806.41798648,
         -7428.71794554,  -7953.60246378, -11583.96054012,  -8361.09174753,
        -13680.67746113, -74653.60282512]),
 'train_neg_mean_squared_error': array([-7631.01840535, -7553.43067035, -7

### Random Forest

In [138]:
cv_rf = cross_validation(X=_X_train, y=_y, model=rf_model)
cv_rf

[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-55.562509551141346, neg_mean_squared_error=-586325.2316016416, total=   2.6s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    2.8s remaining:    0.0s


[CV]  , neg_mean_absolute_error=-22.963020819693092, neg_mean_squared_error=-5444.335279026209, total=   2.6s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    5.5s remaining:    0.0s


[CV]  , neg_mean_absolute_error=-35.89334843694616, neg_mean_squared_error=-93258.77280102817, total=   2.6s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    8.3s remaining:    0.0s


[CV]  , neg_mean_absolute_error=-81.37884162785562, neg_mean_squared_error=-771229.6259854292, total=   2.7s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:   11.1s remaining:    0.0s


[CV]  , neg_mean_absolute_error=-41.08348085124478, neg_mean_squared_error=-713339.6473190195, total=   2.7s
[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-34.737701291830305, neg_mean_squared_error=-47771.83273553999, total=   2.6s
[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-114.57181423539531, neg_mean_squared_error=-1913367.856971233, total=   2.6s
[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-63.0794198892261, neg_mean_squared_error=-606083.2693849637, total=   2.6s
[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-39.05665633500016, neg_mean_squared_error=-40281.6753056752, total=   2.6s
[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-50.68000791179162, neg_mean_squared_error=-146792.16902688204, tota

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:   27.6s finished


In [139]:
# cv_rf

{'fit_time': array([2.58192205, 2.59777188, 2.53510451, 2.63707209, 2.62536192,
        2.57012105, 2.55237556, 2.60491347, 2.58832979, 2.55537987]),
 'score_time': array([0.02659726, 0.02875805, 0.02907968, 0.02380204, 0.02556014,
        0.02531815, 0.028651  , 0.02699685, 0.02937818, 0.02381301]),
 'test_neg_mean_absolute_error': array([ -55.56250955,  -22.96302082,  -35.89334844,  -81.37884163,
         -41.08348085,  -34.73770129, -114.57181424,  -63.07941989,
         -39.05665634,  -50.68000791]),
 'train_neg_mean_absolute_error': array([-42.04888757, -47.43308463, -48.87857052, -45.34740659,
        -44.90841964, -47.67803349, -39.09535815, -43.04473874,
        -47.48205401, -45.71524366]),
 'test_neg_mean_squared_error': array([ -586325.23160164,    -5444.33527903,   -93258.77280103,
         -771229.62598543,  -713339.64731902,   -47771.83273554,
        -1913367.85697123,  -606083.26938496,   -40281.67530568,
         -146792.16902688]),
 'train_neg_mean_squared_error': arr

### Logistic Regression

In [140]:
cv_lgr = cross_validation(model=lgr_model, X=_X_train, y=_y)
cv_lgr

[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-9388.959, neg_mean_squared_error=-298186838.629, total=  59.3s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   59.6s remaining:    0.0s


[CV]  , neg_mean_absolute_error=-5111.648, neg_mean_squared_error=-58447057.352, total=  58.8s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  2.0min remaining:    0.0s


[CV]  , neg_mean_absolute_error=-6677.879, neg_mean_squared_error=-102045168.657, total= 1.0min
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  3.0min remaining:    0.0s


[CV]  , neg_mean_absolute_error=-6078.356, neg_mean_squared_error=-91384368.7, total= 1.0min
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  4.0min remaining:    0.0s


[CV]  , neg_mean_absolute_error=-5315.759, neg_mean_squared_error=-57498299.619, total= 1.0min
[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-5778.033, neg_mean_squared_error=-58494403.317, total=  58.7s
[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-6840.943, neg_mean_squared_error=-119362198.729, total= 1.0min
[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-7490.043, neg_mean_squared_error=-130218972.545, total= 1.0min
[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-7911.921, neg_mean_squared_error=-140723642.127, total=  59.9s
[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-7854.64, neg_mean_squared_error=-135600951.766, total=  59.9s


[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed: 10.1min finished


In [141]:
# cv_lgr

{'fit_time': array([59.27159739, 58.77196527, 60.06457376, 60.46583629, 61.50546384,
        58.65249777, 61.27168941, 60.19165301, 59.83763933, 59.86498761]),
 'score_time': array([0.01460624, 0.0123167 , 0.01384878, 0.013659  , 0.01427674,
        0.0137887 , 0.01442432, 0.01476526, 0.0124917 , 0.01391506]),
 'test_neg_mean_absolute_error': array([-9388.959, -5111.648, -6677.879, -6078.356, -5315.759, -5778.033,
        -6840.943, -7490.043, -7911.921, -7854.64 ]),
 'train_neg_mean_absolute_error': array([-6396.45722222, -6387.49566667, -6002.62811111, -5871.79622222,
        -5904.23644444, -6118.644     , -5911.43333333, -5749.06744444,
        -6107.43955556, -6118.50133333]),
 'test_neg_mean_squared_error': array([-2.98186839e+08, -5.84470574e+07, -1.02045169e+08, -9.13843687e+07,
        -5.74982996e+07, -5.84944033e+07, -1.19362199e+08, -1.30218973e+08,
        -1.40723642e+08, -1.35600952e+08]),
 'train_neg_mean_squared_error': array([-1.01885621e+08, -9.74869099e+07, -9.05698