# Job Salary Prediction
_Predict the salary of any UK job ad based on its contents_

### Job Data

- **Id**: Identificador para cada job.

- **Title**: Texto livre com o titulo ou resumo da vaga.

- **FullDescription**: Descrição da vaga sem qualquer informação salarial.

- **LocationRaw**: Localização da vaga em texto livre.

- **LocationNormalized**: Localização aproximada a partir da convesao do texto livre.

- **ContractType**: full_time ou part_time.

- **ContractTime**: permanent or contract.

- **Company**: Nome da empresa.

- **Category**: Qual das 30 categorias de trabalho padrão esse anúncio se encaixa, inferida de uma maneira muito confusa com base na origem da origem do anúncio. Sabemos que há muito barulho e erro nesse campo.

- **SalaryRaw**: Descrição salarial em texto livre.

- **SalaryNormalised**: Salario bruto anual. Valor que estamos tentando prever.

- **SourceName**: Nome do site ou anunciante da vaga.

### Location Tree

Este é um conjunto de dados suplementares que descreve o relacionamento hierárquico entre os diferentes locais normalizados mostrados nos dados do trabalho. É provável que existam relações significativas entre os salários dos empregos em uma área geográfica semelhante, por exemplo, os salários médios em Londres e no Sudeste são mais altos do que no resto do Reino Unido.

### Saida


    Id,SalaryNormalized
    13656201,36205
    14663195,74570
    16530664,31910.50
    ... 
    

## Imports

In [2]:
import numpy as np
import pandas as pd
from sklearn.svm import SVR
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import KFold, cross_validate
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor


## Dataset

In [67]:
# Install a Drive FUSE wrapper.
# https://github.com/astrada/google-drive-ocamlfuse
# !apt-get install -y -qq software-properties-common python-software-properties module-init-tools
# !add-apt-repository -y ppa:alessandro-strada/ppa 2>&1 > /dev/null
# !apt-get update -qq 2>&1 > /dev/null
# !apt-get -y install -qq google-drive-ocamlfuse fuse

gpg: keybox '/tmp/tmpfhvsk8t_/pubring.gpg' created
gpg: /tmp/tmpfhvsk8t_/trustdb.gpg: trustdb created
gpg: key AD5F235DF639B041: public key "Launchpad PPA for Alessandro Strada" imported
gpg: Total number processed: 1
gpg:               imported: 1


In [72]:
# Generate auth tokens for Colab
# from google.colab import auth
# auth.authenticate_user()

In [0]:
# Generate creds for the Drive FUSE library.
# from oauth2client.client import GoogleCredentials
# creds = GoogleCredentials.get_application_default()
# import getpass
# !google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret} < /dev/null 2>&1 | grep URL
# vcode = getpass.getpass()
# !echo {vcode} | google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret}

In [12]:
# Create a directory and mount Google Drive using that directory.
# !mkdir -p drive
# !google-drive-ocamlfuse drive

In [13]:
# !cd drive/

In [3]:
!ls

Job Salary Prediction.ipynb		    README.md
List_12__Clustering.ipynb		    Train_rev1.csv
List_13__Clusterization_Hierarchical.ipynb  Train_rev1.zip
Não confirmado 712499.crdownload


In [15]:
# df_job_data = pd.read_csv('drive/MachineLearning/JobSalaryPredict/Train_rev1.csv')

In [5]:
df_job_data = pd.read_csv('Train_rev1.csv')

In [16]:
# df_test_rev1 = pd.read_csv('drive/MachineLearning/JobSalaryPredict/Test_rev1.csv')

In [6]:
df_test_rev1 = pd.read_csv('Test_rev1.csv')

## Informações

### Job Data

In [7]:
df_job_data.head(n=2)

Unnamed: 0,Id,Title,FullDescription,LocationRaw,LocationNormalized,ContractType,ContractTime,Company,Category,SalaryRaw,SalaryNormalized,SourceName
0,12612628,Engineering Systems Analyst,Engineering Systems Analyst Dorking Surrey Sal...,"Dorking, Surrey, Surrey",Dorking,,permanent,Gregory Martin International,Engineering Jobs,20000 - 30000/annum 20-30K,25000,cv-library.co.uk
1,12612830,Stress Engineer Glasgow,Stress Engineer Glasgow Salary **** to **** We...,"Glasgow, Scotland, Scotland",Glasgow,,permanent,Gregory Martin International,Engineering Jobs,25000 - 35000/annum 25-35K,30000,cv-library.co.uk


In [8]:
df_job_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244768 entries, 0 to 244767
Data columns (total 12 columns):
Id                    244768 non-null int64
Title                 244767 non-null object
FullDescription       244768 non-null object
LocationRaw           244768 non-null object
LocationNormalized    244768 non-null object
ContractType          65442 non-null object
ContractTime          180863 non-null object
Company               212338 non-null object
Category              244768 non-null object
SalaryRaw             244768 non-null object
SalaryNormalized      244768 non-null int64
SourceName            244767 non-null object
dtypes: int64(2), object(10)
memory usage: 22.4+ MB


### Test

In [9]:
df_test_rev1.head(n=2)

Unnamed: 0,Id,Title,FullDescription,LocationRaw,LocationNormalized,ContractType,ContractTime,Company,Category,SourceName
0,11888454,Business Development Manager,The Company: Our client is a national training...,"Tyne Wear, North East",Newcastle Upon Tyne,,permanent,Asset Appointments,Teaching Jobs,cv-library.co.uk
1,11988350,Internal Account Manager,The Company: Founded in **** our client is a U...,"Tyne and Wear, North East",Newcastle Upon Tyne,,permanent,Asset Appointments,Consultancy Jobs,cv-library.co.uk


In [10]:
df_test_rev1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 122463 entries, 0 to 122462
Data columns (total 10 columns):
Id                    122463 non-null int64
Title                 122463 non-null object
FullDescription       122463 non-null object
LocationRaw           122463 non-null object
LocationNormalized    122463 non-null object
ContractType          33013 non-null object
ContractTime          90702 non-null object
Company               106202 non-null object
Category              122463 non-null object
SourceName            122463 non-null object
dtypes: int64(1), object(9)
memory usage: 9.3+ MB


## Pré-processamento

In [44]:
def normalizeTextField(df, field):
    vectorizer = CountVectorizer(max_features=100)
    fields = vectorizer.fit_transform(df[field]).toarray()
    # Generate field names
    fcols = np.vectorize(lambda x: field + str(x))(np.arange(2))
    # Reduz a dimensionalidade para 2 
    pca = PCA(n_components = 2)
    _df = pd.DataFrame(pca.fit_transform(fields), columns=fcols)
    # Concatena o dataframe com o novo
    df = pd.concat([df, _df], join ='inner', axis=1)
    del df[field]
    return df

### SalaryRaw

In [12]:
del df_job_data['SalaryRaw']

### Remove ContractType

Grande quantidade de valores null

In [13]:
del df_job_data['ContractType']
del df_test_rev1['ContractType']

### Remove ContractTime

In [17]:
del df_job_data['ContractTime']
del df_test_rev1['ContractTime']

### Removendo Category

In [18]:
del df_job_data['Category']
del df_test_rev1['Category']

### Removendo Location Raw

In [19]:
del df_job_data['LocationRaw']
del df_test_rev1['LocationRaw']

### Company

In [20]:
del df_job_data['Company']

In [21]:
del df_test_rev1['Company']

### Removendo linhas com valores NULL

In [23]:
df_job_data.dropna(subset=['Title'], inplace = True)

In [24]:
df_job_data.dropna(subset=['SourceName'], inplace = True)

### Retirando Label

In [25]:
y = df_job_data['SalaryNormalized'].values

In [26]:
y

array([25000, 30000, 30000, ..., 22800, 22800, 42500])

### Retirando IDS

In [27]:
idx_job = df_job_data['Id'].values

In [28]:
idx_job

array([12612628, 12612830, 12612844, ..., 72705213, 72705216, 72705235])

In [29]:
idx_test = df_test_rev1['Id'].values

In [30]:
idx_test

array([11888454, 11988350, 12612558, ..., 72705210, 72705214, 72705218])

### Juntando conteudo

In [31]:
df_job_tuple = df_job_data.shape
df_job_tuple

(244766, 6)

In [32]:
df_test_tuple = df_test_rev1.shape
df_test_tuple

(122463, 5)

In [38]:
df = df_job_data.append(df_test_rev1, sort=False)

In [45]:
df.shape

(367229, 6)

#### LocationNormalized

In [46]:
df = normalizeTextField(df, 'LocationNormalized')

In [47]:
df.shape

(367229, 7)

In [48]:
df.head()

Unnamed: 0,Id,Title,FullDescription,SalaryNormalized,SourceName,LocationNormalized0,LocationNormalized1
0,12612628,Engineering Systems Analyst,Engineering Systems Analyst Dorking Surrey Sal...,25000.0,cv-library.co.uk,-0.11679,-0.229172
1,12612830,Stress Engineer Glasgow,Stress Engineer Glasgow Salary **** to **** We...,30000.0,cv-library.co.uk,-0.118995,-0.237572
2,12612844,Modelling and simulation analyst,Mathematical Modeller / Simulation Analyst / O...,30000.0,cv-library.co.uk,-0.120516,-0.241914
3,12613049,Engineering Systems Analyst / Mathematical Mod...,Engineering Systems Analyst / Mathematical Mod...,27500.0,cv-library.co.uk,-0.122604,-0.249312
4,12613647,"Pioneer, Miser Engineering Systems Analyst","Pioneer, Miser Engineering Systems Analyst Do...",25000.0,cv-library.co.uk,-0.122604,-0.249312


#### Title

In [49]:
df = normalizeTextField(df, 'Title')

In [50]:
df.shape

(367229, 8)

In [51]:
df.head()

Unnamed: 0,Id,FullDescription,SalaryNormalized,SourceName,LocationNormalized0,LocationNormalized1,Title0,Title1
0,12612628,Engineering Systems Analyst Dorking Surrey Sal...,25000.0,cv-library.co.uk,-0.11679,-0.229172,-0.211711,0.010125
1,12612830,Stress Engineer Glasgow Salary **** to **** We...,30000.0,cv-library.co.uk,-0.118995,-0.237572,-0.379567,-0.578679
2,12612844,Mathematical Modeller / Simulation Analyst / O...,30000.0,cv-library.co.uk,-0.120516,-0.241914,-0.204023,0.064291
3,12613049,Engineering Systems Analyst / Mathematical Mod...,27500.0,cv-library.co.uk,-0.122604,-0.249312,-0.211711,0.010125
4,12613647,"Pioneer, Miser Engineering Systems Analyst Do...",25000.0,cv-library.co.uk,-0.122604,-0.249312,-0.211711,0.010125


#### Full Description

In [52]:
df = normalizeTextField(df, 'FullDescription')

In [53]:
df.shape

(367229, 9)

In [54]:
df.head()

Unnamed: 0,Id,SalaryNormalized,SourceName,LocationNormalized0,LocationNormalized1,Title0,Title1,FullDescription0,FullDescription1
0,12612628,25000.0,cv-library.co.uk,-0.11679,-0.229172,-0.211711,0.010125,-18.530014,2.881801
1,12612830,30000.0,cv-library.co.uk,-0.118995,-0.237572,-0.379567,-0.578679,1.115408,-2.899838
2,12612844,30000.0,cv-library.co.uk,-0.120516,-0.241914,-0.204023,0.064291,-1.111251,2.198476
3,12613049,27500.0,cv-library.co.uk,-0.122604,-0.249312,-0.211711,0.010125,-18.890457,3.393422
4,12613647,25000.0,cv-library.co.uk,-0.122604,-0.249312,-0.211711,0.010125,-19.451188,2.751042


#### Source Name

In [55]:
_, sources = np.unique(df['SourceName'], return_inverse=True)

In [56]:
sources.shape

(367229,)

In [57]:
df['SourceName'] = sources

In [58]:
df.shape

(367229, 9)

In [59]:
df.head(n=2)

Unnamed: 0,Id,SalaryNormalized,SourceName,LocationNormalized0,LocationNormalized1,Title0,Title1,FullDescription0,FullDescription1
0,12612628,25000.0,42,-0.11679,-0.229172,-0.211711,0.010125,-18.530014,2.881801
1,12612830,30000.0,42,-0.118995,-0.237572,-0.379567,-0.578679,1.115408,-2.899838


In [60]:
df.tail(n=2)

Unnamed: 0,Id,SalaryNormalized,SourceName,LocationNormalized0,LocationNormalized1,Title0,Title1,FullDescription0,FullDescription1
122461,72705214,,64,-0.11679,-0.229172,0.868987,-0.102754,-3.389519,-0.760345
122462,72705218,,64,-0.118635,-0.235408,-0.168568,0.034971,-13.765711,-0.120908


### Separando Train e Test 

In [150]:
del df['Id']
del df['SalaryNormalized']

In [151]:
df.head()

Unnamed: 0,SourceName,LocationNormalized0,LocationNormalized1,Title0,Title1,FullDescription0,FullDescription1
0,42,-0.11679,-0.229172,-0.211711,0.010125,-18.530014,2.881801
1,42,-0.118995,-0.237572,-0.379567,-0.578679,1.115408,-2.899838
2,42,-0.120516,-0.241914,-0.204023,0.064291,-1.111251,2.198476
3,42,-0.122604,-0.249312,-0.211711,0.010125,-18.890457,3.393422
4,42,-0.122604,-0.249312,-0.211711,0.010125,-19.451188,2.751042


In [152]:
X_train = df.values[:df_job_tuple[0], :df_job_tuple[0]]

In [154]:
X_test = df.values[:df_test_tuple[0], :df_test_tuple[0]]

In [155]:
X_train.shape, X_test.shape

((244766, 7), (122463, 7))

## Estratificando dados

In [156]:
scaler = StandardScaler()

## Criando Folds

In [157]:
n_splits = 10
kfold = KFold(n_splits=n_splits)

## Função para executar modelos

In [158]:
def cross_validation(model, X, y):
    scoring = [ 'neg_mean_absolute_error', 'neg_mean_squared_error']
    pipeline = Pipeline([('transformer', scaler), ('estimator', model)])
    
    return cross_validate(pipeline, X=X, y=y, cv=kfold, n_jobs=1, verbose=5, scoring=scoring, return_train_score=True)

## Criando modelos

In [159]:
rf_model = RandomForestRegressor(n_estimators=50, min_samples_split=30, random_state=1)

In [160]:
gb_model = GradientBoostingRegressor(min_samples_split=30, random_state=1)

In [161]:
lgr_model = LogisticRegression(random_state=1)

In [162]:
ada_model = AdaBoostRegressor(random_state=1)

In [163]:
knn_model = KNeighborsRegressor()

## Treinamento

### KNN

In [174]:
cv_knn = cross_validation(model=knn_model, X=X_train, y=y)
cv_knn

[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-12733.64104261143, neg_mean_squared_error=-307943906.0020215, total=   8.7s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   51.5s remaining:    0.0s


[CV]  , neg_mean_absolute_error=-13537.325007149568, neg_mean_squared_error=-331869801.1730604, total=   5.8s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  1.6min remaining:    0.0s


[CV]  , neg_mean_absolute_error=-13820.028255096622, neg_mean_squared_error=-325121743.29920167, total=   4.8s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  2.3min remaining:    0.0s


[CV]  , neg_mean_absolute_error=-13511.166180495977, neg_mean_squared_error=-325750042.36574256, total=   5.5s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  3.0min remaining:    0.0s


[CV]  , neg_mean_absolute_error=-13446.533063692445, neg_mean_squared_error=-326687714.3590391, total=   5.4s
[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-13315.152780160968, neg_mean_squared_error=-326429621.8620077, total=   5.9s
[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-13295.320975649618, neg_mean_squared_error=-324666501.2103073, total=   5.6s
[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-13573.228051969276, neg_mean_squared_error=-336299433.247127, total=   5.3s
[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-13662.834343847033, neg_mean_squared_error=-331905233.28483576, total=   5.1s
[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-13516.282464454976, neg_mean_squared_error=-331258782.8601537,

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:  7.3min finished


{'fit_time': array([1.38229799, 0.85033631, 0.83714104, 0.8260119 , 0.80018258,
        0.92071152, 0.93332052, 0.85319853, 0.81079412, 0.84515548]),
 'score_time': array([7.34822392, 4.99898529, 3.96101522, 4.67936158, 4.59908962,
        4.99185944, 4.67876554, 4.40313911, 4.32390618, 4.59865427]),
 'test_neg_mean_absolute_error': array([-12733.64104261, -13537.32500715, -13820.0282551 , -13511.1661805 ,
        -13446.53306369, -13315.15278016, -13295.32097565, -13573.22805197,
        -13662.83434385, -13516.28246445]),
 'train_neg_mean_absolute_error': array([-10929.33158079, -10831.40853061, -10820.82836819, -10827.4884075 ,
        -10837.17296188, -10844.06816409, -10859.00943756, -10821.21451814,
        -10813.06045032, -10824.33693677]),
 'test_neg_mean_squared_error': array([-3.07943906e+08, -3.31869801e+08, -3.25121743e+08, -3.25750042e+08,
        -3.26687714e+08, -3.26429622e+08, -3.24666501e+08, -3.36299433e+08,
        -3.31905233e+08, -3.31258783e+08]),
 'train_neg_me

### ADA

In [175]:
cv_ada = cross_validation(model=ada_model, X=X_train, y=y)
cv_ada

[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-17329.61162205793, neg_mean_squared_error=-406408496.5251206, total=   9.3s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    9.9s remaining:    0.0s


[CV]  , neg_mean_absolute_error=-17831.891197009496, neg_mean_squared_error=-431803548.2915109, total=  11.1s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   21.8s remaining:    0.0s


[CV]  , neg_mean_absolute_error=-14536.257642826653, neg_mean_squared_error=-306124162.3752512, total=   6.5s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:   28.8s remaining:    0.0s


[CV]  , neg_mean_absolute_error=-14216.09692198184, neg_mean_squared_error=-306231680.13089, total=   5.5s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:   34.6s remaining:    0.0s


[CV]  , neg_mean_absolute_error=-14141.235030478612, neg_mean_squared_error=-304483101.4509801, total=   7.3s
[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-14210.360811278259, neg_mean_squared_error=-308934397.8279484, total=   6.6s
[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-13765.519680757305, neg_mean_squared_error=-293585642.544772, total=   5.0s
[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-14486.368063944072, neg_mean_squared_error=-319019650.3453199, total=   7.5s
[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-14104.098581094147, neg_mean_squared_error=-304453196.59844565, total=   5.1s
[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-14127.456515985272, neg_mean_squared_error=-308041234.94734216

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:  1.2min finished


{'fit_time': array([ 9.27835393, 11.04466414,  6.48367429,  5.50058055,  7.21869135,
         6.50926232,  4.98939395,  7.41086197,  5.04104948,  5.96006632]),
 'score_time': array([0.06127882, 0.08084559, 0.04047608, 0.03484178, 0.04428959,
        0.04147696, 0.03230834, 0.0484705 , 0.03206539, 0.03724194]),
 'test_neg_mean_absolute_error': array([-17329.61162206, -17831.89119701, -14536.25764283, -14216.09692198,
        -14141.23503048, -14210.36081128, -13765.51968076, -14486.36806394,
        -14104.09858109, -14127.45651599]),
 'train_neg_mean_absolute_error': array([-16555.64898307, -17857.2183484 , -14418.23666158, -13949.5311151 ,
        -14098.83491127, -14252.95978534, -13985.97304756, -14341.66685935,
        -14009.32454021, -13991.02832823]),
 'test_neg_mean_squared_error': array([-4.06408497e+08, -4.31803548e+08, -3.06124162e+08, -3.06231680e+08,
        -3.04483101e+08, -3.08934398e+08, -2.93585643e+08, -3.19019650e+08,
        -3.04453197e+08, -3.08041235e+08]),
 'tr

### Gradient Boosting

In [176]:
cv_gb = cross_validation(model=gb_model, X=X_train, y=y)
cv_gb

[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-11103.428348327567, neg_mean_squared_error=-235727362.57104668, total=  24.4s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   25.1s remaining:    0.0s


[CV]  , neg_mean_absolute_error=-11764.01484388264, neg_mean_squared_error=-252027212.83401963, total=  23.7s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   49.4s remaining:    0.0s


[CV]  , neg_mean_absolute_error=-12071.558423036431, neg_mean_squared_error=-252010695.6776696, total=  24.5s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  1.2min remaining:    0.0s


[CV]  , neg_mean_absolute_error=-11696.462284706546, neg_mean_squared_error=-249974864.99507242, total=  24.0s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  1.7min remaining:    0.0s


[CV]  , neg_mean_absolute_error=-11684.342587998824, neg_mean_squared_error=-250667371.07004234, total=  24.9s
[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-11396.64044126633, neg_mean_squared_error=-244658255.91327146, total=  23.8s
[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-11617.342779489149, neg_mean_squared_error=-249437395.17325613, total=  24.4s
[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-11824.865124412932, neg_mean_squared_error=-256893163.21951717, total=  23.8s
[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-11853.573757634615, neg_mean_squared_error=-252299507.50623816, total=  24.6s
[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-12170.125813880111, neg_mean_squared_error=-267176339.0380

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:  4.2min finished


{'fit_time': array([24.34532571, 23.60189414, 24.41157103, 23.94695377, 24.8332665 ,
        23.73723745, 24.28669786, 23.72914934, 24.49623656, 25.29009271]),
 'score_time': array([0.06644773, 0.06725311, 0.07075977, 0.0655942 , 0.06925273,
        0.06957865, 0.06806302, 0.06797767, 0.0685904 , 0.07125211]),
 'test_neg_mean_absolute_error': array([-11103.42834833, -11764.01484388, -12071.55842304, -11696.46228471,
        -11684.342588  , -11396.64044127, -11617.34277949, -11824.86512441,
        -11853.57375763, -12170.12581388]),
 'train_neg_mean_absolute_error': array([-11742.27837555, -11649.85522723, -11630.45281123, -11674.13646185,
        -11678.89934433, -11730.28971197, -11701.20322227, -11675.6896385 ,
        -11690.39968086, -11614.57676922]),
 'test_neg_mean_squared_error': array([-2.35727363e+08, -2.52027213e+08, -2.52010696e+08, -2.49974865e+08,
        -2.50667371e+08, -2.44658256e+08, -2.49437395e+08, -2.56893163e+08,
        -2.52299508e+08, -2.67176339e+08]),
 'tr

### Random Forest

In [177]:
cv_rf = cross_validation(model=rf_model, X=X_train, y=y)
cv_rf

[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-10877.503380347332, neg_mean_squared_error=-230223446.90212265, total= 1.5min
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  1.6min remaining:    0.0s


[CV]  , neg_mean_absolute_error=-11476.277420387503, neg_mean_squared_error=-237524249.61838746, total= 1.5min
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  3.3min remaining:    0.0s


[CV]  , neg_mean_absolute_error=-12013.191263039213, neg_mean_squared_error=-250168156.26033938, total= 1.5min
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  4.9min remaining:    0.0s


[CV]  , neg_mean_absolute_error=-11552.737114350337, neg_mean_squared_error=-245255634.9223653, total= 1.5min
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  6.5min remaining:    0.0s


[CV]  , neg_mean_absolute_error=-11361.188738070869, neg_mean_squared_error=-236623800.3453667, total= 1.5min
[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-11080.206942309804, neg_mean_squared_error=-230333823.08544135, total= 1.5min
[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-11411.255762304185, neg_mean_squared_error=-240712765.02964288, total= 1.5min
[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-11544.69617147723, neg_mean_squared_error=-246733282.8710564, total= 1.5min
[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-11560.640168013115, neg_mean_squared_error=-242315932.7496565, total= 1.5min
[CV]  ................................................................
[CV]  , neg_mean_absolute_error=-11995.228282612998, neg_mean_squared_error=-258588333.5637847

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed: 16.2min finished


{'fit_time': array([92.05588675, 90.76209569, 90.65457511, 90.51180816, 92.24868631,
        90.52988768, 91.44932055, 90.99893904, 89.74034119, 92.15906477]),
 'score_time': array([0.6005044 , 0.61848903, 0.65210533, 0.6102097 , 0.66494894,
        0.64939761, 0.61932898, 0.63908172, 0.63714027, 0.61887097]),
 'test_neg_mean_absolute_error': array([-10877.50338035, -11476.27742039, -12013.19126304, -11552.73711435,
        -11361.18873807, -11080.20694231, -11411.2557623 , -11544.69617148,
        -11560.64016801, -11995.22828261]),
 'train_neg_mean_absolute_error': array([-9017.1409959 , -8968.36270123, -8908.36588036, -8953.75484995,
        -8965.3873861 , -8999.87729123, -8968.29601776, -8944.11550602,
        -8949.75831218, -8891.43074246]),
 'test_neg_mean_squared_error': array([-2.30223447e+08, -2.37524250e+08, -2.50168156e+08, -2.45255635e+08,
        -2.36623800e+08, -2.30333823e+08, -2.40712765e+08, -2.46733283e+08,
        -2.42315933e+08, -2.58588334e+08]),
 'train_neg_me

### Logistic Regression

In [None]:
cv_lgr = cross_validation(model=lgr_model, X=X_train, y=y)
cv_lgr