# Salary Predictions
Hello. In this notebook I will be exploring and utilizing data science techinques on both numerical and text based data.

The dataset that I will be using is taken from the Kaggle competition https://www.kaggle.com/c/job-salary-prediction

I will use a combination of Naive Bayes and a Random Forest to make an ensemble of voters. I'm not promising accurate and good predictions in this notebook, but I am promising exploration of different methods and uses of ensembles and the relative effectiveness of those methods.

## First look at Data

Let's load in the data and see what we're working with here

In [64]:
import pandas as pd
salary_table = pd.read_csv('C:/Users/Snapu/Downloads/CIS datasets/Train_rev1/Train_rev1.csv')
salary_table.head(5)

Unnamed: 0,Id,Title,FullDescription,LocationRaw,LocationNormalized,ContractType,ContractTime,Company,Category,SalaryRaw,SalaryNormalized,SourceName
0,12612628,Engineering Systems Analyst,Engineering Systems Analyst Dorking Surrey Sal...,"Dorking, Surrey, Surrey",Dorking,,permanent,Gregory Martin International,Engineering Jobs,20000 - 30000/annum 20-30K,25000,cv-library.co.uk
1,12612830,Stress Engineer Glasgow,Stress Engineer Glasgow Salary **** to **** We...,"Glasgow, Scotland, Scotland",Glasgow,,permanent,Gregory Martin International,Engineering Jobs,25000 - 35000/annum 25-35K,30000,cv-library.co.uk
2,12612844,Modelling and simulation analyst,Mathematical Modeller / Simulation Analyst / O...,"Hampshire, South East, South East",Hampshire,,permanent,Gregory Martin International,Engineering Jobs,20000 - 40000/annum 20-40K,30000,cv-library.co.uk
3,12613049,Engineering Systems Analyst / Mathematical Mod...,Engineering Systems Analyst / Mathematical Mod...,"Surrey, South East, South East",Surrey,,permanent,Gregory Martin International,Engineering Jobs,25000 - 30000/annum 25K-30K negotiable,27500,cv-library.co.uk
4,12613647,"Pioneer, Miser Engineering Systems Analyst","Pioneer, Miser Engineering Systems Analyst Do...","Surrey, South East, South East",Surrey,,permanent,Gregory Martin International,Engineering Jobs,20000 - 30000/annum 20-30K,25000,cv-library.co.uk


In [65]:
print(len(salary_table))
salary_table.isnull().sum()

244768


Id                         0
Title                      1
FullDescription            0
LocationRaw                0
LocationNormalized         0
ContractType          179326
ContractTime           63905
Company                32430
Category                   0
SalaryRaw                  0
SalaryNormalized           0
SourceName                 1
dtype: int64

In [66]:
print("ContractType null percentage: " +str(179326.0/244768))
print("ContractTime null percentage: " +str(63905.0/244768))
print("Company null percentage: " + str(32430.0/244768))

ContractType null percentage: 0.732636619166
ContractTime null percentage: 0.261083965224
Company null percentage: 0.132492809518


In [67]:
salary_table.describe(include='all')

Unnamed: 0,Id,Title,FullDescription,LocationRaw,LocationNormalized,ContractType,ContractTime,Company,Category,SalaryRaw,SalaryNormalized,SourceName
count,244768.0,244767,244768,244768,244768,65442,180863,212338,244768,244768,244768.0,244767
unique,,135435,242138,20986,2732,2,2,20812,29,97286,,167
top,,Business Development Manager,What is expected of you as a Registered Nurse ...,London,UK,full_time,permanent,UKStaffsearch,IT Jobs,"50,000-74,999 yearly",,totaljobs.com
freq,,921,18,15605,41093,57538,151521,4997,38483,1923,,48149
mean,69701420.0,,,,,,,,,,34122.577576,
std,3129813.0,,,,,,,,,,17640.543124,
min,12612630.0,,,,,,,,,,5000.0,
25%,68695500.0,,,,,,,,,,21500.0,
50%,69937000.0,,,,,,,,,,30000.0,
75%,71626060.0,,,,,,,,,,42500.0,


It seems that the ContractType isn't very helpful with only 2 unique values and 73.26% missing values. I think it's safe to say we can drop that column and not worry about it. The missing company values are tricky to recover, Im not sure right now how I would go about that. The contract time isn't very important I think so I won't worry too much about the generation of that data.

In [68]:
salary_table = salary_table.drop(['Id'], axis=1)
salary_table = salary_table.drop(['ContractType'], axis=1)
salary_table = salary_table.drop(['SourceName'], axis=1)
salary_table.head(5)

Unnamed: 0,Title,FullDescription,LocationRaw,LocationNormalized,ContractTime,Company,Category,SalaryRaw,SalaryNormalized
0,Engineering Systems Analyst,Engineering Systems Analyst Dorking Surrey Sal...,"Dorking, Surrey, Surrey",Dorking,permanent,Gregory Martin International,Engineering Jobs,20000 - 30000/annum 20-30K,25000
1,Stress Engineer Glasgow,Stress Engineer Glasgow Salary **** to **** We...,"Glasgow, Scotland, Scotland",Glasgow,permanent,Gregory Martin International,Engineering Jobs,25000 - 35000/annum 25-35K,30000
2,Modelling and simulation analyst,Mathematical Modeller / Simulation Analyst / O...,"Hampshire, South East, South East",Hampshire,permanent,Gregory Martin International,Engineering Jobs,20000 - 40000/annum 20-40K,30000
3,Engineering Systems Analyst / Mathematical Mod...,Engineering Systems Analyst / Mathematical Mod...,"Surrey, South East, South East",Surrey,permanent,Gregory Martin International,Engineering Jobs,25000 - 30000/annum 25K-30K negotiable,27500
4,"Pioneer, Miser Engineering Systems Analyst","Pioneer, Miser Engineering Systems Analyst Do...","Surrey, South East, South East",Surrey,permanent,Gregory Martin International,Engineering Jobs,20000 - 30000/annum 20-30K,25000


I would like to use the normalized columns to make more general predictions as I think the loss of detail will make a stronger prediction. Lets take a look at the amount of unique values in the SalaryNormalized column.

In [69]:
unique_salaries = salary_table.SalaryNormalized.unique()
print("Range of Salaries: "+str(sorted(unique_salaries)[:1]) + " "+str(sorted(unique_salaries)[-1:]))
print("Sample Mean of Salaries: " + str(salary_table.SalaryNormalized.mean()))
print("Sample Standard Deviation of Salaries: " + str((salary_table.SalaryNormalized.var())**.5))
len(unique_salaries)

Range of Salaries: [5000] [200000]
Sample Mean of Salaries: 34122.5775755
Sample Standard Deviation of Salaries: 17640.5431239


8454

So we can see that we have 8454 different values now this is a big problem because its not very useful to us with all those outcomes. I will round the values to the nearest thousands that way we can get a general prediction to do. We could use a regression method, but that is outside the scope of this class. I will try to reduce the number of unique values. 

In [70]:
SalaryNormalized = list(salary_table.SalaryNormalized)
small_salaries = [ elem/1000 for elem in SalaryNormalized ]
rounded_salaries = [ '%.1f' % elem for elem in small_salaries ]
big_salaries = [ float(elem)*1000 for elem in rounded_salaries ]
set_salaries = set(big_salaries)
print(set_salaries)
print(len(set_salaries))
salary_table['BinnedSalaries'] = pd.Series(big_salaries)
salary_table.head(5)


set([64000.0, 96000.0, 21000.0, 42000.0, 7000.0, 110000.0, 63000.0, 84000.0, 78000.0, 41000.0, 162000.0, 156000.0, 62000.0, 85000.0, 19000.0, 40000.0, 92000.0, 130000.0, 61000.0, 18000.0, 35000.0, 39000.0, 182000.0, 60000.0, 74000.0, 153000.0, 17000.0, 124000.0, 38000.0, 81000.0, 200000.0, 59000.0, 16000.0, 120000.0, 37000.0, 58000.0, 31000.0, 135000.0, 15000.0, 115000.0, 36000.0, 70000.0, 86000.0, 170000.0, 57000.0, 14000.0, 77000.0, 99000.0, 138000.0, 56000.0, 20000.0, 13000.0, 80000.0, 168000.0, 34000.0, 91000.0, 190000.0, 55000.0, 12000.0, 98000.0, 132000.0, 33000.0, 54000.0, 73000.0, 129000.0, 75000.0, 32000.0, 144000.0, 114000.0, 53000.0, 10000.0, 87000.0, 125000.0, 152000.0, 95000.0, 52000.0, 94000.0, 9000.0, 30000.0, 5000.0, 175000.0, 51000.0, 8000.0, 76000.0, 88000.0, 172000.0, 29000.0, 50000.0, 83000.0, 71000.0, 82000.0, 140000.0, 28000.0, 90000.0, 49000.0, 134000.0, 97000.0, 108000.0, 192000.0, 27000.0, 163000.0, 48000.0, 72000.0, 105000.0, 69000.0, 160000.0, 26000.0, 79000.

Unnamed: 0,Title,FullDescription,LocationRaw,LocationNormalized,ContractTime,Company,Category,SalaryRaw,SalaryNormalized,BinnedSalaries
0,Engineering Systems Analyst,Engineering Systems Analyst Dorking Surrey Sal...,"Dorking, Surrey, Surrey",Dorking,permanent,Gregory Martin International,Engineering Jobs,20000 - 30000/annum 20-30K,25000,25000.0
1,Stress Engineer Glasgow,Stress Engineer Glasgow Salary **** to **** We...,"Glasgow, Scotland, Scotland",Glasgow,permanent,Gregory Martin International,Engineering Jobs,25000 - 35000/annum 25-35K,30000,30000.0
2,Modelling and simulation analyst,Mathematical Modeller / Simulation Analyst / O...,"Hampshire, South East, South East",Hampshire,permanent,Gregory Martin International,Engineering Jobs,20000 - 40000/annum 20-40K,30000,30000.0
3,Engineering Systems Analyst / Mathematical Mod...,Engineering Systems Analyst / Mathematical Mod...,"Surrey, South East, South East",Surrey,permanent,Gregory Martin International,Engineering Jobs,25000 - 30000/annum 25K-30K negotiable,27500,27000.0
4,"Pioneer, Miser Engineering Systems Analyst","Pioneer, Miser Engineering Systems Analyst Do...","Surrey, South East, South East",Surrey,permanent,Gregory Martin International,Engineering Jobs,20000 - 30000/annum 20-30K,25000,25000.0


In [71]:
for i in range(len(salary_table)):
    row = salary_table.iloc[i]
    if isinstance(row['Title'],float):
        print(i)

1588


This was the instance of the missing title that we saw earlier in the notebook, luckily the full description contained the title. I just filled it in that way we could do predictions easily

In [72]:
salary_table.at[1588,'Title'] = "Quality Improvement Manager"
salary_table.iloc[1588]

Title                                       Quality Improvement Manager
FullDescription       Quality Improvement Manager North West England...
LocationRaw                                       Liverpool, Merseyside
LocationNormalized                                            Liverpool
ContractTime                                                        NaN
Company                                                             NaN
Category                                      Healthcare & Nursing Jobs
SalaryRaw                                     40,000 to 45,000 per year
SalaryNormalized                                                  42500
BinnedSalaries                                                    42000
Name: 1588, dtype: object

133 is small enough I think, because we have a large enough data set to get good counts for each of the bins
Now that we have made our data more consice we are ready to begin working with our dataset to create models that will hopefully give us strong predictions.

## Wrangling for Models

This is some hot encoding for the forest. I defined a cut in the dataset, putting it into two bins, the reason is dicussed later in the notebook

In [73]:
mean  = salary_table["BinnedSalaries"].mean()
sd = salary_table["BinnedSalaries"].var()**.5

In [74]:
ohe_type =  pd.get_dummies(salary_table['ContractTime'], prefix = 'type', dummy_na=True)
salary_table = salary_table.join(ohe_type)
ohe_cat =  pd.get_dummies(salary_table['Category'], prefix = 'cat', dummy_na=False)
salary_table = salary_table.join(ohe_cat)
salary_table['normal'] = salary_table.apply(lambda row: 1 if ((row.BinnedSalaries < mean+sd) and (row.BinnedSalaries > mean-sd)) else 0, axis = 1)

In [77]:
salary_table.iloc[1000]

Title                                                              Theatre Manager Wirral
FullDescription                         UK Healthcare Professionals are currently recr...
LocationRaw                                                        The Wirral, Merseyside
LocationNormalized                                                                     UK
ContractTime                                                                          NaN
Company                                                                               NaN
Category                                                        Healthcare & Nursing Jobs
SalaryRaw                                                   40000.00 to 45000.00 per year
SalaryNormalized                                                                    42500
BinnedSalaries                                                                      42000
type_contract                                                                           0
type_perma

Note that the normal column should have about 68% of the data within it, this comes from an assumption that the salaries are going to be normally distributed due to the central limit theorum.

Big bag for big predictions

initialize a dict with binned salaires and 0 values then for each word that is associated with that value just drop the +1 in that one. The dict is gonna have 133 keys, but there is a lot of words. Should be interesting.

In [23]:
#sentence_wrangler here
import re
import string
from nltk.corpus import stopwords
from nltk.tokenize import WordPunctTokenizer
word_punct_tokenizer = WordPunctTokenizer()

def sentence_wrangler(sentence, swords, legal_chars):
    removed_words= []
    result = []
    word_tokes = word_punct_tokenizer.tokenize(sentence.lower())
    for item in word_tokes:
        x = re.findall(legal_chars, item)
        if item in swords:
            removed_words.append(x)
        elif len(x) > 0:
            result.append(x)
        return result, removed_words

reference = sorted(list(set_salaries))
empty_reference = [0] * 133


# -*- coding: utf-8 -*-
import re
caps = "([A-Z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov)"

def split_into_sentences(text):
    text = " " + text + "  "
    text = text.replace("\n"," ")
    text = re.sub(prefixes,"\\1<prd>",text)
    text = re.sub(websites,"<prd>\\1",text)
    if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
    text = re.sub("\s" + caps + "[.] "," \\1<prd> ",text)
    text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
    text = re.sub(caps + "[.]" + caps + "[.]" + caps + "[.]","\\1<prd>\\2<prd>\\3<prd>",text)
    text = re.sub(caps + "[.]" + caps + "[.]","\\1<prd>\\2<prd>",text)
    text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
    text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
    text = re.sub(" " + caps + "[.]"," \\1<prd>",text)
    if "”" in text: text = text.replace(".”","”.")
    if "\"" in text: text = text.replace(".\"","\".")
    if "!" in text: text = text.replace("!\"","\"!")
    if "?" in text: text = text.replace("?\"","\"?")
    text = text.replace(".",".<stop>")
    text = text.replace("?","?<stop>")
    text = text.replace("!","!<stop>")
    text = text.replace("<prd>",".")
    sentences = text.split("<stop>")
    sentences = sentences[:-1]
    sentences = [s.strip() for s in sentences]
    return sentences

In [24]:
import time

start = time.time()

bag_of_fulldesc = {}
bag_of_loc = {}
bag_of_title = {}

for i in range(len(salary_table)):
    row = salary_table.iloc[i]
    text1 = row['FullDescription']
    text1 = split_into_sentences(text1)
    for sentence in text1:
        jobdesc = sentence_wrangler(sentence, stopwords.words('english'), r'^[a-z]+$')[0]
        for words in jobdesc:
            for word in words:
                if word not in bag_of_fulldesc:
                    bag_of_fulldesc[word] = [0] * 133
                bag_of_fulldesc[word][reference.index(row['BinnedSalaries'])] +=1
                
    loc = salary_table.loc[i, 'LocationNormalized'].lower()
    loc = loc.split(" ")
    if len(loc)>1:
        for words in loc:
            if word not in bag_of_loc:
                bag_of_loc[word] = [0] * 133
            bag_of_loc[word][reference.index(row['BinnedSalaries'])] +=1
    else:
        if loc[0] not in bag_of_loc:
            bag_of_loc[loc[0]] = [0] * 133
        bag_of_loc[loc[0]][reference.index(row['BinnedSalaries'])] +=1
            
    title = salary_table.loc[i, 'Title'].lower()
    title = title.split(" ")
    for word in title:
        if word not in bag_of_title:
            bag_of_title[word] = [0] * 133
        bag_of_title[word][reference.index(row['BinnedSalaries'])] +=1
                
                         
    if i%4000 == 0: print('4000 more')
                
end = time.time()
print(end - start)

4000 more


  


4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
1270.12000012


Hm only took about 20 minutes to run, I would say that's pretty efficient for the ~250,000 columns in the data set. Now lets just make sure the numbers are reasonable. 

In [39]:
print(bag_of_fulldesc.items()[100:101])
print(bag_of_loc.items()[100:101])
print(bag_of_title.items()[100:101])

[('louthcaritasrecruitment', [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])]
[('activating', [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])]
[('information/data', [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0,

Looks good. Next I'll generate up the counts that we need to use the Naive Bayes ensemble on the full description, normalized location and the title of the job. I need to count up the number of occurances of the Binned Salaries and then I can use that as probabilities in the formula.

In [26]:
useful_counts = {'salary_count': len(salary_table)}
class_counts = [0] * 133
for i in range(len(salary_table)):
    salary = salary_table.iloc[i]['BinnedSalaries']
    class_counts[reference.index(salary_table.iloc[i]['BinnedSalaries'])] += 1
    
    salary_sum = float(sum(class_counts))
    class_prob = [0] * 133
    for i in range(len(class_counts)):
        class_prob[i] = class_counts[i]/salary_sum
        
    useful_counts['class_count'] = class_counts
    useful_counts['class_prob'] = class_prob
    
    useful_counts

 ## Modeling 
    
   We wrangled the data and set up what we need, now we can create some models. I will be mixing the random forest model on the catagorical columns and then use Naive Bayes on the language based columns. I will then but them together in an ensemble and have them vote for which of the salaries the a new job belongs in.

   To make the processing faster, I will have the locations be used from the Naive Bayes instead of the random forests for now, but if the predictions aren't as strong as I would like I think using random forests to be split using different features and then have the location for the leaves will produce good results. The computation time is just a very large worry. 

### Naive Bayes "Regression"


   For the Naive Bayes I will use unique words that occur in the job descriptions to speed up computation time. I think each job description is roughly 200+ words. If we didn't utilize the unique words, then we are just enforcing similar probabilties onto the data, which could result in bias estimations."

       

In [27]:
#useful functions for random forest modeling
import os
import sys
home_path =  os.path.expanduser('~')
sys.path.append(home_path + '\\Documents\\CIS 399\\Winter Term\\datascience_1')
from week7 import *
%who function

  

accuracy	 build_pred	 build_tree_iter	 caser	 closest_centroid	 compute_mean	 compute_prediction	 compute_training	 euclidean_distance	 
f1	 find_best_splitter	 forest_builder	 generate_table	 gig	 gini	 informedness	 initialize_centroids	 k_fold	 
k_means	 phase_1	 phase_2	 predictor_case	 probabilities	 row_to_vect	 seeder	 sentence_wrangler	 split_into_sentences	 
tree_predictor	 vote_taker	 


In [28]:
import time
start = time.time()
  
all_predictions = []

def naive_bayes_list(lis, bag, counts):
    listo = []
    for i in range(133):
        ego = 1.0
        for word in lis:
            ego *= ((bag[word[0]][i]+0.0)/counts['class_count'][i])
        p = ego*counts['class_prob'][i]+0.0
        listo.append(p)
    return listo

def naive_bayes_loc(word, bag, counts):
    listo = []
    for i in range(133):
        ego = 1.0
        ego *= ((bag[word[0]][i]+0.0)/counts['class_count'][i])
        p = ego*counts['class_prob'][i]+0.0
        listo.append(p)
    return listo

for i in range(len(salary_table)):
    sub_list = []

    unique1 = []
    text1 = salary_table.loc[i, 'FullDescription']
    text1 = split_into_sentences(text1)
    for sentence in text1:
        wrangled_text = sentence_wrangler(sentence, stopwords.words('english'), r'^[a-z]+$')[0]
        for word in wrangled_text:
            if word not in unique1: 
                unique1.append(word)
    result1 = naive_bayes_list(unique1, bag_of_fulldesc, useful_counts)
    sub_list.append(result1.index(max(result1)))
        
    unique2 = []
        
    text2 = salary_table.loc[i, 'LocationNormalized'].lower()
    text2 = text2.split(" ")
    if len(text2) == 1:
        result2 = naive_bayes_loc(text2,bag_of_loc, useful_counts)
    else:
        for word in text2:
            if word not in unique2: 
                unique3.append(text2)
        result2 = naive_bayes_list(text2, bag_of_title, useful_counts)
    sub_list.append(result2.index(max(result2)))
            
    unique3 = []
        
    text3 = salary_table.loc[i, 'Title'].lower()
    text3 = text3.split(" ")
    if len(text3) == 1:
        result3 = naive_bayes_loc(text3,bag_of_title, useful_counts)
    else:
        for word in text3:
            if word not in unique3: 
                unique3.append(text3)
        result3 = naive_bayes_list(unique3, bag_of_title, useful_counts)
    sub_list.append(result3.index(max(result3)))
        
        
        
    all_predictions.append(tuple(sub_list))
    if i%4000 == 0: print('4000 more')
end = time.time()
print(end - start) 

4000 more


  


4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
4000 more
1248.99100018


Once again a nice 20 minutes run time. I'd say im quite please with the efficiency of this code to turn out three predictions that quickly, especially considering the size of the data.

In [29]:
import json
with open('nb_predictions.txt', 'w') as file:
    file.write(json.dumps(all_predictions))

In [30]:
all_predictions = json.load(open("nb_predictions.txt"))


In [31]:
print(all_predictions[:10])

[[30, 15, 109], [35, 13, 66], [35, 30, 99], [50, 30, 109], [30, 30, 20], [45, 15, 109], [40, 25, 96], [17, 30, 15], [22, 12, 15], [30, 45, 83]]


Now, looking at the values returned for the first 10, there are numerous things to consider. Since we are focusing on the use of NLP to determine the salary data, this allows us to do a few things with the random forests on the catagorical variables.

First, we can use the Random Forest to have those come up with it's own predictions based on the data and then put all of the predictions together and then come up with a way to get a single prediction for the salary based on all the predictions, but how can we get a random forest to predict multiple values, this seems pretty complicated and would require quite a bit of changing to the algorithmns at hand. Perhaps this isn't the best idea.

Second, we could use the random forest as a mean to refine the predictions that we have gotten. I think making the random forests check to see if the predicted values are in or out of a certain range would be a very cool way to use it. Hopefully the implementation is effective. 

Third, how are we going to test how well this model does in predicting the data. I think if we interpret the Naive Bayes and Random Forest matrix to be sort of a linear regression of sorts we can assess it's predictive power based on similar concepts. We can calculate something called the Root Mean Squared Prediction Error and see how well this concept works. 

### Random Forest

Since the data is already wrangled and we have imported our functions, I will create a random forest in which will return predicted values similar to the ones. 

In [79]:
splitter_columns = ['type_contract',
       'type_permanent', 'type_nan', 'cat_Accounting & Finance Jobs',
       'cat_Admin Jobs', 'cat_Charity & Voluntary Jobs',
       'cat_Consultancy Jobs', 'cat_Creative & Design Jobs',
       'cat_Customer Services Jobs', 'cat_Domestic help & Cleaning Jobs',
       'cat_Energy, Oil & Gas Jobs', 'cat_Engineering Jobs',
       'cat_Graduate Jobs', 'cat_HR & Recruitment Jobs',
       'cat_Healthcare & Nursing Jobs', 'cat_Hospitality & Catering Jobs',
       'cat_IT Jobs', 'cat_Legal Jobs', 'cat_Logistics & Warehouse Jobs',
       'cat_Maintenance Jobs', 'cat_Manufacturing Jobs',
       'cat_Other/General Jobs', 'cat_PR, Advertising & Marketing Jobs',
       'cat_Part time Jobs', 'cat_Property Jobs', 'cat_Retail Jobs',
       'cat_Sales Jobs', 'cat_Scientific & QA Jobs',
       'cat_Social work Jobs', 'cat_Teaching Jobs',
       'cat_Trade & Construction Jobs', 'cat_Travel Jobs']

In [81]:
import time
start = time.time()

forest1 = forest_builder(salary_table, splitter_columns, 'normal', hypers={'total-trees':5})
salary_table['forest_1'] = salary_table.apply(lambda row: vote_taker(row, forest1), axis=1)
salary_table['forest_1_type'] = salary_table.apply(lambda row: predictor_case(row, pred='forest_1', target='normal'), axis=1)
forest1_types = salary_table['forest_1_type'].value_counts()
print((accuracy(forest1_types), f1(forest1_types), informedness(forest1_types)))

end = time.time()
print(end - start)

(0.7322893515492221, 0.8454584690419307, 0.0)
136.38499999


In [84]:
import time
start = time.time()

forest2 = forest_builder(salary_table, splitter_columns, 'normal', hypers={'total-trees':11})
salary_table['forest2'] = salary_table.apply(lambda row: vote_taker(row, forest2), axis=1)
salary_table['forest2_type'] = salary_table.apply(lambda row: predictor_case(row, pred='forest2', target='normal'), axis=1)
forest2_types = salary_table['forest2_type'].value_counts()
print((accuracy(forest2_types), f1(forest2_types), informedness(forest2_types)))

end = time.time()
print(end - start)

(0.7322893515492221, 0.8454584690419307, 0.0)
0.0480000972748


I'm a little surprised that the random forest scores as well as it does under these conditions. We can see that the size of the forest after 5 does not improve the ability of prediction. I tried to introduce the normlaized location as a catagorical variable in the random forest and kept coming up with errors, so I will not include it into the random forest as it's already captured in the Naive Bayes.

Now that we have models for our data now we have to assess the ensemble together and see how to get a singular value, as well as assessing how well our model performs to the test set.

## All together now

So I think putting more weight to the fulldescription prediction is going to yeild the best results, I would like to find some way to weigh the predictive power of each of predictions and then get a weighted average of sorts. 

I think allowing for the location and the title to affect the prediction on the full description is going to be best. 

Using the Random Forest to check if the predictions given the type of work as well as the sector of work is good will be a sort of verification of the prediction given by naive bayes. If the predictions of the Naive Bayes doesn't correspond to the random forest then the weights of the other predictions will be changed to hopefully account for the differences in the sector and the longevity of the position.

First things first, lets get all the predictions together into a matrix so we can more efficiently do our calculations

In [99]:
for i in range(len(salary_table)):
    all_predictions[i].append(salary_table.iloc[i]['forest_1'])
all_predictions[:5]


[[30, 15, 109, 1],
 [35, 13, 66, 1],
 [35, 30, 99, 1],
 [50, 30, 109, 1],
 [30, 30, 20, 1]]

Let's begin with weighting it as 85%,10%,5% to start for the Naive Bayes Predictions, maybe I can find a better weighting that will minimize the RMSPE. After I'll check to see if it's valid, if it's not we will add or subtract a std. dev to the predictions and see how that performs. I chose to move it by a single standard deviation under some sort of intuition that if the original prediction of the random forest says it's outside of the normal standard deiviation around the mean then the model predictions should reflect this. I think this will make a tighter fit around the actual values and increase prediction power.

In [106]:
mean 
print(sd)
reference.index(35000)

17683.6768553


30

In [140]:
def prediction_condenser(predictions, weights):
    cp = []
    preds = []
    for i in range(len(predictions)):
        cp.append((int(predictions[i][0])*weights[0])+(int(predictions[i][1])*weights[1])+(int(predictions[i][2])*weights[2]))
        rounded = round(cp[i],0)
        ps = reference[int(rounded)] 
        if predictions[3] == 1:
            if (ps > mean): 
                ps -= sd
            else:
                ps += sd
        if predictions[3] == 0:
            if ((ps < mean) and (ps > mean)):
                if predictions[0]<30:
                    ps -= sd
                else:
                    ps += sd
        preds.append(ps)
    return preds

pred_reg = prediction_condenser(all_predictions, [.85,.1,.05])
pred_reg[:10]

[37000.0,
 39000.0,
 43000.0,
 56000.0,
 35000.0,
 50000.0,
 46000.0,
 23000.0,
 26000.0,
 39000.0]

In [136]:
def rmse(predictions, targets):
    diffs = 0
    for i in range(len(predictions)):
        diffs += (predictions[i] - targets[i])**2
    mean_diffs = diffs/len(predictions)
    rmspe = (mean_diffs)**.5
    return rmspe 

print(rmse(pred_reg, big_salaries))
print(sd)

14720.601765
17683.6768553


Our RMSE is better than the standard deviation of the underlying dataset, I would say we have done good here. Perhaps an $R^2$ would be a better measure of fit. I would feel confident using the $R^2$ statistic as I think our results of predictions fit the qualifications of a linear model. Perhaps not a very traditional OLS model, but none-the-less. 

In [141]:
def r2(predictions, targets):
    num = 0
    for i in range(len(predictions)):
        num += (predictions[i] - targets[i])**2
    dem = 0
    mean = sum(targets)/len(targets)
    for i in range(len(targets)):
        dem += (targets[i] - mean)**2
    return (num/dem)
r2(pred_reg, big_salaries)

0.6929593544812951

Hm, we are capturing about 70% of the variation of Salaries from our model. Not bad, Im sure we can do better with some more tweaking to the weights.

In [143]:
pred_reg = prediction_condenser(all_predictions, [.90,.05,.05])
print(".90,.05,.05 -> R2:"+str(r2(pred_reg, big_salaries)))

pred_reg = prediction_condenser(all_predictions, [.95,.04,.01])
print(".95,.04,.01 -> R2:"+str(r2(pred_reg, big_salaries)))

pred_reg = prediction_condenser(all_predictions, [.95,.01,.04])
print(".95,.01,.04 -> R2:"+str(r2(pred_reg, big_salaries)))

pred_reg = prediction_condenser(all_predictions, [.98,.01,.01])
print(".98,.01,.01 -> R2:"+str(r2(pred_reg, big_salaries)))

pred_reg = prediction_condenser(all_predictions, [1,.00,.00])
print("1.0,.00,.00 -> R2:"+str(r2(pred_reg, big_salaries)))

.90,.05,.05 -> R2:0.719595638522
.95,.04,.01 -> R2:0.755323818323
.95,.01,.04 -> R2:0.751398375485
.98,.01,.01 -> R2:0.778603669749
1.0,.00,.00 -> R2:0.794102558184


Hm, it seems I gave too much importance to the Location and the Title. Either way with this "Naive Bayes Regression" of sorts, having an $R^2$ value of .794 is more one that I didn't think would be so large.

I think it may be worth seeing how just the full description predictions measure up to the power and see if our methods of tweaking the predictions on the standard deviations helped or hurt.

In [139]:
nb_fulldesc=[]
for i in range(len(all_predictions)):
    ps = reference[all_predictions[i][0]] 
    nb_fulldesc.append(ps)
    
r2(nb_fulldesc, big_salaries)

0.7941025581838882

## Conclusions

It appears that the naive bayes on the full description of the job was the best performance I could have done. I would have thought that other factors would have counted more, or should be weighted more, such as location and sectors. It seems the Naive Bayes did a good job at catching and accounting for outliers, which I thought the random forest would help correct. The location was causing better prediction then the Title for the Naive Bayes, but the fulldescription itself did a very good job at the prediction alone. It seems the ensemble approach failed in this instance.

I think the ensemble approach may not have been the best course of action due to the full description typically containing the location and the title, but I felt like those would have a larger impact on the salary then just the occurance of the words themselves. 