# Group work - Assessment 2

In this assignment, we will focus on salary prediction. The data set for this assignment includes information on job descriptions and salaries. Use this data set to see if you can predict the salary of a job posting (i.e., the `Salary` column in the data set) based on the job description. This is important, because this model can make a salary recommendation as soon as a job description is entered into a system.

## Description of Variables

The description of variables are provided in "Jobs - Data Dictionary.docx"

## Goal

Use the **jobs_alldata.csv** data set and build models to predict **salary**.

**Be careful: this is a REGRESSION task**

## Submission:

Please save and submit this Jupyter notebook file. The correctness of the code matters for your grade. **Readability and organization of your code is also important.** You may lose points for submitting unreadable/undecipherable code. Therefore, use markdown cells to create sections, and use comments where necessary.


## Recommended roles for group members:

**Section 1:** to be completed by both group members

**Section 2:** first three models to be completed by the first group member and checked by the second; last two models to be completed by the second group members and checked by the first group member.

**Discussion:** to be completed by both group members

**Important notes:**
- Both group members will get the same grade. Therefore, you should check the work of your group member. If they make a mistake, you will be responsible for that mistake too.
- Both group members must put in their fair share of effort. Otherwise, those who don't contribute to the assignment will not receive any grade.


# Section 1: (8 points in total)

## Data Prep (6 points)

In [3]:
import pandas as pd
import numpy as np
import os

In [4]:
# get csv files from current directory
csv_files = [f for f in os.listdir('.') if f.endswith('.csv')]
csv_files

['jobs_alldata.csv']

In [5]:
# read csv files
df = pd.read_csv(csv_files[0])
df.head()

Unnamed: 0,Salary,Job Description,Location,Min_years_exp,Technical,Comm,Travel
0,67206,Civil Service Title: Regional Director Mental ...,Remote,5,2,3,0
1,88313,The New York City Comptrollerâ€™s Office Burea...,Remote,5,2,4,10-15
2,81315,With minimal supervision from the Deputy Commi...,East campus,5,3,3,5-10
3,76426,OPEN TO CURRENT BUSINESS PROMOTION COORDINATOR...,East campus,1,1,3,0
4,55675,Only candidates who are permanent in the Princ...,Southeast campus,1,1,3,5-10


In [7]:
from sklearn.model_selection import train_test_split

# define train and test sets from data
X_train, X_test, y_train, y_test = train_test_split(
    df.drop('Salary', axis=1), df['Salary'], test_size=0.2, random_state=0)


In [10]:
# check for null values
print(X_train.isnull().sum())

Job Description    0
Location           0
Min_years_exp      0
Technical          0
Comm               0
Travel             0
dtype: int64


In [12]:
print(X_test.isnull().sum())

Job Description    0
Location           0
Min_years_exp      0
Technical          0
Comm               0
Travel             0
dtype: int64


In [17]:
# print data types
print(X_train.dtypes)

Job Description    object
Location           object
Min_years_exp       int64
Technical           int64
Comm                int64
Travel             object
dtype: object


In [19]:
# drop all object columns
X_train = X_train.select_dtypes(exclude=['object'])
X_test = X_test.select_dtypes(exclude=['object'])

In [20]:
X_train.head()

Unnamed: 0,Min_years_exp,Technical,Comm
459,1,1,3
2006,5,5,1
951,1,2,3
1598,4,3,4
1458,4,3,3


## Importing Necessary Decision Tree Libraries

In [21]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import FunctionTransformer

### Feature Engineeering: Defining New Column based on the average from our numerical data (Min years of experience, Technical, Comm)

In [26]:
# define new column 'Combined_Aptitude' by averaging 'Min_years_exp' , 'Technical' and 'Comm'
def combine_aptitude(row):
    val = ((row['Min_years_exp'] + row['Technical'] + row['Comm']) / 3)
    return round(val, 2)
# insert new column 'Combined_Aptitude'
X_train['Combined_Aptitude'] = X_train.apply(combine_aptitude, axis=1)
X_test['Combined_Aptitude'] = X_test.apply(combine_aptitude, axis=1)

In [27]:
X_train.head()

Unnamed: 0,Min_years_exp,Technical,Comm,Combined_Aptitude
459,1,1,3,1.67
2006,5,5,1,3.67
951,1,2,3,2.0
1598,4,3,4,3.67
1458,4,3,3,3.33


## Feature Engineering (1 points)

Create one NEW feature from existing data. You either transform a single variable, or create a new variable from existing ones. 

Grading: 
- 0.5 points for creating the new feature correctly
- 0.5 points for the justification of the new feature (i.e., why did you create this new feature)

## Find the Baseline (1 point)

# Section 2: (7 points in total)

Build the following models:


## Decision Tree: (1 point)

## Voting regressor (2 points):

The voting regressor should have at least 3 individual models

## A Boosting model: (1 point)

Build either an Adaboost or a GradientBoost model

## Neural network: (1 point)

## Grid search (2 points)

Perform either a full or randomized grid search on any model you want. There has to be at least two parameters for the search. 

# Discussion (5 points in total)


## List the train and test values of each model you built (2 points)

## Which model performs the best and why? (0.5 points) 
## How does it compare to baseline? (0.5 points)

Hint: The best model is the one that has the highest TEST score (regardless of any of the training values). If you select your model based on TRAIN values, you will lose points.

## Is there any evidence of overfitting in the best model, why or why not? If there is, what did you do about it? (1 point)

## Is there any overfitting in the other models (besides the best model), why or why not? If there is, what did you do about it? (1 point)