## CSCI 470 Activities and Case Studies

1. For all activities, you are allowed to collaborate with a partner. 
1. For case studies, you should work individually and are **not** allowed to collaborate.

By filling out this notebook and submitting it, you acknowledge that you are aware of the above policies and are agreeing to comply with them.

Some considerations with regard to how these notebooks will be graded:

1. You can add more notebook cells or edit existing notebook cells other than "# YOUR CODE HERE" to test out or debug your code. We actually highly recommend you do so to gain a better understanding of what is happening. However, during grading, **these changes are ignored**. 
2. You must ensure that all your code for the particular task is available in the cells that say "# YOUR CODE HERE"
3. Every cell that says "# YOUR CODE HERE" is followed by a "raise NotImplementedError". You need to remove that line. During grading, if an error occurs then you will not receive points for your work in that section.
4. If your code passes the "assert" statements, then no output will result. If your code fails the "assert" statements, you will get an "AssertionError". Getting an assertion error means you will not receive points for that particular task.
5. If you edit the "assert" statements to make your code pass, they will still fail when they are graded since the "assert" statements will revert to the original. Make sure you don't edit the assert statements.
6. We may sometimes have "hidden" tests for grading. This means that passing the visible "assert" statements is not sufficient. The "assert" statements are there as a guide but you need to make sure you understand what you're required to do and ensure that you are doing it correctly. Passing the visible tests is necessary but not sufficient to get the grade for that cell.
7. When you are asked to define a function, make sure you **don't** use any variables outside of the parameters passed to the function. You can think of the parameters being passed to the function as a hint. Make sure you're using all of those variables.
8. Finally, **make sure you run "Kernel > Restart and Run All"** and pass all the asserts before submitting. If you don't restart the kernel, there may be some code that you ran and deleted that is still being used and that was why your asserts were passing.

# Job Performance Prediction

You work for a software startup, Predict All The Things Inc. (PALT), and are approached by the CEO to build an algorithm that can help sift through resumes. PALT just closed a $3 million Series A round of funding and the CEO just landed a deal with a national retailer, SellsALOT, to help them with hiring Sales Associates.

They are able to obtain data on all the employees that work as Sales Associates throughout their stores as well as customer satisfaction and sales performance scores.

In this case study, you are tasked with building a model to predict job performance to assist HR in selecting applicants to interview.

The data was provided to you by the new HR intern, Keegan. This is the email you got from Keegan with the attached data.

>Hi!
>
>I hope you're doing well. I've attached the data we have about all employees. Please ensure this data stays confidential and is not shared with anyone who has not signed the NDA. The columns have all the information we have about our employees and the scoring rating that they've received from our performance monitors. We also have some employees who were fired and I have included those as well.
>
>I was also able to dig up some more information about our employees that I found on the internet. It took a lot of time but I hope it helps in making the model even better. Can't wait to see this thing in action. Everyone here is very excited about our collaboration with you and we look forward to this making hiring a lot easier for us.
>
>Thanks,
>
>Keegan Thiel
>
>HR Intern
>
>Human Resources
>
>SellsALOT


Data is available in the `employees.csv` file provided. 


SellsALOT is an Equal Opportunity Employer which is an employer who agrees not to discriminate against any employee or job applicant based on race, color, religion, national origin, sex, physical or mental disability, or age.


## Import packages that are likely to be useful

### However, __do not__ use TensorFlow to build your models.

Below we import packages that are needed or may be useful. You may import additional packages as you see fit, with the exception of TensorFlow. You may use scikit-learn. **Ensure that you import additional packages in cells that say "### YOUR CODE HERE".**

In [1]:
import pandas as pd
import matplotlib
import numpy as np
import sklearn as sk
from sklearn.model_selection import train_test_split
import datetime
from datetime import date

## Data Cleaning

First, let's investigate the data that we received from Keegan.

If you are using colab, **Make sure you upload the employees.csv file** so it can be loaded in the next cell. In order to do so, click the file folder icon in the colab sidebar. You will see the contents of the current directory, which will include a "sample_data" folder. Click the upload icon (piece of paper with upward pointing arrow, just below the word "Files"). Locate the employees.csv file that you downloaded from Canvas to your local machine and open/upload the file.

In [2]:
df = pd.read_csv("employees.csv")

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
df.head(10)

Unnamed: 0,First Name,Last Name,Date of Birth,Address,Zipcode,Gender,Race / Ethnicity,English Fluency,Spanish Fluency,Education,High School GPA,College GPA,Years of Experience,Years of Volunteering,Myers Briggs Type,Twitter followers,Instagram Followers,Requires Sponsorship,Customer Satisfaction Rating,Sales Rating,Fired
0,Sarah,Chang,1989-12-24,764 Howard Tunnel,30167,Female,Black,Fluent,Basic,High School,3.1,2.52,8.8,0.0,ISTJ,693,1108,False,2.21,2.07,Current Employee
1,Daniel,Taylor,1985-03-15,4892 Jessica Turnpike Suite 781,86553,Male,Black,Fluent,Basic,High School,3.02,3.9,13.7,0.0,ISFJ,507,1259,False,3.37,2.98,Current Employee
2,Heather,Stewart,1993-09-20,778 Linda Orchard Apt. 609,30167,Female,Black,Proficient,Basic,High School,2.95,2.63,5.2,0.0,INFP,599,868,False,1.5,1.36,Current Employee
3,Katherine,Dillon,1986-12-22,139 Linda Crossroad Suite 115,30167,Female,Black,Basic,Basic,High School,3.99,3.88,12.5,0.0,ISFP,1321,889,True,2.89,2.62,Current Employee
4,Sheri,Bolton,1991-02-24,1858 Lauren Orchard,60531,Female,Black,Proficient,Proficient,High School,3.82,3.3,7.0,0.0,ISFJ,414,13760,True,1.94,1.78,Current Employee
5,Donna,Davis,1996-05-26,4232 Tina Forks,86553,Female,Black,Proficient,Basic,Associates,2.05,3.14,1.6,0.0,ESTJ,495,2401,False,0.78,0.87,Current Employee
6,Benjamin,Shelton,1985-01-15,186 Warren Mount Apt. 396,30167,Male,Black,Proficient,Basic,Associates,2.12,3.51,13.1,0.0,ESFJ,1696,1158,False,3.41,3.11,Current Employee
7,Kevin,Hayes,1994-01-21,515 Tucker Plaza Suite 304,59010,Male,Black,Fluent,Basic,High School,2.09,2.92,4.1,0.0,ESTP,319,538,False,1.3,1.29,Current Employee
8,Autumn,Robinson,1996-05-05,0123 Audrey Union,60531,Female,Black,Fluent,Basic,,,,3.0,0.0,ISFJ,988,510,False,0.89,0.63,Current Employee
9,Kimberly,Becker,1983-04-12,91615 Wilson Place,60531,Female,Black,Fluent,Basic,High School,2.99,2.97,14.8,0.0,INTP,1169,2254,False,3.59,3.4,Current Employee


In [5]:
df.describe(include="all")

Unnamed: 0,First Name,Last Name,Date of Birth,Address,Zipcode,Gender,Race / Ethnicity,English Fluency,Spanish Fluency,Education,High School GPA,College GPA,Years of Experience,Years of Volunteering,Myers Briggs Type,Twitter followers,Instagram Followers,Requires Sponsorship,Customer Satisfaction Rating,Sales Rating,Fired
count,2000,2000,2000,2000,2000.0,2000,2000,2000,2000,2000,1645.0,1645.0,2000.0,2000.0,2000,2000.0,2000.0,2000,2000.0,2000.0,2000
unique,477,696,1701,2000,,2,3,3,3,5,,,,,16,,,2,,,2
top,Michael,Smith,1990-05-09,5747 Smith Lakes,,Female,Black,Fluent,Basic,High School,,,,,ISFJ,,,False,,,Current Employee
freq,42,45,4,1,,1201,1000,1183,1795,881,,,,,218,,,1813,,,1844
mean,,,,,53300.7405,,,,,,3.010359,3.251465,8.04255,0.263,,1065.1445,9586.589,,2.220175,2.06511,
std,,,,,17455.384226,,,,,,0.584502,0.430466,4.674366,0.907487,,7499.266485,226956.3,,1.058473,0.97946,
min,,,,,24310.0,,,,,,2.0,2.5,0.0,0.0,,300.0,500.0,,0.0,0.0,
25%,,,,,43357.0,,,,,,2.51,2.87,3.9,0.0,,364.0,671.0,,1.3075,1.2575,
50%,,,,,55864.0,,,,,,3.02,3.27,8.1,0.0,,466.5,992.5,,2.25,2.06,
75%,,,,,60531.0,,,,,,3.51,3.61,12.1,0.0,,769.25,2042.0,,3.08,2.86,


In [6]:
print("The columns of data are:")
list(df.columns)

The columns of data are:


['First Name',
 'Last Name',
 'Date of Birth',
 'Address',
 'Zipcode',
 'Gender',
 'Race / Ethnicity',
 'English Fluency',
 'Spanish Fluency',
 'Education',
 'High School GPA',
 'College GPA',
 'Years of Experience',
 'Years of Volunteering',
 'Myers Briggs Type',
 'Twitter followers',
 'Instagram Followers',
 'Requires Sponsorship',
 'Customer Satisfaction Rating',
 'Sales Rating',
 'Fired']

Before building any models, your manager has asked you to **convert all the feature data into formats that can easily be used for training and testing a variety of models**. This means:
1. Splitting the 16 Myers Briggs types into 4 subtypes
2. Converting categorical features into dummy binary features
3. Calculating age based on date of birth
4. Dealing with missing (NaN) values in the data

In addition, you should remove columns that contain redundant information after going through the process above, e.g., removing the 'Date of Birth' column after an 'Age' column is added.


### MBTI Splitting

The [Myers Briggs Type Indicator](https://en.wikipedia.org/wiki/Myers%E2%80%93Briggs_Type_Indicator) (MBTI) descibes people as one of two types for each of:

* extraversion (E) or introversion (I)
* sensing (S) or intuition (N)
* thinking (T) or feeling (F)
* judgment (J) or perception (P)

It would make more sense for us to represent people as one or the other of these instead of creating all the possible cases. That way a model can learn based on each of those factors as well as their combination. 

Your next task is to split the MBTI column into four columns in the dataframe, with the following column names and values:

* MBTI_EI with value `E` or `I`
* MBTI_SN with value `S` or `N`
* MBTI_TF with value `T` or `F`
* MBTI_JP with value `J` or `P`

that correspond to the same row's Myers Briggs Type, and add those columns to your DataFrame, ```df```. Consider using the Series ```apply()``` method.

Afterwards, remove the original "Myers Briggs Type" column.

In [7]:
df['MBTI_EI'] = df['Myers Briggs Type'].apply(lambda x: x[0])
df['MBTI_SN'] = df['Myers Briggs Type'].apply(lambda x: x[1])
df['MBTI_TF'] = df['Myers Briggs Type'].apply(lambda x: x[2])
df['MBTI_JP'] = df['Myers Briggs Type'].apply(lambda x: x[3])
df.drop('Myers Briggs Type', axis=1, inplace=True)
# raise NotImplementedError()

In [8]:
assert len(set(df["MBTI_EI"])) == 2
assert "E" in set(df["MBTI_EI"]) and "I" in set(df["MBTI_EI"])
assert len(set(df["MBTI_SN"])) == 2
assert "S" in set(df["MBTI_SN"]) and "N" in set(df["MBTI_SN"])
assert len(set(df["MBTI_TF"])) == 2
assert "T" in set(df["MBTI_TF"]) and "F" in set(df["MBTI_TF"])
assert len(set(df["MBTI_JP"])) == 2
assert "J" in set(df["MBTI_JP"]) and "P" in set(df["MBTI_JP"])
assert "Myers Briggs Type" not in list(df.columns)

1. ~~Splitting the 16 Myers Briggs types into 4 subtypes~~
2. Converting categorical features into dummy binary features
3. Calculating age based on date of birth
4. Dealing with missing (NaN) values in the data

### Categorical to Dummy Variables

Dummy variables are variables that allow us to convert a category into several binary variables. For example, if we had a color value that we were storing and we knew it could only have the values `red`, `green`, and `blue`, then instead of storing the color as those strings, we can store three binary variables: `is_red`, `is_green`, and `is_blue`. 

We can do this in pandas easily by using [`get_dummies`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html).

In [9]:
# Review the DataFrame columns and identify the columns that contain categorical
# features and save them to a list called "categorical_columns".

# Categorical here means that there is a discrete (albeit large in some cases)
# number of possible options for the column that are not just 0 or 1

categorical_columns = ['Gender', 'Race / Ethnicity', 'English Fluency', 'Spanish Fluency', 'Education', 'Requires Sponsorship', 'Fired', 'MBTI_EI', 'MBTI_SN', 'MBTI_TF', 'MBTI_JP']
# raise NotImplementedError()

In [10]:
assert len(categorical_columns) > 8
for category in categorical_columns:
    assert category in df.columns

In [11]:
# Before we get the dummy variables, we need to make sure that all these 
# categorical columns are actually recognized by pandas to be of 'category' type.
for column in categorical_columns:
    df[column] = df[column].astype('category')

In [12]:
# For every column in the categorical_columns,
# calculate the dummy variables and add them to the dataframe

df = pd.concat([df, pd.get_dummies(df[categorical_columns])], axis=1)
# raise NotImplementedError()

In [13]:
assert len(list(df.columns)) > 45

In [14]:
print("The current columns are:")
list(df.columns)

The current columns are:


['First Name',
 'Last Name',
 'Date of Birth',
 'Address',
 'Zipcode',
 'Gender',
 'Race / Ethnicity',
 'English Fluency',
 'Spanish Fluency',
 'Education',
 'High School GPA',
 'College GPA',
 'Years of Experience',
 'Years of Volunteering',
 'Twitter followers',
 'Instagram Followers',
 'Requires Sponsorship',
 'Customer Satisfaction Rating',
 'Sales Rating',
 'Fired',
 'MBTI_EI',
 'MBTI_SN',
 'MBTI_TF',
 'MBTI_JP',
 'Gender_Female',
 'Gender_Male',
 'Race / Ethnicity_Black',
 'Race / Ethnicity_Caucasian',
 'Race / Ethnicity_Hispanic',
 'English Fluency_Basic',
 'English Fluency_Fluent',
 'English Fluency_Proficient',
 'Spanish Fluency_Basic',
 'Spanish Fluency_Fluent',
 'Spanish Fluency_Proficient',
 'Education_Associates',
 'Education_Graduate',
 'Education_High School',
 'Education_None',
 'Education_Undergraduate',
 'Requires Sponsorship_False',
 'Requires Sponsorship_True',
 'Fired_Current Employee',
 'Fired_Fired',
 'MBTI_EI_E',
 'MBTI_EI_I',
 'MBTI_SN_N',
 'MBTI_SN_S',
 'MBTI_

In [15]:
# Now drop all the categorical features columns from the dataframe
# So that we don't have duplicate information stored
df.drop(categorical_columns, axis=1, inplace=True)
# raise NotImplementedError()

In [16]:
print("The current columns are:")
list(df.columns)

The current columns are:


['First Name',
 'Last Name',
 'Date of Birth',
 'Address',
 'Zipcode',
 'High School GPA',
 'College GPA',
 'Years of Experience',
 'Years of Volunteering',
 'Twitter followers',
 'Instagram Followers',
 'Customer Satisfaction Rating',
 'Sales Rating',
 'Gender_Female',
 'Gender_Male',
 'Race / Ethnicity_Black',
 'Race / Ethnicity_Caucasian',
 'Race / Ethnicity_Hispanic',
 'English Fluency_Basic',
 'English Fluency_Fluent',
 'English Fluency_Proficient',
 'Spanish Fluency_Basic',
 'Spanish Fluency_Fluent',
 'Spanish Fluency_Proficient',
 'Education_Associates',
 'Education_Graduate',
 'Education_High School',
 'Education_None',
 'Education_Undergraduate',
 'Requires Sponsorship_False',
 'Requires Sponsorship_True',
 'Fired_Current Employee',
 'Fired_Fired',
 'MBTI_EI_E',
 'MBTI_EI_I',
 'MBTI_SN_N',
 'MBTI_SN_S',
 'MBTI_TF_F',
 'MBTI_TF_T',
 'MBTI_JP_J',
 'MBTI_JP_P']

In [17]:
assert 55 > len(list(df.columns)) > 30

1. ~~Splitting the 16 Myers Briggs types into 4 subtypes~~
2. ~~Converting categorical features into dummy binary features~~
3. Calculating age based on date of birth
4. Dealing with missing (NaN) values in the data

### Age Calculation

In [18]:
def calculate_age(born):
    """Calculates age based on date of birth using https://stackoverflow.com/a/9754466/818687

    Args:
        born (datetime): The date of birth

    Returns:
        int: The age based on date of birth
    """
    today = datetime.datetime.strptime("2020-11-20", "%Y-%m-%d")
    return today.year - born.year - ((today.month, today.day) < (born.month, born.day))

Add an "Age" column to the dataframe, with the help of the ```calculate_age()``` function above. Afterwards, remove the "Date of Birth" column.

The input to ```calculate_age()``` should be a datetime object. Review the ```datetime.datetme.strptime()```
function and [format codes](https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior) to determine how to convert the "Date of Birth" date string into a datetime object.

In [19]:
mylist = []
df['Date of Birth'] = pd.to_datetime(df['Date of Birth'], format="%Y-%m-%d")
for i in df['Date of Birth']:
  mylist.append(calculate_age(i))
df['Age'] = mylist
df.drop('Date of Birth', axis=1, inplace=True)

# raise NotImplementedError()

In [20]:
assert df["Age"].min() == 22
assert df["Age"].max() == 38
assert df["Age"].median() == 30
assert "Date of Birth" not in list(df.columns)

1. ~~Splitting the 16 Myers Briggs types into 4 subtypes~~
2. ~~Converting categorical features into dummy binary features~~
3. ~~Calculating age based on date of birth~~
4. Dealing with missing (NaN) values in the data

## Handle NaN values

In [21]:
# Create a list of columns that contain NaN values
nan_columns = df.columns[df.isna().any()].tolist()
print(nan_columns)

['High School GPA', 'College GPA']


We see that data is not truly "missing" any values, but for people that did not attend or complete high school or college, there are no GPA values.

How should you deal with this? If you had a large number of people without GPAs, you might consider making separate models for people with GPAs and for people without. For this case, your manager asks you to make sure there's one model for everyone. She recommends one of the following options:

1. Replace NaN values with the mean value of all the non-NaN values.
1. Replace NaN values with 0
1. Replace NaN values with some other value
1. Create a model to predict people's GPA values from other attributes and fill them in with those values

Consider the assumptions of each approach:
1. Replacing with the mean assumes that that person would receive the average of others who work at this company.
1. Replacing with 0 assumes that that person would fail if they attended high school or college.
1. Replacing with some arbitrary value will have assumptions based on what that value is
1. Creating a model to predict people's GPA values from the other attributes in the data assumes that those attributes are predictive of GPA. 


Regardless of the approach you take, just make sure there are no more NaN values. 

In [22]:
# For each of the two columns that contain NaN values, replace the NaN values
# with numerical values, using one of the approaches above, or some other approach
# that you devise yourself.

df[nan_columns] = df[nan_columns].fillna(0)
# raise NotImplementedError()

In [23]:
for col in nan_columns:
    assert not df[col].isna().any()

In [24]:
# Describe the approach you chose and why and save that as a string called nan_filling_approach
nan_filling_approach = '''
I chose approach two since I felt that, even though the person may not have 
technically failed any classes, they did fail to complete their degree and therefore
for when it comes to gpa this is like an equivalent of having no gpa which can be 
distinguished by storing as a gpa of zero.
'''
# raise NotImplementedError()
print(nan_filling_approach)


I chose approach two since I felt that, even though the person may not have 
technically failed any classes, they did fail to complete their degree and therefore
for when it comes to gpa this is like an equivalent of having no gpa which can be 
distinguished by storing as a gpa of zero.



In [25]:
assert len(nan_filling_approach) > 30

## Modeling

### Interviewing model(s)

Having completed the conversion of the data into a format that can be used with machine learning models, your manager asks that you build three seperate models which predict the following three targets, respectively:

1. Customer Satisfaction
1. Sales Performance
1. Fired

In [26]:
# Save the names of columns we are trying to predict to a list called "targets".
# Make sure that if we had a categorical column, that you use the dummy representation(s)
targets = ['Customer Satisfaction Rating', 'Sales Rating', 'Fired_Fired']
# raise NotImplementedError()

In [27]:
assert len(targets) == 3
for target in targets:
    assert target in df.columns

Ultimately, the predictions of your models will be used to rank applicants for interviews with HR.

**Which features will you select to use in your model?** You will specify them below.

In [28]:
print("The available columns are:")
list(df)

The available columns are:


['First Name',
 'Last Name',
 'Address',
 'Zipcode',
 'High School GPA',
 'College GPA',
 'Years of Experience',
 'Years of Volunteering',
 'Twitter followers',
 'Instagram Followers',
 'Customer Satisfaction Rating',
 'Sales Rating',
 'Gender_Female',
 'Gender_Male',
 'Race / Ethnicity_Black',
 'Race / Ethnicity_Caucasian',
 'Race / Ethnicity_Hispanic',
 'English Fluency_Basic',
 'English Fluency_Fluent',
 'English Fluency_Proficient',
 'Spanish Fluency_Basic',
 'Spanish Fluency_Fluent',
 'Spanish Fluency_Proficient',
 'Education_Associates',
 'Education_Graduate',
 'Education_High School',
 'Education_None',
 'Education_Undergraduate',
 'Requires Sponsorship_False',
 'Requires Sponsorship_True',
 'Fired_Current Employee',
 'Fired_Fired',
 'MBTI_EI_E',
 'MBTI_EI_I',
 'MBTI_SN_N',
 'MBTI_SN_S',
 'MBTI_TF_F',
 'MBTI_TF_T',
 'MBTI_JP_J',
 'MBTI_JP_P',
 'Age']

In [29]:
# Enter all the features you want to use in a list and save it to "interview_features".
# These are the features for the models that will predict the targets, and the
# predictions will be used to rank applicants for **interviews**.
interview_features = ['High School GPA',
                      'College GPA',
                      'Years of Experience',
                      'Years of Volunteering',
                      'English Fluency_Basic',
                      'English Fluency_Fluent',
                      'English Fluency_Proficient',
                      'Spanish Fluency_Basic',
                      'Spanish Fluency_Fluent',
                      'Spanish Fluency_Proficient',
                      'Education_Associates',
                      'Education_High School',
                      'Education_Undergraduate',
                      'MBTI_EI_E',
                      'MBTI_EI_I',
                      'MBTI_SN_N',
                      'MBTI_SN_S',
                      'MBTI_TF_F',
                      'MBTI_TF_T',
                      'MBTI_JP_J',
                      'MBTI_JP_P',
                      ]
# raise NotImplementedError()

Why did you choose the features you did?

In [30]:
## Save your reasoning in a string to the variable interview_reason

interview_reason = '''
for my feature selection, I tried to pick features that seem to be taken into consideration for jobs 
that I have applied for in real life and this can sometimes include a personality test which is why 
I included the personality types. I also tried to stick to features that are metrics which hiring managers 
can use legally to evaluate candidates to try and remove the more biased, opinion based, metrics.
'''
# raise NotImplementedError()

In [31]:
assert isinstance(interview_reason, str)
assert len(interview_reason) > 20

In [32]:
# Perform a train and test split on the data with the variable names:
# interview_x_train for the training features
# interview_x_test for the testing features
# interview_y_train for the training targets
# interview_y_test for the testing targets
# The test dataset should be 20% of the total dataset

interview_x_train, interview_x_test, interview_y_train, interview_y_test = train_test_split(df[interview_features], df[targets], test_size=0.2)

# raise NotImplementedError()

In [33]:
assert (len(interview_x_train) / (len(interview_x_test) + len(interview_x_train))) == 0.8
assert (len(interview_y_train) / (len(interview_y_test) + len(interview_y_train))) == 0.8
assert len(interview_x_train) == len(interview_y_train)
assert len(interview_x_test) == len(interview_y_test)

Build and train your interviewing models.

In [34]:
# Select models of your choosing, import them here, and perform a
# hyperparameter search while training them on each of the targets.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_selection import SelectKBest, mutual_info_regression, RFE
from sklearn.linear_model import LinearRegression, LassoCV
from sklearn.metrics import classification_report, confusion_matrix, f1_score
from sklearn import metrics
# Do not use Tensorflow to build a model - you may use scikit-learn.

def param(mi_trans):
  for feature, importance in zip(interview_features, mi_trans.scores_):
    print(f"The MI score for {feature} is {importance}")
  return

kval = 14

mi_x_train1 = SelectKBest(k=kval).fit_transform(interview_x_train, interview_y_train['Customer Satisfaction Rating'])
mi_x_test1 = SelectKBest(k=kval).fit_transform(interview_x_test, interview_y_test['Customer Satisfaction Rating'])
mi_transformer1 = SelectKBest(k=kval).fit(interview_x_train, interview_y_train['Customer Satisfaction Rating'])
interview_model_target1 = LinearRegression().fit(interview_x_train, interview_y_train['Customer Satisfaction Rating'])

mi_x_train2 = SelectKBest(k=kval).fit_transform(interview_x_train, interview_y_train['Sales Rating'])
mi_x_test2 = SelectKBest(k=kval).fit_transform(interview_x_test, interview_y_test['Sales Rating'])
mi_transformer2 = SelectKBest(k=kval).fit(interview_x_train, interview_y_train['Sales Rating'])
interview_model_target2 = LinearRegression().fit(interview_x_train, interview_y_train['Sales Rating'])

interview_model_target3 = KNeighborsClassifier(n_neighbors=16).fit(interview_x_train, interview_y_train['Fired_Fired'])

# Determine an appropriate metric for measuring your performance for each
# model/target, and report the test score for that metric. The metric may be
# different for each model/target.

#ac_model1

pred1 = interview_model_target2.predict(interview_x_test)
ac_model1 = np.sqrt(metrics.mean_squared_error(interview_y_test['Customer Satisfaction Rating'], pred1))

#ac_model2

pred2 = interview_model_target2.predict(interview_x_test)
ac_model2 = np.sqrt(metrics.mean_squared_error(interview_y_test['Sales Rating'], pred2))

# ac_model3
param3 = "a K value of 16"
pred3 = interview_model_target3.predict(interview_x_test)
ac_model3 = classification_report(interview_y_test['Fired_Fired'], pred3, output_dict=True).get('accuracy')


# Save your models in a list, with models ordered in the same manner as the
# targets they are predicting in the list "targets" you created above.
# Call the list "my_hiring_models", e.g.
my_interview_models = [interview_model_target1,
                      interview_model_target2,
                      interview_model_target3]
type_models = ['LinearRegression', 'LinearRegression', 'KNN']
int_ac_models = [ac_model1, ac_model2, ac_model3]
meas = ['mean square error', 'mean square error', 'accuracy']
# You should use multiple print messages to print something like the
# following for each of your models/targets:
#
# To predict the target (target), I trained a (model) model
# and determined the best hyperparameters as (param1 = p1), (param2 = p2)...
# resulting in a (metric) score of (score).

for i in range(3):
  if i == 0:
    print("To predict the target ", str(targets[i]), ", I trained a ", str(type_models[i]), " model and determined the best hyperparameters as:")
    param(mi_transformer1)
    print("\n resulting in a mean square error of: ", str(int_ac_models[i]), "\n")
  if i == 1:
    print("To predict the target ", str(targets[i]), ", I trained a ", str(type_models[i]), " model and determined the best hyperparameters as:")
    param(mi_transformer2)
    print("\n resulting in a mean square error of: ", str(int_ac_models[i]), "\n")
  if i == 2:
    print("To predict the target ", str(targets[i]), ", I trained a ", str(type_models[i]), " model and determined the best hyperparameters as: \n",  param3, "\n resulting in a accuracy score of: ", str(int_ac_models[i]), "\n")
# raise NotImplementedError()

To predict the target  Customer Satisfaction Rating , I trained a  LinearRegression  model and determined the best hyperparameters as:
The MI score for High School GPA is 1.2874304345043608
The MI score for College GPA is 1.246806144966678
The MI score for Years of Experience is 45.07757727508887
The MI score for Years of Volunteering is 1.6961390467519553
The MI score for English Fluency_Basic is 1.1790732748640382
The MI score for English Fluency_Fluent is 1.06359102797297
The MI score for English Fluency_Proficient is 0.9668210245973864
The MI score for Spanish Fluency_Basic is 0.9256402248027101
The MI score for Spanish Fluency_Fluent is 1.0081687641143733
The MI score for Spanish Fluency_Proficient is 0.9308409746029191
The MI score for Education_Associates is 0.8542426523425956
The MI score for Education_High School is 1.1633169808725512
The MI score for Education_Undergraduate is 1.1983144374952706
The MI score for MBTI_EI_E is 0.9452340635580201
The MI score for MBTI_EI_I is 0.

In [35]:
assert len(my_interview_models)==len(targets)

### Hiring model(s)

You manager tells you that SellsALOT has decided they wish to consider doing away with interviews altogether, in order to save money. SellsALOT would like a model that will be used to rank candidates for directly hiring them, rather than for interviewing them.

Will your choice of features changes?

**Which features will you select to use in that model?** You will specify them below.

In [36]:
print("The available columns are:")
list(df)

The available columns are:


['First Name',
 'Last Name',
 'Address',
 'Zipcode',
 'High School GPA',
 'College GPA',
 'Years of Experience',
 'Years of Volunteering',
 'Twitter followers',
 'Instagram Followers',
 'Customer Satisfaction Rating',
 'Sales Rating',
 'Gender_Female',
 'Gender_Male',
 'Race / Ethnicity_Black',
 'Race / Ethnicity_Caucasian',
 'Race / Ethnicity_Hispanic',
 'English Fluency_Basic',
 'English Fluency_Fluent',
 'English Fluency_Proficient',
 'Spanish Fluency_Basic',
 'Spanish Fluency_Fluent',
 'Spanish Fluency_Proficient',
 'Education_Associates',
 'Education_Graduate',
 'Education_High School',
 'Education_None',
 'Education_Undergraduate',
 'Requires Sponsorship_False',
 'Requires Sponsorship_True',
 'Fired_Current Employee',
 'Fired_Fired',
 'MBTI_EI_E',
 'MBTI_EI_I',
 'MBTI_SN_N',
 'MBTI_SN_S',
 'MBTI_TF_F',
 'MBTI_TF_T',
 'MBTI_JP_J',
 'MBTI_JP_P',
 'Age']

In [37]:
# Enter all the features you want to use in a list and save it to "hire_features".
# These are the features for the models that will predict the targets, and the
# predictions will be used to rank applicants for **hiring**.

hire_features = interview_features
# raise NotImplementedError()

Why did you choose the features you did?

In [38]:
## Save your reasoning in a string to the variable hire_reason

hire_reason = '''
I ended up going with essentially the same features as before. 
In real life a personality test is often included in the hiring process as well which is why 
I included the personality types. And here I also tried to stick to features that are metrics which hiring managers 
can use legally to evaluate candidates to try and remove the more biased, opinion based, metrics as well.
'''
# raise NotImplementedError()

In [39]:
assert isinstance(hire_reason, str)
assert len(hire_reason) > 20

Why was your choice different from or the same as the interviewing features?


In [40]:
# Save your reasoning in a string to the variable
# same_reason if the features are the same, or
# different_reason if the features are different.
same_reason = '''
I believe that the reason for getting an interview and the reason 
for someone getting hired are essentially the same. The interview I feel is more to 
see how well you can communicate and how strongly you may state your case for why
you should get hired. But if the company has the same information that they would of had
for the interview I think all that information is still useful in determining whether
or not someone gets hired.
'''
# raise NotImplementedError()

In [41]:
if all([rf in hire_features for rf in interview_features]) and all([sf in interview_features for sf in hire_features]):
    print("Your features for interviewing and hiring are the same.")
    assert isinstance(same_reason, str)
    assert len(same_reason) > 20
else:
    print("Your features for interviewing and hiring are different.")
    assert isinstance(different_reason, str)
    assert len(different_reason) > 20

Your features for interviewing and hiring are the same.


In [42]:
# Perform a train and test split on the data with the variable names:
# hire_x_train for the training features
# hire_x_test for the testing features
# hire_y_train for the training targets
# hire_y_test for the testing targets
# The test dataset should be 20% of the total dataset

hire_x_train, hire_x_test, hire_y_train, hire_y_test = train_test_split(df[hire_features], df[targets], test_size=0.2)
# raise NotImplementedError()

In [43]:
assert (len(hire_x_train) / (len(hire_x_test) + len(hire_x_train))) == 0.8
assert (len(hire_y_train) / (len(hire_y_test) + len(hire_y_train))) == 0.8
assert len(hire_x_train) == len(hire_y_train)
assert len(hire_x_test) == len(hire_y_test)

Build and train your hiring models.

Do you expect this model to perform differently?

In [44]:
# Select models of your choosing, import them here, and perform a
# hyperparameter search while training them on each of the targets.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_selection import SelectKBest, mutual_info_regression, RFE
from sklearn.linear_model import LinearRegression, LassoCV
from sklearn.metrics import classification_report, confusion_matrix, f1_score
from sklearn import metrics
# Do not use Tensorflow to build a model - you may use scikit-learn.

def param(mi_trans):
  for feature, importance in zip(hire_features, mi_trans.scores_):
    print(f"The MI score for {feature} is {importance}")
  return

kval = 14

mi_x_train1 = SelectKBest(k=kval).fit_transform(hire_x_train, hire_y_train['Customer Satisfaction Rating'])
mi_x_test1 = SelectKBest(k=kval).fit_transform(hire_x_test, hire_y_test['Customer Satisfaction Rating'])
mi_transformer1 = SelectKBest(k=kval).fit(hire_x_train, hire_y_train['Customer Satisfaction Rating'])
hire_model_target1 = LinearRegression().fit(hire_x_train, hire_y_train['Customer Satisfaction Rating'])

mi_x_train2 = SelectKBest(k=kval).fit_transform(hire_x_train, hire_y_train['Sales Rating'])
mi_x_test2 = SelectKBest(k=kval).fit_transform(hire_x_test, hire_y_test['Sales Rating'])
mi_transformer2 = SelectKBest(k=kval).fit(hire_x_train, hire_y_train['Sales Rating'])
hire_model_target2 = LinearRegression().fit(hire_x_train, hire_y_train['Sales Rating'])

hire_model_target3 = KNeighborsClassifier(n_neighbors=16).fit(hire_x_train, hire_y_train['Fired_Fired'])

# Determine an appropriate metric for measuring your performance for each
# model/target, and report the test score for that metric. The metric may be
# different for each model/target.

#ac_model1

pred1 = hire_model_target2.predict(hire_x_test)
ac_model1 = np.sqrt(metrics.mean_squared_error(hire_y_test['Customer Satisfaction Rating'], pred1))

#ac_model2

pred2 = hire_model_target2.predict(hire_x_test)
ac_model2 = np.sqrt(metrics.mean_squared_error(hire_y_test['Sales Rating'], pred2))

# ac_model3
param3 = "a K value of 16"
pred3 = hire_model_target3.predict(hire_x_test)
ac_model3 = classification_report(hire_y_test['Fired_Fired'], pred3, output_dict=True).get('accuracy')


# Save your models in a list, with models ordered in the same manner as the
# targets they are predicting in the list "targets" you created above.
# Call the list "my_hiring_models", e.g.
my_hiring_models = [hire_model_target1,
                  hire_model_target2,
                  hire_model_target3]
type_models = ['LinearRegression', 'LinearRegression', 'KNN']
hire_ac_models = [ac_model1, ac_model2, ac_model3]
# You should use multiple print messages to print something like the
# following for each of your models/targets:
#
# To predict the target (target), I trained a (model) model
# and determined the best hyperparameters as (param1 = p1), (param2 = p2)...
# resulting in a (metric) score of (score).

for i in range(3):
  if i == 0:
    print("To predict the target ", str(targets[i]), ", I trained a ", str(type_models[i]), " model and determined the best hyperparameters as:")
    param(mi_transformer1)
    print("\n resulting in a mean square error of: ", str(hire_ac_models[i]), "\n")
  if i == 1:
    print("To predict the target ", str(targets[i]), ", I trained a ", str(type_models[i]), " model and determined the best hyperparameters as:")
    param(mi_transformer2)
    print("\n resulting in a mean square error of: ", str(hire_ac_models[i]), "\n")
  if i == 2:
    print("To predict the target ", str(targets[i]), ", I trained a ", str(type_models[i]), " model and determined the best hyperparameters as: \n",  param3, "\n resulting in a accuracy score of: ", str(hire_ac_models[i]), "\n")
# raise NotImplementedError()

To predict the target  Customer Satisfaction Rating , I trained a  LinearRegression  model and determined the best hyperparameters as:
The MI score for High School GPA is 1.176511306728623
The MI score for College GPA is 1.206028311619181
The MI score for Years of Experience is 42.689289743435104
The MI score for Years of Volunteering is 1.4904992319709545
The MI score for English Fluency_Basic is 1.0937792009921217
The MI score for English Fluency_Fluent is 1.014811502273786
The MI score for English Fluency_Proficient is 0.9488811718525213
The MI score for Spanish Fluency_Basic is 1.0444834694883742
The MI score for Spanish Fluency_Fluent is 1.0412007672102295
The MI score for Spanish Fluency_Proficient is 1.140212587096581
The MI score for Education_Associates is 0.9072134024018822
The MI score for Education_High School is 1.0533928787174978
The MI score for Education_Undergraduate is 1.1126458485614548
The MI score for MBTI_EI_E is 0.8807836261019236
The MI score for MBTI_EI_I is 0.

In [45]:
assert len(my_hiring_models)==len(targets)

In [46]:
# Follow this up with a comparison between the performance (test scores) on your
# two sets of models.
#
# You should print something like, for each of the targets:
#   Using interview features for target (target) the model scored (score)
#   versus using the hiring features where it scored (score)
print("using interview features: ")
for i in range(3):
    print("the resulting ", str(meas[i]), " was: ", str(int_ac_models[i]))
print("using hiring features: ")
for i in range(3):
    print("the resulting ", str(meas[i]), " was: ", str(hire_ac_models[i]))
# raise NotImplementedError()

using interview features: 
the resulting  mean square error  was:  0.21304768897901027
the resulting  mean square error  was:  0.05005810775616038
the resulting  accuracy  was:  0.9525
using hiring features: 
the resulting  mean square error  was:  0.20688921880273578
the resulting  mean square error  was:  0.04807897266693205
the resulting  accuracy  was:  0.9425


## Model Evaluation

In this section we'll create example applicants and see how they would fare based on their applications and your models. First, let's create some example applications. We've created four applicants, and you'll need to create a fifth one in the cell below.

In [47]:
applicant_1 = {
    'First Name': "Stefon",
    'Last Name': "Smith",
    'Date of Birth': "1989-12-24",
    'Address': "4892 Jessica Turnpike Suite 781",
    'Zipcode': 86553,
    'Gender': "Male",
    'Race / Ethnicity': "Caucasian",
    'English Fluency': "Proficient",
    'Spanish Fluency': "Basic",
    'Education': "Associates",
    'High School GPA': 2.9,
    'College GPA': 3.1,
    'Years of Experience': 5,
    'Years of Volunteering': 2,
    'Myers Briggs Type': "ESFJ",
    'Twitter followers': 524,
    'Instagram Followers': 857,
    'Requires Sponsorship': True
}
applicant_2 = {
    'First Name': "Sarah",
    'Last Name': "Chang",
    'Date of Birth': "1995-04-13",
    'Address': "9163 Rebecca Loop",
    'Zipcode': 43711,
    'Gender': "Female",
    'Race / Ethnicity': "Hispanic",
    'English Fluency': "Fluent",
    'Spanish Fluency': "Fluent",
    'Education': "Undergraduate",
    'High School GPA': 4.0,
    'College GPA': 3.8,
    'Years of Experience': 5,
    'Years of Volunteering': 0,
    'Myers Briggs Type': "ISTJ",
    'Twitter followers': 97,
    'Instagram Followers': 204,
    'Requires Sponsorship': False
}
applicant_3 = {
    'First Name': "Daniel",
    'Last Name': "Richardson",
    'Date of Birth': "1998-10-23",
    'Address': "436 Lauren Stream",
    'Zipcode': 54821,
    'Gender': "Male",
    'Race / Ethnicity': "Black",
    'English Fluency': "Fluent",
    'Spanish Fluency': "Proficient",
    'Education': "Undergraduate",
    'High School GPA': 3.0,
    'College GPA': 3.2,
    'Years of Experience': 1,
    'Years of Volunteering': 1,
    'Myers Briggs Type': "ENFJ",
    'Twitter followers': 2087,
    'Instagram Followers': 3211,
    'Requires Sponsorship': False
}

applicant_4 = {
    'First Name': "Billy",
    'Last Name': "Bob",
    'Date of Birth': "1999-11-03",
    'Address': "412 Railway Stream",
    'Zipcode': 43711,
    'Gender': "Male",
    'Race / Ethnicity': "Caucasian",
    'English Fluency': "Basic",
    'Spanish Fluency': "Fluent",
    'Education': "Undergraduate",
    'High School GPA': 2.0,
    'College GPA': 3.5,
    'Years of Experience': 1,
    'Years of Volunteering': 1,
    'Myers Briggs Type': "ENFJ",
    'Twitter followers': 207,
    'Instagram Followers': 309,
    'Requires Sponsorship': False
}

# Create a fictional applicant by copying the attributes above from any of the
# other applicants and/or adding example values that you would be curious to
# see how your model treats. For example, create an applicant you'd be sure to
# reject or sure to hire.

applicant_5 = {
    'First Name': "Milly",
    'Last Name': "Mob",
    'Date of Birth': "1999-11-03",
    'Address': "412 Railway Stream",
    'Zipcode': 43711,
    'Gender': "Male",
    'Race / Ethnicity': "Caucasian",
    'English Fluency': "Fluent",
    'Spanish Fluency': "Basic",
    'Education': "High School",
    'High School GPA': 2.0,
    'College GPA': 1.0,
    'Years of Experience': 0,
    'Years of Volunteering': 0,
    'Myers Briggs Type': "INFP",
    'Twitter followers': 0,
    'Instagram Followers': 0,
    'Requires Sponsorship': False
}
# raise NotImplementedError()

In [48]:
for key in applicant_4.keys():
    assert key in applicant_5.keys()

In [49]:
new_people = [applicant_1, applicant_2, applicant_3, applicant_4, applicant_5]
new_people_df = pd.DataFrame.from_records(new_people)

In [50]:
new_people_df

Unnamed: 0,First Name,Last Name,Date of Birth,Address,Zipcode,Gender,Race / Ethnicity,English Fluency,Spanish Fluency,Education,High School GPA,College GPA,Years of Experience,Years of Volunteering,Myers Briggs Type,Twitter followers,Instagram Followers,Requires Sponsorship
0,Stefon,Smith,1989-12-24,4892 Jessica Turnpike Suite 781,86553,Male,Caucasian,Proficient,Basic,Associates,2.9,3.1,5,2,ESFJ,524,857,True
1,Sarah,Chang,1995-04-13,9163 Rebecca Loop,43711,Female,Hispanic,Fluent,Fluent,Undergraduate,4.0,3.8,5,0,ISTJ,97,204,False
2,Daniel,Richardson,1998-10-23,436 Lauren Stream,54821,Male,Black,Fluent,Proficient,Undergraduate,3.0,3.2,1,1,ENFJ,2087,3211,False
3,Billy,Bob,1999-11-03,412 Railway Stream,43711,Male,Caucasian,Basic,Fluent,Undergraduate,2.0,3.5,1,1,ENFJ,207,309,False
4,Milly,Mob,1999-11-03,412 Railway Stream,43711,Male,Caucasian,Fluent,Basic,High School,2.0,1.0,0,0,INFP,0,0,False


### Future Applicants Data Cleaning



In [51]:
# Apply all the cleaning and dummy variable creation you did above to this new
# DataFrame. You can copy your code from above and modify it to apply to
# new_people_df instead of df.

new_people_df['MBTI_EI'] = new_people_df['Myers Briggs Type'].apply(lambda x: x[0])
new_people_df['MBTI_SN'] = new_people_df['Myers Briggs Type'].apply(lambda x: x[1])
new_people_df['MBTI_TF'] = new_people_df['Myers Briggs Type'].apply(lambda x: x[2])
new_people_df['MBTI_JP'] = new_people_df['Myers Briggs Type'].apply(lambda x: x[3])
new_people_df.drop(columns=['Myers Briggs Type'])

categorical_columns = ['Gender', 'Race / Ethnicity', 'English Fluency', 'Spanish Fluency', 'Education', 'Requires Sponsorship', 'MBTI_EI', 'MBTI_SN', 'MBTI_TF', 'MBTI_JP']
new_people_df = pd.concat([new_people_df, pd.get_dummies(new_people_df[categorical_columns])], axis=1)
new_people_df.drop(columns=categorical_columns)
# raise NotImplementedError()

Unnamed: 0,First Name,Last Name,Date of Birth,Address,Zipcode,High School GPA,College GPA,Years of Experience,Years of Volunteering,Myers Briggs Type,Twitter followers,Instagram Followers,Gender_Female,Gender_Male,Race / Ethnicity_Black,Race / Ethnicity_Caucasian,Race / Ethnicity_Hispanic,English Fluency_Basic,English Fluency_Fluent,English Fluency_Proficient,Spanish Fluency_Basic,Spanish Fluency_Fluent,Spanish Fluency_Proficient,Education_Associates,Education_High School,Education_Undergraduate,MBTI_EI_E,MBTI_EI_I,MBTI_SN_N,MBTI_SN_S,MBTI_TF_F,MBTI_TF_T,MBTI_JP_J,MBTI_JP_P
0,Stefon,Smith,1989-12-24,4892 Jessica Turnpike Suite 781,86553,2.9,3.1,5,2,ESFJ,524,857,0,1,0,1,0,0,0,1,1,0,0,1,0,0,1,0,0,1,1,0,1,0
1,Sarah,Chang,1995-04-13,9163 Rebecca Loop,43711,4.0,3.8,5,0,ISTJ,97,204,1,0,0,0,1,0,1,0,0,1,0,0,0,1,0,1,0,1,0,1,1,0
2,Daniel,Richardson,1998-10-23,436 Lauren Stream,54821,3.0,3.2,1,1,ENFJ,2087,3211,0,1,1,0,0,0,1,0,0,0,1,0,0,1,1,0,1,0,1,0,1,0
3,Billy,Bob,1999-11-03,412 Railway Stream,43711,2.0,3.5,1,1,ENFJ,207,309,0,1,0,1,0,1,0,0,0,1,0,0,0,1,1,0,1,0,1,0,1,0
4,Milly,Mob,1999-11-03,412 Railway Stream,43711,2.0,1.0,0,0,INFP,0,0,0,1,0,1,0,0,1,0,1,0,0,0,1,0,0,1,1,0,1,0,0,1


In [52]:
new_people_df

Unnamed: 0,First Name,Last Name,Date of Birth,Address,Zipcode,Gender,Race / Ethnicity,English Fluency,Spanish Fluency,Education,High School GPA,College GPA,Years of Experience,Years of Volunteering,Myers Briggs Type,Twitter followers,Instagram Followers,Requires Sponsorship,MBTI_EI,MBTI_SN,MBTI_TF,MBTI_JP,Requires Sponsorship.1,Gender_Female,Gender_Male,Race / Ethnicity_Black,Race / Ethnicity_Caucasian,Race / Ethnicity_Hispanic,English Fluency_Basic,English Fluency_Fluent,English Fluency_Proficient,Spanish Fluency_Basic,Spanish Fluency_Fluent,Spanish Fluency_Proficient,Education_Associates,Education_High School,Education_Undergraduate,MBTI_EI_E,MBTI_EI_I,MBTI_SN_N,MBTI_SN_S,MBTI_TF_F,MBTI_TF_T,MBTI_JP_J,MBTI_JP_P
0,Stefon,Smith,1989-12-24,4892 Jessica Turnpike Suite 781,86553,Male,Caucasian,Proficient,Basic,Associates,2.9,3.1,5,2,ESFJ,524,857,True,E,S,F,J,True,0,1,0,1,0,0,0,1,1,0,0,1,0,0,1,0,0,1,1,0,1,0
1,Sarah,Chang,1995-04-13,9163 Rebecca Loop,43711,Female,Hispanic,Fluent,Fluent,Undergraduate,4.0,3.8,5,0,ISTJ,97,204,False,I,S,T,J,False,1,0,0,0,1,0,1,0,0,1,0,0,0,1,0,1,0,1,0,1,1,0
2,Daniel,Richardson,1998-10-23,436 Lauren Stream,54821,Male,Black,Fluent,Proficient,Undergraduate,3.0,3.2,1,1,ENFJ,2087,3211,False,E,N,F,J,False,0,1,1,0,0,0,1,0,0,0,1,0,0,1,1,0,1,0,1,0,1,0
3,Billy,Bob,1999-11-03,412 Railway Stream,43711,Male,Caucasian,Basic,Fluent,Undergraduate,2.0,3.5,1,1,ENFJ,207,309,False,E,N,F,J,False,0,1,0,1,0,1,0,0,0,1,0,0,0,1,1,0,1,0,1,0,1,0
4,Milly,Mob,1999-11-03,412 Railway Stream,43711,Male,Caucasian,Fluent,Basic,High School,2.0,1.0,0,0,INFP,0,0,False,I,N,F,P,False,0,1,0,1,0,0,1,0,1,0,0,0,1,0,0,1,1,0,1,0,0,1


In [53]:
for feature in interview_features:
    assert feature in new_people_df.columns
for feature in hire_features:
    assert feature in new_people_df.columns

### Future Applicant Model(s) Predictions

Now let's predict what the applicants' scores would be. Use your `best_interview_model` and `best_hire_model` to predict their scores.

In [54]:
# Save your predictions as new_people_interview and new_people_hire.
# Each of these should be a list of dictionaries, with one dictionery for
# each applicant. The keys of the dictionaries should be the same as the
# elements/strings in the "targets" list you created above.

new_people_interview_features = new_people_df[interview_features]
new_people_hire_features = new_people_df[hire_features]

new_people_interview = []
new_people_hire = []

# print(new_people_interview_features.iloc[0])

def get_dic_int(data):
  new_interview = {
      'Customer Satisfaction Rating': my_interview_models[0].predict([data]),
      'Sales Rating': my_interview_models[1].predict([data]),
      'Fired_Fired': my_interview_models[2].predict([data])
      }
  return new_interview

for i in range(5):
  new_people_interview.append(get_dic_int(new_people_interview_features.iloc[i]))

def get_dic_hire(data):
  new_hire = {
      'Customer Satisfaction Rating': my_hiring_models[0].predict([data]),
      'Sales Rating': my_hiring_models[1].predict([data]),
      'Fired_Fired': my_hiring_models[2].predict([data])
      }
  return new_hire

for j in range(5):
  new_people_hire.append(get_dic_hire(new_people_hire_features.iloc[j]))

# raise NotImplementedError()

In [55]:
for new_person_interview in new_people_interview:
    for key in targets:
        assert key in new_person_interview.keys()

for new_person_hire in new_people_hire:
    for key in targets:
        assert key in new_person_hire.keys()

### Ranking Evaluation

Your manager notes that given that you might have more than one prediction target, the model predictions aren't really ranking or selecting people. There is no "best" person because there's more than one metric to look through. A human still needs to look through the predictions so your models don't yet really do what SellsALOT has asked for.

Your manager asks you to create a synthetic scalar variable that is calculated from the multiple target predictions of an individual person. That way we'll have one metric by which we can rank people. You need to create that synthetic metric (score).

Some candidate approaches:

1. Incorporating a binary value, x:
    - You can multiply x by some arbitrary value and add/subtract it to/from the total score:
      - score = t1 + t2 * x
    - You can multiple your entire score output by the binary value to say something like "if not x, then  score is 0", e.g.:
      - score = x * (t1 + t2)
1. Balancing between different target values:
    - You can balance between different values by adding a multiplier (if t1 is twice as important as t2, then the score can be something like:
     - score = 2 * t1 + t2
1. Some combination of the items above
1. Something creative you devise on your own!

In [56]:
def calculate_synthetic_metric(targets):
    """Calculates a synthetic matric based on the targets of an individual
    Your metric should result in a higher score being a better one

    Args:
      targets (dict): The dictionary with keys as the target names and
                      values as the target values/predictions

    Returns:
      float: The synthetic score produced from 
    """
    one = targets.get('Customer Satisfaction Rating')
    two = targets.get('Sales Rating')
    three = targets.get('Fired_Fired')
    score = 0.5*one+0.5*two-0.5*three
    return score
    # raise NotImplementedError()

Let's try out the synthetic metric on the original data and see if you're happy with the result based on the past data.

In [57]:
# Add a column named "Metric" to the **original** DataFrame with the synthetic metric applied to each row
my_metrics = []
for index,row in df.iterrows():
  my_score = calculate_synthetic_metric(get_dic_int(df[interview_features].iloc[index]))
  my_metrics.append(my_score[0])
df['Metric'] = my_metrics

# raise NotImplementedError()


In [58]:
df.head()

Unnamed: 0,First Name,Last Name,Address,Zipcode,High School GPA,College GPA,Years of Experience,Years of Volunteering,Twitter followers,Instagram Followers,Customer Satisfaction Rating,Sales Rating,Gender_Female,Gender_Male,Race / Ethnicity_Black,Race / Ethnicity_Caucasian,Race / Ethnicity_Hispanic,English Fluency_Basic,English Fluency_Fluent,English Fluency_Proficient,Spanish Fluency_Basic,Spanish Fluency_Fluent,Spanish Fluency_Proficient,Education_Associates,Education_Graduate,Education_High School,Education_None,Education_Undergraduate,Requires Sponsorship_False,Requires Sponsorship_True,Fired_Current Employee,Fired_Fired,MBTI_EI_E,MBTI_EI_I,MBTI_SN_N,MBTI_SN_S,MBTI_TF_F,MBTI_TF_T,MBTI_JP_J,MBTI_JP_P,Age,Metric
0,Sarah,Chang,764 Howard Tunnel,30167,3.1,2.52,8.8,0.0,693,1108,2.21,2.07,1,0,1,0,0,0,1,0,1,0,0,0,0,1,0,0,1,0,1,0,0,1,0,1,0,1,1,0,30,2.071932
1,Daniel,Taylor,4892 Jessica Turnpike Suite 781,86553,3.02,3.9,13.7,0.0,507,1259,3.37,2.98,0,1,1,0,0,0,1,0,1,0,0,0,0,1,0,0,1,0,1,0,0,1,0,1,1,0,1,0,35,3.24704
2,Heather,Stewart,778 Linda Orchard Apt. 609,30167,2.95,2.63,5.2,0.0,599,868,1.5,1.36,1,0,1,0,0,0,0,1,1,0,0,0,0,1,0,0,1,0,1,0,0,1,1,0,1,0,0,1,27,1.373467
3,Katherine,Dillon,139 Linda Crossroad Suite 115,30167,3.99,3.88,12.5,0.0,1321,889,2.89,2.62,1,0,1,0,0,1,0,0,1,0,0,0,0,1,0,0,0,1,1,0,0,1,0,1,1,0,0,1,33,2.812127
4,Sheri,Bolton,1858 Lauren Orchard,60531,3.82,3.3,7.0,0.0,414,13760,1.94,1.78,1,0,1,0,0,0,0,1,0,0,1,0,0,1,0,0,0,1,1,0,0,1,0,1,1,0,1,0,29,1.855247


In [59]:
assert "Metric" in df.columns
assert np.issubdtype(df["Metric"].dtype, np.number)

Are you happy with the synthetic score based on the values for each person here? Go back and update it until you're satisfied with this score.

In [60]:
# Explain the logic behind your synthetic scoring mechanism and save it as synthetic_score_reasoning
synthetic_score_reasoning = '''
the reasoning behind my scoring was simple. I wanted 
half of the score weight to be determined by the 
'Customer Satisfaction Rating' and the other portion 
determined by 'Sales_Rating' and if the person was listed as Fired
then that would be -0.5 points. 
'''
# raise NotImplementedError()

In [61]:
assert len(synthetic_score_reasoning) > 100

Now let's calculate the synthetic scores for the new people (applicants) and see if you're satisfied with your models' rankings for interviewing and hiring.

In [62]:
new_people_interview_score = [calculate_synthetic_metric(target_values) for target_values in new_people_interview]
new_people_hire_score = [calculate_synthetic_metric(target_values) for target_values in new_people_hire]

In [63]:
best_interview_person = new_people[new_people_interview_score.index(max(new_people_interview_score))]
best_hire_person = new_people[new_people_hire_score.index(max(new_people_hire_score))]

Based on these scores, your model selected the following people:

In [64]:
print(f"""
Your interviewing model selected {best_interview_person['First Name']} {best_interview_person['Last Name']} as the person to interview.

Your hiring model selected {best_hire_person['First Name']} {best_hire_person['Last Name']} as the person to hire.
""")


Your interviewing model selected Stefon Smith as the person to interview.

Your hiring model selected Stefon Smith as the person to hire.



Are you happy with these results? Feel free to modify the `applicant_5`'s attributes and see how your model performs based on changing these values. 

In [65]:
# Describe your level of satisfaction with your models
# Did you edit your model based on the results? What did you change?
# What general conclusions did you get from the exercise
# Save your answer to the above questions as conclusions
conclusions = '''
I am very satisfied with the outcomes of this project. 
It makes sense that the person we chose is the same for both the 
interviewing and hiring models since they were based on the 
same features and this is how I wanted it to behave. The outcome 
also generally makes sense just looking at the stats of the person
who was chosen.
'''
# raise NotImplementedError()
print(conclusions)


I am very satisfied with the outcomes of this project. 
It makes sense that the person we chose is the same for both the 
interviewing and hiring models since they were based on the 
same features and this is how I wanted it to behave. The outcome 
also generally makes sense just looking at the stats of the person
who was chosen.



In [66]:
assert len(conclusions) > 100

## Feedback

In [67]:
def feedback():
    """Provide feedback on the contents of this exercise
    
    Returns:
        string
    """
    return("not easy but not too hard either")
    # raise NotImplementedError()