# Job Performance Prediction

You work for a software startup, Predict All The Things Inc. (PALT), and are approached by the CEO to build an algorithm that can help sift through resumes. PALT just closed a $3 million Series A round of funding and the CEO just landed a deal with a national retailer, SellsALOT, to help them with hiring Sales Associates.

They are able to obtain data on all the employees that work as Sales Associates throughout their stores as well as customer satisfaction and sales performance scores.

In this case study, you are tasked to build a model to predict job performance to assist HR in selecting applicants to interview.

The data was provided to you by the new HR intern, Keegan. This is the email you got from Keegan with the attached data.

>Hi!
>
>I hope you're doing well. I've attached the data we have about all employees. Please ensure this data stays confidential and is not shared with anyone who has not signed the NDA. The columns have all the information we have about our employees and the scoring rating that they've received from our performance monitors. We also have some employees that were fired and I have included those as well.
>
>I was also able to dig up some more information about our employees that I found on the internet. It took a lot of time but I hope it helps in making the model even better. Can't wait to see this thing in action. Everyone here is very excited about our collaboration with you and we look forward to this making hiring a lot easier for us.
>
>Thanks,
>
>Keegan Thiel
>
>HR Intern
>
>Human Resources
>
>SellsALOT


Data is available in the `employees.csv` file provided. 


SellsALOT is an Equal Opportunity Employer which is an employer who agrees not to discriminate against any employee or job applicant because of race, color, religion, national origin, sex, physical or mental disability, or age.


## Data Cleaning

First, let's investigate the data that we received from Keegan.



In [0]:
import pandas as pd
import matplotlib
import numpy as np
import sklearn as sk
from sklearn.model_selection import train_test_split
import datetime
from datetime import date

In [0]:
df = pd.read_csv("employees.csv")

In [195]:
df.head()

Unnamed: 0,First Name,Last Name,Date of Birth,Address,Zipcode,Gender,Race / Ethnicity,English Fluency,Spanish Fluency,Education,...,College GPA,Years of Experience,Years of Volunteering,Myers Briggs Type,Twitter followers,Instagram Followers,Requires Sponsorship,Customer Satisfaction Rating,Sales Rating,Fired
0,Sarah,Chang,1989-12-24,764 Howard Tunnel,30167,Female,Black,Fluent,Basic,High School,...,2.52,8.8,0.0,ISTJ,693,1108,False,2.21,2.07,Current Employee
1,Daniel,Taylor,1985-03-15,4892 Jessica Turnpike Suite 781,86553,Male,Black,Fluent,Basic,High School,...,3.9,13.7,0.0,ISFJ,507,1259,False,3.37,2.98,Current Employee
2,Heather,Stewart,1993-09-20,778 Linda Orchard Apt. 609,30167,Female,Black,Proficient,Basic,High School,...,2.63,5.2,0.0,INFP,599,868,False,1.5,1.36,Current Employee
3,Katherine,Dillon,1986-12-22,139 Linda Crossroad Suite 115,30167,Female,Black,Basic,Basic,High School,...,3.88,12.5,0.0,ISFP,1321,889,True,2.89,2.62,Current Employee
4,Sheri,Bolton,1991-02-24,1858 Lauren Orchard,60531,Female,Black,Proficient,Proficient,High School,...,3.3,7.0,0.0,ISFJ,414,13760,True,1.94,1.78,Current Employee


In [196]:
df.describe(include="all")
df['Myers Briggs Type'].unique()

array(['ISTJ', 'ISFJ', 'INFP', 'ISFP', 'ESTJ', 'ESFJ', 'ESTP', 'INTP',
       'ISTP', 'ESFP', 'ENFP', 'INFJ', 'INTJ', 'ENTJ', 'ENFJ', 'ENTP'],
      dtype=object)

In [197]:
print("The columns of data are:")
list(df.columns)

The columns of data are:


['First Name',
 'Last Name',
 'Date of Birth',
 'Address',
 'Zipcode',
 'Gender',
 'Race / Ethnicity',
 'English Fluency',
 'Spanish Fluency',
 'Education',
 'High School GPA',
 'College GPA',
 'Years of Experience',
 'Years of Volunteering',
 'Myers Briggs Type',
 'Twitter followers',
 'Instagram Followers',
 'Requires Sponsorship',
 'Customer Satisfaction Rating',
 'Sales Rating',
 'Fired']

Based on the data above we need to do the following:

1. Split Myers Briggs into subtypes
1. Convert categorical columns to dummy variables
1. Calculate Age based on date of birth

The [Myers Briggs Type Indicator](https://en.wikipedia.org/wiki/Myers%E2%80%93Briggs_Type_Indicator) (MBTI) descibes people as one of two types for each of:

* extraversion (E) or introversion (I)
* sensing (S) or intuition (N)
* thinking (T) or feeling (F)
* judgment (J) or perception (P)

It would make more sense for us to represent people as one or the other of these instead of creating all the possible cases. That way a model can learn based on each of those factors as well as their combination. 

Your next task is to split the MBTI column into four columns in the dataframe:

* MBTI_EI with value `E` or `I`
* MBTI_SN with value `S` or `N`
* MBTI_TF with value `T` or `F`
* MBTI_JP with value `J` or `P`

That correspond to the same row's Myers Briggs Type.

In [0]:
# YOUR CODE HERE
df['MBTI_EI'] = df.apply(lambda row: row['Myers Briggs Type'][0:1], axis=1)
df['MBTI_SN'] = df.apply(lambda row: row['Myers Briggs Type'][1:2], axis=1)
df['MBTI_TF'] = df.apply(lambda row: row['Myers Briggs Type'][2:3], axis=1)
df['MBTI_JP'] = df.apply(lambda row: row['Myers Briggs Type'][3:4], axis=1)

In [0]:
assert len(set(df["MBTI_EI"])) == 2
assert "E" in set(df["MBTI_EI"]) and "I" in set(df["MBTI_EI"])
assert len(set(df["MBTI_SN"])) == 2
assert "S" in set(df["MBTI_SN"]) and "N" in set(df["MBTI_SN"])
assert len(set(df["MBTI_TF"])) == 2
assert "T" in set(df["MBTI_TF"]) and "F" in set(df["MBTI_TF"])
assert len(set(df["MBTI_JP"])) == 2
assert "J" in set(df["MBTI_JP"]) and "P" in set(df["MBTI_JP"])

1. ~~Split Myers Briggs into subtypes~~
1. Convert categorical columns to dummy variables
1. Calculate Age based on date of birth

Dumy variables are variables that allow us to convert a category into several binary variables. For example, if we had a color value that we were storing and we knew it could only have the values `red`, `green`, and `blue`, then instead of storing the color as those strings, we can store three binary variables: `is_red`, `is_green`, and `is_blue`. 

We can do this in pandas easily by using [`get_dummies`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html).

In [200]:
# Determine all the categorical columns and save them to categorical_columns
# Note that this does not include the binary features

# YOUR CODE HERE

categorical_columns = pd.DataFrame(data = df['MBTI_EI'])
categorical_columns['MBTI_SN'] = df['MBTI_SN']
categorical_columns['MBTI_TF'] = df['MBTI_TF']
categorical_columns['MBTI_JP'] = df['MBTI_JP']
categorical_columns ['Gender'] = df['Gender']
categorical_columns['College GPA'] = df['College GPA']
categorical_columns['High School GPA'] =  df['High School GPA']
categorical_columns ['Race / Ethnicity']=  df['Race / Ethnicity']
categorical_columns.head()


Unnamed: 0,MBTI_EI,MBTI_SN,MBTI_TF,MBTI_JP,Gender,College GPA,High School GPA,Race / Ethnicity
0,I,S,T,J,Female,2.52,3.1,Black
1,I,S,F,J,Male,3.9,3.02,Black
2,I,N,F,P,Female,2.63,2.95,Black
3,I,S,F,P,Female,3.88,3.99,Black
4,I,S,F,J,Female,3.3,3.82,Black


In [0]:
assert len(categorical_columns) > 8
for category in categorical_columns:
    assert category in df.columns

In [202]:
# For every column in the categorical features
# Calculate the dummy variables and add them to the dataframe
# YOUR CODE HERE
df['ISTJ'] = df.apply(lambda row: int(row['Myers Briggs Type']=="ISTJ"), axis=1)
df['ISFJ'] = df.apply(lambda row: int(row['Myers Briggs Type']=="ISFJ"), axis=1)
df['INFP'] = df.apply(lambda row: int(row['Myers Briggs Type']=="INFP"), axis=1)
df['ISFP'] = df.apply(lambda row: int(row['Myers Briggs Type']=="ISFP"), axis=1)
df['ESTJ'] = df.apply(lambda row: int(row['Myers Briggs Type']=="ESTJ"), axis=1)
df['ESFJ'] = df.apply(lambda row: int(row['Myers Briggs Type']=="ESFJ"), axis=1)
df['ESTP'] = df.apply(lambda row: int(row['Myers Briggs Type']=="ESTP"), axis=1)
df['INTP'] = df.apply(lambda row: int(row['Myers Briggs Type']=="INTP"), axis=1)
df['ISTP'] = df.apply(lambda row: int(row['Myers Briggs Type']=="ISTP"), axis=1)
df['ESFP'] = df.apply(lambda row: int(row['Myers Briggs Type']=="ESFP"), axis=1)
df['ENFP'] = df.apply(lambda row: int(row['Myers Briggs Type']=="ENFP"), axis=1)
df['INFJ'] = df.apply(lambda row: int(row['Myers Briggs Type']=="INFJ"), axis=1)
df['INTJ'] = df.apply(lambda row: int(row['Myers Briggs Type']=="INTJ"), axis=1)
df['ENTJ'] = df.apply(lambda row: int(row['Myers Briggs Type']=="ENTJ"), axis=1)
df['ENFJ'] = df.apply(lambda row: int(row['Myers Briggs Type']=="ENFJ"), axis=1)
df['ENTP'] = df.apply(lambda row: int(row['Myers Briggs Type']=="ENTP"), axis=1)
df['is_Black'] = df.apply(lambda row: int(row['Race / Ethnicity']=="Black"), axis=1)
df['is_Hispanic'] = df.apply(lambda row: int(row['Race / Ethnicity']=="Hispanic"), axis=1)
df['is_Caucasian'] = df.apply(lambda row: int(row['Race / Ethnicity']=="Caucasian"), axis=1)
df['is_Male'] = df.apply(lambda row: int(row['Gender']=="Male"), axis=1)
df['is_Female'] = df.apply(lambda row: int(row['Gender']=="Female"), axis=1)
df['is_Requires_Sponsorship'] = df.apply(lambda row: int(row['Requires Sponsorship']=="True"), axis=1)

categorical_columns['is_i'] = df['MBTI_EI']
categorical_columns['is_e'] = df['MBTI_EI']
df['is_i'] = df['MBTI_EI']
df['is_e'] = df['MBTI_EI']

categorical_columns['is_s'] = df['MBTI_SN']
categorical_columns['is_n'] = df['MBTI_SN']
df['is_s'] = df['MBTI_SN']
df['is_n'] = df['MBTI_SN']

categorical_columns['is_t'] = df['MBTI_TF']
categorical_columns['is_f'] = df['MBTI_TF']
df['is_t'] = df['MBTI_TF']
df['is_f'] = df['MBTI_TF']

categorical_columns['is_j'] = df['MBTI_JP']
categorical_columns['is_p'] = df['MBTI_JP']
df['is_j'] = df['MBTI_JP']
df['is_p'] = df['MBTI_JP']


df1 = pd.get_dummies(categorical_columns['is_i'])
df['is_i'] = df1[['I', 'E']]
df1 = pd.get_dummies(categorical_columns['is_e'])
df['is_e'] = df1[['E', 'I']]

df1 = pd.get_dummies(categorical_columns['is_s'])
df['is_s'] = df1[['S', 'N']]
df1 = pd.get_dummies(categorical_columns['is_n'])
df['is_n'] = df1[['N', 'S']]

df1 = pd.get_dummies(categorical_columns['is_t'])
df['is_t'] = df1[['T', 'F']]
df1 = pd.get_dummies(categorical_columns['is_f'])
df['is_f'] = df1[['F', 'T']]

df1 = pd.get_dummies(categorical_columns['is_j'])
df['is_j'] = df1[['J', 'P']]
df1 = pd.get_dummies(categorical_columns['is_p'])
df['is_p'] = df1[['P', 'J']]

print(df.head())

  First Name Last Name Date of Birth                          Address  \
0      Sarah     Chang    1989-12-24                764 Howard Tunnel   
1     Daniel    Taylor    1985-03-15  4892 Jessica Turnpike Suite 781   
2    Heather   Stewart    1993-09-20       778 Linda Orchard Apt. 609   
3  Katherine    Dillon    1986-12-22    139 Linda Crossroad Suite 115   
4      Sheri    Bolton    1991-02-24              1858 Lauren Orchard   

   Zipcode  Gender Race / Ethnicity English Fluency Spanish Fluency  \
0    30167  Female            Black          Fluent           Basic   
1    86553    Male            Black          Fluent           Basic   
2    30167  Female            Black      Proficient           Basic   
3    30167  Female            Black           Basic           Basic   
4    60531  Female            Black      Proficient      Proficient   

     Education  ...   is_Female  is_Requires_Sponsorship  is_i  is_e is_s  \
0  High School  ...           1                        0 

In [0]:
assert len(list(df.columns)) > 45

In [204]:
print("The current columns are:")
list(df.columns)

The current columns are:


['First Name',
 'Last Name',
 'Date of Birth',
 'Address',
 'Zipcode',
 'Gender',
 'Race / Ethnicity',
 'English Fluency',
 'Spanish Fluency',
 'Education',
 'High School GPA',
 'College GPA',
 'Years of Experience',
 'Years of Volunteering',
 'Myers Briggs Type',
 'Twitter followers',
 'Instagram Followers',
 'Requires Sponsorship',
 'Customer Satisfaction Rating',
 'Sales Rating',
 'Fired',
 'MBTI_EI',
 'MBTI_SN',
 'MBTI_TF',
 'MBTI_JP',
 'ISTJ',
 'ISFJ',
 'INFP',
 'ISFP',
 'ESTJ',
 'ESFJ',
 'ESTP',
 'INTP',
 'ISTP',
 'ESFP',
 'ENFP',
 'INFJ',
 'INTJ',
 'ENTJ',
 'ENFJ',
 'ENTP',
 'is_Black',
 'is_Hispanic',
 'is_Caucasian',
 'is_Male',
 'is_Female',
 'is_Requires_Sponsorship',
 'is_i',
 'is_e',
 'is_s',
 'is_n',
 'is_t',
 'is_f',
 'is_j',
 'is_p']

In [0]:
# Now let's drop all the categorical features columns from the dataframe
# So that we don't have duplicate information stored
# YOUR CODE HERE

df = df.drop(['MBTI_EI', 'MBTI_SN','MBTI_TF','MBTI_JP','College GPA', 'is_i', 'is_e', 'is_s', 'is_n', 'is_t', 'is_f', 'is_j', 'is_p','Gender','Race / Ethnicity','High School GPA'], axis=1)

In [206]:
print("The current columns are:")
list(df.columns)

The current columns are:


['First Name',
 'Last Name',
 'Date of Birth',
 'Address',
 'Zipcode',
 'English Fluency',
 'Spanish Fluency',
 'Education',
 'Years of Experience',
 'Years of Volunteering',
 'Myers Briggs Type',
 'Twitter followers',
 'Instagram Followers',
 'Requires Sponsorship',
 'Customer Satisfaction Rating',
 'Sales Rating',
 'Fired',
 'ISTJ',
 'ISFJ',
 'INFP',
 'ISFP',
 'ESTJ',
 'ESFJ',
 'ESTP',
 'INTP',
 'ISTP',
 'ESFP',
 'ENFP',
 'INFJ',
 'INTJ',
 'ENTJ',
 'ENFJ',
 'ENTP',
 'is_Black',
 'is_Hispanic',
 'is_Caucasian',
 'is_Male',
 'is_Female',
 'is_Requires_Sponsorship']

In [0]:
assert 45 > len(list(df.columns)) > 30

1. ~~Split Myers Briggs into subtypes~~
1. ~~Convert categorical columns to dummy variables~~
1. Calculate Age based on date of birth

In [0]:
def calculate_age(born):
    """Calculates age based on date of birth using https://stackoverflow.com/a/9754466/818687

    Args:
        born (datetime): The date of birth

    Returns:
        int: The age based on date of birth
    """
    today = date.today()
    return today.year - born.year - ((today.month, today.day) < (born.month, born.day))

In [0]:
# Add an "Age" column to the dataframe that calculates people's ages based on their date of birth
# YOUR CODE HERE
df["Age"] = df.apply(lambda x: calculate_age(pd.to_datetime(x["Date of Birth"])), axis=1)


In [0]:
assert df["Age"].min() == 20
assert df["Age"].max() == 36
assert df["Age"].median() == 28

## Modelling

Based on your understanding of the data, select the features that you want to use to predict:

1. Customer Satisfaction
1. Sales Performance
1. Fired

In [0]:
# Save the columns we are trying to predict to targets
# Make sure that if we had a categorical column, that you use the dummy representation(s)
# YOUR CODE HERE
targets =  pd.DataFrame()
targets['Customer Satisfaction Rating'] = [0,0,0]
targets["Sales Rating"] = [0,0,0]
targets["Fired"] = [0,0,0]


In [0]:
assert len(targets) == 3
for target in targets:
    assert target in df.columns

Your prediction will be used to rank applicants for interviews with HR. **Which features will you select to use in your model?**

In [213]:
print("The available columns are:")
list(df)

The available columns are:


['First Name',
 'Last Name',
 'Date of Birth',
 'Address',
 'Zipcode',
 'English Fluency',
 'Spanish Fluency',
 'Education',
 'Years of Experience',
 'Years of Volunteering',
 'Myers Briggs Type',
 'Twitter followers',
 'Instagram Followers',
 'Requires Sponsorship',
 'Customer Satisfaction Rating',
 'Sales Rating',
 'Fired',
 'ISTJ',
 'ISFJ',
 'INFP',
 'ISFP',
 'ESTJ',
 'ESFJ',
 'ESTP',
 'INTP',
 'ISTP',
 'ESFP',
 'ENFP',
 'INFJ',
 'INTJ',
 'ENTJ',
 'ENFJ',
 'ENTP',
 'is_Black',
 'is_Hispanic',
 'is_Caucasian',
 'is_Male',
 'is_Female',
 'is_Requires_Sponsorship',
 'Age']

In [0]:
# Enter all the features you want to use in a list and save it to rank_features
# These are the features for the model that will rank applicants for interviews
# YOUR CODE HERE
rank_features = ['College GPA', 'High School GPA','Years of Experience', 'Years of Volunteering', 'ISTJ','ISFJ','INFP','ISFP','ESTJ','ESFJ','ESTP','INTP','ISTP','ESFP', 'ENFP','INFJ','INTJ','ENTJ','ENFJ','ENTP','Age','is_Black',
 'is_Hispanic', 'is_Caucasian', 'is_Male', 'is_Female', 'is_Requires_Sponsorship']

Why did you choose the features you did?

In [0]:
## Save your reasoning in a string to the variable ranking_reason

# YOUR CODE HERE
ranking_reason = "I chose these features because this is the ideal employee. Experience matters because you would like a worker who is mature to handle difficult situation. Additionall, College GPA and High School GPA also matters to quantify a person's hard work and overall intellect. Gender may play a role in specific fields as well."

In [0]:
assert isinstance(ranking_reason, str)
assert len(ranking_reason) > 20

In [228]:
# Perform a train and test split on the data with the variable names:
# rank_x_train for the training features
# rank_x_test for the testing features
# rank_y_train for the training targets
# rank_y_test for the testing targets
# The test dataset should be 20% of the total dataset

# YOUR CODE HERE

new_df = pd.DataFrame(data = df)
new_df=new_df.dropna()

rankFeatures =  new_df[['Instagram Followers', 'Years of Experience', 'Years of Volunteering', 'ISTJ','ISFJ','INFP','ISFP','ESTJ','ESFJ','ESTP','INTP','ISTP','ESFP', 'ENFP','INFJ','INTJ','ENTJ','ENFJ','ENTP','Age','is_Black','is_Hispanic', 'is_Caucasian', 'is_Male', 'is_Female', 'is_Requires_Sponsorship','Twitter followers']]
targets=new_df[[ 'Fired']]
rank_x_train, rank_x_test, rank_y_train, rank_y_test = train_test_split(rankFeatures,targets,test_size=0.2)

print(df["Fired"].head())

0    Current Employee
1    Current Employee
2    Current Employee
3    Current Employee
4    Current Employee
Name: Fired, dtype: object


In [0]:
assert (len(rank_x_train) / (len(rank_x_test) + len(rank_x_train))) == 0.8
assert (len(rank_y_train) / (len(rank_y_test) + len(rank_y_train))) == 0.8
assert len(rank_x_train) == len(rank_y_train)
assert len(rank_x_test) == len(rank_y_test)

In [0]:
# Select models of your choosing, import them and perform a parameter search to train them on each of the targets
# Determine an appropriate metric for measuring your performance and report that.

# YOUR CODE HERE
from sklearn.neighbors import KNeighborsClassifier
ks = [1,2,3,4,5,6,7,8,9,10,15,18,19,20,21,22,30,32,33,34,45,48,50,100]
def get_knn_validation_scores(ks, model_features, model_labels, validation_features, validation_labels):
    d= dict()
    for k in ks:
      knn = KNeighborsClassifier(n_neighbors=k)
      knn.fit(validation_features, validation_labels)
      validationPredictions = knn.predict(validation_features)
      f1 = f1_score(validation_labels, validationPredictions, average="weighted")
      d[k] = f1 
    return d
  
def get_knn_training_scores(ks, model_features, model_labels):
    
    d= dict()
    for k in ks:
      knn = KNeighborsClassifier(n_neighbors=k)
      knn.fit(model_features, model_labels)
      validationPredictions = knn.predict(model_features)
      f1 = f1_score(model_labels, validationPredictions, average="weighted")
      d[k] = f1 
    return d


In [230]:
from sklearn import preprocessing
from sklearn.metrics import classification_report, confusion_matrix, f1_score

lab_enc = preprocessing.LabelEncoder()
rank_y_train = lab_enc.fit_transform(rank_y_train)


rank_y_test = lab_enc.fit_transform(rank_y_test)
training_scores = get_knn_training_scores(ks, rank_x_train, rank_y_train)
validation_scores = get_knn_validation_scores(ks, rank_x_train, rank_y_train, rank_x_test, rank_y_test)


  'precision', 'predicted', average, warn_for)


Would your feature choice change if HR was going to use your model to directly hire applicants without an interview? **Which features will you select to use in that model?**

In [232]:
scores = []
from sklearn import metrics,preprocessing,datasets,tree
from sklearn.metrics import accuracy_score, classification_report
from sklearn.neighbors import KNeighborsClassifier


clf5= KNeighborsClassifier(n_neighbors=20)
clf5.fit(rank_x_train, rank_y_train)
y_pred = clf5.predict(rank_x_test)
scores = metrics.accuracy_score(rank_y_test, y_pred)
scores = np.array(scores)
print('mean: ',scores.max())
print('avg: ',scores.mean())

print(clf5)
conf=confusion_matrix(rank_y_test,y_pred)
print(confusion_matrix(rank_y_test,y_pred))
print(classification_report(rank_y_test,y_pred))

('mean: ', 0.93)
('avg: ', 0.93)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=20, p=2,
           weights='uniform')
[[372   0]
 [ 28   0]]
              precision    recall  f1-score   support

           0       0.93      1.00      0.96       372
           1       0.00      0.00      0.00        28

   micro avg       0.93      0.93      0.93       400
   macro avg       0.47      0.50      0.48       400
weighted avg       0.86      0.93      0.90       400



  'precision', 'predicted', average, warn_for)


In [231]:
print("The available columns are:")
list(df)

The available columns are:


['First Name',
 'Last Name',
 'Date of Birth',
 'Address',
 'Zipcode',
 'English Fluency',
 'Spanish Fluency',
 'Education',
 'Years of Experience',
 'Years of Volunteering',
 'Myers Briggs Type',
 'Twitter followers',
 'Instagram Followers',
 'Requires Sponsorship',
 'Customer Satisfaction Rating',
 'Sales Rating',
 'Fired',
 'ISTJ',
 'ISFJ',
 'INFP',
 'ISFP',
 'ESTJ',
 'ESFJ',
 'ESTP',
 'INTP',
 'ISTP',
 'ESFP',
 'ENFP',
 'INFJ',
 'INTJ',
 'ENTJ',
 'ENFJ',
 'ENTP',
 'is_Black',
 'is_Hispanic',
 'is_Caucasian',
 'is_Male',
 'is_Female',
 'is_Requires_Sponsorship',
 'Age']

In [0]:

targets=new_df[[ 'Customer Satisfaction Rating']]
rank_x_train, rank_x_test, rank_y_train, rank_y_test = train_test_split(rankFeatures,targets,test_size=0.2)


In [236]:
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
regr = linear_model.LinearRegression()
regr.fit(rank_x_train, rank_y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

np.mean((regr.predict(rank_x_test) - rank_y_test)**2)
regr.score(rank_x_test, rank_y_test) 


0.9809033506638087

In [237]:
targets=new_df[[ 'Sales Rating']]
rank_x_train, rank_x_test, rank_y_train, rank_y_test = train_test_split(rankFeatures,targets,test_size=0.2)

from sklearn import linear_model
from sklearn.linear_model import LinearRegression
regr = linear_model.LinearRegression()
regr.fit(rank_x_train, rank_y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

np.mean((regr.predict(rank_x_test) - rank_y_test)**2)
regr.score(rank_x_test, rank_y_test) 




0.9612807769704205

(1)  For the Costomer Satisfaction Rating, I trained the model using Linear Regression. The resulting mean square is 0.98

(2)For the Sales Rating, I trained the model using Linear Regression. The resulting mean square is 0.96

(3) For the Fired I trained the model using the K-Nearest Neighbor with a score of 0.93

In [0]:
# Enter all the features you want to use in a list and save it to selection_features
# These are the features for the model that will directly hire the top applicants
# YOUR CODE HERE
selection_features = new_df[[ 'Years of Experience', 'Years of Volunteering', 'Education']]

Why did you choose the features you did?

In [0]:
## Save your reasoning in a string to the variable selection_reason

# YOUR CODE HERE
selection_reason = 'Employees who has highest years of experience and volunteering typically has a better experience dealing with technical and emotional challanges when dealing with customer. Education is important because it defines a trained background of an employee.'

In [0]:
assert isinstance(selection_reason, str)
assert len(selection_reason) > 20

Why was your choice different from or the same as the ranking features?


In [0]:
# Save your reasoning in a string to the variable
# same_reason if the features are the same
# different_reason if the features are different
# YOUR CODE HERE
different_reason = "I removed gender because I believe a person's gender does not determine their success in a field."

In [248]:
if all([rf in selection_features for rf in rank_features]) and all([sf in rank_features for sf in selection_features]):
    print("Your features for ranking and selection are the same.")
    assert isinstance(same_reason, str)
    assert len(same_reason) > 20
else:
    print("Your features for ranking and selection are different.")
    assert isinstance(different_reason, str)
    assert len(different_reason) > 20

Your features for ranking and selection are different.


In [251]:
# Perform a train and test split on the data with the variable names:
# selection_x_train for the training features
# selection_x_test for the testing features
# selection_y_train for the training targets
# selection_y_test for the testing targets
# The test dataset should be 20% of the total dataset

# YOUR CODE HERE
targets=new_df[['Sales Rating']]
selection_x_train, selection_x_test, selection_y_train, selection_y_test = train_test_split(selection_features,targets,test_size=0.2)
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
regr = linear_model.LinearRegression()
regr.fit(selection_x_train, selection_y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
np.mean((regr.predict(selection_x_test) - selection_y_test)**2)
regr.score(selection_x_test, selection_y_test) 


0.967774323141


In [0]:
assert (len(selection_x_train) / (len(selection_x_test) + len(selection_x_train))) == 0.8
assert (len(selection_y_train) / (len(selection_y_test) + len(selection_y_train))) == 0.8
assert len(selection_x_train) == len(selection_y_train)
assert len(selection_x_test) == len(selection_y_test)

Now let's see if the model performs differently.

In [254]:
# Select models of your choosing, import them and perform a parameter search to train them on each of the targets
# Determine an appropriate metric for measuring your performance and report that.

# YOUR CODE HERE
targets=new_df[[ 'Customer Satisfaction Rating']]
selection_x_train, selection_x_test, selection_y_train, selection_y_test = train_test_split(selection_features,targets,test_size=0.2)
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
regr = linear_model.LinearRegression()
regr.fit(selection_x_train, selection_y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
np.mean((regr.predict(selection_x_test) - selection_y_test)**2)
regr.score(selection_x_test, selection_y_test) 


0.944157445537


In [257]:
targets=new_df[[ 'Fired']]

selection_x_train, selection_x_test, selection_y_train, selection_y_test = train_test_split(rankFeatures,targets,test_size=0.2)

lab_enc = preprocessing.LabelEncoder()
selection_y_train = lab_enc.fit_transform(selection_y_train)

selection_y_test = lab_enc.fit_transform(selection_y_test)
training_scores = get_knn_training_scores(ks, selection_x_train, selection_y_train)
validation_scores = get_knn_validation_scores(ks, selection_x_train, selection_y_train, selection_x_test, selection_y_test)


scores = []
from sklearn import metrics,preprocessing,datasets,tree
from sklearn.metrics import accuracy_score, classification_report
from sklearn.neighbors import KNeighborsClassifier


clf6= KNeighborsClassifier(n_neighbors=10)
clf6.fit(selection_x_train, selection_y_train)
y_pred = clf6.predict(selection_x_test)
scores = metrics.accuracy_score(selection_y_test, y_pred)
scores = np.array(scores)
print('mean: ',scores.max())
print('avg: ',scores.mean())
conf=confusion_matrix(selection_y_test,y_pred)
print(confusion_matrix(selection_y_test,y_pred))
print(classification_report(selection_y_test,y_pred))



('mean: ', 0.9275)
('avg: ', 0.9275)
[[371   0]
 [ 29   0]]
              precision    recall  f1-score   support

           0       0.93      1.00      0.96       371
           1       0.00      0.00      0.00        29

   micro avg       0.93      0.93      0.93       400
   macro avg       0.46      0.50      0.48       400
weighted avg       0.86      0.93      0.89       400





In [0]:
# Follow this up with a comparison between the performance on your 2 models using the different features.
# You should print something like
# Using rank features for target (target) the model scored (score)
# versus using the selection features where it scored (score)
# YOUR CODE HERE



(1)  For the Costomer Satisfaction Rating, I trained the model using Linear Regression. The resulting mean square is 0.944
​
(2)For the Sales Rating, I trained the model using Linear Regression. The resulting mean square is 0.9677
​
(3) For the Fired I trained the model using the K-Nearest Neighbor with a score of 0.9275

## Feedback

In [259]:
#@title
def feedback():
    """Provide feedback on the contents of this exercise
    
    Returns:
        string
    """
    # YOUR CODE HERE
print('This assignment is too long to do for just 1 week right before finals...')

This assignment is too long to do for just 1 week right before finals...


In [0]:
#@title
