## **PREDICTING SALARIES FOR EMPLOYEES**

In this exercise, I am analysing a hiring.csv dataset. This file contains hiring statics for a firm such as experience of candidate, his written test score and personal interview score. Based on these 3 factors, HR will decide the salary. Given this data, you need to build a machine learning model for HR department that can help them decide salaries for future candidates. Using this predict salaries for following candidates,

`2 yr experience, 9 test score, 6 interview score`

`12 yr experience, 10 test score, 10 interview score`

In [2]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import math
from sklearn.linear_model import LinearRegression
from word2number import w2n

In [3]:
hiring = pd.read_csv('hiring.csv')
hiring

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,,8.0,9,50000
1,,8.0,6,45000
2,five,6.0,7,60000
3,two,10.0,10,65000
4,seven,9.0,6,70000
5,three,7.0,10,62000
6,ten,,7,72000
7,eleven,7.0,8,80000


In [4]:
hiring.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 4 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   experience                  6 non-null      object 
 1   test_score(out of 10)       7 non-null      float64
 2   interview_score(out of 10)  8 non-null      int64  
 3   salary($)                   8 non-null      int64  
dtypes: float64(1), int64(2), object(1)
memory usage: 384.0+ bytes


I have loaded the dataset and as observed, there are columns with missing values. <br> The experience column also has number words, i have to change it to numeric digits.

In [5]:
# A function to convert number words to digits
def convert_word_to_digits(change_word):
    try:
        return str(w2n.word_to_num(change_word))
    except ValueError:
        return change_word

# Apply the conversion function to the "experience" column
hiring['experience'] = hiring['experience'].apply(convert_word_to_digits)
hiring


Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,,8.0,9,50000
1,,8.0,6,45000
2,5.0,6.0,7,60000
3,2.0,10.0,10,65000
4,7.0,9.0,6,70000
5,3.0,7.0,10,62000
6,10.0,,7,72000
7,11.0,7.0,8,80000


I have converted the column experience to numeric digits. <br> Now, i'll deal with the missing values.

In [6]:
hiring.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 4 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   experience                  6 non-null      object 
 1   test_score(out of 10)       7 non-null      float64
 2   interview_score(out of 10)  8 non-null      int64  
 3   salary($)                   8 non-null      int64  
dtypes: float64(1), int64(2), object(1)
memory usage: 384.0+ bytes


In [7]:
hiring.isna().sum()

experience                    2
test_score(out of 10)         1
interview_score(out of 10)    0
salary($)                     0
dtype: int64

Only two columns have missing values. Therefore, I am going to fill them.

In [8]:
#Fill the experience column with 0
hiring['experience'] = hiring['experience'].fillna(0)

#Fill the test_score column with the median
test_median = hiring['test_score(out of 10)'].median()
hiring['test_score(out of 10)'] = hiring['test_score(out of 10)'].fillna(test_median)

In [9]:
hiring

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,0,8.0,9,50000
1,0,8.0,6,45000
2,5,6.0,7,60000
3,2,10.0,10,65000
4,7,9.0,6,70000
5,3,7.0,10,62000
6,10,8.0,7,72000
7,11,7.0,8,80000


In [10]:
hiring.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 4 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   experience                  8 non-null      object 
 1   test_score(out of 10)       8 non-null      float64
 2   interview_score(out of 10)  8 non-null      int64  
 3   salary($)                   8 non-null      int64  
dtypes: float64(1), int64(2), object(1)
memory usage: 384.0+ bytes


In [11]:
#Change the data-type of the experience column from object to integer.
hiring['experience'] = hiring['experience'].astype(int)

In [12]:
hiring.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 4 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   experience                  8 non-null      int32  
 1   test_score(out of 10)       8 non-null      float64
 2   interview_score(out of 10)  8 non-null      int64  
 3   salary($)                   8 non-null      int64  
dtypes: float64(1), int32(1), int64(2)
memory usage: 352.0 bytes


In [13]:
hiring

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,0,8.0,9,50000
1,0,8.0,6,45000
2,5,6.0,7,60000
3,2,10.0,10,65000
4,7,9.0,6,70000
5,3,7.0,10,62000
6,10,8.0,7,72000
7,11,7.0,8,80000


Now the dataset is clean and ready for modeling.

### **MODELING**

In [14]:
linreg = LinearRegression()
linreg.fit(hiring[['experience','test_score(out of 10)','interview_score(out of 10)']], hiring['salary($)'])

LinearRegression()

In [15]:
linreg.predict([[2, 9, 6]])

array([53205.96797671])

#### 2 yr experience, 9 test score, 6 interview score the salary is **$53,205**

In [16]:
linreg.predict([[12, 10, 10]])

array([92002.18340611])

#### 12 yr experience, 10 test score, 10 interview score the salary is **$92,002**

### **Model Persistence**

In [18]:
# Using Pickle

import pickle

#dump the model
with open ('hiring_pickle', 'wb') as f:
    pickle.dump(linreg, f)


#Load 