### AccelerateAI - Model Selection and Deployment

*** Problem Statement ***
<br>
The HR function in Unitech Pvt Ltd, wants to create a system to predict the annual salaries of employees using the potential explanatory variables from historical data in file MLR_Q13_EmpSalary.csv. 

1) Estimate the appropriate multiple linear regression equation to estimate the salary of an Unitech employee using all explanatory variables.<br>
2) What would be the estimated salary of a Sr. Data Scientist (joining engineering) with 10 years of work experience. This woman has 18 years of total education, and will be supervising 4 junior employees.<br>
3) Train and save a model as pickle file for deployment on Cloud, and make it accessible to all HR recruiters.

In [2]:
# required libraries 
import numpy as np                  # math calculations
import pandas as pd                 # loading data
import statsmodels.api as sm        # modeling 
import seaborn as sbn               # visualization maacha! 

In [3]:
# Read the EmpSalary dataset 
salary_df = pd.read_csv("MLR_Q13_EmpSalary.csv")       
salary_df.head(5) 

Unnamed: 0,Employee,Salary,PreviousExp,YearsEmployed,YearsEducation,DirectRepotees,Female,Male,Engineering,Sales,Other
0,1,"$65,487",0,27,22,44,0,1,1,0,0
1,2,"$46,184",3,20,14,1,1,0,1,0,0
2,3,"$32,782",1,0,17,0,1,0,0,1,0
3,4,"$54,899",5,12,18,0,0,1,1,0,0
4,5,"$34,869",5,7,14,1,0,1,1,0,0


We see that Male/Female and Engineering/Sales/Other are Dummy encoded variables. 
We will drop the unwanted variable Employee and one among from each group of Dummy encoded variables(to avoid multicollinearity)

In [4]:
X = salary_df.drop(["Employee", "Salary","Female", "Other"], axis=1)

In [5]:
X.head()

Unnamed: 0,PreviousExp,YearsEmployed,YearsEducation,DirectRepotees,Male,Engineering,Sales
0,0,27,22,44,1,1,0
1,3,20,14,1,0,1,0
2,1,0,17,0,0,0,1
3,5,12,18,0,1,1,0
4,5,7,14,1,1,1,0


In [6]:
# Make salary  numeric datatype
salary_df["Salary"] = salary_df["Salary"].apply(lambda x: int(x.replace('$','').replace(',',''))) 

In [7]:
# Check for correlation among X variables
X.corr()

Unnamed: 0,PreviousExp,YearsEmployed,YearsEducation,DirectRepotees,Male,Engineering,Sales
PreviousExp,1.0,0.031277,0.080169,0.216198,-0.217145,-0.032948,0.156045
YearsEmployed,0.031277,1.0,0.607486,0.345444,-0.209393,0.076349,0.033222
YearsEducation,0.080169,0.607486,1.0,0.504609,-0.192692,0.10304,-0.012239
DirectRepotees,0.216198,0.345444,0.504609,1.0,-0.100337,0.178719,-0.083201
Male,-0.217145,-0.209393,-0.192692,-0.100337,1.0,-0.003799,-0.082572
Engineering,-0.032948,0.076349,0.10304,0.178719,-0.003799,1.0,-0.483046
Sales,0.156045,0.033222,-0.012239,-0.083201,-0.082572,-0.483046,1.0


None seem to be **strongly correlated**. Now let's check for multicollinearity. 

In [8]:
# Check for multicollinearity
from statsmodels.stats.outliers_influence import variance_inflation_factor
pd.Series([variance_inflation_factor(X.values, i) 
               for i in range(X.shape[1])], 
               index=X.columns)

PreviousExp       2.756377
YearsEmployed     4.320725
YearsEducation    9.802773
DirectRepotees    1.620989
Male              1.923437
Engineering       2.401963
Sales             1.704261
dtype: float64

**YearsEducation** has high multi-collinearity. So we should drop that variable from model. 

In [9]:
X.drop("YearsEducation", axis=1, inplace=True)
X.head()

Unnamed: 0,PreviousExp,YearsEmployed,DirectRepotees,Male,Engineering,Sales
0,0,27,44,1,1,0
1,3,20,1,0,1,0
2,1,0,0,0,0,1
3,5,12,0,1,1,0
4,5,7,1,1,1,0


In [10]:
# Fit an OLS model
Y = salary_df["Salary"]
X1 = sm.add_constant(X)

model1 = sm.OLS(Y, X1).fit()
print(model1.summary())

                            OLS Regression Results                            
Dep. Variable:                 Salary   R-squared:                       0.756
Model:                            OLS   Adj. R-squared:                  0.718
Method:                 Least Squares   F-statistic:                     20.11
Date:                Sat, 10 Sep 2022   Prob (F-statistic):           1.50e-10
Time:                        12:04:27   Log-Likelihood:                -460.40
No. Observations:                  46   AIC:                             934.8
Df Residuals:                      39   BIC:                             947.6
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
const           3.087e+04   2499.928     12.

**PreviousExp** has the highest p-value. We will remove it and retrain the model. 

In [11]:
# Fit an OLS model
Y = salary_df["Salary"]
X2 = X1.drop(columns="PreviousExp", axis=1)

model2 = sm.OLS(Y, X2).fit()
print(model2.summary())

                            OLS Regression Results                            
Dep. Variable:                 Salary   R-squared:                       0.755
Model:                            OLS   Adj. R-squared:                  0.724
Method:                 Least Squares   F-statistic:                     24.65
Date:                Sat, 10 Sep 2022   Prob (F-statistic):           3.01e-11
Time:                        12:04:27   Log-Likelihood:                -460.47
No. Observations:                  46   AIC:                             932.9
Df Residuals:                      40   BIC:                             943.9
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
const           3.044e+04   2151.287     14.

**Engineering** has a high p-value. We will remove it and retrain the model.

In [12]:
# Fit an OLS model
Y = salary_df["Salary"]
X3 = X2.drop(columns="Engineering", axis=1)

model3 = sm.OLS(Y, X3).fit()
print(model3.summary())

                            OLS Regression Results                            
Dep. Variable:                 Salary   R-squared:                       0.752
Model:                            OLS   Adj. R-squared:                  0.728
Method:                 Least Squares   F-statistic:                     31.12
Date:                Sat, 10 Sep 2022   Prob (F-statistic):           6.20e-12
Time:                        12:04:27   Log-Likelihood:                -460.73
No. Observations:                  46   AIC:                             931.5
Df Residuals:                      41   BIC:                             940.6
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
const           3.113e+04   1868.963     16.

**Male** has a high p-value. We will remove it and retrain the model.

In [13]:
# Fit an OLS model
Y = salary_df["Salary"]
X4 = X3.drop(columns="Male", axis=1)

model4 = sm.OLS(Y, X4).fit()
print(model4.summary())

                            OLS Regression Results                            
Dep. Variable:                 Salary   R-squared:                       0.741
Model:                            OLS   Adj. R-squared:                  0.723
Method:                 Least Squares   F-statistic:                     40.10
Date:                Sat, 10 Sep 2022   Prob (F-statistic):           2.15e-12
Time:                        12:04:27   Log-Likelihood:                -461.73
No. Observations:                  46   AIC:                             931.5
Df Residuals:                      42   BIC:                             938.8
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
const           2.962e+04   1513.081     19.

The variables YearsEmployed, DirectRepotees and Sales are significant.

1) Estimate the appropriate multiple linear regression equation to predict the salary of an Unitech employee using all explanatory variables.<br>
    **Regression Eqn: Salary = 2962 + 988.7 * YearsEmployed + 284.5 * DirectRepotees  - 7464 * Sales**

2) What would be the estimated salary of a Sr. Data Scientist (joining engineering) with 10 years of work experience. This woman has 18 years of total education, and will be supervising 4 junior employees.

In [14]:
new_emp = {
    'const' : 1,
    'YearsEmployed' :10,
    'DirectRepotees' : 4,
    'Sales'  : 0 
    }
  
x = pd.DataFrame(new_emp, index=[0])

predicted_sal = model4.predict(x)

print("Predicted Salary:$",predicted_sal[0].round(1))

Predicted Salary:$ 40645.2


### 3)  Train and save a model as pickle file for deployment on Cloud.

In [16]:
# Fitting the Multiple Linear Regression in the Training set
from sklearn.linear_model import LinearRegression

X_Train = salary_df[["YearsEmployed", "DirectRepotees", "Sales"]]
Y_Train = salary_df["Salary"]

regressor = LinearRegression()
lr_model = regressor.fit(X_Train, Y_Train)

In [18]:
# Save the model as pkl file
import pickle

# create an iterator object with write permission - model.pkl
with open('lrmodel.pkl', 'wb') as files:
    pickle.dump(lr_model, files)
    
# load saved model
with open('lrmodel.pkl' , 'rb') as f:
    saved_model = pickle.load(f)

# check prediction
''' YearsEmployed  :10,
    DirectRepotees : 4,
    Sales          : 0 
'''        
x = np.array([10,4,0]).reshape(1,3)
predicted_sal = saved_model.predict(x)

print("Predicted Salary:$",predicted_sal[0].round(1))

Predicted Salary:$ 40645.2




***