# Q13
An HR analyst in Unitech Pvt Ltd, wants to predict the annual salaries of given employees using the potential explanatory variables in the file MLR_Q13_EmpSalary.csv. File link: https://drive.google.com/drive/folders/1ILKastUTJWccxaxIpJpjqCJDpsMJ-oC8

    1) Estimate the appropriate multiple linear regression equation to predict the salary of an Unitech employee using all explanatory variables.
    2) Do we need to exclude certain columns? Why?
    3) Which department employees are paid the highest? By how much?
    4) Do you see any discrimination in salaries earned by male and female employees? 
    5) What would be the estimated salary of a Sr. Data Scientist (joining engineering) with 10 years of work experience. This woman has 18 years of total education, and will be supervising 4 junior employees. 

In [2]:
import numpy as np
import pandas as pd 

import seaborn as sns 
import matplotlib.pyplot as plt

import statsmodels.api as sm
from sklearn.linear_model import LinearRegression

In [3]:
df = pd.read_csv("MLR_Q13_EmpSalary.csv")\
       .assign(Salary = lambda df: df['Salary'].str.replace('\\$|,', '', regex=True).astype(int))

df.head(2)

Unnamed: 0,Employee,Salary,PreviousExp,YearsEmployed,YearsEducation,DirectRepotees,Female,Male,Engineering,Sales,Other
0,1,65487,0,27,22,44,0,1,1,0,0
1,2,46184,3,20,14,1,1,0,1,0,0


In [4]:
#sns.pairplot(data = df[["Salary", "PreviousExp", "YearsEmployed", "YearsEducation", 
#                        "DirectRepotees", "Female", "Male", "Engineering", "Sales", "Other"
#                       ]], 
#             diag_kind='kde')
#plt.show()

In [5]:
def fit_lin_reg_with_intercept(X, Y):
    X = sm.add_constant(X) # adding a constan
    reg_model = sm.OLS(Y,X).fit()
    return reg_model

In [6]:
x_var = ["PreviousExp", "YearsEmployed", "YearsEducation", "DirectRepotees", "Female", "Male", "Engineering",
         "Sales", "Other"
        ]
reg_model = fit_lin_reg_with_intercept(X=df[x_var], Y=df[["Salary"]])
print(reg_model.summary())

                            OLS Regression Results                            
Dep. Variable:                 Salary   R-squared:                       0.846
Model:                            OLS   Adj. R-squared:                  0.817
Method:                 Least Squares   F-statistic:                     29.78
Date:                Thu, 19 May 2022   Prob (F-statistic):           1.48e-13
Time:                        07:15:02   Log-Likelihood:                -449.82
No. Observations:                  46   AIC:                             915.6
Df Residuals:                      38   BIC:                             930.3
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
const           4051.4713   2505.392      1.

In [7]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
# strong multicollinearity among Male,Female, Engineering, Sales and Other
#   remove Male
# strong multicollinearity among Female, Engineering, Sales and Other with Engineering showing highest VIF
#   remove Engineering
#   remove PreviousExp, because of high p-value


x_var = ["YearsEmployed", "YearsEducation", "DirectRepotees", "Female",
         "Sales", "Other"
        ]
X = df[x_var]
pd.Series([variance_inflation_factor(X.values, i) 
               for i in range(X.shape[1])], index=X.columns)

YearsEmployed     4.128813
YearsEducation    6.600779
DirectRepotees    1.618599
Female            2.214391
Sales             1.485294
Other             1.706149
dtype: float64

In [8]:
reg_model = fit_lin_reg_with_intercept(X=df[x_var], Y=df[["Salary"]])
print(reg_model.summary())

                            OLS Regression Results                            
Dep. Variable:                 Salary   R-squared:                       0.845
Model:                            OLS   Adj. R-squared:                  0.822
Method:                 Least Squares   F-statistic:                     35.56
Date:                Thu, 19 May 2022   Prob (F-statistic):           2.47e-14
Time:                        07:15:02   Log-Likelihood:                -449.88
No. Observations:                  46   AIC:                             913.8
Df Residuals:                      39   BIC:                             926.6
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
const           9352.2329   4473.683      2.

### Answer
    1) Estimate the appropriate multiple linear regression equation to predict the salary of an Unitech employee using all explanatory variables.
       9352.2329 +  67.1579 * YearsEmployed + 1601.9920 * YearsEducation + 124.3685 * DirectRepotees + 1790.9146 * Female - 8171.3772 * Sales - 1271.7246 * Other  
    2) Do we need to exclude certain columns? Why?
       We must remove variable 'Male' and `Engineering` to address multicolliearity, otherwise we will not be able to make a good linear regression model. We also need to remove PreviousExp, since its sign was coming as negative, which did not make sense
    3) Which department employees are paid the highest? By how much?
    Engineering department is making the most, if everything else remains same. As compared to Sales, they are making more than -8171.3772 and as compared to Other, they are making -1271.7246 more. For Sales, the p-Value of the coefficient is not significant at 95%, hence we may have to do some other analysis to confirm this.
    4) Do you see any discrimination in salaries earned by male and female employees? 
    From the positive regression coefficient, we do see that Females salaries are more than Males, but the coefficient is not significant. Hence its not possible to conclude this based on regression. So based on this data, we cant say we have evidence that there is discrimination in Salaries amonng Males and Females.
    5) What would be the estimated salary of a Sr. Data Scientist (joining engineering) with 10 years of work experience. This woman has 18 years of total education, and will be supervising 4 junior employees.
    41,148.06

In [11]:
round(9352.2329 +  67.1579 * 10 + 1601.9920 * 18 + 124.3685 * 4 + 1790.9146 * 1, 2)

41148.06