<b>Tested on NVDIDIA GPU Computing Platform</b>

<b>Objective</b>

Download employee retention dataset from here: https://www.kaggle.com/giripujar/hr-analytics.

Now do some exploratory data analysis to figure out which variables have direct and clear impact on employee retention (i.e. whether they leave the company or continue to work)
Plot bar charts showing impact of employee salaries on retention
Plot bar charts showing corelation between department and employee retention
Now build logistic regression model using variables that were narrowed down in step 1
Measure the accuracy of the model

In [None]:
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline

In [None]:
df = pd.read_csv("HR_comma_sep.csv")
df.head()

<h2 style="color:purple">Data exploration and visualization</h2>

In [None]:
left = df[df.left==1]
left.shape

In [None]:
retained = df[df.left==0]
retained.shape

**Average numbers for all columns** 

In [None]:
df.groupby('left').mean()

From above table we can draw following conclusions,
<ol>
    <li>**Satisfaction Level**: Satisfaction level seems to be relatively low (0.44) in employees leaving the firm vs the retained ones (0.66)</li>
    <li>**Average Monthly Hours**: Average monthly hours are higher in employees leaving the firm (199 vs 207)</li>
    <li>**Promotion Last 5 Years**: Employees who are given promotion are likely to be retained at firm </li>
</ol>

**Impact of salary on employee retention**

In [None]:
pd.crosstab(df.salary,df.left).plot(kind='bar')

Above bar chart shows employees with high salaries are likely to not leave the company

**Department wise employee retention rate**

In [None]:
pd.crosstab(df.Department,df.left).plot(kind='bar')

From above chart there seem to be some impact of department on employee retention but it is not major hence we will ignore department in our analysis

<h3 style="color:purple">From the data analysis so far we can conclude that we will use following variables as independant variables in our model</h3>
<ol>
    <li>**Satisfaction Level**</li>
    <li>**Average Monthly Hours**</li>
    <li>**Promotion Last 5 Years**</li>
    <li>**Salary**</li>
</ol>

In [None]:
subdf = df[['satisfaction_level','average_montly_hours','promotion_last_5years','salary']]
subdf.head()

**Tackle salary dummy variable**

Salary has all text data. It needs to be converted to numbers and we will use dummy variable for that. Check my one hot encoding tutorial to understand purpose behind dummy variables.

In [None]:
salary_dummies = pd.get_dummies(subdf.salary, prefix="salary")

In [None]:
df_with_dummies = pd.concat([subdf,salary_dummies],axis='columns')

In [None]:
df_with_dummies.head()

Now we need to remove salary column which is text data. It is already replaced by dummy variables so we can safely remove it

In [None]:
df_with_dummies.drop('salary',axis='columns',inplace=True)
df_with_dummies.head()

In [None]:
X = df_with_dummies
X.head()

In [None]:
y = df.left

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=0.3)

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()

In [None]:
model.fit(X_train, y_train)

In [None]:
model.predict(X_test)

**Accuracy of the model**

In [None]:
model.score(X_test,y_test)