In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

In [None]:
df = pd.read_csv("../input/WA_Fn-UseC_-HR-Employee-Attrition.csv")
df.head()

## Data Analysis
Although the main task is to explore **Attrition**, it's also interesting to explore other features like **MonthlyIncome** , **JobSatisfaction** and **WorkLifeBalance**. I will target these four features through this notebook. 

Let's first quickly check the quality of the dataset. 

In [None]:
df.isnull().any()

Great! There are no missing values, now we can move to the next step: understanding the distributions of our targeted features. Here we should notice that **Attrition**, **JobSatisfaction**  and **WorkLifeBalance** are categorical variables, **MonthlyIncome** is a numerical variable. Categorical variables and numerical variables should be treated differently. 

In [None]:
f,((ax1,ax2),(ax3,ax4)) = plt.subplots(2,2,figsize=(12,9))
sns.countplot(x='Attrition',data=df,ax=ax1)
sns.distplot(df['MonthlyIncome'],ax=ax2)
sns.countplot(x="JobSatisfaction",data=df,ax=ax3)
sns.countplot(x="WorkLifeBalance",data=df,ax=ax4)

**Take away**

From the distribution of three categorical variables, we can have a straightforward impression that employees in IBM seem to have a not bad life. The population of "happy" people is in general larger than "sad" people. 

### Correlation of features
Then we want to explore the correlation between features, a lot of cool visualisation will show up. This step can also help us to choose right features for each model. 

#### ***MonthlyIncome***

 - **Relationship with numerical features**

Before we do anything, we can first make an assumption of which features matter most. Maybe **Age**, **TotalWorkingYears** or **YearsAtCompany**? There are other interesting features as well like **EmployeeNumber** and **YearsSinceLastPromotion**. For numerical features, we can use scatter plot plus​ regression model to see the trend. Seanborn is a so powerful package that it can do most of the things for us very easily. 

In [None]:
plt.figure()
cols = ["MonthlyIncome","Age","TotalWorkingYears","EmployeeNumber","YearsSinceLastPromotion"]
sns.pairplot(df[cols],diag_kind="kde",kind="reg")

**Take away**

Luckily, our assumption seems make sense. **MonthlyIncome** has strong positive correlations with **Age** and **TotalWorkingYears** and a slight positive correlation with **YearsLastPromotion**.  It looks like **EmployNumber** cannot say a lot. Of course, it's not enough. Let's explore more!

- **Relationship with categorical features**

Similarly, we can make an assumption again. Don't worry, we are not experts in HR, we can never exactly make correct assumptions. But, that's the reason for data analysis, right? In this case, there are more categorical variables. We don't need to show all of them right now but some that I think maybe matter a lot, such as **Education**, **JobLevel**, **JobSatisfaction**. I also would like to check other interesting features like **Gender** and **MaritalStatus**. To make it more fun, I will use boxplot to show the trend. 

In [None]:
f,((ax1,ax2,ax3),(ax4,ax5,ax6)) = plt.subplots(2,3,figsize=(12,8))
sns.boxplot(x=df['Education'],y=df['MonthlyIncome'],ax=ax1)
sns.boxplot(x=df['JobLevel'],y=df['MonthlyIncome'],ax=ax2)
sns.boxplot(x=df['JobSatisfaction'],y=df['MonthlyIncome'],ax=ax3)
sns.boxplot(x=df['Gender'],y=df['MonthlyIncome'],ax=ax4)
sns.boxplot(x=df['MaritalStatus'],y=df['MonthlyIncome'],ax=ax5)


**Take away**

Wow, this time I find something really suprising. First of all, JobLevel has an extremely effect on the income, apparently higher job level means higher income. 

In [None]:
sns.jointplot(x="YearsAtCompany",y="MonthlyIncome",data=df,kind="hex")

In [None]:
corrmat = df.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True);

**Take away**

Generally, for the training model, we don't select features that have a strong correlation because it will have multicollinearity problem. Heatmap is a good way to detect this kind of situation. In this case, **YearsAtCompany**, **YearsInCurrentRole**, **YearsSinceLastPromotion** and **YearWithCurrManager** have strong correlations with each other. 