<a href="https://colab.research.google.com/github/CALDISS-AAU/sdsphd19_coursematerials/blob/master/notebooks/Portfolio_Tuesday.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Portfolio Tuesday

We will be using the turnover dataset 

https://github.com/CALDISS-AAU/sdsphd19_coursematerials/raw/master/data/turnover.csv'

for this portfolio to do 3 things:

## Unsupervised ML

- Prepare the dataset (select the columns that are useful)
- Preprocess (Scale)
- Reduce dimensionality and perhaps make a scatterplot of the data
- Examine the reduced data

## Supervised ML

- Preprocess the data
- Try to predict "churn" using a simple model (Logistic Regression)
- Try a more advanced model (e.g. Random Forest)
- Evaluate your model
- Predict "satisfaction" (!!! This is a regression model)
- Evaluate the regression model



In [0]:
import pandas as pd
pd.set_option('display.float_format', lambda x: '%.3f' % x)

In [0]:
turnover = pd.read_csv('https://github.com/CALDISS-AAU/sdsphd19_coursematerials/raw/master/data/turnover.csv')

In [4]:
turnover.head()

Unnamed: 0,satisfaction,evaluation,number_of_projects,average_montly_hours,time_spend_company,work_accident,churn,promotion,department,salary
0,0.38,0.53,2,157,3,0,1,0,sales,low
1,0.8,0.86,5,262,6,0,1,0,sales,medium
2,0.11,0.88,7,272,4,0,1,0,sales,medium
3,0.72,0.87,5,223,5,0,1,0,sales,low
4,0.37,0.52,2,159,3,0,1,0,sales,low


### Prepare the Dataset

#### Select the Data

In [0]:
# Select the variable needed for dimensionality reduction

In [0]:
X = turnover[...]

In [10]:
turnover.salary.unique()

array(['low', 'medium', 'high'], dtype=object)

In [0]:
salary_mapper = {'low':0, 'medium':1, 'high':2}

In [0]:
turnover['salary_recode'] = turnover.salary.map(salary_mapper)

In [13]:
turnover['salary_recode'].unique()

array([0, 1, 2])

In [15]:
turnover.department.value_counts(normalize=True)

sales          0.276018
technical      0.181345
support        0.148610
IT             0.081805
product_mng    0.060137
marketing      0.057204
RandD          0.052470
accounting     0.051137
hr             0.049270
management     0.042003
Name: department, dtype: float64

In [0]:
dep_matrix = pd.get_dummies(turnover.department)

In [0]:
data = pd.concat([turnover, dep_matrix], axis = 1)

In [0]:
data.drop(['department', 'salary', 'churn'], axis=1, inplace=True)

In [33]:
data

Unnamed: 0,satisfaction,evaluation,number_of_projects,average_montly_hours,time_spend_company,work_accident,promotion,salary_recode,IT,RandD,accounting,hr,management,marketing,product_mng,sales,support,technical
0,0.38,0.53,2,157,3,0,0,0,0,0,0,0,0,0,0,1,0,0
1,0.80,0.86,5,262,6,0,0,1,0,0,0,0,0,0,0,1,0,0
2,0.11,0.88,7,272,4,0,0,1,0,0,0,0,0,0,0,1,0,0
3,0.72,0.87,5,223,5,0,0,0,0,0,0,0,0,0,0,1,0,0
4,0.37,0.52,2,159,3,0,0,0,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14994,0.40,0.57,2,151,3,0,0,0,0,0,0,0,0,0,0,0,1,0
14995,0.37,0.48,2,160,3,0,0,0,0,0,0,0,0,0,0,0,1,0
14996,0.37,0.53,2,143,3,0,0,0,0,0,0,0,0,0,0,0,1,0
14997,0.11,0.96,6,280,4,0,0,0,0,0,0,0,0,0,0,0,1,0


### Scaling

In [0]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

In [0]:
data_scaled = scaler.fit_transform(data)

In [38]:
pd.DataFrame(data_scaled, columns = data.columns).describe()

Unnamed: 0,satisfaction,evaluation,number_of_projects,average_montly_hours,time_spend_company,work_accident,promotion,salary_recode,IT,RandD,accounting,hr,management,marketing,product_mng,sales,support,technical
count,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0
mean,-0.0,-0.0,-0.0,-0.0,-0.0,0.0,0.0,0.0,0.0,-0.0,0.0,-0.0,-0.0,0.0,0.0,0.0,0.0,0.0
std,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
min,-2.103,-2.08,-1.463,-2.103,-1.026,-0.411,-0.147,-0.933,-0.298,-0.235,-0.232,-0.228,-0.209,-0.246,-0.253,-0.617,-0.418,-0.471
25%,-0.695,-0.912,-0.652,-0.902,-0.341,-0.411,-0.147,-0.933,-0.298,-0.235,-0.232,-0.228,-0.209,-0.246,-0.253,-0.617,-0.418,-0.471
50%,0.109,0.023,0.16,-0.021,-0.341,-0.411,-0.147,0.636,-0.298,-0.235,-0.232,-0.228,-0.209,-0.246,-0.253,-0.617,-0.418,-0.471
75%,0.833,0.899,0.971,0.88,0.344,-0.411,-0.147,0.636,-0.298,-0.235,-0.232,-0.228,-0.209,-0.246,-0.253,1.62,-0.418,-0.471
max,1.557,1.659,2.594,2.182,4.453,2.432,6.784,2.206,3.35,4.25,4.308,4.393,4.776,4.06,3.953,1.62,2.394,2.125


#### Dimensionality Reduction

In [0]:
from sklearn.decomposition import PCA

pca = PCA(n_components = ...)

In [0]:
X_reduced = pca.fit_transform(...)

#### Exploring the PCA

#### Clustering

In [0]:
from sklearn.cluster import KMeans

clusterer = KMeans(n_clusters = ...)

In [0]:
clusterer.fit(...)