## Predicting Employee Churn with Logistic Regression

## What does **`churn`** mean?


**Customer Churn**: Rate at which customers leave/cease paying for a product/service. Its a critical figure in many businesses, as acquiring new customers is a lot more costly than retaining existing ones.

**Employee Churn**: Customer churn where *customer* is the *employee* of that company. It can be used to predict who, and when an employee will terminate the service.

**GOAL**: Predict which employee will leave based on a given set of attributes

## Why Employee Churn is important?

Cons of an employee leaving the company:
- People with niche skills are important to replace
- Disrupts ongoing workflow of workers
- Incoming new employee may take time to get acquainted with the vacated role

Leveraging the power of analytics we can predict churn rate of employees, which in turn can:
- Help management design strategies accordingly
- Improve overall working environment

## Agenda

- Exploratory Data Analysis
- Data preparation
- Classification and its types
- **Problem statement**
- **Solving classification problems with linear regression**
- Building blocks of Logistic Regression
    - **Sigmoid**
    - Odds ratio
    - **Decision boundary stability**
- Cost function intuition
- Model building with scikit-learn
- Evaluation metrics

## Session takeaways
- Dataset preparation
- Exploratory analysis
- Nuts and bolts of logistic regression
- Choice of evaluation metrics

## What is Classification?

- Group data according to some criteria based on attributes. Ex:
    - Predict genre of movie as **horror/suspense/thriller** based on plot, duration, actors etc.  
    - Classify emails as **spam/ham email** based on email subject, email message, attachments etc.
- `Different from regression`: Predict categories (discrete value) not continuous values. 
- Supervised learning mode; unsupervised version similar to classification also exists (clustering).

**Supervised vs Unsupervised learning**

<img src='../images/supervised.png'>

- **`Unsupervised`**: We don't know about the targets of data points and goal is to categorize data into a finite number of categories based on given features
- **`Supervised`**: Information about true targets are known (shown by colored points) and goal is predict that information (number/category) on unseen data

## 

