# _GLOBAL TURNOVER ASSET_
&nbsp;
- **Author:** *Varun V*
- **Team:** *People Analytics*
- **Location:** *GCC*
- **Date:** *Oct 2018*
- **Description:** *Transition Documentation for GCC handover*
---

## Environment:
&nbsp;
1. **OS** - Windows 10
2. **Python** - Anaconda version 3.6.5
3. **Packages** - *Described in the requirements file of the project (requirements.txt in the project dir)*
4. **Deployed Location** - Azure VM; with data in Azure Datalake and production running on an Azure Databricks cluster
5. **Structure** - Three python notebooks
    1. Data Preparation
        1. Raw_ADS_preparation.ipynb
        2. Final_ADS_preparation.ipynb
    2. Modelling FW
        1. Turnover_modelling_FW.ipynb
    3. Prediction and Factor analysis
        1. Predict.ipynb
        2. Factor_analysis.ipynb
---

## **Contents**
&nbsp;
1. Purpose of the Document
2. Primary Steps in the Asset
    1. Data Procurement
    2. Data Preparation
        1. Monthly/Yearly dataset preparation
        2. TRAIN and VALID datasets preparation
    3. Turnover Modelling Framework
        1. Data Preprocessing
        2. Classification Module
    4. Model Interpretation and Deployment
        1. Prediction
        2. Factor analysis
3. Consumption Guidelines
    1. Click (Stay Interviews)
    2. Feedback Module
        1. Adoption Module
            1. Trends/Insights Generation (High level stakeholder dashboard)
            2. Feedback Incorporation
            3. Benefit/ROI calculator
        2. Health Module
            1. Data Pollution Module
            2. Model retrain/refresh cadence determiner
4. Closure Module
    1. Transition process
    2. Leverage Module
    3. Appendix and Glossary
    4. References

## **Purpose**
&nbsp;
- To ensure seamless transition of the entire pipeline to another entity
- Complete technical documentation of the entire Turnover process from scratch to final adoption/health monitoring
- Complete manual processes documentation involved in the pipeline
- To create modules that can be scaled/leveraged across the organization in other problems
- To ensure a viable consumption pipeline that maximizes the ROI potential of the whole process
    - The generic process flow across the various individual modules
    - ***Perspective***:
        - **Level** - Actionable steps
        - **Order** - Chronological order along the time and implementation dimensions (Linear. No parall steps)
        - **Audience** - Primarily intended for a Data Scientist (few existing steps will be translated to a Data Engineer/BI developer)

## **Primary Steps**
&nbsp;
1. **Data Procurement**
    1. Mappping all the individual data sources
        1. flatfile reports (from the corresponding data owners; mostly year end reports (headcount, turnover))
            - manual
        2. datalake
            - automated (currently not implemented)
        3. sharpops/navigate/payroll (target, movement, opr, salary, demographics)
            - automated (currently not implemented)
    2. Identification and mapping of the different owners for the above
        - the manually procured ones
        - the automated ones
    3. The sources and owners table is displayed below:
    4. The cadence for the above data sources are:  
        1. Every Monday
            - headcount
            - movements
            - PBP mapping
        2. 1st Monday of every month
            - headcount
            - movements

2. **Data Preparation**
    1. Monthly/Yearly dataset preparation
        1. Notebook
            1. Name - Raw_ADS_preparation.ipynb
            2. Location - Azure VM xxx
        2. Purpose
            1. Source in all the different data sources
                - automated pull or reading in the flatfile reports saved in the Azure Datalake
            2. Preprocess and clean with business rules
            3. Aggregate and create the final raw ADS
                - Level = Employee and Time
                    - Time level = year (i.e. employee + year)
                    - Time level = monthly (i.e. employee + year_month)
        3. Features - Raw base features (may have missing values, violate business rules)
            - target
            - opr
            - salary
            - personal attributes(age, tenure, hire date, position, location)
            - manager attributes(same as personal) etc.
            - ~90 base features
    2. TRAIN and VALID datasets preparation
        1. Notebook
            1. Name - Final_ADS_preparation.ipynb
            2. Location - Azure VM xxx
        2. Purpose - Depending on the level
            1. For yearly datasets - simply appending them
            2. For monthly datasets - passing them through the Random Sampling module
        3. Labels - Response variable (0/1) for the two datasets are also applied based on a logic that is user customizable
            - 0 stands for active, while 1 stands for leaver
            - Demarcation = Time dimension based (to differentiate between what is TRAIN and VALID)
                - 2017 Dec 31st
            - recommended to tune on the definition for labels
                - for the TRAIN dataset, positive labels could be the last n (n=3/4/5/6 records) of the leavers
                - for the VALID dataset, positive labels are all the people who left between 1st Jan and 30th June 2018
                - for both datasets, zero labels are all the remaining records (all for still active, remaining for the leavers)
        4. Output
            - A custom logic is applied to select a subset of the monthly datasets
            - Final TRAIN and VALID datasets are prepared and placed in the Azure Datalake

3. **Turnover Modelling Framework**
    1. Notebook
        1. Name - Turnover_modelling_FW.ipynb
        2. Location - Azure VM xxx
        3. Content
            1. Data Preprocessing
                1. Steps
                    1. Cleaning
                        - Table/Field header cleaning
                        - Missing value treatment module
                    2. Processing
                        - Correlation analysis
                        - Categorical feature encoding
                        - Feature scaling/transformations
                2. Output
                    - The processed TRAIN and VALID datasets are prepared which go into the modelling framework
            2. Classification module
                1. Steps
                    1. Perform baseline logistic regression model
                        - if the performance of the simple model suffices the modelling pipeline ends here
                    2. If the performance is not upto the desired mark
                        - Use the Turnover asset (a much more sophisticated pipeline) to get better results
                        - Seperate technical documentation for the asset (only for data scientists)
                2. Output
                     - A final model object that is used for predictions (1)
                     - missing value treatment object (1)
                     - Categorical feature encoding objects/dictionary (1)

4. **Model Integration and Deployment**
    1. Prediction
        1. Notebook
            1. Name - Predict.ipynb
            2. Location - Azure VM xxx
        2. Steps
            1. Open the Predict.ipynb notebook
            2. Make sure the 3 objects from the previous model creation exercise is present in the folder xxx
            3. Run the notebook to generate the predictions for all the active people as of that day
        3. Output
            - The final predictions (probablities) are generated and saved as a flatfile
    2. Factor analysis
        1. Notebook
            1. Name - Factor_analysis.ipynb
            2. Location - Azure VM xxx
        2. Steps
            1. Open the Factor_analysis.ipynb notebook
            2. Make sure the required input file (the predictions flatfile from the above step in the pipeline) is present in the location xxx
            3. Run the notebook to generate the final flatfile with the predictions and the macro factors/questions mapped to each employee
        3. Output
            - the final flatfile with everything except for the PBP mapping is generated

In [6]:
import pandas as pd
x=pd.read_csv('/dbfs/mnt/datalake/NAZ/People/TurnoverModel/Data/Handover/DataSourcesMapping.csv')

from tabulate import tabulate
print(tabulate(x, tablefmt="pipe", headers="keys", showindex=False))

## **Consumption Guidelines**
&nbsp;
  1. Click (*Stay Interviews*) - this phase details out the steps to be taken once the final prediction dataset with the macro factors mapped is generated
    1. Steps
      1. Take the final predictions flatfile with the macro factors mapped out
      2. Perform the PBP mapping (manual for now, will be automated later down the line)
        - Steps for the same are:
          1. ...
          2. ...
      3. Create the final output file and export it into the Azure SQL Db
  2. Feedback Module
    1. Adoption Module
      1. Trends/Insights Generation (High level stakeholder dashboard)
      2. Feedback Incorporation
      3. Benefit/ROI calculator
    2. Health Module
      1. Data Pollution Module
      2. Model retrain/refresh cadence determiner
      
**__in progress__**

## **Closure Module**
&nbsp;
  1. Transition process
  - Leverage Module
  - Appendix and Glossary
  - References
    
**__in progress__**