-----
## Predicting Global Supply Chain Outcomes for Essential HIV Medicines using Machine Learning Methods
------

### Author: Tichakunda Mangono

### **Capstone Project, Udacity Machine Learning Engineer Nanodegree, September 2017**



- ***Key Question:*** *Can we use procurement transaction data to predict whether a delivery is delayed and estimating the length of the delay*
- ***Main Data Source:*** *From The Website: https://data.pepfar.net/additionalData. Procurement transaction data from the Supply Chain Management System (SCMS), administered by the United States Agency for International Development (USAID), provides information on health commodities, prices, and delivery destinations.*

# Notebook 4: Final Model & Results

#### *Chart of Trends in Supply Chain metrics for on-time and in-full delivery rates*

<img src='Data\chart-declining-supply-chain-performance.png',width=600,height=400>

### Project Overview

- **Background:** only 19.5M people out of the ~37M people living with HIV are getting the essential medicines they need. Supply of these essential medicines is critical. Recent evidence https://www.devex.com/news/exclusive-documents-reveal-largest-usaid-health-project-in-trouble-90933 suggests that supply chain for major global programs has worsened after recent changes in supply chain managing organizations. See also chart displayed above.  


- **Problem Statement:** Such significant supply chain delays in delivery of medicines disrupt treatment and can lead to loss of life and ultimately increases supply chain costs. The goal is to machine learning to determine when and which products are likely to be delayed, as well quantify the extent of the delay. 



- **Datasets & Inputs:** Publicly available ***supply chain data*** from US, The President;s Emergency Plan for AIDS Relief (PEPFAR) https://data.pepfar.net/additionalData; ***Logistics Performance Index*** data from the World Bank https://lpi.worldbank.org/international/global?sort=asc&order=Infrastructure;   ***Fragile State Index*** data from Fund for Peace data http://fundforpeace.org/fsi/excel/; and finally Factory location and continent from the googlemaps API: http://maps.googleapis.com/maps/api/geocode/json?



- **Solution Statement:** A combined "classification-then-regression" machine learning algorithm where the classification algorithm predicts whether a particular product will be delayed or not and then the regression algorithm will predict the length of delay on the subset of the data which the classification predicts will be delayed. This mimicks a streamlined, prioritized decision process of a supply chain manager. 



- **Benchmark Model:** Default SciKit-Learn RandomForestClassifier and RandomForestRegressor will be used as benchmarks/baseline. Several models will then be explored to improve over the benchmark including other ensemble and tree-based models, Support-Vector Machines (SVM), XGBoost.  



- **Evaluation Metrics:** Recall and F1-score will be used for classification while R-squared and RMSE will be used for the regression part of the combined model  

### Summary of Process so far:

#### 1. Data Cleaning
- Understand the data descriptions and available fields
- Handling Missing Values
    - Dosage, Shipping and Line Item Insurance
- Investigating Misclassified/micategorized features
    - Purchase Quotation date (implied from purchase order date)
    - Purchase Order date (imputed from delivery date scheduled)
    - Weight - deal with substantial string entries. Calculated average weight per item, molecule or dosage group then multiplied by line item quantity
    - Freight Cost - deal with substantial string entries. Same trick as weight, but this time calculated as a proportion of ln_item_weight at the right level. Note bundled vs.unbundled shipments 

#### 2. Feature Engineering
- Feature Extraction
    - Dates: year, month, day, weekday, quarter, weekofyear to capture time aspect for purchase order dates, and scheduled delivery date
    - Numeric: counts, sums, proportions and measures of central tendency for country-year, factory-year, vendor-year, molecule-year, brand-year
    - Categorical: weight captured separately, shipment configuration, freight cost included commoodities, or invoiced separately
    - Predicted variables (actual delivery date less scheduled delivery date)
    - Time series variations and auto-correlation (lagging, cumulative and rolling stats)
- Feature Creation
    - Fragility State Index(FSI) for country stability
    - Logisitics Performance Index 
    - Factory location, country and continent (separating and identifying origins vs. destination in all indices)
    
#### 3. Exploratory Data Analysis & Feature Selection
- High corelations in the volumes, quantity and value features
- More delays over the weekend? Specific years were worse. Not much signal from the quarter 
- Significant pairwise correlations
- Numerical - some signal from all country , vendor etc. high volumes, trade numbers etc. But vendor festure quite high than others. Some useless ones though. Select those, take away highly correlated ones
- Country Fragility and Logistics Indices correlate with Delays
- Dimensionality Reduction. PCA first and second cdimensions..
- Feature Importances. Several features with small individual impact. Sharing due to correlation


#### 5. Pre-processing Pipeline
- Pre-processing data pieces, standardscaler, logarithm, Dummifier, labeler, Oversampling techniques

#### 6. Model Benchmark
- Classification with RandomForestClassifier. Recall=0.264, f1-score- 0.379, total 107 delays correclty found
- Regression with RandomForestRegressor: R-squared: 0.876, RMSE:14 days vs. mean of 27 days 

#### 7. Model Selection
- Classification: LinearSVC, SVC, KNeighborsClassifier, LogisticRegressionCV, LogisticRegression, SGDClassifier
,BaggingClassifier, ExtraTreesClassifier, RandomForestClassifier, MLPClassifier. Show the chart, and top performer poof
    - ExtraTrees, MLP, Random
- Regression:Show charts and the numbers; ExtraTrees, MLP, Random

### Final Model and Results
#### **Synopsis:**

A combined “classification-then-regression” machine learning model can avoid the public health 
and economic costs associated with delayed deliveries of HIV medicines. An ensemble classification 
algorithm, Extra Trees, is able to detect 1 in 2 delayed item deliveries. This is a significant 
improvement from a null hypothesis model which would detect only 1 in 9 delayed items and a 
considerable improvement from benchmarked Random Forest classification algorithm which 
catches 1 in 3 delayed items. Once delayed items are identified, an Extra Trees regression 
algorithm can predict the length of delay to within 12 days (RMSE) with an R-Squared of 0.86, 
an improvement from 16 days (RMSE) and R-Squared of 0.81 with the benchmarked Random Forest regression. 

### Requirements: software and libraries used
- A python file/module "my_helper_functions.py" is included in this folder with a set of my own helper functions
- The rest of the libraries can be installed using either anaconda or pip distributions
- I was running python 3.6.1 on a 64-bit windows system

#### Install/Download the following libraries and apis 
1. python 3.6.1 
2. my_helper_functions - provided in this folder. This is my own module of helper functions. It will be required to run most of the code
3. pandas
4. numpy
5. matplotlib
6. seaborn
7. time
8. datetime
9. pandas_profiling
10. pivottablejs
11. missingno
12. os
13. sklearn
14. yellowbrick
15. imblearn
16. googlemaps