## Data

This dataset was originally posted on Kaggle. **The key task is to predict whether a product/part will go on backorder.**

Product backorder may be the result of strong sales performance (e.g. the product is in such a high demand that production cannot keep up with sales). However, backorders can upset consumers, lead to canceled orders and decreased customer loyalty. Companies want to avoid backorders, but also avoid overstocking every product (leading to higher inventory costs).

This dataset has ~1.9 million observations of products/parts in an 8 week period. The source of the data is unreferenced.

* __Outcome__: whether the product went on backorder
* __Predictors__: Current inventory, sales history, forecasted sales, recommended stocking amount, product risk flags etc. (22 predictors in total)

The features and the target variable of the dataset are as follows:

**Description**
~~~
# Features: 
sku - Random ID for the product
national_inv - Current inventory level for the part
lead_time - Transit time for product (if available)
in_transit_qty - Amount of product in transit from source
forecast_3_month - Forecast sales for the next 3 months
forecast_6_month - Forecast sales for the next 6 months
forecast_9_month - Forecast sales for the next 9 months
sales_1_month - Sales quantity for the prior 1 month time period
sales_3_month - Sales quantity for the prior 3 month time period
sales_6_month - Sales quantity for the prior 6 month time period
sales_9_month - Sales quantity for the prior 9 month time period
min_bank - Minimum recommend amount to stock
potential_issue - Source issue for part identified
pieces_past_due - Parts overdue from source
perf_6_month_avg - Source performance for prior 6 month period
perf_12_month_avg - Source performance for prior 12 month period
local_bo_qty - Amount of stock orders overdue
deck_risk - Part risk flag
oe_constraint - Part risk flag
ppap_risk - Part risk flag
stop_auto_buy - Part risk flag
rev_stop - Part risk flag

# Target 
went_on_backorder - Product actually went on backorder
~~~

Two data files are given and the files are accessible in the JupyterHub environment:
 * `/dsa/data/all_datasets/back_order/Kaggle_Training_Dataset_v2.csv`
 * `/dsa/data/all_datasets/back_order/Kaggle_Test_Dataset_v2.csv`


 
<span style='background:yellow'>**NOTE:** The training data file is 117MB. **Do NOT try to version control any data files** (training, test, or created), you may blow-through the _push limit_.</span>  
You can easily lock up a notebook with bad coding practices.  
Please save you project early, and often, and use `git commits` to checkpoint your process.

## Exploration, Training, and Validation

You will examine the _training_ dataset and perform 
 * **data preparation and exploratory data analysis**, 
 * **anomaly detection / removal**,
 * **dimensionality reduction** and then
 * **train and validate**.

We aim to develop at least 3 unique pipelines (see M5 to learn about pipeline). **By unique we mean that if an ML method (i.e. classification,  feature selection, or anomaly detection) is used in Pipeline 1, that classification method should not be used in Pipeline 2 and Pipeline 3.** 

Of the 3 different models, you are free to pick any models from Scikit-Learn or any custom models that work within sklearn pipeline. Here is a pool of methods. 


### Pool of Anomaly Detection Methods (Discussed in M4)
1. IsolationForest
2. EllipticEnvelope
3. LocalOutlierFactor
4. OneClassSVM
5. SGDOneClassSVM

### Pool of Feature Selection Methods (Discussed in M3)

1. VarianceThreshold
1. SelectKBest with any scoring method (e.g, chi, f_classif, mutual_info_classif)
1. SelectKPercentile
3. SelectFpr, SelectFdr, or  SelectFwe
1. GenericUnivariateSelect
2. PCA
3. Factor Analysis
4. Variance Threshold
5. RFE
7. SelectFromModel


### Classification Methods (Discussed in M1-M2)
1. Decision Tree
2. Random Forest
3. Logistic Regression
4. Naive Bayes
5. Linear SVC
6. SVC with kernels
7. KNeighborsClassifier
8. GradientBoostingClassifier
9. XGBClassifier
10. LGBM Classifier



### Validation Assessment

Your first, intermediate, result will be an **assessment** of the models' performance.
This assessement should be grounded within a 5-fold or 10-fold cross-validation methodology. Give an unbiased evaluaiton of the best model within each pipeline. This should include the confusion matrix, precision, recall, F1-score, and accuracy for each classifier as a bare minimum.

## Testing

Once we have chosen our final model, we need to re-train it using all the training data. Then final evaluation should be performed on the given test dataset. 





--- 
##  Overview / Roadmap

**General steps**:
* Part I: Preprocessing
  * Dataset carpentry & Exploratory Data Analysis
    * Develop functions to perform the necessary steps, you will have to carpentry the Training and the Testing data.
  * Generate a **smart sample** of the the data
* Part II: Training and Validation
  * Create 3 alternative pipelines, each does:
      * Anomaly detection
      * Dimensionality reduction
      * Classification
* Part III: Testing
  * Train chosen model full training data
  * Evaluate model against testing
  * Write a summary of your processing and an analysis of the model performance




# Part I: Data Preprocessing

In this part, we preprocess the given training set. 


In [1]:
%matplotlib inline
import matplotlib.pyplot as plt

import os, sys
import itertools
import numpy as np
import pandas as pd
import seaborn as sns
import joblib

from imblearn.under_sampling import ClusterCentroids#,TomekLinks
from sklearn.utils import resample

## Load dataset

**Description**
~~~
sku - Random ID for the product
national_inv - Current inventory level for the part
lead_time - Transit time for product (if available)
in_transit_qty - Amount of product in transit from source
forecast_3_month - Forecast sales for the next 3 months
forecast_6_month - Forecast sales for the next 6 months
forecast_9_month - Forecast sales for the next 9 months
sales_1_month - Sales quantity for the prior 1 month time period
sales_3_month - Sales quantity for the prior 3 month time period
sales_6_month - Sales quantity for the prior 6 month time period
sales_9_month - Sales quantity for the prior 9 month time period
min_bank - Minimum recommend amount to stock
potential_issue - Source issue for part identified
pieces_past_due - Parts overdue from source
perf_6_month_avg - Source performance for prior 6 month period
perf_12_month_avg - Source performance for prior 12 month period
local_bo_qty - Amount of stock orders overdue
deck_risk - Part risk flag
oe_constraint - Part risk flag
ppap_risk - Part risk flag
stop_auto_buy - Part risk flag
rev_stop - Part risk flag
went_on_backorder - Product actually went on backorder. 
~~~

**Note**: This is a real-world dataset without any preprocessing.  
There will also be warnings due to fact that the 1st column is mixing integer and string values.  
**NOTE:** The last column, `went_on_backorder`, is what we are trying to predict.


In [2]:
# Dataset location
DATASET = '/dsa/data/all_datasets/back_order/Kaggle_Training_Dataset_v2.csv'
assert os.path.exists(DATASET)

# Load and shuffle
dataset = pd.read_csv(DATASET).sample(frac = 1).reset_index(drop=True)

dataset.head().transpose()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,0,1,2,3,4
sku,3013412,3273206,2833348,1123844,2920347
national_inv,11.0,24.0,58.0,5000.0,510.0
lead_time,8.0,2.0,16.0,52.0,8.0
in_transit_qty,0.0,0.0,0.0,0.0,152.0
forecast_3_month,0.0,48.0,0.0,0.0,240.0
forecast_6_month,0.0,48.0,0.0,0.0,520.0
forecast_9_month,0.0,48.0,0.0,0.0,660.0
sales_1_month,2.0,0.0,4.0,0.0,163.0
sales_3_month,4.0,0.0,18.0,0.0,298.0
sales_6_month,7.0,0.0,40.0,0.0,642.0


In [3]:
dataset.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
national_inv,1687860.0,496.111782,29615.233831,-27256.0,4.0,15.0,80.0,12334404.0
lead_time,1586967.0,7.872267,7.056024,0.0,4.0,8.0,9.0,52.0
in_transit_qty,1687860.0,44.052022,1342.741731,0.0,0.0,0.0,0.0,489408.0
forecast_3_month,1687860.0,178.119284,5026.553102,0.0,0.0,0.0,4.0,1427612.0
forecast_6_month,1687860.0,344.986664,9795.151861,0.0,0.0,0.0,12.0,2461360.0
forecast_9_month,1687860.0,506.364431,14378.923562,0.0,0.0,0.0,20.0,3777304.0
sales_1_month,1687860.0,55.926069,1928.195879,0.0,0.0,0.0,4.0,741774.0
sales_3_month,1687860.0,175.02593,5192.377625,0.0,0.0,1.0,15.0,1105478.0
sales_6_month,1687860.0,341.728839,9613.167104,0.0,0.0,2.0,31.0,2146625.0
sales_9_month,1687860.0,525.269701,14838.613523,0.0,0.0,4.0,47.0,3205172.0


## Processing

In this section, the goal is to figure out:

* which columns we can use directly,  
* which columns are usable after some processing,  
* and which columns are not processable or obviously irrelevant (like product id) that we will discard.

Then process and prepare this dataset for creating a predictive model.

In [4]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1687861 entries, 0 to 1687860
Data columns (total 23 columns):
 #   Column             Non-Null Count    Dtype  
---  ------             --------------    -----  
 0   sku                1687861 non-null  object 
 1   national_inv       1687860 non-null  float64
 2   lead_time          1586967 non-null  float64
 3   in_transit_qty     1687860 non-null  float64
 4   forecast_3_month   1687860 non-null  float64
 5   forecast_6_month   1687860 non-null  float64
 6   forecast_9_month   1687860 non-null  float64
 7   sales_1_month      1687860 non-null  float64
 8   sales_3_month      1687860 non-null  float64
 9   sales_6_month      1687860 non-null  float64
 10  sales_9_month      1687860 non-null  float64
 11  min_bank           1687860 non-null  float64
 12  potential_issue    1687860 non-null  object 
 13  pieces_past_due    1687860 non-null  float64
 14  perf_6_month_avg   1687860 non-null  float64
 15  perf_12_month_avg  1687860 non-n

### Take samples and examine the dataset

In [5]:
dataset.iloc[:3,:6]

Unnamed: 0,sku,national_inv,lead_time,in_transit_qty,forecast_3_month,forecast_6_month
0,3013412,11.0,8.0,0.0,0.0,0.0
1,3273206,24.0,2.0,0.0,48.0,48.0
2,2833348,58.0,16.0,0.0,0.0,0.0


In [6]:
dataset.iloc[:3,6:12]

Unnamed: 0,forecast_9_month,sales_1_month,sales_3_month,sales_6_month,sales_9_month,min_bank
0,0.0,2.0,4.0,7.0,18.0,0.0
1,48.0,0.0,0.0,0.0,0.0,0.0
2,0.0,4.0,18.0,40.0,60.0,8.0


In [7]:
dataset.iloc[:3,12:18]

Unnamed: 0,potential_issue,pieces_past_due,perf_6_month_avg,perf_12_month_avg,local_bo_qty,deck_risk
0,No,0.0,0.97,0.97,0.0,No
1,No,0.0,1.0,1.0,0.0,Yes
2,No,0.0,0.85,0.8,0.0,No


In [8]:
dataset.iloc[:3,18:24]

Unnamed: 0,oe_constraint,ppap_risk,stop_auto_buy,rev_stop,went_on_backorder
0,No,Yes,Yes,No,No
1,No,No,Yes,No,No
2,No,No,Yes,No,No


### Drop columns that are obviously irrelevant or not processable

In [9]:
dataset.corr().abs()

Unnamed: 0,national_inv,lead_time,in_transit_qty,forecast_3_month,forecast_6_month,forecast_9_month,sales_1_month,sales_3_month,sales_6_month,sales_9_month,min_bank,pieces_past_due,perf_6_month_avg,perf_12_month_avg,local_bo_qty
national_inv,1.0,0.003326,0.098238,0.078199,0.079744,0.078948,0.147449,0.192605,0.225067,0.239613,0.399969,0.030677,0.013544,0.010732,0.014887
lead_time,0.003326,1.0,0.007238,0.00801,0.008513,0.008738,0.006013,0.007279,0.00727,0.007313,0.008198,0.0015,0.09994,0.106019,0.001306
in_transit_qty,0.098238,0.007238,1.0,0.662648,0.687768,0.679152,0.61927,0.698417,0.689908,0.659372,0.749974,0.16746,0.003282,0.004292,0.066612
forecast_3_month,0.078199,0.00801,0.662648,1.0,0.99049,0.977337,0.684494,0.781178,0.835585,0.825539,0.725042,0.361214,0.008445,0.008694,0.039419
forecast_6_month,0.079744,0.008513,0.687768,0.99049,1.0,0.994945,0.70177,0.808755,0.868099,0.858253,0.738553,0.363147,0.008343,0.008499,0.039724
forecast_9_month,0.078948,0.008738,0.679152,0.977337,0.994945,1.0,0.716367,0.829911,0.891884,0.881894,0.735891,0.366001,0.008306,0.008421,0.039732
sales_1_month,0.147449,0.006013,0.61927,0.684494,0.70177,0.716367,1.0,0.918548,0.867479,0.815959,0.756137,0.249526,0.001163,0.00237,0.066188
sales_3_month,0.192605,0.007279,0.698417,0.781178,0.808755,0.829911,0.918548,1.0,0.975594,0.929491,0.856017,0.304565,0.001488,0.002837,0.07103
sales_6_month,0.225067,0.00727,0.689908,0.835585,0.868099,0.891884,0.867479,0.975594,1.0,0.971833,0.83711,0.323552,0.002898,0.004221,0.057765
sales_9_month,0.239613,0.007313,0.659372,0.825539,0.858253,0.881894,0.815959,0.929491,0.971833,1.0,0.80089,0.317692,0.003438,0.004749,0.04888


In [10]:
# Create correlation matrix, fill diagonal and upper half with NaNs
corr = dataset.corr().abs()
mask = np.zeros_like(corr, dtype = bool)
mask[np.triu_indices_from(mask)] = True
corr[mask] = np.nan
(corr
 .style
 .background_gradient(cmap = 'coolwarm', axis = None, vmin = -1, vmax = 1)
 .highlight_null(null_color = '#f1f1f1')  # Color NaNs grey
 .set_precision(3))

Unnamed: 0,national_inv,lead_time,in_transit_qty,forecast_3_month,forecast_6_month,forecast_9_month,sales_1_month,sales_3_month,sales_6_month,sales_9_month,min_bank,pieces_past_due,perf_6_month_avg,perf_12_month_avg,local_bo_qty
national_inv,,,,,,,,,,,,,,,
lead_time,0.003,,,,,,,,,,,,,,
in_transit_qty,0.098,0.007,,,,,,,,,,,,,
forecast_3_month,0.078,0.008,0.663,,,,,,,,,,,,
forecast_6_month,0.08,0.009,0.688,0.99,,,,,,,,,,,
forecast_9_month,0.079,0.009,0.679,0.977,0.995,,,,,,,,,,
sales_1_month,0.147,0.006,0.619,0.684,0.702,0.716,,,,,,,,,
sales_3_month,0.193,0.007,0.698,0.781,0.809,0.83,0.919,,,,,,,,
sales_6_month,0.225,0.007,0.69,0.836,0.868,0.892,0.867,0.976,,,,,,,
sales_9_month,0.24,0.007,0.659,0.826,0.858,0.882,0.816,0.929,0.972,,,,,,


In [11]:
# Add code below this comment  (Question #E101)
# ----------------------------------
def drop_uneccessary_features(data_frame):
    # Create duplicate dataset
    #dataset_redux = dataset

    # Remove sku feature
    data_frame.drop('sku', axis = 1, inplace = True)

    # Create correlation matrix
    corr_matrix = data_frame.corr().abs()

    # Select upper triangle of correlation matrix
    upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k = 1).astype(np.bool))

    # Find highly-correlated features to drop
    to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]

    # Drop highly-correlated features 
    data_frame.drop(to_drop, axis = 1, inplace = True)

drop_uneccessary_features(dataset)

dataset.info()

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1687861 entries, 0 to 1687860
Data columns (total 17 columns):
 #   Column             Non-Null Count    Dtype  
---  ------             --------------    -----  
 0   national_inv       1687860 non-null  float64
 1   lead_time          1586967 non-null  float64
 2   in_transit_qty     1687860 non-null  float64
 3   forecast_3_month   1687860 non-null  float64
 4   sales_1_month      1687860 non-null  float64
 5   sales_3_month      1687860 non-null  float64
 6   min_bank           1687860 non-null  float64
 7   potential_issue    1687860 non-null  object 
 8   pieces_past_due    1687860 non-null  float64
 9   perf_6_month_avg   1687860 non-null  float64
 10  local_bo_qty       1687860 non-null  float64
 11  deck_risk          1687860 non-null  object 
 12  oe_constraint      1687860 non-null  object 
 13  ppap_risk          1687860 non-null  object 
 14  stop_auto_buy      1687860 non-null  object 
 15  rev_stop           1687860 non-n

### Find unique values of string columns

Now try to make sure that these Yes/No columns really only contains Yes or No.  
If that's true, proceed to convert them into binaries (0s and 1s).

**Tip**: use [unique()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html) function of pandas Series.

Example

~~~python
print('went_on_backorder', dataset['went_on_backorder'].unique())
~~~

In [12]:
# All the column names of these yes/no columns
yes_no_columns = list(filter(lambda i: dataset[i].dtype!=np.float64, dataset.columns))
print(yes_no_columns)

# Add code below this comment  (Question #E102)
# ----------------------------------
print('potential_issue', dataset['potential_issue'].unique())
print('/deck_risk', dataset['deck_risk'].unique())
print('/oe_constraint', dataset['oe_constraint'].unique())
print('/ppap_risk', dataset['ppap_risk'].unique())
print('/stop_auto_buy', dataset['stop_auto_buy'].unique())
print('/rev_stop', dataset['rev_stop'].unique())
print('/went_on_backorder', dataset['went_on_backorder'].unique())


['potential_issue', 'deck_risk', 'oe_constraint', 'ppap_risk', 'stop_auto_buy', 'rev_stop', 'went_on_backorder']
potential_issue ['No' 'Yes' nan]
/deck_risk ['No' 'Yes' nan]
/oe_constraint ['No' 'Yes' nan]
/ppap_risk ['Yes' 'No' nan]
/stop_auto_buy ['Yes' 'No' nan]
/rev_stop ['No' 'Yes' nan]
/went_on_backorder ['No' 'Yes' nan]


You may see **nan** also as possible values representing missing values in the dataset.

We fill them using most popular values, the [Mode](https://en.wikipedia.org/wiki/Mode_%28statistics%29) in Stats.

In [13]:
for column_name in yes_no_columns:
    mode = dataset[column_name].apply(str).mode()[0]
    print('Filling missing values of {} with {}'.format(column_name, mode))
    dataset[column_name].fillna(mode, inplace=True)

# Fill missing lead time data with mean 
dataset['lead_time'].fillna((dataset['lead_time'].mean()), inplace = True)

# Remove any rows with any remaining NaN values
dataset = dataset.dropna(how = 'any')
dataset = dataset.reset_index(drop = True)

print(dataset.isnull().sum()) # view nan counts in columms

Filling missing values of potential_issue with No
Filling missing values of deck_risk with No
Filling missing values of oe_constraint with No
Filling missing values of ppap_risk with No
Filling missing values of stop_auto_buy with Yes
Filling missing values of rev_stop with No
Filling missing values of went_on_backorder with No
national_inv         0
lead_time            0
in_transit_qty       0
forecast_3_month     0
sales_1_month        0
sales_3_month        0
min_bank             0
potential_issue      0
pieces_past_due      0
perf_6_month_avg     0
local_bo_qty         0
deck_risk            0
oe_constraint        0
ppap_risk            0
stop_auto_buy        0
rev_stop             0
went_on_backorder    0
dtype: int64


### Convert yes/no columns into binary (0s and 1s)

In [14]:
# Add code below this comment  (Question #E103)
# ----------------------------------
for col in ['potential_issue', 'deck_risk', 'oe_constraint', 'ppap_risk',
            'stop_auto_buy', 'rev_stop', 'went_on_backorder']:
    dataset[col] = (dataset[col] == 'Yes').astype(int)


Now all columns should be either int64 or float64.

In [15]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1687860 entries, 0 to 1687859
Data columns (total 17 columns):
 #   Column             Non-Null Count    Dtype  
---  ------             --------------    -----  
 0   national_inv       1687860 non-null  float64
 1   lead_time          1687860 non-null  float64
 2   in_transit_qty     1687860 non-null  float64
 3   forecast_3_month   1687860 non-null  float64
 4   sales_1_month      1687860 non-null  float64
 5   sales_3_month      1687860 non-null  float64
 6   min_bank           1687860 non-null  float64
 7   potential_issue    1687860 non-null  int64  
 8   pieces_past_due    1687860 non-null  float64
 9   perf_6_month_avg   1687860 non-null  float64
 10  local_bo_qty       1687860 non-null  float64
 11  deck_risk          1687860 non-null  int64  
 12  oe_constraint      1687860 non-null  int64  
 13  ppap_risk          1687860 non-null  int64  
 14  stop_auto_buy      1687860 non-null  int64  
 15  rev_stop           1687860 non-n

### Smartly sample the data into a more manageable size for cross-fold validation in Grid Search

**Note**: This is a good point to re-balance dataset before actually moving on. For sampling we can either take advantage of pandas/numpy `sample` method or use `imblearn` [package](https://imbalanced-learn.org/stable/user_guide.html#user-guide). 

`imblearn` module has implemented a pipeline on top sklearn pipeline, and it is possible to add sampling strategies within the `imblearn` pipeline. We are not required to use `imblearn` pipeline for this project. 

In [16]:
num_backorder = np.sum(dataset['went_on_backorder']==1)
print('backorder ratio:', num_backorder, '/', len(dataset), '=', num_backorder / len(dataset))

backorder ratio: 11293 / 1687860 = 0.006690720794378681


Create a smar sample of the dataset. You can either store the data to csv files or simply use `joblib` to dump the variables and load them in Part 2. 

**Example code for using joblib:**

Say we need to store three objects (sampled_X, sampled_y, model) to a file. 

```python
import joblib

# for dumping 
joblib.dump([sampled_X, sampled_y, model], 'data/sample-data-v1.pkl')

# for loading
sampled_X, sampled_y, model = joblib.load('data/sample-data-v1.pkl')

```


In [17]:
dataset['went_on_backorder'].value_counts()

0    1676567
1      11293
Name: went_on_backorder, dtype: int64

In [18]:
''' X = dataset.iloc[:, :-1]
y = dataset.went_on_backorder
tl = TomekLinks()

#print(X)
#print("y",y)

# Undersample data
resampled_X, resampled_y = tl.fit_resample(X, y)
print(sorted(Counter(resampled_y).items()))'''


"""  # Build under-sampled dataframe
dataset_under = pd.DataFrame(resampled_X)

# Restore column names
dataset_under.columns = X.columns

# Restore y values
resampled_y = pd.Series(resampled_y)
dataset_under[self.went_on_backorder] = resampled_y

#print(dataset_under.info())
#print(resampled_X)
#print(resampled_y) """

'  # Build under-sampled dataframe\ndataset_under = pd.DataFrame(resampled_X)\n\n# Restore column names\ndataset_under.columns = X.columns\n\n# Restore y values\nresampled_y = pd.Series(resampled_y)\ndataset_under[self.went_on_backorder] = resampled_y\n\n#print(dataset_under.info())\n#print(resampled_X)\n#print(resampled_y) '

In [19]:
''' X = dataset.iloc[:, :-1]
y = dataset.went_on_backorder

cc = ClusterCentroids(random_state = 17)
resampled_X, resampled_y = cc.fit_resample(X, y)
print('y',sorted(Counter(resampled_y).items()))
print('X',sorted(Counter(resampled_X).items())) '''

" X = dataset.iloc[:, :-1]\ny = dataset.went_on_backorder\n\ncc = ClusterCentroids(random_state = 17)\nresampled_X, resampled_y = cc.fit_resample(X, y)\nprint('y',sorted(Counter(resampled_y).items()))\nprint('X',sorted(Counter(resampled_X).items())) "

In [20]:
# Add code below this comment   (Question #E104) 
# ----------------------------------
## Hoping undersampling kills two birds by reducing dataset to manageable size
# Separate minority and majority classes
X = dataset[dataset.went_on_backorder == 0]
y = dataset[dataset.went_on_backorder == 1]

# Downsample majority
X_under = resample(X,
                   replace = False, # sample without replacement
                   n_samples = y.shape[0], # match minority's n
                   random_state = 17) # seeded for reproducible results

# Combine minority and downsampled majority
train_under = pd.concat([X_under, y]).reset_index(drop=True)

# Checking counts
print(train_under.went_on_backorder.value_counts())

# Break undersampled data into X and y again
resampled_X = train_under.iloc[:, :-1]
resampled_y = train_under.went_on_backorder
#resampled_y = train_under['went_on_backorder']

print(resampled_X.info())
print('y',resampled_y)

0    11293
1    11293
Name: went_on_backorder, dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22586 entries, 0 to 22585
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   national_inv      22586 non-null  float64
 1   lead_time         22586 non-null  float64
 2   in_transit_qty    22586 non-null  float64
 3   forecast_3_month  22586 non-null  float64
 4   sales_1_month     22586 non-null  float64
 5   sales_3_month     22586 non-null  float64
 6   min_bank          22586 non-null  float64
 7   potential_issue   22586 non-null  int64  
 8   pieces_past_due   22586 non-null  float64
 9   perf_6_month_avg  22586 non-null  float64
 10  local_bo_qty      22586 non-null  float64
 11  deck_risk         22586 non-null  int64  
 12  oe_constraint     22586 non-null  int64  
 13  ppap_risk         22586 non-null  int64  
 14  stop_auto_buy     22586 non-null  int64  
 15  rev_stop          22586 non


**Note:** After sampling the data, you may want to write the data to a file for reloading later.

<span style="background: yellow;">If required, remove the `dataset` variable to avoid any memory-related issue.</span> 

In [21]:
# Write your smart sampling to local file  
# ----------------------------------
# Pickle the sampled data
joblib.dump([resampled_X, resampled_y, train_under], 'data/sample-data-v1.pkl' )
 

['data/sample-data-v1.pkl']

In [30]:
resampled_X, resampled_y, train_under = joblib.load('data/sample-data-v1.pkl')

print(resampled_X.info(),'\n')
print('y\n',resampled_y, '\n')
print('\n',train_under.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22586 entries, 0 to 22585
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   national_inv      22586 non-null  float64
 1   lead_time         22586 non-null  float64
 2   in_transit_qty    22586 non-null  float64
 3   forecast_3_month  22586 non-null  float64
 4   sales_1_month     22586 non-null  float64
 5   sales_3_month     22586 non-null  float64
 6   min_bank          22586 non-null  float64
 7   potential_issue   22586 non-null  int64  
 8   pieces_past_due   22586 non-null  float64
 9   perf_6_month_avg  22586 non-null  float64
 10  local_bo_qty      22586 non-null  float64
 11  deck_risk         22586 non-null  int64  
 12  oe_constraint     22586 non-null  int64  
 13  ppap_risk         22586 non-null  int64  
 14  stop_auto_buy     22586 non-null  int64  
 15  rev_stop          22586 non-null  int64  
dtypes: float64(10), int64(6)
memory usage: 2

You should have made a couple commits so far of this project.  
**Definitely make a commit of the notebook now!**  
Comment should be: `Final Project, Checkpoint - Data Sampled`


# Save your notebook!
## Then `File > Close and Halt`