# SENDY LOGISTICS DATA PREPROCESING 
---
1. Make your **1st line a comment**, just to give clarity to the Team about the task/code used.
2. If you happen to get a **CODE snippet** from **stackoverflow/Blog**, Make your **2nd line the link referencing the code/post**. for later reference if team members need clarity.

---
## HEADS UP
*The following steps will serve as a guide-line not mandatory step and they might not be in order.*

- split data into subsets (train & validation/test)
- data preparation
    - imputing misssing values
    - Changing Data Types (if necessary e.g [df.convert_dtypes()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.convert_dtypes.html)| [df.astype()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html) based features [pd.to_datetime(df['date'])](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html?highlight=to_datetime))
    - One Hot Encoding [more](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)
    - Ordinal Encoding [more](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html#sklearn.preprocessing.OrdinalEncoder)
    - Target Encording (Do more Research)
    - Target Transformation Regressor (Do [more Research](https://scikit-learn.org/stable/modules/generated/sklearn.compose.TransformedTargetRegressor.html?highlight=transform#sklearn.compose.TransformedTargetRegressor) )
        - Target might be transformed when linear algorithms such as Linear Regression, etc.
- [Scale](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html?highlight=scale#sklearn.preprocessing.StandardScaler) & transform 
- feature Engineering (read more)
    - Feature Interaction
    - Bin Numeric Features
    - Trigonometry Features
    - Group Features
    - Polynomial Features
    - Combine Rare Levels
- feature Selection (AKA variables selection)
    - SelectKBest, SelectFromModel, RFE [more here](https://scikit-learn.org/stable/search.html?q=feature+selection) 
    - Feature Importance
    - collinearity (Remove Multicollinearity using threshold >= 0.9, )
        - rules: if feature A is highly correlated with feature B but A is less corelated with Target, then Remove Feature A
    - Principal Component Analysis (PCA)
    - VarianceThreshold [view](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html?highlight=feature%20selection#sklearn.feature_selection.VarianceThreshold)
        - SET Threshold and Ignore Low Variance fearues below the Threshold.
- Clustering (Removing Outliers / creating clusters)
---

# 1. Library Imports
---
Keep it clean, import Libraries at the Top!

In [12]:
# data manipulation
import pandas as pd
import numpy as np

# data visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer

# preprocessing
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler

# More imports Below


# 2. Import Datasets
---
- By Default the notebook is using the github links to fetch data remotely, change to local if need be. 
> e.g replace URL_TRAIN with URL_local_TRAIN
- DO NOT FORGET TO CHANGE BACK THE LINKS BEFORE CREATING A PULL REQUEST
> Use The following Links for Local Machine NoteBooks/Jupyterlab (this will save data)
>This assumes the notebook.ipynb is inside the Notebooks folder

```python

    URL_local_TRAIN = "/data/Train.csv"
    URL_local_TEST = "/data/Test.csv"
    URL_local_RIDERS = "/data/Riders.csv" 
    URL_local_DD = "/data/VariableDefinitions.csv"

```

In [15]:
# CONSTANTS

URL_TRAIN = "https://raw.githubusercontent.com/Explore-EDSA-2020/Sendy-Logistics-Challenge/master/data/Train.csv"
URL_TEST = "https://raw.githubusercontent.com/Explore-EDSA-2020/Sendy-Logistics-Challenge/master/data/Test.csv"
URL_RIDERS = "https://raw.githubusercontent.com/Explore-EDSA-2020/Sendy-Logistics-Challenge/master/data/Riders.csv" 
URL_DD = "https://raw.githubusercontent.com/Explore-EDSA-2020/Sendy-Logistics-Challenge/master/data/VariableDefinitions.csv" # Data Dictionary


In [14]:
# reading the data to dataframe

train_df  = pd.read_csv(URL_TRAIN)
test_df   = pd.read_csv(URL_TEST)
riders_df = pd.read_csv(URL_RIDERS)
data_dictionary_df = pd.read_csv(URL_DD)


In [16]:
# making a copy of the data to avoid altering the original data
train_riders = train_df.copy()
test_riders  = test_df.copy()

# merged train with riders
train_riders = train_riders.merge(riders_df, how='left', on='Rider Id')
test_riders  = test_riders.merge(riders_df, how='left', on='Rider Id')

# view dimension
print('train without riders: ', train_df.shape)
print('train merged with riders: ', train_riders.shape)
print('---------------------------------------')
print('test without riders: ', test_df.shape)
print('test merged with riders: ', test_riders.shape)

train without riders:  (21201, 29)
train merged with riders:  (21201, 33)
---------------------------------------
test without riders:  (7068, 25)
test merged with riders:  (7068, 29)


# 3. Data Preprocessing
---
Avoid imputing missing values from test data, instead use a pipeline so we can inherit the **train properties** e.g mean, median, mode(most common) or constant: "missing". **Ask why in the upcoming meeting!**

---
You take it from here!
