# SENDY LOGISTICS EXPLORITORY DATA ANALYSIS 
---
1. Make your **1st line a comment**, just to give clarity to the Team about the task/code used.
2. If you happen to get a **CODE snippet** from **stackoverflow/Blog**, Make your **2nd line the link referencing the code/post**. for later reference if team members need clarity.

---
## HEADS UP
*The following steps will serve as a guide-line not mandatory step and they might not be in order.*




# 1. Library Imports
---
Keep it clean, import Libraries at the Top!

In [5]:
# data manipulation
import pandas as pd
import numpy as np

# data visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# More imports Below


# 2. Import Datasets
---
- By Default the notebook is using the github links to fetch data remotely, change to local if need be. 
> e.g replace URL_TRAIN with URL_local_TRAIN
- DO NOT FORGET TO CHANGE BACK THE LINKS BEFORE CREATING A PULL REQUEST
> Use The following Links for Local Machine NoteBooks/Jupyterlab (this will save data)
>This assumes the notebook.ipynb is inside the Notebooks folder

```python
URL_local_TRAIN = "/data/Train.csv"
URL_local_TEST = "/data/Test.csv"
URL_local_RIDERS = "/data/Riders.csv" 
URL_local_DD = "/data/VariableDefinitions.csv"

```
---

In [6]:
# CONSTANTS

URL_TRAIN = "https://raw.githubusercontent.com/Explore-EDSA-2020/Sendy-Logistics-Challenge/master/data/Train.csv"
URL_TEST = "https://raw.githubusercontent.com/Explore-EDSA-2020/Sendy-Logistics-Challenge/master/data/Test.csv"
URL_RIDERS = "https://raw.githubusercontent.com/Explore-EDSA-2020/Sendy-Logistics-Challenge/master/data/Riders.csv" 
URL_DD = "https://raw.githubusercontent.com/Explore-EDSA-2020/Sendy-Logistics-Challenge/master/data/VariableDefinitions.csv" # Data Dictionary


In [7]:
# reading the data to dataframe

train_df  = pd.read_csv(URL_TRAIN)
test_df   = pd.read_csv(URL_TEST)
riders_df = pd.read_csv(URL_RIDERS)
data_dictionary_df = pd.read_csv(URL_DD)

In [8]:
# making a copy of the data to avoid altering the original data
train_riders = train_df.copy()
test_riders  = test_df.copy()

# merged train with riders
train_riders = train_riders.merge(riders_df, how='left', on='Rider Id')
test_riders  = test_riders.merge(riders_df, how='left', on='Rider Id')

# view dimension
print('train without riders: ', train_df.shape)
print('train merged with riders: ', train_riders.shape)
print('---------------------------------------')
print('test without riders: ', test_df.shape)
print('test merged with riders: ', test_riders.shape)

train without riders:  (21201, 29)
train merged with riders:  (21201, 33)
---------------------------------------
test without riders:  (7068, 25)
test merged with riders:  (7068, 29)


# 3. DOMAIN KNOWLEDGE
---
Read the Data Dictionary to Understand what each Feature Holds.

### Variables

**1. Order details**
- Order No – Unique number identifying the order
- User Id – Unique number identifying the customer on a platform
- Vehicle Type – For this competition limited to bikes, however in practice, Sendy service extends to trucks and vans
- Platform Type – Platform used to place the order, there are 4 types
- Personal or Business – Customer type

**2. Placement times**
- Placement - Day of Month i.e 1-31
- Placement - Weekday (Monday = 1)
- Placement - Time - Time of day the order was placed

**3. Confirmation times**
- Confirmation - Day of Month i.e 1-31
- Confirmation - Weekday (Monday = 1)
- Confirmation - Time - time of day the order was confirmed by a rider

**4. Arrival at Pickup times**
Arrival at Pickup - Day of Month i.e 1-31
Arrival at Pickup - Weekday (Monday = 1)
Arrival at Pickup - Time - Time of day the rider arrived at the location to pick up the order - as marked by the rider through the Sendy application

**5. Pickup times**
- Pickup - Day of Month i.e 1-31
- Pickup - Weekday (Monday = 1)
- Pickup - Time - Time of day the rider picked up the order - as marked by the rider through the Sendy application


**Arrival at Destination times (column missing in Test set)**

- Arrival at Delivery - Day of Month i.e 1-31
- Arrival at Delivery - Weekday (Monday = 1)
- Arrival at Delivery - Time - Time of day the rider arrived at the destination to deliver the order - as marked by the rider through the Sendy application
- Distance covered (KM) - The distance from Pickup to Destination

- Temperature -Temperature at the time of order placement in Degrees Celsius (measured every three hours)
- Precipitation in Millimeters - Precipitation at the time of order placement (measured every three hours)
- Pickup Latitude and Longitude - Latitude and longitude of pick up location
- Destination Latitude and Longitude - Latitude and longitude of delivery location
- Rider ID – ID of the Rider who accepted the order
- Time from Pickup to Arrival - Time in seconds between ‘Pickup’ and ‘Arrival at Destination’ - calculated from the columns for the purpose of facilitating the task

**Rider metrics**
- Rider ID – Unique number identifying the rider (same as in order details)
- No of Orders – Number of Orders the rider has delivered
- Age – Number of days since the rider delivered the first order
- Average Rating – Average rating of the rider
- No of Ratings - Number of ratings the rider has received. Rating an order is optional for the customer.

# 4. QUICK DATA OVERVIEW
---
What seems to be odd?

**------------------------------riders data------------------------------**

In [9]:
# Riders data overview 
riders_df

Unnamed: 0,Rider Id,No_Of_Orders,Age,Average_Rating,No_of_Ratings
0,Rider_Id_396,2946,2298,14.0,1159
1,Rider_Id_479,360,951,13.5,176
2,Rider_Id_648,1746,821,14.3,466
3,Rider_Id_753,314,980,12.5,75
4,Rider_Id_335,536,1113,13.7,156
...,...,...,...,...,...
955,Rider_Id_896,152,99,12.4,18
956,Rider_Id_149,69,101,10.2,10
957,Rider_Id_270,338,96,14.4,41
958,Rider_Id_201,159,96,15.0,9


In [10]:
# investigate missing values and the data type 
riders_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 960 entries, 0 to 959
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Rider Id        960 non-null    object 
 1   No_Of_Orders    960 non-null    int64  
 2   Age             960 non-null    int64  
 3   Average_Rating  960 non-null    float64
 4   No_of_Ratings   960 non-null    int64  
dtypes: float64(1), int64(3), object(1)
memory usage: 37.6+ KB


In [11]:
# check if really the rider's id is unique, to understand how many riders do we have
riders_df['Rider Id'].nunique()

960

In [12]:
# get the summary statistics of the riders
# We wish to undersatnd the average number of orders made, later on we will check per rider
riders_df.describe()

Unnamed: 0,No_Of_Orders,Age,Average_Rating,No_of_Ratings
count,960.0,960.0,960.0,960.0
mean,1864.851042,1200.234375,13.412604,427.983333
std,1880.337785,810.930171,2.675794,486.957931
min,2.0,96.0,0.0,0.0
25%,261.75,478.25,13.5,30.0
50%,1475.5,1021.0,14.0,223.0
75%,2847.25,1891.5,14.3,678.75
max,9756.0,3764.0,15.2,2298.0


NOTES:
- So we have **960 unique riders** with No Missing Values
- we should later try to investigate existing outliers. 
- on avarage the orders collected by the riders is **1864.851042** with MAX = **9756.000000**
- The Age seems to represent exprience of the riders, meaning Age above 50% the rider must be some-how faster at delivary. does that better explain the target?
- Assupmtions are:
   - Age < 25% basic rider (25% of the riders are basic riders)
   - Age < 50% or Age <= 1200.234375 Intermediate rider
   - Age < 75% Advanced rider
   - This might be a potential feature e.g **riders_rank-->   {basic, Intermediate, Advanced }**
- interesting features are:
    - Age	 
    - Average_Rating	
    - No_of_Ratings 	
    
> whats's the method used to rate the riders, what factor rate?

> Does Age and No of orders better explain the target?

**------------------------------Train data with riders------------------------------**

In [13]:
# Lets verify what we got from domain knowledge section
# get columns from train and test
column_train = train_riders.columns.to_list()
column_test = test_riders.columns.to_list()

# get columns in trains but not in test excluding the target
[column for column in column_train if column not in column_test and column !=  'Time from Pickup to Arrival']

['Arrival at Destination - Day of Month',
 'Arrival at Destination - Weekday (Mo = 1)',
 'Arrival at Destination - Time']

# 5. Visualizations
---
**The 3 ways tree branches to understand the data**
1. Composition
2. Comparison
3. RelationShips
---
Take it from here!