# Mr Haulage - Data Quality Analysis
## Author: Lottie Jane Pollard

*"Garbage in, Garbage out" - George Fuechsel, IBM*

### Business Case Analysis Information

Mr Haulage runs a box delivery firm that has been in his family for generations. They run a contract with Defence to deliver supplies within the UK. His old fleet of vehicles to service this contract are starting to show their age, and he wants to replace his current trucks with some new ones.  The firm gets two types of boxes to transport; a small box and a large box. The prices paid by the customer are shown below:


<img src="images/table_1_delivery-payments.png" alt="table_1_delivery-payments" width="350"/>


The manager of the firm is deciding on how many trucks to buy to transport these boxes. There are two types of trucks on the market; both types can perform one delivery per day:



<img src="images/table_2_truck_details.png" alt="table_1_delivery-payments" width="1000"/>


Mr Haulage has provided access to a dataset for about two years’ worth of orders (attached). What should Mr Haulage buy to replace his fleet?


In [1]:
# listing libraries to import

import pandas as pd

In [2]:
# Naming convention 'df' will be used for the initial quality analysis & data cleansing

# import dataset provided by Mr Haulage
df = pd.read_excel('mr_haulage_order_details.xlsx')

# configure display settings
pd.set_option('display.max.columns', None, 'display.width', None, 'display.max.colwidth', None)

# show the head of df to see what I'm working with
df.head(50)

Unnamed: 0,Order ID,Customer ID,Order Date,Order Time,Item Serial,Box Type,Delivery Region,Distance (miles)
0,1097342,733603,22/08/2021,00:14,30351,Small,South East,70
1,1097343,405061,22/08/2021,07:08,17634,Small,Greater London,32
2,1097344,842139,22/08/2021,10:15,25598,Small,South West,190
3,1097345,211806,22/08/2021,17:05,10104,Small,South West,85
4,1097346,103222,22/08/2021,23:48,3252,Small,Greater London,43
5,1097347,603400,22/08/2021,23:57,62831,Small,Greater London,33
6,1097348,837737,23/08/2021,02:11,90766,Large,West Midlands,143
7,1097349,334749,23/08/2021,04:43,93186,Large,Greater London,45
8,1097350,239710,23/08/2021,11:49,99590,Large,North East,210
9,1097351,730371,23/08/2021,14:03,39952,Small,South West,110


In [3]:
# check the size of the dataset
print(f"The dataset has {df.shape[0]} rows & {df.shape[1]} columns")

The dataset has 2000 rows & 8 columns


In [4]:
# check for null values in the dataset
null_values = df.isnull().sum()

print(f"The dataset has been examined for missing or null values, and it has been confirmed that all columns contain complete data. Therefore, further investigation into missing values is not necessary at this time.")

null_values

The dataset has been examined for missing or null values, and it has been confirmed that all columns contain complete data. Therefore, further investigation into missing values is not necessary at this time.


Order ID            0
Customer ID         0
Order Date          0
Order Time          0
Item Serial         0
Box Type            0
Delivery Region     0
Distance (miles)    0
dtype: int64

In [6]:
# convert column names to lowercase and replace whitespace with underscores
df.columns = df.columns.str.lower().str.replace(' ', '_')

# store new column names to a list for potential accessing later
columns = df.columns.tolist()
columns_str = "\n".join(columns)
print(f"The list of new column names are:\n{columns_str}")

df

The list of new column names are:
order_id
customer_id
order_date
order_time
item_serial
box_type
delivery_region
distance_(miles)


Unnamed: 0,order_id,customer_id,order_date,order_time,item_serial,box_type,delivery_region,distance_(miles)
0,1097342,733603,22/08/2021,00:14,30351,Small,South East,70
1,1097343,405061,22/08/2021,07:08,17634,Small,Greater London,32
2,1097344,842139,22/08/2021,10:15,25598,Small,South West,190
3,1097345,211806,22/08/2021,17:05,10104,Small,South West,85
4,1097346,103222,22/08/2021,23:48,3252,Small,Greater London,43
...,...,...,...,...,...,...,...,...
1995,1099337,882969,09/04/2023,18:11,17983,Small,Greater London,33
1996,1099338,103222,10/04/2023,01:02,29091,Small,South West,233
1997,1099339,710623,10/04/2023,11:32,387,Small,South West,300
1998,1099340,932977,10/04/2023,17:54,80608,Large,South East,48


In [7]:
# let's check the data types of each column as we seem to have dates, ints & string type columns
df.dtypes

order_id             int64
customer_id          int64
order_date          object
order_time          object
item_serial          int64
box_type            object
delivery_region     object
distance_(miles)     int64
dtype: object

We can see that some data types are incorrect, let's change them:

- order_id, customer_id, item_serial & distance_(miles) are correct at 'int64' objects
- order_date & order_time can be converted to pandas 'datetime64' objects
- box_type & delivery_region can be converted to 'category' objects

NB: Changing the 'order_time' datatype proved to complex & not necessary. No matter which way it would have to be converted back to a 'string' type after either datetime64 or Timedelta conversion. We won't be using the order time for this analysis. This column may be dropped if I can't see use for it within this particular analysis.

In [8]:
# First, I'll apply a lambda function to 'box_type' & 'delivery_region' columns to convert the object to lowercase & replace whitespace with underscores for consistency & ease of later use
df[['box_type', 'delivery_region']] = df[['box_type', 'delivery_region']].apply(lambda x: x.str.lower().str.replace(' ', '_'))

# I'll also convert them in to 'category' datatypes
df['box_type'] = df['box_type'].astype('category')
df['delivery_region'] = df['delivery_region'].astype('category')

# Convert 'order_date' to datetime64 datatype
df['order_date'] = pd.to_datetime(df['order_date'], format='%d/%m/%Y')

df.dtypes

order_id                     int64
customer_id                  int64
order_date          datetime64[ns]
order_time                  object
item_serial                  int64
box_type                  category
delivery_region           category
distance_(miles)             int64
dtype: object

In [9]:
# sort the dataset by order_date ascending to check the time period we are working with
df = df.sort_values(by='order_date', ascending=True)

# get the earliest & latest dates in the dataset
earliest_order_date = df['order_date'].iloc[0]
latest_order_date = df['order_date'].iloc[-1]

# calculate the number of days the dataset spans
no_of_days_data = (latest_order_date - earliest_order_date).days

# format the dates to 'dd/mm/yyyy'
formatted_earliest = earliest_order_date.strftime('%d/%m/%Y')
formatted_latest = latest_order_date.strftime('%d/%m/%Y')

print(f"The dataset contains orders between {formatted_earliest} and {formatted_latest}")
print(f"The dataset contains orders spanning {no_of_days_data} days.")

df

The dataset contains orders between 22/08/2021 and 10/04/2023
The dataset contains orders spanning 596 days.


Unnamed: 0,order_id,customer_id,order_date,order_time,item_serial,box_type,delivery_region,distance_(miles)
0,1097342,733603,2021-08-22,00:14,30351,small,south_east,70
1,1097343,405061,2021-08-22,07:08,17634,small,greater_london,32
2,1097344,842139,2021-08-22,10:15,25598,small,south_west,190
3,1097345,211806,2021-08-22,17:05,10104,small,south_west,85
4,1097346,103222,2021-08-22,23:48,3252,small,greater_london,43
...,...,...,...,...,...,...,...,...
1993,1099335,216509,2023-04-09,06:40,4716,small,greater_london,8
1997,1099339,710623,2023-04-10,11:32,387,small,south_west,300
1998,1099340,932977,2023-04-10,17:54,80608,large,south_east,48
1996,1099338,103222,2023-04-10,01:02,29091,small,south_west,233


In [10]:
# okay, let's check for unique values or the potential for other duplicate information

messages = []
for col in columns:
    unique_count = df[col].nunique()
    messages.append(f"There are {unique_count} unique values in {col}")

for message in messages:
    print(message)

There are 2000 unique values in order_id
There are 1792 unique values in customer_id
There are 597 unique values in order_date
There are 1083 unique values in order_time
There are 1982 unique values in item_serial
There are 2 unique values in box_type
There are 8 unique values in delivery_region
There are 289 unique values in distance_(miles)


We have 2000 rows in the dataset & 2000 unique orders, which shows no duplicate orders have been included, therefore negating the need to drop any specific records from the dataset at this stage.

We can see that there are 1,792 individual customers contained in this dataset. For the purpose of this analysis, I must state that I am working on the assumption that the singular aforementioned 'Defence Contract' is made up of many customers, therefore validating many Customer ID's. However, I'd like to clarify this with Mr Haulage, if incorrect data has been provided, this will significantly skew the results of my analysis.

I'd like to clarify the meaning of 'item_serial' (is it the Defence Contractors internal box serial number?) - Considering there are 2,000 orders & 1,982 item serials (only a variation of 18); it strikes me that there is the potential for there to be duplicate records placed by the Defence contractor accidentally & may not have been fulfilled or invoiced - therefore affecting my upcoming cost analysis. For the purpose of this analysis, I will focus on 'order_id' being the unique identifier aka 'Primary Key' & reindex the table accordingly

Distance in miles has been provided, however, no information has been provided in regard to miles that can be driven per day, or hours worked per day. Therefore, this analysis will be based on each truck visiting ONE REGION per day, however, I strongly advise further information be provided

We only have 2 unique box types, which we know to be correct as Small & Large

We have 8 unique delivery regions, of which I will verify next

At this stage, I'd consider dropping columns that may not be required as follows, however I will wait until after my exploratory data analysis:

- 'order_time' - I cannot see the effect this has on my analysis as I can't speculate on the details of the customer's Service Level Agreement with the Defence contactor or any KPIs held therewith
- 'item_serial' - Without further information on the meaning of this column & therefore any further insight into duplication, I will have to exclude it from my analysis this time.
- 'distance_(miles)' - I have no information as to the amount of miles that can be driven per day or the working hours in which deliveries are carried out, therefore, for the purpose of this analysis 'distance_(miles)' is irrelevant. I will be basing my analysis on a truck visiting ONE REGION per day instead.

In [11]:
# Let's do an extra check to make sure there aren't any records with the same customer_id, order_date & item_serial

duplicates = df[df.duplicated(subset=['customer_id', 'order_date', 'item_serial'], keep=False)]

# check the size of the dataset
print(f"The dataset has {df.shape[0]} rows & {df.shape[1]} columns")
print(f"The dataset is the same size, we can confirm there are definitely no duplicate records")

The dataset has 2000 rows & 8 columns
The dataset is the same size, we can confirm there are definitely no duplicate records


In [14]:
# checking the delivery regions didn't contain spelling mistakes which could lead to duplicate data, however, the 8 unique regions are all different & verified regions of the UK

unique_regions = list(df['delivery_region'].unique())
unique_regions

['south_east',
 'greater_london',
 'south_west',
 'south_wales',
 'north_east',
 'west_midlands',
 'east_midlands',
 'north_wales']

In [12]:
# create 'order_month', 'order_year' & 'order_quarter' columns based on 'order_date' column

# month
df['order_month'] = df['order_date'].dt.strftime('%B')

# year
df['order_year'] = df['order_date'].dt.year

# quarter
df['financial_quarter'] = df['order_date'].dt.month.apply(
    lambda x: 'Q1' if 4 <= x <= 6 else 'Q2' if 7 <= x <= 9 else 'Q3' if 10 <= x <= 12 else 'Q4'
)

df

Unnamed: 0,order_id,customer_id,order_date,order_time,item_serial,box_type,delivery_region,distance_(miles),order_month,order_year,financial_quarter
0,1097342,733603,2021-08-22,00:14,30351,small,south_east,70,August,2021,Q2
1,1097343,405061,2021-08-22,07:08,17634,small,greater_london,32,August,2021,Q2
2,1097344,842139,2021-08-22,10:15,25598,small,south_west,190,August,2021,Q2
3,1097345,211806,2021-08-22,17:05,10104,small,south_west,85,August,2021,Q2
4,1097346,103222,2021-08-22,23:48,3252,small,greater_london,43,August,2021,Q2
...,...,...,...,...,...,...,...,...,...,...,...
1993,1099335,216509,2023-04-09,06:40,4716,small,greater_london,8,April,2023,Q1
1997,1099339,710623,2023-04-10,11:32,387,small,south_west,300,April,2023,Q1
1998,1099340,932977,2023-04-10,17:54,80608,large,south_east,48,April,2023,Q1
1996,1099338,103222,2023-04-10,01:02,29091,small,south_west,233,April,2023,Q1


In [13]:
# let's set the 'order_id' as the Primary Key / Unique Identifier for our dataset
df.set_index(df.columns[0], inplace=True)

df

Unnamed: 0_level_0,customer_id,order_date,order_time,item_serial,box_type,delivery_region,distance_(miles),order_month,order_year,financial_quarter
order_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1097342,733603,2021-08-22,00:14,30351,small,south_east,70,August,2021,Q2
1097343,405061,2021-08-22,07:08,17634,small,greater_london,32,August,2021,Q2
1097344,842139,2021-08-22,10:15,25598,small,south_west,190,August,2021,Q2
1097345,211806,2021-08-22,17:05,10104,small,south_west,85,August,2021,Q2
1097346,103222,2021-08-22,23:48,3252,small,greater_london,43,August,2021,Q2
...,...,...,...,...,...,...,...,...,...,...
1099335,216509,2023-04-09,06:40,4716,small,greater_london,8,April,2023,Q1
1099339,710623,2023-04-10,11:32,387,small,south_west,300,April,2023,Q1
1099340,932977,2023-04-10,17:54,80608,large,south_east,48,April,2023,Q1
1099338,103222,2023-04-10,01:02,29091,small,south_west,233,April,2023,Q1


In [16]:
# we can now save the cleansed dataset to a new csv file ready for some exploratory analysis
df.to_csv('cleansed_mr_haulage_order_details.csv', index=False, sep=',')

print(f"A cleansed version of the dataset 'mr_haulage_order_details.xlsx' has been saved for analysis as 'cleansed_mr_haulage_order_details.csv'")

A cleansed version of the dataset 'mr_haulage_order_details.xlsx' has been saved for analysis as 'cleansed_mr_haulage_order_details.csv'
