# Mr Haulage - Fleet Analysis - Data Cleaning & Initial Inspection
## Author: Lottie Jane Pollard

*"Garbage in, Garbage out" - George Fuechsel, IBM*

### Business Case Analysis Brief:

Mr Haulage runs a box delivery firm that has been in his family for generations. They run a contract with Defense to deliver supplies within the UK. His old fleet of vehicles to service this contract are starting to show their age, and he wants to replace his current trucks with some new ones.  The firm gets two types of boxes to transport; a small box and a large box. The prices paid by the customer are shown below:


<img src="/Users/lottiejanepollare/Library/Mobile Documents/com~apple~CloudDocs/CV, Profiles, Interviews & Job Applications/applications/techmodal_analyst_data_engineer/20230825_Analyst_case_study_submission_Lottie_Jane_Pollard/images/table_1_delivery-payments.png" alt="table_1_delivery-payments" width="350"/>


The manager of the firm is deciding on how many trucks to buy to transport these boxes. There are two types of trucks on the market; both types can perform one delivery per day:



<img src="/Users/lottiejanepollare/Library/Mobile Documents/com~apple~CloudDocs/CV, Profiles, Interviews & Job Applications/applications/techmodal_analyst_data_engineer/20230825_Analyst_case_study_submission_Lottie_Jane_Pollard/images/table_2_truck_details.png" alt="table_2_truck_details.png" width="1000"/>


Mr Haulage has provided access to a dataset for about two years’ worth of orders.


### Challenge:

What should Mr Haulage buy to replace his fleet?


### Assumptions:

I will list the assumptions that my recommendations will be based on here as I go:

- Assume all customer_id's are pertaining to the single defense contract in question
- Assume item_serial is allowed duplicate values. There are 18 duplicate values, close to unique per order, however, order dates & delivery regions are different, so I have assumed duplicates to be allowed in this context.

### Questions for Mr Haulage & his Management Team:

- I'd like to clarify with business that all records in this dataset pertain to the ONE defense contract in question? There are 1,792 customers in the dataset of 2,000 orders indicating a lack of repeat business, I'd expect to see fewer customer_id's & more repeat orders when analysing one specific contract. I am working on the assumption that the singular aforementioned 'Defense Contract' is made up of many customers, therefore validating many Customer ID's. However, I'd like to clarify this with Mr Haulage.
- Could you clarify the meaning of item_serial? I have checked as best I can for duplicate data here, but without a further context I can only assume its value & allow duplicates. Considering there are 2,000 orders & 1,982 item serials (only a variation of 18); it strikes me that there is the potential for there to be duplicate records placed by the Defense contractor accidentally, maybe two order_id’s / purchase orders raised for the same delivery that may not have been fulfilled or made it to invoicing.



### Initial Inspection:

I'll perform an initial inspection of the dataset to begin with

In [1]:
# libraries to import

import pandas as pd

In [2]:
# NB: dataset naming convention 'df' will be used for the initial quality analysis & data cleansing

# import dataset provided by Mr Haulage
df = pd.read_excel('/Users/lottiejanepollare/Library/Mobile Documents/com~apple~CloudDocs/CV, Profiles, Interviews & Job Applications/applications/techmodal_analyst_data_engineer/20230825_Analyst_case_study_submission_Lottie_Jane_Pollard/datasets/mr_haulage_order_details.xlsx')

# configure display settings: display all columns regardless of df width, disable wrapping columns to display entire field, no truncating columns, display an English date format
pd.set_option('display.max.columns', None, 'display.width', None, 'display.max.colwidth', None, 'display.date_dayfirst', True)
# this should update the date format displayed for the whole notebook, but it isn't & I'm not sure why - maybe as I'm in Pycharm IDE note JupyterNotebooks itself

# show the head of df to see what I'm working with
df.head(50)

Unnamed: 0,Order ID,Customer ID,Order Date,Order Time,Item Serial,Box Type,Delivery Region,Distance (miles)
0,1097342,733603,22/08/2021,00:14,30351,Small,South East,70
1,1097343,405061,22/08/2021,07:08,17634,Small,Greater London,32
2,1097344,842139,22/08/2021,10:15,25598,Small,South West,190
3,1097345,211806,22/08/2021,17:05,10104,Small,South West,85
4,1097346,103222,22/08/2021,23:48,3252,Small,Greater London,43
5,1097347,603400,22/08/2021,23:57,62831,Small,Greater London,33
6,1097348,837737,23/08/2021,02:11,90766,Large,West Midlands,143
7,1097349,334749,23/08/2021,04:43,93186,Large,Greater London,45
8,1097350,239710,23/08/2021,11:49,99590,Large,North East,210
9,1097351,730371,23/08/2021,14:03,39952,Small,South West,110


In [3]:
# let's check the size of the dataset
print(f"The dataset has {df.shape[0]} rows & {df.shape[1]} columns")

The dataset has 2000 rows & 8 columns


In [4]:
# let's check the completeness of the dataset by checking for missing or null values
null_values = df.isnull().sum()

print(f"The dataset has been examined for missing or null values, and it has been confirmed that all columns contain complete data. Therefore, further investigation into missing values is not necessary at this time.")

null_values

The dataset has been examined for missing or null values, and it has been confirmed that all columns contain complete data. Therefore, further investigation into missing values is not necessary at this time.


Order ID            0
Customer ID         0
Order Date          0
Order Time          0
Item Serial         0
Box Type            0
Delivery Region     0
Distance (miles)    0
dtype: int64

In [5]:
# let's convert column names to lowercase & replace whitespace with underscores for uniformity
df.columns = df.columns.str.lower().str.replace(' ', '_')

# let's store the column names to a list to access later
columns = df.columns.tolist()
print(f"The list of new column names are:\n{columns}")

The list of new column names are:
['order_id', 'customer_id', 'order_date', 'order_time', 'item_serial', 'box_type', 'delivery_region', 'distance_(miles)']


In [6]:
# let's check for unique values & the potential for duplicate information

messages = []
for col in columns:
    unique_count = df[col].nunique()
    messages.append(f"There are {unique_count} unique values in {col}")

for message in messages:
    print(message)
    print(f"--------------------------")

There are 2000 unique values in order_id
--------------------------
There are 1792 unique values in customer_id
--------------------------
There are 597 unique values in order_date
--------------------------
There are 1083 unique values in order_time
--------------------------
There are 1982 unique values in item_serial
--------------------------
There are 2 unique values in box_type
--------------------------
There are 8 unique values in delivery_region
--------------------------
There are 289 unique values in distance_(miles)
--------------------------


### Observations:
- 2,000 unique order_id's & 2,000 rows in the dataset mean we don't have any duplicate order numbers, therefore negating the need to drop any specific records from the dataset at this stage. This also identifies the 'order_id' as a unique identifier / primary key & will be used as such throughout
- 1,792 unique customers out of 2,000 orders tell me there isn't much repeat business. I'd like to clarify with business that all records in this dataset pertain to the ONE defense contract in question
- 1,982 unique item_serial values. I'd like to clarify the meaning of this column to better understand its relevance to the business. Potential for duplicate data here that will need investigating.
- 2 unique values in box type as expected, small & large
- 8 unique delivery regions, I will check these for spelling to confirm no duplicates here

In [7]:
# let's check the current data type of each column
df.dtypes

order_id             int64
customer_id          int64
order_date          object
order_time          object
item_serial          int64
box_type            object
delivery_region     object
distance_(miles)     int64
dtype: object

We can see that:

- order_id, customer_id, item_serial & distance_(miles) are correct at 'int64' objects
- order_date & order_time can be converted to pandas 'datetime64' objects
- box_type & delivery_region can be converted to 'category' objects for optimal memory efficiency (even though the dataset is relatively small, 2 box_type categories & 8 delivery_region categories is more efficient than 2 x 2,000 strings)

In [8]:
# first, I'll apply a lambda function to 'box_type' & 'delivery_region' columns to convert the object to lowercase & replace whitespace with underscores for uniformity
df[['box_type', 'delivery_region']] = df[['box_type', 'delivery_region']].apply(lambda x: x.str.lower().str.replace(' ', '_'))

# I'll also convert them in to 'category' datatypes
df['box_type'] = df['box_type'].astype('category')
df['delivery_region'] = df['delivery_region'].astype('category')

# let's convert 'order_date' to datetime64 datatype
df['order_date'] = pd.to_datetime(df['order_date'], format='%d/%m/%Y')

# let's convert 'order_time' to datetime using Python's datetime.time to extract only the time itself as an object
df['order_time'] = pd.to_datetime(df['order_time'], format='%H:%M').dt.time

# check the new data types
df.dtypes

order_id                     int64
customer_id                  int64
order_date          datetime64[ns]
order_time                  object
item_serial                  int64
box_type                  category
delivery_region           category
distance_(miles)             int64
dtype: object

### Let's add in some extra columns to allow for financial analysis

In [9]:
# let's create 'week_number', 'order_month', 'order_year' & 'order_quarter' columns based on 'order_date' column

# financial week
df['order_week'] = df['order_date'].dt.isocalendar().week

# financial month
df['order_month'] = df['order_date'].dt.strftime('%B')

# financial year
df['order_year'] = df['order_date'].dt.year

# financial quarter (assuming financial year 1st Jan - 3st Dec)
df['financial_quarter'] = df['order_date'].dt.month.apply(
    lambda x: 'Q1' if 1 <= x <= 3 else 'Q2' if 4 <= x <= 6 else 'Q3' if 7 <= x <= 9 else 'Q4'
)

### Let's sort the dataset by order_date ascending & find out the time period we are working with

In [10]:
# sort the dataset by order_date ascending
df = df.sort_values(by='order_date', ascending=True)

# get the earliest & latest dates in the dataset
earliest_order_date = df['order_date'].iloc[0]
latest_order_date = df['order_date'].iloc[-1]

# calculate the number of days the dataset spans
no_of_days_data = (latest_order_date - earliest_order_date).days

# format the date
formatted_earliest = earliest_order_date.strftime('%d/%m/%Y')
formatted_latest = latest_order_date.strftime('%d/%m/%Y')

print(f"The dataset contains orders between {formatted_earliest} and {formatted_latest}")
print(f"The dataset contains orders spanning {no_of_days_data} days.")

df

The dataset contains orders between 22/08/2021 and 10/04/2023
The dataset contains orders spanning 596 days.


Unnamed: 0,order_id,customer_id,order_date,order_time,item_serial,box_type,delivery_region,distance_(miles),order_week,order_month,order_year,financial_quarter
0,1097342,733603,2021-08-22,00:14:00,30351,small,south_east,70,33,August,2021,Q3
1,1097343,405061,2021-08-22,07:08:00,17634,small,greater_london,32,33,August,2021,Q3
2,1097344,842139,2021-08-22,10:15:00,25598,small,south_west,190,33,August,2021,Q3
3,1097345,211806,2021-08-22,17:05:00,10104,small,south_west,85,33,August,2021,Q3
4,1097346,103222,2021-08-22,23:48:00,3252,small,greater_london,43,33,August,2021,Q3
...,...,...,...,...,...,...,...,...,...,...,...,...
1993,1099335,216509,2023-04-09,06:40:00,4716,small,greater_london,8,14,April,2023,Q2
1997,1099339,710623,2023-04-10,11:32:00,387,small,south_west,300,15,April,2023,Q2
1998,1099340,932977,2023-04-10,17:54:00,80608,large,south_east,48,15,April,2023,Q2
1996,1099338,103222,2023-04-10,01:02:00,29091,small,south_west,233,15,April,2023,Q2


Considering there are 2,000 orders & 1,982 item serials (only a variation of 18); it strikes me that there is the potential for there to be duplicate records placed by the Defense contractor accidentally, maybe two order_id's / purchase orders raised for the same delivery that may not have been fulfilled or made it to invoicing. I will perform an extra check to ensure as best I can the dataset does not contain duplicate records. I will also ask Mr Haulage for clarification.


In [11]:
# let's do an extra check to make sure there aren't any records with the same customer_id, order_date & item_serial
duplicates = df[df.duplicated(subset=['customer_id', 'order_date', 'item_serial'], keep=False)]

# check the size of the dataset
print(f"There are {duplicates.shape[0]} rows with matching customer_id, order_date & item_serial. We can confirm there are definitely no duplicate records")

There are 0 rows with matching customer_id, order_date & item_serial. We can confirm there are definitely no duplicate records


Let's check the delivery regions do not contain spelling mistakes & therefore potential duplicates

In [12]:
# checking the delivery regions didn't contain spelling mistakes which could lead to duplicate data; however, the 8 unique regions are all different & verified regions of the UK
unique_regions = list(df['delivery_region'].unique().sort_values())

print(f"I can see that the {len(unique_regions)} unique regions are real-world unique regions")

unique_regions

I can see that the 8 unique regions are real-world unique regions


['east_midlands',
 'greater_london',
 'north_east',
 'north_wales',
 'south_east',
 'south_wales',
 'south_west',
 'west_midlands']

## Let's start our cost analysis by adding a column for revenue per box to allow for some Pandas aggregation & visualisations

In [13]:
# first, we will create the values based on if/else criteria (small box = £20 & large box = £100)
initial_values = [20.00 if box_type == 'small' else 100.00 if box_type == 'large' else 0 for box_type in df['box_type']]

# then, insert a new column with the above initial values, I'll insert the new column, 'order_revenue' after 'item_serial' at column index 4
df.insert(loc=4, column='order_revenue', value=initial_values)

# set the data type for 'order_revenue' to float64 to accurately represent monetary value
df['order_revenue'] = df['order_revenue'].astype('float64')
pd.set_option('display.float_format', '{:.2f}'.format)

df

Unnamed: 0,order_id,customer_id,order_date,order_time,order_revenue,item_serial,box_type,delivery_region,distance_(miles),order_week,order_month,order_year,financial_quarter
0,1097342,733603,2021-08-22,00:14:00,20.00,30351,small,south_east,70,33,August,2021,Q3
1,1097343,405061,2021-08-22,07:08:00,20.00,17634,small,greater_london,32,33,August,2021,Q3
2,1097344,842139,2021-08-22,10:15:00,20.00,25598,small,south_west,190,33,August,2021,Q3
3,1097345,211806,2021-08-22,17:05:00,20.00,10104,small,south_west,85,33,August,2021,Q3
4,1097346,103222,2021-08-22,23:48:00,20.00,3252,small,greater_london,43,33,August,2021,Q3
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1993,1099335,216509,2023-04-09,06:40:00,20.00,4716,small,greater_london,8,14,April,2023,Q2
1997,1099339,710623,2023-04-10,11:32:00,20.00,387,small,south_west,300,15,April,2023,Q2
1998,1099340,932977,2023-04-10,17:54:00,100.00,80608,large,south_east,48,15,April,2023,Q2
1996,1099338,103222,2023-04-10,01:02:00,20.00,29091,small,south_west,233,15,April,2023,Q2


### I'm happy with the quality of the dataset as far as is reasonably possible within the given scope. Let's save the dataset to a new, clean csv file.
### Sometimes the datatypes don't save with the csv file, so I'll save the metadata for this separately.

In [14]:
# save data types to dict
metadata = df.dtypes.to_dict()

# save dict to dataframe
metadata_df = pd.DataFrame(list(metadata.items()), columns=['column_name', 'datatype'])

# save to excel
metadata_df.to_excel('/Users/lottiejanepollare/Library/Mobile Documents/com~apple~CloudDocs/CV, Profiles, Interviews & Job Applications/applications/techmodal_analyst_data_engineer/20230825_Analyst_case_study_submission_Lottie_Jane_Pollard/datasets/metadata.xlsx', index=False)

print(f"The metadata for the column datatypes has been saved to datasets/metadata.xlsx'")

The metadata for the column datatypes has been saved to datasets/metadata.xlsx'


In [15]:
# let's save the cleansed dataset to a new csv file for further analysis
df.to_csv('/Users/lottiejanepollare/Library/Mobile Documents/com~apple~CloudDocs/CV, Profiles, Interviews & Job Applications/applications/techmodal_analyst_data_engineer/20230825_Analyst_case_study_submission_Lottie_Jane_Pollard/datasets/cleansed_mr_haulage_order_details.csv', index=False, sep=',')

print(f"A cleansed version of the dataset 'mr_haulage_order_details.xlsx' has been saved for analysis as 'cleansed_mr_haulage_order_details.csv'")

A cleansed version of the dataset 'mr_haulage_order_details.xlsx' has been saved for analysis as 'cleansed_mr_haulage_order_details.csv'
