## Project 1: Data Preprocessing.

### **Objective:**:

* #### Clean and preprocess a dataset for shipment times and routes.

### **Some methods to apply**:
1. Load the dataset from a CSV file.
2. Handle missing values appropriately.
3. Encode categorical variables.
4. Normalize numerical features.

In [4]:
import pandas as pd
import numpy as np

In [5]:
pd.set_option("display.max_columns",None)
pd.set_option("display.max_rows",None)

In [6]:
ShipData = pd.read_csv("ShipmentData.csv")

#### This was an original supply chain data from Kaggle.  I had the dataset Extrapolated to 25k+ records by filling in likely future (or past) values based on patterns observed with an addition of a distance column for future deep learning works.

In [8]:
print(ShipData.shape)
ShipData.head(5)

(25000, 25)


Unnamed: 0,Product type,SKU,Price,Availability,Number of products sold,Revenue generated,Customer demographics,Stock levels,Lead times,Order quantities,Shipping times,Shipping carriers,Shipping costs,Supplier name,Location,Lead time,Production volumes,Manufacturing lead time,Manufacturing costs,Inspection results,Defect rates,Transportation modes,Routes,Costs,Distance (km)
0,skincare,SKU0,69.02,0,254,19937.77,Unknown,94,25,92,3,Carrier B,674.73,Supplier 5,Mumbai,11,739,24,1.97,Fail,2.73,Air,Route B,413.86,1016
1,haircare,SKU1,96.01,18,510,52220.92,Male,84,17,42,4,Carrier C,247.07,Supplier 1,Mumbai,22,293,21,62.02,Pending,0.79,Rail,Route C,490.68,967
2,cosmetics,SKU2,26.6,62,177,5081.47,Male,90,4,51,3,Carrier A,663.83,Supplier 5,Mumbai,11,86,29,11.98,Pass,3.42,Air,Route A,867.51,1011
3,cosmetics,SKU3,76.59,68,195,17142.99,Unknown,90,4,22,4,Carrier A,268.11,Supplier 4,Chennai,29,914,26,23.99,Pending,4.8,Rail,Route A,600.23,1039
4,haircare,SKU4,89.51,1,281,27986.87,Male,11,26,40,7,Carrier A,162.01,Supplier 5,Kolkata,12,838,23,72.95,Fail,2.21,Sea,Route B,105.88,1058


### Normalize column names


In [12]:
ShipData.columns = ShipData.columns.str.lower().str.strip().str.replace(' ', '_')

### Find and handle missing values if any.

In [14]:
ShipData.isnull().sum()

product_type               0
sku                        0
price                      0
availability               0
number_of_products_sold    0
revenue_generated          0
customer_demographics      0
stock_levels               0
lead_times                 0
order_quantities           0
shipping_times             0
shipping_carriers          0
shipping_costs             0
supplier_name              0
location                   0
lead_time                  0
production_volumes         0
manufacturing_lead_time    0
manufacturing_costs        0
inspection_results         0
defect_rates               0
transportation_modes       0
routes                     0
costs                      0
distance_(km)              0
dtype: int64

### Find and handle duplicates if any.

In [16]:
ShipData.duplicated().value_counts()

False    25000
Name: count, dtype: int64

### Check if each columns is in the right data type.

In [18]:
ShipData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 25 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   product_type             25000 non-null  object 
 1   sku                      25000 non-null  object 
 2   price                    25000 non-null  float64
 3   availability             25000 non-null  int64  
 4   number_of_products_sold  25000 non-null  int64  
 5   revenue_generated        25000 non-null  float64
 6   customer_demographics    25000 non-null  object 
 7   stock_levels             25000 non-null  int64  
 8   lead_times               25000 non-null  int64  
 9   order_quantities         25000 non-null  int64  
 10  shipping_times           25000 non-null  int64  
 11  shipping_carriers        25000 non-null  object 
 12  shipping_costs           25000 non-null  float64
 13  supplier_name            25000 non-null  object 
 14  location              

### Methods Used for Project 1:

* Loaded dataset with 25 features and 25k records
* Verified no missing values or duplicates
* Selected key features: Routes, Shipping times, Distance, etc.
* Encoded categorical 'Routes' with LabelEncoder
* Normalized numerical features with StandardScaler

### Results:

#### Successfully completed preprocessing with:

* Clean dataset (no missing values/duplicates)
* Normalized column names

### Possible Improvement(s):

* Implement ColumnTransformer for a unified preprocessing pipeline
* Add feature correlation analysis before selection