Analysis on Shipment Data

In [6]:
import pandas as pd
from ydata_profiling import ProfileReport 


In [8]:
df = pd.read_csv("data/shipment.csv")



In [9]:
profile = ProfileReport(df, title="Shipment Data Report", explorative=True)

In [10]:
profile.to_notebook_iframe()


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]


[A%|                                                                                           | 0/20 [00:00<?, ?it/s]
[A%|████▏                                                                              | 1/20 [00:00<00:14,  1.30it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:09<00:00,  2.11it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

### 📌 Column: Customer Id

- The `Customer Id` serves as a **primary key** and can be used for uniquely identifying records or lookup purposes.
- It contains **no missing values or duplicates**, ensuring high data integrity.
- Since it carries **no predictive power** for modeling shipment cost, it should be **dropped** from the feature set during model training.



### 📌 Column: Artist Name

- The `Artist Name` column has 99.2% distinct values and no missing data, indicating near uniqueness across records and limited predictive value for shipment cost prediction; hence, it can be dropped.


### 📌 Column: Artist Reputation

- The `Artist Reputation` column is a real-valued feature ranging from 0 to 1, with **11.5% missing values**, which will need to be imputed or handled appropriately.
- The distribution appears roughly **symmetric** (mean ≈ median, skewness ≈ 0) and **not heavily tailed** (negative kurtosis), suggesting a fairly uniform spread of values.
- With a **moderate IQR (0.44)** and **low standard deviation (0.26)**, this feature may carry useful information for predicting shipment cost and should be retained after addressing missing data.


- The **low standard deviation (0.26)** indicates controlled variation in values, meaning the feature isn't constant and could help in distinguishing between records.
- The **moderate IQR (0.44)** shows a meaningful spread in the central 50% of the data, suggesting potential influence on the target variable.
- Together, these imply that `Artist Reputation` holds informative variance and should be retained after handling missing values.


### 📌 Column: Height

- `Height` is a real-valued variable with a wide range (3 to 73) and **moderate standard deviation (11.97)**, indicating significant variation across records.
- A **moderate IQR of 18** and **positive skewness (0.59)** suggest the distribution is slightly right-skewed but still usable for modeling.
- Despite 5.8% missing values, the meaningful spread and variation in data make this feature **valuable for prediction** after imputation.


### 📌 Column: Width

- `Width` has a **moderate standard deviation (5.42)** and **IQR of 6**, indicating sufficient variability to potentially impact predictions.
- The **positive skewness (1.55)** and **high kurtosis (3.36)** imply a long right tail and presence of outliers, which may need transformation or treatment.
- Despite 9% missing values, its variability and possible correlation with cost suggest it should be **retained with proper preprocessing**.


### 📌 Column: Weight

- `Weight` shows **extremely high skewness (21.56)** and **kurtosis (731.84)**, indicating a heavily right-skewed distribution with extreme outliers.
- The **very large range and standard deviation** suggest the presence of a few disproportionately large values, which may distort model training.
- Due to its high correlation and potential impact on cost, this feature should be **retained with log transformation or outlier handling** to normalize its effect.


### 📌 Column: Material

- `Material` is a **categorical feature** with only 7 distinct values, making it suitable for **encoding techniques** like one-hot or label encoding.
- With **11.8% missing values**, imputation (e.g., using mode) may be necessary before use in modeling.
- Given its **high correlation with the target**, this feature should be **retained** after handling missing data.



### 📌 Column: Price Of Sculpture
- The **extremely high skewness (22.21)** and **kurtosis (727.3)** indicate a long right tail and extreme outliers in `Price Of Sculpture`.
- A **large IQR (84.24)** relative to the lower quartiles, combined with a high **CV (7.39)**, confirms substantial variability and imbalance in value distribution.
- These characteristics reinforce the need for **log transformation** or **outlier treatment** to stabilize variance before modeling.


### 📌 Column: Base Shipping Price

- `Base Shipping Price` has **no missing values** and over **57% unique entries**, offering useful continuous variability for modeling.
- With a **moderate standard deviation (26.87)** and **IQR of 41.21**, it shows a healthy spread, though a slight right skew (0.91) exists.
- Given its **high correlation with the target**, it is a **valuable predictor** and should be retained as-is or optionally normalized.


### 📌 Column: International

- `International` is a **boolean feature** with only 2 distinct values and **no missing data**, indicating a binary classification (e.g., domestic vs. international).
- Its simplicity, **high correlation**, and potential impact on cost make it an **important categorical feature to retain**.

### 📌 Column: Express Shipment

- `Express Shipment` is a **binary categorical feature** with **no missing values** and a distribution of 67.2% False and 32.8% True.
- This slight imbalance is acceptable, and the feature likely captures urgency or priority, which can influence shipping cost.
- Due to its **high correlation**, it should be **retained** and **label encoded** (False = 0, True = 1) for modeling.


### 📌 Column: Scheduled Date & Delivery Date

- Both columns are **datetime features** with no missing or invalid entries, ranging from **2015 to 2019**.
- While individually they may not add much value, the **difference between Delivery Date and Scheduled Date** can be engineered as a new feature (e.g., shipping duration) to capture delivery delays or efficiency.
- Hence, they should be **retained temporarily for feature engineering** and then optionally dropped afterward.

---

### 📌 Column: Customer Location

- `Customer Location` is a **text field with 100% unique values**, meaning it acts as an identifier rather than a useful predictive feature.
- Since it doesn’t generalize across records and has no repeated patterns, it should be **dropped** for modeling.
### 📌 Column: Cost (Target Variable)

- `Cost` is the **target variable**, with **no missing values** and high uniqueness (97.8%), ideal for regression modeling.
- It is **heavily right-skewed (skewness = 29.82)** with extreme **outliers (kurtosis = 1124.77)**, and even contains **10.1% negative values**, which is unusual for cost data.
- This variable requires **careful preprocessing**, such as **handling negative values**, **log transformation**, or **outlier capping**, to ensure stable and accurate model performance.


### 📊 Heatmap Interpretation

- This heatmap shows the **correlation matrix** of numerical and boolean features in the dataset, where the values range from **-1 to +1**:
  - **+1 (dark blue)**: strong positive correlation
  - **0 (light)**: no correlation
  - **-1 (dark red)**: strong negative correlation

---

### 🔍 Key Observations:

- `Cost` shows **strong positive correlation** with:
  - `Weight`
  - `Price Of Sculpture`
  - `Base Shipping Price`
  - (and to a lesser extent) `Height` and `Width`
  
- `Weight` is also positively correlated with `Width`, `Height`, and `Base Shipping Price`, indicating that larger/heavier sculptures likely cost more to ship.

- `Artist Reputation`, `Material`, `International`, and `Express Shipment` show **weaker but visible positive correlations** with `Cost`, suggesting they may still contribute value when combined with other features.

- Most categorical/boolean features (e.g., `Installation Included`, `Fragile`, `Remote Location`) have low correlations individually, but may still hold **non-linear influence** or interact with other variables.

---

### ✅ Conclusion:

- Features like `Weight`, `Price Of Sculpture`, and `Base Shipping Price` are **key drivers** of `Cost` and should be prioritized in modeling.
- Lower-correlation features shouldn't be discarded outright, especially in tree-based models (e.g., XGBoost) that can handle complex interactions.
