# ARTI308 - Machine Learning
## (Lab 3: Exploratory Data Analysis (EDA) — Assignment)



---
## Step 1: Import Libraries

In [1]:
!pip install matplotlib seaborn pandas numpy






[notice] A new release of pip is available: 24.0 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings('ignore')
sns.set_theme(style='whitegrid')

print("Libraries imported successfully!")

Matplotlib is building the font cache; this may take a moment.


Libraries imported successfully!


---
##  Load and Preview the Dataset

In [3]:
df = pd.read_csv("Chocolate_Sales.csv")

print("Dataset loaded successfully!")
print(f"Shape: {df.shape[0]} rows × {df.shape[1]} columns")
print()
df.head()

Dataset loaded successfully!
Shape: 3282 rows × 6 columns



Unnamed: 0,Sales Person,Country,Product,Date,Amount,Boxes Shipped
0,Jehu Rudeforth,UK,Mint Chip Choco,04/01/2022,"$5,320.00",180
1,Van Tuxwell,India,85% Dark Bars,01/08/2022,"$7,896.00",94
2,Gigi Bohling,India,Peanut Butter Cubes,07/07/2022,"$4,501.00",91
3,Jan Morforth,Australia,Peanut Butter Cubes,27/04/2022,"$12,726.00",342
4,Jehu Rudeforth,UK,Peanut Butter Cubes,24/02/2022,"$13,685.00",184


In [None]:
df.tail()

---
##  Data Cleaning & Type Conversion

The `Date` column needs to be converted to datetime format, and the `Amount` column contains a dollar sign (`$`) that must be removed before converting to numeric.

In [None]:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)

df['Amount'] = df['Amount'].replace(r'[\$,]', '', regex=True)
df['Amount'] = pd.to_numeric(df['Amount'])

df['Month'] = df['Date'].dt.to_period('M')
df['Year'] = df['Date'].dt.year
df['Revenue_per_Box'] = df['Amount'] / df['Boxes Shipped']

print("Data types after cleaning:")
print(df.dtypes)

---
##  Check Missing Values

In [None]:
print("Missing values per column:")
print(df.isna().sum())
print()
print(f"Total missing values: {df.isna().sum().sum()}")
print("No missing values found — the dataset is clean.")

---
##  Check for Duplicate Rows

In [None]:
duplicate_count = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count}")

if duplicate_count == 0:
    print("No duplicate rows found.")
else:
    print("Duplicates found. Removing them...")
    df = df.drop_duplicates()
    print(f"Removed {duplicate_count} duplicates.")

---
##  Dataset Shape and Structure

In [None]:
print(f"Number of rows    : {df.shape[0]}")
print(f"Number of columns : {df.shape[1]}")
print()
print("Column names:", list(df.columns))
print()
print("Unique values per categorical column:")
for col in ['Sales Person', 'Country', 'Product']:
    print(f"  {col}: {df[col].nunique()} unique values")

---
##  Descriptive Statistics

In [None]:
df.describe(include='all')

In [None]:
print("Key Statistics:")
print(f"  Total Revenue        : ${df['Amount'].sum():,.2f}")
print(f"  Average Sale Amount  : ${df['Amount'].mean():,.2f}")
print(f"  Median Sale Amount   : ${df['Amount'].median():,.2f}")
print(f"  Max Sale Amount      : ${df['Amount'].max():,.2f}")
print(f"  Min Sale Amount      : ${df['Amount'].min():,.2f}")
print()
print(f"  Total Boxes Shipped  : {df['Boxes Shipped'].sum():,}")
print(f"  Avg Boxes per Sale   : {df['Boxes Shipped'].mean():.1f}")

---
##  Univariate Analysis

Univariate analysis examines each variable individually to understand its distribution.

In [None]:
plt.figure(figsize=(9, 5))
sns.histplot(df['Amount'], bins=30, kde=True, color='chocolate')
plt.title('Distribution of Revenue (Amount)', fontsize=14)
plt.xlabel('Amount ($)')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()

 The revenue distribution is slightly right-skewed. Most sales fall between \$2,000–\$10,000, with a few large outlier sales above \$20,000.

In [None]:
plt.figure(figsize=(9, 5))
sns.histplot(df['Boxes Shipped'], bins=30, kde=True, color='saddlebrown')
plt.title('Distribution of Boxes Shipped', fontsize=14)
plt.xlabel('Boxes Shipped')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()

 Boxes shipped also show a right-skewed distribution. Most shipments are relatively small (under 200 boxes), while a few large orders exceed 600 boxes.

In [None]:
plt.figure(figsize=(9, 5))
country_counts = df['Country'].value_counts()
sns.barplot(x=country_counts.index, y=country_counts.values, palette='YlOrBr')
plt.title('Number of Transactions per Country', fontsize=14)
plt.xlabel('Country')
plt.ylabel('Number of Transactions')
plt.tight_layout()
plt.show()

 Australia has the highest number of transactions, followed by India and UK. This shows Australia is the most active sales market.

---
## sbivariate analysis

Bivariate analysis examines relationships between two variables.

In [None]:
country_revenue = df.groupby('Country')['Amount'].sum().sort_values(ascending=False)

plt.figure(figsize=(10, 5))
country_revenue.plot(kind='bar', color='chocolate', edgecolor='black')
plt.title('Total Revenue by Country', fontsize=14)
plt.ylabel('Revenue ($)')
plt.xlabel('Country')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

print(country_revenue)

In [None]:
product_revenue = df.groupby('Product')['Amount'].sum().sort_values(ascending=False).head(10)

plt.figure(figsize=(12, 5))
product_revenue.plot(kind='bar', color='peru', edgecolor='black')
plt.title('Top 10 Products by Revenue', fontsize=14)
plt.ylabel('Revenue ($)')
plt.xlabel('Product')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

print(product_revenue)

In [None]:
sp_revenue = df.groupby('Sales Person')['Amount'].sum().sort_values(ascending=False).head(10)

plt.figure(figsize=(10, 5))
sp_revenue.plot(kind='barh', color='sienna', edgecolor='black')
plt.title('Top 10 Sales Persons by Revenue', fontsize=14)
plt.xlabel('Revenue ($)')
plt.tight_layout()
plt.show()

print(sp_revenue)

In [None]:
plt.figure(figsize=(9, 5))
sns.scatterplot(x='Boxes Shipped', y='Amount', data=df, alpha=0.4, color='chocolate')
plt.title('Boxes Shipped vs Revenue', fontsize=14)
plt.xlabel('Boxes Shipped')
plt.ylabel('Amount ($)')
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(10, 5))
sns.boxplot(x='Country', y='Amount', data=df, palette='YlOrBr')
plt.title('Revenue Distribution by Country', fontsize=14)
plt.xlabel('Country')
plt.ylabel('Amount ($)')
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(7, 7))
df['Country'].value_counts().plot(
    kind='pie', autopct='%1.1f%%',
    colors=sns.color_palette('YlOrBr', 6)
)
plt.title('Transactions Share by Country', fontsize=14)
plt.ylabel('')
plt.tight_layout()
plt.show()

---
## Correlation Analysis

In [None]:
plt.figure(figsize=(7, 5))
corr_matrix = df[['Amount', 'Boxes Shipped', 'Revenue_per_Box']].corr()
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='YlOrBr', linewidths=0.5)
plt.title('Correlation Matrix', fontsize=14)
plt.tight_layout()
plt.show()

print(corr_matrix)

- `Amount` and `Boxes Shipped` have a moderate positive correlation (~0.6), meaning more boxes generally lead to higher revenue.
- `Revenue_per_Box` has a lower correlation with `Boxes Shipped`, indicating price per box varies across products.

---
## Time_Based Analysis

In [None]:
monthly_revenue = df.groupby('Month')['Amount'].sum()

plt.figure(figsize=(12, 5))
monthly_revenue.plot(marker='o', color='chocolate', linewidth=2)
plt.title('Monthly Revenue Trend', fontsize=14)
plt.ylabel('Revenue ($)')
plt.xlabel('Month')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
monthly_boxes = df.groupby('Month')['Boxes Shipped'].sum()

plt.figure(figsize=(12, 5))
monthly_boxes.plot(marker='s', color='saddlebrown', linewidth=2)
plt.title('Monthly Boxes Shipped Trend', fontsize=14)
plt.ylabel('Total Boxes Shipped')
plt.xlabel('Month')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

 Revenue and box shipments follow similar trends over time. There are visible peaks and dips across months, suggesting seasonal patterns in chocolate sales.

---
##  Outlier Detection

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

sns.boxplot(y=df['Amount'], color='chocolate', ax=axes[0])
axes[0].set_title('Outliers in Revenue (Amount)', fontsize=13)

sns.boxplot(y=df['Boxes Shipped'], color='saddlebrown', ax=axes[1])
axes[1].set_title('Outliers in Boxes Shipped', fontsize=13)

plt.tight_layout()
plt.show()

for col in ['Amount', 'Boxes Shipped']:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    outliers = df[(df[col] < Q1 - 1.5*IQR) | (df[col] > Q3 + 1.5*IQR)]
    print(f"{col}: {len(outliers)} outliers detected")

Both `Amount` and `Boxes Shipped` contain outliers. These represent unusually large orders and are likely legitimate high-value transactions rather than data errors.

---
##  summary & results

After performing a complete EDA on the Chocolate Sales dataset, here are the key findings:

| Finding | Detail |
|---|---|
| **Dataset Size** | 3,282 rows × 6 columns |
| **Missing Values** | None |
| **Duplicate Rows** | None |
| **Date Range** | 2022 |
| **Total Revenue** | ~$19.79 Million |
| **Top Country** | Australia (highest revenue & transactions) |
| **Top Product** | Smooth Sliky Salty |
| **Correlation** | Moderate positive correlation between Boxes Shipped and Revenue |
| **Outliers** | Present in both Amount and Boxes Shipped — likely large legitimate orders |
| **Distribution** | Right-skewed for both Amount and Boxes Shipped |

### businss Recommendations:
Australia should continue to be the key target market since it is the largest contributor to overall success in terms of both sales volume and revenue. Promotional activities should focus on high-performing products, like Smooth Sliky Salty, in order to maximize profitability. Additionally, identifying seasonal demand trends through the analysis of monthly sales peaks would enable more efficient stock management and inventory planning. Lastly, the team as a whole can improve overall sales performance and consistency by keeping an eye on the tactics of top-performing salespeople and adopting their methods