# Completed Project: Sales Data Analysis
This notebook completes the full Task 2 requirements: loading a dataset, cleaning, EDA, visualization, and answering research questions.

## Research questions answered
1. How do manufacturing and freight costs affect net sales and profit over time?
2. Which cost component contributes most to margin variability?


In [None]:
# Imports
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path

BASE = Path('../')  # adjust if opened from /notebooks
DATA_PATH = BASE / 'data' / 'sample_sales.csv'
df = pd.read_csv(DATA_PATH)
df['date'] = pd.to_datetime(df['date'])
df.head()


## Data cleaning
- Removed duplicates (if any).
- Converted `date` to datetime type.


In [None]:
# Basic cleaning steps
df = df.drop_duplicates().reset_index(drop=True)
print('Shape:', df.shape)
print('\nMissing values:\n', df.isnull().sum())
df.describe()


## Exploratory Data Analysis
### 1) Time series of sales metrics
Saved figure: `outputs/timeseries_sales.png`


In [None]:
# Load and display the timeseries image
from IPython.display import Image, display
display(Image(filename='../outputs/timeseries_sales.png'))


### 2) Cost breakdown (COGS, manufacturing, freight)
Saved figure: `outputs/cost_breakdown.png`


In [None]:
display(Image(filename='../outputs/cost_breakdown.png'))


### 3) Correlation matrix
Saved figure: `outputs/correlation_matrix.png`


In [None]:
display(Image(filename='../outputs/correlation_matrix.png'))


### 4) Manufacturing cost vs Net Sales (scatter + linear fit)
Saved figure: `outputs/manuf_vs_netsales.png`


In [None]:
display(Image(filename='../outputs/manuf_vs_netsales.png'))


### 5) Margin distributions
Saved figure: `outputs/margins_boxplot.png`


In [None]:
display(Image(filename='../outputs/margins_boxplot.png'))


## Findings & Answers to Research Questions
**Q1. How do manufacturing and freight costs affect net sales and profit over time?**
- Visual inspection (time series & scatter) shows manufacturing cost has a visible relationship with net sales in this sample. The regression line suggests a trend, but this small sample may not be statistically robust.
- Freight costs are relatively small compared to COGS and manufacturing, so their impact on net sales is minor in this dataset.
**Q2. Which cost component contributes most to margin variability?**
- From the cost breakdown and correlation matrix, `COGS` is the largest cost component and shows the strongest negative relationship with profit and net sales. It is the primary driver of margin variability here.


## Limitations & Next Steps
- The sample dataset is small and synthetic — replace with a larger real dataset for graded work.
- Perform statistical tests (correlation significance, regression diagnostics) for formal inference.
- Consider time-series decomposition if you have longer date ranges.


In [None]:
# Save cleaned dataset and optionally export notebook to HTML
OUT = Path('../outputs')
OUT.mkdir(parents=True, exist_ok=True)
df.to_csv(OUT / 'cleaned_sales.csv', index=False)
print('Saved cleaned dataset to', OUT / 'cleaned_sales.csv')
