# Project: Sales Data Analysis (Starter Notebook)

**Objective:** Example notebook to show how to explore a dataset, ask research questions, and visualize results.

## Folder structure
- `/data` : raw datasets (sample_sales.csv)
- `/notebooks` : Jupyter notebooks
- `/scripts` : helper scripts
- `/outputs` : figures and exported results

## How to run
1. Open this notebook in JupyterLab or Jupyter Notebook.
2. Install requirements: `pip install pandas matplotlib seaborn notebook`.

----


In [None]:
# Imports
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

DATA_PATH = Path('../data/sample_sales.csv')  # adjust if opened from /notebooks
df = pd.read_csv(DATA_PATH)
df.head()


## Quick EDA
- Check missing values
- Summary statistics
- Time series of gross and net sales


In [None]:
# Basic EDA
print('Shape:', df.shape)
print('\nMissing values:\n', df.isnull().sum())
display(df.describe())


In [None]:
# Convert date and plot time series
df['date'] = pd.to_datetime(df['date'])
ts = df.groupby('date')[['gross_sales','net_sales','profit']].sum().reset_index()
plt.figure(figsize=(10,5))
plt.plot(ts['date'], ts['gross_sales'], label='Gross Sales')
plt.plot(ts['date'], ts['net_sales'], label='Net Sales')
plt.plot(ts['date'], ts['profit'], label='Profit')
plt.legend()
plt.title('Time series of Sales Metrics')
plt.xlabel('Date')
plt.ylabel('Amount')
plt.grid(True)
plt.show()


## Suggested Research Questions (pick 1-2)
1. How do manufacturing and freight costs affect net sales and profit over time?
2. Is there seasonality or clear trends in gross/net sales across the sample period?
3. Which cost component (COGS, manuf, freight) contributes most to margin variability?


In [None]:
# Cost breakdown plot
costs = df.groupby('date')[['COGS','manufacturing_cost','freight_cost']].sum().reset_index()
costs = costs.set_index('date')
costs.plot(kind='bar', stacked=True, figsize=(10,5))
plt.title('Daily Cost Breakdown')
plt.ylabel('Amount')
plt.tight_layout()
plt.show()


## Exporting results
- Save cleaned data to `/outputs`
- Save figures to `/outputs`


In [None]:
OUT = Path('../outputs')
OUT.mkdir(parents=True, exist_ok=True)
df.to_csv(OUT / 'cleaned_sales.csv', index=False)
print('Saved cleaned dataset to', OUT / 'cleaned_sales.csv')
