# Sales Analysis — Starter Notebook

This **starter notebook** is a ready-to-run template that walks you step-by-step through the project:
1. Setup & required packages
2. Loading `Sales.xlsx`
3. Inspecting the data
4. Cleaning skeleton (you edit column names to match your file)
5. Creating a Date table
6. KPI examples (Total Sales, Profit, YOY comparisons)
7. Example visualisations (matplotlib / seaborn)
8. Exporting cleaned data & quick Git instructions

**How to use**:
- Put `Sales.xlsx` in the same project folder (the repository folder you opened in VS Code).
- Open this notebook in VS Code (Jupyter extension) and run cells from top to bottom.
- Edit column names in the cleaning cell if your Excel uses different headers.
- If you prefer Google Colab later, you can upload this notebook there directly.

---


## Environment & installation (one-time)

If you already installed packages, skip this. Otherwise, open a terminal in VS Code and run:

```bash
# Recommended (run in terminal, NOT a notebook cell)
python -m pip install --upgrade pip
python -m pip install pandas numpy matplotlib seaborn openpyxl plotly
```

If your system uses `py` instead of `python` use `py -m pip ...`

In VS Code pick the Python interpreter that has these packages (bottom-right corner).


In [None]:
# Imports - run this cell
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Notebook display settings
%matplotlib inline
pd.options.display.max_columns = 50
sns.set_context('talk')
plt.rcParams['figure.figsize'] = (10,6)

print('libraries imported')

In [None]:
# Load dataset - place 'Sales.xlsx' in the same folder as this notebook
# If your file is named differently, change the filename below.
try:
    df = pd.read_excel('Sales.xlsx')
    print('Loaded Sales.xlsx successfully')
except FileNotFoundError as e:
    print('Sales.xlsx not found - make sure the file is in the same folder as this notebook.')
    raise e

# Quick preview
print('\nColumns:')
print(df.columns.tolist())
df.head()

In [None]:
# Basic inspection
print('Shape:', df.shape)
display(df.info())
display(df.isna().sum())

## Cleaning skeleton (edit to match your real column names)

Below is a safe, editable cleaning template. Look at the printed column list above and adjust names if necessary.


In [None]:
# Copy and adapt this cleaning block to your dataset.
df = df.copy()

# Normalize column names (quick tidy)
df.columns = (df.columns
                  .str.strip()
                  .str.replace(' ', '_')
                  .str.replace('\n','_')
                  .str.replace('(', '')
                  .str.replace(')', ''))

print('Normalized column names:')
print(df.columns.tolist())

# Example: convert Date and numeric columns
# Change these names if your file uses different headers (e.g., 'Order Date' or 'Order_Date')
date_candidates = [c for c in df.columns if 'date' in c.lower()]
print('Date-like columns detected:', date_candidates)

if date_candidates:
    date_col = date_candidates[0]
    print('Using', date_col, 'as the Date column')
    df[date_col] = pd.to_datetime(df[date_col], errors='coerce')
    df.rename(columns={date_col: 'Date'}, inplace=True)
else:
    print('No Date-like column automatically detected. You must set a Date column manually.')

# Numeric conversions - adapt names as needed
for col in ['Sales', 'Profit', 'Cost', 'Order_Quantity', 'Order_Quantity.1', 'OrderQuantity']:
    if col in df.columns:
        df[col] = pd.to_numeric(df[col], errors='coerce')

# Remove exact duplicates
before = df.shape[0]
df.drop_duplicates(inplace=True)
after = df.shape[0]
print(f'Removed {before-after} exact duplicate rows.')

# Show rows with missing critical values (Date or Sales)
display(df[df['Date'].isna()].head())
display(df[df['Sales'].isna()].head())

# Show cleaned head
df.head()

## Create Date table (calendar)

In [None]:
# Create a Date table covering the full range in the data
min_date = df['Date'].min().normalize()
max_date = df['Date'].max().normalize()
print('Date range in data:', min_date, 'to', max_date)

date_range = pd.date_range(start=min_date, end=max_date, freq='D')
dates = pd.DataFrame({'Date': date_range})
dates['Year'] = dates['Date'].dt.year
dates['Month'] = dates['Date'].dt.month_name()
dates['MonthNum'] = dates['Date'].dt.month
dates['Quarter'] = dates['Date'].dt.quarter
dates['Weekday'] = dates['Date'].dt.day_name()

dates.head()

## KPI examples (Total Sales, Profit, YOY comparison)

The notebook below shows a simple way to compute totals and previous-year (PY) values.


In [None]:
# Safe check that required columns exist
for col in ['Date','Sales','Profit']:
    if col not in df.columns:
        print(f'Warning: expected column "{col}" not found. Please adapt the cleaning step above.')

# Add Year column
df['Year'] = df['Date'].dt.year

# Total Sales by Year with previous year comparison
sales_by_year = df.groupby('Year')['Sales'].sum().reset_index().sort_values('Year')
sales_by_year['Sales_PY'] = sales_by_year['Sales'].shift(1)
sales_by_year['Delta'] = sales_by_year['Sales'] - sales_by_year['Sales_PY']
sales_by_year['Delta_pct'] = sales_by_year['Delta'] / sales_by_year['Sales']
sales_by_year.style.format({'Sales':'{:.2f}','Sales_PY':'{:.2f}','Delta':'{:.2f}','Delta_pct':'{:.2%}'})

## Example visualisations

Below are a couple of example plots you can run and adapt. They assume columns like `Product`, `City`, `Channel` may exist in your dataset.


In [None]:
# Sales by month (monthly trend) - example
# Ensure we have a Date column and Sales numeric
if 'Date' in df.columns and 'Sales' in df.columns:
    monthly = df.set_index('Date').resample('M')['Sales'].sum().reset_index()
    monthly['Month'] = monthly['Date'].dt.to_period('M').astype(str)
    plt.figure(figsize=(12,5))
    sns.lineplot(data=monthly, x='Month', y='Sales', marker='o')
    plt.xticks(rotation=45)
    plt.title('Monthly Sales Trend')
    plt.tight_layout()
    plt.show()
else:
    print('Date or Sales column missing - cannot plot monthly trend.')

In [None]:
# Top 5 cities by Sales - example
if 'City' in df.columns and 'Sales' in df.columns:
    top_cities = df.groupby('City')['Sales'].sum().nlargest(5).reset_index()
    sns.barplot(data=top_cities, x='Sales', y='City')
    plt.title('Top 5 Cities by Sales')
    plt.show()
else:
    print('City or Sales column not found - skip top cities.')

## Export cleaned data & Git instructions

When you are happy with cleaning, save the cleaned CSV and add both the notebook and CSV to git:

```bash
# In your terminal inside the project folder:
git add starter_sales_analysis.ipynb cleaned_sales.csv
git commit -m "Add starter notebook and cleaned data"
git push origin main
```

If you want, I can prepare a commit message and the exact files to add.


In [None]:
# Export cleaned CSV (optional)
try:
    df.to_csv('cleaned_sales.csv', index=False)
    print('Exported cleaned_sales.csv')
except Exception as e:
    print('Could not export cleaned_sales.csv:', e)