# Online Retail Analysis Assignment

Welcome to the Online Retail Analysis assignment! In this assignment, you'll analyse real-world transaction data from a UK-based online retailer. Follow each step and create the necessary Python functions to analyse and manipulate the data.

## Dataset Information
The dataset contains transactions occurring between 2010 and 2011 for a UK-based online retailer. Many of the customers are wholesalers.

- **Download Dataset**: [Online Retail Data Set](https://archive.ics.uci.edu/ml/datasets/online+retail)
- **File Name**: `Online Retail.xlsx`

## Instructions

1. Download the dataset and place it in the same directory as your code files.
2. Create a Python file named `retail_analysis.py` containing the required functions below.

## Functions to Implement

### 1. `load_data(filename: str) -> pd.DataFrame`
   - **Description**: Load the dataset from an Excel file and return a DataFrame.
   - **Parameters**: 
     - `filename`: The filename of the dataset (string).
   - **Returns**: A Pandas DataFrame with the dataset.
   - **Notes**: Make sure the function reads the Excel file correctly.

### 2. `clean_data(df: pd.DataFrame) -> pd.DataFrame`
   - **Description**: Clean the dataset by removing rows with missing `CustomerID` values and any negative quantities or unit prices.
   - **Parameters**: 
     - `df`: The raw dataset DataFrame.
   - **Returns**: A cleaned DataFrame.

### 3. `top_customers(df: pd.DataFrame, n: int) -> pd.DataFrame`
   - **Description**: Identify the top `n` customers by total spending.
   - **Parameters**: 
     - `df`: The cleaned DataFrame.
     - `n`: Number of top customers to return.
   - **Returns**: A DataFrame with the top `n` customers and their total spending.

### 4. `monthly_sales(df: pd.DataFrame) -> pd.DataFrame`
   - **Description**: Calculate total monthly sales.
   - **Parameters**: 
     - `df`: The cleaned DataFrame.
   - **Returns**: A DataFrame with two columns: `month` and `total_sales`.

### 5. `most_popular_products(df: pd.DataFrame, n: int) -> pd.DataFrame`
   - **Description**: Find the top `n` most popular products by quantity sold.
   - **Parameters**: 
     - `df`: The cleaned DataFrame.
     - `n`: Number of top products to return.
   - **Returns**: A DataFrame with the top `n` products by quantity sold.



## Conceptual Questions (Multiple Choice)


- Why is it essential to clean the data by removing rows with missing CustomerID values?
    A) Missing CustomerID values indicate erroneous entries that could skew analysis.
    B) Removing them helps improve performance by reducing the data size.
    C) CustomerID is a foreign key that links to customer-specific data, so its absence makes those rows incomplete.
    D) Missing values can cause errors during data visualization.


- What advantage does grouping data by month provide in sales analysis?
    A) It allows us to detect and analyse seasonal trends in sales.
    B) It improves data security by anonymizing dates.
    C) Monthly grouping is the most accurate time scale for detecting outliers.
    D) It makes it easier to calculate customer retention metrics.


- When calculating the total revenue from a sale, which of the following formulas is correct?
    A) Quantity + UnitPrice
    B) Quantity - UnitPrice
    C) Quantity * UnitPrice
    D) Quantity / UnitPrice


- Which of the following is the best reason to identify the most popular products?
    A) To know which products contribute most to customer satisfaction.
    B) To optimize stock levels for high-demand items.
    C) To understand which product generates the highest profit per unit.
    D) To determine which products to discount in order to increase sales.


- If we want to calculate the total revenue per CustomerID, which approach is most efficient?
    A) Sort data by CustomerID and sum up the Quantity.
    B) Group by CustomerID and sum the Quantity and UnitPrice columns.
    C) Group by CustomerID and calculate the sum of the Total column (Quantity * UnitPrice).
    D) Calculate the average of the Quantity for each CustomerID.

### 6. `def get_mcq_answers() -> dict:`
- **Description**: 
    - Implement a function named get_mcq_answers that returns a dictionary containing your answers to the multiple-choice questions. 
    - Each key in the dictionary should be the question number (e.g., "Q1", "Q2", etc.), and the value should be a set of answer choices (e.g., {"A"}, {"A", "C"}, etc.).
    
- **Returns**: Returns answers to multiple-choice questions as a dictionary.Each key represents a question ID, and the value is a set of selected answers.
   
## Submission
- Implement the functions in `retail_analysis.py`.
- I will share the pytest grading script and the solution after the submission deadline, allowing you to review your answers and the marking criteria. This will help you understand how your submissions will be evaluated.