

### **BAS240 Data Analytics Programming - Assignment: Customer and Transaction Analysis**

#### **Objective**
This assignment is designed to give you hands-on experience with data analysis tasks using Pandas in Python. You will perform data aggregation, data preprocessing, outlier detection, table joining, diagnostic analysis, and predictive analysis on customer, transaction, product, and supplier data.

#### **Dataset Overview**
You will work with four interconnected tables:
1. **Customers**: Basic information about customers.
2. **Products**: Details about each product.
3. **Suppliers**: Information about product suppliers.
4. **Transactions**: Records of each transaction made by customers.

#### **Assignment Tasks**

#### **Part 1: Data Aggregation**
1. Calculate the **total revenue** generated by each product category.
2. Find the **average income** of customers based on their location.
3. Determine the **total quantity of products** bought by each customer. Identify the top 10 customers by quantity purchased.

#### **Part 2: Data Preprocessing**
1. Handle any **missing values** in each table. Document your approach and reasoning.
2. **Standardize** product categories (e.g., make all lowercase or ensure consistent naming).
3. **Encode** categorical variables where necessary (e.g., Gender, Employment Status) for later analysis.

#### **Part 3: Outlier Detection**
1. **Identify and analyze outliers** in the Income column for the Customers table. Discuss any potential reasons for these outliers.
2. Detect outliers in **Quantity** and **Price** columns of the Transactions table, using statistical or visual methods (e.g., box plots). What are your observations?

#### **Part 4: Joining Tables**
1. **Merge** the Transactions and Customers tables to create a consolidated view of each transaction, enriched with customer demographics.
2. Join the Products table to this consolidated dataset to add product details to each transaction.
3. Finally, **link the Supplier information** for each transaction using ProductID and SupplierID.

#### **Part 5: Diagnostic Analysis**
1. Analyze the relationship between **customer income and total spending** across transactions. Do higher-income customers spend more?
2. Investigate the **impact of product categories** on total spending. Which product categories contribute most to revenue?
3. Perform a **time-series analysis** on the transaction data to identify peak sales periods within the year. Does spending vary by season or month?

#### **Part 6: Predictive Analysis**
1. Build a **linear regression model** to predict the transaction amount based on customer demographics (e.g., Age, Income) and product category.
2. Assess the **accuracy of your model** and discuss any insights or limitations. What variables seem to influence transaction amounts most?



#### **Tips for Success**
- Use Pandas functions like `.groupby()`, `.merge()`, `.isnull()`, `.fillna()`, and `.describe()` to streamline your analysis.
- For predictive analysis, you may use `sklearn.linear_model.LinearRegression` or similar models available in scikit-learn.
- Document each step clearly, especially when making assumptions or decisions during preprocessing.

