**Assignment Submission Guidelines**

**1. Submission Platform:**

- Submit your completed assignment through [Specify Platform: e.g., Google Classroom, Canvas, GitHub Classroom, etc.].

**2. Submission Format:**

- Submit the Google Colab Notebook (.ipynb file) provided as the assignment template.
- Do not create a new notebook. Fill in the provided template.

**3. Template Completion:**

The template notebook contains:
- The code to generate the student_performance_detailed_nan.csv dataset.
- Placeholders for your code and explanations for each question.

Follow the instructions within the template.
- Code Cells:
  - Place your code solutions directly in the designated code cells below each question.
- Markdown Cells:
  - Provide your explanations and justifications in the designated Markdown cells.
- Report section:
  - Complete the markdown section at the bottom of the notebook titled "Report".
  - In this section, compile the explanation of each of the questions.
  - Answer the following data analysis questions:
   1. "What are the key characteristics of the customer base in this dataset?"
   2. "Which factors appear to have the strongest influence on product price or customer rating?"
   3. "What are the most common missing data patterns, and what implications might they have on our analysis?"
   4. "Based on your analysis, what are 2-3 recommendations you would make to improve sales or customer satisfaction?"

- Do not modify the structure of the template notebook.

**4. File Naming:**

Ensure the file name remains as provided in the template. Do not rename the file.

**5. Timely Submission:**

- Submit your completed template notebook by the deadline: **24th of March, 2025**.
- Late submissions will be penalized as follows:
- Submissions within **5:00pm 26th of March, 2025**  will receive a maximum of 5 marks for timely submission.
Submissions after  will receive 0 marks for timely submission.

**6. Report:**

- Complete the "Report" section at the end of your notebook.
- Ensure your report is:
  - Well-organized and easy to read.
  - Clear and concise.
  - Free of grammatical errors.

**7. Code Execution:**

Ensure your completed notebook runs without errors from top to bottom.
Before submitting, restart the kernel and run all cells to confirm reproducibility.



**8. Academic Integrity:**

All work must be your own.
Plagiarism will result in a failing grade.
Cite any external resources you use.



**Tips for Success:**

- Start the assignment early.
- Read the instructions within the template carefully.
- Plan your approach before coding.
- Test your code thoroughly.
- Document your work clearly.
- Review the rubrics to understand the grading criteria.


**Grading Rubrics:**

Total 50 Marks

- Timely Submission: 10 Marks
- Report : 10 Marks
- Level 1 (Basic Questions): 5 Marks (1 x 5 = 5)
- Level 2 (Intermediate Questions): 10 Marks (2 x 5 = 10)
- Level 3 (Advanced Questions): 15 Marks (3 x 5 = 15)

##**Assignment Title: Analyzing Online Shopping Trends - A Data Exploration**

**Background**

You are a data analyst working for "ShopInsights," a market research firm specializing in e-commerce trends. ShopInsights partners with online retailers to provide insights into customer behavior, product performance, and market dynamics.

Your team has compiled a dataset of customer purchase data from an online marketplace. This dataset includes information on customer demographics, product details, purchase history, and reviews.

Your goal is to explore and analyze this data to uncover key trends and patterns that can help online retailers:

Understand customer purchasing behavior. Optimize product offerings and marketing strategies. Improve customer satisfaction and retention. Identify popular product categories and trends. In this assignment, you will explore and analyze the online shopping data. If you choose to tackle the advanced level, you will delve deeper by building predictive models to understand the key drivers of customer purchases and provide recommendations for enhancing future data collection.

**Dataset (Synthetic):**

We will create a synthetic dataset resembling Amazon-like purchase data.

In [None]:
import pandas as pd
import numpy as np
import random
from datetime import datetime, timedelta

def generate_online_shopping_data(num_records=1000):
    """Generates synthetic online shopping data."""

    data = []
    for record_id in range(1, num_records + 1):
        customer_id = random.randint(1001, 2000)
        product_category = random.choice(['Electronics', 'Clothing', 'Books', 'Home & Kitchen', 'Beauty'])
        product_price = np.random.uniform(10, 500)
        purchase_date = datetime(2023, 1, 1) + timedelta(days=random.randint(0, 364))
        customer_age = random.randint(18, 70)
        customer_gender = random.choice(['Male', 'Female', 'Other'])
        customer_location = random.choice(['Urban', 'Suburban', 'Rural'])
        rating = random.choice([1, 2, 3, 4, 5, np.nan])
        review = random.choice(['Good product', 'Average', 'Poor quality', 'Excellent', np.nan])
        shipping_method = random.choice(['Standard', 'Express', np.nan])
        discount_applied = random.choice([True, False])

        data.append({
            'RecordID': record_id,
            'CustomerID': customer_id,
            'ProductCategory': product_category,
            'ProductPrice': product_price,
            'PurchaseDate': purchase_date,
            'CustomerAge': customer_age,
            'CustomerGender': customer_gender,
            'CustomerLocation': customer_location,
            'Rating': rating,
            'Review': review,
            'ShippingMethod': shipping_method,
            'DiscountApplied': discount_applied
        })

    df = pd.DataFrame(data)
    df['PurchaseDate'] = pd.to_datetime(df['PurchaseDate'])
    return df

# Generate and save the dataset
online_shopping_df = generate_online_shopping_data()
online_shopping_df.to_csv('online_shopping_data.csv', index=False)

print("Synthetic online shopping dataset generated: online_shopping_data.csv")

Synthetic online shopping dataset generated: online_shopping_data.csv


**The Data**

The data comes from a compilation by ShopInsights, available in 'online_shopping_data.csv'. Each row represents a single customer's purchase record:

**RecordID** - Unique identifier for each purchase record.

**CustomerID** - Unique identifier for each customer.

**ProductCategory** - Category of the purchased product:
    
    -Electronics
    -Clothing
    -Books
    -Home & Kitchen
    -Beauty

**ProductPrice** - Price of the purchased product (in USD).

**PurchaseDate** - Date of purchase (YYYY-MM-DD).

**CustomerAge** - Age of the customer (in years).

**CustomerGender** - Gender of the customer:
    -Male
    -Female
    -Other

**CustomerLocation** - Location of the customer:
    
    -Urban
    -Suburban
    -Rural

**Rating** - Customer rating of the product: 1-5, or NaN if no rating.

**Review** - Customer review of the product, or NaN if no review.

**ShippingMethod** - Shipping method used:
    -Standard
    -Express
    -NaN (if not specified)

**DiscountApplied** - Indicates whether a discount was applied: True/False.

## **Basic (RBT Levels: 2, 3):**

Total: 5 Marks

Each Question Carry 1 Mark

**Question 1. Missing Value Identification:**

Identify the columns in the dataset that contain missing values. How many missing values are present in each column?

In [None]:
# Question 1: Missing Value Identification
# Identify the columns in the dataset that contain missing values. How many missing values are present in each column?
# Your Code Here:

**Explanation**

[Your explanation here]

**Question 2: Basic Missing Value Handling**

Remove all rows that contain at least one missing value. How many rows are removed? Explain why you chose this approach.


In [None]:
# Question 2: Basic Missing Value Handling
# Remove all rows that contain at least one missing value. How many rows are removed? Explain why you chose this approach.
# Your Code Here:

**Explanation**

[Your explanation here]

**Question 3: Data Type Conversion**

Verify the data types of each column in the online shopping dataset. Convert the 'ProductPrice' column to a float data type and the 'PurchaseDate' column to a datetime data type. Explain why these data types are appropriate.


In [None]:
# Question 3: Data Type Conversion
# Verify the data types of each column. Convert the 'ProductPrice' column to a float data type and the 'PurchaseDate' column to a datetime data type. Explain why these data types are appropriate.
# Your Code Here:
# ... your code here ...

**Explanation**

[Your explanation here]

**Question 4: Renaming Columns**

Rename the 'CustomerID' column to 'Customer_ID' and the 'ProductCategory' column to 'Category'. Explain why renaming columns can be useful.


In [None]:
# Question 4: Renaming Columns
# Rename the 'CustomerID' column to 'Customer_ID' and the 'ProductCategory' column to 'Category'. Explain why renaming columns can be useful.
# Your Code Here:
# ... your code here ...


**Explanation**

[Your explanation here]

Question 5: Duplicate Row Removal

Check for and remove any duplicate rows in the dataset. How many duplicate rows were found and removed?


In [None]:
# Question 5: Duplicate Row Removal
# Check for and remove any duplicate rows in the dataset. How many duplicate rows were found and removed?
# Your Code Here:

**Explanation**

[Your explanation here]

##**Intermediate (RBT Levels: 3, 4):**

Total: 10 Marks

Each Question Carry 2 Marks



**Question 6: Targeted Missing Value Imputation**

Impute the missing values in the 'Rating' column with the median rating. Explain why you chose this imputation method.

In [None]:
# Question 6: Targeted Missing Value Imputation
# Impute the missing values in the 'Rating' column with the median rating. Explain why you chose this imputation method.
# Your Code Here:
# ... your code here ...


**Explanation**

[Your explanation here]



```
# This is formatted as code
```
Impute the missing values in the 'ShippingMethod' column with the string 'Unknown'. Explain why you chose this imputation method.


In [None]:

# Impute the missing values in the 'ShippingMethod' column with the string 'Unknown'. Explain why you chose this imputation method.
# Your Code Here:
# ... your code here ...



**Explanation**

[Your explanation here]

**Question 7: Binning Numerical Data and Visualization**

Create a new categorical column called 'PriceRange' by binning the 'ProductPrice' column into appropriate price ranges (e.g., Low, Medium, High). Explain your binning strategy. Create a bar chart showing the distribution of products in each price range.


In [None]:
# Question 7: Binning Numerical Data and Visualization
# Create a new categorical column called 'PriceRange' by binning the 'ProductPrice' column into appropriate price ranges (e.g., Low, Medium, High).
# Explain your binning strategy. Create a bar chart showing the distribution of products in each price range.
# Your Code Here:
# ... your code here ...



**Explanation**

[Your explanation here]

Create a new categorical column called 'AgeGroup' by binning the 'CustomerAge' column into quantiles. Explain your binning strategy. Create a boxplot chart showing the distribution of 'ProductPrice' based on 'AgeGroup'.

In [None]:
# Create a new categorical column called 'AgeGroup' by binning the 'CustomerAge' column into quantiles.
# Explain your binning strategy. Create a boxplot chart showing the distribution of 'ProductPrice' based on 'AgeGroup'.

# Your Code Here:
# ... your code here ...

**Explanation**

[Your explanation here]

**Question 8: Outlier Detection and Removal**

Use the IQR method to identify and remove outliers from the 'ProductPrice' column. Explain your outlier detection and removal process.


In [None]:
# Question 8: Outlier Detection and Removal
# Use the IQR method to identify and remove outliers from the 'ProductPrice' column. Explain your outlier detection and removal process.
# Your Code Here:
# ... your code here ...



**Explanation**

[Your explanation here]

**Question 9: String Manipulation**

Clean the 'Review' column by removing any leading or trailing whitespace. Convert all values to lowercase to ensure consistency.

In [None]:
# Question 10: String Manipulation
# Clean the 'Review' column by removing any leading or trailing whitespace. Convert all values to lowercase to ensure consistency.
# Your Code Here:
# ... your code here ...

**Explanation**

[Your explanation here]

**Question 10: Dummy Variable Creation and Stacked Bar Plot**

Create dummy variables for the 'CustomerGender' and 'ProductCategory' columns. Explain how dummy variables are used in data analysis. Create a stacked bar plot to visualize the distribution of 'CustomerGender' within each 'ProductCategory'.


In [None]:
# Question 10: Dummy Variable Creation and Stacked Bar Plot
# Create dummy variables for the 'CustomerGender' and 'ProductCategory' columns. Explain how dummy variables are used in data analysis.
# Create a stacked bar plot to visualize the distribution of 'CustomerGender' within each 'ProductCategory'.
# Your Code Here:
# ... your code here ...



**Explanation**

[Your explanation here]

##**Advanced (RBT Levels: 4, 5):**

Total: 15 Marks

Each Question Carry 3 Marks

**Question 11: Conditional Missing Value Imputation**

Impute missing values in the 'Rating' column. If 'Review' is NaN, impute 'Rating' with the overall median rating. Otherwise, impute with the median rating of the reviews that are not NaN. Explain your approach.

In [None]:
# Question 11: Conditional Missing Value Imputation
# Impute missing values in the 'Rating' column. If 'Review' is NaN, impute 'Rating' with the overall median rating.
# Otherwise, impute with the median rating of the reviews that are not NaN. Explain your approach.
# Your Code Here:
# ... your code here ...


**Explanation**

[Your explanation here]

**Question 12: Custom Binning Function**

Write a custom function to create a 'PriceCategory' column based on the 'ProductPrice'. Categorize prices below  50as′Low′,pricesbetween 50 and  200as′Medium′,andpricesabove 200 as 'High'. Apply this function to create the new column.

In [None]:
# Question 12: Custom Binning Function
# Write a custom function to create a 'PriceCategory' column based on the 'ProductPrice'.
# Categorize prices below $50 as 'Low', prices between $50 and $200 as 'Medium', and prices above $200 as 'High'. Apply this function to create the new column.
# Your Code Here:
# ... your code here ...



**Explanation**

[Your explanation here]

**Question 13: Grouped Transformations and Line Chart**

Calculate the average 'ProductPrice' for each 'ProductCategory'. Then create a new column called 'PriceNormalized' that represents each product's 'ProductPrice' as a z-score relative to its category's average. Create a line chart visualizing the average normalized Price across categories sorted by average normalized Price.


In [None]:
# Question 13: Grouped Transformations and Line Chart
# Calculate the average 'ProductPrice' for each 'ProductCategory'.
# Then create a new column called 'PriceNormalized' that represents each product's 'ProductPrice' as a z-score relative to its category's average.
# Create a line chart visualizing the average normalized Price across categories sorted by average normalized Price.
# Your Code Here:
# ... your code here ...


**Explanation**

[Your explanation here]

**Question 14: Data Sampling and Validation**

Randomly sample 30% of the dataset. Use this sample to calculate the average 'ProductPrice' for each 'CustomerLocation'. Compare these means to the means calculated using the entire dataset. Discuss any differences and their potential implications.

In [None]:
# Question 14: Data Sampling and Validation
# Randomly sample 30% of the dataset. Use this sample to calculate the mean 'ProductPrice' for each 'CustomerLocation'.
# Compare these means to the means calculated using the entire dataset. Discuss any differences and their potential implications.
# Your Code Here:
# ... your code here ...



**Explanation**

[Your explanation here]

**Question 15: Merging Hypothetical Data**

Imagine you have a second dataset with customer demographic information (e.g., income level, marital status). Merge this hypothetical dataset with the online shopping dataset using the 'CustomerID' column as a key. Explain your merge strategy and how this merged data could be used for further analysis.


In [None]:
# Question 15: Merging Hypothetical Data
# Imagine you have a second dataset with customer demographic information (e.g., income level, marital status).
# Merge this hypothetical dataset with the online shopping dataset using the 'CustomerID' column as a key. Explain your merge strategy and how this merged data could be used for further analysis.
# Your Code Here:
# ... your code here ...



**Explanation**

[Your explanation here]

**Report**

**Part 1**

- In this section, compile the explanation of each of the questions.

**Part 2**

- Answer the following data analysis questions:
 1. "What are the key characteristics of the customer base in this dataset?"
 2. "Which factors appear to have the strongest influence on product price or customer rating?"
 3. "What are the most common missing data patterns, and what implications might they have on our analysis?"
 4. "Based on your analysis, what are 2-3 recommendations you would make to improve sales or customer satisfaction?"

##**Answers**