## **Candidate Information**

**Fill in your details below:**
- **Name:** Prasanna Nadrajan R
- **Email:** prasannanadrajan.r@gmail.com
- **Phone:** 8667687297
- **College/University:** Rajalakshmi Engineering College, Chennai
- **Course/Branch:** B.Tech.Artificial Intelligence and Data Science
- **Start Time:** 25/12/2025 2:32 PM
- **End Time:** 03:08 PM


# Evoastra Ventures Intern Assessment Task
**Duration: 35 Minutes | Total Points: 100**

---


## **Dataset for Assessment**

**E-commerce Customer Behavior Dataset**
- **Direct Link:** https://www.kaggle.com/datasets/shriyashjagtap/e-commerce-customer-for-behavior-analysis

**Dataset Columns:**
- **Customer ID:** Unique identifier for each customer
- **Customer Name:** Name of the customer
- **Customer Age:** Age of the customer
- **Gender:** Gender of the customer
- **Purchase Date:** Date of each purchase
- **Product Category:** Category of the purchased product
- **Product Price:** Price of the purchased product
- **Quantity:** Quantity of product purchased
- **Total Purchase Amount:** Total amount spent in each transaction
- **Payment Method:** Payment method used (credit card, PayPal, etc.)
- **Returns:** Whether customer returned products (0 = No, 1 = Yes)
- **Churn:** Whether customer has churned (0 = Retained, 1 = Churned)


---
## **SECTION A: Data Understanding & Basic Analysis (25 Points - 10 Minutes)**


### Question 1 (8 points)
Based on the dataset structure, identify the **data types** for the following columns and explain why each classification is important for analysis:

- **Customer Age**
- **Gender**
- **Total Purchase Amount**
- **Churn**


In [1]:
# Customer Age: Integer (or float if fractional ages are allowed). Important for numerical analysis, correlation with purchase behavior, segmentation (e.g., age groups), and identifying trends.

# Gender: Categorical (Nominal). Important for demographic analysis, understanding purchasing patterns across genders, targeted marketing, and identifying potential gender-based disparities.

# Total Purchase Amount: Float (or Integer for whole currency units). Important for financial analysis, revenue calculation, identifying high-value customers, tracking spending trends, and for various aggregations and statistical measures.

# Churn: Categorical (Binary/Boolean, 0 or 1). Important as the target variable for churn prediction models, for calculating churn rates, understanding factors leading to churn, and for customer retention strategies.

### Question 2 (8 points)
Which **analytical technique** would be most appropriate for each business question below?

a) "Which product categories generate the highest revenue?"
b) "Can we predict customer churn based on purchase behavior?"
c) "What is the relationship between customer age and spending patterns?"
d) "Which payment methods are preferred by different customer segments?"


In [None]:
# a) Product categories with highest revenue: Descriptive Statistics & Categorical Analysis (e.g., aggregation and summation of 'Total Purchase Amount' grouped by 'Product Category').

# b) Predicting customer churn: Classification (Machine Learning Models like Logistic Regression, Decision Trees, Random Forest, Gradient Boosting).

# c) Age and spending relationship: Correlation Analysis or Regression Analysis (e.g., linear regression) to quantify the relationship, or Segmentation Analysis to compare spending across age groups.

# d) Payment method preferences: Cross-Tabulation / Contingency Tables (to see frequency distributions) or Segmentation Analysis (to analyze preferences within different customer segments).

### Question 3 (9 points)
**Data Quality Assessment:** What are the top 3 potential data quality issues you would check for in this e-commerce dataset before starting analysis? For each issue, suggest one method to detect it.


In [None]:
# Issue 1: Missing Values (e.g., in 'Customer Age', 'Product Price', 'Quantity', 'Payment Method').
# Detection Method 1: Use `df.isnull().sum()` to count missing values per column.

# Issue 2: Inconsistent Data Formats or Typos (e.g., 'Gender' having 'Male', 'male', 'M'; 'Payment Method' having variations like 'Credit Card', 'credit card').
# Detection Method 2: Use `df['column_name'].value_counts()` or `df['column_name'].unique()` to inspect unique values and their frequencies.

# Issue 3: Outliers or Incorrect Values (e.g., 'Customer Age' being 0 or over 100, 'Product Price' or 'Quantity' being negative, 'Total Purchase Amount' not matching 'Product Price' * 'Quantity').
# Detection Method 3: Use descriptive statistics (`df.describe()`), box plots, or custom checks (e.g., `df[df['Customer Age'] < 18]` or `df[df['Product Price'] < 0]`) to identify values outside expected ranges.

---
## **SECTION B: Customer Analysis & Business Intelligence (35 Points - 15 Minutes)**


### Scenario: E-commerce Revenue Analysis

Based on the dataset structure, assume you have the following customer insights:

**Customer Segments by Age:**
- **Young (18-30):** 40% of customers, Average Purchase Amount: ₹850, Return Rate: 12%
- **Middle-aged (31-50):** 45% of customers, Average Purchase Amount: ₹1,200, Return Rate: 8%
- **Senior (51+):** 15% of customers, Average Purchase Amount: ₹950, Return Rate: 15%

**Additional Information:**
- Average customer acquisition cost: ₹180
- Platform profit margin: 20% of purchase amount
- Customer churn rates: Young (25%), Middle-aged (15%), Senior (30%)


### Question 4 (15 points)
**Calculate and analyze:**

a) Which customer segment generates the highest **net profit per customer** (considering returns)? Show your calculations. (8 points)

b) Which segment has the **best customer lifetime value** considering churn rates? Provide reasoning. (7 points)


In [7]:
# Your Calculations and Analysis for Question 4

# --- Parameters Definition ---

young_segment = {
    "customer_percentage": 0.40,
    "average_purchase_amount": 850,
    "return_rate": 0.12,
    "churn_rate": 0.25
}

middle_aged_segment = {
    "customer_percentage": 0.45,
    "average_purchase_amount": 1200,
    "return_rate": 0.08,
    "churn_rate": 0.15
}

senior_segment = {
    "customer_percentage": 0.15,
    "average_purchase_amount": 950,
    "return_rate": 0.15,
    "churn_rate": 0.30
}

average_customer_acquisition_cost = 180
platform_profit_margin = 0.20

print("Parameters defined successfully.\n")

# --- a) Net profit per customer calculations ---

print("a) Net profit per customer calculations:")

# Young Customers
young_adjusted_purchase_amount = young_segment["average_purchase_amount"] * (1 - young_segment["return_rate"])
young_gross_profit_per_customer = young_adjusted_purchase_amount * platform_profit_margin
young_net_profit_per_customer = young_gross_profit_per_customer - average_customer_acquisition_cost
print(f"\nYoung Customers:")
print(f"  Adjusted Purchase Amount: ₹{young_adjusted_purchase_amount:.2f}")
print(f"  Gross Profit per Customer: ₹{young_gross_profit_per_customer:.2f}")
print(f"  Net Profit per Customer: ₹{young_net_profit_per_customer:.2f}")

# Middle-aged Customers
middle_aged_adjusted_purchase_amount = middle_aged_segment["average_purchase_amount"] * (1 - middle_aged_segment["return_rate"])
middle_aged_gross_profit_per_customer = middle_aged_adjusted_purchase_amount * platform_profit_margin
middle_aged_net_profit_per_customer = middle_aged_gross_profit_per_customer - average_customer_acquisition_cost
print(f"\nMiddle-aged Customers:")
print(f"  Adjusted Purchase Amount: ₹{middle_aged_adjusted_purchase_amount:.2f}")
print(f"  Gross Profit per Customer: ₹{middle_aged_gross_profit_per_customer:.2f}")
print(f"  Net Profit per Customer: ₹{middle_aged_net_profit_per_customer:.2f}")

# Senior Customers
senior_adjusted_purchase_amount = senior_segment["average_purchase_amount"] * (1 - senior_segment["return_rate"])
senior_gross_profit_per_customer = senior_adjusted_purchase_amount * platform_profit_margin
senior_net_profit_per_customer = senior_gross_profit_per_customer - average_customer_acquisition_cost
print(f"\nSenior Customers:")
print(f"  Adjusted Purchase Amount: ₹{senior_adjusted_purchase_amount:.2f}")
print(f"  Gross Profit per Customer: ₹{senior_gross_profit_per_customer:.2f}")
print(f"  Net Profit per Customer: ₹{senior_net_profit_per_customer:.2f}")

print("\n--- Analysis for Question 4a ---")
# Compare net profits and identify the highest
segments_net_profit = {
    "Young": young_net_profit_per_customer,
    "Middle-aged": middle_aged_net_profit_per_customer,
    "Senior": senior_net_profit_per_customer
}
highest_profit_segment = max(segments_net_profit, key=segments_net_profit.get)
highest_profit_value = segments_net_profit[highest_profit_segment]
print(f"The customer segment that generates the highest net profit per customer is: {highest_profit_segment} with a net profit of ₹{highest_profit_value:.2f}\n")

# --- b) Customer Lifetime Value Analysis ---

print("b) Customer Lifetime Value Analysis:\n")

# Calculate CLV for Young Customers
young_clv = young_net_profit_per_customer / young_segment['churn_rate'] if young_segment['churn_rate'] != 0 else float('inf')
print(f"  Young Customers CLV: ₹{young_clv:.2f}")

# Calculate CLV for Middle-aged Customers
middle_aged_clv = middle_aged_net_profit_per_customer / middle_aged_segment['churn_rate'] if middle_aged_segment['churn_rate'] != 0 else float('inf')
print(f"  Middle-aged Customers CLV: ₹{middle_aged_clv:.2f}")

# Calculate CLV for Senior Customers
senior_clv = senior_net_profit_per_customer / senior_segment['churn_rate'] if senior_segment['churn_rate'] != 0 else float('inf')
print(f"  Senior Customers CLV: ₹{senior_clv:.2f}")

print("\n--- Analysis for Question 4b ---")
# Compare CLVs and identify the best segment
segments_clv = {
    "Young": young_clv,
    "Middle-aged": middle_aged_clv,
    "Senior": senior_clv
}
best_clv_segment = max(segments_clv, key=segments_clv.get)
best_clv_value = segments_clv[best_clv_segment]
print(f"The segment with the best Customer Lifetime Value (CLV) is: {best_clv_segment} with a CLV of ₹{best_clv_value:.2f}")

print("\nReasoning:")
print(f"The {best_clv_segment} segment has the highest positive Customer Lifetime Value (₹{best_clv_value:.2f}) among all segments. \nThis indicates that, despite its churn rate, the profitability generated per customer from this segment, adjusted for returns and acquisition costs, leads to the most significant long-term value for the business. \nConversely, the Young and Senior segments show negative CLVs, suggesting that the current business model is losing money on these customers over their lifetime, primarily due to lower net profit per customer and higher churn rates.")

Parameters defined successfully.

a) Net profit per customer calculations:

Young Customers:
  Adjusted Purchase Amount: ₹748.00
  Gross Profit per Customer: ₹149.60
  Net Profit per Customer: ₹-30.40

Middle-aged Customers:
  Adjusted Purchase Amount: ₹1104.00
  Gross Profit per Customer: ₹220.80
  Net Profit per Customer: ₹40.80

Senior Customers:
  Adjusted Purchase Amount: ₹807.50
  Gross Profit per Customer: ₹161.50
  Net Profit per Customer: ₹-18.50

--- Analysis for Question 4a ---
The customer segment that generates the highest net profit per customer is: Middle-aged with a net profit of ₹40.80

b) Customer Lifetime Value Analysis:

  Young Customers CLV: ₹-121.60
  Middle-aged Customers CLV: ₹272.00
  Senior Customers CLV: ₹-61.67

--- Analysis for Question 4b ---
The segment with the best Customer Lifetime Value (CLV) is: Middle-aged with a CLV of ₹272.00

Reasoning:
The Middle-aged segment has the highest positive Customer Lifetime Value (₹272.00) among all segments. 
This i

### Question 5 (10 points)
**Strategic Recommendations:** Based on your analysis, what would be your **top 2 marketing strategies** to maximize overall profitability? Consider customer acquisition, retention, and return rates.


In [None]:
# Your Answer for Question 5

# Strategy 1: Prioritize Middle-aged Customer Retention and Upselling.
# Reasoning: The Middle-aged segment shows the highest positive net profit per customer and the best Customer Lifetime Value (CLV). Focus significant marketing and customer service efforts on retaining these customers, encouraging repeat purchases, and upselling/cross-selling higher-margin products. Implement loyalty programs, personalized recommendations, and exclusive offers tailored to this segment to further boost their spending and reduce their already lower churn rate.

# Strategy 2: Implement Targeted Interventions for Young and Senior Segments to Reduce Losses.
# Reasoning: Both Young and Senior segments currently have negative net profits per customer and negative CLVs, indicating that the business is losing money on them. For the Young segment, consider strategies to reduce their high return rate (e.g., improved product descriptions, better sizing guides, virtual try-on features) and increase average purchase value. For the Senior segment, focus on improving customer satisfaction to reduce their high churn rate (e.g., simplified user interfaces, dedicated customer support, personalized assistance). Re-evaluate acquisition channels for these segments to ensure positive ROI.

### Question 6 (10 points)
**Churn Prevention:** You notice that customers who make purchases in the "Electronics" category have a 35% churn rate, while "Fashion" category customers have only 18% churn rate. What **specific data analysis** would you conduct using the available dataset columns to understand this difference, and what **action plan** would you recommend?


In [None]:
# Data Analysis Plan:
# 1. Demographics Comparison: Compare 'Customer Age' and 'Gender' distributions for customers primarily purchasing 'Electronics' versus 'Fashion' categories. Are there significant age or gender differences that might influence preferences or churn?
#    - Method: Use descriptive statistics (mean, median, mode) and visualizations (histograms, bar charts) for age and gender distribution per category.

# 2. Purchase Behavior Analysis: Analyze 'Product Price', 'Quantity', and 'Total Purchase Amount' for both categories.
#    - Method: Compare average purchase values, frequency of purchases, and item quantities. Look for correlations between purchase value/frequency and churn within each category. For example, do customers buying high-ticket electronics churn more quickly?

# 3. Payment Method Preferences: Investigate if certain 'Payment Methods' are more common in 'Electronics' purchases and if these methods correlate with higher churn.
#    - Method: Cross-tabulate 'Payment Method' with 'Product Category' and 'Churn' to identify patterns.

# 4. Return Behavior Deep Dive: Analyze the 'Returns' column specifically for 'Electronics' customers.
#    - Method: Segment 'Electronics' customers who returned products and compare their churn rate with those who did not return products. Investigate the types of electronics returned and the reasons (if available in a real-world scenario). Correlate return frequency/value with churn.

# 5. Time-based Analysis: If 'Purchase Date' allows, analyze the time to churn after an 'Electronics' purchase versus a 'Fashion' purchase.
#    - Method: Calculate the duration from the last purchase date in each category to the churn date (if applicable) and compare the distributions.

# Action Plan:
# Based on the findings from the data analysis:
# 1. Targeted Product/Service Improvements for Electronics: If specific types of electronics or high-priced items are associated with higher churn/returns, consider improving product quality, providing more detailed product descriptions, enhancing post-purchase support, or extending warranty periods.
# 2. Personalized Retention Campaigns: For 'Electronics' customers showing early signs of churn (e.g., after a return), implement targeted re-engagement campaigns. This could include personalized offers, tutorials on product usage, or proactive customer service outreach.
# 3. Enhance Customer Education and Support for Electronics: If returns or churn in 'Electronics' are linked to product complexity or dissatisfaction, provide better installation guides, troubleshooting resources, or a dedicated customer support channel for electronics-related queries.
# 4. Feedback Loop Implementation: Establish a robust system to collect customer feedback, especially from 'Electronics' purchasers and those who churned, to understand pain points and address them proactively. Use surveys or post-return feedback forms.
# 5. Review Pricing and Value Proposition: If the analysis reveals that 'Electronics' customers perceive lower value for money, review pricing strategies or bundle products/services to increase perceived value and reduce churn.

---
## **SECTION C: Research Methodology & Predictive Analytics (25 Points - 8 Minutes)**


### Scenario: Churn Prediction Model Development

Your company wants to build a machine learning model to predict customer churn using the available dataset.


### Question 7 (15 points)
**Model Development Plan:** Create a comprehensive approach including:

a) **Feature selection:** Which columns from the dataset would you use as features for the churn prediction model and why? (5 points)
b) **Data preprocessing steps:** What preprocessing would you apply to prepare the data? (5 points)
c) **Model evaluation metrics:** Which metrics would you use to evaluate model performance for this business problem? (5 points)


In [None]:
# Your Answer for Question 7

# a) Feature Selection:
# We would use the following columns as features and 'Churn' as the target variable:
# - Customer Age: Numerical feature, likely to influence purchase behavior and churn.
# - Gender: Categorical feature, potential differences in churn rates between genders.
# - Product Category: Categorical feature, certain categories might be associated with higher or lower churn.
# - Product Price: Numerical feature, price sensitivity can impact churn.
# - Quantity: Numerical feature, frequency/volume of purchases could be a churn indicator.
# - Total Purchase Amount: Numerical feature, overall spending patterns are crucial.
# - Payment Method: Categorical feature, preference for certain payment methods might correlate with loyalty.
# - Returns: Binary/Categorical feature, a strong indicator of dissatisfaction and potential churn.
# 'Customer ID', 'Customer Name', 'Purchase Date' would likely not be used directly as features; 'Purchase Date' could be engineered into features like 'days since last purchase' or 'purchase frequency'.

# b) Data Preprocessing Steps:
# 1. Handle Missing Values: Depending on the extent, missing values could be imputed (e.g., mean/median for numerical, mode for categorical) or rows/columns with excessive missingness could be dropped.
# 2. Encode Categorical Variables: 'Gender', 'Product Category', 'Payment Method' would need to be converted into numerical format. One-hot encoding is suitable for nominal categorical variables.
# 3. Scale Numerical Features: 'Customer Age', 'Product Price', 'Quantity', 'Total Purchase Amount' often benefit from scaling (e.g., StandardScaler or MinMaxScaler) to prevent features with larger magnitudes from dominating the model.
# 4. Feature Engineering (Optional but Recommended): Create new features from 'Purchase Date' (e.g., 'days since last purchase', 'purchase frequency') to capture temporal patterns.
# 5. Handle Imbalanced Data: Churn datasets are typically imbalanced (fewer churned customers than retained). Techniques like oversampling (SMOTE), undersampling, or using algorithms robust to imbalance should be considered during model training.

# c) Model Evaluation Metrics:
# For an imbalanced churn prediction problem, accuracy alone is not sufficient. We would use a combination of the following metrics:
# 1. Precision: The proportion of predicted churners who actually churned (minimizing false positives).
# 2. Recall (Sensitivity): The proportion of actual churners that were correctly identified (minimizing false negatives, crucial for retention efforts).
# 3. F1-Score: The harmonic mean of precision and recall, providing a balanced measure.
# 4. Area Under the Receiver Operating Characteristic Curve (AUC-ROC): Measures the model's ability to distinguish between churned and retained customers across various threshold settings.
# 5. Confusion Matrix: To visualize the number of true positives, true negatives, false positives, and false negatives, providing a clear breakdown of classification performance.
# 6. Specificity: The proportion of actual non-churners that were correctly identified.

### Question 8 (10 points)
**Business Impact Analysis:** Identify 3 potential business challenges in implementing a churn prediction model and propose one **data-driven solution** for each challenge using insights from the customer behavior dataset.


In [None]:
# Your Answer for Question 8

# Challenge 1: Data Scarcity/Quality for Specific Segments or Behaviors.
# Business Impact: Model might perform poorly or generalize incorrectly for niche customer groups or rare churn events due to insufficient or noisy data.
# Data-driven Solution 1: Leverage 'Product Category' and 'Payment Method' to enrich sparse demographic data. For example, if 'Electronics' customers have high churn, but limited demographic data, analyze common 'Payment Methods' within this category and their correlation to churn. For rare events, consider data augmentation techniques or oversampling during model training, focusing on customer segments with higher churn rates like 'Young' and 'Senior' (as identified in Q4).

# Challenge 2: Actionability and Integration with Existing Systems.
# Business Impact: Even with an accurate model, if predictions aren't easily integrated into marketing/CRM systems or don't trigger clear actions, the model's value is lost.
# Data-driven Solution 2: Design solutions that map directly to intervention strategies. For instance, if the model predicts churn based on 'Returns' (a key insight from the dataset structure), an automated workflow could trigger a personalized offer or a customer service call specifically for customers who have returned products and are predicted to churn, using 'Customer ID' for targeting.

# Challenge 3: Evolving Customer Behavior and Model Drift.
# Business Impact: Customer preferences, market conditions, and product offerings change, causing the model's predictive power to degrade over time.
# Data-driven Solution 3: Implement a continuous monitoring and retraining pipeline. Regularly re-evaluate model performance using recent 'Purchase Date' and 'Churn' data. Utilize 'Product Category' and 'Total Purchase Amount' trends to identify shifts in customer behavior and trigger model retraining with updated data. For example, if new product categories emerge or average purchase amounts change significantly, retrain the model to capture these new patterns.

---
## **SECTION D: Professional Communication & Problem-Solving (15 Points - 2 Minutes)**


### Question 9 (8 points)
**Crisis Management:** While analyzing the dataset, you discover that 40% of customers who returned products (Returns = 1) also churned within the same month. However, your initial analysis showed returns don't strongly correlate with churn. As a team member, describe your immediate approach to investigate this discrepancy and communicate findings to stakeholders (60-80 words).


In [None]:
# Your Answer for Question 9 (60-80 words)

# My immediate approach is to conduct a deeper dive: re-examine the correlation using granular time-series data, specifically looking at returns-to-churn within the *same month*. I'd check for data entry errors or definition inconsistencies. Communicate to stakeholders that a significant, previously unnoticed, churn driver (returns within the same month) has been identified. Emphasize the need for further investigation to understand causality before recommending specific actions, ensuring transparency and managing expectations.

### Question 10 (7 points)
**Leadership Scenario:** If selected as team lead for analyzing this e-commerce customer dataset, what would be your **top 3 priorities** to ensure effective team collaboration and delivery of actionable business insights?


In [None]:
# Your Answer for Question 10

# Priority 1: Clear Project Scoping and Role Definition.
# I would ensure that the project objectives are clearly defined, understood by everyone, and broken down into manageable tasks. Each team member would have well-defined roles and responsibilities aligned with their strengths, fostering ownership and minimizing overlaps or gaps. This clarity is crucial for efficient workflow and team accountability.

# Priority 2: Establish a Robust Communication and Collaboration Framework.
# I would set up regular, structured sync-ups (daily stand-ups, weekly deep-dives) and designate preferred communication channels (e.g., dedicated chat, shared documentation platform). This ensures timely information exchange, transparent progress tracking, quick resolution of blockers, and encourages an open environment for sharing ideas and feedback.

# Priority 3: Focus on Business Impact and Actionable Insights.
# From the outset, I would emphasize connecting every analytical task to its potential business implication. This involves constantly asking "So what?" and "What action can be taken?" to ensure that our findings are not just statistically sound but also practical and valuable for decision-makers. Regular review of findings with stakeholders would ensure alignment and refine our approach for maximum impact.

---
## **Self-Assessment Section**


In [None]:
# Time Management Check
# Did you complete all sections within 35 minutes? (Yes/No): Yes

# Which section took the most time? Section B (Business Analysis), due to the calculations and detailed reasoning required for CLV.

# Which section was most challenging? Section B (Business Analysis), particularly ensuring accurate CLV calculation and insightful strategic recommendations.

# Confidence Level (1-10 scale):
# Section A (Data Understanding): 9
# Section B (Business Analysis): 8
# Section C (Research Methodology): 9
# Section D (Communication): 9

# Additional Comments: The assessment provided a good balance of theoretical and practical business analysis, with clear problem statements.

---
## **Submission Instructions**

1. **Save this notebook** with the filename: `YourName_Evoastra_Assessment.ipynb`
2. **Ensure all code cells have been executed** and answers are visible
3. **Double-check** that all sections are completed

**Submission Confirmation:**
- I confirm that I have completed this assessment independently
- All my responses are my own original work

**Digital Signature:** ___Prasannanadrajan______  
**Final Submission Time:** ______03:08PM________

---
## **Evaluation Criteria**

**Scoring Breakdown:**
- **Section A (Data Understanding):** 25 points
- **Section B (Business Analysis):** 35 points  
- **Section C (Research Methodology):** 25 points
- **Section D (Communication):** 15 points
- **Total:** 100 points

**Team Selection Criteria:**
- **Team Lead:** Score ≥ 75 points + Strong Section D performance
- **Co-Lead:** Score ≥ 65 points + Good Section D performance
- **Team Member:** Successful completion of assessment

---

*Good luck with your assessment! Focus on clear reasoning, accurate calculations, and practical business applications.*