# Evoastra Ventures Intern Assessment Task
**Duration: 35 Minutes | Total Points: 100**

---


## **Candidate Information**

**Fill in your details below:**
- **Name: *Ketan Malviya* 
- **Email: *ketanmalviya9424840@gmail.com* 
- **Phone: *8817462881* 
- **College/University: *Madhav Institute of Technology and Science* 
- **Course/Branch: *B.Tech (Mathematics and Computing) * 
- **Start Time:*7:30 pm* 
- **End Time:*8 pm* 


## **Dataset for Assessment**

**E-commerce Customer Behavior Dataset**
- **Direct Link:** https://www.kaggle.com/datasets/shriyashjagtap/e-commerce-customer-for-behavior-analysis

**Dataset Columns:**
- **Customer ID:** Unique identifier for each customer
- **Customer Name:** Name of the customer
- **Customer Age:** Age of the customer
- **Gender:** Gender of the customer
- **Purchase Date:** Date of each purchase
- **Product Category:** Category of the purchased product
- **Product Price:** Price of the purchased product
- **Quantity:** Quantity of product purchased
- **Total Purchase Amount:** Total amount spent in each transaction
- **Payment Method:** Payment method used (credit card, PayPal, etc.)
- **Returns:** Whether customer returned products (0 = No, 1 = Yes)
- **Churn:** Whether customer has churned (0 = Retained, 1 = Churned)


---
## **SECTION A: Data Understanding & Basic Analysis (25 Points - 10 Minutes)**


### Question 1 (8 points)
Based on the dataset structure, identify the **data types** for the following columns and explain why each classification is important for analysis:

- **Customer Age**
- **Gender**
- **Total Purchase Amount**
- **Churn**


# Your Answer for Question 1
**Customer Age:** Numeric (Continuous)  
Importance: Helps analyze spending and segmentation.

**Gender:** Categorical (Nominal)  
Importance: Useful for demographic and preference-based analysis.

**Total Purchase Amount:** Numeric (Continuous)  
Importance: Used for revenue calculations and RFM analysis.

**Churn:** Categorical (Binary)  
Importance: Important for churn prediction and retention strategies.



### Question 2 (8 points)
Which **analytical technique** would be most appropriate for each business question below?

a) "Which product categories generate the highest revenue?"
b) "Can we predict customer churn based on purchase behavior?"
c) "What is the relationship between customer age and spending patterns?"
d) "Which payment methods are preferred by different customer segments?"


# Your Answer for Question 2
a) Descriptive analytics (Group-by + Aggregation)  
   → To identify revenue contribution of each product category.

b) Predictive analytics (Classification models)  
   → To predict churn using customer behavior features.

c) Correlation analysis and regression  
   → To study the relationship between age and spending.

d) Categorical analysis (Cross-tab, Chi-square test)  
   → To determine preferences of payment methods across segments.


### Question 3 (9 points)
**Data Quality Assessment:** What are the top 3 potential data quality issues you would check for in this e-commerce dataset before starting analysis? For each issue, suggest one method to detect it.


# Your Answer for Question 3
### **Your Answer for Question 3**

---

## **Issue 1: Missing Values**
Many real-world datasets contain missing entries such as missing age, missing purchase amount, or incomplete demographic fields.  
This can reduce model accuracy and distort statistical analysis.

**Detection Method:**  
`df.isnull().sum()`  
(Shows total number of missing values per column)

---

## **Issue 2: Duplicate Records**
Duplicate customer rows or repeated transactions inflate revenue, skew customer counts, and harm model performance.

**Detection Method:**  
`df.duplicated().sum()`  
(Returns how many duplicate rows exist)

---

## **Issue 3: Outliers in Numeric Columns**
Columns like *Customer Age* or *Total Purchase Amount* may contain unusually large or unrealistic values, which distort averages and impact ML models.

**Detection Method:**  
Use descriptive statistics or visualizations:  
- `df.describe()`  
- Boxplot for outlier detection:  
  `sns.boxplot(df['TotalPurchaseAmount'])`

---



---
## **SECTION B: Customer Analysis & Business Intelligence (35 Points - 15 Minutes)**


### Scenario: E-commerce Revenue Analysis

Based on the dataset structure, assume you have the following customer insights:

**Customer Segments by Age:**
- **Young (18-30):** 40% of customers, Average Purchase Amount: ₹850, Return Rate: 12%
- **Middle-aged (31-50):** 45% of customers, Average Purchase Amount: ₹1,200, Return Rate: 8%
- **Senior (51+):** 15% of customers, Average Purchase Amount: ₹950, Return Rate: 15%

**Additional Information:**
- Average customer acquisition cost: ₹180
- Platform profit margin: 20% of purchase amount
- Customer churn rates: Young (25%), Middle-aged (15%), Senior (30%)


### Question 4 (15 points)
**Calculate and analyze:**

a) Which customer segment generates the highest **net profit per customer** (considering returns)? Show your calculations. (8 points)

b) Which segment has the **best customer lifetime value** considering churn rates? Provide reasoning. (7 points)


## **Question 4: Net Profit & Customer Lifetime Value Analysis**

### **Given Data**
- **Platform profit margin:** 20% of purchase amount  
- **Customer acquisition cost (CAC):** ₹180  
- **Return rates:**  
  - Young: 12%  
  - Middle-aged: 8%  
  - Senior: 15%  
- **Churn rates:**  
  - Young: 25%  
  - Middle-aged: 15%  
  - Senior: 30%  

### **Average Purchase Amounts**
- Young: ₹850  
- Middle-aged: ₹1200  
- Senior: ₹950  

---

# **a) Net Profit per Customer (after returns)**

### **Formula Used**
1. **Effective purchase amount after return adjustment:**  
   `Effective Amount = Purchase Amount × (1 − Return Rate)`

2. **Platform profit:**  
   `Profit = Effective Amount × 20%`

3. **Net Profit per Customer:**  
   `Net Profit = Profit − CAC`

---

## **1️⃣ Young Customers (18–30)**  
- Purchase Amount = 850  
- Return Rate = 12%  

**Effective Amount:**  
850 × (1 − 0.12) = **₹748**

**Platform Profit:**  
748 × 0.20 = **₹149.6**

**Net Profit:**  
149.6 − 180 = **₹−30.4**

➡️ **Young customers generate a loss of ₹30.4 per customer.**

---

## **2️⃣ Middle-aged Customers (31–50)**  
- Purchase Amount = 1200  
- Return Rate = 8%  

**Effective Amount:**  
1200 × (1 − 0.08) = **₹1104**

**Platform Profit:**  
1104 × 0.20 = **₹220.8**

**Net Profit:**  
220.8 − 180 = **₹40.8**

➡️ **Middle-aged customers generate a profit of ₹40.8 per customer.**

---

## **3️⃣ Senior Customers (51+)**  
- Purchase Amount = 950  
- Return Rate = 15%  

**Effective Amount:**  
950 × (1 − 0.15) = **₹807.5**

**Platform Profit:**  
807.5 × 0.20 = **₹161.5**

**Net Profit:**  
161.5 − 180 = **₹−18.5**

➡️ **Senior customers generate a loss of ₹18.5 per customer.**

---

# ✅ **Conclusion (Net Profit):**  
### **Middle-aged (31–50) customers generate the highest net profit: ₹40.8 per customer.**

---

# **b) Customer Lifetime Value (CLV) Analysis**

### **Formula Used**
`CLV ≈ Net Profit per Customer / Churn Rate`

(High churn = short lifetime, Low churn = long lifetime)

---

## **1️⃣ Young Customers**
- Net Profit = **−30.4**
- Churn Rate = **25%**  
CLV = (−30.4) / 0.25 = **−121.6**

---

## **2️⃣ Middle-aged Customers**
- Net Profit = **40.8**
- Churn Rate = **15%**  
CLV = 40.8 / 0.15 = **272**

---

## **3️⃣ Senior Customers**
- Net Profit = **−18.5**
- Churn Rate = **30%**  
CLV = (−18.5) / 0.30 = **−61.7**

---

# ✅ **Conclusion (Customer Lifetime Value):**  
### **Middle-aged customers have the BEST Customer Lifetime Value (CLV = ₹272)**  
because they:  
- Generate positive profit  
- Have the **lowest churn rate (15%)**  
- Make the highest average purchase amount  

Thus, **Middle-aged customers are the most valuable segment for the business.**

---


### Question 5 (10 points)
**Strategic Recommendations:** Based on your analysis, what would be your **top 2 marketing strategies** to maximize overall profitability? Consider customer acquisition, retention, and return rates.


## **Question 5 — Strategic Recommendations**

Based on the profit and CLV analysis, the following **top 2 marketing strategies** are recommended to maximize profitability:

---

### **Strategy 1: Focus on Middle-aged (31–50) Segment Through Personalized Retention Campaigns**
**Reasoning:**
- This segment generates the **highest net profit** per customer (+₹40.8).
- They have the **highest CLV (₹272)** and the **lowest churn rate (15%)**.
- Retaining them gives the maximum ROI.

**Action Ideas:**
- Personalized product recommendations  
- Loyalty rewards and exclusive discounts  
- Membership programs for long-term engagement  

---

### **Strategy 2: Reduce Return Rates Through Product Quality & Transparent Information**
**Reasoning:**
- High return rates significantly reduce profit margins.
- Young (12%) and Senior (15%) return rates cause **negative net profit**.
- Reducing returns directly increases revenue.

**Action Ideas:**
- Improve product descriptions and images  
- Add customer reviews and size/fit guides  
- Provide quality checks before delivery  
- Use A/B testing to reduce return-causing issues  

---



### Question 6 (10 points)
**Churn Prevention:** You notice that customers who make purchases in the "Electronics" category have a 35% churn rate, while "Fashion" category customers have only 18% churn rate. What **specific data analysis** would you conduct using the available dataset columns to understand this difference, and what **action plan** would you recommend?


## **Question 6 — Churn Prevention Analysis & Action Plan**

Customers buying **Electronics** have a **35% churn rate**, while **Fashion** customers have only **18% churn**.  
To understand and address this, the following data analysis and action plan is recommended.

---

## **Data Analysis Plan**

1. **Analyze Return Rates by Category**
   - Electronics often have higher return rates.
   - Use: `groupby("ProductCategory")["Return"].mean()`

2. **Study Customer Complaints / Reviews**
   - Check if Electronics have quality or defect issues.
   - Use text analysis or review sentiment scoring.

3. **Compare Average Delivery Time**
   - Electronics may have slower shipping or installation delays.
   - Use: `df.groupby("Category")["DeliveryTime"].mean()`

4. **Analyze Price Sensitivity**
   - Electronics buyers may be more price-sensitive.
   - Compare discount usage and abandoned cart data.

5. **Analyze Purchase Frequency & Repeat Rates**
   - Electronics are bought less frequently than Fashion.
   - Use: Purchase frequency metrics.

---

## **Action Plan**

### **1. Reduce Electronics Churn Through Better Post-purchase Support**
- Provide installation assistance and troubleshooting help.
- Offer faster, cheaper repair or replacement.
- Extend warranty and add guarantee options.

### **2. Improve Product Quality & Listing Accuracy**
- Add detailed specifications, comparison charts, and clearer photos.
- Increase pre-shipment quality checks.

### **3. Introduce Electronics-specific Loyalty & Protection Plans**
- Cashback for repeat electronics purchases.
- Exclusive membership for high-value electronics buyers.

### **4. Faster Delivery & Priority Support**
- Priority shipping for electronic devices.
- 24×7 customer support for electronics shoppers.

---

## **Conclusion**
Electronics churn is higher due to:
- Higher return rates  
- Quality/defect issues  
- Low repeat purchase frequency  
- Higher customer expectations  

By improving support, quality, and experience, churn can be significantly reduced.

---


---
## **SECTION C: Research Methodology & Predictive Analytics (25 Points - 8 Minutes)**


### Scenario: Churn Prediction Model Development

Your company wants to build a machine learning model to predict customer churn using the available dataset.


### Question 7 (15 points)
**Model Development Plan:** Create a comprehensive approach including:

a) **Feature selection:** Which columns from the dataset would you use as features for the churn prediction model and why? (5 points)
b) **Data preprocessing steps:** What preprocessing would you apply to prepare the data? (5 points)
c) **Model evaluation metrics:** Which metrics would you use to evaluate model performance for this business problem? (5 points)


## **Question 7 — Model Development Plan**

Your company wants to build a machine learning model to predict customer churn.  
Below is the complete development plan covering feature selection, preprocessing, and evaluation.

---

## **a) Feature Selection (5 points)**

The following columns should be used as features because they strongly influence customer churn:

1. **Customer Age**  
   - Impacts spending habits, loyalty, and buying frequency.

2. **Gender**  
   - Helps detect demographic churn patterns.

3. **Total Purchase Amount**  
   - Higher spending customers usually churn less.

4. **Purchase Frequency / Number of Orders**  
   - Low activity signals risk of churn.

5. **Return Rate / Number of Returns**  
   - High return rate often leads to dissatisfaction → churn.

6. **Product Category Purchased**  
   - Some categories (e.g., electronics) have higher churn.

7. **Payment Method**  
   - Helps capture convenience-related drop-offs.

8. **Engagement Metrics (e.g., last login, recency)**  
   - More engaged users are less likely to churn.

**Target Variable:**  
- **Churn (Yes/No)**

---

## **b) Data Preprocessing Steps (5 points)**

1. **Handle Missing Values**
   - Numerical: mean/median imputation  
   - Categorical: mode imputation

2. **Encoding Categorical Variables**
   - One-hot encoding for Gender, Product Category, Payment Method

3. **Scaling Numerical Features**
   - StandardScaler or MinMaxScaler for Age, Purchase Amount, Frequency

4. **Outlier Treatment**
   - Use IQR or z-score for high-spending or unrealistic age values

5. **Train-Test Split**
   - 70% training, 30% testing (or 80/20 depending on dataset size)

6. **Class Imbalance Handling**
   - Use SMOTE or class weights if churn = minority class

---

## **c) Model Evaluation Metrics (5 points)**

To evaluate churn prediction models, use:

1. **Accuracy**  
   - Overall correctness of predictions.

2. **Precision**  
   - How many predicted churners are actually churners.

3. **Recall (Most Important)**  
   - Ability to identify customers who will churn.  
   - High recall = fewer missed churners.

4. **F1-Score**  
   - Best balance between precision and recall.

5. **ROC-AUC Score**  
   - Measures model’s a



### Question 8 (10 points)
**Business Impact Analysis:** Identify 3 potential business challenges in implementing a churn prediction model and propose one **data-driven solution** for each challenge using insights from the customer behavior dataset.


## **Question 8 — Business Impact Analysis**

Identify 3 potential business challenges in implementing a churn prediction model and propose one data-driven solution for each.

---

## **Challenge 1: Data Quality Issues (Missing values, duplicates, noise)**
Poor quality data leads to poor predictions.

### **Data-driven Solution 1:**  
Implement automated data-cleaning pipelines:  
- Missing value imputation  
- Duplicate removal  
- Outlier detection  
- Validation rules during data entry  

This ensures the model learns from accurate and reliable data.

---

## **Challenge 2: Imbalanced Churn Classes**
Usually only a small percentage of customers churn → model becomes biased.

### **Data-driven Solution 2:**  
Use class-imbalance techniques such as:  
- **SMOTE (Synthetic Minority Oversampling Technique)**  
- **Class weights** in algorithms  
- **Threshold tuning**  

This improves the model’s ability to correctly identify churn



---
## **SECTION D: Professional Communication & Problem-Solving (15 Points - 2 Minutes)**


### Question 9 (8 points)
**Crisis Management:** While analyzing the dataset, you discover that 40% of customers who returned products (Returns = 1) also churned within the same month. However, your initial analysis showed returns don't strongly correlate with churn. As a team member, describe your immediate approach to investigate this discrepancy and communicate findings to stakeholders (60-80 words).


## **Question 9 — Crisis Management (60–80 words)**

To investigate the 40% churn among customers who returned products, I would first validate the data by checking return timestamps, churn dates, and potential data-entry inconsistencies. Next, I would segment customers by product category, return reason, and purchase frequency to identify hidden patterns. I would run correlation and cohort analyses to confirm whether returns indirectly impact churn. Finally, I would communicate findings to stakeholders with a clear explanation of the discrepancy and recommend next investigative steps.


### Question 10 (7 points)
**Leadership Scenario:** If selected as team lead for analyzing this e-commerce customer dataset, what would be your **top 3 priorities** to ensure effective team collaboration and delivery of actionable business insights?


## **Question 10 — Leadership Scenario**

If selected as the team lead for analyzing the e-commerce dataset, my top 3 priorities would be:

### **Priority 1: Establish a Clear Analysis Framework**
Define objectives, assign responsibilities, and set timelines to ensure the team has clarity and alignment.

### **Priority 2: Maintain Data Quality & Consistency**
Implement validation checks, ensure proper preprocessing, and standardize workflows so that insights are reliable.

### **Priority 3: Deliver Actionable Insights, Not Just Reports**
Focus on translating analytical results into practical recommendations for marketing, retention, and product teams to drive measurable business impact.



---
## **Self-Assessment Section**


## **Self-Assessment Section**

### **Time Management Check**
**Did you complete all sections within 35 minutes?**  
Yes/No: **Yes**

### **Which section took the most time?**
**Section C (Research Methodology & Predictive Analytics)** took the most time because it required detailed planning, explanation of metrics, and structured reasoning.

### **Which section was most challenging?**
**Section B (Business Analysis & Customer Insights)** was the most challenging because it involved numerical calculations, interpretation, and linking data patterns to business decisions.

---

### **Confidence Level (1–10 scale)**  
**Section A (Data Understanding):** 9  
**Section B (Business Analysis):** 8  
**Section C (Research Methodology):** 8  
**Section D (Communication):** 9  

---

### **Additional Comments:**  
This assessment helped me strengthen my analytical thinking, structured communication, and model-building approach. It also improved my understanding of how data translates into real business decisions. Overall, it was a valuable learning experience.




---

## **Submission Instructions**
1. **Save this notebook** with the filename: `Ketan_Evoastra_Assessment.ipynb`
2. **Ensure all code cells have been executed** and outputs are visible.
3. **Double-check** that every section is complete.

---

## **Submission Confirmation**
I confirm that I have completed this assessment independently.  
All responses are my own original work.

**Digital Signature:** _Ketan_  
**Final Submission Time:** __________30 mins____________



---
## **Evaluation Criteria**

**Scoring Breakdown:**
- **Section A (Data Understanding):** 25 points
- **Section B (Business Analysis):** 35 points  
- **Section C (Research Methodology):** 25 points
- **Section D (Communication):** 15 points
- **Total:** 100 points

**Team Selection Criteria:**
- **Team Lead:** Score ≥ 75 points + Strong Section D performance
- **Co-Lead:** Score ≥ 65 points + Good Section D performance
- **Team Member:** Successful completion of assessment

---

*Good luck with your assessment! Focus on clear reasoning, accurate calculations, and practical business applications.*