
### **Situation:**  
As the **item catalog size increased** and the number of concurrent requests grew (especially during ), we started experiencing **latency issues**, struggling to meet our **SLA of 70 milliseconds** for inference. Given the complexity of **graph computations and attention mechanisms**, inference time was a bottleneck, impacting real-time recommendations.  

---

### **Task:**  
My goal was to **optimize the training and inference pipeline** to ensure that the model could handle the growing catalog size while keeping the response time within the SLA.  

I had to:  
✅ Identify bottlenecks in the **model architecture, data pipelines, and compute efficiency**.  
✅ Improve inference speed without significantly compromising model accuracy.  
✅ Ensure the **deployment pipeline scaled efficiently** with increasing data volume. Specifically, when scaling the model to multiple nodes (horizontal scaling), performance did not scale linearly, leading to high request latencies and uneven load distribution. 

---

### **Action:**   

#### **1️⃣ Optimizing Model Architecture & Computation Efficiency** 
- **Quantized the Model**: Used **TorchScript with FP16 precision** to reduce model size and speed up inference.  
- **Optimized Attention Mechanism**: Replaced the standard **scaled dot-product attention** with an **approximate attention mechanism** (e.g., **performing key-query projection with lower-dimensional embeddings**). 
- Optimized dgl aggregation() to use a pre-defined method (mean()) during model inferencing instead of user-defined GRU aggregation.

#### **2️⃣ Enhancing Data Pipeline**  
- **Sparse Tensor Optimization**: Instead of storing large adjacency matrices, I utilized **sparse representations with PyTorch SparseTensors**, reducing computation overhead.  
- At inference layer, ensures caching of guest session history was enabled.

---

## TODOs

#### **3️⃣ Improving Deployment & Hardware Utilization**  
- **ONNX Runtime Acceleration**: Converted the PyTorch model to **ONNX** and used **ONNX Runtime** for optimized inference.  
- **Parallelized Batch Processing**: Deployed the model with **TorchServe** and enabled **asynchronous batch inference**, reducing per-request compute time.  

---

### **Result:**  
📌 **Inference latency reduced by ~40%**, bringing down the 95% response time from **95ms → ~55ms**, meeting the SLA of 70ms.  
📌 **Model maintained >98% of original accuracy**, despite optimizations.     

---

Would you like to tailor this further to focus on **collaborating with engineers, trade-offs, or a different challenge**? 🚀

Here’s an enhanced **STAR response** with more **technical depth** on how **TAC was applied to deal recommendations**:  

---

### **Situation:**  
At Target, I was leading **ML initiatives in the Item PRZ team**, focusing on **session-based recommendation models**. Our primary focus was on **personalizing product recommendations**, but I saw an opportunity to extend our TAC to **deal recommendations** in the **Deals PRZ team**.  



---

### **Situation:**  
As a **Lead Data Scientist** in Target’s **Item Personalization team**, I took the initiative to engage with my director and Principal Scientist to explore the potential of using a session-based recommendation model for Deals use case. They encouraged me to connect with stakeholders within the Deals team to understand their specific needs and determine if a session based model could provide value to one of their placements.

We identified and implemented **two distinct use cases** to assess **TAC’s effectiveness**:  
1️⃣ **Comparing TAC-based recommendations (recommend similar deals to items that a guest browsed, i.e. consider guest level activty; against the existing Top Sellers ranking placement (overall top-seller items ).**  
2️⃣ **Integrating TAC into PDP-related offer placement using a new microservice.**  

---

### **Task:**  
The key objectives were to:  
✅ **Assess TAC’s performance against the baseline Deals model (item similarity sort- find deals that are smilar to the top selling items).**  
✅ **Design and implement a separate TAC-Deals workflow and microservice.**  
✅ **Analyze experimental results and provide recommendations on next steps.**  

---

### **Action:**  

#### **1️⃣ Use Case 1: TAC vs. Top Sellers (A/B Testing)**  
- Conducted an **A/B test comparing TAC’s recommendations** to the baseline Deals model (**item similarity sort**).  
- **Optimized TAC embeddings** to incorporate deal-specific embeddings (map TAC item recommendations to similar deals)
- Deployed the **TAC-atc model on a subset of users** and tracked **conversion rate, engagement rate, and statistical significance**.  

🔹 **Technical Adjustments:**  
- **Normalized TAC’s attention weights** to avoid over-prioritization of long-session interactions.  
- **Used Approximate Nearest Neighbors (ANN)** with **FAISS indexing** to improve retrieval speed for large-scale inference.  
- **Monitored latency** using **Grafana dashboards**, ensuring inference remained within the **70ms SLA**.  

📌 **Result:** TAC achieved a **7.12% improvement in conversion rate**, but the results were **not statistically significant** compared to the baseline.  
📌 **Outcome:** The Deals team decided **not to productionize this use case**. (Can use this for 'Explain a time you failed')

---

#### **2️⃣ Use Case 2: TAC-Deals Integration for PDP Offer Placement**  
- Designed and implemented a **new training workflow and microservice (TAC-Deals)** to test **session-based deal recommendations** on **Product Detail Pages (PDPs)**.  
- Experimented with **three variations**:  
  - **Control**: Existing item-related offers.  
  - **V1**: TAC-Deals model.  
  - **V2**: Hybrid approach (item-related offers + offer similarity sort).  
- Integrated the TAC-Deals model into **Target’s real-time inference pipeline**, ensuring it could handle **high-traffic PDP requests** with **low-latency inference**.  

🔹 **Engineering Enhancements:**    
- **Implemented model quantization** via **ONNX Runtime** to reduce inference latency.  
- **Parallelized model inference** across GPUs using **TorchServe with Kubernetes auto-scaling**.  
- **Logged real-time user interactions** via **Kafka streaming**, feeding insights back into model retraining.  

📌 **Result:** The **Control group outperformed** both TAC-based variations (**V1 & V2**) in terms of **interaction rate and display-to-conversion rate**.  
📌 **Outcome:** Highlighted **opportunities for further refinement**, including:  
   - **Re-weighting short-term user intent in TAC’s attention mechanism.**  
   - **Exploring multi-task learning to better balance deal conversions and engagement.**  

---

### **Result & Learnings:**  
✅ **Demonstrated TAC’s adaptability** for cross-team use cases, even though **initial results were inconclusive**.  
✅ **Provided actionable insights** for refining TAC in future deal applications.  
✅ **Established a scalable TAC-Deals microservice**, allowing future optimizations.  
✅ **Validated experimental rigor** with well-structured A/B testing and real-world deployment.  

---

Would you like further refinements, such as **highlighting leadership aspects or next steps for improving TAC in Deals?** 🚀