
# 🛠️ How We Created Synthetic Customer Feedback Using GPT for Churn Prediction

*Author: Beata Faron  
Date: 2025-07-14*

In this notebook, we explain the process of generating **realistic customer feedback** for a Telco churn dataset using **GPT-based language models**.  
This synthetic data allows us to simulate realistic scenarios where companies have access to customer comments—without violating any privacy or relying on incomplete textual fields.

We focus on:
- Generating feedback from structured data (e.g., churn status, contract type),
- Using prompt engineering,
- Applying lightweight heuristics to filter poor or repetitive answers.


### Imports

In [None]:
import pandas as pd
import time



## Step 0. Set up OpenAI Access

To generate synthetic customer feedback using GPT, you’ll need access to the OpenAI API.

#### Follow these steps:

1. **Create an OpenAI account**
   Visit: [https://platform.openai.com/signup](https://platform.openai.com/signup)
   Sign up using your email or GitHub/Google account.

2. **Navigate to the API section**
   After logging in, go to: [https://platform.openai.com/account/api-keys](https://platform.openai.com/account/api-keys)

3. **Generate a new secret API key**

   * Click on `+ Create new secret key`
   * Copy the generated token and store it **securely** – you won’t be able to see it again later.

     
     

4. **Install the OpenAI Python package** (if not already installed)


In [None]:
 pip install openai

5. **Set your API key in your notebook or environment**

In [None]:
import openai
     openai.api_key = "your-secret-api-key"

## Step 1. Preview of the Telco Dataset

In [None]:
df=pd.read_csv("/kaggle/input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv")
# Show a few rows
df.sample(5, random_state=42)


## Step 2. GPT Prompt Design

We designed a prompt that generates personalized feedback based on key customer attributes.  
Example prompt (used with `text-davinci-003` or `gpt-3.5-turbo`):

```
You are a customer of a telecom company. 
Your contract is {Contract}, you pay {MonthlyCharges}$ per month, and your internet service is {InternetService}. 
You {Churned} from the company. 
Please write a realistic reason (2-3 sentences) why you decided to leave or stay.
```

This prompt was dynamically generated for each row in the dataset.


In [None]:
# Function to construct a GPT prompt from structured customer data
def create_prompt(row):
    return (
        f"Write a realistic customer feedback based on this profile:\n"
        f"Churn: {row['Churn']}\n"
        f"Tenure: {row['tenure']} months\n"
        f"Contract type: {row['Contract']}\n"
        f"Monthly Charges: ${row['MonthlyCharges']}\n"
        f"Internet Service: {row['InternetService']}\n"
        f"Payment Method: {row['PaymentMethod']}"
    )


## ➜ Generate feedback sample

In [None]:
# Initialize OpenAI client (use your own API key or secret manager)
import os
from openai import OpenAI

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# Store generated feedback here
feedbacks = []

# Loop over each customer row and generate GPT response
for idx, row in sample_df.iterrows():
    prompt = create_prompt(row)
    
    try:
        response = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.7,
            max_tokens=150
        )
        feedback = response.choices[0].message.content.strip()
    except Exception as e:
        feedback = f"ERROR: {str(e)}"
    
    feedbacks.append(feedback)
    time.sleep(1.5)  # Sleep to respect rate limits


| Parameter     | Description                                                 | Example Value                             | Effect                                     |
| ------------- | ----------------------------------------------------------- | ----------------------------------------- | ------------------------------------------ |
| `model`       | GPT model version to use                                    | `"gpt-3.5-turbo"`                         | Fast, cost-efficient, conversational       |
| `messages`    | List of message objects simulating a chat (role + content)  | `[{{"role": "user", "content": prompt}}]` | Simulates a user asking a question         |
| `temperature` | Controls randomness and creativity of output                | `0.2` to `1.0`                            | Higher = more variety, lower = more stable |
| `max_tokens`  | Maximum length of the generated text (tokens ≈ words × 1.3) | `150`                                     | Longer values = longer answers             |


| Temperature | Example Feedback                                                                   |
| ----------- | ---------------------------------------------------------------------------------- |
| `0.2`       | *"I left because the bill increased without notice."*                              |
| `0.7`       | *"After months of hidden fees and poor customer service, I finally had enough."*   |
| `0.95`      | *"Boom! My Wi-Fi exploded and took my patience with it. Time to cut the cord!"* 😅 |


| max\_tokens | Approximate Output                       |
| ----------- | ---------------------------------------- |
| `50`        | 1–2 short sentences                      |
| `100`       | 2–4 medium sentences                     |
| `150`       | Up to a paragraph (default in our setup) |


## ➜ Append

In [None]:
# Add both prompt and feedback to the sample dataframe
sample_df["PromptInput"] = sample_df.apply(create_prompt, axis=1)
sample_df["CustomerFeedback"] = feedbacks

# Display results
sample_df[["customerID", "PromptInput", "CustomerFeedback"]].head()


##  Step 3. Quality Filtering Heuristics

We used a mix of simple, efficient heuristics to clean the generated feedback:
- **Semantic similarity check** using embeddings to detect near-duplicates
- **Regex filters** to remove overly generic or broken outputs
- **Manual spot-checking** for the first batches

This ensured the final dataset was diverse and believable.


# Supporting Files



➡️ The final dataset is available here: [Telco Customer Churn + Realistic Customer Feedback](https://www.kaggle.com/datasets/beatafaron/telco-customer-churn-realistic-customer-feedback) <br>
➡️ Modeling notebook (using this data): [Exploring Customer Churn & GPT-generated Feedback](https://www.kaggle.com/code/beatafaron/eda-customer-churn-gpt-generated-feedback)


# * Lesson learned
 Note: Later analysis showed that some GPT responses inadvertently reflected the churn label too directly. To reduce potential data leakage, we truncated feedback to remove overly revealing first phrases and introduced realistic noise. This is discussed in a follow-up notebook.

 ➡️ Real noise simulation: [Feedback Noise Simulation & Fallback Testing
](https://www.kaggle.com/code/beatafaron/feedback-noise-simulation-fallback-testing)
