# 📊 MGMT 467 - Unit 2 Lab 2: Churn Modeling with BigQueryML + Feature Engineering
**Date:** 2025-10-16

In this lab you will:
- Connect to BigQuery from Colab
- Create features and labels
- Engineer new features from user behavior
- Train and evaluate logistic regression models
- Reflect on modeling assumptions and interpret results

In [3]:
# ✅ Authenticate and set up GCP project
from google.colab import auth
auth.authenticate_user()

project_id = "mgmt467-project1"  # <-- Replace with your actual project ID
!gcloud config set project $project_id

Updated property [core/project].


In [4]:
# ✅ Verify BigQuery access
%%bigquery --project $project_id
SELECT CURRENT_DATE() AS today, SESSION_USER() AS user

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,today,user
0,2025-10-27,Sanjanamohan2003@gmail.com


In [5]:
%%bigquery --project $project_id
CREATE OR REPLACE TABLE `churn_dataset.churn_features` AS
WITH users AS (
  SELECT
    GENERATE_UUID() AS user_id,
    -- Randomly assign region
    CASE
      WHEN RAND() < 0.3 THEN 'North America'
      WHEN RAND() < 0.6 THEN 'Europe'
      WHEN RAND() < 0.8 THEN 'Asia'
      ELSE 'Latin America'
    END AS region,
    -- Randomly assign plan tier
    CASE
      WHEN RAND() < 0.4 THEN 'Basic'
      WHEN RAND() < 0.8 THEN 'Standard'
      ELSE 'Premium'
    END AS plan_tier,
    -- Randomly assign age band
    CASE
      WHEN RAND() < 0.25 THEN '18-25'
      WHEN RAND() < 0.55 THEN '26-35'
      WHEN RAND() < 0.8 THEN '36-50'
      ELSE '50+'
    END AS age_band,
    -- Average rating between 1 and 5
    ROUND(1 + 4 * RAND(), 2) AS avg_rating,
    -- Total minutes watched between 100 and 10,000
    CAST(100 + RAND() * 9900 AS INT64) AS total_minutes,
    -- Average progress (completion % of shows)
    ROUND(0.3 + RAND() * 0.7, 2) AS avg_progress,
    -- Number of sessions between 1 and 200
    CAST(1 + RAND() * 199 AS INT64) AS num_sessions,
    -- Churn label (1 = churned, 0 = active)
    CASE WHEN RAND() < 0.2 THEN 1 ELSE 0 END AS churn_label
  FROM UNNEST(GENERATE_ARRAY(1, 1000)) AS _
)
SELECT * FROM users;


Query is running:   0%|          |

In [12]:
%%bigquery --project $project_id
CREATE OR REPLACE MODEL `churn_dataset.churn_model`
OPTIONS(
  model_type = 'logistic_reg',
  input_label_cols = ['churn_label']
) AS
SELECT
  region,
  plan_tier,
  age_band,
  avg_rating,
  total_minutes,
  avg_progress,
  num_sessions,
  churn_label
FROM `churn_dataset.churn_features`;

Query is running:   0%|          |

In [10]:
# ✅ Prepare base churn features
%%bigquery --project $project_id
CREATE OR REPLACE TABLE `churn_dataset.churn_features` AS
SELECT
  user_id,
  region,
  plan_tier,
  age_band,
  avg_rating,
  total_minutes,
  avg_progress,
  num_sessions,
  churn_label
FROM `churn_dataset.churn_features`
WHERE churn_label IS NOT NULL;

Query is running:   0%|          |

In [15]:
# ✅ Evaluate base model
%%bigquery --project $project_id
SELECT *
FROM ML.EVALUATE(MODEL `churn_dataset.churn_model`);

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.0,0.0,0.815,0.0,0.47717,0.595034


In [17]:
# ✅ Predict churn with base model
%%bigquery --project $project_id
SELECT
  user_id,
  predicted_churn_label,
  predicted_churn_label_probs
FROM ML.PREDICT(MODEL `churn_dataset.churn_model`,
                (SELECT * FROM `churn_dataset.churn_features`));

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,user_id,predicted_churn_label,predicted_churn_label_probs
0,2d9dfa4a-61ae-4b84-bffb-44b9a0c1d8c5,0,"[{'label': 1, 'prob': 0.2016355634452711}, {'l..."
1,9dfeb033-f850-4fb7-bcef-d81d2ddfc163,0,"[{'label': 1, 'prob': 0.19949802563790314}, {'..."
2,abe16788-d680-4202-8812-cde858f92d32,0,"[{'label': 1, 'prob': 0.2172049680871126}, {'l..."
3,5e1658b0-9d48-445d-b187-a443cd46a3b8,0,"[{'label': 1, 'prob': 0.21442743210957343}, {'..."
4,65dec7d4-08ba-483d-b1ba-3d5c971d7f5c,0,"[{'label': 1, 'prob': 0.2072716230241932}, {'l..."
...,...,...,...
995,f27e1593-1f45-4860-8a55-f7f898ca0c47,0,"[{'label': 1, 'prob': 0.20025922163754623}, {'..."
996,79662716-4a3d-4c15-9603-8e9098605e62,0,"[{'label': 1, 'prob': 0.2075950352428651}, {'l..."
997,3164a993-35e8-4c0d-9876-2e3f38b15825,0,"[{'label': 1, 'prob': 0.21168105301273132}, {'..."
998,18f5dbae-2d5c-4149-b297-9dd821fe6afe,0,"[{'label': 1, 'prob': 0.2177731338234437}, {'l..."



## 🛠️ Feature Engineering Section

We will now engineer new features to improve model performance:

- Bucket continuous variables
- Create interaction terms
- Add behavioral flags


In [18]:

# ✅ Create enhanced feature set
%%bigquery --project $project_id
CREATE OR REPLACE TABLE `churn_dataset.churn_features_enhanced` AS
SELECT
  user_id,
  region,
  plan_tier,
  age_band,
  avg_rating,
  total_minutes,
  CASE
    WHEN total_minutes < 100 THEN 'low'
    WHEN total_minutes BETWEEN 100 AND 300 THEN 'medium'
    ELSE 'high'
  END AS watch_time_bucket,
  avg_progress,
  num_sessions,
  CONCAT(plan_tier, '_', region) AS plan_region_combo,
  IF(total_minutes > 500, 1, 0) AS flag_binge,
  churn_label
FROM `churn_dataset.churn_features`;


Query is running:   0%|          |

In [19]:

# ✅ Train enhanced model
%%bigquery --project $project_id
CREATE OR REPLACE MODEL `churn_dataset.churn_model_enhanced`
OPTIONS(model_type='logistic_reg') AS
SELECT
  region,
  plan_tier,
  age_band,
  watch_time_bucket,
  avg_rating,
  avg_progress,
  num_sessions,
  plan_region_combo,
  flag_binge,
  churn_label
FROM `churn_dataset.churn_features_enhanced`;


Executing query with job ID: 95a666e7-dbf7-4aec-a4c5-a615e7ed03bf
Query executing: 0.49s


ERROR:
 400 GET https://bigquery.googleapis.com/bigquery/v2/projects/mgmt467-project1/queries/95a666e7-dbf7-4aec-a4c5-a615e7ed03bf?maxResults=0&location=US&prettyPrint=false: Missing 'label' column in query statement. Update OPTIONS(input_label_cols=['your_label_col']) to indicate the correct label column name.

Location: US
Job ID: 95a666e7-dbf7-4aec-a4c5-a615e7ed03bf



In [22]:
# ✅ Prepare base churn features
%%bigquery --project $project_id
CREATE OR REPLACE MODEL `churn_dataset.churn_model_enhanced`
OPTIONS(
  model_type = 'logistic_reg',
  input_label_cols = ['churn_label']
) AS
SELECT
  region,
  plan_tier,
  age_band,
  watch_time_bucket,
  avg_rating,
  avg_progress,
  num_sessions,
  plan_region_combo,
  flag_binge,
  churn_label
FROM `churn_dataset.churn_features_enhanced`;


Query is running:   0%|          |

In [23]:

# ✅ Evaluate enhanced model
%%bigquery --project $project_id
SELECT *
FROM ML.EVALUATE(MODEL `churn_dataset.churn_model_enhanced`);


Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.0,0.0,0.781395,0.0,0.518217,0.576677



## 🤔 Chain-of-Thought Prompts: Feature Engineering

### 1. Why bucket continuous values like watch time?
- What patterns become clearer by using categories like "low", "medium", "high"?

*Bucketing continuous variables such as total_minutes into discrete categories like "low", "medium", and "high" helps to simplify the relationship between watch time and churn. This approach can capture non-linear patterns that a linear model might miss, making it easier to see if there are specific thresholds of watch time that strongly influence churn. Additionally, bucketing can make the model more robust to extreme values or outliers in the continuous data, improving its generalization.*

### 2. What value do interaction terms (e.g., `plan_tier_region`) add?
- Could some plans behave differently in different regions?

*Interaction terms like plan_tier_region are created by combining two or more features to see if their combined effect is different from the sum of their individual effects. In this context, it allows the model to explore whether the churn rate for a specific plan tier varies significantly depending on the user's region. This can reveal nuanced insights, such as a premium plan being highly successful in one region but struggling in another, which wouldn't be apparent by looking at plan tier or region alone.*

### 3. What’s the purpose of binary flags like `flag_binge`?
- Can these capture unique behaviors not reflected in raw totals?

*Binary flags, such as flag_binge, are boolean features (0 or 1) that indicate the presence or absence of a specific characteristic or behavior. They are useful for capturing distinct user segments or events that might have a unique impact on churn, like users who watch a large amount of content in a short period. These flags can highlight behaviors that are not simply a matter of more or less of a continuous variable but represent a qualitative difference in user engagement.*

### 4. After evaluating the enhanced model:
- Which new features helped the most?
- Did any surprise you?

*Upon evaluating the enhanced model, there is a slight decrease in the roc_auc score compared to the base model (0.576677 vs. 0.595034). This tells us that that, in this dataset, the engineered features did not improve the model's ability to distinguish between users who churn and those who don't. This could be because this is a synthetic dataset which is randomised in the first place causing there to be no clear pattern. It's possible that the relationships between the engineered features and churn are not strong enough in this simulated data, or that different feature engineering techniques or model types might be more effective. Further analysis, including examining feature importance and exploring alternative features, would be necessary to understand why the enhanced model didn't outperform the base model.*
