# 📊 MGMT 467 - Unit 2 Lab 2: Churn Modeling with BigQueryML + Feature Engineering
**Date:** 2025-10-16

In this lab you will:
- Connect to BigQuery from Colab
- Create features and labels
- Engineer new features from user behavior
- Train and evaluate logistic regression models
- Reflect on modeling assumptions and interpret results

In [1]:
# ✅ Authenticate and set up GCP project
from google.colab import auth
auth.authenticate_user()

project_id = "mgmt-467-2500"  # <-- Replace with your actual project ID
!gcloud config set project $project_id

Updated property [core/project].


In [18]:
# ✅ Verify BigQuery access
%%bigquery --project $project_id
SELECT CURRENT_DATE() AS today, SESSION_USER() AS user

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,today,user
0,2025-10-20,gallagherdaniel555@gmail.com


In [39]:
# ✅ Prepare base churn features
%%bigquery --project $project_id
CREATE OR REPLACE TABLE `netflix.churn_features` AS
SELECT
  user_id,
  age,
  gender,
  country,
  city,
  subscription_plan,
  monthly_spend,
  household_size,
  created_at,
  subscription_start_date,
  is_active
FROM `netflix.users`;

Query is running:   0%|          |

In [40]:
%%bigquery --project $project_id

-- Step 1: Add a new column called churn_label
ALTER TABLE `netflix.churn_features`
ADD COLUMN churn_label INT64;

Query is running:   0%|          |

In [41]:
%%bigquery --project $project_id

-- Step 2: Populate the new column with random 0s and 1s
UPDATE `netflix.churn_features`
SET churn_label =
    CASE
        WHEN RAND() < 0.5 THEN 0  -- Assign 0 to approximately 50% of rows
        ELSE 1  -- Assign 1 to the remaining rows
    END
WHERE churn_label IS NULL; -- Only update rows where churn_label is currently NULL

Query is running:   0%|          |

In [42]:
# ✅ Train base logistic regression model
%%bigquery --project $project_id
CREATE OR REPLACE MODEL `netflix.churn_model`
OPTIONS(model_type='logistic_reg', input_label_cols=['churn_label']) AS
SELECT
  user_id,
  age,
  gender,
  country,
  city,
  subscription_plan,
  monthly_spend,
  household_size,
  created_at,
  subscription_start_date,
  is_active,
  churn_label
FROM `netflix.churn_features`;

Query is running:   0%|          |

In [43]:
# ✅ Evaluate base model
%%bigquery --project $project_id
SELECT *
FROM ML.EVALUATE(MODEL `netflix.churn_model`);

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.525956,0.357143,0.50096,0.425414,0.693921,0.502329


In [44]:
# ✅ Predict churn with base model
%%bigquery --project $project_id
SELECT
  user_id,
  predicted_churn_label,
  predicted_churn_label_probs
FROM ML.PREDICT(MODEL `netflix.churn_model`,
                (SELECT * FROM `netflix.churn_features`));

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,user_id,predicted_churn_label,predicted_churn_label_probs
0,user_04304,0,"[{'label': 1, 'prob': 0.4953124098262236}, {'l..."
1,user_07830,0,"[{'label': 1, 'prob': 0.4997124288898135}, {'l..."
2,user_03656,0,"[{'label': 1, 'prob': 0.47546490073953857}, {'..."
3,user_04289,0,"[{'label': 1, 'prob': 0.44260191045095554}, {'..."
4,user_09364,0,"[{'label': 1, 'prob': 0.4439633922736284}, {'l..."
...,...,...,...
10295,user_08362,1,"[{'label': 1, 'prob': 0.5522749686375736}, {'l..."
10296,user_09454,1,"[{'label': 1, 'prob': 0.5121736367185027}, {'l..."
10297,user_01140,1,"[{'label': 1, 'prob': 0.5504343992942049}, {'l..."
10298,user_07363,1,"[{'label': 1, 'prob': 0.5291633457625049}, {'l..."



## 🛠️ Feature Engineering Section

We will now engineer new features to improve model performance:

- Bucket continuous variables
- Create interaction terms
- Add behavioral flags


In [46]:

# ✅ Create enhanced feature set
%%bigquery --project $project_id
CREATE OR REPLACE TABLE `netflix.churn_features_enhanced` AS
SELECT
  user_id,
  age,
  gender,
  country,
  city,
  subscription_plan,
  monthly_spend,
  household_size,
  created_at,
  subscription_start_date,
  is_active,
  CASE
    WHEN monthly_spend < 10 THEN 'low'
    WHEN monthly_spend BETWEEN 10 AND 25 THEN 'medium'
    ELSE 'high'
  END AS monthly_spend_bucket,
  CONCAT(country, '_', subscription_plan) AS country_plan_combo,
  churn_label
FROM `netflix.churn_features`;


Query is running:   0%|          |

In [49]:

# ✅ Train enhanced model
%%bigquery --project $project_id
CREATE OR REPLACE MODEL `netflix.churn_model_enhanced`
OPTIONS(model_type='logistic_reg', input_label_cols=['churn_label']) AS
SELECT
  user_id,
  age,
  gender,
  city,
  monthly_spend_bucket,
  household_size,
  created_at,
  subscription_start_date,
  is_active,
  churn_label
FROM `netflix.churn_features_enhanced`;


Query is running:   0%|          |

In [51]:

# ✅ Evaluate enhanced model
%%bigquery --project $project_id
SELECT *
FROM ML.EVALUATE(MODEL `netflix.churn_model_enhanced`);


Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.520788,0.46439,0.515939,0.490975,0.692345,0.525261



## 🤔 Chain-of-Thought Prompts: Feature Engineering

### 1. Why bucket continuous values like watch time?
- What patterns become clearer by using categories like "low", "medium", "high"?

### 2. What value do interaction terms (e.g., `plan_tier_region`) add?
- Could some plans behave differently in different regions?

### 3. What’s the purpose of binary flags like `flag_binge`?
- Can these capture unique behaviors not reflected in raw totals?

### 4. After evaluating the enhanced model:
- Which new features helped the most?
- Did any surprise you?

✍️ Write your responses in a text cell below or in a shared doc for discussion.


## Responses to Chain-of-Thought Prompts: Feature Engineering

### 1. Why bucket continuous values like watch time?
Bucketing continuous values can help to:
- Simplify the relationship between a continuous feature and the target variable (churn_label) if the relationship is not linear.
- Reduce the impact of outliers.
- Make the feature more interpretable by grouping users into distinct categories (e.g., 'low', 'medium', 'high' watch time), which can reveal clearer patterns related to churn. For example, a specific tier of watch time might have a significantly higher churn rate than others.

### 2. What value do interaction terms (e.g., `plan_tier_region`) add?
Interaction terms capture the combined effect of two or more features that might be different from the sum of their individual effects. For example:
- The impact of a specific `subscription_plan` on churn might vary significantly depending on the `region` due to local market conditions, pricing, or content preferences. An interaction term like `plan_tier_region` can capture these unique combinations and improve the model's ability to identify specific segments that are more or less likely to churn.

### 3. What’s the purpose of binary flags like `flag_binge`?
Binary flags are used to highlight specific, often non-linear, behaviors or characteristics that might be strong indicators of the target variable but are not well-represented by raw continuous or categorical features alone.
- A `flag_binge` can capture the behavior of users who watch an unusually high amount of content in a short period. This "binge" behavior might be strongly correlated with either churn (e.g., they consumed all content they wanted and are leaving) or retention (e.g., they are highly engaged). A simple total watch time might not capture this distinct behavior as effectively as a dedicated flag.

### 4. After evaluating the enhanced model:
(This requires running the evaluation cell `423b6d00` and analyzing its output.)
- Based on the model evaluation metrics (like precision, recall, accuracy, F1-score, log-loss, and ROC AUC), we can compare the performance of the enhanced model (`your_dataset.churn_model_enhanced`) to the base model (`netflix.churn_model`).
- By examining the model's coefficients or feature importance scores (if the model type supports it and you run an ML.WEIGHTS query), you can identify which of the new features (bucketed watch time, interaction terms, binary flags) have the strongest impact (positive or negative) on the predicted churn probability. This analysis would reveal which new features helped the most and if any had surprising effects

In [None]:
!nbstripout Labs/Lab5_Classification_BQML.ipynb

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [67]:
!git config --global user.name "DanielGallagher1"
!git config --global user.email "gallagherdaniel555@gmail.com"

In [68]:
from getpass import getpass
token = getpass('Enter your GitHub token: ')
!git clone https://{token}@github.com/DanielGallagher1/mgmt467-analytics-portfolio.git

Enter your GitHub token: ··········
Cloning into 'mgmt467-analytics-portfolio'...
remote: Enumerating objects: 65, done.[K
remote: Counting objects: 100% (65/65), done.[K
remote: Compressing objects: 100% (59/59), done.[K
remote: Total 65 (delta 32), reused 10 (delta 4), pack-reused 0 (from 0)[K
Receiving objects: 100% (65/65), 808.46 KiB | 5.46 MiB/s, done.
Resolving deltas: 100% (32/32), done.


In [73]:
%cd /content/mgmt467-analytics-portfolio/

/content/mgmt467-analytics-portfolio


In [74]:
!cp "/content/drive/My Drive/MGMT467/Labs/Lab5_Classification_BQML.ipynb" "/content/mgmt467-analytics-portfolio/Labs/"

In [75]:
!git add Labs/Lab5_Classification_BQML.ipynb
!git commit -m "Added Lab5_Classification_BQML.ipynb with latest analysis"
!git push https://{token}@github.com/DanielGallagher1/mgmt467-analytics-portfolio.git main

[main e065813] Added Lab5_Classification_BQML.ipynb with latest analysis
 1 file changed, 1 insertion(+)
 create mode 100644 Labs/Lab5_Classification_BQML.ipynb
Enumerating objects: 6, done.
Counting objects: 100% (6/6), done.
Delta compression using up to 2 threads
Compressing objects: 100% (4/4), done.
Writing objects: 100% (4/4), 18.96 KiB | 3.79 MiB/s, done.
Total 4 (delta 2), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (2/2), completed with 2 local objects.[K
To https://github.com/DanielGallagher1/mgmt467-analytics-portfolio.git
   28a3a1e..e065813  main -> main


In [None]:
!nbstripout Labs/Lab5_Classification_BQML.ipynb