<a href="https://colab.research.google.com/github/Tomas-Turner/mgmt467-analytics-portfolio/blob/main/Unit2_Lab2_Churn_Modeling_FeatureEngineering_Colab_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📊 MGMT 467 - Unit 2 Lab 2: Churn Modeling with BigQueryML + Feature Engineering
**Date:** 2025-10-16

In this lab you will:
- Connect to BigQuery from Colab
- Create features and labels
- Engineer new features from user behavior
- Train and evaluate logistic regression models
- Reflect on modeling assumptions and interpret results

In [2]:
# ✅ Authenticate and set up GCP project
from google.colab import auth
auth.authenticate_user()

project_id = "mgmt-467-35946"  # <-- Replace with your actual project ID
!gcloud config set project $project_id

Updated property [core/project].


In [18]:
# ✅ Verify BigQuery access
%%bigquery --project $project_id
SELECT CURRENT_DATE() AS today, SESSION_USER() AS user

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,today,user
0,2025-10-27,tomasturner443@gmail.com


In [22]:
%%bigquery --project $project_id
CREATE OR REPLACE TABLE `netflix.cleaned_features` AS
SELECT
  u.user_id,
  u.state_province AS region,
  u.subscription_plan AS plan_tier,
  u.age AS age_band,
  u.monthly_spend AS avg_rating,
  u.household_size AS total_minutes,
  -- Calculate average progress from watch_history
  AVG(w.progress_percentage) AS avg_progress,
  -- Count how many watch sessions each user has
  COUNT(w.session_id) AS num_sessions,
  -- Label churned users
  CASE WHEN u.is_active = FALSE THEN 1 ELSE 0 END AS churn_label
FROM `mgmt-467-35946.netflix.users` AS u
LEFT JOIN `mgmt-467-35946.netflix.watch_history` AS w
  ON u.user_id = w.user_id
GROUP BY
  u.user_id, region, plan_tier, age_band, avg_rating, total_minutes, churn_label;


Query is running:   0%|          |

In [23]:
# ✅ Prepare base churn features
%%bigquery --project $project_id
CREATE OR REPLACE TABLE `netflix.churn_features` AS
SELECT
  user_id,
  region,
  plan_tier,
  age_band,
  avg_rating,
  total_minutes,
  avg_progress,
  num_sessions,
  churn_label
FROM `netflix.cleaned_features`
WHERE churn_label IS NOT NULL;

Query is running:   0%|          |

In [25]:
# ✅ Train base logistic regression model
%%bigquery --project $project_id
CREATE OR REPLACE MODEL `netflix.churn_model_base`
OPTIONS(model_type='logistic_reg', input_label_cols=['churn_label']) AS
SELECT
  region,
  plan_tier,
  age_band,
  avg_rating,
  total_minutes,
  avg_progress,
  num_sessions,
  churn_label
FROM `netflix.churn_features`;

Query is running:   0%|          |

In [29]:
# ✅ Evaluate base model
%%bigquery --project $project_id
SELECT *
FROM ML.EVALUATE(MODEL `netflix.churn_model_base`);

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.0,0.0,0.849037,0.0,0.426463,0.472275


In [31]:
# ✅ Predict churn with base model
%%bigquery --project $project_id
SELECT
  user_id,
  predicted_churn_label,
  predicted_churn_label_probs
FROM ML.PREDICT(MODEL `netflix.churn_model_base`,
                (SELECT * FROM `netflix.churn_features`));

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,user_id,predicted_churn_label,predicted_churn_label_probs
0,user_03064,0,"[{'label': 1, 'prob': 0.1776848281166091}, {'l..."
1,user_07309,0,"[{'label': 1, 'prob': 0.17695735269916}, {'lab..."
2,user_07856,0,"[{'label': 1, 'prob': 0.18098830077934172}, {'..."
3,user_02509,0,"[{'label': 1, 'prob': 0.1783810571184799}, {'l..."
4,user_01767,0,"[{'label': 1, 'prob': 0.1612832894975099}, {'l..."
...,...,...,...
9995,user_02238,0,"[{'label': 1, 'prob': 0.1624518075234975}, {'l..."
9996,user_03520,0,"[{'label': 1, 'prob': 0.17430167939180838}, {'..."
9997,user_02041,0,"[{'label': 1, 'prob': 0.16177520599515546}, {'..."
9998,user_06505,0,"[{'label': 1, 'prob': 0.17904221673918924}, {'..."



## 🛠️ Feature Engineering Section

We will now engineer new features to improve model performance:

- Bucket continuous variables
- Create interaction terms
- Add behavioral flags


In [32]:

# ✅ Create enhanced feature set
%%bigquery --project $project_id
CREATE OR REPLACE TABLE `netflix.churn_features_enhanced` AS
SELECT
  user_id,
  region,
  plan_tier,
  age_band,
  avg_rating,
  total_minutes,
  CASE
    WHEN total_minutes < 100 THEN 'low'
    WHEN total_minutes BETWEEN 100 AND 300 THEN 'medium'
    ELSE 'high'
  END AS watch_time_bucket,
  avg_progress,
  num_sessions,
  CONCAT(plan_tier, '_', region) AS plan_region_combo,
  IF(total_minutes > 500, 1, 0) AS flag_binge,
  churn_label
FROM `netflix.churn_features`;


Query is running:   0%|          |

In [35]:
# ✅ Train enhanced model
%%bigquery --project $project_id
CREATE OR REPLACE MODEL `netflix.churn_model_enhanced`
OPTIONS(model_type='logistic_reg', input_label_cols=['churn_label']) AS
SELECT
  region,
  plan_tier,
  age_band,
  watch_time_bucket,
  avg_rating,
  avg_progress,
  num_sessions,
  plan_region_combo,
  flag_binge,
  churn_label
FROM `netflix.churn_features_enhanced`;

Query is running:   0%|          |

In [36]:

# ✅ Evaluate enhanced model
%%bigquery --project $project_id
SELECT *
FROM ML.EVALUATE(MODEL `netflix.churn_model_enhanced`);


Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.0,0.0,0.848727,0.0,0.427576,0.480636



## 🤔 Chain-of-Thought Prompts: Feature Engineering

### 1. Why bucket continuous values like watch time?
- What patterns become clearer by using categories like "low", "medium", "high"?

### 2. What value do interaction terms (e.g., `plan_tier_region`) add?
- Could some plans behave differently in different regions?

### 3. What’s the purpose of binary flags like `flag_binge`?
- Can these capture unique behaviors not reflected in raw totals?

### 4. After evaluating the enhanced model:
- Which new features helped the most?
- Did any surprise you?

✍️ Write your responses in a text cell below or in a shared doc for discussion.


- 1.) I think that bucketing continous variables can help capture non-linear relationships with the target variable. For example, the impact of watch time on churn might not be linear across all values. Bucketing allows the model to lear different relationships for "low", "medium", and "high" watch times. It can also help reduce the impact of outliers and make the model more robust. Patterns also might become clearer if there are distinct groups of users with different watch time behaviours. For instance, very low watch time users might be more likely to churn than medium watch time users, but extremely high watch time users might also have a different churn rate.
- 2.) I think that interaction terms allow the model to capture synergitic or antagonistic effects between features. For example, a certain subscription plan might perform very well in one region but poorly in another. A plan_tier_region interaction can model this specific relationship which woudln't be captured by considering plan_tier and region independently
-3.) Binary flags are useful for highlighing specific behaviours or characteristics that might be strongly associated with the target variable. in this case, flag_binge specifically identifies users with very high watch time, which might be a distcint group with different churn patters. This flag can capture a specific "segment" of users that might not be well-represented by the raw total_minutes variable alone.
- 4.) TLooking at the output, and comparing the roc_auc metric, we see a slight increase from 0.472275 in the base model to 0.480636. While the improvement is small, the increase in AUC suggests that the new features (watch time bucket, plan-region combo, and binge flag) did add some value in helping them model differentiate between churned and non-churned users. I would say that these did not surprise me, it makes sense that they would allow the model to be more accurate.