<a href="https://colab.research.google.com/github/bigDataNCloud/classResources/blob/main/colab_advanced_analytics_run_and_visualize.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🔍 Advanced Analytics Colab Notebook
This notebook runs end-to-end BigQuery modeling for churn and conversion analysis using:
- KMeans clustering
- Linear regression
- Logistic regression
- Manual segmentation
- Logistic Model Tree logic

**Dataset**: `prof-big-data.churn25.churnstudy`


In [12]:

from google.colab import auth
auth.authenticate_user()

from google.cloud import bigquery
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

client = bigquery.Client(project="prof-big-data")


## 📌 Clustering: Basic User Profile (4 clusters)

In [9]:
query = '''CREATE OR REPLACE MODEL `prof-big-data.churn25.kmeans_basic`
OPTIONS(model_type='kmeans', num_clusters=4, standardize_features=TRUE) AS
SELECT
  sessions,
  day_playtime / 60.0 AS playtime_minutes,
  active_days
FROM `prof-big-data.churn25.churnstudy`;'''
client.query(query).result()
print('✅ Done.')

✅ Done.


📊 Evaluate KMeans Model Quality

In [13]:
query_eval = """
SELECT *
FROM ML.EVALUATE(MODEL `prof-big-data.churn25.kmeans_basic`)
"""
df_eval = client.query(query_eval).to_dataframe()
df_eval


Unnamed: 0,davies_bouldin_index,mean_squared_distance
0,1.150668,1.370435


In [None]:
# prompt: Using dataframe df_eval: prompt: Using dataframe df_eval: Using dataframe df_eval: please explain the results
# The dataframe `df_eval` contains evaluation metrics for a clustering model.
# The 'davies_bouldin_index' is a metric that evaluates the quality of a clustering algorithm.
# A lower Davies-Bouldin index generally indicates better clustering.
# The value of 1.15 suggests a reasonable clustering, but the interpretation depends on the context and the specific dataset.
# The 'mean_squared_distance' likely represents the average squared distance of each data point to its cluster center.
# This is a measure of cluster dispersion. A lower mean squared distance indicates that points within clusters are closer to their centroids.
# The value of 1.37 indicates the average spread of points within the clusters. Again, the interpretation is relative to the specific dataset and problem.
# Since the dataframe only contains one row, these values represent the evaluation metrics for a single clustering result.
# To fully understand the quality of the clustering, these values should be compared to results from different clustering algorithms,
# different numbers of clusters, or known benchmarks for similar datasets.
# df_eval

# Display the evaluation metrics from the dataframe
print(df_eval)

# Explain the metrics
print("\nThe dataframe `df_eval` contains evaluation metrics for a clustering model.")

# Explain the Davies-Bouldin Index
print("\nThe 'davies_bouldin_index' is a metric that evaluates the quality of a clustering algorithm.")
print("A lower Davies-Bouldin index generally indicates better clustering.")
print(f"The value of {df_eval['davies_bouldin_index'].iloc[0]:.2f} suggests a reasonable clustering, but the interpretation depends on the context and the specific dataset.")

# Explain the Mean Squared Distance
print("\nThe 'mean_squared_distance' likely represents the average squared distance of each data point to its cluster center.")
print("This is a measure of cluster dispersion. A lower mean squared distance indicates that points within clusters are closer to their centroids.")
print(f"The value of {df_eval['mean_squared_distance'].iloc[0]:.2f} indicates the average spread of points within the clusters. Again, the interpretation is relative to the specific dataset and problem.")

# Explain the context of the results
print("\nSince the dataframe only contains one row, these values represent the evaluation metrics for a single clustering result.")
print("To fully understand the quality of the clustering, these values should be compared to results from different clustering algorithms,")
print("different numbers of clusters, or known benchmarks for similar datasets.")

In [14]:
# prompt: Using dataframe df_eval: Using dataframe df_eval: please explain the results

# The dataframe `df_eval` contains evaluation metrics for a clustering model.
# The 'davies_bouldin_index' is a metric that evaluates the quality of a clustering algorithm.
# A lower Davies-Bouldin index generally indicates better clustering.
# The value of 1.15 suggests a reasonable clustering, but the interpretation depends on the context and the specific dataset.

# The 'mean_squared_distance' likely represents the average squared distance of each data point to its cluster center.
# This is a measure of cluster dispersion. A lower mean squared distance indicates that points within clusters are closer to their centroids.
# The value of 1.37 indicates the average spread of points within the clusters. Again, the interpretation is relative to the specific dataset and problem.

# Since the dataframe only contains one row, these values represent the evaluation metrics for a single clustering result.
# To fully understand the quality of the clustering, these values should be compared to results from different clustering algorithms,
# different numbers of clusters, or known benchmarks for similar datasets.
df_eval

Unnamed: 0,davies_bouldin_index,mean_squared_distance
0,1.150668,1.370435


## 📌 Clustering: Game Behavior Segments (5 clusters)

In [15]:
query = '''CREATE OR REPLACE MODEL `prof-big-data.churn25.kmeans_game_behavior`
OPTIONS(model_type='kmeans', num_clusters=5, standardize_features=TRUE) AS
SELECT
  sessions,
  active_days,
  consecutive_days,
  IF(game1_play, 1, 0) AS game1,
  IF(game2_play, 1, 0) AS game2
FROM `prof-big-data.churn25.churnstudy`;'''
client.query(query).result()
print('✅ Done.')

✅ Done.


## 📌 Clustering: Full Engagement Profile (6 clusters)

In [None]:
query = '''CREATE OR REPLACE MODEL `prof-big-data.churn25.kmeans_full_profile`
OPTIONS(model_type='kmeans', num_clusters=6, standardize_features=TRUE) AS
SELECT
  sessions,
  active_days,
  consecutive_days,
  day_playtime / 60.0 AS playtime_minutes,
  IF(game1_play, 1, 0) AS game1,
  IF(game2_play, 1, 0) AS game2
FROM `prof-big-data.churn25.churnstudy`;'''
client.query(query).result()
print('✅ Done.')

## 📈 Linear Regression: Predicting Playtime

In [None]:
query = '''CREATE OR REPLACE MODEL `prof-big-data.churn25.linear_playtime`
OPTIONS(model_type='linear_reg') AS
SELECT
  sessions,
  active_days,
  consecutive_days,
  day_playtime / 60.0 AS playtime_minutes
FROM `prof-big-data.churn25.churnstudy`;'''
client.query(query).result()
print('✅ Done.')

## 📈 Linear Regression: Predicting Session Count

In [None]:
query = '''CREATE OR REPLACE MODEL `prof-big-data.churn25.linear_sessions`
OPTIONS(model_type='linear_reg') AS
SELECT
  day_playtime / 60.0 AS playtime_minutes,
  active_days,
  consecutive_days,
  sessions
FROM `prof-big-data.churn25.churnstudy`;'''
client.query(query).result()
print('✅ Done.')

## 📈 Linear Regression: Predicting Consecutive Streak

In [None]:
query = '''CREATE OR REPLACE MODEL `prof-big-data.churn25.linear_streak`
OPTIONS(model_type='linear_reg') AS
SELECT
  sessions,
  active_days,
  day_playtime / 60.0 AS playtime_minutes,
  consecutive_days
FROM `prof-big-data.churn25.churnstudy`;'''
client.query(query).result()
print('✅ Done.')

## 🔍 Logistic Regression: Conversion Prediction

In [None]:
query = '''CREATE OR REPLACE MODEL `prof-big-data.churn25.logit_conversion`
OPTIONS(model_type='logistic_reg', input_label_cols=['converted']) AS
SELECT
  sessions,
  active_days,
  consecutive_days,
  day_playtime / 60.0 AS playtime_minutes,
  IF(game1_play, 1, 0) AS game1,
  IF(game2_play, 1, 0) AS game2,
  IF(freetrial_status = 'renewed', 1, 0) AS converted
FROM `prof-big-data.churn25.churnstudy`;'''
client.query(query).result()
print('✅ Done.')

## 🌲 Manual Segmentation: Light / Moderate / Power Users

In [None]:
query = '''CREATE OR REPLACE TABLE `prof-big-data.churn25.segmented_users` AS
SELECT *,
  CASE
    WHEN sessions > 10 AND consecutive_days >= 5 THEN 'power_user'
    WHEN sessions BETWEEN 5 AND 10 THEN 'moderate_user'
    ELSE 'light_user'
  END AS user_segment
FROM `prof-big-data.churn25.churnstudy`;'''
client.query(query).result()
print('✅ Done.')

## 🌲 Logistic Model: Power Users Only

In [None]:
query = '''CREATE OR REPLACE MODEL `prof-big-data.churn25.logit_power_users`
OPTIONS(model_type='logistic_reg', input_label_cols=['converted']) AS
SELECT
  sessions,
  active_days,
  consecutive_days,
  IF(game1_play, 1, 0) AS game1,
  IF(game2_play, 1, 0) AS game2,
  IF(freetrial_status = 'renewed', 1, 0) AS converted
FROM `prof-big-data.churn25.segmented_users`
WHERE user_segment = 'power_user';'''
client.query(query).result()
print('✅ Done.')