Skip to content

Google Cloud native web analytics using Google Compute Engine, Cloud SQL Server, Cloud Dataflow, Google Kubernetes Engine and Google Deployment Manager

Notifications You must be signed in to change notification settings

Mohitsai/google-analytics-simulation

Repository files navigation

Google Analytics Simulation – Google Cloud Platform

Project Overview

This project simulates a large-scale web analytics system using Google Cloud Platform (GCP). The system deploys 10,000 webpages on Google Cloud Storage, tracks 100,000+ user requests, and applies machine learning to predict user demographics based on web traffic behavior. The project incorporates cloud-native big data processing, machine learning, and scalable orchestration.

Key Features

  • Web Traffic Simulation:
    • Hosted 10,000 webpages in Google Cloud Storage.
    • Served content via a Google Compute Engine (GCE) Virtual Machine.
    • Tracked 100,000+ HTTP requests, capturing metadata such as location, age, and gender.
  • Real-Time Traffic Analysis with Cloud SQL & Pub/Sub:
    • Logged all user requests in Google Cloud SQL (MySQL 8.0).
    • Used Google Pub/Sub to track and handle requests from banned countries.
  • PageRank Computation with Google Cloud Dataflow:
    • Processed webpage link structures in real-time using Apache Beam on Cloud Dataflow.
    • Identified high-authority pages, improving search and ranking insights.
  • Machine Learning for User Demographics Prediction:
    • Trained an ML model to predict user demographics based on web request metadata.
    • Achieved 99.7% accuracy in classification.
  • Scalable Orchestration with Google Kubernetes Engine (GKE):
    • Deployed the system in GKE, ensuring high availability and fault tolerance.
    • Used Google Deployment Manager for automated infrastructure setup.

System Architecture

1️⃣ Web Serving Layer

  • Google Cloud Storage (GCS): Stores 10,000 webpages.
  • Compute Engine (GCE VM): Serves webpages to users.

2️⃣ Data Ingestion & Storage

  • Cloud SQL (MySQL 8.0): Stores 100,000+ user requests.
  • Pub/Sub Topic & Subscription:
    • Tracks banned country requests.
    • Notifies the logging system of policy violations.

3️⃣ Real-Time Analytics & Processing

  • Google Cloud Dataflow (Apache Beam):
    • Computes PageRank for all 10,000 webpages.
    • Identifies influential pages for performance insights.

4️⃣ Machine Learning & Deployment

  • ML Model:
    • Predicts user demographics from metadata.
    • Achieves 99.7% accuracy.
  • Google Kubernetes Engine (GKE):
    • Deploys the prediction service for real-time analytics.
  • Google Deployment Manager:
    • Automates resource provisioning.

Deployment Steps

1️⃣ Prerequisites

  • Google Cloud SDK installed & authenticated.
  • Cloud Storage Bucket with webpages.
  • Cloud SQL Instance set up with MySQL.
  • Pub/Sub Topic & Subscription created.

2️⃣ Deploy Web Serving VM

gcloud compute instances create web-server \
    --zone=us-central1-a \
    --machine-type=e2-micro \
    --image-family=debian-11 \
    --image-project=debian-cloud \
    --metadata=startup-script-url=gs://your-bucket/startup-script.sh

3️⃣ Enable Web Traffic Logging

gcloud sql instances create web-traffic-db \
    --database-version=MYSQL_8_0 \
    --tier=db-f1-micro \
    --region=us-central1

4️⃣ Deploy Banned Country Tracker

from google.cloud import pubsub_v1

PROJECT_ID = "ds-561-mohitsai"
SUBSCRIPTION_ID = "banned-country-topic-sub"

def callback(message):
    print(f"Received banned country request: {message.data.decode('utf-8')}")
    message.ack()

def listen_for_banned_requests():
    subscriber = pubsub_v1.SubscriberClient()
    subscription_path = subscriber.subscription_path(PROJECT_ID, SUBSCRIPTION_ID)
    
    future = subscriber.subscribe(subscription_path, callback=callback)
    print(f"Listening for messages on {subscription_path}...")

    try:
        future.result()
    except KeyboardInterrupt:
        future.cancel()

if __name__ == "__main__":
    listen_for_banned_requests()

5️⃣ Compute PageRank on Cloud Dataflow

gcloud dataflow jobs run compute-pagerank \
    --gcs-location gs://your-bucket/path-to-pagerank-pipeline \
    --region us-central1

6️⃣ Train & Deploy ML Model on GKE

gcloud container clusters create analytics-cluster \
    --num-nodes=3 --zone=us-central1-a

gcloud builds submit --tag gcr.io/$PROJECT_ID/predictor

gcloud run deploy predictor-service \
    --image gcr.io/$PROJECT_ID/predictor \
    --platform managed --region us-central1

Insights & Results

  • Top Web Pages: Identified via PageRank computation.
  • Traffic Analysis: Logged 100,000+ requests, categorized by region, age, gender.
  • Banned Country Tracking: Requests flagged via Pub/Sub logs.
  • ML Model Accuracy: Achieved 99.7% accuracy in demographic predictions.

Contributing & Usage

  • Modify and adapt configurations as needed.
  • Ensure credentials & permissions are set correctly.
  • Star ⭐ the repo if you found this useful!

Contact

Feel free to reach out via:


© 2025 Mohit Sai Gutha | Built using Google Cloud, Dataflow & GKE

About

Google Cloud native web analytics using Google Compute Engine, Cloud SQL Server, Cloud Dataflow, Google Kubernetes Engine and Google Deployment Manager

Topics

Resources

Stars

Watchers

Forks