Google Analytics Simulation – Google Cloud Platform

Project Overview

This project simulates a large-scale web analytics system using Google Cloud Platform (GCP). The system deploys 10,000 webpages on Google Cloud Storage, tracks 100,000+ user requests, and applies machine learning to predict user demographics based on web traffic behavior. The project incorporates cloud-native big data processing, machine learning, and scalable orchestration.

Key Features

Web Traffic Simulation:
- Hosted 10,000 webpages in Google Cloud Storage.
- Served content via a Google Compute Engine (GCE) Virtual Machine.
- Tracked 100,000+ HTTP requests, capturing metadata such as location, age, and gender.
Real-Time Traffic Analysis with Cloud SQL & Pub/Sub:
- Logged all user requests in Google Cloud SQL (MySQL 8.0).
- Used Google Pub/Sub to track and handle requests from banned countries.
PageRank Computation with Google Cloud Dataflow:
- Processed webpage link structures in real-time using Apache Beam on Cloud Dataflow.
- Identified high-authority pages, improving search and ranking insights.
Machine Learning for User Demographics Prediction:
- Trained an ML model to predict user demographics based on web request metadata.
- Achieved 99.7% accuracy in classification.
Scalable Orchestration with Google Kubernetes Engine (GKE):
- Deployed the system in GKE, ensuring high availability and fault tolerance.
- Used Google Deployment Manager for automated infrastructure setup.

System Architecture

1️⃣ Web Serving Layer

Google Cloud Storage (GCS): Stores 10,000 webpages.
Compute Engine (GCE VM): Serves webpages to users.

2️⃣ Data Ingestion & Storage

Cloud SQL (MySQL 8.0): Stores 100,000+ user requests.
Pub/Sub Topic & Subscription:
- Tracks banned country requests.
- Notifies the logging system of policy violations.

3️⃣ Real-Time Analytics & Processing

Google Cloud Dataflow (Apache Beam):
- Computes PageRank for all 10,000 webpages.
- Identifies influential pages for performance insights.

4️⃣ Machine Learning & Deployment

ML Model:
- Predicts user demographics from metadata.
- Achieves 99.7% accuracy.
Google Kubernetes Engine (GKE):
- Deploys the prediction service for real-time analytics.
Google Deployment Manager:
- Automates resource provisioning.

Deployment Steps

1️⃣ Prerequisites

Google Cloud SDK installed & authenticated.
Cloud Storage Bucket with webpages.
Cloud SQL Instance set up with MySQL.
Pub/Sub Topic & Subscription created.

2️⃣ Deploy Web Serving VM

gcloud compute instances create web-server \
    --zone=us-central1-a \
    --machine-type=e2-micro \
    --image-family=debian-11 \
    --image-project=debian-cloud \
    --metadata=startup-script-url=gs://your-bucket/startup-script.sh

3️⃣ Enable Web Traffic Logging

gcloud sql instances create web-traffic-db \
    --database-version=MYSQL_8_0 \
    --tier=db-f1-micro \
    --region=us-central1

4️⃣ Deploy Banned Country Tracker

from google.cloud import pubsub_v1

PROJECT_ID = "ds-561-mohitsai"
SUBSCRIPTION_ID = "banned-country-topic-sub"

def callback(message):
    print(f"Received banned country request: {message.data.decode('utf-8')}")
    message.ack()

def listen_for_banned_requests():
    subscriber = pubsub_v1.SubscriberClient()
    subscription_path = subscriber.subscription_path(PROJECT_ID, SUBSCRIPTION_ID)
    
    future = subscriber.subscribe(subscription_path, callback=callback)
    print(f"Listening for messages on {subscription_path}...")

    try:
        future.result()
    except KeyboardInterrupt:
        future.cancel()

if __name__ == "__main__":
    listen_for_banned_requests()

5️⃣ Compute PageRank on Cloud Dataflow

gcloud dataflow jobs run compute-pagerank \
    --gcs-location gs://your-bucket/path-to-pagerank-pipeline \
    --region us-central1

6️⃣ Train & Deploy ML Model on GKE

gcloud container clusters create analytics-cluster \
    --num-nodes=3 --zone=us-central1-a

gcloud builds submit --tag gcr.io/$PROJECT_ID/predictor

gcloud run deploy predictor-service \
    --image gcr.io/$PROJECT_ID/predictor \
    --platform managed --region us-central1

Insights & Results

Top Web Pages: Identified via PageRank computation.
Traffic Analysis: Logged 100,000+ requests, categorized by region, age, gender.
Banned Country Tracking: Requests flagged via Pub/Sub logs.
ML Model Accuracy: Achieved 99.7% accuracy in demographic predictions.

Contributing & Usage

Modify and adapt configurations as needed.
Ensure credentials & permissions are set correctly.
Star ⭐ the repo if you found this useful!

Contact

Feel free to reach out via:

LinkedIn
Email

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
dataflow		dataflow
google-deployment-manager		google-deployment-manager
google-kubernetes-engine		google-kubernetes-engine
pagerank		pagerank
web-requests-analytics		web-requests-analytics
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Google Analytics Simulation – Google Cloud Platform

Project Overview

Key Features

System Architecture

Deployment Steps

1️⃣ Prerequisites

2️⃣ Deploy Web Serving VM

3️⃣ Enable Web Traffic Logging

4️⃣ Deploy Banned Country Tracker

5️⃣ Compute PageRank on Cloud Dataflow

6️⃣ Train & Deploy ML Model on GKE

Insights & Results

Contributing & Usage

Contact

About

Uh oh!

Uh oh!

Languages

Mohitsai/google-analytics-simulation

Folders and files

Latest commit

History

Repository files navigation

Google Analytics Simulation – Google Cloud Platform

Project Overview

Key Features

System Architecture

Deployment Steps

1️⃣ Prerequisites

2️⃣ Deploy Web Serving VM

3️⃣ Enable Web Traffic Logging

4️⃣ Deploy Banned Country Tracker

5️⃣ Compute PageRank on Cloud Dataflow

6️⃣ Train & Deploy ML Model on GKE

Insights & Results

Contributing & Usage

Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages