This project simulates a large-scale web analytics system using Google Cloud Platform (GCP). The system deploys 10,000 webpages on Google Cloud Storage, tracks 100,000+ user requests, and applies machine learning to predict user demographics based on web traffic behavior. The project incorporates cloud-native big data processing, machine learning, and scalable orchestration.
- Web Traffic Simulation:
- Hosted 10,000 webpages in Google Cloud Storage.
- Served content via a Google Compute Engine (GCE) Virtual Machine.
- Tracked 100,000+ HTTP requests, capturing metadata such as location, age, and gender.
- Real-Time Traffic Analysis with Cloud SQL & Pub/Sub:
- Logged all user requests in Google Cloud SQL (MySQL 8.0).
- Used Google Pub/Sub to track and handle requests from banned countries.
- PageRank Computation with Google Cloud Dataflow:
- Processed webpage link structures in real-time using Apache Beam on Cloud Dataflow.
- Identified high-authority pages, improving search and ranking insights.
- Machine Learning for User Demographics Prediction:
- Trained an ML model to predict user demographics based on web request metadata.
- Achieved 99.7% accuracy in classification.
- Scalable Orchestration with Google Kubernetes Engine (GKE):
- Deployed the system in GKE, ensuring high availability and fault tolerance.
- Used Google Deployment Manager for automated infrastructure setup.
1️⃣ Web Serving Layer
- Google Cloud Storage (GCS): Stores 10,000 webpages.
- Compute Engine (GCE VM): Serves webpages to users.
2️⃣ Data Ingestion & Storage
- Cloud SQL (MySQL 8.0): Stores 100,000+ user requests.
- Pub/Sub Topic & Subscription:
- Tracks banned country requests.
- Notifies the logging system of policy violations.
3️⃣ Real-Time Analytics & Processing
- Google Cloud Dataflow (Apache Beam):
- Computes PageRank for all 10,000 webpages.
- Identifies influential pages for performance insights.
4️⃣ Machine Learning & Deployment
- ML Model:
- Predicts user demographics from metadata.
- Achieves 99.7% accuracy.
- Google Kubernetes Engine (GKE):
- Deploys the prediction service for real-time analytics.
- Google Deployment Manager:
- Automates resource provisioning.
- Google Cloud SDK installed & authenticated.
- Cloud Storage Bucket with webpages.
- Cloud SQL Instance set up with MySQL.
- Pub/Sub Topic & Subscription created.
gcloud compute instances create web-server \
--zone=us-central1-a \
--machine-type=e2-micro \
--image-family=debian-11 \
--image-project=debian-cloud \
--metadata=startup-script-url=gs://your-bucket/startup-script.sh
gcloud sql instances create web-traffic-db \
--database-version=MYSQL_8_0 \
--tier=db-f1-micro \
--region=us-central1
from google.cloud import pubsub_v1
PROJECT_ID = "ds-561-mohitsai"
SUBSCRIPTION_ID = "banned-country-topic-sub"
def callback(message):
print(f"Received banned country request: {message.data.decode('utf-8')}")
message.ack()
def listen_for_banned_requests():
subscriber = pubsub_v1.SubscriberClient()
subscription_path = subscriber.subscription_path(PROJECT_ID, SUBSCRIPTION_ID)
future = subscriber.subscribe(subscription_path, callback=callback)
print(f"Listening for messages on {subscription_path}...")
try:
future.result()
except KeyboardInterrupt:
future.cancel()
if __name__ == "__main__":
listen_for_banned_requests()
gcloud dataflow jobs run compute-pagerank \
--gcs-location gs://your-bucket/path-to-pagerank-pipeline \
--region us-central1
gcloud container clusters create analytics-cluster \
--num-nodes=3 --zone=us-central1-a
gcloud builds submit --tag gcr.io/$PROJECT_ID/predictor
gcloud run deploy predictor-service \
--image gcr.io/$PROJECT_ID/predictor \
--platform managed --region us-central1
- Top Web Pages: Identified via PageRank computation.
- Traffic Analysis: Logged 100,000+ requests, categorized by region, age, gender.
- Banned Country Tracking: Requests flagged via Pub/Sub logs.
- ML Model Accuracy: Achieved 99.7% accuracy in demographic predictions.
- Modify and adapt configurations as needed.
- Ensure credentials & permissions are set correctly.
- Star ⭐ the repo if you found this useful!
Feel free to reach out via:
© 2025 Mohit Sai Gutha | Built using Google Cloud, Dataflow & GKE