Merge pull request #99 from Thomas-George-T/feature_machine_learning

Feature machine learning
Thomas-George-T · Dec 13, 2023 · bbe7781 · bbe7781
2 parents a9f0cad + 4e622a8
commit bbe7781
Show file tree

Hide file tree

Showing 7 changed files with 195 additions and 49 deletions.
diff --git a/.gitignore b/.gitignore
@@ -91,3 +91,5 @@ logs
 
 # environment files
 .env
+
+ecommerce-mlops-406821-40598235283c.json
diff --git a/README.md b/README.md
@@ -1,5 +1,10 @@
 [![Pytest](https://github.com/Thomas-George-T/Ecommerce-Data-MLOps/actions/workflows/pytest.yml/badge.svg)](https://github.com/Thomas-George-T/Ecommerce-Data-MLOps/actions/workflows/pytest.yml)
 # Ecommerce Customer Segmentation & MLOps
+[Ashkan Ghanavati](https://github.com/AshyScripts)
+[Bardia Mouhebat](https://github.com/baridamm)
+[Komal Pardeshi](https://github.com/kokomocha)
+[Moheth Muralidharan](https://github.com/Moheth2000)
+[Thomas George Thomas](https://github.com/Thomas-George-T)
 
 <p align="center">  
     <br>
@@ -162,6 +167,8 @@ The following is the explanation of our Data pipeline DAG
 
 ## Data Pipeline Components
 
+![Model Pipeline](assets/Data_Pipeline.png "Model Pipeline")
+
 The data pipeline in this project consists of several interconnected modules, each performing specific tasks to process the data. We utilize Airflow and Docker to orchestrate and containerize these modules, with each module functioning as a task in the main data pipeline DAG (`datapipeline`).
 
 ### 1. Downloading Data:
@@ -257,7 +264,75 @@ In managing models for Staging, Production, and Archiving, we rely on MLflow.
    ![Distribution_of_clusters](assets/Distribtion_customers.png)
 
    <p align="center">The plot above visualises the distribution of customers into clusters.</p>
-
+
+# Model Insights
+
+## Segmentation Clusters
+
+### Cluster 0
+
+Profile: Recurrent High Spenders with High Cancellations
+
+- Consumers in this cluster buy a wide range of unusual goods and have very high overall spending power.
+- They do a lot of transactions, but they also cancel a lot and with high frequency.
+- These clients typically shop early in the day and have very short average time intervals between transactions (low Hour value).
+- Their high level of monthly variability suggests that, in comparison to other clusters, their spending patterns may be less predictable.
+- They exhibit a low spending tendency in spite of their high expenditure, which raises the possibility that their high spending levels will eventually decline.
+
+![Cluster 0](data/plots/Cluster0.jpeg)
+
+### Cluster 1
+
+Profile:  Intermittent Big Spenders with a High Spending Trends
+- The moderate spending levels of the customers in this cluster are accompanied by infrequent transactions, as seen by the high Days_Since_Last_Purchase and Average_Days_Between_Purchases values.
+- Their expenditure trend is really high, suggesting that they have been spending more money over time.
+- These clients, who are primarily from the UK, prefer to purchase late in the day, as seen by the high Hour value.
+- They typically cancel a modest amount of transactions, with a moderate frequency and rate of cancellations.
+- Their comparatively high average transaction value indicates that people typically make large purchases when they go shopping.
+
+![Cluster 1](data/plots/Cluster1.jpeg)
+
+### Cluster 2
+
+Profile: Sporadic Shoppers with a Proclivity for Weekend Shopping
+
+- Consumers in this cluster typically make fewer purchases and spend less money overall.
+- The very high Day_of_Week number suggests that they have a slight inclination to shop on the weekends.
+- Their monthly spending variation is low (low Monthly_Spending_Std), and their spending trend is generally constant but on the lower side.
+- These customers have a low cancellation frequency and rate, indicating that they have not engaged in numerous cancellations.
+- When they do shop, they typically spend less each transaction, as seen by the lower average transaction value.
+
+
+![Cluster 1](data/plots/Cluster2.jpeg)
+
+## Customer RFM Trends based on Clusters
+
+![Customer Trends Histogram](data/plots/histogram_analysis.png)
+
+
+<hr> 
+
+# Cost Analysis
+
+Breakdown of the costs associated with the Machine Learning pipeline on Google Cloud Platform (GCP) hosted on US East1 Region.
+
+## Initial Cost Analysis
+
+Model Training using Vertex AI: $3.58
+
+Deploying Model: $1.75
+
+Total Training and Deployment Cost: $5.33
+
+## Serving Analysis
+
+Daily Online Prediction for Model Serving: $6.63
+
+Weekly serving cost: $46.41
+
+Monthly serving cost: $185.64
+
+Yearly serving cost: $2,423.72
 
 <hr>
 
@@ -359,43 +434,3 @@ Most important declarations in the code:
     ```
 <hr>
 
-# Model Insights
-
-## Segmentation Clusters
-
-### Cluster 0
-Profile: Recurrent High Spenders with High Cancellations
-
-- Consumers in this cluster buy a wide range of unusual goods and have very high overall spending power.
-- They do a lot of transactions, but they also cancel a lot and with high frequency.
-- These clients typically shop early in the day and have very short average time intervals between transactions (low Hour value).
-- Their high level of monthly variability suggests that, in comparison to other clusters, their spending patterns may be less predictable.
-- They exhibit a low spending tendency in spite of their high expenditure, which raises the possibility that their high spending levels will eventually decline.
-
-![Cluster 0](data/plots/Cluster0.jpeg)
-
-### Cluster 1
-Profile:  Intermittent Big Spenders with a High Spending Trends
-- The moderate spending levels of the customers in this cluster are accompanied by infrequent transactions, as seen by the high Days_Since_Last_Purchase and Average_Days_Between_Purchases values.
-- Their expenditure trend is really high, suggesting that they have been spending more money over time.
-- These clients, who are primarily from the UK, prefer to purchase late in the day, as seen by the high Hour value.
-- They typically cancel a modest amount of transactions, with a moderate frequency and rate of cancellations.
-- Their comparatively high average transaction value indicates that people typically make large purchases when they go shopping.
-
-![Cluster 1](data/plots/Cluster1.jpeg)
-
-### Cluster 2
-Profile: Sporadic Shoppers with a Proclivity for Weekend Shopping
-
-- Consumers in this cluster typically make fewer purchases and spend less money overall.
-- The very high Day_of_Week number suggests that they have a slight inclination to shop on the weekends.
-- Their monthly spending variation is low (low Monthly_Spending_Std), and their spending trend is generally constant but on the lower side.
-- These customers have a low cancellation frequency and rate, indicating that they have not engaged in numerous cancellations.
-- When they do shop, they typically spend less each transaction, as seen by the lower average transaction value.
-
-
-![Cluster 1](data/plots/Cluster2.jpeg)
-
-## Customer RFM Trends based on Clusters
-
-![Customer Trends Histogram](data/plots/histogram_analysis.png)
diff --git a/assets/Data_Pipeline.png b/assets/Data_Pipeline.png
diff --git a/gcpdeploy/src/inference.py b/gcpdeploy/src/inference.py
@@ -48,18 +48,15 @@ def predict_custom_trained_model(
 
 predict_custom_trained_model(
     project="1002663879452",
-    endpoint_id="363199476979990528",
+    endpoint_id="3665182428772696064",
     location="us-east1",
     instances= {
-  "instances": [
-    {
+
       "PC1": 1000.595596,
       "PC2": -0.944713,
       "PC3": 0.340492,
       "PC4": 1.335999,
       "PC5": 0.135310,
       "PC6": 0.506377
     }
-  ]
-}
 )
diff --git a/gcpdeploy/src/serve/Dockerfile b/gcpdeploy/src/serve/Dockerfile
@@ -6,16 +6,18 @@ WORKDIR /app
 # Copy the current directory contents into the container at /app
 COPY serve/predict.py /app/
 
+COPY serve/ecommerce-mlops-406821-40598235283c.json /app/
+
 # Install Flask and google-cloud-storage
-RUN pip install Flask google-cloud-storage joblib scikit-learn grpcio gcsfs python-dotenv pandas flask
+RUN pip install Flask google-cloud-storage joblib scikit-learn grpcio gcsfs python-dotenv pandas flask google-cloud-logging google-cloud-bigquery google-auth
 
 ENV AIP_STORAGE_URI=gs://ecommerce_retail_online_mlops/model
 ENV AIP_HEALTH_ROUTE=/ping
 ENV AIP_PREDICT_ROUTE=/predict
 ENV AIP_HTTP_PORT=8080
 ENV BUCKET_NAME=ecommerce_retail_online_mlops
 ENV PROJECT_ID=ecommerce-mlops-406821
-
+ENV BIGQUERY_TABLE_ID=ecommerce-mlops-406821.mlops_project_dataset.model_monitoring_copy
 
 # Run serve.py when the container launches
 ENTRYPOINT ["python", "predict.py"]
diff --git a/gcpdeploy/src/serve/predict.py b/gcpdeploy/src/serve/predict.py
@@ -1,10 +1,19 @@
 from flask import Flask, jsonify, request
-from google.cloud import storage
+# from google.cloud import storage
 import joblib
 import os
 import json
 from dotenv import load_dotenv
 import pandas as pd
+# Experimental Start
+import time
+from datetime import datetime
+from google.cloud import storage, logging, bigquery
+from google.cloud.bigquery import SchemaField
+from google.api_core.exceptions import NotFound
+from google.oauth2 import service_account
+from google.logging.type import log_severity_pb2 as severity
+# Experimental End
 
 load_dotenv()
 
@@ -13,6 +22,57 @@
 
 app = Flask(__name__)
 
+## Experimental start
+# Set up Google Cloud logging
+service_account_file = 'ecommerce-mlops-406821-40598235283c.json'
+credentials = service_account.Credentials.from_service_account_file(service_account_file)
+client = logging.Client(credentials=credentials)
+logger = client.logger('training_pipeline')
+# Initialize BigQuery client
+bq_client = bigquery.Client(credentials=credentials)
+table_id = os.environ['BIGQUERY_TABLE_ID']
+
+
+def get_table_schema():
+    """Build the table schema for the output table
+    
+    Returns:
+        List: List of `SchemaField` objects"""
+    return [
+
+        SchemaField("PC1", "FLOAT", mode="NULLABLE"),
+        SchemaField("PC2", "FLOAT", mode="NULLABLE"),
+        SchemaField("PC3", "FLOAT", mode="NULLABLE"),
+        SchemaField("PC4", "FLOAT", mode="NULLABLE"),
+        SchemaField("PC5", "FLOAT", mode="NULLABLE"),
+        SchemaField("PC6", "FLOAT", mode="NULLABLE"),
+        SchemaField("prediction", "FLOAT", mode="NULLABLE"),
+        SchemaField("timestamp", "TIMESTAMP", mode="NULLABLE"),
+        SchemaField("latency", "FLOAT", mode="NULLABLE"),
+    ]
+
+
+def create_table_if_not_exists(client, table_id, schema):
+    """Create a BigQuery table if it doesn't exist
+    
+    Args:
+        client (bigquery.client.Client): A BigQuery Client
+        table_id (str): The ID of the table to create
+        schema (List): List of `SchemaField` objects
+        
+    Returns:
+        None"""
+    try:
+        client.get_table(table_id)  # Make an API request.
+        print("Table {} already exists.".format(table_id))
+    except NotFound:
+        print("Table {} is not found. Creating table...".format(table_id))
+        table = bigquery.Table(table_id, schema=schema)
+        client.create_table(table)  # Make an API request.
+        print("Created table {}.{}.{}".format(table.project, table.dataset_id, table.table_id))
+
+## Experimental End
+
 def initialize_variables():
   """
   Initialize environment variables.
@@ -101,8 +161,50 @@ def predict():
 
   request_instances = request_json['instances']
 
+  ## Experimental start
+  logger.log_text("Received prediction request.", severity='INFO')
+
+  prediction_start_time = time.time()
+  current_timestamp = datetime.now().isoformat()
+  ## Experimental end
+
   prediction = model.predict(pd.DataFrame(list(request_instances)))
+
+  ## Experimental start
+  prediction_end_time = time.time()
+  prediction_latency = prediction_end_time - prediction_start_time
+  ## Experimental end
+
   prediction = prediction.tolist()
+
+  ## Experimental start
+
+  logger.log_text(f"Prediction results: {prediction}", severity='INFO')
+
+  rows_to_insert = [
+      {   
+          "PC1": instance['PC1'],
+          "PC2": instance['PC2'],
+          "PC3": instance['PC3'],
+          "PC4": instance['PC4'],
+          "PC5": instance['PC5'],
+          "PC6": instance['PC6'],
+          "prediction": pred,
+          "timestamp": current_timestamp,
+          "latency": prediction_latency
+      }
+      for instance, pred in zip(request_instances, prediction)
+  ]
+
+  errors = bq_client.insert_rows_json(table_id, rows_to_insert)
+  if errors == []:
+      logger.log_text("New predictions inserted into BigQuery.", severity='INFO')
+  else:
+      logger.log_text(f"Encountered errors inserting predictions into BigQuery: {errors}", severity='ERROR')
+
+
+## Experiment end
+  # print("prediction",prediction)
   output = {'predictions': [{'cluster': pred} for pred in prediction]}
   return jsonify(output)
 
@@ -112,6 +214,11 @@ def predict():
 
 model = load_model(bucket, bucket_name)
 
+## Experiment start
+schema = get_table_schema()
+create_table_if_not_exists(bq_client, table_id, schema)
+## Experiment end
+
 
 if __name__ == '__main__':
   app.run(host='0.0.0.0', port=8080)
diff --git a/requirements.txt b/requirements.txt
@@ -43,3 +43,6 @@ gcsfs
 python-dotenv
 kaleido==0.2.1
 grpcio==1.51.3
+google-cloud-logging
+google-cloud-bigquery
+google-auth