# MLOps

## 1. Data Management and Feature Engineering
Importance: Data is the foundation of machine learning models. Proper management ensures high-quality, relevant data for training. Feature engineering transforms raw data into features that improve model performance.

Benefits:

- Better model accuracy through relevant features.
- Reduced noise in data, leading to more robust models

In [2]:
# to optimize the dataframe and possibly reduce size
def optimize_data_types(data):
    """Optimize data types to reduce memory usage."""
    for column in data.columns:
        column_type = data[column].dtype

        # Check for numeric types
        if pd.api.types.is_numeric_dtype(column_type):
            if pd.api.types.is_integer_dtype(column_type):
                # Downcast integer types
                data[column] = pd.to_numeric(data[column], downcast='integer')
            elif pd.api.types.is_float_dtype(column_type):
                # Downcast float types
                data[column] = pd.to_numeric(data[column], downcast='float')

        # Check for categorical types
        elif pd.api.types.is_object_dtype(column_type):
            unique_values = data[column].nunique()
            total_values = data[column].size
            if unique_values / total_values < 0.5:  # 50% unique values threshold
                # Convert to category if there are few unique values
                data[column] = data[column].astype('category')
            else:
                # Convert to the new string type for better performance
                data[column] = data[column].astype('string')

        # Check for boolean types
        elif pd.api.types.is_bool_dtype(column_type):
            # Convert booleans to a more memory-efficient representation if needed
            data[column] = data[column].astype('bool')

    return data

### Local Machine (In House)

In [4]:
# Local Machine
data_org = pd.read_excel('../data/Adidas.xlsx',header=4,).drop(['Unnamed: 0'], axis=1)
data_org['Retailer ID'] = data_org['Retailer ID'].astype("string")
data_org = data_org.groupby(["Invoice Date",'City','Retailer','Product','Region','Retailer ID','State','Sales Method']).sum()
# to optimize dataframe
data_org = optimize_data_types(data_org)
#data_org.tail()

data_org = data_org.reset_index(level=0)
data_org.columns

data_org.index = data_org.index.map('_'.join)
data_org

data_org = data_org.reset_index().rename(columns={"index":"ID"})
data_org.head()


Unnamed: 0,ID,Invoice Date,Price per Unit,Units Sold,Total Sales,Operating Profit,Operating Margin
0,New York_Foot Locker_Men's Street Footwear_Nor...,2020-01-01,50.0,1200,600000.0,300000.0,0.5
1,New York_Foot Locker_Men's Street Footwear_Nor...,2020-01-01,47.0,336,15792.0,9633.12,0.61
2,New York_Foot Locker_Men's Street Footwear_Nor...,2020-01-01,34.0,384,13056.0,6789.12,0.52
3,Philadelphia_Foot Locker_Women's Apparel_North...,2020-01-01,68.0,83,5644.0,2426.92,0.43
4,Philadelphia_Foot Locker_Women's Apparel_North...,2020-01-01,128.0,358,210649.0,63282.68,0.62


### Azure ML
- **Services**: Data preparation services, Azure Data Factory.
- **Pros**: Integrated data services; Azure Data Factory allows for complex data workflows.
- **Cons**: Some features may have a steep learning curve.


In [14]:
#!pip install azure-identity azure-mgmt-resource azure-mgmt-sql azure-mgmt-storage azure-mgmt-datafactory azure-storage-file-share azure-ai-ml pyodbc

[31mERROR: Could not find a version that satisfies the requirement azure-mgmt-machinelearning (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for azure-mgmt-machinelearning[0m[31m
[0m

In [44]:
from azure.identity import DefaultAzureCredential
from azure.mgmt.resource import ResourceManagementClient
from azure.mgmt.sql import SqlManagementClient
from azure.storage.fileshare import ShareServiceClient
from azure.mgmt.storage import StorageManagementClient
from azure.mgmt.datafactory import DataFactoryManagementClient
#from azure.mgmt.ml import MachineLearningServiceClient
from azure.ai.ml import MLClient
from azure.ai.ml.entities import Workspace
from azure.ai.ml.entities import Data
from azure.ai.ml import Input, Output
from azure.ai.ml import dsl, load_component
from azure.ai.ml.constants import AssetTypes
from azure.ai.ml import command

In [51]:
# Azure credentials and configuration
credential = DefaultAzureCredential()
subscription_id = '8f7e0e9e-aa19-4956-8137-6ca807cad266'  #'your_subscription_id'
resource_group_name = 'azur_ml' #'your_resource_group_name'
location = 'centralus' # your_location
svr_name = 'azurmlserver'#'your-sql-server-name'
db_name  = 'azurmldb'#'your-database-name'
admin_login = 'azurmlusername'#'your_admin_username'
admin_login_pass = 'azurmlpassword$abc123:911' #'your_admin_password'


In [46]:

# Create Resource Management client
resource_client = ResourceManagementClient(credential, subscription_id)


In [47]:

def create_resource_group(resource_client):
    print(f"Creating resource group '{resource_group_name}'...")
    resource_group = resource_client.resource_groups.create_or_update(
        resource_group_name,
        {'location': location}
    )
    print(f"Resource group '{resource_group_name}' created.")


In [48]:
create_resource_group(resource_client)

Creating resource group 'azur_ml'...


Resource group 'azur_ml' created.


In [35]:
def delete_resource_group(resource_client):
    print(f"Deleting resource group '{resource_group_name}'...")
    resource_client.resource_groups.begin_delete(resource_group_name).result()
    print(f"Resource group '{resource_group_name}' deleted.")

In [43]:
delete_resource_group(resource_client)

Deleting resource group 'azur_ml'...
Resource group 'azur_ml' deleted.


In [52]:

def create_sql_database(credential, subscription_id, svr_name, db_name, admin_login, admin_login_pass):
    sql_client = SqlManagementClient(credential, subscription_id)
    server_name = svr_name #'your-sql-server-name'
    database_name = db_name #'your-database-name'

    print(f"Creating SQL server '{server_name}'...")
    server = sql_client.servers.begin_create_or_update(
        resource_group_name,
        server_name,
        {
            'location': location,
            'administrator_login': admin_login, #'your_admin_username',
            'administrator_login_password': admin_login_pass #'your_admin_password'
        }
    ).result()

    print(f"Creating SQL database '{database_name}'...")
    database = sql_client.databases.begin_create_or_update(
        resource_group_name,
        server_name,
        database_name,
        {
            'location': location,
            'sku': {'name': 'S0'}
        }
    ).result()
    print(f"SQL database '{database_name}' created.")



In [53]:
create_sql_database(credential, subscription_id, svr_name, db_name, admin_login, admin_login_pass)

Creating SQL server 'azurmlserver'...
Creating SQL database 'azurmldb'...
SQL database 'azurmldb' created.


In [None]:
# Delete SQL Server
def delete_sql_server(credential, subscription_id, resource_group_name, server_name):
    sql_client = SqlManagementClient(credential, subscription_id)
    server_name = 'your_server_name'  # Same name used for creation

    print(f"Deleting SQL server '{server_name}'...")
    sql_client.servers.begin_delete(
        resource_group_name,
        server_name
    ).wait()
    print(f"SQL server '{server_name}' deleted.")

In [None]:
def create_storage_account_and_file_share():
    storage_client = StorageManagementClient(credential, subscription_id)
    storage_account_name = 'yourstorageaccount'
    file_share_name = 'your-file-share'

    print(f"Creating storage account '{storage_account_name}'...")
    storage_account = storage_client.storage_accounts.begin_create(
        resource_group_name,
        storage_account_name,
        {
            'location': location,
            'kind': 'StorageV2',
            'sku': {'name': 'Standard_LRS'}
        }
    ).result()

    print(f"Creating file share '{file_share_name}'...")
    storage_client.file_shares.create(
        resource_group_name,
        storage_account_name,
        file_share_name,
        {}
    )
    print(f"File share '{file_share_name}' created.")

def create_data_factory():
    adf_client = DataFactoryManagementClient(credential, subscription_id)
    factory_name = 'your-data-factory-name'

    print(f"Creating Data Factory '{factory_name}'...")
    factory = adf_client.factories.begin_create_or_update(
        resource_group_name,
        factory_name,
        {
            'location': location
        }
    ).result()
    print(f"Data Factory '{factory_name}' created.")

def create_ml_workspace():
    ml_client = MachineLearningServiceClient(credential, subscription_id)
    workspace_name = 'your-ml-workspace-name'

    print(f"Creating Machine Learning workspace '{workspace_name}'...")
    workspace = ml_client.workspaces.begin_create_or_update(
        resource_group_name,
        workspace_name,
        {
            'location': location,
            'sku': {'name': 'Basic', 'tier': 'Basic'},
            'identity': {'type': 'SystemAssigned'}
        }
    ).result()
    print(f"Machine Learning workspace '{workspace_name}' created.")

def perform_data_preparation():
    # This is a simplified example. In a real scenario, you would use Azure ML SDK
    # to create a dataset, define a data preparation step, and run an experiment.
    print("Performing data preparation...")
    ml_client = MLClient(credential, subscription_id, resource_group_name, 'your-ml-workspace-name')
    
    # Create a simple dataset (this is just a placeholder)
    dataset = ml_client.data.create_or_update(
        name='your-dataset-name',
        path='path/to/your/data',
        type='uri_file'
    )
    
    print("Data preparation complete.")

if __name__ == "__main__":
    create_resource_group()
    create_sql_database()
    create_storage_account_and_file_share()
    create_data_factory()
    create_ml_workspace()
    perform_data_preparation()


### AWS SageMaker
- **Services**: SageMaker Data Wrangler, SageMaker Ground Truth.
- **Pros**: Easy data preparation and labeling; user-friendly interface for feature engineering.
- **Cons**: Can be complex when integrating multiple services.

### Databricks
- **Services**: Apache Spark, Delta Lake.
- **Pros**: Excellent for big data processing; Delta Lake allows ACID transactions.
- **Cons**: Requires more setup; may not be ideal for smaller datasets.

---

---

## 2. Model Development and Experimentation
### Azure ML
- **Services**: Automated Machine Learning (AutoML), Jupyter notebooks.
- **Pros**: User-friendly interface; strong AutoML capabilities.
- **Cons**: AutoML can be limited in terms of customization.

### AWS SageMaker
- **Services**: SageMaker Studio, built-in algorithms.
- **Pros**: Comprehensive development environment; supports multiple frameworks.
- **Cons**: Learning curve for advanced features.

### Databricks
- **Services**: Collaborative notebooks, MLflow for tracking experiments.
- **Pros**: Excellent for collaboration among teams; MLflow integrates well for experiment tracking.
- **Cons**: Complexity increases with larger teams and projects.

---

---
## 3. Model Training
### Azure ML
- **Services**: Compute instances, distributed training.
- **Pros**: Scalable training options; easy integration with Azure infrastructure.
- **Cons**: Costs can increase with heavy usage.

### AWS SageMaker
- **Services**: Managed training jobs, automatic model tuning (hyperparameter tuning).
- **Pros**: Fully managed; supports distributed training.
- **Cons**: Pricing complexity can be challenging to estimate.

### Databricks
- **Services**: Auto-scaling clusters, distributed training.
- **Pros**: High scalability; supports Spark MLlib for distributed ML tasks.
- **Cons**: Can become expensive for extensive use.

---


---
## 4. Model Evaluation and Validation
### Azure ML
- **Services**: Built-in metrics for evaluation, confusion matrix visualizations.
- **Pros**: Comprehensive evaluation metrics; user-friendly dashboard.
- **Cons**: Some advanced evaluation features may require manual implementation.

### AWS SageMaker
- **Services**: SageMaker Model Monitor.
- **Pros**: Automated monitoring of model performance; integrates with SageMaker.
- **Cons**: Requires setup to monitor effectively.

### Databricks
- **Services**: MLflow for tracking metrics.
- **Pros**: Great for custom evaluation metrics; flexible tracking.
- **Cons**: Requires familiarity with MLflow.

---

---

## 5. Model Deployment and Serving
### Azure ML
- **Services**: Azure Kubernetes Service (AKS), Azure Container Instances.
- **Pros**: Seamless deployment options; easy to scale.
- **Cons**: Configuration can be complex for beginners.

### AWS SageMaker
- **Services**: Real-time endpoints, batch transform jobs.
- **Pros**: Fully managed endpoints; easy scaling.
- **Cons**: Pricing can add up with high traffic.

### Databricks
- **Services**: Databricks Jobs for scheduling and serving.
- **Pros**: Simple integration with existing workflows; supports batch and streaming.
- **Cons**: Limited out-of-the-box serving options compared to others.

---

---

## 6. Performance Monitoring
### Azure ML
- **Services**: Azure Monitor, built-in monitoring tools.
- **Pros**: Integrates well with Azure services for end-to-end monitoring.
- **Cons**: Can be costly for extensive monitoring.

### AWS SageMaker
- **Services**: SageMaker Model Monitor, CloudWatch.
- **Pros**: Automated monitoring and alerts; comprehensive metrics.
- **Cons**: Requires setup and configuration.

### Databricks
- **Services**: Spark UI, built-in performance monitoring.
- **Pros**: Good for tracking Spark jobs; integrates with existing monitoring solutions.
- **Cons**: May not provide as detailed metrics for ML models as other platforms.

---

---

## 7. ML Metadata Store
### Azure ML
- **Services**: Azure ML Workspace.
- **Pros**: Organizes experiments, models, and datasets; easy access.
- **Cons**: Might lack some advanced features of specialized ML metadata stores.

### AWS SageMaker
- **Services**: SageMaker Model Registry.
- **Pros**: Centralized model management; easy versioning.
- **Cons**: Can be limited in features compared to dedicated tools.

### Databricks
- **Services**: MLflow tracking.
- **Pros**: Comprehensive tracking of experiments and models.
- **Cons**: Requires understanding of MLflow for effective use.

---

In [1]:
import pandas as pd