# MuleSoft_Data_Sample_Baseline 


In our context, the baseline is a statistical reference point that shows what "normal" looks like for the system. It helps us understand the typical range of behavior, like the usual number of API requests. Think of it as a benchmark that allows us to spot anything unusual, like spikes or drops in traffic, by comparing the current behavior to the expected range. This helps us quickly identify any anomalies, trends, or deviations from the norm. 



The statistical metrics that we are going to use to support our baseline are:  

 
 *-The Average (Mean): The average is the central value of your data, providing a general "middle point".*

*-Standard Deviation (σ): measures how much the values deviate from the average*

 *-Lower Threshold (Avg - 2σ): The minimum count that is likely to occur under normal conditions* 

 *-Upper Threshold (Avg + 2σ): The maximum count that is likely to occur under normal conditions.* 

 



For an example if take the case of the “Average Request count” graph for a duration of 15 days present in MuleSoftwe will have: 



*-The Average (Mean): will be average of request in 15 days for each API/App (10 in total)*. 

*-Standard Deviation (σ): will measure how much some request deviate from the average of request for each API/App (10 in total).* 

*-Lower Threshold (Avg - 2σ): The minimum of number request that is likely to occur under normal conditions for each API/App.* 

*-Upper Threshold (Avg + 2σ): The maximum of number request that is likely to occur under normal conditions for each API/App.* 

 


**Example:** 

 For baseline Range = [Avg - 2σ, Avg + 2σ] 

  Any value within this range is considered "normal" or "expected" (System behavior is normal and No action needed unless other metrics (e.g., latency, errors) show issues) and values outside this range are potential anomalies. 

 Concrete example: if the baseline range is [95.86, 124.14] 

 **-Request counts within this range indicate normal operation, the system is healthy** 
 
 **-Values outside this range might require investigation and care from us.** 

 

## Import package and librairies

In [None]:
!pip install plotly

In [None]:
import openpyxl 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go

In [None]:
pd.set_option('display.max_rows', None)  # Show all rows
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.width', None)  # Adjust the width to fit the content


### Import data

In [None]:
#The following is a sample coming from metrics of MuleSoft. We have the Time, App_id, Heap_commited, Heap_used and Heap_total


Mulesoft_metrics_wb = openpyxl.load_workbook('Mulesoft_metrics.xlsx')
Mulesoft_metrics = Mulesoft_metrics_wb.active

print(Mulesoft_metrics)
print('Total number of rows: '+str(Mulesoft_metrics.max_row)+'. And total number of columns: '+str(Mulesoft_metrics.max_column))


### Data processing, cleaning transformation

In [None]:
#Let transform the excel sheet to a DataFrame

data_mulesoft = pd.read_excel('Mulesoft_metrics.xlsx')

print(data_mulesoft)


In [None]:
#Clean and formating the columns

columns_to_clean = ["Heap_commited", "Heap_Used", "Heap_total"]

data_mulesoft[columns_to_clean] = data_mulesoft[columns_to_clean].apply(lambda x: x.str.replace(" 0", "").astype(float))

data_mulesoft.rename(columns={"Date and Time ": "Date and Time"}, inplace=True)

data_mulesoft["Date and Time"] = pd.to_datetime(data_mulesoft["Date and Time"])


data_mulesoft["Time"] = data_mulesoft["Date and Time"].dt.time


data_mulesoft = data_mulesoft.drop(columns=["Date and Time"])


print(data_mulesoft)



### Calculate Metrics for our baseline base

In [None]:
#Calculate the Average and Standard deviation(heap_used)

metrics = data_mulesoft.groupby('App_id').agg(
    average_heap_used=('Heap_Used', 'mean'),
    std_heap_used=('Heap_Used', 'std')
).reset_index()



In [None]:
# Calculate thresholds

metrics['lower_threshold'] = metrics['average_heap_used'] - 1.5 * metrics['std_heap_used']
metrics['upper_threshold'] = metrics['average_heap_used'] + 1.5 * metrics['std_heap_used']

In [None]:
print(metrics)

In [None]:
data_mulesoft = data_mulesoft.merge(metrics, on='App_id')

In [None]:
print(data_mulesoft)

### Let's hightlight the outliers

In [None]:
data_mulesoft['outlier'] = (
    (data_mulesoft['Heap_Used'] < data_mulesoft['lower_threshold']) |
    (data_mulesoft['Heap_Used'] > data_mulesoft['upper_threshold'])
)

outliers = data_mulesoft[data_mulesoft['outlier'] == True]

print(outliers)

### Visualization of the Data

In [None]:
fig = px.line(
    data_mulesoft, x='Time', y='Heap_Used', color='App_id',
    title='Heap Used Over Time',
    labels={'Heap_Used': 'Heap Used (MB)'}
)
fig.show()

In [None]:
#Adding Threshold

for app in data_mulesoft['App_id'].unique():
    app_data = data_mulesoft[data_mulesoft['App_id'] == app]
    fig = go.Figure()
    fig.add_trace(go.Scatter(
        x=app_data['Time'], y=app_data['Heap_Used'], mode='lines', name='Heap Used'
    ))
    fig.add_trace(go.Scatter(
        x=app_data['Time'], y=app_data['upper_threshold'], mode='lines', name='Upper Threshold', line=dict(dash='dash')
    ))
    fig.add_trace(go.Scatter(
        x=app_data['Time'], y=app_data['lower_threshold'], mode='lines', name='Lower Threshold', line=dict(dash='dash')
    ))
    fig.update_layout(title=f"Memory Utilization for App {app}", xaxis_title='Time', yaxis_title='Heap Used')
    fig.show()

In [None]:
!pip install nbconvert

In [None]:
!jupyter nbconvert --to markdown Mulesoft_Baseline_Sample.ipynb


In [None]:
!jupyter nbconvert --to html  Mulesoft_Baseline_Sample.ipynb