## Microsoft Azure Logging

NOTE

This was commercial work completed to help build out tooling for proactive infastrucutre analysis and problem detection from Microsoft Azure data. The dataset used has been generated in Microsfot Excel to mirror close to the original quantitative data. Furthermore, categorical naming and datetime data has been changed.

SYSTEM INFASTRUCTURE BACKGROUND

The specific type of logging indicates the servers hit by client traffic/requests for subject specific information to be returned across Microsoft Azure Cloud Infastructure. Specific clients will be aligned with Azure clusters/engine-nodes. Cluster will hold include certain engine nodes at different periods in time dependent on customer/client traffic expectations.

Server
We will not desribe the particular servers functionality in this case. However, what is import to understand is that the serves which are assigned to specific nodes in this instance were being hit to return in house functionality. This Python script will not detail this functionality. What is important is the usecase of the code to pull out specific information from our data.

Using our logging tooling we can acquire upto 15 days logs for specific system infastructure logging aimed at both proactive and reactive problem detection and solution.

Data

- Date : log date
- EngineNode : Specific engine node used across & assinged to specific clusters for specific clients
- Client : Live client|customer
- RequestSize(TB) : Custumer|Client Server Request Size - Terabytes
- Milliseconds : Time milliseconds to return speific server function to client/customer
- Seconds : Time seconds to return speific server function to client/customer
- Minutes : Time minutes to return speific server function to client/customer

In [None]:
import numpy as np # import numpy package for numerical python
import pandas as pd # import pandas

# import visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px 
from plotly.subplots import make_subplots
import plotly.graph_objects as go

from sklearn.preprocessing import normalize, StandardScaler # data pre-processing libraries and modules

#### DATA CLEANING & FEATURE ENGINEERING

In [None]:
#Create Dataframe from CSV file & parse dates
data = pd.read_excel('DataDecodedPrepared.xlsx', parse_dates = True)

#create copy of orginal dataframe
data_copy = data.copy(deep = True)

# drop column
data.drop(data.columns[[6]], axis=1, inplace=True)

# strng strip (remove part of string)
data['Milliseconds'] = data['Milliseconds'].str.rstrip('ms')
# remove white space
data['Client'] = data['Client'].str.strip()
data['EngineNode'] = data['EngineNode'].str.strip()

#covert 'Date' column object to datetime object
data['Date'] = pd.to_datetime(data['Date'])

#
data['Milliseconds'] = data['Milliseconds'].astype('int64')

#new column milliseconds conversion to seconds
data['Seconds'] = data['Milliseconds']/1000

#new column seconds conversion to minutes
data['Minutes'] = data['Milliseconds']/60000

# copy dataframe - pre-processing (later in analysis)
pre_process = data

In [None]:
data

#### LOG FREQUENCY

We will need to count each alert (len(df)) as an individual alert which is the case, however in this dataset we do not have this specific column/series data. We can do this with further feature engineering be creating a new column and using the index to increment the value of the index on each iteration by 1 using the dataframe.

NOTE
- Its important to remember zero indexing for this task.

VISUALIZATION
- Simply, we can then plot this quantitative discrete data against time using our datetime column in the pandas dataframe. We will be using a histogram to visualize the data.

In [None]:
data['AlertFrequency'] = data.index + 1 #  dataframe index, can be integer numbers or string values, the column labels, called column names, are usually strings. 

In [None]:
# plotly express line chart
fig = px.histogram(data_frame = data, x = 'Date', y = 'AlertFrequency',color_discrete_sequence=['blue'],
                   opacity=0.6, width=1000, height=500 ,histnorm="probability")
fig.update_layout(
    margin=dict(l=20, r=20, t=20, b=20),
    paper_bgcolor="LightSteelBlue",
)
fig.show()

In terms of subject matter this is highly important information as we can see when our pricing software was under most constraint within this time period. This can then be cross analysed against further findings below.

#### DESCRIPTIVE STATISTICS

In [None]:
data.info()

In [None]:
data.describe()

In [None]:
data.describe(include=object)

## Correlation Analysis -  Client Request Size (TB) Terabytes Server Return Time Taken Functionality  Correlation Analysis

We can test the current correlation of the mentioned variables over the datasets timeframe which is approximitely 15 days. This alone will not be completely useful. However, if we can gauge here what the correaltion is and test over differing time periods this could provide useful for key infasturcutre system health and problem detections. 

Statistical Correlation Analysis

Correlation analysis in research is a statistical method used to measure the strength of the linear relationship between two variables and compute their association. Simply put - correlation analysis calculates the level of change in one variable due to the change in the other.

The correlation coefficient is the unit of measurement used to calculate the intensity in the linear relationship between the variables involved in a correlation analysis, this is easily identifiable since it is represented with the symbol r and is usually a value without units which is located between 1 and -1.

Correlation between two variables can be either a positive correlation, a negative correlation, or no correlation. Let's look at examples of each of these three types.

- Positive correlation: A positive correlation between two variables means both the variables move in the same direction. An increase in one variable leads to an increase in the other variable and vice versa.
- Negative correlation: A negative correlation between two variables means that the variables move in opposite directions. An increase in one variable leads to a decrease in the other variable and vice versa.
- Weak/Zero correlation: No correlation exists when one variable does not affect the other.


In [None]:
# Generate scatterplot (discrete quantitative data accross time)
fig = px.scatter(data_frame=data,x = 'Date',y = 'RequestSize(TB)',color_discrete_sequence=['green'],
                   opacity=0.6,width=1000, height=650)
fig.update_layout(
    margin=dict(l=20, r=20, t=20, b=20),
    paper_bgcolor="lightsteelblue",
)
fig.show()

We already know our variable of interest displayed on the y axis is discrete quantitative data. This means the data includes nondivisible figures which you can statistically count. The above plot would be helpful with OUTLIER DETECTION however in this instance this is not of interest.

The below lineplot will allow is to visually analyse the line trend across time and both the variables of interest. Here, we can get a visually rough idea of Pearsons Correlation between the two mentioned variables.

In [None]:
# Generate lineplot using plotly
fig = make_subplots(rows=2, cols=1,subplot_titles=("Time Series Request Size (Terabytes) Analysis", "Time Series Server Request Return Time  in Seconds"))
#
fig.add_trace(go.Scatter(x=data['Date'], y=data['RequestSize(TB)']),
              row=1, col=1)
fig.add_trace(go.Scatter(x=data['Date'], y=data['Seconds']),
              row=2, col=1)

# Update xaxis properties
fig.update_xaxes(title_text="Date|Time", row=1, col=1)
fig.update_xaxes(title_text="Date|Time", row=2, col=1)
# Update yaxis properties
fig.update_yaxes(title_text="RequestSize(TB)", row=1, col=1)
fig.update_yaxes(title_text="Seconds", row=2, col=1)

# Update title and height
fig.update_layout(height=900, width=1500,
                  title_text="Statistical Correlation Analysis")

fig.show()

# pearsons correlation
print("RequestSize(TB) v Seconds Correlation Analysis: ", data['RequestSize(TB)'].corr(data['Seconds'])) # statsitical correlation method

We can see that there is a negative correlation of -0.55, meaning that the distibution of the two variables move in opposite directions. An increase in one variable leads to a decrease in the other variable and vice versa. We can conclude that the linear relationship between the two variables is off medium strength based on our correaltion calculated value.

HOWEVER, as both variables have both differing units of measurements one being time and the other being count data, the question arises how can we compare both these will differ in terms of numerical value. 
The above correlation analysis may well be untrue/miss-leading. 

### Data Pre-Processing

In this instance we want to test pre-processing our data to regulate the magnitude of both variables values to see if there is any change in the correlation value generated. Simply put this method will aim to treat the values of both variables as the same unit of measurement. 
- Normalising the data has been chosen for sample testing. What would be required is in depth knowledge of pre-processing techniques and use cases. This would require further testing and analysis. 

Normalization has been chosen as this type of pre-processing has been worked with before. We could tst other pre-processing techniques however we will not be doing this for this test.

In [None]:
pre_process.drop(['Date','Host','Service','EngineNode','Client','Milliseconds','Minutes','AlertFrequency'],axis = 1, inplace=True) # drop dataframe columns
pre_process.head()

In [None]:
# Initiate Standard Scaler Function
scaler = StandardScaler() 
data_scaled = scaler.fit_transform(pre_process) 
  
# Normalizing the data so that the data approximately follows a Gaussian distribution 
data_normalized = normalize(data_scaled) 
  
# Converting the numpy array into a pandas DataFrame 
data_normalized = pd.DataFrame(data_normalized) 
  
# Renaming the columns 
data_normalized.columns = pre_process.columns 
  
data_normalized.head() 

In [None]:
fig = make_subplots(rows=2, cols=1,subplot_titles=("Time Series Request Size (Terabytes) Analysis", "Time Series Server Request Return Time  in Seconds"))

fig.add_trace(go.Scatter(x=data_normalized.index, y=data_normalized['RequestSize(TB)']),
              row=1, col=1)
fig.add_trace(go.Scatter(x=data_normalized.index, y=data_normalized['Seconds']),
              row=2, col=1)

# Update xaxis properties
fig.update_xaxes(title_text="Date|Time", row=1, col=1)
fig.update_xaxes(title_text="Date|Time", row=2, col=1)

# Update yaxis properties
fig.update_yaxes(title_text="RequestSize(TB) - Terabytes", row=1, col=1)
fig.update_yaxes(title_text="Seconds", row=2, col=1)

# Update title and height
fig.update_layout(height=900, width=1500,
                  title_text="Statistical Correlation Analysis")

fig.show()

#
print("RequestSize(TB) v Seconds Correlation Analysis: ", data_normalized['RequestSize(TB)'].corr(data_normalized['Seconds'])) # statsitical correlation method

In [None]:
# pearsons correlation
print("RequestSize(TB) v Seconds Correlation Analysis: ", data['RequestSize(TB)'].corr(data['Seconds'])) # statsitical correlation method
print("RequestSize(TB) v Seconds Correlation Analysis: ", data_normalized['RequestSize(TB)'].corr(data_normalized['Seconds'])) # statsitical correlation method

#### Correlation Analysis Conclusion
- We can see that from our testing that the correlation value returned against both these features indicates a medium strenght correlction. This potentially indicates that request-sze in terabytes can be correlated to an increase in request time in seconds being returned. Aligning this with subject knowledge this can be seen as justifiable as certain (x) will generate more traffic based on certain dates. However, what would be require is further testing over a longer period and cross comparitive analysis.